From a2e76575782c3e085217fc0efa063a498c143c03 Mon Sep 17 00:00:00 2001 From: Eric Peter Date: Thu, 21 Nov 2024 22:09:51 -0500 Subject: [PATCH] filtered-docs-for-demo (#49) --- .../chunked_databricks_docs_filtered.jsonl | 3592 +++++++++++++++++ 1 file changed, 3592 insertions(+) create mode 100644 quick_start_demo/chunked_databricks_docs_filtered.jsonl diff --git a/quick_start_demo/chunked_databricks_docs_filtered.jsonl b/quick_start_demo/chunked_databricks_docs_filtered.jsonl new file mode 100644 index 0000000..4dd24b2 --- /dev/null +++ b/quick_start_demo/chunked_databricks_docs_filtered.jsonl @@ -0,0 +1,3592 @@ +{"content":"# Query data\n### Data format options\n\nDatabricks has built-in keyword bindings for all of the data formats natively supported by Apache Spark. Databricks uses Delta Lake as the default protocol for reading and writing data and tables, whereas Apache Spark uses Parquet. \nThese articles provide an overview of many of the options and configurations available when you query data on Databricks. \nThe following data formats have built-in keyword configurations in Apache Spark DataFrames and SQL: \n* [Delta Lake](https:\/\/docs.databricks.com\/delta\/index.html)\n* [Delta Sharing](https:\/\/docs.databricks.com\/query\/formats\/deltasharing.html)\n* [Parquet](https:\/\/docs.databricks.com\/query\/formats\/parquet.html)\n* [ORC](https:\/\/docs.databricks.com\/query\/formats\/orc.html)\n* [JSON](https:\/\/docs.databricks.com\/query\/formats\/json.html)\n* [CSV](https:\/\/docs.databricks.com\/query\/formats\/csv.html)\n* [Avro](https:\/\/docs.databricks.com\/query\/formats\/avro.html)\n* [Text](https:\/\/docs.databricks.com\/query\/formats\/text.html)\n* [Binary](https:\/\/docs.databricks.com\/query\/formats\/binary.html)\n* [XML](https:\/\/docs.databricks.com\/query\/formats\/xml.html) \nDatabricks also provides a custom keyword for loading [MLflow experiments](https:\/\/docs.databricks.com\/query\/formats\/mlflow-experiment.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/index.html"} +{"content":"# Query data\n### Data format options\n#### Data formats with special considerations\n\nSome data formats require additional configuration or special considerations for use: \n* Databricks recommends loading [images](https:\/\/docs.databricks.com\/query\/formats\/image.html) as `binary` data.\n* [Hive tables](https:\/\/docs.databricks.com\/query\/formats\/hive-tables.html) are natively supported by Apache Spark, but require configuration on Databricks.\n* Databricks can directly read compressed files in many file formats. You can also [unzip compressed files](https:\/\/docs.databricks.com\/files\/unzip-files.html) on Databricks if necessary.\n* [LZO](https:\/\/docs.databricks.com\/query\/formats\/lzo.html) requires a codec installation. \nFor more information about Apache Spark data sources, see [Generic Load\/Save Functions](https:\/\/spark.apache.org\/docs\/latest\/sql-data-sources-load-save-functions.html) and [Generic File Source Options](https:\/\/spark.apache.org\/docs\/latest\/sql-data-sources-generic-options.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/index.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Configure a customer-managed VPC\n###### Overview\n\nBy default, clusters are created in a single AWS VPC (Virtual Private Cloud) that Databricks creates and configures in your AWS account. You can optionally create your Databricks workspaces in your own VPC, a feature known as *customer-managed VPC*. You can use a customer-managed VPC to exercise more control over your network configurations to comply with specific cloud security and governance standards your organization may require. To configure your workspace to use [AWS PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html) for any type of connection, your workspace must use a customer-managed VPC. \nA customer-managed VPC is good solution if you have: \n* Security policies that prevent PaaS providers from creating VPCs in your own AWS account.\n* An approval process to create a new VPC, in which the VPC is configured and secured in a well-documented way by internal information security or cloud engineering teams. \nBenefits include: \n* **Lower privilege level**: You maintain more control of your own AWS account. And you don\u2019t need to grant Databricks as many permissions via cross-account IAM role as you do for a Databricks-managed VPC. For example, there is no need for permission to create VPCs. This limited set of permissions can make it easier to get approval to use Databricks in your platform stack.\n* **Simplified network operations**: Better network space utilization. Optionally configure smaller subnets for a workspace, compared to the default CIDR \/16. And there is no need for the complex VPC peering configurations that might be necessary with other solutions.\n* **Consolidation of VPCs**: Multiple Databricks workspaces can share a single classic compute plane VPC, which is often preferred for billing and instance management.\n* **Limit outgoing connections**: By default, the classic compute plane does not limit outgoing connections from Databricks Runtime workers. For workspaces that are configured to use a customer-managed VPC, you can use an egress firewall or proxy appliance to limit outbound traffic to a list of allowed internal or external data sources. \n![Customer-managed VPC](https:\/\/docs.databricks.com\/_images\/customer-managed-vpc.png) \nTo take advantage of a customer-managed VPC, you must specify a VPC when you first create the Databricks workspace. You cannot move an existing workspace with a Databricks-managed VPC to use a customer-managed VPC. You can, however, move an existing workspace with a customer-managed VPC from one VPC to another VPC by updating the workspace configuration\u2019s network configuration object. See [Update a running or failed workspace](https:\/\/docs.databricks.com\/admin\/workspace\/update-workspace.html). \nTo deploy a workspace in your own VPC, you must: \n1. Create the VPC following the requirements enumerated in [VPC requirements](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#vpc-requirements).\n2. Reference your VPC network configuration with Databricks when you create the workspace. \n* [Use the account console](https:\/\/docs.databricks.com\/admin\/workspace\/create-workspace.html) and choose the configuration by name\n* [Use the Account API](https:\/\/docs.databricks.com\/admin\/workspace\/create-workspace-api.html) and choose the configuration by its IDYou must provide the VPC ID, subnet IDs, and security group ID when you register the VPC with Databricks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Configure a customer-managed VPC\n###### VPC requirements\n\nYour VPC must meet the requirements described in this section in order to host a Databricks workspace. \nRequirements: \n* [VPC region](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#vpc-region)\n* [VPC sizing](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#vpc-sizing)\n* [VPC IP address ranges](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#vpc-ip-address-ranges)\n* [DNS](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#dns)\n* [Subnets](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#subnets)\n* [Security groups](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#security-groups)\n* [Subnet-level network ACLs](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#subnet-level-network-acls)\n* [AWS PrivateLink support](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#aws-privatelink-support) \n### [VPC region](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#id1) \nFor a list of AWS regions that support customer-managed VPC, see [Databricks clouds and regions](https:\/\/docs.databricks.com\/resources\/supported-regions.html). \n### [VPC sizing](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#id2) \nYou can share one VPC with multiple workspaces in a single AWS account. However, Databricks recommends using unique subnets and security groups for each workspace. Be sure to size your VPC and subnets accordingly. Databricks assigns two IP addresses per node, one for management traffic and one for Apache Spark applications. The total number of instances for each subnet is equal to half the number of IP addresses that are available. Learn more in [Subnets](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#subnet). \n### [VPC IP address ranges](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#id3) \nDatabricks doesn\u2019t limit netmasks for the workspace VPC, but each workspace subnet must have a netmask between `\/17` and `\/26`. This means that if your workspace has two subnets and both have a netmask of `\/26`, then the netmask for your workspace VPC must be `\/25` or smaller. \nImportant \nIf you have configured secondary CIDR blocks for your VPC, make sure that the subnets for the Databricks workspace are configured with the same VPC CIDR block. \n### [DNS](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#id4) \nThe VPC must have DNS hostnames and DNS resolution enabled. \n### [Subnets](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#id5) \nDatabricks must have access to at least *two subnets for each workspace*, with each subnet in a different availability zone. You cannot specify more than one Databricks workspace subnet per Availability Zone in the [Create network configuration API call](https:\/\/docs.databricks.com\/api\/account\/networks\/create). You can have more than one subnet per availability zone as part of your network setup, but you can choose only one subnet per Availability Zone for the Databricks workspace. \nYou can choose to share one subnet across multiple workspaces or both subnets across workspaces. For example, you can have two workspaces that share the same VPC. One workspace can use subnets `A` and `B` and another workspaces can use subnets `A` and `C`. If you plan to share subnets across multiple workspaces, be sure to size your VPC and subnets to be large enough to scale with usage. \nDatabricks assigns two IP addresses per node, one for management traffic and one for Spark applications. The total number of instances for each subnet is equal to half of the number of IP addresses that are available. \nEach subnet must have a netmask between `\/17` and `\/26`. \n#### Additional subnet requirements \n* Subnets must be private.\n* Subnets must have outbound access to the public network using a [NAT gateway](https:\/\/docs.aws.amazon.com\/vpc\/latest\/userguide\/vpc-nat-gateway.html) and [internet gateway](https:\/\/docs.aws.amazon.com\/vpc\/latest\/userguide\/VPC_Internet_Gateway.html), or other similar customer-managed appliance infrastructure.\n* The NAT gateway must be set up [in its own subnet](https:\/\/aws.amazon.com\/premiumsupport\/knowledge-center\/nat-gateway-vpc-private-subnet\/) that routes quad-zero (`0.0.0.0\/0`) traffic to an [internet gateway](https:\/\/docs.aws.amazon.com\/vpc\/latest\/userguide\/VPC_Internet_Gateway.html) or other customer-managed appliance infrastructure. \nImportant \nWorkspaces must have outbound access from the VPC to the public network. If you configure IP access lists, those public networks must be added to an allow list. See [Configure IP access lists for workspaces](https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list-workspace.html). \n#### Subnet route table \nThe route table for workspace subnets must have quad-zero (`0.0.0.0\/0`) traffic that targets the appropriate network device. Quad-zero traffic must target a NAT Gateway or your own managed NAT device or proxy appliance. \nImportant \nDatabricks requires subnets to add `0.0.0.0\/0` to your allow list. This rule must be prioritized. To control egress traffic, use an egress firewall or proxy appliance to block most traffic but allow the URLs that Databricks needs to connect to. See [Configure a firewall and outbound access](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#firewall). \nThis is a base guideline only. Your configuration requirements may differ. For questions, contact your Databricks account team. \n### [Security groups](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#id6) \nA Databricks workspace must have access to at least one AWS security group and no more than five security groups. You can reuse existing security groups rather than create new ones. However, Databricks recommends using unique subnets and security groups for each workspace. \nSecurity groups must have the following rules: \n**Egress (outbound):** \n* Allow all TCP and UDP access to the workspace security group (for internal traffic)\n* Allow TCP access to `0.0.0.0\/0` for these ports: \n+ 443: for Databricks infrastructure, cloud data sources, and library repositories\n+ 3306: for the metastore\n+ 6666: for secure cluster connectivity. This is only required if you use [PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html).\n+ 2443: Supports FIPS encryption. Only required if you enable the [compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html).\n+ 8443 through 8451: Future extendability. Ensure these [ports are open by January 31, 2024](https:\/\/docs.databricks.com\/release-notes\/product\/2023\/august.html#aws-new-egress-ports). \n**Ingress (inbound):** Required for all workspaces (these can be separate rules or combined into one): \n* Allow TCP on all ports when traffic source uses the same security group\n* Allow UDP on all ports when traffic source uses the same security group \n### [Subnet-level network ACLs](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#id7) \nSubnet-level network ACLs must not deny ingress or egress to any traffic. Databricks validates for the following rules while creating the workspace: \n**Egress (outbound):** \n* Allow all traffic to the workspace VPC CIDR, for internal traffic \n+ Allow TCP access to `0.0.0.0\/0` for these ports: \n- 443: for Databricks infrastructure, cloud data sources, and library repositories\n- 3306: for the metastore\n- 6666: only required if you use [PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html) \nImportant \nIf you configure additional `ALLOW` or `DENY` rules for outbound traffic, set the rules required by Databricks to the highest priority (the lowest rule numbers), so that they take precedence. \n**Ingress (inbound):** \n* `ALLOW ALL from Source 0.0.0.0\/0`. This rule must be prioritized. \nNote \nDatabricks requires subnet-level network ACLs to add `0.0.0.0\/0` to your allow list. To control egress traffic, use an egress firewall or proxy appliance to block most traffic but allow the URLs that Databricks needs to connect to. See [Configure a firewall and outbound access](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#firewall). \n### [AWS PrivateLink support](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#id8) \nIf you plan to enabled AWS PrivateLink on the workspace with this VPC: \n* On the VPC, ensure that you enable both of the settings **DNS Hostnames** and **DNS resolution**.\n* Review the article [Enable AWS PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html) for guidance about creating an extra subnet for VPC endpoints (recommended but not required) and creating an extra security group for VPC endpoints.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Configure a customer-managed VPC\n###### Create a VPC\n\nTo create VPCs you can use various tools: \n* AWS console\n* AWS CLI\n* [Terraform](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html)\n* [AWS Quickstart](https:\/\/docs.databricks.com\/admin\/workspace\/templates.html) (create a new customer-managed VPC and a new workspace) \nTo use AWS Console, the basic instructions for creating and configuring a VPC and related objects are listed below. For complete instructions, see the AWS documentation. \nNote \nThese basic instructions might not apply to all organizations. Your configuration requirements may differ. This section does not cover all possible ways to configure NATs, firewalls, or other network infrastructure. If you have questions, contact your Databricks account team before proceeding. \n1. Go to the [VPCs page in AWS](https:\/\/console.aws.amazon.com\/vpc\/#vpcs:).\n2. See the region picker in the upper-right. If needed, switch to the region for your workspace.\n3. In the upper-right corner, click the orange button **Create VPC**. \n![create new VPC editor](https:\/\/docs.databricks.com\/_images\/customer-managed-vpc-createnew.png)\n4. Click **VPC and more**.\n5. In the **Name tag auto-generation** type a name for your workspace. Databricks recommends including the region in the name.\n6. For VPC address range, optionally change it if desired.\n7. For public subnets, click `2`. Those subnets aren\u2019t used directly by your Databricks workspace, but they are required to enable NATs in this editor.\n8. For private subnets, click `2` for the minimum for workspace subnets. You can add more if desired. \nYour Databricks workspace needs at least two private subnets. To resize them, click **Customize subnet CIDR blocks**.\n9. For NAT gateways, click **In 1 AZ**.\n10. Ensure the following fields at the bottom are enabled: **Enable DNS hostnames** and **Enable DNS resolution**.\n11. Click **Create VPC**.\n12. When viewing your new VPC, click on the left navigation items to update related settings on the VPC. To make it easier to find related objects, in the **Filter by VPC** field, select your new VPC.\n13. Click **Subnets** and what AWS calls the **private** subnets labeled 1 and 2, which are the ones you will use to configure your main workspace subnets. Modify the subnets as specified in [VPC requirements](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#vpc-requirements). \nIf you created an extra private subnet for use with PrivateLink, configure private subnet 3 as specified in [Enable AWS PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html).\n14. Click **Security groups** and modify the security group as specified in [Security groups](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#security-groups). \nIf you will use back-end PrivateLink connectivity, create an additional security group with inbound and outbound rules as specified in the PrivateLink article in the section [Step 1: Configure AWS network objects](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#create-vpc).\n15. Click **Network ACLs** and modify the network ACLs as specified in [Subnet-level network ACLs](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#network-acls).\n16. Choose whether to perform the optional configurations that are specified later in this article.\n17. Register your VPC with Databricks to create a network configuration [using the account console](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/networks.html) or by [using the Account API](https:\/\/docs.databricks.com\/admin\/workspace\/create-workspace-api.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Configure a customer-managed VPC\n###### Updating CIDRs\n\nYou might need to, at a later time, update subnet CIDRs that overlap with original subnets. \nTo update the CIDRs and other workspace objects: \n1. Terminate all running clusters (and other compute resources) that are running in the subnets that need to be updated.\n2. Using the AWS console, delete the subnets to update.\n3. Re-create the subnets with updated CIDR ranges.\n4. Update the route table association for the two new subnets. You can reuse the ones in each availability zone for existing subnets. \nImportant \nIf you skip this step or misconfigure the route tables, cluster may fail to launch.\n5. Create a new network configuration object with the new subnets.\n6. Update the workspace to use this newly created network configuration object\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Configure a customer-managed VPC\n###### (Recommended) Configure regional endpoints\n\nIf you use a customer-managed VPC (optional), Databricks recommends you configure your VPC to use only regional VPC endpoints to AWS services. Using regional VPC endpoints enables more direct connections to AWS services and reduced cost compared to AWS global endpoints. There are four AWS services that a Databricks workspace with a customer-managed VPC must reach: STS, S3, Kinesis, and RDS. \nThe connection from your VPC to the RDS service is required only if you use the default Databricks legacy Hive metastore and does not apply to Unity Catalog metastores. Although there is no VPC endpoint for RDS, instead of using the default Databricks legacy Hive metastore, you can configure your own external metastore. You can implement an external metastore with a [Hive metastore](https:\/\/docs.databricks.com\/archive\/external-metastores\/external-hive-metastore.html) or [AWS Glue](https:\/\/docs.databricks.com\/archive\/external-metastores\/aws-glue-metastore.html). \nFor the other three services, you can create VPC gateway or interface endpoints such that the relevant in-region traffic from clusters could transit over the secure AWS backbone rather than the public network: \n* **S3**: Create a [VPC gateway endpoint](https:\/\/aws.amazon.com\/blogs\/aws\/new-vpc-endpoint-for-amazon-s3) that is directly accessible from your Databricks cluster subnets. This causes workspace traffic to all in-region S3 buckets to use the endpoint route. To access any cross-region buckets, open up access to S3 global URL `s3.amazonaws.com` in your egress appliance, or route `0.0.0.0\/0` to an AWS internet gateway. \nTo use [DBFS mounts](https:\/\/docs.databricks.com\/dbfs\/mounts.html) with regional endpoints enabled: \n+ You must set up an environment variable in the cluster configuration to set `AWS_REGION=`. For example, if your workspace is deployed in the N. Virginia region, set `AWS_REGION=us-east-1`. To enforce it for all clusters, use [cluster policies](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html).\n* **STS**: Create a [VPC interface endpoint](https:\/\/docs.aws.amazon.com\/vpc\/latest\/userguide\/vpce-interface.html#create-interface-endpoint) directly accessible from your Databricks cluster subnets. You can create this endpoint in your workspace subnets. Databricks recommends that you use the same security group that was created for your workspace VPC. This configuration causes workspace traffic to STS to use the endpoint route.\n* **Kinesis**: Create a [VPC interface endpoint](https:\/\/docs.aws.amazon.com\/vpc\/latest\/userguide\/vpce-interface.html#create-interface-endpoint) directly accessible from your Databricks cluster subnets. You can create this endpoint in your workspace subnets. Databricks recommends that you use the same security group that was created for your workspace VPC. This configuration causes workspace traffic to Kinesis to use the endpoint route. The only exception to this rule is workspaces in the AWS region `us-west-1` because target Kinesis streams in this region are cross-region to the `us-west-2` region.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Configure a customer-managed VPC\n###### Configure a firewall and outbound access\n\nYou must use an egress firewall or proxy appliance to block most traffic but allow the URLs that Databricks needs to connect to: \n* If the firewall or proxy appliance is in the same VPC as the Databricks workspace VPC, route the traffic and configure it to allow the following connections.\n* If the firewall or proxy appliance is in a different VPC or an on-premises network, route `0.0.0.0\/0` to that VPC or network first and configure the proxy appliance to allow the following connections. \nImportant \nDatabricks strongly recommends that you specify destinations as domain names in your egress infrastructure, rather than as IP addresses. \nAllow the following outgoing connections. For each connection type, follow the link to get IP addresses or domains for your workspace region. \n* **Databricks web application**:Required. Also used for REST API calls to your workspace. \n[Databricks control plane addresses](https:\/\/docs.databricks.com\/resources\/supported-regions.html#control-plane-ip-addresses)\n* **Databricks secure cluster connectivity (SCC) relay**: Required for secure cluster connectivity. \n[Databricks control plane addresses](https:\/\/docs.databricks.com\/resources\/supported-regions.html#control-plane-ip-addresses)\n* **AWS S3 global URL**:Required by Databricks to access the root S3 bucket. Use `s3.amazonaws.com:443`, regardless of region.\n* **AWS S3 regional URL**:Optional. If you use S3 buckets that might be in other regions, you must also allow the S3 regional endpoint. Although AWS provides a domain and port for a regional endpoint (`s3..amazonaws.com:443`), Databricks recommends that you instead use a [VPC endpoint](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#regional-endpoints) so that this traffic goes through the private tunnel over the AWS network backbone. See [(Recommended) Configure regional endpoints](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#regional-endpoints).\n* **AWS STS global URL**:Required. Use the following address and port, regardless of region: `sts.amazonaws.com:443`\n* **AWS STS regional URL**:Required due to expected switch to regional endpoint. Use a VPC endpoint. See [(Recommended) Configure regional endpoints](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#regional-endpoints).\n* **AWS Kinesis regional URL**:Required. The Kinesis endpoint is used to capture logs needed to manage and monitor the software. For the URL for your region, see [Kinesis addresses](https:\/\/docs.databricks.com\/resources\/supported-regions.html#kinesis).\n* **Table metastore RDS regional URL (by compute plane region)**:Required if your Databricks workspace uses the default Hive metastore. \nThe Hive metastore is always in the same region as your compute plane, but it might be in a different region than the control plane. \n[RDS addresses for legacy Hive metastore](https:\/\/docs.databricks.com\/resources\/supported-regions.html#rds) \nNote \nInstead of using the default Hive metastore, you can choose to [implement your own table metastore instance](https:\/\/docs.databricks.com\/archive\/external-metastores\/index.html), in which case you are responsible for its network routing.\n* **Control plane infrastructure**: Required. Used by Databricks for standby Databricks infrastructure to improve the stability of Databricks services. \n[Databricks control plane addresses](https:\/\/docs.databricks.com\/resources\/supported-regions.html#control-plane-ip-addresses) \n### Troubleshoot regional endpoints \nIf you followed the instructions above and the VPC endpoints do not work as intended, for example, if your data sources are inaccessible or if the traffic is bypassing the endpoints, you can use one of two approaches to add support for the regional endpoints for S3 and STS instead of using VPC endpoints. \n1. Add the environment variable `AWS_REGION` in the cluster configuration and set it to your AWS region. To enable it for all clusters, use [cluster policies](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html). You might have already configured this environment variable to use DBFS mounts.\n2. Add the required Apache Spark configuration. Do exactly one of the following approaches: \n* **In each source notebook**: \n```\n%scala\nspark.conf.set(\"fs.s3a.stsAssumeRole.stsEndpoint\", \"https:\/\/sts..amazonaws.com\")\nspark.conf.set(\"fs.s3a.endpoint\", \"https:\/\/s3..amazonaws.com\")\n\n``` \n```\n%python\nspark.conf.set(\"fs.s3a.stsAssumeRole.stsEndpoint\", \"https:\/\/sts..amazonaws.com\")\nspark.conf.set(\"fs.s3a.endpoint\", \"https:\/\/s3..amazonaws.com\")\n\n```\n* *Alternatively, in the Apache Spark config for the cluster*\\*: \n```\nspark.hadoop.fs.s3a.endpoint https:\/\/s3..amazonaws.com\nspark.hadoop.fs.s3a.stsAssumeRole.stsEndpoint https:\/\/sts..amazonaws.com\n\n```\n3. If you limit egress from the classic compute plane using a firewall or internet appliance, add these regional endpoint addresses to your allow list. \nTo set these values for all clusters, configure the values as part of your [cluster policy](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Configure a customer-managed VPC\n###### (Optional) Access S3 using instance profiles\n\nTo access S3 mounts using [instance profiles](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html), set the following Spark configurations: \n* Either **in each source notebook**: \n```\n%scala\nspark.conf.set(\"fs.s3a.stsAssumeRole.stsEndpoint\", \"https:\/\/sts..amazonaws.com\")\nspark.conf.set(\"fs.s3a.endpoint\", \"https:\/\/s3..amazonaws.com\")\n\n``` \n```\n%python\nspark.conf.set(\"fs.s3a.stsAssumeRole.stsEndpoint\", \"https:\/\/sts..amazonaws.com\")\nspark.conf.set(\"fs.s3a.endpoint\", \"https:\/\/s3..amazonaws.com\")\n\n```\n* Or **in the Apache Spark config for the cluster**: \n```\nspark.hadoop.fs.s3a.endpoint https:\/\/s3..amazonaws.com\nspark.hadoop.fs.s3a.stsAssumeRole.stsEndpoint https:\/\/sts..amazonaws.com\n\n``` \nTo set these values for all clusters, configure the values as part of your [cluster policy](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html). \nWarning \nFor the S3 service, there are limitations to applying additional regional endpoint configurations at the notebook or cluster level. Notably, access to cross-region S3 access is blocked, even if the global S3 URL is allowed in your egress firewall or proxy. If your Databricks deployment might require cross-region S3 access, it is important that you not apply the Spark configuration at the notebook or cluster level.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Configure a customer-managed VPC\n###### (Optional) Restrict access to S3 buckets\n\nMost reads from and writes to S3 are self-contained within the compute plane. However, some management operations originate from the control plane, which is managed by Databricks. To limit access to S3 buckets to a specified set of source IP addresses, create an S3 bucket policy. In the bucket policy, include the IP addresses in the `aws:SourceIp` list. If you use a VPC Endpoint, allow access to it by adding it to the policy\u2019s `aws:sourceVpce`. Databricks uses VPC IDs for accessing S3 buckets in the same region as the Databricks control plane, and NAT IPs for accessing S3 buckets in different regions from the control plane. \nFor more information about S3 bucket policies, see the [bucket policy examples](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/example-bucket-policies.html#example-bucket-policies-use-case-3) in the Amazon S3 documentation. Working [example bucket policies](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#example-bucket-policies) are also included in this topic. \n### Requirements for bucket policies \nYour bucket policy must meet these requirements, to ensure that your clusters start correctly and that you can connect to them: \n* You must allow access from the [control plane NAT IP and VPC IDs for your region](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#required-ips-and-storage-buckets).\n* You must allow access from the compute plane VPC, by doing one of the following: \n+ (Recommended) Configure a gateway VPC Endpoint in your [Customer-managed VPC](https:\/\/docs.databricks.com\/admin\/cloud-configurations\/aws\/customer-managed-vpc.html) and adding it to the `aws:sourceVpce` to the bucket policy, or\n+ Add the compute plane NAT IP to the `aws:SourceIp` list.\n* **When using [Endpoint policies for Amazon S3](https:\/\/docs.aws.amazon.com\/vpc\/latest\/privatelink\/vpc-endpoints-s3.html#vpc-endpoints-policies-s3)**, your policy must include: \n+ Your workspace\u2019s [root storage bucket](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/storage.html).\n+ The required [artifact, log, system tables, and shared datasets bucket for your region](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#required-ips-and-storage-buckets).\n* **To avoid losing connectivity from within your corporate network**, Databricks recommends always allowing access from at least one known and trusted IP address, such as the public IP of your corporate VPN. This is because Deny conditions apply even within the AWS console. \nNote \nWhen deploying a new workspace with S3 bucket policy restrictions, you must allow access to the control plane NAT-IP for a `us-west` region, otherwise the deployment fails. After the workspace is deployed, you can remove the `us-west` info and update the control plane NAT-IP to reflect your region. \n### Required IPs and storage buckets \nFor the IP addresses and domains that you need for configuring S3 bucket policies and VPC Endpoint policies to restrict access to your workspace\u2019s S3 buckets, see [Outbound from Databricks control plane](https:\/\/docs.databricks.com\/resources\/supported-regions.html#outbound). \n### Example bucket policies \nThese examples use placeholder text to indicate where to specify recommended IP addresses and required storage buckets. Review the [requirements](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#requirements-for-bucket-policies) to ensure that your clusters start correctly and that you can connect to them. \n**Restrict access to the Databricks control plane, compute plane, and trusted IPs:** \nThis S3 bucket policy uses a Deny condition to selectively allow access from the control plane, NAT gateway, and corporate VPN IP addresses you specify. Replace the placeholder text with values for your environment. You can add any number of IP addresses to the policy. Create one policy per S3 bucket you want to protect. \nImportant \nIf you use VPC Endpoints, this policy is not complete. See [Restrict access to the Databricks control plane, VPC endpoints, and trusted IPs](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#example-bucket-policy-vpce). \n```\n{\n\"Sid\": \"IPDeny\",\n\"Effect\": \"Deny\",\n\"Principal\": \"*\",\n\"Action\": \"s3:*\",\n\"Resource\": [\n\"arn:aws:s3:::\",\n\"arn:aws:s3:::\/*\"\n],\n\"Condition\": {\n\"NotIpAddress\": {\n\"aws:SourceIp\": [\n\"\",\n\"\",\n\"\"\n]\n}\n}\n}\n\n``` \n**Restrict access to the Databricks control plane, VPC endpoints, and trusted IPs:** \nIf you use a VPC Endpoint to access S3, you must add a second condition to the policy. This condition allows access from your VPC Endpoint and VPC ID by adding it to the `aws:sourceVpce` list. \nThis bucket selectively allows access from your VPC Endpoint, and from the control plane and corporate VPN IP addresses you specify. \nWhen using VPC Endpoints, you can use a VPC Endpoint policy instead of an S3 bucket policy. A VPCE policy must allow access to your root S3 bucket and to the required artifact, log, and shared datasets bucket for your region. For the IP addresses and domains for your regions, see [IP addresses and domains](https:\/\/docs.databricks.com\/resources\/supported-regions.html#ip-domain-aws). \nReplace the placeholder text with values for your environment. \n```\n{\n\"Sid\": \"IPDeny\",\n\"Effect\": \"Deny\",\n\"Principal\": \"*\",\n\"Action\": \"s3:*\",\n\"Resource\": [\n\"arn:aws:s3:::\",\n\"arn:aws:s3:::\/*\"\n],\n\"Condition\": {\n\"NotIpAddressIfExists\": {\n\"aws:SourceIp\": [\n\"\",\n\"\"\n]\n},\n\"StringNotEqualsIfExists\": {\n\"aws:sourceVpce\": \"\",\n\"aws:SourceVPC\": \"\"\n}\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### User-defined aggregate functions - Scala\n\nThis article contains an example of a UDAF and how to register it for use in Apache Spark SQL. See [User-defined aggregate functions (UDAFs)](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-functions-udf-aggregate.html) for more details.\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/aggregate-scala.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### User-defined aggregate functions - Scala\n##### Implement a `UserDefinedAggregateFunction`\n\n```\nimport org.apache.spark.sql.expressions.MutableAggregationBuffer\nimport org.apache.spark.sql.expressions.UserDefinedAggregateFunction\nimport org.apache.spark.sql.Row\nimport org.apache.spark.sql.types._\n\nclass GeometricMean extends UserDefinedAggregateFunction {\n\/\/ This is the input fields for your aggregate function.\noverride def inputSchema: org.apache.spark.sql.types.StructType =\nStructType(StructField(\"value\", DoubleType) :: Nil)\n\n\/\/ This is the internal fields you keep for computing your aggregate.\noverride def bufferSchema: StructType = StructType(\nStructField(\"count\", LongType) ::\nStructField(\"product\", DoubleType) :: Nil\n)\n\n\/\/ This is the output type of your aggregatation function.\noverride def dataType: DataType = DoubleType\n\noverride def deterministic: Boolean = true\n\n\/\/ This is the initial value for your buffer schema.\noverride def initialize(buffer: MutableAggregationBuffer): Unit = {\nbuffer(0) = 0L\nbuffer(1) = 1.0\n}\n\n\/\/ This is how to update your buffer schema given an input.\noverride def update(buffer: MutableAggregationBuffer, input: Row): Unit = {\nbuffer(0) = buffer.getAs[Long](0) + 1\nbuffer(1) = buffer.getAs[Double](1) * input.getAs[Double](0)\n}\n\n\/\/ This is how to merge two objects with the bufferSchema type.\noverride def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {\nbuffer1(0) = buffer1.getAs[Long](0) + buffer2.getAs[Long](0)\nbuffer1(1) = buffer1.getAs[Double](1) * buffer2.getAs[Double](1)\n}\n\n\/\/ This is where you output the final value, given the final value of your bufferSchema.\noverride def evaluate(buffer: Row): Any = {\nmath.pow(buffer.getDouble(1), 1.toDouble \/ buffer.getLong(0))\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/aggregate-scala.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### User-defined aggregate functions - Scala\n##### Register the UDAF with Spark SQL\n\n```\nspark.udf.register(\"gm\", new GeometricMean)\n\n```\n\n#### User-defined aggregate functions - Scala\n##### Use your UDAF\n\n```\n\/\/ Create a DataFrame and Spark SQL table\nimport org.apache.spark.sql.functions._\n\nval ids = spark.range(1, 20)\nids.createOrReplaceTempView(\"ids\")\nval df = spark.sql(\"select id, id % 3 as group_id from ids\")\ndf.createOrReplaceTempView(\"simple\")\n\n``` \n```\n-- Use a group_by statement and call the UDAF.\nselect group_id, gm(id) from simple group by group_id\n\n``` \n```\n\/\/ Or use DataFrame syntax to call the aggregate function.\n\n\/\/ Create an instance of UDAF GeometricMean.\nval gm = new GeometricMean\n\n\/\/ Show the geometric mean of values of column \"id\".\ndf.groupBy(\"group_id\").agg(gm(col(\"id\")).as(\"GeometricMean\")).show()\n\n\/\/ Invoke the UDAF by its assigned name.\ndf.groupBy(\"group_id\").agg(expr(\"gm(id) as GeometricMean\")).show()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/aggregate-scala.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-alerts.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Monitor alerts\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis page describes how to create a Databricks SQL alert based on a metric from a monitor metrics table. Some common uses for monitor alerts include: \n* Get notified when a statistic moves out of a certain range. For example, you want to receive a notification when the fraction of missing values exceeds a certain level.\n* Get notified of a change in the data. The drift metrics table stores statistics that track changes in the data distribution.\n* Get notified if data has drifted in comparison to the baseline table. You can set up an alert to investigate the data changes or, for `InferenceLog` analysis, to indicate that the model should be retrained. \nMonitor alerts are created and used the same way as other Databricks SQL alerts. You create a [Databricks SQL query](https:\/\/docs.databricks.com\/sql\/user\/queries\/index.html) on the monitor profile metrics table or drift metrics table. You then create a Databricks SQL alert for this query. You can configure the alert to evaluate the query at a desired frequency, and send a notification if the alert is triggered. By default, email notification is sent. You can also set up a webhook or send notifications to other applications such as Slack or Pagerduty. \nYou can also quickly create an alert from the [monitor dashboard](https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-dashboard.html) as follows: \n1. On the dashboard, find the chart for which you want to create an alert.\n2. Click ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) in the upper-right corner of the chart and select **View query**. The SQL editor opens.\n3. In the SQL editor, click ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) above the editor window and select **Create alert**. The **New alert** dialog opens in a new tab.\n4. Configure the alert and click **Create alert**. \nNote that if the query uses parameters, then the alert is based on the default values for these parameters. You should confirm that the default values reflect the intent of the alert. \nFor details, see [Databricks SQL alerts](https:\/\/docs.databricks.com\/sql\/user\/alerts\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-alerts.html"} +{"content":"# \n### Compute\n\nDatabricks compute refers to the selection of computing resources available in the Databricks workspace. Users need access to compute to run data engineering, data science, and data analytics workloads, such as production ETL pipelines, streaming analytics, ad-hoc analytics, and machine learning. \nUsers can either connect to existing compute or create new compute if they have the proper permissions. \nYou can view the compute you have access to using the **Compute** section of the workspace: \n![All-purpose compute page in Databricks workspace](https:\/\/docs.databricks.com\/_images\/compute-page.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/index.html"} +{"content":"# \n### Compute\n#### Types of compute\n\nThese are the types of compute available in Databricks: \n* **Serverless compute for notebooks (Public Preview)**: On-demand, scalable compute used to execute SQL and Python code in notebooks.\n* **Serverless compute for workflows (Public Preview)**: On-demand, scalable compute used to run your Databricks jobs without configuring and deploying infrastructure. \n* **All-Purpose compute**: Provisioned compute used to analyze data in notebooks. You can create, terminate, and restart this compute using the UI, CLI, or REST API.\n* **Job compute**: Provisioned compute used to run automated jobs. The Databricks job scheduler automatically creates a job compute whenever a job is configured to run on new compute. The compute terminates when the job is complete. You *cannot* restart a job compute. See [Use Databricks compute with your jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/use-compute.html).\n* **Instance pools**: Compute with idle, ready-to-use instances, used to reduce start and autoscaling times. You can create this compute using the UI, CLI, or REST API. \n* **Serverless SQL warehouses**: On-demand elastic compute used to run SQL commands on data objects in the SQL editor or interactive notebooks. You can create SQL warehouses using the UI, CLI, or REST API. \n* **Classic SQL warehouses**: Used to run SQL commands on data objects in the SQL editor or interactive notebooks. You can create SQL warehouses using the UI, CLI, or REST API. \nThe articles in this section describe how to work with compute resources using the Databricks UI. For other methods, see [What is the Databricks CLI?](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) and the [Databricks REST API reference](https:\/\/docs.databricks.com\/api\/workspace).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/index.html"} +{"content":"# \n### Compute\n#### Databricks Runtime\n\nDatabricks Runtime is the set of core components that run on your compute. The Databricks Runtime is a configurable setting in all-purpose of jobs compute but autoselected in SQL warehouses. \nEach Databricks Runtime version includes updates that improve the usability, performance, and security of big data analytics. The Databricks Runtime on your compute adds many features, including: \n* Delta Lake, a next-generation storage layer built on top of Apache Spark that provides ACID transactions, optimized layouts and indexes, and execution engine improvements for building data pipelines. See [What is Delta Lake?](https:\/\/docs.databricks.com\/delta\/index.html).\n* Installed Java, Scala, Python, and R libraries.\n* Ubuntu and its accompanying system libraries.\n* GPU libraries for GPU-enabled clusters.\n* Databricks services that integrate with other components of the platform, such as notebooks, jobs, and cluster management. \nFor information about the contents of each runtime version, see the [release notes](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). \n### Runtime versioning \nDatabricks Runtime versions are released on a regular basis: \n* **Long Term Support** versions are represented by an **LTS** qualifier (for example, **3.5 LTS**). For each major release, we declare a \u201ccanonical\u201d feature version, for which we provide three full years of support. See [Databricks runtime support lifecycles](https:\/\/docs.databricks.com\/release-notes\/runtime\/databricks-runtime-ver.html) for more information.\n* **Major** versions are represented by an increment to the version number that precedes the decimal point (the jump from 3.5 to 4.0, for example). They are released when there are major changes, some of which may not be backwards-compatible.\n* **Feature** versions are represented by an increment to the version number that follows the decimal point (the jump from 3.4 to 3.5, for example). Each major release includes multiple feature releases. Feature releases are always backward compatible with previous releases within their major release.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/index.html"} +{"content":"# \n### Compute\n#### What is Serverless Compute?\n\nServerless compute enhances productivity, cost efficiency, and reliability in the following ways: \n* **Productivity**: Cloud resources are managed by Databricks, reducing management overhead and providing instant compute to enhance user productivity.\n* **Efficiency**: Serverless compute offers rapid start-up and scaling times, minimizing idle time and ensuring you only pay for the compute you use.\n* **Reliability**: With serverless compute, capacity handling, security, patching, and upgrades are managed automatically, alleviating concerns about security policies and capacity shortages.\n\n### Compute\n#### What are Serverless SQL Warehouses?\n\nDatabricks SQL delivers optimal price and performance with serverless SQL warehouses. Key advantages of serverless warehouses over pro and classic models include: \n* **Instant and elastic compute**: Eliminates waiting for infrastructure resources and avoids resource over-provisioning during usage spikes. Intelligent workload management dynamically handles scaling. See [SQL warehouse types](https:\/\/docs.databricks.com\/admin\/sql\/warehouse-types.html) for more information on intelligent workload management and other serverless features.\n* **Minimal management overhead**: Capacity management, patching, upgrades, and performance optimization are all handled by Databricks, simplifying operations and leading to predictable pricing.\n* **Lower total cost of ownership (TCO)**: Automatic provisioning and scaling of resources as needed helps avoid over-provisioning and reduces idle times, thus lowering TCO.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/index.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Upgrade Hive tables and views to Unity Catalog\n\nThis article describes how to upgrade tables and views registered in your existing workspace-local Hive metastore to Unity Catalog. You can upgrade a Hive table either to a *managed table* or *external table* in Unity Catalog. \n* **Managed tables** are the preferred way to create tables in Unity Catalog. Unity Catalog fully manages their lifecycle, file layout, and storage. Unity Catalog also optimizes their performance automatically. Managed tables always use the [Delta](https:\/\/docs.databricks.com\/delta\/index.html) table format. \nManaged tables reside in a [managed storage location](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#managed-storage) that you reserve for Unity Catalog. Because of this storage requirement, you must use [CLONE](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#clone) or [CREATE TABLE AS SELECT](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#create-table-as-select) (CTAS) if you want to copy existing Hive tables to Unity Catalog as managed tables.\n* **External tables** are tables whose data lifecycle, file layout, and storage location are not managed by Unity Catalog. Multiple data formats are supported for external tables. \nTypically you use external tables only when you also need direct access to data using non-Databricks compute (that is, not using Databricks clusters or Databricks SQL warehouses). External tables are also convenient in migration scenarios, because you can register existing data in Unity Catalog quickly without having to that copy data. This is thanks to the fact that data in external tables doesn\u2019t have to reside in reserved managed storage. \nFor more information about managed and external tables in Unity Catalog, see [Tables](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#table).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Upgrade Hive tables and views to Unity Catalog\n##### Hive to Unity Catalog migration options\n\nWhen you are ready to migrate Hive tables to Unity Catalog, you have several options, depending on your use case: \n| Migration tool | Description | Hive table requirements | Unity Catalog table created | Why should I use it? |\n| --- | --- | --- | --- | --- |\n| [UCX](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/ucx.html) | A comprehensive set of command-line utilities and other tools that assess your workspace\u2019s readiness for Unity Catalog migration and perform workflows that migrate identities, permissions, storage locations, and tables to Unity Catalog. UCX is available on GitHub at [databrickslabs\/ucx](https:\/\/github.com\/databrickslabs\/ucx). | Managed or external Hive tables | Managed or external | You want a comprehensive workspace upgrade planning tool that goes beyond upgrading Hive tables to Unity Catalog. You want to upgrade workspaces that have large amounts of data in the Hive metastore. You are comfortable running scripts. If you want to perform a bulk upgrade of Hive tables to Unity Catalog managed tables, this is your only option. UCX, like all Databricks Labs projects, is a public GitHub repo and not supported directly by Databricks. |\n| [Unity Catalog upgrade wizard](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#wizard-bulk) | A Catalog Explorer feature that enables you to bulk-copy entire schemas (databases) and multiple managed and external tables from your Hive metastore to the Unity Catalog metastore as external tables. The upgrade wizard performs the `SYNC` command on the tables that you select, leaving the original Hive tables intact. You have the option to schedule regular upgrades in order to pick up changes to the source Hive tables. | Managed or external Hive tables | External only | You want to quickly upgrade your Hive tables to external tables in Unity Catalog, and you prefer a visual interface. The ability to schedule regular syncs when the source Hive table changes makes it a useful tool for managing a \u201chybrid\u201d Hive and Unity Catalog workspace during the transition to Unity Catalog. |\n| [SYNC SQL command](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#sync) | `SYNC` enables you to copy external tables and managed tables (if the managed tables are stored outside of Databricks workspace storage, sometimes known as DBFS root) in your Hive metastore to external tables in Unity Catalog. You can sync individual tables or entire schemas. `SYNC` is designed to be run on a schedule to pick up new changes in the Hive metastore and sync them to Unity Catalog. | Managed or external Hive tables | External only | You want to quickly upgrade your Hive tables to external tables in Unity Catalog, and you prefer to use SQL commands rather than a visual interface. Scheduling regular `SYNC` runs to update existing Unity Catalog tables when the source Hive table changes makes it a useful tool for managing a \u201chybrid\u201d Hive and Unity Catalog workspace during the transition to Unity Catalog. Because you cannot use `SYNC` to upgrade managed tables that are in Databricks workspace storage, use [CREATE TABLE CLONE](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-clone.html) for those tables. |\n| [CREATE TABLE CLONE SQL command](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#clone) | `CREATE TABLE CLONE` enables you to upgrade managed tables in your Hive metastore to managed tables in Unity Catalog. You can clone individual tables. Deep clones are preferred, because they copy source table data to the clone target in addition to the existing table metadata. | Managed Hive tables that are in Delta, Parquet, or Iceberg format. Cloning Parquet and Iceberg source tables has some specific requirements and limitations: see [Requirements and limitations for cloning Parquet and Iceberg tables](https:\/\/docs.databricks.com\/delta\/clone-parquet.html#limitations). | Managed only | You want to migrate Hive managed tables to Unity Catalog managed tables to take full advantage of Unity Catalog data governance, and your Hive tables meet the criteria listed in the \u201cHive table requirements\u201d cell. If your Hive tables do not meet the \u201cHive table requirements\u201d, you can use the [CREATE TABLE AS SELECT SQL command](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#create-table-as-select) to upgrade a Hive table to a Unity Catalog managed table. However `CLONE` is almost always preferred. Cloning has simpler syntax than `CREATE TABLE AS SELECT`: you don\u2019t need to specify partitioning, format, invariants, nullability, stream, `COPY INTO`, and other metadata, because these are cloned from the source table. | \nThis article describes how to perform all but the UCX-driven upgrade process. Databricks recommends UCX for most workspace upgrade scenarios. However, for simpler use cases, you might prefer one or more of the tools described here.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Upgrade Hive tables and views to Unity Catalog\n##### Before you begin\n\nThis section describes some of the impacts of migration that you should be prepared for, along with permissions and compute requirements. \n### Understand the impact \nYou should be aware that when you modify your workloads to use the new Unity Catalog tables, you might need to change some behaviors: \n* Unity Catalog manages partitions differently than Hive. Hive commands that directly manipulate partitions are not supported on tables managed by Unity Catalog.\n* Table history is not migrated when you run `CREATE TABLE CLONE`. Any tables in the Hive metastore that you clone to Unity Catalog are treated as new tables. You cannot perform Delta Lake time travel or other operations that rely on pre-migration history. \nFor more information, see [Work with Unity Catalog and the legacy Hive metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/hive-metastore.html). \n### Requirements \nTo perform migrations, you must have: \n* A workspace that that has a Unity Catalog metastore and at least one Unity Catalog catalog. See [Set up and manage Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html).\n* Privileges on the Unity Catalog catalogs to which you are migrating tables. These privilege requirements are enumerated at the start of each procedure covered in this article.\n* For migration to Unity Catalog external tables: storage credentials and external locations defined in Unity Catalog, and the `CREATE EXTERNAL TABLE` privilege on the external location.\n* Access to Databricks compute that meets both of the following requirements: \n+ Supports Unity Catalog (SQL warehouses or compute resources that use single-user or shared access mode).\n+ Allows access to the tables in the Hive metastore.Because compute resources that use shared access mode are enabled for [legacy table access control](https:\/\/docs.databricks.com\/data-governance\/table-acls\/index.html) by default, that means that if you use that access mode, you must have table access control privileges on the Hive metastore that you are migrating from. You can grant yourself access using the following SQL command: \n```\nGRANT all_privileges ON catalog hive_metastore TO ``\n\n``` \nAlternatively, you can use a compute resource in single-user access mode. \nFor more information about managing privileges on objects in the Hive metastore, see [Hive metastore privileges and securable objects (legacy)](https:\/\/docs.databricks.com\/data-governance\/table-acls\/object-privileges.html). For more information about managing privileges on objects in the Unity Catalog metastore, see [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Upgrade Hive tables and views to Unity Catalog\n##### Identify tables that are managed by the Hive metastore\n\nTo determine whether a table is currently registered in Unity Catalog, check the catalog name. Tables in the catalog `hive_metastore` are registered in the workspace-local Hive metastore. Any other catalogs listed are governed by Unity Catalog. \nTo view the tables in the `hive_metastore` catalog using Catalog Explorer: \n1. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog** in the sidebar.\n2. In the catalog pane, browse to the `hive_metastore` catalog and expand the schema nodes. \nYou can also search for a specific table using the filter field in the Catalog pane.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Upgrade Hive tables and views to Unity Catalog\n##### Upgrade a schema or multiple tables from the Hive metastore to Unity Catalog external tables using the upgrade wizard\n\nYou can copy complete schemas (databases) and multiple external or managed tables from your Databricks default Hive metastore to the Unity Catalog metastore using the **Catalog Explorer** upgrade wizard. The upgraded tables will be external tables in Unity Catalog. \nFor help deciding when to use the upgrade wizard, see [Hive to Unity Catalog migration options](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#comparison-table). \n### Requirements \n**Data format requirements**: \n* See [External tables](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#external-table). \n**Compute requirements**: \n* A compute resource that supports Unity Catalog. See [Before you begin](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#before). \n**Unity Catalog object and permission requirements**: \n* A [storage credential](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html) for an IAM role that authorizes Unity Catalog to access the tables\u2019 location path.\n* An [external location](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html) that references the storage credential you just created and the path to the data on your cloud tenant.\n* `CREATE EXTERNAL TABLE` permission on the external locations of the tables to be upgraded. \n**Hive table access requirements**: \n* If your compute uses shared access mode, you need access to the tables in the Hive metastore, granted using legacy table access control. See [Before you begin](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#before). \n### Upgrade process \n1. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog** in the sidebar to open the [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html).\n2. Select `hive_metastore` as your catalog and select the schema (database) that you want to upgrade. \n![Select database](https:\/\/docs.databricks.com\/_images\/data-explorer-select-database.png)\n3. Click **Upgrade** at the top right of the schema detail view.\n4. Select all of the tables that you want to upgrade and click **Next**. \nOnly [external tables](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#create-an-external-table) in [formats supported by Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#external-table) can be upgraded using the upgrade wizard.\n5. Set the destination catalog, schema (database), and owner for each table. \nUsers will be able to access the newly created table in the context of their privileges on the [catalog and schema](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#object-model). \nTable owners have all privileges on the table, including `SELECT` and `MODIFY`. If you don\u2019t select an owner, the managed tables are created with you as the owner. Databricks generally recommends that you grant table ownership to groups. To learn more about object ownership in Unity Catalog, see [Manage Unity Catalog object ownership](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/ownership.html). \nTo assign the same catalog and schema to multiple tables, select the tables and click the **Set destination** button. \nTo assign the same owner to multiple tables, select the tables and click the **Set owner** button.\n6. Review the table configurations. To modify them, click the **Previous** button.\n7. Click **Create Query for Upgrade**. \nA query editor appears with generated SQL statements.\n8. Run the query. \nWhen the query is done, each table\u2019s metadata has been copied from Hive metastore to Unity Catalog. These tables are marked as upgraded in the upgrade wizard.\n9. Define fine-grained access control using the **Permissions** tab of each new table.\n10. (Optional) Add comments to each upgraded Hive table that points users to the new Unity Catalog table. \nReturn to the original table in the `hive.metastore` catalog to add the table comment. \nIf you use the following syntax in the table comment, notebooks and SQL query editor queries that reference the deprecated Hive table will display the deprecated table name using strikethrough text, display the comment as a warning, and provide a **Quick Fix** link to Databricks Assistant, which can update your code to reference the new table. \n```\nThis table is deprecated. Please use catalog.default.table instead of hive_metastore.schema.table.\n\n``` \nSee [Add comments to indicate that a Hive table has been migrated](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#deprecated-comment).\n11. Modify your workloads to use the new tables. \nIf you added a comment to the original Hive table like the one listed in the optional previous step, you can use the **Quick Fix** link and Databricks Assistant to help you find and modify workloads.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Upgrade Hive tables and views to Unity Catalog\n##### Upgrade a single Hive table to a Unity Catalog external table using the upgrade wizard\n\nYou can copy a single table from your default Hive metastore to the Unity Catalog metastore using the upgrade wizard in **Catalog Explorer** \nFor help deciding when to use the upgrade wizard, see [Hive to Unity Catalog migration options](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#comparison-table). \n### Requirements \n**Data format requirements**: \n* See [External tables](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#external-table). \n**Compute requirements**: \n* A compute resource that supports Unity Catalog. See [Before you begin](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#before). \n**Unity Catalog object and permission requirements**: \n* A [storage credential](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html) for an IAM role that authorizes Unity Catalog to access the table\u2019s location path.\n* An [external location](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html) that references the storage credential you just created and the path to the data on your cloud tenant.\n* `CREATE EXTERNAL TABLE` permission on the external locations of the tables to be upgraded. \n### Upgrade process \nTo upgrade an external table: \n1. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog** in the sidebar to open [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html).\n2. Select the database, then the table, that you want to upgrade.\n3. Click **Upgrade** in the top-right corner of the table detail view.\n4. Select the table to upgrade and click **Next**.\n5. Select your destination catalog, schema (database), and owner. \nUsers will be able to access the newly created table in the context of their privileges on the [catalog and schema](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#object-model). \nTable owners have all privileges on the table, including `SELECT` and `MODIFY`. If you don\u2019t select an owner, the managed table is created with you as the owner. Databricks generally recommends that you grant table ownership to groups. To learn more about object ownership in Unity Catalog, see [Manage Unity Catalog object ownership](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/ownership.html).\n6. Click **Upgrade** in the top-right corner of the table detail view.\n7. Select the table to upgrade and click **Next**. \nThe table metadata is now copied to Unity Catalog, and a new table has been created. You can now use the **Permissions** tab to define fine-grained access control.\n8. Use the **Permissions** tab to define fine-grained access control.\n9. (Optional) Add a comment to the Hive table that points users to the new Unity Catalog table. \nReturn to the original table in the `hive.metastore` catalog to add the table comment. \nIf you use the following syntax in the table comment, notebooks and SQL query editor queries that reference the deprecated Hive table will display the deprecated table name using strikethrough text, display the comment as a warning, and provide a **Quick Fix** link to Databricks Assistant, which can update your code to reference the new table. \n```\nThis table is deprecated. Please use catalog.default.table instead of hive_metastore.schema.table.\n\n``` \nSee [Add comments to indicate that a Hive table has been migrated](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#deprecated-comment).\n10. Modify existing workloads to use the new table. \nIf you added a comment to the original Hive table like the one listed in the optional previous step, you can use the **Quick Fix** link and Databricks Assistant to help you find and modify workloads. \nNote \nIf you no longer need the old table, you can drop it from the Hive metastore. Dropping an external table does not modify the data files on your cloud tenant.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Upgrade Hive tables and views to Unity Catalog\n##### Upgrade a Hive table to a Unity Catalog external table using SYNC\n\nYou can use the `SYNC` SQL command to copy external tables in your Hive metastore to external tables in Unity Catalog. You can sync individual tables or entire schemas. \nYou can also use `SYNC` to copy Hive managed tables that are stored outside of Databricks workspace storage (sometimes called DBFS root) to external tables in Unity Catalog. You cannot use it to copy Hive managed tables stored in workspace storage. To copy those tables, use [CREATE TABLE CLONE](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#clone) instead. \nThe `SYNC` command performs a write operation to each source table it upgrades to add additional table properties for bookkeeping, including a record of the target Unity Catalog external table. \n`SYNC` can also be used to update existing Unity Catalog tables when the source tables in the Hive metastore are changed. This makes it a good tool for transitioning to Unity Catalog gradually. \nFor details, see [SYNC](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-aux-sync.html). For help deciding when to use the upgrade wizard, see [Hive to Unity Catalog migration options](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#comparison-table). \n### Requirements \n**Data format requirements**: \n* See [External tables](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#external-table). \n**Compute requirements**: \n* A compute resource that supports Unity Catalog. See [Before you begin](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#before). \n**Unity Catalog object and permission requirements**: \n* A [storage credential](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html) for an IAM role that authorizes Unity Catalog to access the tables\u2019 location path.\n* An [external location](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html) that references the storage credential you just created and the path to the data on your cloud tenant.\n* `CREATE EXTERNAL TABLE` permission on the external locations of the tables to be upgraded. \n**Hive table access requirements**: \n* If your compute uses shared access mode, you need access to the tables in the Hive metastore, granted using legacy table access control. See [Before you begin](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#before). \n### Upgrade process \nTo upgrade tables in your Hive metastore to Unity Catalog external tables using `SYNC`: \n1. In a notebook or the SQL query editor, run one of the following: \nSync an external Hive table: \n```\nSYNC TABLE .. FROM hive_metastore..\nSET OWNER ;\n\n``` \nSync an external Hive schema and all of its tables: \n```\nSYNC SCHEMA . FROM hive_metastore.\nSET OWNER ;\n\n``` \nSync a managed Hive table that is stored outside of Databricks workspace storage: \n```\nSYNC TABLE .. AS EXTERNAL FROM hive_metastore..\nSET OWNER ;\n\n``` \nSync a schema that contains managed Hive tables that are stored outside of Databricks workspace storage: \n```\nSYNC SCHEMA . AS EXTERNAL FROM hive_metastore.\nSET OWNER ;\n\n```\n2. Grant account-level users or groups access to the new table. See [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html).\n3. (Optional) Add a comment to the original Hive table that points users to the new Unity Catalog table. \nReturn to the original table in the `hive.metastore` catalog to add the table comment. To learn how to add table comments using Catalog Explorer, see [Add markdown comments to data objects using Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/markdown-data-comments.html#manual). To learn how to add table comments using SQL statements in a notebook or the SQL query editor, see [COMMENT ON](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-comment.html). \nIf you use the following syntax in the table comment, notebooks and SQL query editor queries that reference the deprecated Hive table will display the deprecated table name using strikethrough text, display the comment as a warning, and provide a **Quick Fix** link to Databricks Assistant, which can update your code to reference the new table. \n```\nThis table is deprecated. Please use catalog.default.table instead of hive_metastore.schema.table.\n\n``` \nSee [Add comments to indicate that a Hive table has been migrated](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#deprecated-comment).\n4. After the table is migrated, users should update their existing queries and workloads to use the new table. \nIf you added a comment to the original Hive table like the one listed in the optional previous step, you can use the **Quick Fix** link and Databricks Assistant to help you find and modify workloads.\n5. Before you drop the old table, test for dependencies by revoking access to it and re-running related queries and workloads. \nDon\u2019t drop the old table if you are still relying on deprecation comments to help you find and update existing code that references the old table. Likewise, don\u2019t drop the old table if that table has changed since your original sync: `SYNC` can be used to update existing Unity Catalog tables with changes from source Hive tables.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Upgrade Hive tables and views to Unity Catalog\n##### Upgrade a Hive managed table to a Unity Catalog managed table using CLONE\n\nUse `CREATE TABLE CLONE` to upgrade managed tables in your Hive metastore to managed tables in Unity Catalog. You can clone individual tables. *Deep clones* copy source table data to the clone target in addition to the existing table metadata. Use deep clone if you intend to drop the Hive source table. *Shallow clones* do not copy the data files to the clone target but give access to them by reference to the source data: the table metadata is equivalent to the source. Shallow clones are cheaper to create but require that users who query data in the clone target also have access to the source data. \nFor help deciding when to use `CLONE`, see [Hive to Unity Catalog migration options](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#comparison-table). For help deciding which clone type to use, see [Clone a table on Databricks](https:\/\/docs.databricks.com\/delta\/clone.html). \n### Requirements \n**Data format requirements**: \n* Managed Hive tables in Delta, Parquet, or Iceberg format. Cloning Parquet and Iceberg source tables has some specific requirements and limitations. See [Requirements and limitations for cloning Parquet and Iceberg tables](https:\/\/docs.databricks.com\/delta\/clone-parquet.html#limitations). \n**Compute requirements**: \n* A compute resource that supports Unity Catalog. See [Before you begin](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#before). \n**Permission requirements**: \n* The `USE CATALOG` and `USE SCHEMA` privileges on the catalog and schema that you add the table to, along with `CREATE TABLE` on the schema, or you must be the owner of the catalog or schema. See [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html).\n* If your compute uses shared access mode, you need access to the tables in the Hive metastore, granted using legacy table access control. See [Before you begin](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#before). \n### Upgrade process \nTo upgrade managed tables in your Hive metastore to managed tables in Unity Catalog: \n1. In a notebook or the SQL query editor, run one of the following: \nDeep clone a managed table in the Hive metastore: \n```\nCREATE OR REPLACE TABLE ..\nDEEP CLONE hive_metastore..;\n\n``` \nShallow clone a managed table in the Hive metastore: \n```\nCREATE OR REPLACE TABLE ..\nSHALLOW CLONE hive_metastore..;\n\n``` \nFor information about additional parameters, including table properties, see [CREATE TABLE CLONE](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-clone.html).\n2. Grant account-level users or groups access to the new table. See [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html).\n3. (Optional) Add a comment to the original Hive table that points users to the new Unity Catalog table. \nReturn to the original table in the `hive.metastore` catalog to add the table comment. To learn how to add table comments using Catalog Explorer, see [Add markdown comments to data objects using Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/markdown-data-comments.html#manual). To learn how to add table comments using SQL statements in a notebook or the SQL query editor, see [COMMENT ON](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-comment.html). \nIf you use the following syntax in the table comment, notebooks and SQL query editor queries that reference the deprecated Hive table will display the deprecated table name using strikethrough text, display the comment as a warning, and provide a **Quick Fix** link to Databricks Assistant, which can update your code to reference the new table. \n```\nThis table is deprecated. Please use catalog.default.table instead of hive_metastore.schema.table.\n\n``` \nSee [Add comments to indicate that a Hive table has been migrated](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#deprecated-comment).\n4. After the table is migrated, users should update their existing queries and workloads to use the new table. \nIf you added a comment to the original Hive table like the one listed in the optional previous step, you can use the **Quick Fix** link and Databricks Assistant to help you find and modify workloads.\n5. Before you drop the old table, test for dependencies by revoking access to it and re-running related queries and workloads. \nDon\u2019t drop the old table if you are still relying on deprecation comments to help you find and update existing code that references the old table. Likewise, don\u2019t drop the old table if you performed a shallow clone. Shallow clones reference data from the source Hive table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Upgrade Hive tables and views to Unity Catalog\n##### Upgrade a Hive table to a Unity Catalog managed table using CREATE TABLE AS SELECT\n\nIf you cannot use or prefer not to use `CREATE TABLE CLONE` to migrate a table in your Hive metastore to a managed table in Unity Catalog, you can create a new managed table in Unity Catalog by querying the Hive table using `CREATE TABLE AS SELECT`. For information about the differences between `CREATE TABLE CLONE` and `CREATE TABLE AS SELECT`, see [Hive to Unity Catalog migration options](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#comparison-table). \n### Requirements \n**Compute requirements**: \n* A compute resource that supports Unity Catalog. See [Before you begin](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#before). \n**Permission requirements**: \n* The `USE CATALOG` and `USE SCHEMA` privileges on the catalog and schema that you add the table to, along with `CREATE TABLE` on the schema, or you must be the owner of the catalog or schema. See [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html).\n* If your compute uses shared access mode, you need access to the tables in the Hive metastore, granted using legacy table access control. See [Before you begin](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#before). \n### Upgrade process \nTo upgrade a table in your Hive metastore to a managed table in Unity Catalog using `CREATE TABLE AS SELECT`: \n1. Create a new Unity Catalog table by querying the existing table. Replace the placeholder values: \n* ``: The Unity Catalog catalog for the new table.\n* ``: The Unity Catalog schema for the new table.\n* ``: A name for the Unity Catalog table.\n* ``: The schema for the Hive table, such as `default`.\n* ``: The name of the Hive table. \n```\nCREATE TABLE ..\nAS SELECT * FROM hive_metastore..;\n\n``` \n```\ndf = spark.table(\"hive_metastore..\")\n\ndf.write.saveAsTable(\nname = \"..\"\n)\n\n``` \n```\n%r\nlibrary(SparkR)\n\ndf = tableToDF(\"hive_metastore..\")\n\nsaveAsTable(\ndf = df,\ntableName = \"..\"\n)\n\n``` \n```\nval df = spark.table(\"hive_metastore..\")\n\ndf.write.saveAsTable(\ntableName = \"..\"\n)\n\n``` \nIf you want to migrate only some columns or rows, modify the `SELECT` statement. \nNote \nThe commands presented here create a [managed table](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#create-a-managed-table) in which data is copied into a dedicated *managed storage location*. If instead you want to create an [external table](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#create-an-external-table), where the table is registered in Unity Catalog without moving the data in cloud storage, see [Upgrade a single Hive table to a Unity Catalog external table using the upgrade wizard](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#migrate-external). See also [Specify a managed storage location in Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/managed-storage.html).\n2. Grant account-level users or groups access to the new table. See [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html).\n3. (Optional) Add a comment to the original Hive table that points users to the new Unity Catalog table. \nReturn to the original table in the `hive.metastore` catalog to add the table comment. To learn how to add table comments using Catalog Explorer, see [Add markdown comments to data objects using Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/markdown-data-comments.html#manual). To learn how to add table comments using SQL statements in a notebook or the SQL query editor, see [COMMENT ON](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-comment.html). \nIf you use the following syntax in the table comment, notebooks and SQL query editor queries that reference the deprecated Hive table will display the deprecated table name using strikethrough text, display the comment as a warning, and provide a **Quick Fix** link to Databricks Assistant, which can update your code to reference the new table. \n```\nThis table is deprecated. Please use catalog.default.table instead of hive_metastore.schema.table.\n\n``` \nSee [Add comments to indicate that a Hive table has been migrated](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#deprecated-comment).\n4. After the table is migrated, users should update their existing queries and workloads to use the new table. \nIf you added a comment to the original Hive table like the one listed in the optional previous step, you can use the **Quick Fix** link and Databricks Assistant to help you find and modify workloads.\n5. Before you drop the old table, test for dependencies by revoking access to it and re-running related queries and workloads. \nDon\u2019t drop the old table if you are still relying on deprecation comments to help you find and update existing code that references the old table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Upgrade Hive tables and views to Unity Catalog\n##### Upgrade a view to Unity Catalog\n\nAfter you upgrade all of a view\u2019s referenced tables to the same Unity Catalog metastore, you can [create a new view](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-views.html) that references the new tables.\n\n#### Upgrade Hive tables and views to Unity Catalog\n##### Add comments to indicate that a Hive table has been migrated\n\nWhen you add a comment to the deprecated Hive table that points users to the new Unity Catalog table, notebooks and SQL query editor queries that reference the deprecated Hive table will display the deprecated table name using strikethrough text, display the comment as a warning, and provide a **Quick Fix** link to Databricks Assistant, which can update your code to reference the new table. \n![Hive table deprecation warning](https:\/\/docs.databricks.com\/_images\/hive-migration-table-comment.png) \nYour comment must use the following format: \n```\nThis table is deprecated. Please use catalog.default.table instead of hive_metastore.schema.table.\n\n``` \nTo learn how to add table comments using Catalog Explorer, see [Add markdown comments to data objects using Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/markdown-data-comments.html#manual). To learn how to add table comments using SQL statements in a notebook or the SQL query editor, see [COMMENT ON](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-comment.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Upgrade Hive tables and views to Unity Catalog\n##### Use Databricks Assistant to update a deprecated table reference\n\nIf you see strikethrough text on a table name in a notebook cell or statement in the SQL query editor, hover over the table name to reveal a warning notice. If that warning notice describes the table as deprecated and displays the new table name, click **Quick Fix**, followed by **Fix Deprecation**. Databricks Assistant opens, offering to replace the the deprecated table name with the new Unity Catalog table name. Follow the prompts to complete the task. \n![Video showing Hive table update using Databricks Assistant](https:\/\/docs.databricks.com\/_images\/hive-to-uc-sql.gif) \nSee also [Use Databricks Assistant](https:\/\/docs.databricks.com\/notebooks\/use-databricks-assistant.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create and manage catalogs\n\nThis article shows how to create and manage catalogs in Unity Catalog. A catalog contains [schemas (databases)](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-schemas.html), and a schema contains tables, views, volumes, models, and functions. \nNote \nIn some workspaces that were enabled for Unity Catalog automatically, a *workspace catalog* was created for you by default. If this catalog exists, all users in your workspace (and only your workspace) have access to it by default. See [Step 1: Confirm that your workspace is enabled for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#auto-enabled-check). \nNote \nTo learn how to create a *foreign catalog*, a Unity Catalog object that mirrors a database in an external data system, see [Create a foreign catalog](https:\/\/docs.databricks.com\/query-federation\/index.html#foreign-catalog). See also [Manage and work with foreign catalogs](https:\/\/docs.databricks.com\/query-federation\/foreign-catalogs.html).\n\n#### Create and manage catalogs\n##### Requirements\n\nTo create a catalog: \n* You must be a Databricks metastore admin or have the `CREATE CATALOG` privilege on the metastore.\n* You must have a Unity Catalog metastore [linked to the workspace](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html) where you perform the catalog creation.\n* The cluster that you use to run a notebook to create a catalog must use a Unity Catalog-compliant access mode. See [Access modes](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode). \nSQL warehouses always support Unity Catalog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create and manage catalogs\n##### Create a catalog\n\nTo create a catalog, you can use Catalog Explorer or a SQL command. \n1. Log in to a workspace that is linked to the metastore.\n2. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. Click the **Create Catalog** button.\n4. Select the catalog type that you want to create: \n* **Standard** catalog: a securable object that organizes data assets that are managed by Unity Catalog. For all use cases except Lakehouse Federation.\n* **Foreign** catalog: a securable object in Unity Catalog that mirrors a database in an external data system using Lakehouse Federation. See [Overview of Lakehouse Federation setup](https:\/\/docs.databricks.com\/query-federation\/index.html#setup-overview).\n5. (Optional but strongly recommended) Specify a managed storage location. Requires the `CREATE MANAGED STORAGE` privilege on the target external location. See [Specify a managed storage location in Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/managed-storage.html). \nImportant \nIf your workspace does not have a metastore-level storage location, you must specify a managed storage location when you create a catalog.\n6. Click **Create**.\n7. (Optional) Specify the workspace that the catalog is bound to. \nBy default, the catalog is shared with all workspaces attached to the current metastore. If the catalog will contain data that should be restricted to specific workspaces, go to the **Workspaces** tab and add those workspaces. \nFor more information, see [(Optional) Assign a catalog to specific workspaces](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#catalog-binding).\n8. Assign permissions for your catalog. See [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html). \n1. Run the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional. Replace the placeholder values: \n* ``: A name for the catalog.\n* ``: Optional but strongly recommended. Provide a storage location path if you want managed tables in this catalog to be stored in a location that is different than the default root storage configured for the metastore. \nImportant \nIf your workspace does not have a metastore-level storage location, you must specify a managed storage location when you create a catalog. \nThis path must be defined in an external location configuration, and you must have the `CREATE MANAGED STORAGE` privilege on the external location configuration. You can use the path that is defined in the external location configuration or a subpath (in other words, `'s3:\/\/depts\/finance'` or `'s3:\/\/depts\/finance\/product'`). Requires Databricks Runtime 11.3 and above.\n* ``: Optional description or other comment.\nNote \nIf you are creating a foreign catalog (a securable object in Unity Catalog that mirrors a database in an external data system, used for Lakehouse Federation), the SQL command is `CREATE FOREIGN CATALOG` and the options are different. See [Create a foreign catalog](https:\/\/docs.databricks.com\/query-federation\/index.html#foreign-catalog). \n```\nCREATE CATALOG [ IF NOT EXISTS ] \n[ MANAGED LOCATION '' ]\n[ COMMENT ];\n\n``` \nFor example, to create a catalog named `example`: \n```\nCREATE CATALOG IF NOT EXISTS example;\n\n``` \nIf you want to limit catalog access to specific workspaces in your account, also known as workspace-catalog binding, see [Bind a catalog to one or more workspaces](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#bind). \nFor parameter descriptions, see [CREATE CATALOG](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-catalog.html).\n2. Assign privileges to the catalog. See [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html). \nWhen you create a catalog, two schemas (databases) are automatically created: `default` and `information_schema`. \nYou can also create a catalog by using the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_catalog](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/catalog). You can retrieve information about catalogs by using [databricks\\_catalogs](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/data-sources\/catalogs).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create and manage catalogs\n##### (Optional) Assign a catalog to specific workspaces\n\nIf you use workspaces to isolate user data access, you may want to limit catalog access to specific workspaces in your account, also known as workspace-catalog binding. The default is to share the catalog with all workspaces attached to the current metastore. \nThe exception to this default is the *workspace catalog* that is created by default in workspaces that are enabled for Unity Catalog automatically (see [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement)). By default, this workspace catalog is bound only to your workspace, unless you choose to give other workspaces access to it. For important information about assigning permissions if you unbind this catalog, see [Unbind a catalog from a workspace](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#unbind). \nYou can allow read and write access to the catalog from a workspace (the default), or you can specify read-only access. If you specify read-only, then all write operations are blocked from that workspace to that catalog. \nTypical use cases for binding a catalog to specific workspaces include: \n* Ensuring that users can only access production data from a production workspace environment.\n* Ensuring that users can only process sensitive data from a dedicated workspace.\n* Giving users read-only access to production data from a developer workspace to enable development and testing. \nNote \nYou can also bind external locations and storage credentials to specific workspaces, limiting the ability to access data in external locations to privileged users in those workspaces. See [(Optional) Assign an external location to specific workspaces](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html#workspace-binding) and [(Optional) Assign a storage credential to specific workspaces](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html#workspace-binding). \n### Workspace-catalog binding example \nTake the example of production and development isolation. If you specify that your production data catalogs can only be accessed from production workspaces, this supersedes any individual grants that are issued to users. \n![Catalog-workspace binding diagram](https:\/\/docs.databricks.com\/_images\/catalog-bindings.png) \nIn this diagram, `prod_catalog` is bound to two production workspaces. Suppose a user has been granted access to a table in `prod_catalog` called `my_table` (using `GRANT SELECT ON my_table TO `). If the user tries to access `my_table` in the Dev workspace, they receive an error message. The user can access `my_table` only from the Prod ETL and Prod Analytics workspaces. \nWorkspace-catalog bindings are respected in all areas of the platform. For example, if you query the information schema, you see only the catalogs accessible in the workspace where you issue the query. Data lineage and search UIs likewise show only the catalogs that are assigned to the workspace (whether using bindings or by default). \n### Bind a catalog to one or more workspaces \nTo assign a catalog to specific workspaces, you can use Catalog Explorer or the Unity Catalog REST API. \n**Permissions required**: Metastore admin or catalog owner. \nNote \nMetastore admins can see all catalogs in a metastore using Catalog Explorer\u2014and catalog owners can see all catalogs they own in a metastore\u2014regardless of whether the catalog is assigned to the current workspace. Catalogs that are not assigned to the workspace appear grayed out, and no child objects are visible or queryable. \n1. Log in to a workspace that is linked to the metastore.\n2. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. In the **Catalog** pane, on the left, click the catalog name. \nThe main Catalog Explorer pane defaults to the **Catalogs** list. You can also select the catalog there.\n4. On the **Workspaces** tab, clear the **All workspaces have access** checkbox. \nIf your catalog is already bound to one or more workspaces, this checkbox is already cleared.\n5. Click **Assign to workspaces** and enter or find the workspaces you want to assign.\n6. (Optional) Limit workspace access to read-only. \nOn the **Manage access level** menu, select **Change access to read-only**. \nYou can reverse this selection at any time by editing the catalog and selecting **Change access to read & write**. \nTo revoke access, go to the **Workspaces** tab, select the workspace, and click **Revoke**. \nThere are two APIs and two steps required to assign a catalog to a workspace. In the following examples, replace `` with your workspace instance name. To learn how to get the workspace instance name and workspace ID, see [Get identifiers for workspace objects](https:\/\/docs.databricks.com\/workspace\/workspace-details.html). To learn about getting access tokens, see [Authentication for Databricks automation - overview](https:\/\/docs.databricks.com\/dev-tools\/auth\/index.html). \n1. Use the `catalogs` API to set the catalog\u2019s `isolation mode` to `ISOLATED`: \n```\ncurl -L -X PATCH 'https:\/\/\/api\/2.1\/unity-catalog\/catalogs\/ \\\n-H 'Authorization: Bearer \\\n-H 'Content-Type: application\/json' \\\n--data-raw '{\n\"isolation_mode\": \"ISOLATED\"\n}'\n\n``` \nThe default `isolation mode` is `OPEN` to all workspaces attached to the metastore.\n2. Use the update `bindings` API to assign the workspaces to the catalog: \n```\ncurl -L -X PATCH 'https:\/\/\/api\/2.1\/unity-catalog\/bindings\/catalog\/ \\\n-H 'Authorization: Bearer \\\n-H 'Content-Type: application\/json' \\\n--data-raw '{\n\"add\": [{\"workspace_id\": , \"binding_type\": }...],\n\"remove\": [{\"workspace_id\": , \"binding_type\": \"}...]\n}'\n\n``` \nUse the `\"add\"` and `\"remove\"` properties to add or remove workspace bindings. `` can be either `\u201cBINDING_TYPE_READ_WRITE\u201d` (default) or `\u201cBINDING_TYPE_READ_ONLY\u201d`. \nTo list all workspace assignments for a catalog, use the list `bindings` API: \n```\ncurl -L -X GET 'https:\/\/\/api\/2.1\/unity-catalog\/bindings\/catalog\/ \\\n-H 'Authorization: Bearer \\\n\n``` \n### Unbind a catalog from a workspace \nInstructions for revoking workspace access to a catalog using Catalog Explorer or the `bindings` API are included in [Bind a catalog to one or more workspaces](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#bind). \nImportant \nIf your workspace was enabled for Unity Catalog automatically and you have a *workspace catalog*, workspace admins own that catalog and have all permissions on that catalog **in the workspace only**. If you unbind that catalog or bind it to other catalogs, you must grant any required permissions manually to the members of the workspace admins group as individual users or using account-level groups, because the workspace admins group is a workspace-local group. For more information about account groups vs workspace-local groups, see [Difference between account groups and workspace-local groups](https:\/\/docs.databricks.com\/admin\/users-groups\/groups.html#account-vs-workspace-group).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create and manage catalogs\n##### Add schemas to your catalog\n\nTo learn how to add schemas (databases) to your catalog. see [Create and manage schemas (databases)](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-schemas.html).\n\n#### Create and manage catalogs\n##### View catalog details\n\nTo view information about a catalog, you can use Catalog Explorer or a SQL command. \n1. Log in to a workspace that is linked to the metastore.\n2. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. In the **Catalog** pane, find the catalog and click its name. \nSome details are listed at the top of the page. Others can be viewed on the **Schemas**, **Details**, **Permissions**, and **Workspaces** tabs. \nRun the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional. Replace the placeholder ``. \nFor details, see [DESCRIBE CATALOG](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-aux-describe-catalog.html). \n```\nDESCRIBE CATALOG ;\n\n``` \nUse `CATALOG EXTENDED` to get full details.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create and manage catalogs\n##### Delete a catalog\n\nTo delete (or drop) a catalog, you can use Catalog Explorer or a SQL command. To drop a catalog you must be its owner. \nYou must delete all schemas in the catalog except `information_schema` before you can delete a catalog. This includes the auto-created `default` schema. \n1. Log in to a workspace that is linked to the metastore.\n2. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. In the **Catalog** pane, on the left, click the catalog you want to delete.\n4. In the detail pane, click the three-dot menu to the left of the **Create database** button and select **Delete**.\n5. On the **Delete catalog** dialog, click **Delete**. \nRun the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional. Replace the placeholder ``. \nFor parameter descriptions, see [DROP CATALOG](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-drop-catalog.html). \nIf you use `DROP CATALOG` without the `CASCADE` option, you must delete all schemas in the catalog except `information_schema` before you can delete the catalog. This includes the auto-created `default` schema. \n```\nDROP CATALOG [ IF EXISTS ] [ RESTRICT | CASCADE ]\n\n``` \nFor example, to delete a catalog named `vaccine` and its schemas: \n```\nDROP CATALOG vaccine CASCADE\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create and manage catalogs\n##### Manage the default catalog\n\nA default catalog is configured for each workspace that is enabled for Unity Catalog. The default catalog lets you perform data operations without specifying a catalog. If you omit the top-level catalog name when you perform data operations, the default catalog is assumed. \nA workspace admin can view or switch the default catalog using the Admin Settings UI. You can also set the default catalog for a cluster using a Spark config. \nCommands that do not specify the catalog (for example `GRANT CREATE TABLE ON SCHEMA myschema TO mygroup`) are evaluated for the catalog in the following order: \n1. Is the catalog set for the session using a `USE CATALOG` statement or a JDBC setting?\n2. Is the Spark configuration `spark.databricks.sql.initial.catalog.namespace` set on the cluster?\n3. Is there a workspace default catalog set for the cluster? \n### The default catalog configuration when Unity Catalog is enabled \nThe default catalog that was initially configured for your workspace depends on how your workspace was enabled for Unity Catalog: \n* For some workspaces that were enabled for Unity Catalog automatically, the *workspace catalog* was set as the default catalog. See [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement).\n* For all other workspaces, the `hive_metastore` catalog was set as the default catalog. \nIf you are transitioning from the Hive metastore to Unity Catalog within an existing workspace, it typically makes sense to use `hive_metastore` as the default catalog to avoid impacting existing code that references the hive metastore. \n### Change the default catalog \nA workspace admin can change the default catalog for the workspace. Anyone with permission to create or edit a cluster can set a different default catalog for the cluster. \nWarning \nChanging the default catalog can break existing data operations that depend on it. \nTo configure a different default catalog for a workspace: \n1. Log in to your workspace as a workspace admin.\n2. Click your username in the top bar of the workspace and select **Settings** from the dropdown.\n3. Click the **Advanced** tab.\n4. On the **Default catalog for the workspace** row, enter the catalog name and click **Save**. \nRestart your SQL warehouses and clusters for the change to take effect. All new and restarted SQL warehouses and clusters will use this catalog as the workspace default. \nYou can also override the default catalog for a specific cluster by setting the following Spark configuration on the cluster. This approach is not available for SQL warehouses: \n```\nspark.databricks.sql.initial.catalog.name\n\n``` \nFor instructions, see [Spark configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration). \n### View the current default catalog \nTo get the current default catalog for your workspace, you can use a SQL statement in a notebook or SQL Editor query. A workspace admin can get the default catalog using the Admin Settings UI. \n1. Log in to your workspace as a workspace admin.\n2. Click your username in the top bar of the workspace and select **Settings** from the dropdown.\n3. Click the **Advanced** tab.\n4. On the **Default catalog for the workspace** row, view the catalog name. \nRun the following command in a notebook or SQL Editor query that is running on a SQL warehouse or Unity Catalog-compliant cluster. The workspace default catalog is returned as long as no `USE CATALOG` statement or JDBC setting has been set on the session, and as long as no `spark.databricks.sql.initial.catalog.namespace` config is set for the cluster. \n```\nSELECT current_catalog();\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html"} +{"content":"# What is Databricks?\n### Databricks concepts\n\nThis article introduces the set of fundamental concepts you need to understand in order to use Databricks effectively.\n\n### Databricks concepts\n#### Accounts and workspaces\n\nIn Databricks, a *workspace* is a Databricks deployment in the cloud that functions as an environment for your team to access Databricks assets. Your organization can choose to have either multiple workspaces or just one, depending on its needs. \nA Databricks *account* represents a single entity that can include multiple workspaces. Accounts enabled for [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) can be used to manage users and their access to data centrally across all of the workspaces in the account. Billing and support are also handled at the account level.\n\n### Databricks concepts\n#### Billing: Databricks units (DBUs)\n\nDatabricks bills based on Databricks units (DBUs), units of processing capability per hour based on VM instance type. \nSee the [Databricks on AWS pricing estimator](https:\/\/databricks.com\/product\/aws-pricing\/instance-types).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/concepts.html"} +{"content":"# What is Databricks?\n### Databricks concepts\n#### Authentication and authorization\n\nThis section describes concepts that you need to know when you manage Databricks identities and their access to Databricks assets. \n### User \nA unique individual who has access to the system. User identities are represented by email addresses. See [Manage users](https:\/\/docs.databricks.com\/admin\/users-groups\/users.html). \n### Service principal \nA service identity for use with jobs, automated tools, and systems such as scripts, apps, and CI\/CD platforms. Service principals are represented by an application ID. See [Manage service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html). \n### Group \nA collection of identities. Groups simplify identity management, making it easier to assign access to workspaces, data, and other securable objects. All Databricks identities can be assigned as members of groups. See [Manage groups](https:\/\/docs.databricks.com\/admin\/users-groups\/groups.html) \n### Access control list (ACL) \nA list of permissions attached to the workspace, cluster, job, table, or experiment. An ACL specifies which users or system processes are granted access to the objects, as well as what operations are allowed on the assets. Each entry in a typical ACL specifies a subject and an operation. See [Access control lists](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html) \n### Personal access token \nAn opaque string is used to authenticate to the REST API and by tools in the [Technology partners](https:\/\/docs.databricks.com\/integrations\/index.html) to connect to SQL warehouses. See [Databricks personal access token authentication](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). \n### UI \nThe Databricks UI is a graphical interface for interacting with features, such as workspace folders and their contained objects, data objects, and computational resources.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/concepts.html"} +{"content":"# What is Databricks?\n### Databricks concepts\n#### Data science & engineering\n\n[Data science & engineering](https:\/\/docs.databricks.com\/workspace-index.html) tools aid collaboration among data scientists, data engineers, and data analysts. This section describes the fundamental concepts. \n### Workspace \nA [workspace](https:\/\/docs.databricks.com\/workspace\/index.html) is an environment for accessing all of your Databricks assets. A workspace organizes objects (notebooks, libraries, dashboards, and experiments) into [folders](https:\/\/docs.databricks.com\/workspace\/workspace-objects.html#folders) and provides access to data objects and computational resources. \n### Notebook \nA web-based interface for creating data science and machine learning workflows that can contain runnable commands, visualizations, and narrative text. See [Introduction to Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html). \n### Dashboard \nAn interface that provides organized access to visualizations. See [Dashboards in notebooks](https:\/\/docs.databricks.com\/notebooks\/dashboards.html). \n### Library \nA package of code available to the notebook or job running on your cluster. Databricks runtimes include many [libraries](https:\/\/docs.databricks.com\/libraries\/index.html) and you can add your own. \n### Git folder (formerly Repos) \nA folder whose contents are co-versioned together by syncing them to a remote Git repository. [Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html) integrate with Git to provide source and version control for your projects. \n### Experiment \nA collection of [MLflow runs](https:\/\/docs.databricks.com\/mlflow\/tracking.html) for training a machine learning model. See [Organize training runs with MLflow experiments](https:\/\/docs.databricks.com\/mlflow\/experiments.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/concepts.html"} +{"content":"# What is Databricks?\n### Databricks concepts\n#### Databricks interfaces\n\nThis section describes the interfaces that Databricks supports, in addition to the UI, for accessing your assets: API and command-line (CLI). \n### REST API \nThe Databricks REST API provides endpoints for modifying or requesting information about Databricks account and workspace objects. See [account reference](https:\/\/docs.databricks.com\/api\/account\/introduction) and [workspace reference](https:\/\/docs.databricks.com\/api\/workspace\/introduction). \n### CLI \nThe Databricks CLI is hosted on [GitHub](https:\/\/github.com\/databricks\/cli). The CLI is built on top of the Databricks REST API.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/concepts.html"} +{"content":"# What is Databricks?\n### Databricks concepts\n#### Data management\n\nThis section describes the objects that hold the data on which you perform analytics and feed into machine learning algorithms. \n### Unity Catalog \nUnity Catalog is a unified governance solution for data and AI assets on Databricks that provides centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces. See [What is Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). \n### DBFS root \nImportant \nStoring and accessing data using DBFS root or DBFS mounts is a deprecated pattern and not recommended by Databricks. Instead, Databricks recommends using Unity Catalog to manage access to all data. See [What is Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). \nThe DBFS root is a storage location available to all users by default. See [What is DBFS?](https:\/\/docs.databricks.com\/dbfs\/index.html). \n### Database \nA collection of data objects, such as tables or views and functions, that is organized so that it can be easily accessed, managed, and updated. See [What is a database?](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#database) \n### Table \nA representation of structured data. You query tables with Apache Spark SQL and Apache Spark APIs. See [What is a table?](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#table) \n### Delta table \nBy default, all tables created in Databricks are Delta tables. Delta tables are based on the [Delta Lake open source project](https:\/\/delta.io\/), a framework for high-performance ACID table storage over cloud object stores. A Delta table stores data as a directory of files on cloud object storage and registers table metadata to the metastore within a catalog and schema. \nFind out more about [technologies branded as Delta](https:\/\/docs.databricks.com\/introduction\/delta-comparison.html). \n### Metastore \nThe component that stores all the structure information of the various tables and partitions in the data warehouse including column and column type information, the serializers and deserializers necessary to read and write data, and the corresponding files where the data is stored. See [What is a metastore?](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#metastore) \nEvery Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata. You also have the option to use an existing [external Hive metastore](https:\/\/docs.databricks.com\/archive\/external-metastores\/external-hive-metastore.html). \n### Visualization \nA graphical presentation of the result of running a query. See [Visualizations in Databricks notebooks](https:\/\/docs.databricks.com\/visualizations\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/concepts.html"} +{"content":"# What is Databricks?\n### Databricks concepts\n#### Computation management\n\nThis section describes concepts that you need to know to run computations in Databricks. \n### Cluster \nA set of computation resources and configurations on which you run notebooks and jobs. There are two types of clusters: all-purpose and job. See [Compute](https:\/\/docs.databricks.com\/compute\/index.html). \n* You create an *all-purpose cluster* using the UI, CLI, or REST API. You can manually terminate and restart an all-purpose cluster. Multiple users can share such clusters to do collaborative interactive analysis.\n* The Databricks job scheduler creates *a job cluster* when you run a [job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) on a *new job cluster* and terminates the cluster when the job is complete. You *cannot* restart an job cluster. \n### Pool \nA set of idle, ready-to-use instances that reduce cluster start and auto-scaling times. When attached to a pool, a cluster allocates its driver and worker nodes from the pool. See [Pool configuration reference](https:\/\/docs.databricks.com\/compute\/pools.html). \nIf the pool does not have sufficient idle resources to accommodate the cluster\u2019s request, the pool expands by allocating new instances from the instance provider. When an attached cluster is terminated, the instances it used\nare returned to the pool and can be reused by a different cluster. \n### Databricks runtime \nThe set of core components that run on the clusters managed by Databricks. See [Compute](https:\/\/docs.databricks.com\/compute\/index.html).\\* Databricks has the following runtimes: \n* [Databricks Runtime](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html) includes Apache Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics.\n* [Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html) is built on Databricks Runtime and provides prebuilt machine learning infrastructure that is integrated with all of the capabilities of the Databricks workspace. It contains multiple popular libraries, including TensorFlow, Keras, PyTorch, and XGBoost. \n### Workflows \nFrameworks to develop and run data processing pipelines: \n* [Jobs](https:\/\/docs.databricks.com\/workflows\/index.html#what-is-jobs): A non-interactive mechanism for running a notebook or library either immediately or on a scheduled basis.\n* [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html): A framework for building reliable, maintainable, and testable data processing pipelines. \nSee [Introduction to Databricks Workflows](https:\/\/docs.databricks.com\/workflows\/index.html). \n### Workload \nDatabricks identifies two types of workloads subject to different [pricing](https:\/\/databricks.com\/product\/pricing) schemes: data engineering (job) and data analytics (all-purpose). \n* **Data engineering** An (automated) workload runs on *a job cluster* which the Databricks job scheduler creates for each workload.\n* **Data analytics** An (interactive) workload runs on an *all-purpose cluster*. Interactive workloads typically run commands within a Databricks [notebook](https:\/\/docs.databricks.com\/notebooks\/index.html). However, running a *job* on an *existing all-purpose* cluster is also treated as an interactive workload. \n### Execution context \nThe state for a read\u2013eval\u2013print loop (REPL) environment for each supported programming language. The languages supported are Python, R, Scala, and SQL.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/concepts.html"} +{"content":"# What is Databricks?\n### Databricks concepts\n#### Machine learning\n\n[Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html) on Databricks is an integrated end-to-end environment incorporating managed services for experiment tracking, model training, feature development and management, and feature and model serving. \n### Experiments \nThe main unit of organization for tracking machine learning model development. See [Organize training runs with MLflow experiments](https:\/\/docs.databricks.com\/mlflow\/experiments.html). Experiments organize, display, and control access to individual [logged runs of model training code](https:\/\/docs.databricks.com\/mlflow\/tracking.html). \n### Feature Store \nA centralized repository of features. See [What is a feature store?](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html) Feature Store enables feature sharing and discovery across your organization and also ensures that the same feature computation code is used for model training and inference. \n### Models & model registry \nA [trained machine learning or deep learning model](https:\/\/docs.databricks.com\/machine-learning\/train-model\/index.html) that has been registered in [Model Registry](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/concepts.html"} +{"content":"# What is Databricks?\n### Databricks concepts\n#### SQL\n\n### SQL REST API \nAn interface that allows you to automate tasks on SQL objects. See [SQL API](https:\/\/docs.databricks.com\/api\/workspace\/statementexecution). \n### Dashboard \nA presentation of data visualizations and commentary. See [Dashboards](https:\/\/docs.databricks.com\/dashboards\/index.html). For legacy dashboards, see [Legacy dashboards](https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html). \n### SQL queries \nThis section describes concepts that you need to know to run SQL queries in Databricks. \n* **[Query](https:\/\/docs.databricks.com\/sql\/user\/queries\/index.html)**: A valid SQL statement.\n* **[SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html)**: A computation resource on which you execute SQL queries.\n* **[Query history](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html)**: A list of executed queries and their performance characteristics.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/concepts.html"} +{"content":"# What is Databricks?\n## What is a data lakehouse?\n#### Data objects in the Databricks lakehouse\n\nThe Databricks lakehouse organizes data stored with Delta Lake in cloud object storage with familiar relations like database, tables, and views. This model combines many of the benefits of an enterprise data warehouse with the scalability and flexibility of a data lake. Learn more about how this model works, and the relationship between object data and metadata so that you can apply best practices when designing and implementing Databricks lakehouse for your organization.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/data-objects.html"} +{"content":"# What is Databricks?\n## What is a data lakehouse?\n#### Data objects in the Databricks lakehouse\n##### What data objects are in the Databricks lakehouse?\n\nThe Databricks lakehouse architecture combines data stored with the Delta Lake protocol in cloud object storage with metadata registered to a [metastore](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#metastore). There are five primary objects in the Databricks lakehouse: \n* **[Catalog](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#catalog)**: a grouping of databases.\n* **[Database](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#database)** or schema: a grouping of objects in a catalog. Databases contain tables, views, and functions.\n* **[Table](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#table)**: a collection of rows and columns stored as data files in object storage.\n* **[View](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#view)**: a saved query typically against one or more tables or data sources.\n* **[Function](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#function)**: saved logic that returns a scalar value or set of rows. \n![Unity Catalog object model diagram](https:\/\/docs.databricks.com\/_images\/object-model.png) \nFor information on securing objects with Unity Catalog, see [securable objects model](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#object-model).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/data-objects.html"} +{"content":"# What is Databricks?\n## What is a data lakehouse?\n#### Data objects in the Databricks lakehouse\n##### What is a metastore?\n\nThe metastore contains all of the metadata that defines data objects in the lakehouse. Databricks provides the following metastore options: \n* **[Unity Catalog metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html)**: Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities. You create Unity Catalog metastores at the Databricks account level, and a single metastore can be used across multiple workspaces. \nEach Unity Catalog metastore is configured with a root storage location in an S3 bucket in your AWS account. This storage location is used by default for storing data for managed tables. \nIn Unity Catalog, data is secure by default. Initially, users have no access to data in a metastore. Access can be granted by either a metastore admin or the owner of an object. Securable objects in Unity Catalog are hierarchical and privileges are inherited downward. Unity Catalog offers a single place to administer data access policies. Users can access data in Unity Catalog from any workspace that the metastore is attached to. For more information, see [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html).\n* **Built-in Hive metastore (legacy)**: Each Databricks workspace includes a built-in Hive metastore as a managed service. An instance of the metastore deploys to each cluster and securely accesses metadata from a central repository for each customer workspace. \nThe Hive metastore provides a less centralized data governance model than Unity Catalog. By default, a cluster allows all users to access all data managed by the workspace\u2019s built-in Hive metastore unless table access control is enabled for that cluster. For more information, see [Hive metastore table access control (legacy)](https:\/\/docs.databricks.com\/data-governance\/table-acls\/index.html). \nTable access controls are not stored at the account-level, and therefore they must be configured separately for each workspace. To take advantage of the centralized and streamlined data governance model provided by Unity Catalog, Databricks recommends that you [upgrade the tables managed by your workspace\u2019s Hive metastore to the Unity Catalog metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html). \n* **[External Hive metastore (legacy)](https:\/\/docs.databricks.com\/archive\/external-metastores\/index.html)**: You can also bring your own metastore to Databricks. Databricks clusters can connect to existing external Apache Hive metastores or the AWS Glue Data Catalog. You can use table access control to manage permissions in an external metastore. Table access controls are not stored in the external metastore, and therefore they must be configured separately for each workspace. Databricks recommends that you use Unity Catalog instead for its simplicity and account-centered governance model. \nRegardless of the metastore that you use, Databricks stores all table data in object storage in your cloud account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/data-objects.html"} +{"content":"# What is Databricks?\n## What is a data lakehouse?\n#### Data objects in the Databricks lakehouse\n##### What is a catalog?\n\nA catalog is the highest abstraction (or coarsest grain) in the Databricks lakehouse relational model. Every database will be associated with a catalog. Catalogs exist as objects within a metastore. \nBefore the introduction of Unity Catalog, Databricks used a two-tier namespace. Catalogs are the third tier in the Unity Catalog namespacing model: \n```\ncatalog_name.database_name.table_name\n\n``` \nThe built-in Hive metastore only supports a single catalog, `hive_metastore`.\n\n#### Data objects in the Databricks lakehouse\n##### What is a database?\n\nA database is a collection of data objects, such as tables or views (also called \u201crelations\u201d), and functions. In Databricks, the terms \u201cschema\u201d and \u201cdatabase\u201d are used interchangeably (whereas in many relational systems, a database is a collection of schemas). \nDatabases will always be associated with a location on cloud object storage. You can optionally specify a `LOCATION` when registering a database, keeping in mind that: \n* The `LOCATION` associated with a database is always considered a managed location.\n* Creating a database does not create any files in the target location.\n* The `LOCATION` of a database will determine the default location for data of all tables registered to that database.\n* Successfully dropping a database will recursively drop all data and files stored in a managed location. \nThis interaction between locations managed by database and data files is very important. To avoid accidentally deleting data: \n* Do not share database locations across multiple database definitions.\n* Do not register a database to a location that already contains data.\n* To manage data life cycle independently of database, save data to a location that is not nested under any database locations.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/data-objects.html"} +{"content":"# What is Databricks?\n## What is a data lakehouse?\n#### Data objects in the Databricks lakehouse\n##### What is a table?\n\nA Databricks table is a collection of structured data. A Delta table stores data as a directory of files on cloud object storage and registers table metadata to the metastore within a catalog and schema. As Delta Lake is the default format for tables created in Databricks, all tables created in Databricks are Delta tables, by default. Because Delta tables store data in cloud object storage and provide references to data through a metastore, users across an organization can access data using their preferred APIs; on Databricks, this includes SQL, Python, PySpark, Scala, and R. \nNote that it is possible to create tables on Databricks that are not Delta tables. These tables are not backed by Delta Lake, and will not provide the ACID transactions and optimized performance of Delta tables. Tables falling into this category include tables registered against data in external systems and tables registered against other file formats in the data lake. See [Connect to data sources](https:\/\/docs.databricks.com\/connect\/index.html). \nThere are two kinds of tables in Databricks, [managed](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#managed-table) and [unmanaged](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#unmanaged-table) (or external) tables. \nNote \nThe [Delta Live Tables](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#dlt) distinction between live tables and streaming live tables is not enforced from the table perspective. \n### What is a managed table? \nDatabricks manages both the metadata and the data for a managed table; when you drop a table, you also delete the underlying data. Data analysts and other users that mostly work in SQL may prefer this behavior. Managed tables are the default when creating a table. The data for a managed table resides in the `LOCATION` of the database it is registered to. This managed relationship between the data location and the database means that in order to move a managed table to a new database, you must rewrite all data to the new location. \nThere are a number of ways to create managed tables, including: \n```\nCREATE TABLE table_name AS SELECT * FROM another_table\n\n``` \n```\nCREATE TABLE table_name (field_name1 INT, field_name2 STRING)\n\n``` \n```\ndf.write.saveAsTable(\"table_name\")\n\n``` \n### What is an unmanaged table? \nDatabricks only manages the metadata for unmanaged (external) tables; when you drop a table, you do not affect the underlying data. Unmanaged tables will always specify a `LOCATION` during table creation; you can either register an existing directory of data files as a table or provide a path when a table is first defined. Because data and metadata are managed independently, you can rename a table or register it to a new database without needing to move any data. Data engineers often prefer unmanaged tables and the flexibility they provide for production data. \nThere are a number of ways to create unmanaged tables, including: \n```\nCREATE TABLE table_name\nUSING DELTA\nLOCATION '\/path\/to\/existing\/data'\n\n``` \n```\nCREATE TABLE table_name\n(field_name1 INT, field_name2 STRING)\nLOCATION '\/path\/to\/empty\/directory'\n\n``` \n```\ndf.write.option(\"path\", \"\/path\/to\/empty\/directory\").saveAsTable(\"table_name\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/data-objects.html"} +{"content":"# What is Databricks?\n## What is a data lakehouse?\n#### Data objects in the Databricks lakehouse\n##### What is a view?\n\nA view stores the text for a query typically against one or more data sources or tables in the metastore. In Databricks, a view is equivalent to a Spark DataFrame persisted as an object in a database. Unlike DataFrames, you can query views from any part of the Databricks product, assuming you have permission to do so. Creating a view does not process or write any data; only the query text is registered to the metastore in the associated database.\n\n#### Data objects in the Databricks lakehouse\n##### What is a temporary view?\n\nA temporary view has a limited scope and persistence and is not registered to a schema or catalog. The lifetime of a temporary view differs based on the environment you\u2019re using: \n* In notebooks and jobs, temporary views are scoped to the notebook or script level. They cannot be referenced outside of the notebook in which they are declared, and will no longer exist when the notebook detaches from the cluster.\n* In Databricks SQL, temporary views are scoped to the query level. Multiple statements within the same query can use the temp view, but it cannot be referenced in other queries, even within the same dashboard.\n* Global temporary views are scoped to the cluster level and can be shared between notebooks or jobs that share computing resources. Databricks recommends using views with appropriate table ACLs instead of global temporary views.\n\n#### Data objects in the Databricks lakehouse\n##### What is a function?\n\nFunctions allow you to associate user-defined logic with a database. Functions can return either scalar values or sets of rows. Functions are used to aggregate data. Databricks allows you to save functions in various languages depending on your execution context, with SQL being broadly supported. You can use functions to provide managed access to custom logic across a variety of contexts on the Databricks product.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/data-objects.html"} +{"content":"# What is Databricks?\n## What is a data lakehouse?\n#### Data objects in the Databricks lakehouse\n##### How do relational objects work in Delta Live Tables?\n\n[Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html) uses declarative syntax to define and manage DDL, DML, and infrastructure deployment. Delta Live Tables uses the concept of a \u201cvirtual schema\u201d during logic planning and execution. Delta Live Tables can interact with other databases in your Databricks environment, and Delta Live Tables can publish and persist tables for querying elsewhere by specifying a target database in the pipeline configuration settings. \nAll tables created in Delta Live Tables are Delta tables. When using Unity Catalog with Delta Live Tables, all tables are Unity Catalog managed tables. If Unity Catalog is not active, tables can be declared as either managed or unmanaged tables. \nWhile views can be declared in Delta Live Tables, these should be thought of as temporary views scoped to the pipeline. Temporary tables in Delta Live Tables are a unique concept: these tables persist data to storage but do not publish data to the target database. \nSome operations, such as `APPLY CHANGES INTO`, will register both a table and view to the database; the table name will begin with an underscore (`_`) and the view will have the table name declared as the target of the `APPLY CHANGES INTO` operation. The view queries the corresponding hidden table to materialize the results.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/data-objects.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### Download lakehouse reference architectures\n\nThis article covers architectural guidance for the lakehouse in terms of data source, ingestion, transformation, querying and processing, serving, analysis\/output, and storage. \nEach reference architecture has a downloadable PDF in 11 x 17 (A3) format.\n\n### Download lakehouse reference architectures\n#### Generic reference architecture\n\n![Generic reference architecture of the lakehouse](https:\/\/docs.databricks.com\/_images\/ref-arch-overview-generic.png) \n**[Download: Generic lakehouse reference architecture for Databricks (PDF)](https:\/\/docs.databricks.com\/_extras\/documents\/reference-architecture-databricks-generic.pdf)**\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/reference.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### Download lakehouse reference architectures\n#### Organization of the reference architectures\n\nThe reference architecture is structured along the swim lanes *Source*, *Ingest*, *Transform*, *Query and Process*, *Serve*, *Analysis*, and *Storage*: \n* **Source** \nThe architecture distinguishes between semi-structured and unstructured data (sensors and IoT, media, files\/logs), and structured data (RDBMS, business applications). SQL sources (RDBMS) can also be integrated into the lakehouse and [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) without ETL through [lakehouse federation](https:\/\/docs.databricks.com\/query-federation\/index.html). In addition, data might be loaded from other cloud providers.\n* **Ingest** \nData can be ingested into the lakehouse via batch or streaming: \n+ Files delivered to cloud storage can be loaded directly using the Databricks [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html).\n+ For batch ingestion of data from enterprise applications into [Delta Lake](https:\/\/docs.databricks.com\/delta\/index.html), the [Databricks lakehouse](https:\/\/docs.databricks.com\/lakehouse\/index.html) relies on [partner ingest tools](https:\/\/docs.databricks.com\/partner-connect\/ingestion.html) with specific adapters for these systems of record.\n+ Streaming events can be ingested directly from event streaming systems such as Kafka using Databricks [Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/index.html). Streaming sources can be sensors, IoT, or [change data capture](https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html) processes.\n* **Storage** \nData is typically stored in the cloud storage system where the ETL pipelines use the [medallion architecture](https:\/\/docs.databricks.com\/lakehouse\/medallion.html) to store data in a curated way as [Delta files\/tables](https:\/\/delta.io\/).\n* **Transform** and **Query and process** \nThe Databricks lakehouse uses its engines [Apache Spark](https:\/\/docs.databricks.com\/spark\/index.html) and [Photon](https:\/\/docs.databricks.com\/compute\/photon.html) for all transformations and queries. \nDue to its simplicity, the declarative framework DLT ([Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html)) is a good choice for building reliable, maintainable, and testable data processing pipelines. \nPowered by Apache Spark and Photon, the Databricks Data Intelligence Platform supports both types of workloads: SQL queries via [SQL warehouses](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html), and SQL, Python and Scala workloads via workspace [clusters](https:\/\/docs.databricks.com\/compute\/index.html). \nFor data science (ML Modeling and [Gen AI](https:\/\/docs.databricks.com\/generative-ai\/generative-ai.html)), the Databricks [AI and Machine Learning platform](https:\/\/docs.databricks.com\/machine-learning\/index.html) provides specialized ML runtimes for [AutoML](https:\/\/docs.databricks.com\/machine-learning\/automl\/index.html) and for coding ML jobs. All data science and [MLOps workflows](https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html) are best supported by [MLflow](https:\/\/docs.databricks.com\/mlflow\/index.html).\n* **Serve** \nFor DWH and BI use cases, the Databricks lakehouse provides [Databricks SQL](https:\/\/docs.databricks.com\/sql\/index.html), the data warehouse powered by [SQL warehouses](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html) and [serverless SQL warehouses](https:\/\/docs.databricks.com\/admin\/sql\/serverless.html). \nFor machine learning, [model serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) is a scalable, real-time, enterprise-grade model serving capability hosted in the Databricks control plane. \nOperational databases: [External systems](https:\/\/docs.databricks.com\/connect\/external-systems\/index.html), such as operational databases, can be used to store and deliver final data products to user applications. \nCollaboration: Business partners get secure access to the data they need through [Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html). Based on Delta Sharing, the [Databricks Marketplace](https:\/\/docs.databricks.com\/marketplace\/index.html) is an open forum for exchanging data products.\n* **Analysis** \nThe final business applications are in this swim lane. Examples include custom clients such as AI applications connected to [Databricks Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) for real-time inference or applications that access data pushed from the lakehouse to an operational database. \nFor BI use cases, analysts typically use [BI tools to access the data warehouse](https:\/\/docs.databricks.com\/partner-connect\/bi.html). SQL developers can additionally use the [Databricks SQL Editor](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html) (not shown in the diagram) for queries and dashboarding. \nThe Data Intelligence Platform also offers [dashboards](https:\/\/docs.databricks.com\/dashboards\/index.html) to build data visualizations and share insights.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/reference.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### Download lakehouse reference architectures\n#### Capabilities for your workloads\n\nIn addition, the Databricks lakehouse comes with management capabilities that support all workloads: \n* **Data and AI governance** \nThe central data and AI governance system in the Databricks Data Intelligence Platform is [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). Unity Catalog provides a single place to manage data access policies that apply across all workspaces and supports all assets created or used in the lakehouse, such as tables, volumes, features ([feature store](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html)), and models ([model registry](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html)). Unity Catalog can also be used to [capture runtime data lineage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html) across queries run on Databricks. \nDatabricks [lakehouse monitoring](https:\/\/docs.databricks.com\/lakehouse-monitoring\/index.html) allows you to monitor the quality of the data in all of the tables in your account. It can also track the performance of [machine learning models and model-serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/monitor-diagnose-endpoints.html). \nFor Observability, [system tables](https:\/\/docs.databricks.com\/admin\/system-tables\/index.html) is a Databricks-hosted analytical store of your account\u2019s operational data. System tables can be used for historical observability across your account.\n* **Data intelligence engine** \nThe Databricks Data Intelligence Platform allows your entire organization to use data and AI. It is powered by [DatabricksIQ](https:\/\/docs.databricks.com\/databricksiq\/index.html) and combines generative AI with the unification benefits of a lakehouse to understand the unique semantics of your data. \nThe [Databricks Assistant](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html) is available in Databricks notebooks, SQL editor, and file editor as a context-aware AI assistant for developers. \n* **Orchestration** \n[Databricks Workflows](https:\/\/docs.databricks.com\/workflows\/index.html) orchestrate data processing, machine learning, and analytics pipelines in the Databricks Data Intelligence Platform. Workflows has fully managed orchestration services integrated into the Databricks platform, including [Databricks Jobs](https:\/\/docs.databricks.com\/workflows\/index.html#what-is-databricks-jobs) to run non-interactive code in your Databricks workspace and [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html) to build reliable and maintainable ETL pipelines.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/reference.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### Download lakehouse reference architectures\n#### The Data Intelligence Platform reference architecture on AWS\n\nThe AWS reference architecture is derived from the [generic reference architecture](https:\/\/docs.databricks.com\/lakehouse-architecture\/reference.html#gen-ref-arch) by adding AWS-specific services for the Source, Ingest, Serve, Analysis, and Storage elements. \n![Reference architecture for the Databricks lakehouse on AWS](https:\/\/docs.databricks.com\/_images\/ref-arch-overview-aws.png) \n**[Download: Reference architecture for the Databricks lakehouse on AWS](https:\/\/docs.databricks.com\/_extras\/documents\/reference-architecture-databricks-on-aws.pdf)** \nThe AWS reference architecture shows the following AWS-specific services for Ingest, Storage, Serve, and Analysis\/Output: \n* Amazon Redshift as a source for Lakehouse Federation\n* Amazon AppFlow and AWS Glue for batch ingest\n* AWS IoT Core, Amazon Kinesis, and AWS DMS for streaming ingest\n* Amazon S3 as the object storage\n* Amazon RDS and Amazon DynamoDB as operational databases\n* Amazon QuickSight as BI tool\n* Amazon Bedrock as a unified API to foundation models from leading AI startups and Amazon \nNote \n* This view of the reference architecture focuses only on AWS services and the Databricks lakehouse. The lakehouse on Databricks is an open platform that integrates with a [large ecosystem of partner tools](https:\/\/docs.databricks.com\/integrations\/index.html).\n* The cloud provider services shown are not exhaustive. They are selected to illustrate the concept.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/reference.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### Download lakehouse reference architectures\n#### Use case: Batch ETL\n\n![Batch ETL reference architecture for Databricks on AWS](https:\/\/docs.databricks.com\/_images\/aws-ref-arch-batch.png) \n**[Download: Batch ETL reference architecture for Databricks on AWS](https:\/\/docs.databricks.com\/_extras\/documents\/reference-use-case-batch-for-aws.pdf)** \nIngest tools use source-specific adapters to read data from the source and then either store it in the cloud storage from where Auto Loader can read it, or call Databricks directly (for example, with partner ingest tools integrated into the Databricks lakehouse). To load the data, the Databricks ETL and processing engine - via DLT - runs the queries. Single or multitask jobs can be orchestrated by Databricks workflows and governed by Unity Catalog (access control, audit, lineage, and so on). If low-latency operational systems require access to specific golden tables, they can be exported to an operational database such as an RDBMS or key-value store at the end of the ETL pipeline.\n\n### Download lakehouse reference architectures\n#### Use case: Streaming and change data capture (CDC)\n\n![Spark structured streaming architecture on Databricks on AWS](https:\/\/docs.databricks.com\/_images\/aws-ref-arch-streaming-cdc.png) \n**[Download: Spark structured streaming architecture for Databricks on AWS](https:\/\/docs.databricks.com\/_extras\/documents\/reference-use-case-streaming-cdc-for-aws.pdf)** \nThe Databricks ETL engine Spark Structured Streaming to read from event queues such as Apache Kafka or AWS Kinesis. The downstream steps follow the approach of the Batch use case above. \nReal-time change data capture (CDC) typically uses an event queue to store the extracted events. From there, the use case follows the streaming use case. \nIf CDC is done in batch where the extracted records are stored in cloud storage first, then Databricks Autoloader can read them and the use case follows Batch ETL.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/reference.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### Download lakehouse reference architectures\n#### Use case: Machine learning and AI\n\n![Machine learning and AI reference architecture for Databricks on AWS](https:\/\/docs.databricks.com\/_images\/aws-ref-arch-ai.png) \n**[Download: Machine learning and AI reference architecture for Databricks on AWS](https:\/\/docs.databricks.com\/_extras\/documents\/reference-use-case-ai-for-aws.pdf)** \nFor machine learning, the Databricks Data Intelligence Platform provides Mosaic AI, which comes with state-of-the-art machine and deep learning libraries. It provides capabilities such as Feature Store and model registry (both integrated into Unity Catalog), low-code features with AutoML, and MLflow integration into the data science lifecycle. \nAll data science-related assets (tables, features, and models) are governed by Unity Catalog and data scientists can use Databricks Workflows to orchestrate their jobs. \nFor deploying models in a scalable and enterprise-grade way, use the MLOps capabilities to publish the models in model serving.\n\n### Download lakehouse reference architectures\n#### Use case: Retrieval Augmented Generation (Gen AI)\n\n![Gen AI RAG reference architecture for Databricks on AWS](https:\/\/docs.databricks.com\/_images\/aws-ref-arch-ai-rag.png) \n**[Download: Gen AI RAG reference architecture for Databricks on AWS](https:\/\/docs.databricks.com\/_extras\/documents\/reference-use-case-gen-ai-rag-for-aws.pdf)** \nFor generative AI use cases, Mosaic AI comes with state-of-the-art libraries and specific Gen AI capabilities from prompt engineering to fine-tuning of existing models and pre-training from scratch. The above architecture shows an example of how vector search can be integrated to create a RAG (retrieval augmented generation) AI application. \nFor deploying models in a scalable and enterprise-grade way, use the MLOps capabilities to publish the models in model serving.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/reference.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### Download lakehouse reference architectures\n#### Use case: BI and SQL analytics\n\n![BI and SQL analytics reference architecture for Databricks on AWS](https:\/\/docs.databricks.com\/_images\/aws-ref-arch-bi.png) \n**[Download: BI and SQL analytics reference architecture for Databricks on AWS](https:\/\/docs.databricks.com\/_extras\/documents\/reference-use-case-bi-for-aws.pdf)** \nFor BI use cases, business analysts can use dashboards, the Databricks SQL editor or specific BI tools such as Tableau or Amazon QuickSight. In all cases, the engine is Databricks SQL (serverless or non-serverless) and data discovery, exploration, lineage, and access control are provided by Unity Catalog.\n\n### Download lakehouse reference architectures\n#### Use case: Lakehouse federation\n\n![Lakehouse federation reference architecture for Databricks on AWS](https:\/\/docs.databricks.com\/_images\/aws-ref-arch-federation.png) \n**[Download: Lakehouse federation reference architecture for Databricks on AWS](https:\/\/docs.databricks.com\/_extras\/documents\/reference-use-case-federation-for-aws.pdf)** \nLakehouse federation allows external data SQL databases (such as MySQL, Postgres, or Redshift) to be integrated with Databricks. \nAll workloads (AI, DWH, and BI) can benefit from this without the need to ETL the data into object storage first. The external source catalog is mapped into the Unity catalog and fine-grained access control can be applied to access via the Databricks platform.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/reference.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### Download lakehouse reference architectures\n#### Use case: Enterprise data sharing\n\n![Enterprise data sharing reference architecture for Databricks on AWS](https:\/\/docs.databricks.com\/_images\/aws-ref-arch-collaboration.png) \n**[Download: Enterprise data sharing reference architecture for Databricks on AWS](https:\/\/docs.databricks.com\/_extras\/documents\/reference-use-case-collaboration-for-aws.pdf)** \nEnterprise-grade data sharing is provided by Delta Sharing. It provides direct access to data in the object store secured by Unity Catalog, and Databricks Marketplace is an open forum for exchanging data products.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/reference.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Limits & FAQ for Git integration with Databricks Git folders\n##### Errors and troubleshooting for Databricks Git folders\n\nFollow the guidance below to respond to common error messages or to troubleshoot issues with Databricks Git folders.\n\n##### Errors and troubleshooting for Databricks Git folders\n###### `Invalid credentials`\n\nTry the following: \n* Confirm that the Git integration settings (**Settings** > **Linked accounts**) are correct. \n+ You must enter both your Git provider username and token.\n* Confirm that you have selected the correct Git provider in [\\*\\*Settings\\*\\* > \\*\\*Linked accounts\\*\\*](https:\/\/docs.databricks.com\/repos\/repos-setup.html).\n* Ensure your personal access token or app password has the correct repo access.\n* If SSO is enabled on your Git provider, authorize your tokens for SSO.\n* Test your token with the Git command line. Replace the text strings in angle brackets: \n```\ngit clone https:\/\/:@github.com\/\/.git\n\n```\n\n##### Errors and troubleshooting for Databricks Git folders\n###### `Secure connection...SSL problems`\n\nThis error occurs if your Git server is not accessible from Databricks. To access a private Git server get in touch with your Databricks account team \n```\n: Secure connection to could not be established because of SSL problems\n\n```\n\n##### Errors and troubleshooting for Databricks Git folders\n###### Timeout errors\n\nExpensive operations such as cloning a large repo or checking out a large branch might result in timeout errors, but the operation might complete in the background. You can also try again later if the workspace was under heavy load at the time. \nTo work with a large repo, try [sparse checkout](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html#sparse).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/errors-troubleshooting.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Limits & FAQ for Git integration with Databricks Git folders\n##### Errors and troubleshooting for Databricks Git folders\n###### 404 errors\n\nIf you get a 404 error when you try to open a non-notebook file, try waiting a few minutes and then trying again. There is a delay of a few minutes between when the workspace is enabled and when the webapp picks up the configuration flag.\n\n##### Errors and troubleshooting for Databricks Git folders\n###### Detached head state\n\nA Databricks Git folder can get into the detached head state if: \n* **The remote branch is deleted**. Databricks tried to recover the uncommitted local changes on the branch by applying those changes to the default branch. If the default branch has conflicting changes, Databricks applies the changes on a snapshot of the default branch (detached head).\n* A user or service principal checked out a remote repo on a tag using the [`update repo` API](https:\/\/docs.databricks.com\/api\/workspace\/repos\/update). \nTo recover from this state: \n1. Click the `create branch` button to create a new branch based on the current commit, or use the \u201cSelect branch\u201d dropdown to check out an existing branch.\n2. Commit and push if you want to keep the changes. To discard the changes, click on the kebab under **Changes**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/errors-troubleshooting.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Limits & FAQ for Git integration with Databricks Git folders\n##### Errors and troubleshooting for Databricks Git folders\n###### Resolve notebook name conflicts\n\nDifferent notebooks with identical or similar filenames can cause an error when you create a repo or pull request, such as `Cannot perform Git operation due to conflicting names` or `A folder cannot contain a notebook with the same name as a notebook, file, or folder (excluding file extensions).` \nA naming conflict can occur even with different file extensions. For example, these two files conflict: \n* `notebook.ipynb`\n* `notebook.py` \n![Diagram: Name conflict for notebook, file, or folder.](https:\/\/docs.databricks.com\/_images\/asset-name-conflict.png) \n### To fix the name conflict \n* Rename the notebook, file, or folder contributing to the error state. \n+ If this error occurs when you clone the repo, you need to rename notebooks, files, or folders in the remote Git repo.\n\n##### Errors and troubleshooting for Databricks Git folders\n###### Errors suggest recloning\n\n```\nThere was a problem with deleting folders. The repo could be in an inconsistent state and re-cloning is recommended.\n\n``` \nThis error indicates that a problem occurred while deleting folders from the repo. This could leave the repo in an inconsistent state, where folders that should have been deleted still exist. If this error occurs, Databricks recommends deleting and re-cloning the repo to reset its state.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/errors-troubleshooting.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Limits & FAQ for Git integration with Databricks Git folders\n##### Errors and troubleshooting for Databricks Git folders\n###### `No experiment...found` or MLflow UI errors\n\nYou might see a Databricks error message `No experiment for node found` or an error in MLflow when you work on an\nMLflow notebook experiment last logged to before the [3.72 platform release](https:\/\/docs.databricks.com\/release-notes\/product\/2022\/may.html#databricks-repos-fix-to-issue-with-mlflow-experiment-data-loss).\nTo resolve the error, log a new run in the notebook associated with that experiment. \nNote \nThis applies only to notebook experiments. Creation of new experiments in Git folders is [unsupported](https:\/\/docs.databricks.com\/repos\/limits.html#can-i-create-an-mlflow-experiment-in-a-repo).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/errors-troubleshooting.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Limits & FAQ for Git integration with Databricks Git folders\n##### Errors and troubleshooting for Databricks Git folders\n###### Notebooks appear as modified without any visible user edits\n\nIf every line of a notebook appears modified without any user edits, the modifications may be changes in line ending characters. Databricks uses linux-style LF line ending characters and this may differ from line endings in files committed from Windows systems. \nIf your notebook shows as a modified but you can\u2019t see any obvious user edits, the \u201cmodifications\u201d may be changes to the normally invisible \u201cend of line\u201d characters. End-of-line characters can be different across operating systems and file formats. \nTo diagnose this issue, check if you have a `.gitattributes` file. If you do: \n* It must not contain `* text eol=crlf`.\n* If you are **not** using Windows as your environment, remove the setting. Both your native development environment and Databricks use Linux end-of-line characters.\n* If you **are** using Windows, change the setting to `* text=auto`. Git will now internally store all files with Linux-style line endings, but will checkout to platform-specific (such as Windows) end-of-line characters automatically. \nIf you have already committed files with Windows end-of-line characters into Git, perform the following steps: \n1. Clear any outstanding changes.\n2. Update the `.gitattributes` file with the recommendation above. Commit the change.\n3. Run `git add --renormalize`. Commit and push all changes.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/errors-troubleshooting.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables Python language reference\n\nThis article provides details for the Delta Live Tables Python programming interface. \nFor information on the SQL API, see the [Delta Live Tables SQL language reference](https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html). \nFor details specific to configuring Auto Loader, see [What is Auto Loader?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html).\n\n##### Delta Live Tables Python language reference\n###### Limitations\n\nThe Delta Live Tables Python interface has the following limitations: \n* The Python `table` and `view` functions must return a DataFrame. Some functions that operate on DataFrames do not return DataFrames and should not be used. Because DataFrame transformations are executed *after* the full dataflow graph has been resolved, using such operations might have unintended side effects. These operations include functions such as `collect()`, `count()`, `toPandas()`, `save()`, and `saveAsTable()`. However, you can include these functions outside of `table` or `view` function definitions because this code is run once during the graph initialization phase.\n* The `pivot()` function is not supported. The `pivot` operation in Spark requires eager loading of input data to compute the schema of the output. This capability is not supported in Delta Live Tables.\n\n##### Delta Live Tables Python language reference\n###### Import the `dlt` Python module\n\nDelta Live Tables Python functions are defined in the `dlt` module. Your pipelines implemented with the Python API must import this module: \n```\nimport dlt\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables Python language reference\n###### Create a Delta Live Tables materialized view or streaming table\n\nIn Python, Delta Live Tables determines whether to update a dataset as a materialized view or streaming table based on the defining query. The `@table` decorator is used to define both materialized views and streaming tables. \nTo define a materialized view in Python, apply `@table` to a query that performs a static read against a data source. To define a streaming table, apply `@table` to a query that performs a streaming read against a data source. Both dataset types have the same syntax specification as follows: \n```\nimport dlt\n\n@dlt.table(\nname=\"\",\ncomment=\"\",\nspark_conf={\"\" : \"\", \"\" : \"\"},\ntable_properties={\"\" : \"\", \"\" : \"\"},\npath=\"\",\npartition_cols=[\"\", \"\"],\nschema=\"schema-definition\",\ntemporary=False)\n@dlt.expect\n@dlt.expect_or_fail\n@dlt.expect_or_drop\n@dlt.expect_all\n@dlt.expect_all_or_drop\n@dlt.expect_all_or_fail\ndef ():\nreturn ()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables Python language reference\n###### Create a Delta Live Tables view\n\nTo define a view in Python, apply the `@view` decorator. Like the `@table` decorator, you can use views in Delta Live Tables for either static or streaming datasets. The following is the syntax for defining views with Python: \n```\nimport dlt\n\n@dlt.view(\nname=\"\",\ncomment=\"\")\n@dlt.expect\n@dlt.expect_or_fail\n@dlt.expect_or_drop\n@dlt.expect_all\n@dlt.expect_all_or_drop\n@dlt.expect_all_or_fail\ndef ():\nreturn ()\n\n```\n\n##### Delta Live Tables Python language reference\n###### Example: Define tables and views\n\nTo define a table or view in Python, apply the `@dlt.view` or `@dlt.table` decorator to a function. You can use the function name or the `name` parameter to assign the table or view name. The following example defines two different datasets: a view called `taxi_raw` that takes a JSON file as the input source and a table called `filtered_data` that takes the `taxi_raw` view as input: \n```\nimport dlt\n\n@dlt.view\ndef taxi_raw():\nreturn spark.read.format(\"json\").load(\"\/databricks-datasets\/nyctaxi\/sample\/json\/\")\n\n# Use the function name as the table name\n@dlt.table\ndef filtered_data():\nreturn dlt.read(\"taxi_raw\").where(...)\n\n# Use the name parameter as the table name\n@dlt.table(\nname=\"filtered_data\")\ndef create_filtered_data():\nreturn dlt.read(\"taxi_raw\").where(...)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables Python language reference\n###### Example: Access a dataset defined in the same pipeline\n\nIn addition to reading from external data sources, you can access datasets defined in the same pipeline with the Delta Live Tables `read()` function. The following example demonstrates creating a `customers_filtered` dataset using the `read()` function: \n```\n@dlt.table\ndef customers_raw():\nreturn spark.read.format(\"csv\").load(\"\/data\/customers.csv\")\n\n@dlt.table\ndef customers_filteredA():\nreturn dlt.read(\"customers_raw\").where(...)\n\n``` \nYou can also use the `spark.table()` function to access a dataset defined in the same pipeline. When using the `spark.table()` function to access a dataset defined in the pipeline, in the function argument prepend the `LIVE` keyword to the dataset name: \n```\n@dlt.table\ndef customers_raw():\nreturn spark.read.format(\"csv\").load(\"\/data\/customers.csv\")\n\n@dlt.table\ndef customers_filteredB():\nreturn spark.table(\"LIVE.customers_raw\").where(...)\n\n```\n\n##### Delta Live Tables Python language reference\n###### Example: Read from a table registered in a metastore\n\nTo read data from a table registered in the Hive metastore, in the function argument omit the `LIVE` keyword and optionally qualify the table name with the database name: \n```\n@dlt.table\ndef customers():\nreturn spark.table(\"sales.customers\").where(...)\n\n``` \nFor an example of reading from a Unity Catalog table, see [Ingest data into a Unity Catalog pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html#ingest-data).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables Python language reference\n###### Example: Access a dataset using `spark.sql`\n\nYou can also return a dataset using a `spark.sql` expression in a query function. To read from an internal dataset, prepend `LIVE.` to the dataset name: \n```\n@dlt.table\ndef chicago_customers():\nreturn spark.sql(\"SELECT * FROM LIVE.customers_cleaned WHERE city = 'Chicago'\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables Python language reference\n###### Create a table to use as the target of streaming operations\n\nUse the `create_streaming_table()` function to create a target table for records output by streaming operations, including [apply\\_changes()](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html#cdc) and [@append\\_flow](https:\/\/docs.databricks.com\/delta-live-tables\/flows.html#append-flows) output records. \nNote \nThe `create_target_table()` and `create_streaming_live_table()` functions are deprecated. Databricks recommends updating existing code to use the `create_streaming_table()` function. \n```\ncreate_streaming_table(\nname = \"\",\ncomment = \"\"\nspark_conf={\"\" : \"\"},\ntable_properties={\"\" : \"\", \"\" : \"\"},\npartition_cols=[\"\", \"\"],\npath=\"\",\nschema=\"schema-definition\",\nexpect_all = {\"\" : \"\"},\nexpect_all_or_drop = {\"\" : \"\"},\nexpect_all_or_fail = {\"\" : \"\"}\n)\n\n``` \n| Arguments |\n| --- |\n| **`name`** Type: `str` The table name. This parameter is required. |\n| **`comment`** Type: `str` An optional description for the table. |\n| **`spark_conf`** Type: `dict` An optional list of Spark configurations for the execution of this query. |\n| **`table_properties`** Type: `dict` An optional list of [table properties](https:\/\/docs.databricks.com\/delta-live-tables\/properties.html) for the table. |\n| **`partition_cols`** Type: `array` An optional list of one or more columns to use for partitioning the table. |\n| **`path`** Type: `str` An optional storage location for table data. If not set, the system will default to the pipeline storage location. |\n| **`schema`** Type: `str` or `StructType` An optional schema definition for the table. Schemas can be defined as a SQL DDL string, or with a Python `StructType`. |\n| **`expect_all`** **`expect_all_or_drop`** **`expect_all_or_fail`** Type: `dict` Optional data quality constraints for the table. See [multiple expectations](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html#expect-all). |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables Python language reference\n###### Control how tables are materialized\n\nTables also offer additional control of their materialization: \n* Specify how tables are [partitioned](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html#schema-partition-example) using `partition_cols`. You can use partitioning to speed up queries.\n* You can set table properties when you define a view or table. See [Delta Live Tables table properties](https:\/\/docs.databricks.com\/delta-live-tables\/properties.html#table-properties).\n* Set a storage location for table data using the `path` setting. By default, table data is stored in the pipeline storage location if `path` isn\u2019t set.\n* You can use [generated columns](https:\/\/docs.databricks.com\/delta\/generated-columns.html) in your schema definition. See [Example: Specify a schema and partition columns](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html#schema-partition-example). \nNote \nFor tables less than 1 TB in size, Databricks recommends letting Delta Live Tables control data organization. Unless you expect your table to grow beyond a terabyte, you should generally not specify partition columns.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables Python language reference\n###### Example: Specify a schema and partition columns\n\nYou can optionally specify a table schema using a Python `StructType` or a SQL DDL string. When specified with a DDL string, the definition can include [generated columns](https:\/\/docs.databricks.com\/delta\/generated-columns.html). \nThe following example creates a table called `sales` with a schema specified using a Python `StructType`: \n```\nsales_schema = StructType([\nStructField(\"customer_id\", StringType(), True),\nStructField(\"customer_name\", StringType(), True),\nStructField(\"number_of_line_items\", StringType(), True),\nStructField(\"order_datetime\", StringType(), True),\nStructField(\"order_number\", LongType(), True)]\n)\n\n@dlt.table(\ncomment=\"Raw data on sales\",\nschema=sales_schema)\ndef sales():\nreturn (\"...\")\n\n``` \nThe following example specifies the schema for a table using a DDL string, defines a generated column, and defines a partition column: \n```\n@dlt.table(\ncomment=\"Raw data on sales\",\nschema=\"\"\"\ncustomer_id STRING,\ncustomer_name STRING,\nnumber_of_line_items STRING,\norder_datetime STRING,\norder_number LONG,\norder_day_of_week STRING GENERATED ALWAYS AS (dayofweek(order_datetime))\n\"\"\",\npartition_cols = [\"order_day_of_week\"])\ndef sales():\nreturn (\"...\")\n\n``` \nBy default, Delta Live Tables infers the schema from the `table` definition if you don\u2019t specify a schema.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables Python language reference\n###### Configure a streaming table to ignore changes in a source streaming table\n\nNote \n* The `skipChangeCommits` flag works only with `spark.readStream` using the `option()` function. You cannot use this flag in a `dlt.read_stream()` function.\n* You cannot use the `skipChangeCommits` flag when the source streaming table is defined as the target of an [apply\\_changes()](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html#cdc) function. \nBy default, streaming tables require append-only sources. When a streaming table uses another streaming table as a source, and the source streaming table requires updates or deletes, for example, GDPR \u201cright to be forgotten\u201d processing, the `skipChangeCommits` flag can be set when reading the source streaming table to ignore those changes. For more information about this flag, see [Ignore updates and deletes](https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html#ignore-changes). \n```\n@table\ndef b():\nreturn spark.readStream.option(\"skipChangeCommits\", \"true\").table(\"LIVE.A\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables Python language reference\n###### Python Delta Live Tables properties\n\nThe following tables describe the options and properties you can specify while defining tables and views with Delta Live Tables: \n| @table or @view |\n| --- |\n| **`name`** Type: `str` An optional name for the table or view. If not defined, the function name is used as the table or view name. |\n| **`comment`** Type: `str` An optional description for the table. |\n| **`spark_conf`** Type: `dict` An optional list of Spark configurations for the execution of this query. |\n| **`table_properties`** Type: `dict` An optional list of [table properties](https:\/\/docs.databricks.com\/delta-live-tables\/properties.html) for the table. |\n| **`path`** Type: `str` An optional storage location for table data. If not set, the system will default to the pipeline storage location. |\n| **`partition_cols`** Type: `a collection of str` An optional collection, for example, a `list`, of one or more columns to use for partitioning the table. |\n| **`schema`** Type: `str` or `StructType` An optional schema definition for the table. Schemas can be defined as a SQL DDL string, or with a Python `StructType`. |\n| **`temporary`** Type: `bool` Create a table but do not publish metadata for the table. The `temporary` keyword instructs Delta Live Tables to create a table that is available to the pipeline but should not be accessed outside the pipeline. To reduce processing time, a temporary table persists for the lifetime of the pipeline that creates it, and not just a single update. The default is \u2018False\u2019. | \n| Table or view definition |\n| --- |\n| **`def ()`** A Python function that defines the dataset. If the `name` parameter is not set, then `` is used as the target dataset name. |\n| **`query`** A Spark SQL statement that returns a Spark Dataset or Koalas DataFrame. Use `dlt.read()` or `spark.table()` to perform a complete read from a dataset defined in the same pipeline. When using the `spark.table()` function to read from a dataset defined in the same pipeline, prepend the `LIVE` keyword to the dataset name in the function argument. For example, to read from a dataset named `customers`: `spark.table(\"LIVE.customers\")` You can also use the `spark.table()` function to read from a table registered in the metastore by omitting the `LIVE` keyword and optionally qualifying the table name with the database name: `spark.table(\"sales.customers\")` Use `dlt.read_stream()` to perform a streaming read from a dataset defined in the same pipeline. Use the `spark.sql` function to define a SQL query to create the return dataset. Use [PySpark](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/dataframe.html) syntax to define Delta Live Tables queries with Python. | \n| Expectations |\n| --- |\n| **`@expect(\"description\", \"constraint\")`** Declare a data quality constraint identified by `description`. If a row violates the expectation, include the row in the target dataset. |\n| **`@expect_or_drop(\"description\", \"constraint\")`** Declare a data quality constraint identified by `description`. If a row violates the expectation, drop the row from the target dataset. |\n| **`@expect_or_fail(\"description\", \"constraint\")`** Declare a data quality constraint identified by `description`. If a row violates the expectation, immediately stop execution. |\n| **`@expect_all(expectations)`** Declare one or more data quality constraints. `expectations` is a Python dictionary, where the key is the expectation description and the value is the expectation constraint. If a row violates any of the expectations, include the row in the target dataset. |\n| **`@expect_all_or_drop(expectations)`** Declare one or more data quality constraints. `expectations` is a Python dictionary, where the key is the expectation description and the value is the expectation constraint. If a row violates any of the expectations, drop the row from the target dataset. |\n| **`@expect_all_or_fail(expectations)`** Declare one or more data quality constraints. `expectations` is a Python dictionary, where the key is the expectation description and the value is the expectation constraint. If a row violates any of the expectations, immediately stop execution. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables Python language reference\n###### Change data capture with Python in Delta Live Tables\n\nUse the `apply_changes()` function in the Python API to use Delta Live Tables CDC functionality. The Delta Live Tables Python interface also provides the [create\\_streaming\\_table()](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html#create-target-fn) function. You can use this function to create the target table required by the `apply_changes()` function. \n```\napply_changes(\ntarget = \"\",\nsource = \"\",\nkeys = [\"key1\", \"key2\", \"keyN\"],\nsequence_by = \"\",\nignore_null_updates = False,\napply_as_deletes = None,\napply_as_truncates = None,\ncolumn_list = None,\nexcept_column_list = None,\nstored_as_scd_type = ,\ntrack_history_column_list = None,\ntrack_history_except_column_list = None\n)\n\n``` \nNote \nThe default behavior for `INSERT` and `UPDATE` events is to *upsert* CDC events from the source: update any rows in the target table that match the specified key(s) or insert a new row when a matching record does not exist in the target table. Handling for `DELETE` events can be specified with the `APPLY AS DELETE WHEN` condition. \nImportant \nYou must declare a target streaming table to apply changes into. You can optionally specify the schema for your target table. When specifying the schema of the `apply_changes` target table, you must also include the `__START_AT` and `__END_AT` columns with the same data type as the `sequence_by` field. \nSee [APPLY CHANGES API: Simplify change data capture in Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html). \n| Arguments |\n| --- |\n| **`target`** Type: `str` The name of the table to be updated. You can use the [create\\_streaming\\_table()](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html#create-target-fn) function to create the target table before executing the `apply_changes()` function. This parameter is required. |\n| **`source`** Type: `str` The data source containing CDC records. This parameter is required. |\n| **`keys`** Type: `list` The column or combination of columns that uniquely identify a row in the source data. This is used to identify which CDC events apply to specific records in the target table. You can specify either:* A list of strings: `[\"userId\", \"orderId\"]` * A list of Spark SQL `col()` functions: `[col(\"userId\"), col(\"orderId\"]` Arguments to `col()` functions cannot include qualifiers. For example, you can use `col(userId)`, but you cannot use `col(source.userId)`. This parameter is required. |\n| **`sequence_by`** Type: `str` or `col()` The column name specifying the logical order of CDC events in the source data. Delta Live Tables uses this sequencing to handle change events that arrive out of order. You can specify either:* A string: `\"sequenceNum\"` * A Spark SQL `col()` function: `col(\"sequenceNum\")` Arguments to `col()` functions cannot include qualifiers. For example, you can use `col(userId)`, but you cannot use `col(source.userId)`. This parameter is required. |\n| **`ignore_null_updates`** Type: `bool` Allow ingesting updates containing a subset of the target columns. When a CDC event matches an existing row and `ignore_null_updates` is `True`, columns with a `null` will retain their existing values in the target. This also applies to nested columns with a value of `null`. When `ignore_null_updates` is `False`, existing values will be overwritten with `null` values. This parameter is optional. The default is `False`. |\n| **`apply_as_deletes`** Type: `str` or `expr()` Specifies when a CDC event should be treated as a `DELETE` rather than an upsert. To handle out-of-order data, the deleted row is temporarily retained as a tombstone in the underlying Delta table, and a view is created in the metastore that filters out these tombstones. The retention interval can be configured with the `pipelines.cdc.tombstoneGCThresholdInSeconds` [table property](https:\/\/docs.databricks.com\/delta-live-tables\/properties.html#table-properties). You can specify either:* A string: `\"Operation = 'DELETE'\"` * A Spark SQL `expr()` function: `expr(\"Operation = 'DELETE'\")` This parameter is optional. |\n| **`apply_as_truncates`** Type: `str` or `expr()` Specifies when a CDC event should be treated as a full table `TRUNCATE`. Because this clause triggers a full truncate of the target table, it should be used only for specific use cases requiring this functionality. The `apply_as_truncates` parameter is supported only for SCD type 1. SCD type 2 does not support truncate. You can specify either:* A string: `\"Operation = 'TRUNCATE'\"` * A Spark SQL `expr()` function: `expr(\"Operation = 'TRUNCATE'\")` This parameter is optional. |\n| **`column_list`** **`except_column_list`** Type: `list` A subset of columns to include in the target table. Use `column_list` to specify the complete list of columns to include. Use `except_column_list` to specify the columns to exclude. You can declare either value as a list of strings or as Spark SQL `col()` functions:* `column_list = [\"userId\", \"name\", \"city\"]`. * `column_list = [col(\"userId\"), col(\"name\"), col(\"city\")]` * `except_column_list = [\"operation\", \"sequenceNum\"]` * `except_column_list = [col(\"operation\"), col(\"sequenceNum\")` Arguments to `col()` functions cannot include qualifiers. For example, you can use `col(userId)`, but you cannot use `col(source.userId)`. This parameter is optional. The default is to include all columns in the target table when no `column_list` or `except_column_list` argument is passed to the function. |\n| **`stored_as_scd_type`** Type: `str` or `int` Whether to store records as SCD type 1 or SCD type 2. Set to `1` for SCD type 1 or `2` for SCD type 2. This clause is optional. The default is SCD type 1. |\n| **`track_history_column_list`** **`track_history_except_column_list`** Type: `list` A subset of output columns to be tracked for history in the target table. Use `track_history_column_list` to specify the complete list of columns to be tracked. Use `track_history_except_column_list` to specify the columns to be excluded from tracking. You can declare either value as a list of strings or as Spark SQL `col()` functions: - `track_history_column_list = [\"userId\", \"name\", \"city\"]`. - `track_history_column_list = [col(\"userId\"), col(\"name\"), col(\"city\")]` - `track_history_except_column_list = [\"operation\", \"sequenceNum\"]` - `track_history_except_column_list = [col(\"operation\"), col(\"sequenceNum\")` Arguments to `col()` functions cannot include qualifiers. For example, you can use `col(userId)`, but you cannot use `col(source.userId)`. This parameter is optional. The default is to include all columns in the target table when no `track_history_column_list` or `track_history_except_column_list` argument is passed to the function. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html"} +{"content":"# What is Delta Lake?\n### Use Delta Lake change data feed on Databricks\n\nChange data feed allows Databricks to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records *change events* for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated. \nImportant \nChange data feed works in tandem with table history to provide change information. Because cloning a Delta table creates a separate history, the change data feed on cloned tables doesn\u2019t match that of the original table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html"} +{"content":"# What is Delta Lake?\n### Use Delta Lake change data feed on Databricks\n#### Incrementally process change data\n\nDatabricks recommends using change data feed in combination with Structured Streaming to incrementally process changes from Delta tables. You must use Structured Streaming for Databricks to automatically track versions for your table\u2019s change data feed. \nNote \nDelta Live Tables provides functionality for easy propagation of change data and storing results as SCD (slowly changing dimension) type 1 or type 2 tables. See [APPLY CHANGES API: Simplify change data capture in Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html). \nTo read the change data feed from a table, you must enable change data feed on that table. See [Enable change data feed](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html#enable). \nSet the option `readChangeFeed` to `true` when configuring a stream against a table to read the change data feed, as shown in the following syntax example: \n```\n(spark.readStream.format(\"delta\")\n.option(\"readChangeFeed\", \"true\")\n.table(\"myDeltaTable\")\n)\n\n``` \n```\nspark.readStream.format(\"delta\")\n.option(\"readChangeFeed\", \"true\")\n.table(\"myDeltaTable\")\n\n``` \nBy default, the stream returns the latest snapshot of the table when the stream first starts as an `INSERT` and future changes as change data. \nChange data commits as part of the Delta Lake transaction, and becomes available at the same time the new data commits to the table. \nYou can optionally specify a starting version. See [Should I specify a starting version?](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html#version-options). \nChange data feed also supports batch execution, which requires specifying a starting version. See [Read changes in batch queries](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html#batch). \nOptions like rate limits (`maxFilesPerTrigger`, `maxBytesPerTrigger`) and `excludeRegex` are also supported when reading change data. \nRate limiting can be atomic for versions other than the starting snapshot version. That is, the entire commit version will be rate limited or the entire commit will be returned.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html"} +{"content":"# What is Delta Lake?\n### Use Delta Lake change data feed on Databricks\n#### Should I specify a starting version?\n\nYou can optionally specify a starting version if you want to ignore changes that happened before a particular version. You can specify a version using a timestamp or the version ID number recorded in the Delta transaction log. \nNote \nA starting version is required for batch reads, and many batch patterns can benefit from setting an optional ending version. \nWhen you\u2019re configuring Structured Streaming workloads involving change data feed, it\u2019s important to understand how specifying a starting version impacts processing. \nMany streaming workloads, especially new data processing pipelines, benefit from the default behavior. With the default behavior, the first batch is processed when the stream first records all existing records in the table as `INSERT` operations in the change data feed. \nIf your target table already contains all the records with appropriate changes up to a certain point, specify a starting version to avoid processing the source table state as `INSERT` events. \nThe following example syntax recovering from a streaming failure in which the checkpoint was corrupted. In this example, assume the following conditions: \n1. Change data feed was enabled on the source table at table creation.\n2. The target downstream table has processed all changes up to and including version 75.\n3. Version history for the source table is available for versions 70 and above. \n```\n(spark.readStream.format(\"delta\")\n.option(\"readChangeFeed\", \"true\")\n.option(\"startingVersion\", 76)\n.table(\"source_table\")\n)\n\n``` \n```\nspark.readStream.format(\"delta\")\n.option(\"readChangeFeed\", \"true\")\n.option(\"startingVersion\", 76)\n.table(\"source_table\")\n\n``` \nIn this example, you must also specify a new checkpoint location. \nImportant \nIf you specify a starting version, the stream fails to start from a new checkpoint if the starting version is no longer present in the table history. Delta Lake cleans up historic versions automatically, meaning that all specified starting versions are eventually deleted. \nSee [Can I use change data feed to replay the entire history of a table?](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html#replay).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html"} +{"content":"# What is Delta Lake?\n### Use Delta Lake change data feed on Databricks\n#### Read changes in batch queries\n\nYou can use batch query syntax to read all changes starting from a particular version or to read changes within a specified range of versions. \nYou specify a version as an integer and a timestamps as a string in the format `yyyy-MM-dd[ HH:mm:ss[.SSS]]`. \nThe start and end versions are inclusive in the queries. To read the changes from a particular start version to the latest version of the table, specify only the starting version. \nIf you provide a version lower or timestamp older than one that has recorded change events\u2014that is, when the change data feed was enabled\u2014an error is thrown indicating that the change data feed was not enabled. \nThe following syntax examples demonstrate using starting and ending version options with batch reads: \n```\n-- version as ints or longs e.g. changes from version 0 to 10\nSELECT * FROM table_changes('tableName', 0, 10)\n\n-- timestamp as string formatted timestamps\nSELECT * FROM table_changes('tableName', '2021-04-21 05:45:46', '2021-05-21 12:00:00')\n\n-- providing only the startingVersion\/timestamp\nSELECT * FROM table_changes('tableName', 0)\n\n-- database\/schema names inside the string for table name, with backticks for escaping dots and special characters\nSELECT * FROM table_changes('dbName.`dotted.tableName`', '2021-04-21 06:45:46' , '2021-05-21 12:00:00')\n\n``` \n```\n# version as ints or longs\nspark.read.format(\"delta\") \\\n.option(\"readChangeFeed\", \"true\") \\\n.option(\"startingVersion\", 0) \\\n.option(\"endingVersion\", 10) \\\n.table(\"myDeltaTable\")\n\n# timestamps as formatted timestamp\nspark.read.format(\"delta\") \\\n.option(\"readChangeFeed\", \"true\") \\\n.option(\"startingTimestamp\", '2021-04-21 05:45:46') \\\n.option(\"endingTimestamp\", '2021-05-21 12:00:00') \\\n.table(\"myDeltaTable\")\n\n# providing only the startingVersion\/timestamp\nspark.read.format(\"delta\") \\\n.option(\"readChangeFeed\", \"true\") \\\n.option(\"startingVersion\", 0) \\\n.table(\"myDeltaTable\")\n\n``` \n```\n\/\/ version as ints or longs\nspark.read.format(\"delta\")\n.option(\"readChangeFeed\", \"true\")\n.option(\"startingVersion\", 0)\n.option(\"endingVersion\", 10)\n.table(\"myDeltaTable\")\n\n\/\/ timestamps as formatted timestamp\nspark.read.format(\"delta\")\n.option(\"readChangeFeed\", \"true\")\n.option(\"startingTimestamp\", \"2021-04-21 05:45:46\")\n.option(\"endingTimestamp\", \"2021-05-21 12:00:00\")\n.table(\"myDeltaTable\")\n\n\/\/ providing only the startingVersion\/timestamp\nspark.read.format(\"delta\")\n.option(\"readChangeFeed\", \"true\")\n.option(\"startingVersion\", 0)\n.table(\"myDeltaTable\")\n\n``` \nNote \nBy default, if a user passes in a version or timestamp exceeding the last commit on a table, the error `timestampGreaterThanLatestCommit` is thrown. In Databricks Runtime 11.3 LTS and above, change data feed can handle the out of range version case if the user sets the following configuration to `true`: \n```\nset spark.databricks.delta.changeDataFeed.timestampOutOfRange.enabled = true;\n\n``` \nIf you provide a start version greater than the last commit on a table or a start timestamp newer than the last commit on a table, then when the preceding configuration is enabled, an empty read result is returned. \nIf you provide an end version greater than the last commit on a table or an end timestamp newer than the last commit on a table, then when the preceding configuration is enabled in batch read mode, all changes between the start version and the last commit are be returned.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html"} +{"content":"# What is Delta Lake?\n### Use Delta Lake change data feed on Databricks\n#### What is the schema for the change data feed?\n\nWhen you read from the change data feed for a table, the schema for the latest table version is used. \nNote \nMost schema change and evolution operations are fully supported. Table with column mapping enabled do not support all use cases and demonstrate different behavior. See [Change data feed limitations for tables with column mapping enabled](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html#column-mapping-limitations). \nIn addition to the data columns from the schema of the Delta table, change data feed contains metadata columns that identify the type of change event: \n| Column name | Type | Values |\n| --- | --- | --- |\n| `_change_type` | String | `insert`, `update_preimage` , `update_postimage`, `delete` [(1)](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html#1) |\n| `_commit_version` | Long | The Delta log or table version containing the change. |\n| `_commit_timestamp` | Timestamp | The timestamp associated when the commit was created. | \n**(1)** `preimage` is the value before the update, `postimage` is the value after the update. \nNote \nYou cannot enable change data feed on a table if the schema contains columns with the same names as these added columns. Rename columns in the table to resolve this conflict before trying to enable change data feed.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html"} +{"content":"# What is Delta Lake?\n### Use Delta Lake change data feed on Databricks\n#### Enable change data feed\n\nYou can only read the change data feed for enabled tables. You must explicitly enable the change data feed option using one of the following methods: \n* **New table**: Set the table property `delta.enableChangeDataFeed = true` in the `CREATE TABLE` command. \n```\nCREATE TABLE student (id INT, name STRING, age INT) TBLPROPERTIES (delta.enableChangeDataFeed = true)\n\n```\n* **Existing table**: Set the table property `delta.enableChangeDataFeed = true` in the `ALTER TABLE` command. \n```\nALTER TABLE myDeltaTable SET TBLPROPERTIES (delta.enableChangeDataFeed = true)\n\n```\n* **All new tables**: \n```\nset spark.databricks.delta.properties.defaults.enableChangeDataFeed = true;\n\n``` \nImportant \nOnly changes made after you enable the change data feed are recorded. Past changes to a table are not captured.\n\n### Use Delta Lake change data feed on Databricks\n#### Change data storage\n\nEnabling change data feed causes a small increase in storage costs for a table. Change data records are generated as the query runs, and are generally much smaller than the total size of rewritten files. \nDatabricks records change data for `UPDATE`, `DELETE`, and `MERGE` operations in the `_change_data` folder under the table directory. Some operations, such as insert-only operations and full-partition deletions, do not generate data in the `_change_data` directory because Databricks can efficiently compute the change data feed directly from the transaction log. \nAll reads against data files in the `_change_data` folder should go through supported Delta Lake APIs. \nThe files in the `_change_data` folder follow the retention policy of the table. Change data feed data is deleted when the `VACUUM` command runs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html"} +{"content":"# What is Delta Lake?\n### Use Delta Lake change data feed on Databricks\n#### Can I use change data feed to replay the entire history of a table?\n\nChange data feed is not intended to serve as a permanent record of all changes to a table. Change data feed only records changes that occur after it\u2019s enabled. \nChange data feed and Delta Lake allow you to always reconstruct a full snapshot of a source table, meaning you can start a new streaming read against a table with change data feed enabled and capture the current version of that table and all changes that occur after. \nYou must treat records in the change data feed as transient and only accessible for a specified retention window. The Delta transaction log removes table versions and their corresponding change data feed versions at regular intervals. When a version is removed from the transaction log, you can no longer read the change data feed for that version. \nIf your use case requires maintaining a permanent history of all changes to a table, you should use incremental logic to write records from the change data feed to a new table. The following code example demonstrates using `trigger.AvailableNow`, which leverages the incremental processing of Structured Streaming but processes available data as a batch workload. You can schedule this workload asynchronously with your main processing pipelines to create a backup of change data feed for auditing purposes or full replayability. \n```\n(spark.readStream.format(\"delta\")\n.option(\"readChangeFeed\", \"true\")\n.table(\"source_table\")\n.writeStream\n.option(\"checkpointLocation\", )\n.trigger(availableNow=True)\n.toTable(\"target_table\")\n)\n\n``` \n```\nspark.readStream.format(\"delta\")\n.option(\"readChangeFeed\", \"true\")\n.table(\"source_table\")\n.writeStream\n.option(\"checkpointLocation\", )\n.trigger(Trigger.AvailableNow)\n.toTable(\"target_table\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html"} +{"content":"# What is Delta Lake?\n### Use Delta Lake change data feed on Databricks\n#### Change data feed limitations for tables with column mapping enabled\n\nWith column mapping enabled on a Delta table, you can drop or rename columns in the table without rewriting data files for existing data. With column mapping enabled, change data feed has limitations after performing non-additive schema changes such as renaming or dropping a column, changing data type, or nullability changes. \nImportant \n* You cannot read change data feed for a transaction or range in which a non-additive schema change occurs using batch semantics.\n* In Databricks Runtime 12.2 LTS and below, tables with column mapping enabled that have experienced non-additive schema changes do not support streaming reads on change data feed. See [Streaming with column mapping and schema changes](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html#schema-tracking).\n* In Databricks Runtime 11.3 LTS and below, you cannot read change data feed for tables with column mapping enabled that have experienced column renaming or dropping. \nIn Databricks Runtime 12.2 LTS and above, you can perform batch reads on change data feed for tables with column mapping enabled that have experienced non-additive schema changes. Instead of using the schema of the latest version of the table, read operations use the schema of the end version of the table specified in the query. Queries still fail if the version range specified spans a non-additive schema change.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Auto Loader FAQ\n\nCommonly asked questions about Databricks Auto Loader.\n\n#### Auto Loader FAQ\n##### Does Auto Loader process the file again when the file gets appended or overwritten?\n\nFiles are processed exactly once unless `cloudFiles.allowOverwrites` is enabled. When a file is appended to or overwritten, Databricks cannot guarantee which version of the file will be processed. You should also use caution when enabling `cloudFiles.allowOverwrites` in file notification mode, where Auto Loader might identify new files through both file notifications and directory listing. Due to the discrepancy between file notification event time and file modification time, Auto Loader might obtain two different timestamps and therefore ingest the same file twice, even when the file is only written once. \nIn general, Databricks recommends you use Auto Loader to ingest only immutable files and avoid setting `cloudFiles.allowOverwrites`. If this does not meet your requirements, contact your Databricks account team.\n\n#### Auto Loader FAQ\n##### If my data files do not arrive continuously, but in regular intervals, for example, once a day, should I still use this source and are there any benefits?\n\nIn this case, you can set up a `Trigger.AvailableNow` (available in Databricks Runtime 10.4 LTS and above) Structured Streaming job and schedule to run after the anticipated file arrival time. Auto Loader works well with both infrequent or frequent updates. Even if the eventual updates are very large, Auto Loader scales well to the input size. Auto Loader\u2019s efficient file discovery techniques and schema evolution capabilities make Auto Loader the recommended method for incremental data ingestion.\n\n#### Auto Loader FAQ\n##### What happens if I change the checkpoint location when restarting the stream?\n\nA checkpoint location maintains important identifying information of a stream. Changing the checkpoint location effectively means that you have abandoned the previous stream and started a new stream.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/faq.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Auto Loader FAQ\n##### Do I need to create event notification services beforehand?\n\nNo. If you choose file notification mode and provide the required permissions, Auto Loader can create file notification services for you. See [What is Auto Loader file notification mode?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html)\n\n#### Auto Loader FAQ\n##### How do I clean up the event notification resources created by Auto Loader?\n\nYou can use the [cloud resource manager](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html#cloud-resource-management) to list and tear down resources.\nYou can also delete these resources manually using the cloud provider\u2019s UI or APIs.\n\n#### Auto Loader FAQ\n##### Can I run multiple streaming queries from different input directories on the same bucket\/container?\n\nYes, as long as they are not parent-child directories; for example, `prod-logs\/` and `prod-logs\/usage\/` would not work because `\/usage` is a child directory of `\/prod-logs`.\n\n#### Auto Loader FAQ\n##### Can I use this feature when there are existing file notifications on my bucket or container?\n\nYes, as long as your input directory does not conflict with the existing notification prefix (for example, the above parent-child directories).\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/faq.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Auto Loader FAQ\n##### How does Auto Loader infer schema?\n\nWhen the DataFrame is first defined, Auto Loader lists your source directory and chooses the most recent (by file modification time) 50 GB of data or 1000 files, and uses those to infer your data schema. \nAuto Loader also infers partition columns by examining the source directory structure and looks for file paths that contain the `\/key=value\/` structure. If the source directory has an inconsistent structure, for example: \n```\nbase\/path\/partition=1\/date=2020-12-31\/file1.json\n\/\/ inconsistent because date and partition directories are in different orders\nbase\/path\/date=2020-12-31\/partition=2\/file2.json\n\/\/ inconsistent because the date directory is missing\nbase\/path\/partition=3\/file3.json\n\n``` \nAuto Loader infers the partition columns as empty. Use `cloudFiles.partitionColumns` to explicitly parse columns from the directory structure.\n\n#### Auto Loader FAQ\n##### How does Auto Loader behave when the source folder is empty?\n\nIf the source directory is empty, Auto Loader requires you to provide a schema as there is no data to perform inference.\n\n#### Auto Loader FAQ\n##### When does Autoloader infer schema? Does it evolve automatically after every micro-batch?\n\nThe schema is inferred when the DataFrame is first defined in your code. During each micro-batch, schema changes are evaluated on the fly; therefore, you don\u2019t need to worry about performance hits. When the stream restarts, it picks up the evolved schema from the schema location and starts executing without any overhead from inference.\n\n#### Auto Loader FAQ\n##### What\u2019s the performance impact on ingesting the data when using Auto Loader schema inference?\n\nYou should expect schema inference to take a couple of minutes for very large source directories during initial schema inference. You shouldn\u2019t observe significant performance hits otherwise during stream execution. If you run your code in a Databricks notebook, you can see status updates that specify when Auto Loader will be listing your directory for sampling and inferring your data schema.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/faq.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Auto Loader FAQ\n##### Due to a bug, a bad file has changed my schema drastically. What should I do to roll back a schema change?\n\nContact Databricks support for help.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/faq.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Work with features in Workspace Feature Store\n##### Discover features and track feature lineage\n\nWith Databricks Feature Store, you can: \n* Search for feature tables by feature table name, feature, data source, or tag.\n* Control access to feature tables.\n* Identify the data sources used to create a feature table.\n* Identify models that use a particular feature.\n* Add a tag to a feature table.\n* Check feature freshness. \nTo access the Feature Store UI, in the sidebar, select **Machine Learning > Feature Store**. The Feature Store UI lists all of the available feature tables, along with the features in the table and the following metadata: \n* Who created the feature table.\n* Data sources used to compute the feature table.\n* Online stores where the feature table has been published.\n* Scheduled jobs that compute the features in the feature table.\n* The last time a notebook or job wrote to the feature table. \n![Feature store page](https:\/\/docs.databricks.com\/_images\/feature-store-ui.png)\n\n##### Discover features and track feature lineage\n###### Search and browse for feature tables\n\nUse the search box to search for feature tables. You can enter all or part of the name of a feature table, a feature, or a data source used for feature computation. You can also enter all or part of the key or value of a tag. Search text is case-insensitive. \n![Feature search example](https:\/\/docs.databricks.com\/_images\/feature-search-example.png)\n\n##### Discover features and track feature lineage\n###### Control access to feature tables\n\nSee [Control access to feature tables](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/access-control.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/ui.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Work with features in Workspace Feature Store\n##### Discover features and track feature lineage\n###### Track feature lineage and freshness\n\nIn the UI you can track both how a feature was created and where it is used. For example, you can track the raw data sources, notebooks, and jobs that were used to compute the features. You can also track the online stores where the feature is published, the models trained with it, the serving endpoints that access it, and the notebooks and jobs that read it. \nIn the Feature Store UI, click the name of any feature table to display the feature table page. \nOn the feature table page, the **Producers** table provides information about all of the notebooks and jobs that write to this feature table so you can easily confirm the status of scheduled jobs and the freshness of the feature table. \n![producers table](https:\/\/docs.databricks.com\/_images\/producers-table.png) \nThe **Features** table lists all of the features in the table and provides links to the models, endpoints, jobs, and notebooks that use the feature. \n![features table](https:\/\/docs.databricks.com\/_images\/features-table.png) \nTo return to the main Feature Store UI page, click **Feature Store** near the top of the page.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/ui.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Work with features in Workspace Feature Store\n##### Discover features and track feature lineage\n###### Add a tag to a feature table\n\nTags are key-value pairs that you can create and use to [search for feature tables](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/ui.html#search-and-browse-for-feature-tables). \n1. On the feature table page, click ![Tag icon](https:\/\/docs.databricks.com\/_images\/tags1.png) if it is not already open. The tags table appears. \n![tag table](https:\/\/docs.databricks.com\/_images\/tags-open.png)\n2. Click in the **Name** and **Value** fields and enter the key and value for your tag.\n3. Click **Add**. \n![add tag](https:\/\/docs.databricks.com\/_images\/tag-add.png) \n### Edit or delete a tag \nTo edit or delete an existing tag, use the icons in the **Actions** column. \n![tag actions](https:\/\/docs.databricks.com\/_images\/tag-edit-or-delete.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/ui.html"} +{"content":"# Databricks data engineering\n## What are init scripts?\n#### Use cluster-scoped init scripts\n\nCluster-scoped init scripts are init scripts defined in a cluster configuration. Cluster-scoped init scripts apply to both clusters you create and those created to run jobs. \nYou can configure cluster-scoped init scripts using the UI, the CLI, and by invoking the Clusters API. This section focuses on performing these tasks using the UI. For the other methods, see the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) and the [Clusters API](https:\/\/docs.databricks.com\/api\/workspace\/clusters). \nYou can add any number of scripts, and the scripts are executed sequentially in the order provided. \nIf a cluster-scoped init script returns a non-zero exit code, the cluster launch *fails*. You can troubleshoot cluster-scoped init scripts by configuring [cluster log delivery](https:\/\/docs.databricks.com\/compute\/configure.html#cluster-log-delivery) and examining the init script log. See [Init script logging](https:\/\/docs.databricks.com\/init-scripts\/logs.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/cluster-scoped.html"} +{"content":"# Databricks data engineering\n## What are init scripts?\n#### Use cluster-scoped init scripts\n##### Configure a cluster-scoped init script using the UI\n\nThis section contains instructions for configuring a cluster to run an init script using the Databricks UI. \nDatabricks recommends managing all init scripts as cluster-scoped init scripts. If you are using compute with shared or single user access mode, store init scripts in Unity Catalog volumes. If you are using compute with no-isolation shared access mode, use workspace files for init scripts. \nFor shared access mode, you must add init scripts to the `allowlist`. See [Allowlist libraries and init scripts on shared compute](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/allowlist.html). \nTo use the UI to configure a cluster to run an init script, complete the following steps: \n1. On the cluster configuration page, click the **Advanced Options** toggle.\n2. At the bottom of the page, click the **Init Scripts** tab.\n3. In the **Source** drop-down, select the **Workspace**, **Volume**, or **S3** source type.\n4. Specify a path to the init script, such as one of the following examples: \n* For an init script stored in your home directory with workspace files: `\/Users\/\/.sh`.\n* For an init script stored with Unity Catalog volumes: `\/Volumes\/\/\/\/\/.sh`.\n* For an init script stored with object storage: `s3:\/\/bucket-name\/path\/to\/init-script`.\n5. Click **Add**. \nIn single user access mode, the identity of the assigned principal (a user or service principal) is used. \nIn shared access mode, the identity of the cluster owner is used. \nNote \nNo-isolation shared access mode does not support volumes, but uses the same identity assignment as shared access mode. \nTo remove a script from the cluster configuration, click the trash icon at the right of the script. When you confirm the delete you will be prompted to restart the cluster. Optionally you can delete the script file from the location you uploaded it to. \nNote \nIf you configure an init script using the **S3** source type, you must configure access credentials. \nDatabricks recommends using instance profiles to manage access to init scripts stored in S3. Use the following documentation in the cross-reference link to complete this setup: \n1. Create a IAM role with read and list permissions on your desired buckets. See [Tutorial: Configure S3 access with an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html).\n2. Launch a cluster with the instance profile. See [Instance profiles](https:\/\/docs.databricks.com\/compute\/configure.html#instance-profiles). \nWarning \nCluster-scoped init scripts on DBFS are end-of-life. The **DBFS** option in the UI exists in some workspaces to support legacy workloads and is not recommended. All init scripts stored in DBFS should be migrated. For migration instructions, see [Migrate init scripts from DBFS](https:\/\/docs.databricks.com\/init-scripts\/index.html#migrate).\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/cluster-scoped.html"} +{"content":"# Databricks data engineering\n## What are init scripts?\n#### Use cluster-scoped init scripts\n##### Configure S3 region\n\nYou must specify the S3 region for the bucket containing the init script if the bucket is in a different region than your workspace. Select `auto` only if your bucket and workspace share a region.\n\n#### Use cluster-scoped init scripts\n##### Troubleshooting cluster-scoped init scripts\n\n* The script must exist at the configured location. If the script doesn\u2019t exist, attempts to start the cluster or scale up the executors result in failure.\n* The init script cannot be larger than 64KB. If a script exceeds that size, the cluster will fail to launch and a failure message will appear in the cluster log.\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/cluster-scoped.html"} +{"content":"# \n### Use certified answers in Genie spaces\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article defines certified answers and explains how to use them to increase trust and confidence in responses provided in a Genie space.\n\n### Use certified answers in Genie spaces\n#### What are certified answers?\n\nCertified answers allow you to explicitly define validated, parameterized SQL queries as recipes for answering common questions. They can reduce the likelihood of non-technical users receiving responses that are misleading, incorrect, or hard to interpret. Certified answers help the Genie space provide accurate answers to common questions and let users know when the response they receive has been verified. \n![Certified answer response](https:\/\/docs.databricks.com\/_images\/certified-answer.png) \nNote \nCertified answers are not a substitute for all other instructions. Databricks recommends using certified answers only for recurring, well-established questions. They provide exact answers to specific questions and are not reused by the Assistant to address adjacent questions.\n\n### Use certified answers in Genie spaces\n#### Why create certified answers?\n\nGenie spaces return the result of a generated SQL query to answer user questions. Business users can potentially include jargon that is hard to parse for the large language model (LLM) that generates queries. Suppose a business user provides a prompt like, \u201cShow me the open pipeline in our APAC region.\u201d If `open pipeline` does not correspond directly to a field in one of the tables in your Genie space, the user might get an empty result set accompanied by a generated SQL query, as in the following response: \n![Empty result response](https:\/\/docs.databricks.com\/_images\/empty-result.png) \nFor most business users, it is difficult to interpret or troubleshoot this response. Genie space authors can define certified answers to provide trusted responses for questions like this.\n\n","doc_uri":"https:\/\/docs.databricks.com\/prpr-ans-67656E69652D737061636573.html"} +{"content":"# \n### Use certified answers in Genie spaces\n#### Define a certified answer\n\nTo define a certified answer, identify the question you expect users to ask. Then do the following: \n1. Define and test a SQL query that answers the question. \nThe following is an example query designed to answer the question in the previous example. The table this query returns includes results from all regions in the data. \n```\nSELECT\no.id AS `OppId`,\na.region__c AS `Region`,\no.name AS `Opportunity Name`,\no.forecastcategory AS `Forecast Category`,\no.stagename,\no.closedate AS `Close Date`,\no.amount AS `Opp Amount`\nFROM\nusers.user_name.opportunity o\nJOIN catalog.schema.accounts a ON o.accountid = a.id\n\nWHERE\no.forecastcategory = 'Pipeline' AND\no.stagename NOT LIKE '%closed%';\n\n```\n2. Define a Unity Catalog function. \nYour Unity Catalog function should parameterize the query and produce results matching the specific conditions a user might inquire about. \nSee [Create a SQL table function](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-sql-function.html#create-a-sql-table-function) to learn how to define a Unity Catalog function. \nThe following function takes a list of regions and returns a table. The comments provided in the function definitions are critical for instructing the Genie space on when and how to invoke this function. This example includes comments in the function\u2019s parameter definition and comments defined in the SQL table function that explain what the function does. \n* **Parameter comments**: The `open_opps_in_region` function expects an array of strings as a parameter. The comment includes an example of the expected input. If no parameter is supplied, the default value is `NULL`.\n* **Function comments**: The comment in the SQL table function explains what the function does.The associated SQL query has been adjusted to include the Unity Catalog function in the `WHERE` clause. \n```\nCREATE OR REPLACE FUNCTION users.user_name.open_opps_in_region (\nregions ARRAY COMMENT 'List of regions. Example: [\"APAC\", \"EMEA\"]' DEFAULT NULL\n) RETURNS TABLE\nCOMMENT 'Addresses questions about the pipeline in a region by returning a list of all the open opportunities.'\nRETURN\n\nSELECT\no.id AS `OppId`,\na.region__c AS `Region`,\no.name AS `Opportunity Name`,\no.forecastcategory AS `Forecast Category`,\no.stagename,\no.closedate AS `Close Date`,\no.amount AS `Opp Amount`\nFROM\ncatalog.schema.accounts.opportunity o\nJOIN catalog.schema.accounts a ON o.accountid = a.id\nWHERE\no.forecastcategory = 'Pipeline' AND\no.stagename NOT LIKE '%closed%' AND\nisnull(open_opps_in_region.regions) OR array_contains(open_opps_in_region.regions, region__c);\n\n``` \nWhen you run the code to create a function, it\u2019s registered to the currently active schema by default. See [Custom SQL functions in Unity Catalog](https:\/\/docs.databricks.com\/udf\/unity-catalog.html#custom-uc-functions).\n3. Add certified answer. \nAfter being published as a Unity Catalog function, a user with at least CAN EDIT permission on the Genie space can add it in the **Instructions** tab of the Genie space. \n![Add certified answer button](https:\/\/docs.databricks.com\/_images\/btn-certified-answers.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/prpr-ans-67656E69652D737061636573.html"} +{"content":"# \n### Use certified answers in Genie spaces\n#### Required permissions\n\nGenie space authors with at least CAN EDIT permission on a Genie space can add or remove certified answers. \nGenie space users must have CAN USE permission on the catalog and schema that contains the function. To invoke a certified answer, they must have EXECUTE permission on the function in Unity Catalog. Unity Catalog securable objects inherit permissions from their parent containers. See [Securable objects in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html#securable-objects). \nTo simplify sharing in a Genie space, Databricks recommends creating a dedicated schema to contain all of the functions that you want to use in your Genie space.\n\n","doc_uri":"https:\/\/docs.databricks.com\/prpr-ans-67656E69652D737061636573.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Install libraries from workspace files\n\nThis article walks you through the steps required to upload package or requirements.txt files to workspace files and install them onto clusters in Databricks. You can install libraries onto all-purpose compute or job compute. \nImportant \nThis article describes storing libraries as workspace files. This is different than [workspace libraries](https:\/\/docs.databricks.com\/archive\/legacy\/workspace-libraries.html) which are deprecated. \nFor more information about workspace files, see [Navigate the workspace](https:\/\/docs.databricks.com\/workspace\/index.html). \nFor full library compatibility details, see [Libraries](https:\/\/docs.databricks.com\/libraries\/index.html).\n\n#### Install libraries from workspace files\n##### Load libraries to workspace files\n\nYou can load libraries to workspace files the same way you load other files. \nTo load a library to workspace files: \n1. Click ![Workspace Icon](https:\/\/docs.databricks.com\/_images\/workspace-icon.png) **Workspace** in the left sidebar.\n2. Navigate to the location in the workspace where you want to upload the library.\n3. Click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) in the upper right and choose **Import**.\n4. The **Import** dialog appears. For **Import from:** choose **File** or **URL**. Drag and drop or browse to the file(s) you want to upload, or provide the URL path to the file.\n5. Click **Import**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/workspace-files-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Install libraries from workspace files\n##### Install libraries from workspace files onto a cluster\n\nWhen you install a library onto a cluster, all notebooks running on that cluster have access to the library. \nTo install a library from workspace files onto a cluster: \n1. Click ![compute icon](https:\/\/docs.databricks.com\/_images\/clusters-icon.png) **Compute** in the left sidebar.\n2. Click the name of the cluster in the cluster list.\n3. Click the **Libraries** tab.\n4. Click **Install new**. The **Install library** dialog appears.\n5. For **Library Source**, select **Workspace**.\n6. Upload the library or requirements.txt file, browse to the library or requirements.txt in the workspace, or enter its workspace location in the **Workspace File Path** field, such as the following:\n`\/Workspace\/Users\/someone@example.com\/\/.`\n7. Click **Install**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/workspace-files-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Install libraries from workspace files\n##### Add dependent libraries to workflow tasks from workspace files\n\nYou can add dependent libraries to tasks from workspace files. See [Configure dependent libraries](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#task-config-dependent-libraries). \nTo configure a workflow task with a dependent library from workspace files: \n1. Select an existing task in a workflow or create a new task.\n2. Next to **Dependent libraries**, click **+ Add**.\n3. In the **Add dependent library** dialog, select **Workspace** for **Library Source**.\n4. Upload the library or requirements.txt file, browse to the library or requirements.txt file in the workspace, or enter its workspace location in the **Workspace File Path** field, such as the following:\n`\/Workspace\/Users\/someone@example.com\/\/.`\n5. Click **Install**.\n\n#### Install libraries from workspace files\n##### Install libraries from workspace files to a notebook\n\nYou can install Python libraries directly to a notebook to create custom environments that are specific to the notebook. For example, you can use a specific version of a library in a notebook, without affecting other users on the cluster who may need a different version of the same library. For more information, see [notebook-scoped libraries](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html). \nWhen you install a library to a notebook, only the current notebook and any jobs associated with that notebook have access to that library. Other notebooks attached to the same cluster are not affected.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/workspace-files-libraries.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Load data using COPY INTO with an instance profile\n\nThis article describes how to use the `COPY INTO` command to load data from an Amazon S3 bucket in your AWS account into a table in Databricks SQL. \nThe steps in this article assume that your admin has configured a SQL warehouse to use an AWS instance profile so that you can access your source files in S3. If your admin configured a Unity Catalog external location with a storage credential, see [Load data using COPY INTO with Unity Catalog volumes or external locations](https:\/\/docs.databricks.com\/ingestion\/copy-into\/unity-catalog.html) instead. If your admin gave you temporary credentials (an AWS access key ID, a secret key, and a session token), see [Load data using COPY INTO with temporary credentials](https:\/\/docs.databricks.com\/ingestion\/copy-into\/temporary-credentials.html) instead. \nDatabricks recommends using the [COPY INTO](https:\/\/docs.databricks.com\/ingestion\/copy-into\/index.html) command for incremental and bulk data loading with Databricks SQL. \nNote \n`COPY INTO` works well for data sources that contain thousands of files. Databricks recommends that you use [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html) for loading millions of files, which is not supported in Databricks SQL.\n\n#### Load data using COPY INTO with an instance profile\n##### Before you begin\n\nBefore you load data into Databricks, make sure you have the following: \n* Access to data in S3. Your admin must first complete the steps in [Configure data access for ingestion](https:\/\/docs.databricks.com\/ingestion\/copy-into\/configure-data-access.html) so your Databricks SQL warehouse can read your source files.\n* A Databricks SQL warehouse that uses the instance profile that your admin created.\n* The **Can manage** permission on the SQL warehouse.\n* The fully qualified S3 URI.\n* Familiarity with the Databricks SQL user interface.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/tutorial-dbsql.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Load data using COPY INTO with an instance profile\n##### Step 1: Confirm access to data in cloud storage\n\nTo confirm that you have access to the correct data in cloud object storage, do the following: \n1. In the sidebar, click **Create > Query**.\n2. In the SQL editor\u2019s menu bar, select a SQL warehouse.\n3. In the SQL editor, paste the following code: \n```\nselect * from csv.\n\n``` \nReplace `` with the S3 URI that you received from your admin. For example, `s3:\/\/\/\/`.\n4. Click **Run**.\n\n#### Load data using COPY INTO with an instance profile\n##### Step 2: Create a table\n\nThis step describes how to create a table in your Databricks workspace to hold the incoming data. \n1. In the SQL editor, paste the following code: \n```\nCREATE TABLE .. (\ntpep_pickup_datetime TIMESTAMP,\ntpep_dropoff_datetime TIMESTAMP,\ntrip_distance DOUBLE,\nfare_amount DOUBLE,\npickup_zip INT,\ndropoff_zip INT\n);\n\n```\n2. Click **Run**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/tutorial-dbsql.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Load data using COPY INTO with an instance profile\n##### Step 3: Load data from cloud storage into the table\n\nThis step describes how to load data from an S3 bucket into the table you created in your Databricks workspace. \n1. In the sidebar, click **Create > Query**.\n2. In the SQL editor\u2019s menu bar, select a SQL warehouse and make sure the SQL warehouse is running.\n3. In the SQL editor, paste the following code. In this code, replace: \n* `` with the name of your S3 bucket.\n* `` with the name of the folder in your S3 bucket.\n```\nCOPY INTO ..\nFROM 's3:\/\/\/\/'\nFILEFORMAT = CSV\nFORMAT_OPTIONS (\n'header' = 'true',\n'inferSchema' = 'true'\n)\nCOPY_OPTIONS (\n'mergeSchema' = 'true'\n);\n\nSELECT * FROM ..;\n\n``` \nNote \n`FORMAT_OPTIONS` differs depending on `FILEFORMAT`. In this case, the `header` option instructs Databricks to treat the first row of the CSV file as a header, and the `inferSchema` options instructs Databricks to automatically determine the data type of each field in the CSV file.\n4. Click **Run**. \nNote \nIf you click **Run** again, no new data is loaded into the table. This is because the `COPY INTO` command only processes what it considers to be new data.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/tutorial-dbsql.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Load data using COPY INTO with an instance profile\n##### Clean up\n\nYou can clean up the associated resources in your workspace if you no longer want to keep them. \n### Delete the tables \n1. In the sidebar, click **Create > Query**.\n2. Select a SQL warehouse and make sure that the SQL warehouse is running.\n3. Paste the following code: \n```\nDROP TABLE ..;\n\n```\n4. Click **Run**.\n5. Hover over the tab for this query, and then click the **X** icon. \n### Delete the queries in the SQL editor \n1. In the sidebar, click **SQL Editor**.\n2. In the SQL editor\u2019s menu bar, hover over the tab for each query that you created for this tutorial, and then click the **X** icon.\n\n#### Load data using COPY INTO with an instance profile\n##### Additional resources\n\n* The [COPY INTO](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-copy-into.html) reference article\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/tutorial-dbsql.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Driver capability settings for the Databricks ODBC Driver\n\nThis article describes how to configure special and advanced driver capability settings for the [Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/index.html). \nThe Datbricks ODBC Driver provides the following special and advanced driver capability settings. \n* [Set the initial schema in ODBC](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html#odbc-native)\n* [ANSI SQL-92 query support in ODBC](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html#odbc-ansi)\n* [Extract large query results in ODBC](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html#odbc-extract)\n* [Arrow serialization in ODBC](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html#odbc-arrow)\n* [Cloud Fetch in ODBC](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html#cloud-fetch-in-odbc)\n* [Advanced configurations](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html#advanced-configurations)\n* [Enable logging](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html#enable-logging)\n\n######### Driver capability settings for the Databricks ODBC Driver\n########## Set the initial schema in ODBC\n\nThe ODBC driver allows you to specify the schema by setting `Schema=` as a connection configuration. This is equivalent to running `USE `.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Driver capability settings for the Databricks ODBC Driver\n########## ANSI SQL-92 query support in ODBC\n\nThe ODBC driver accepts SQL queries in ANSI SQL-92 dialect and translates the queries to the Databricks SQL dialect. However, if your application generates Databricks SQL directly or your application uses any non-ANSI SQL-92 standard SQL syntax specific to Databricks, Databricks recommends that you set `UseNativeQuery=1` as a connection configuration. With that setting, the driver passes the SQL queries verbatim to Databricks.\n\n######### Driver capability settings for the Databricks ODBC Driver\n########## Extract large query results in ODBC\n\nTo achieve the best performance when you extract large query results, use the latest version of the ODBC driver that includes the following optimizations.\n\n######### Driver capability settings for the Databricks ODBC Driver\n########## Arrow serialization in ODBC\n\nODBC driver version 2.6.15 and above supports an optimized query results serialization format that uses [Apache Arrow](https:\/\/arrow.apache.org\/docs\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Driver capability settings for the Databricks ODBC Driver\n########## Cloud Fetch in ODBC\n\nODBC driver version 2.6.17 and above supports Cloud Fetch, a capability that fetches query results through the cloud storage that is set up in your Databricks deployment. \nQuery results are uploaded to an internal [DBFS storage location](https:\/\/docs.databricks.com\/dbfs\/index.html) as Arrow-serialized files of up to 20 MB. When the driver sends fetch requests after query completion, Databricks generates and returns [presigned URLs](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/using-presigned-url.html) to the uploaded files. The ODBC driver then uses the URLs to download the results directly from DBFS. \nCloud Fetch is only used for query results larger than 1 MB. Smaller results are retrieved directly from Databricks. \nDatabricks automatically garbage collects the accumulated files, which are marked for deletion after 24 hours. These marked files are completely deleted after an additional 24 hours. \nCloud Fetch is only available for E2 workspaces. Also, your corresponding Amazon S3 buckets must not have versioning enabled. If you have versioning enabled, you can still enable Cloud Fetch by following the instructions in [Advanced configurations](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html#advanced-configurations). \nTo learn more about the Cloud Fetch architecture, see [How We Achieved High-bandwidth Connectivity With BI Tools](https:\/\/databricks.com\/blog\/2021\/08\/11\/how-we-achieved-high-bandwidth-connectivity-with-bi-tools.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Driver capability settings for the Databricks ODBC Driver\n########## Advanced configurations\n\nIf you have enabled [S3 bucket versioning](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/Versioning.html) on your [DBFS root](https:\/\/docs.databricks.com\/dbfs\/index.html), then Databricks cannot garbage collect older versions of uploaded query results. We recommend setting an S3 lifecycle policy first that purges older versions of uploaded query results. \nTo set a lifecycle policy follow the steps below: \n1. In the AWS console, go to the **S3** service.\n2. Click on the [S3 bucket](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/storage.html) that you use for your workspace\u2019s root storage.\n3. Open the **Management** tab and choose **Create lifecycle rule**.\n4. Choose any name for the **Lifecycle rule name**.\n5. Keep the prefix field empty.\n6. Under **Lifecycle rule actions** select **Permanently delete noncurrent versions of objects**.\n7. Set a value under **Days after objects become noncurrent**. We recommend using the value 1 here.\n8. Click **Create rule**. \n![Lifecycle policy](https:\/\/docs.databricks.com\/_images\/lifecycle-policy-with-tags.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Driver capability settings for the Databricks ODBC Driver\n########## Enable logging\n\nTo enable logging in the ODBC driver for Windows, set the following fields in the ODBC Data Source Administrator for the related DSN: \n* Set the **Log Level** field from **FATAL** to log only severe events through **TRACE** to log all driver activity.\n* Set the **Log Path** field to the full path to the folder where you want to save log files.\n* Set the **Max Number Files** field to the maximum number of log files to keep.\n* Set the **Max File Size** field to the maximum size of each log file in megabytes. \nTo enable logging in the ODBC driver for a non-Windows machine, set the following properties in the related DSN or DSN-less connection string: \n* Set the `LogLevel` property from `1` to log only severe events through `6` to log all driver activity.\n* Set the `LogPath` property to the full path to the folder where you want to save log files.\n* Set the `LogFileCount` property to the maximum number of log files to keep.\n* Set the `LogFileSize` property to the maximum size of each log file in bytes. \nFor more information, see the sections `Configuring Logging Options on Windows` and `Configuring Logging Options on a Non-Windows Machine` in the [Databricks JDBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Databricks-JDBC-Driver-Install-and-Configuration-Guide.pdf).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html"} +{"content":"# What is Delta Lake?\n### Data skipping for Delta Lake\n\nData skipping information is collected automatically when you write data into a Delta table. Delta Lake on Databricks takes advantage of this information (minimum and maximum values, null counts, and total records per file) at query time to provide faster queries. \nNote \nIn Databricks Runtime 13.3 and above, Databricks recommends using clustering for Delta table layout. Clustering is not compatible with Z-ordering. See [Use liquid clustering for Delta tables](https:\/\/docs.databricks.com\/delta\/clustering.html). \nYou must have statistics collected for columns that are used in `ZORDER` statements. See [What is Z-ordering?](https:\/\/docs.databricks.com\/delta\/data-skipping.html#delta-zorder).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/data-skipping.html"} +{"content":"# What is Delta Lake?\n### Data skipping for Delta Lake\n#### Specify Delta statistics columns\n\nBy default, Delta Lake collects statistics on the first 32 columns defined in your table schema. For this collection, each field in a nested column is considered an individual column. You can modify this behavior by setting one of the following table properties: \n| Table property | Databricks Runtime supported | Description |\n| --- | --- | --- |\n| `delta.dataSkippingNumIndexedCols` | All supported Databricks Runtime versions | Increase or decrease the number of columns on which Delta collects statistics. Depends on column order. |\n| `delta.dataSkippingStatsColumns` | Databricks Runtime 13.3 LTS and above | Specify a list of column names for which Delta Lake collects statistics. Supersedes `dataSkippingNumIndexedCols`. | \nTable properties can be set at table creation or with `ALTER TABLE` statements. See [Delta table properties reference](https:\/\/docs.databricks.com\/delta\/table-properties.html). \nUpdating this property does not automatically recompute statistics for existing data. Rather, it impacts the behavior of future statistics collection when adding or updating data in the table. Delta Lake does not leverage statistics for columns not included in the current list of statistics columns. \nIn Databricks Runtime 14.3 LTS and above, you can manually trigger the recomputation of statistics for a Delta table using the following command: \n```\nANALYZE TABLE table_name COMPUTE DELTA STATISTICS\n\n``` \nNote \nLong strings are truncated during statistics collection. You might choose to exclude long string columns from statistics collection, especially if the columns aren\u2019t used frequently for filtering queries.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/data-skipping.html"} +{"content":"# What is Delta Lake?\n### Data skipping for Delta Lake\n#### What is Z-ordering?\n\nZ-ordering is a [technique](https:\/\/en.wikipedia.org\/wiki\/Z-order_curve) to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Databricks needs to read. To Z-order data, you specify the columns to order on in the `ZORDER BY` clause: \n```\nOPTIMIZE events\nWHERE date >= current_timestamp() - INTERVAL 1 day\nZORDER BY (eventType)\n\n``` \nIf you expect a column to be commonly used in query predicates and if that column has high cardinality (that is, a large number of distinct values), then use `ZORDER BY`. \nYou can specify multiple columns for `ZORDER BY` as a comma-separated list. However, the effectiveness of the locality drops with each extra column. Z-ordering on columns that do not have statistics collected on them would be ineffective and a waste of resources. This is because data skipping requires column-local stats such as min, max, and count. You can configure statistics collection on certain columns by reordering columns in the schema, or you can increase the number of columns to collect statistics on. \nNote \n* Z-ordering is *not idempotent* but aims to be an incremental operation. The time it takes for Z-ordering is not guaranteed to reduce over multiple runs. However, if no new data was added to a partition that was just Z-ordered, another Z-ordering of that partition will not have any effect.\n* Z-ordering aims to produce evenly-balanced data files with respect to the number of tuples, but not necessarily data size on disk. The two measures are most often correlated, but there can be situations when that is not the case, leading to skew in optimize task times. \nFor example, if you `ZORDER BY` *date* and your most recent records are all much wider (for example longer arrays or string values) than the ones in the past, it is expected that the `OPTIMIZE` job\u2019s task durations will be skewed, as well as the resulting file sizes. This is, however, only a problem for the `OPTIMIZE` command itself; it should not have any negative impact on subsequent queries.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/data-skipping.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Search for workspace objects\n\nThis article describes how to search for tables, notebooks, queries, dashboards, alerts, files, folders, libraries, jobs, repos, partners, and Marketplace listings in your Databricks workspace. \nTables must be registered in Unity Catalog to appear in search results. \nNote \nThe search behavior described in this section is not supported for non-E2 workspaces. In those workspaces, you can click ![Search Icon](https:\/\/docs.databricks.com\/_images\/search-icon.png) **Search** in the sidebar and type a search string in the **Search Workspace** field. As you type, objects whose name contains the search string are listed. Click a name from the list to open that item in the workspace. \nIn workspaces that use [customer-managed keys for encryption](https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html), notebook contents and query contents are not available in search.\n\n#### Search for workspace objects\n##### Intelligent search\n\nDatabricks search leverages [DatabricksIQ](https:\/\/docs.databricks.com\/databricksiq\/index.html), the Data Intelligence Engine for Databricks, to provide a more intelligent AI-powered search experience. AI-generated comments use LLMs to automatically add descriptions and tags to tables and columns managed by Unity Catalog. These comments make the search engine aware of unique company jargon, metrics, and semantics, giving it the context needed to make search results more relevant, accurate, and actionable.\n\n","doc_uri":"https:\/\/docs.databricks.com\/search\/index.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Search for workspace objects\n##### Navigational search\n\nTo search the workspace using navigational search in the top bar of the UI, do the following: \n1. Click the **Search** field in the top bar of the Databricks workspace or use the keyboard shortcut Command-P. \n![Navigational search bar](https:\/\/docs.databricks.com\/_images\/navigational-search.png) \nYour recent files, notebooks, queries, alerts, and dashboards are listed under **Recents**, sorted by the last opened date.\n2. Enter your search criteria. \nRecent objects in the list are filtered to match your search criteria. Navigational search might also suggest other objects that match your criteria. To perform a complete search of the workspace, use the **Search results** page.\n3. Select an item from the list.\n\n#### Search for workspace objects\n##### Search results page\n\nThe full-page search experience gives you more space to see results, more metadata for your objects, and more filters to narrow down your results. \nTo filter search results by object type, object owner, or last modified date on the **Search results** page, do the following: \n1. Click the **Search** field in the top bar of the Databricks workspace or use the keyboard shortcut Command-P, and then press Enter. \nThe **Search results** page opens.\n2. Enter your search criteria.\n3. Select an item from the list. \nYou can search by text string, by object type, or both. After you type your search criteria and press **Enter**, the system searches the names of all queries, dashboards, alerts, files, folders, notebooks, libraries, repos, partners, and Marketplace listings in the workspace that you have access to. If your workspace is [enabled for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/enable-workspaces.html), the system also searches table names, table comments, column names, and column comments.\n\n","doc_uri":"https:\/\/docs.databricks.com\/search\/index.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Search for workspace objects\n##### Search by text string\n\nTo search for a text string, type the string into the search field and then press Enter. The system searches the names of all objects in the workspace that you have access to. It also searches text in notebook commands, but not in non-notebook files. \nYou can place quotation marks around your search entry to narrow search results to only documents that contain your exact phrase. \nExact match search supports the following: \n* Basic quotation marks (for example, `\"spark.sql(\"`)\n* Escaped quotation marks (for example, `\"spark.sql(\\\"select\"`) \nExact match search doesn\u2019t support the following: \n* With quotation marks and without quotation marks (for example, `\"spark.sql\" partition`)\n* Multiple quotation marks (for example, `\"spark.sql\" \"partition\"`)\n\n#### Search for workspace objects\n##### Semantic search\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nYou can use natural language to search Unity Catalog tables. Search returns results that have related semantic meaning. \nFor example, the search query \u201cWhat should I use for geographies\u201d focuses on \u201cgeographies\u201d and finds related terms containing geographic attributes such as cities, countries, territories, and geo-locations. \nSearch can also understand patterns in your search queries by separating what might be a search term from a filter, which means that natural language queries are even more powerful. \nFor example, the search query \u201cShow me tables about inspections\u201d is broken down so that \u201cinspections\u201d is the key term and \u201ctable\u201d is the type of object the user is searching for.\n\n","doc_uri":"https:\/\/docs.databricks.com\/search\/index.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Search for workspace objects\n##### Limit search to a specific object type\n\nYou can search for items by type (such as files, folders, notebooks, libraries, tables, or repos) by clicking the object type on the **Search results** page, either from the **Type** drop-down list or from the tabs on the right side of the page. A text string is not required. If you leave the text field blank and then press Enter, the system searches for all objects of that type. Click a name from the list to open that item in the workspace. You can also use dropdown menus to further narrow search results for items of a specific type, such as by owner or last-modified date. \nYou can also specify filters in your search query in the search bar at the top of the UI. For example, you can include the following in your search query to search for tables you own: `type:table owner:me`. To learn more about how to specify your filters via syntax, apply filters on the **Search results** page and see how the query in the search bar automatically updates.\n\n#### Search for workspace objects\n##### Popularity\n\nSearch uses popularity signals based on how often other users in your workspace are interacting with specific tables to improve how tables are ranked. \nWithout popularity boosting, you would have to query the tables returned in the search results to know which is the authoritative table. With popularity boosting, the most popular table is ranked higher so you don\u2019t have to guess which is the correct one. The popularity indicator ![Popularity indicator icon](https:\/\/docs.databricks.com\/_images\/popularity-indicator.png) next to the table name in the search results reflects object ranking. You can also sort search results by popularity.\n\n#### Search for workspace objects\n##### Knowledge cards\n\nWhen search can identify what you\u2019re looking for with high confidence, the top search result turns into a knowledge card. A knowledge card provides additional object metadata. Knowledge cards are supported for Unity Catalog managed tables. \n![Example knowledge card](https:\/\/docs.databricks.com\/_images\/knowledge-card.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/search\/index.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Search for workspace objects\n##### Search tables and models in Unity Catalog-enabled workspaces\n\nIn workspaces [enabled for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/enable-workspaces.html), you can search for tables and models registered in Unity Catalog. You can search on any of the following: \n* Table, view, or model names.\n* Table, view, or model comments.\n* Table or view column names.\n* Table or view column comments.\n* Table or view [tag keys](https:\/\/docs.databricks.com\/search\/index.html#tags). \nTo filter search results by parent catalog, parent schema, owner, or tag on the **Search results** page, click the **Type** drop-down menu and select **Tables**. The filter drop-down menus appear at the top of the page. \nYou can also sort the results by the table\u2019s popularity. \nSearch results don\u2019t include: \n* Tables, views, and models that you don\u2019t have permission to see. \nIn other words, for a table or model to appear in your search results, you must have at least the `SELECT` privilege on that table or `EXECUTE` privilege on the model, the `USE SCHEMA` privilege on its parent schema, and the `USE CATALOG` privilege on its parent catalog. Metastore admins have those privileges by default. All other users must be granted those privileges. See [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html).\n* Tables and views in the legacy Hive metastore (that is, in the `hive_metastore` catalog). \nTo upgrade these tables to Unity Catalog and make them available for search, follow the instructions in [Upgrade Hive tables and views to Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html).\n* Models in the workspace model registry. \nTo upgrade ML workflows to create models in Unity Catalog, see [Upgrade ML workflows to target models in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html). \n### Use tags to search for tables \nYou can use the Databricks workspace search bar to search for tables, views, and table columns using tag keys and tag values. You can also use tag keys to filter tables and views using workspace search. You cannot search for other tagged objects, like catalogs, schemas, or volumes. See also [Apply tags to Unity Catalog securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/tags.html). \nOnly tables and views that you have permission to see appear in search results. \nTo search for tables, views, and columns using tags: \n1. Click the **Search** field in the top bar of the Databricks workspace or use the keyboard shortcut Command-P. \nYou cannot use the filter field in Catalog Explorer to search by tag.\n2. Enter your search criteria. Search for tagged tables or columns by entering the table or column tag key or value. You must use the exact tag key or value term. \nIf you want to search by tag key alone, use the syntax: `tag:`. To search by both tag key and tag value, omit `tag:`. \n![Search for tables by tag key](https:\/\/docs.databricks.com\/_images\/tag-search.png) \nTo filter table search results using tag keys: \n1. Click the **Search** field in the top bar of the Databricks workspace or use the keyboard shortcut Command-P.\n2. Enter a search term or leave the search field blank.\n3. On the **Search results** page, click the **Type** drop-down menu and select **Tables**.\n4. Use the **Tag** filter drop-down menu to select the tag key.\n\n","doc_uri":"https:\/\/docs.databricks.com\/search\/index.html"} +{"content":"# Technology partners\n### Connect to semantic layer partners using Partner Connect\n\nTo connect your Databricks workspace to a semantic layer partner solution using Partner Connect, you typically follow the steps in this article. \nImportant \nBefore you follow the steps in this article, see the appropriate partner article for important partner-specific information. There might be differences in the connection steps between partner solutions. Some partner solutions also allow you to integrate with Databricks SQL warehouses (formerly Databricks SQL endpoints) or Databricks clusters, but not both.\n\n### Connect to semantic layer partners using Partner Connect\n#### Requirements\n\nSee the [requirements](https:\/\/docs.databricks.com\/partner-connect\/index.html#requirements) for using Partner Connect. \nImportant \nFor partner-specific requirements, see the appropriate partner article.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/semantic-layer.html"} +{"content":"# Technology partners\n### Connect to semantic layer partners using Partner Connect\n#### Steps to connect to a semantic layer partner\n\nTo connect your Databricks workspace to a semantic layer partner solution, follow the steps in this section. \nTip \nIf you have an existing partner account, Databricks recommends that you follow the steps to connect to the partner solution manually in the appropriate partner article. This is because the connection experience in Partner Connect is optimized for new partner accounts. \n1. In the sidebar, click ![Partner Connect button](https:\/\/docs.databricks.com\/_images\/partner-connect.png) **Partner Connect**.\n2. Click the partner tile. \nNote \nIf the partner tile has a check mark icon inside it, an administrator has already used Partner Connect to connect the partner to your workspace. Skip to step 8. The partner uses the email address for your Databricks account to prompt you to sign in to your existing partner account.\n3. If there are SQL warehouses in your workspace, select a SQL warehouse from the drop-down list. If your SQL warehouse is stopped, click **Start**.\n4. If there are no SQL warehouses in your workspace, do the following: \n1. Click **Create warehouse**. A new tab opens in your browser that displays the **New SQL Warehouse** page in the Databricks SQL UI.\n2. Follow the steps in [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n3. Return to the Partner Connect tab in your browser, then close the partner tile.\n4. Re-open the partner tile.\n5. Select the SQL warehouse you just created from the drop-down list.\n5. Select a catalog and a schema from the drop-down lists, then click **Add**. You can repeat this step to add multiple schemas. \nNote \nIf your workspace is Unity Catalog-enabled, but the partner doesn\u2019t support Unity Catalog with Partner Connect, the workspace default catalog is used. If your workspace isn\u2019t Unity Catalog-enabled, `hive_metastore` is used.\n6. Click **Next**. \nPartner Connect creates the following resources in your workspace: \n* A Databricks [service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) named **`_USER`**.\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) that is associated with the **`_USER`** service principal.Partner Connect also grants the following privileges to the **`_USER`** service principal: \n* (Unity Catalog)`USE CATALOG`: Required to interact with objects within the selected catalog.\n* (Unity Catalog) `USE SCHEMA`: Required to interact with objects within the selected schema.\n* (Hive metastore) `USAGE`: Required to grant the `SELECT` and `READ METADATA` privileges for the schemas you selected.\n* `SELECT`: Grants the ability to read the schemas you selected.\n* (Hive metastore) `READ METADATA`: Grants the ability to read metadata for the schemas you selected.\n* **CAN\\_USE**: Grants permissions to use the SQL warehouse you selected.\n7. Click **Next**. \nThe **Email** box displays the email address for your Databricks account. The partner uses this email address to prompt you to either create a new partner account or sign in to your existing partner account.\n8. Click **Connect to ``** or **Sign in**. \nA new tab opens in your web browser, which displays the partner website.\n9. Complete the on-screen instructions on the partner website to create your trial partner account or sign in to your existing partner account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/semantic-layer.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Share code between Databricks notebooks\n\nThis article describes how to use files to modularize your code, including how to create and import Python files. \nDatabricks also supports multi-task jobs which allow you to combine notebooks into workflows with complex dependencies. For more information, see [Create and run Databricks Jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html).\n\n#### Share code between Databricks notebooks\n##### Modularize your code using files\n\nWith Databricks Runtime 11.3 LTS and above, you can create and manage source code files in the Databricks workspace, and then import these files into your notebooks as needed. You can also use a Databricks repo to sync your files with a Git repository. For details, see [Work with Python and R modules](https:\/\/docs.databricks.com\/files\/workspace-modules.html) and [Git integration with Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html).\n\n#### Share code between Databricks notebooks\n##### Create a file\n\nTo create a file: \n1. [Navigate to a folder in the workspace](https:\/\/docs.databricks.com\/workspace\/workspace-objects.html).\n2. Right-click on the folder name and select **Create > File**.\n3. Enter a name for the file and click **Create File** or press **Enter**. The file opens in an editor window. Changes are saved automatically.\n\n#### Share code between Databricks notebooks\n##### Open a file\n\nNavigate to the file in your workspace and click on it. The file path displays when you hover over the name of the file.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/share-code.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Share code between Databricks notebooks\n##### Import a file into a notebook\n\nYou can import a file into a notebook using standard Python import commands: \nSuppose you have the following file: \n![file that defines functions](https:\/\/docs.databricks.com\/_images\/functions.png) \nYou can import that file into a notebook and call the functions defined in the file: \n![import file into notebook](https:\/\/docs.databricks.com\/_images\/call-functions.png)\n\n#### Share code between Databricks notebooks\n##### Run a file\n\nYou can run a file from the editor. This is useful for testing. To run a file, place your cursor in the code area and select **Shift + Enter** to run the cell, or highlight code in the cell and press **Shift + Ctrl + Enter** to run only the selected code.\n\n#### Share code between Databricks notebooks\n##### Delete a file\n\nSee [Folders](https:\/\/docs.databricks.com\/workspace\/workspace-objects.html#folders) and [Workspace object operations](https:\/\/docs.databricks.com\/workspace\/workspace-objects.html#objects) for information about how to access the workspace menu and delete files or other items in the workspace.\n\n#### Share code between Databricks notebooks\n##### Rename a file\n\nTo change the title of an open file, click the title and edit inline or click **File > Rename**.\n\n#### Share code between Databricks notebooks\n##### Control access to a file\n\nIf your Databricks account has the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons), you can use [Workspace access control](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#files) to control who has access to a file.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/share-code.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Jobs API 2.0\n\nImportant \nThis article documents the 2.0 version of the Jobs API. However, Databricks recommends using [Jobs API 2.1](https:\/\/docs.databricks.com\/api\/workspace\/jobs) for new and existing clients and scripts. For details on the changes from the 2.0 to 2.1 versions, see [Updating from Jobs API 2.0 to 2.1](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html). \nThe Jobs API allows you to create, edit, and delete jobs. The maximum allowed size of a request to the Jobs API is 10MB. \nFor details about updates to the Jobs API that support orchestration of multiple tasks with Databricks jobs, see [Updating from Jobs API 2.0 to 2.1](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html). \nWarning \nYou should never hard code secrets or store them in plain text. Use the [Secrets API](https:\/\/docs.databricks.com\/api\/workspace\/secrets) to manage secrets in the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). Use the [Secrets utility (dbutils.secrets)](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#dbutils-secrets) to reference secrets in notebooks and jobs. \nNote \nIf you receive a 500-level error when making Jobs API requests, Databricks recommends retrying requests for up to 10 min (with a minimum 30 second interval between retries). \nImportant \nTo access Databricks REST APIs, you must [authenticate](https:\/\/docs.databricks.com\/dev-tools\/auth\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Jobs API 2.0\n##### Create\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/jobs\/create` | `POST` | \nCreate a new job. \n### Example \nThis example creates a job that runs a JAR task at 10:15pm each night. \n#### Request \n```\ncurl --netrc --request POST \\\nhttps:\/\/\/api\/2.0\/jobs\/create \\\n--data @create-job.json \\\n| jq .\n\n``` \n`create-job.json`: \n```\n{\n\"name\": \"Nightly model training\",\n\"new_cluster\": {\n\"spark_version\": \"7.3.x-scala2.12\",\n\"node_type_id\": \"r3.xlarge\",\n\"aws_attributes\": {\n\"availability\": \"ON_DEMAND\"\n},\n\"num_workers\": 10\n},\n\"libraries\": [\n{\n\"jar\": \"dbfs:\/my-jar.jar\"\n},\n{\n\"maven\": {\n\"coordinates\": \"org.jsoup:jsoup:1.7.2\"\n}\n}\n],\n\"email_notifications\": {\n\"on_start\": [],\n\"on_success\": [],\n\"on_failure\": []\n},\n\"webhook_notifications\": {\n\"on_start\": [\n{\n\"id\": \"bf2fbd0a-4a05-4300-98a5-303fc8132233\"\n}\n],\n\"on_success\": [\n{\n\"id\": \"bf2fbd0a-4a05-4300-98a5-303fc8132233\"\n}\n],\n\"on_failure\": []\n},\n\"notification_settings\": {\n\"no_alert_for_skipped_runs\": false,\n\"no_alert_for_canceled_runs\": false,\n\"alert_on_last_attempt\": false\n},\n\"timeout_seconds\": 3600,\n\"max_retries\": 1,\n\"schedule\": {\n\"quartz_cron_expression\": \"0 15 22 * * ?\",\n\"timezone_id\": \"America\/Los_Angeles\"\n},\n\"spark_jar_task\": {\n\"main_class_name\": \"com.databricks.ComputeModels\"\n}\n}\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`.\n* The contents of `create-job.json` with fields that are appropriate for your solution. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file and [jq](https:\/\/stedolan.github.io\/jq\/). \n#### Response \n```\n{\n\"job_id\": 1\n}\n\n``` \n### Request structure \nImportant \n* When you run a job on a new jobs cluster, the job is treated as a Jobs Compute (automated) workload subject to Jobs Compute pricing.\n* When you run a job on an existing all-purpose cluster, it is treated as an All-Purpose Compute (interactive) workload subject to All-Purpose Compute pricing. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `existing_cluster_id` OR `new_cluster` | `STRING` OR [NewCluster](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsclusterspecnewcluster) | If existing\\_cluster\\_id, the ID of an existing cluster that will be used for all runs of this job. When running jobs on an existing cluster, you may need to manually restart the cluster if it stops responding. We suggest running jobs on new clusters for greater reliability. If new\\_cluster, a description of a cluster that will be created for each run. If specifying a [PipelineTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobspipelinetask), this field can be empty. |\n| `notebook_task` OR `spark_jar_task` OR `spark_python_task` OR `spark_submit_task` OR `pipeline_task` OR `run_job_task` | [NotebookTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsnotebooktask) OR [SparkJarTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobssparkjartask) OR [SparkPythonTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobssparkpythontask) OR [SparkSubmitTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobssparksubmittask) OR [PipelineTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobspipelinetask) OR [RunJobTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsrunjobtask) | If notebook\\_task, indicates that this job should run a notebook. This field may not be specified in conjunction with spark\\_jar\\_task. If spark\\_jar\\_task, indicates that this job should run a JAR. If spark\\_python\\_task, indicates that this job should run a Python file. If spark\\_submit\\_task, indicates that this job should be launched by the spark submit script. If pipeline\\_task, indicates that this job should run a Delta Live Tables pipeline. If run\\_job\\_task, indicates that this job should run another job. |\n| `name` | `STRING` | An optional name for the job. The default value is `Untitled`. |\n| `libraries` | An array of [Library](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#managedlibrarieslibrary) | An optional list of libraries to be installed on the cluster that will execute the job. The default value is an empty list. |\n| `email_notifications` | [JobEmailNotifications](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobemailnotifications) | An optional set of email addresses notified when runs of this job begin and complete and when this job is deleted. The default behavior is to not send any emails. |\n| `webhook_notifications` | [WebhookNotifications](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobsystemnotifications) | An optional set of system destinations to notify when runs of this job begin, complete, or fail. |\n| `notification_settings` | [JobNotificationSettings](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobnotificationsettings) | Optional notification settings that are used when sending notifications to each of the `email_notifications` and `webhook_notifications` for this job. |\n| `timeout_seconds` | `INT32` | An optional timeout applied to each run of this job. The default behavior is to have no timeout. |\n| `max_retries` | `INT32` | An optional maximum number of times to retry an unsuccessful run. A run is considered to be unsuccessful if it completes with the `FAILED` result\\_state or `INTERNAL_ERROR` `life_cycle_state`. The value -1 means to retry indefinitely and the value 0 means to never retry. The default behavior is to never retry. |\n| `min_retry_interval_millis` | `INT32` | An optional minimal interval in milliseconds between the start of the failed run and the subsequent retry run. The default behavior is that unsuccessful runs are immediately retried. |\n| `retry_on_timeout` | `BOOL` | An optional policy to specify whether to retry a job when it times out. The default behavior is to not retry on timeout. |\n| `schedule` | [CronSchedule](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobscronschedule) | An optional periodic schedule for this job. The default behavior is that the job runs when triggered by clicking **Run Now** in the Jobs UI or sending an API request to `runNow`. |\n| `max_concurrent_runs` | `INT32` | An optional maximum allowed number of concurrent runs of the job. Set this value if you want to be able to execute multiple runs of the same job concurrently. This is useful for example if you trigger your job on a frequent schedule and want to allow consecutive runs to overlap with each other, or if you want to trigger multiple runs which differ by their input parameters. This setting affects only new runs. For example, suppose the job\u2019s concurrency is 4 and there are 4 concurrent active runs. Then setting the concurrency to 3 won\u2019t kill any of the active runs. However, from then on, new runs are skipped unless there are fewer than 3 active runs. This value cannot exceed 1000. Setting this value to 0 causes all new runs to be skipped. The default behavior is to allow only 1 concurrent run. | \n### Response structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `job_id` | `INT64` | The canonical identifier for the newly created job. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Jobs API 2.0\n##### List\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/jobs\/list` | `GET` | \nList all jobs. \n### Example \n#### Request \n```\ncurl --netrc --request GET \\\nhttps:\/\/\/api\/2.0\/jobs\/list \\\n| jq .\n\n``` \nReplace `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file and [jq](https:\/\/stedolan.github.io\/jq\/). \n#### Response \n```\n{\n\"jobs\": [\n{\n\"job_id\": 1,\n\"settings\": {\n\"name\": \"Nightly model training\",\n\"new_cluster\": {\n\"spark_version\": \"7.3.x-scala2.12\",\n\"node_type_id\": \"r3.xlarge\",\n\"aws_attributes\": {\n\"availability\": \"ON_DEMAND\"\n},\n\"num_workers\": 10\n},\n\"libraries\": [\n{\n\"jar\": \"dbfs:\/my-jar.jar\"\n},\n{\n\"maven\": {\n\"coordinates\": \"org.jsoup:jsoup:1.7.2\"\n}\n}\n],\n\"email_notifications\": {\n\"on_start\": [],\n\"on_success\": [],\n\"on_failure\": []\n},\n\"timeout_seconds\": 100000000,\n\"max_retries\": 1,\n\"schedule\": {\n\"quartz_cron_expression\": \"0 15 22 * * ?\",\n\"timezone_id\": \"America\/Los_Angeles\",\n\"pause_status\": \"UNPAUSED\"\n},\n\"spark_jar_task\": {\n\"main_class_name\": \"com.databricks.ComputeModels\"\n}\n},\n\"created_time\": 1457570074236\n}\n]\n}\n\n``` \n### Response structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `jobs` | An array of [Job](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjob) | The list of jobs. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Jobs API 2.0\n##### Delete\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/jobs\/delete` | `POST` | \nDelete a job and send an email to the addresses specified in `JobSettings.email_notifications`. No action occurs if the job has already been removed. After the job is removed, neither its details nor its run history is visible in the Jobs UI or API. The job is guaranteed to be removed upon completion of this request. However, runs that were active before the receipt of this request may still be active. They will be terminated asynchronously. \n### Example \n```\ncurl --netrc --request POST \\\nhttps:\/\/\/api\/2.0\/jobs\/delete \\\n--data '{ \"job_id\": }'\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`.\n* `` with the ID of the job, for example `123`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file. \n### Request structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `job_id` | `INT64` | The canonical identifier of the job to delete. This field is required. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Jobs API 2.0\n##### Get\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/jobs\/get` | `GET` | \nRetrieve information about a single job. \n### Example \n#### Request \n```\ncurl --netrc --request GET \\\n'https:\/\/\/api\/2.0\/jobs\/get?job_id=' \\\n| jq .\n\n``` \nOr: \n```\ncurl --netrc --get \\\nhttps:\/\/\/api\/2.0\/jobs\/get \\\n--data job_id= \\\n| jq .\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`.\n* `` with the ID of the job, for example `123`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file and [jq](https:\/\/stedolan.github.io\/jq\/). \n#### Response \n```\n{\n\"job_id\": 1,\n\"settings\": {\n\"name\": \"Nightly model training\",\n\"new_cluster\": {\n\"spark_version\": \"7.3.x-scala2.12\",\n\"node_type_id\": \"r3.xlarge\",\n\"aws_attributes\": {\n\"availability\": \"ON_DEMAND\"\n},\n\"num_workers\": 10\n},\n\"libraries\": [\n{\n\"jar\": \"dbfs:\/my-jar.jar\"\n},\n{\n\"maven\": {\n\"coordinates\": \"org.jsoup:jsoup:1.7.2\"\n}\n}\n],\n\"email_notifications\": {\n\"on_start\": [],\n\"on_success\": [],\n\"on_failure\": []\n},\n\"webhook_notifications\": {\n\"on_start\": [\n{\n\"id\": \"bf2fbd0a-4a05-4300-98a5-303fc8132233\"\n}\n],\n\"on_success\": [\n{\n\"id\": \"bf2fbd0a-4a05-4300-98a5-303fc8132233\"\n}\n],\n\"on_failure\": []\n},\n\"notification_settings\": {\n\"no_alert_for_skipped_runs\": false,\n\"no_alert_for_canceled_runs\": false,\n\"alert_on_last_attempt\": false\n},\n\"timeout_seconds\": 100000000,\n\"max_retries\": 1,\n\"schedule\": {\n\"quartz_cron_expression\": \"0 15 22 * * ?\",\n\"timezone_id\": \"America\/Los_Angeles\",\n\"pause_status\": \"UNPAUSED\"\n},\n\"spark_jar_task\": {\n\"main_class_name\": \"com.databricks.ComputeModels\"\n}\n},\n\"created_time\": 1457570074236\n}\n\n``` \n### Request structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `job_id` | `INT64` | The canonical identifier of the job to retrieve information about. This field is required. | \n### Response structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `job_id` | `INT64` | The canonical identifier for this job. |\n| `creator_user_name` | `STRING` | The creator user name. This field won\u2019t be included in the response if the user has been deleted. |\n| `settings` | [JobSettings](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettings) | Settings for this job and all of its runs. These settings can be updated using the [Reset](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsserviceresetjob) or [Update](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsserviceupdatejob) endpoints. |\n| `created_time` | `INT64` | The time at which this job was created in epoch milliseconds (milliseconds since 1\/1\/1970 UTC). |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Jobs API 2.0\n##### Reset\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/jobs\/reset` | `POST` | \nOverwrite all settings for a specific job. Use the [Update](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsserviceupdatejob) endpoint to update job settings partially. \n### Example \nThis example request makes job 2 identical to job 1 in the [create](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsservicecreatejob) example. \n```\ncurl --netrc --request POST \\\nhttps:\/\/\/api\/2.0\/jobs\/reset \\\n--data @reset-job.json \\\n| jq .\n\n``` \n`reset-job.json`: \n```\n{\n\"job_id\": 2,\n\"new_settings\": {\n\"name\": \"Nightly model training\",\n\"new_cluster\": {\n\"spark_version\": \"7.3.x-scala2.12\",\n\"node_type_id\": \"r3.xlarge\",\n\"aws_attributes\": {\n\"availability\": \"ON_DEMAND\"\n},\n\"num_workers\": 10\n},\n\"libraries\": [\n{\n\"jar\": \"dbfs:\/my-jar.jar\"\n},\n{\n\"maven\": {\n\"coordinates\": \"org.jsoup:jsoup:1.7.2\"\n}\n}\n],\n\"email_notifications\": {\n\"on_start\": [],\n\"on_success\": [],\n\"on_failure\": []\n},\n\"webhook_notifications\": {\n\"on_start\": [\n{\n\"id\": \"bf2fbd0a-4a05-4300-98a5-303fc8132233\"\n}\n],\n\"on_success\": [\n{\n\"id\": \"bf2fbd0a-4a05-4300-98a5-303fc8132233\"\n}\n],\n\"on_failure\": []\n},\n\"notification_settings\": {\n\"no_alert_for_skipped_runs\": false,\n\"no_alert_for_canceled_runs\": false,\n\"alert_on_last_attempt\": false\n},\n\"timeout_seconds\": 100000000,\n\"max_retries\": 1,\n\"schedule\": {\n\"quartz_cron_expression\": \"0 15 22 * * ?\",\n\"timezone_id\": \"America\/Los_Angeles\",\n\"pause_status\": \"UNPAUSED\"\n},\n\"spark_jar_task\": {\n\"main_class_name\": \"com.databricks.ComputeModels\"\n}\n}\n}\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`.\n* The contents of `reset-job.json` with fields that are appropriate for your solution. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file and [jq](https:\/\/stedolan.github.io\/jq\/). \n### Request structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `job_id` | `INT64` | The canonical identifier of the job to reset. This field is required. |\n| `new_settings` | [JobSettings](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettings) | The new settings of the job. These settings completely replace the old settings. Changes to the field `JobSettings.timeout_seconds` are applied to active runs. Changes to other fields are applied to future runs only. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Jobs API 2.0\n##### Update\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/jobs\/update` | `POST` | \nAdd, change, or remove specific settings of an existing job. Use the [Reset](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsserviceresetjob) endpoint to overwrite all job settings. \n### Example \nThis example request removes libraries and adds email notification settings to job 1 defined in the [create](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsservicecreatejob) example. \n```\ncurl --netrc --request POST \\\nhttps:\/\/\/api\/2.0\/jobs\/update \\\n--data @update-job.json \\\n| jq .\n\n``` \n`update-job.json`: \n```\n{\n\"job_id\": 1,\n\"new_settings\": {\n\"existing_cluster_id\": \"1201-my-cluster\",\n\"email_notifications\": {\n\"on_start\": [ \"someone@example.com\" ],\n\"on_success\": [],\n\"on_failure\": []\n}\n},\n\"fields_to_remove\": [\"libraries\"]\n}\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`.\n* The contents of `update-job.json` with fields that are appropriate for your solution. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file and [jq](https:\/\/stedolan.github.io\/jq\/). \n### Request structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `job_id` | `INT64` | The canonical identifier of the job to update. This field is required. |\n| `new_settings` | [JobSettings](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettings) | The new settings for the job. Top-level fields specified in `new_settings`, except for arrays, are completely replaced. Arrays are merged based on the respective key fields, such as `task_key` or `job_cluster_key`, and array entries with the same key are completely replaced. Except for array merging, partially updating nested fields is not supported. Changes to the field `JobSettings.timeout_seconds` are applied to active runs. Changes to other fields are applied to future runs only. |\n| `fields_to_remove` | An array of `STRING` | Remove top-level fields in the job settings. Removing nested fields is not supported, except for entries from the `tasks` and `job_clusters` arrays. For example, the following is a valid argument for this field: `[\"libraries\", \"schedule\", \"tasks\/task_1\", \"job_clusters\/Default\"]` This field is optional. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Jobs API 2.0\n##### Run now\n\nImportant \n* A workspace is limited to 1000 concurrent task runs. A `429 Too Many Requests` response is returned when you request a run that cannot start immediately.\n* The number of jobs a workspace can create in an hour is limited to 10000 (includes \u201cruns submit\u201d). This limit also affects jobs created by the REST API and notebook workflows. \n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/jobs\/run-now` | `POST` | \nRun a job now and return the `run_id` of the triggered run. \nTip \nIf you invoke [Create](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsservicecreatejob) together with [Run now](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsservicerunnow), you can use the\n[Runs submit](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsservicesubmitrun) endpoint instead, which allows you to submit your workload directly without having to create a job. \n### Example \n```\ncurl --netrc --request POST \\\nhttps:\/\/\/api\/2.0\/jobs\/run-now \\\n--data @run-job.json \\\n| jq .\n\n``` \n`run-job.json`: \nAn example request for a notebook job: \n```\n{\n\"job_id\": 1,\n\"notebook_params\": {\n\"name\": \"john doe\",\n\"age\": \"35\"\n}\n}\n\n``` \nAn example request for a JAR job: \n```\n{\n\"job_id\": 2,\n\"jar_params\": [ \"john doe\", \"35\" ]\n}\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`.\n* The contents of `run-job.json` with fields that are appropriate for your solution. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file and [jq](https:\/\/stedolan.github.io\/jq\/). \n### Request structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `job_id` | `INT64` | |\n| `jar_params` | An array of `STRING` | A list of parameters for jobs with JAR tasks, e.g. `\"jar_params\": [\"john doe\", \"35\"]`. The parameters will be used to invoke the main function of the main class specified in the Spark JAR task. If not specified upon `run-now`, it will default to an empty list. jar\\_params cannot be specified in conjunction with notebook\\_params. The JSON representation of this field (i.e. `{\"jar_params\":[\"john doe\",\"35\"]}`) cannot exceed 10,000 bytes. |\n| `notebook_params` | A map of [ParamPair](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsparampair) | A map from keys to values for jobs with notebook task, e.g. `\"notebook_params\": {\"name\": \"john doe\", \"age\": \"35\"}`. The map is passed to the notebook and is accessible through the [dbutils.widgets.get](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#dbutils-widgets) function. If not specified upon `run-now`, the triggered run uses the job\u2019s base parameters. You cannot specify notebook\\_params in conjunction with jar\\_params. The JSON representation of this field (i.e. `{\"notebook_params\":{\"name\":\"john doe\",\"age\":\"35\"}}`) cannot exceed 10,000 bytes. |\n| `python_params` | An array of `STRING` | A list of parameters for jobs with Python tasks, e.g. `\"python_params\": [\"john doe\", \"35\"]`. The parameters will be passed to Python file as command-line parameters. If specified upon `run-now`, it would overwrite the parameters specified in job setting. The JSON representation of this field (i.e. `{\"python_params\":[\"john doe\",\"35\"]}`) cannot exceed 10,000 bytes. |\n| `spark_submit_params` | An array of `STRING` | A list of parameters for jobs with spark submit task, e.g. `\"spark_submit_params\": [\"--class\", \"org.apache.spark.examples.SparkPi\"]`. The parameters will be passed to spark-submit script as command-line parameters. If specified upon `run-now`, it would overwrite the parameters specified in job setting. The JSON representation of this field cannot exceed 10,000 bytes. |\n| `idempotency_token` | `STRING` | An optional token to guarantee the idempotency of job run requests. If a run with the provided token already exists, the request does not create a new run but returns the ID of the existing run instead. If a run with the provided token is deleted, an error is returned. If you specify the idempotency token, upon failure you can retry until the request succeeds. Databricks guarantees that exactly one run is launched with that idempotency token. This token must have at most 64 characters. For more information, see [How to ensure idempotency for jobs](https:\/\/kb.databricks.com\/jobs\/jobs-idempotency.html). | \n### Response structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `run_id` | `INT64` | The globally unique ID of the newly triggered run. |\n| `number_in_job` | `INT64` | The sequence number of this run among all runs of the job. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Jobs API 2.0\n##### Runs submit\n\nImportant \n* A workspace is limited to 1000 concurrent task runs. A `429 Too Many Requests` response is returned when you request a run that cannot start immediately.\n* The number of jobs a workspace can create in an hour is limited to 10000 (includes \u201cruns submit\u201d). This limit also affects jobs created by the REST API and notebook workflows. \n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/jobs\/runs\/submit` | `POST` | \nSubmit a one-time run. This endpoint allows you to submit a workload directly without creating a job. Use the `jobs\/runs\/get` API to check the run state after the job is submitted. \n### Example \n#### Request \n```\ncurl --netrc --request POST \\\nhttps:\/\/\/api\/2.0\/jobs\/runs\/submit \\\n--data @submit-job.json \\\n| jq .\n\n``` \n`submit-job.json`: \n```\n{\n\"run_name\": \"my spark task\",\n\"new_cluster\": {\n\"spark_version\": \"7.3.x-scala2.12\",\n\"node_type_id\": \"r3.xlarge\",\n\"aws_attributes\": {\n\"availability\": \"ON_DEMAND\"\n},\n\"num_workers\": 10\n},\n\"libraries\": [\n{\n\"jar\": \"dbfs:\/my-jar.jar\"\n},\n{\n\"maven\": {\n\"coordinates\": \"org.jsoup:jsoup:1.7.2\"\n}\n}\n],\n\"spark_jar_task\": {\n\"main_class_name\": \"com.databricks.ComputeModels\"\n}\n}\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`.\n* The contents of `submit-job.json` with fields that are appropriate for your solution. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file and [jq](https:\/\/stedolan.github.io\/jq\/). \n#### Response \n```\n{\n\"run_id\": 123\n}\n\n``` \n### Request structure \nImportant \n* When you run a job on a new jobs cluster, the job is treated as a Jobs Compute (automated) workload subject to Jobs Compute pricing.\n* When you run a job on an existing all-purpose cluster, it is treated as an All-Purpose Compute (interactive) workload subject to All-Purpose Compute pricing. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `existing_cluster_id` OR `new_cluster` | `STRING` OR [NewCluster](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsclusterspecnewcluster) | If existing\\_cluster\\_id, the ID of an existing cluster that will be used for all runs of this job. When running jobs on an existing cluster, you may need to manually restart the cluster if it stops responding. We suggest running jobs on new clusters for greater reliability. If new\\_cluster, a description of a cluster that will be created for each run. If specifying a [PipelineTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobspipelinetask), then this field can be empty. |\n| `notebook_task` OR `spark_jar_task` OR `spark_python_task` OR `spark_submit_task` OR `pipeline_task` OR `run_job_task` | [NotebookTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsnotebooktask) OR [SparkJarTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobssparkjartask) OR [SparkPythonTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobssparkpythontask) OR [SparkSubmitTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobssparksubmittask) OR [PipelineTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobspipelinetask) OR [RunJobTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsrunjobtask) | If notebook\\_task, indicates that this job should run a notebook. This field may not be specified in conjunction with spark\\_jar\\_task. If spark\\_jar\\_task, indicates that this job should run a JAR. If spark\\_python\\_task, indicates that this job should run a Python file. If spark\\_submit\\_task, indicates that this job should be launched by the spark submit script. If pipeline\\_task, indicates that this job should run a Delta Live Tables pipeline. If run\\_job\\_task, indicates that this job should run another job. |\n| `run_name` | `STRING` | An optional name for the run. The default value is `Untitled`. |\n| `webhook_notifications` | [WebhookNotifications](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobsystemnotifications) | An optional set of system destinations to notify when runs of this job begin, complete, or fail. |\n| `notification_settings` | [JobNotificationSettings](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobnotificationsettings) | Optional notification settings that are used when sending notifications to each of the `webhook_notifications` for this run. |\n| `libraries` | An array of [Library](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#managedlibrarieslibrary) | An optional list of libraries to be installed on the cluster that will execute the job. The default value is an empty list. |\n| `timeout_seconds` | `INT32` | An optional timeout applied to each run of this job. The default behavior is to have no timeout. |\n| `idempotency_token` | `STRING` | An optional token to guarantee the idempotency of job run requests. If a run with the provided token already exists, the request does not create a new run but returns the ID of the existing run instead. If a run with the provided token is deleted, an error is returned. If you specify the idempotency token, upon failure you can retry until the request succeeds. Databricks guarantees that exactly one run is launched with that idempotency token. This token must have at most 64 characters. For more information, see [How to ensure idempotency for jobs](https:\/\/kb.databricks.com\/jobs\/jobs-idempotency.html). | \n### Response structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `run_id` | `INT64` | The canonical identifier for the newly submitted run. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Jobs API 2.0\n##### Runs list\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/jobs\/runs\/list` | `GET` | \nList runs in descending order by start time. \nNote \nRuns are automatically removed after 60 days. If you to want to reference them beyond 60 days, you should save old run results before they expire. To export using the UI, see [Export job run results](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#export-job-runs). To export using the Jobs API, see [Runs export](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsserviceexportrun). \n### Example \n#### Request \n```\ncurl --netrc --request GET \\\n'https:\/\/\/api\/2.0\/jobs\/runs\/list?job_id=&active_only=&offset=&limit=&run_type=' \\\n| jq .\n\n``` \nOr: \n```\ncurl --netrc --get \\\nhttps:\/\/\/api\/2.0\/jobs\/runs\/list \\\n--data 'job_id=&active_only=&offset=&limit=&run_type=' \\\n| jq .\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`.\n* `` with the ID of the job, for example `123`.\n* `` with `true` or `false`.\n* `` with the `offset` value.\n* `` with the `limit` value.\n* `` with the `run_type` value. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file and [jq](https:\/\/stedolan.github.io\/jq\/). \n#### Response \n```\n{\n\"runs\": [\n{\n\"job_id\": 1,\n\"run_id\": 452,\n\"number_in_job\": 5,\n\"state\": {\n\"life_cycle_state\": \"RUNNING\",\n\"state_message\": \"Performing action\"\n},\n\"task\": {\n\"notebook_task\": {\n\"notebook_path\": \"\/Users\/donald@duck.com\/my-notebook\"\n}\n},\n\"cluster_spec\": {\n\"existing_cluster_id\": \"1201-my-cluster\"\n},\n\"cluster_instance\": {\n\"cluster_id\": \"1201-my-cluster\",\n\"spark_context_id\": \"1102398-spark-context-id\"\n},\n\"overriding_parameters\": {\n\"jar_params\": [\"param1\", \"param2\"]\n},\n\"start_time\": 1457570074236,\n\"end_time\": 1457570075149,\n\"setup_duration\": 259754,\n\"execution_duration\": 3589020,\n\"cleanup_duration\": 31038,\n\"run_duration\": 3879812,\n\"trigger\": \"PERIODIC\"\n}\n],\n\"has_more\": true\n}\n\n``` \n### Request structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `active_only` OR `completed_only` | `BOOL` OR `BOOL` | If active\\_only is `true`, only active runs are included in the results; otherwise, lists both active and completed runs. An active run is a run in the `PENDING`, `RUNNING`, or `TERMINATING` [RunLifecycleState](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsrunlifecyclestate). This field cannot be `true` when completed\\_only is `true`. If completed\\_only is `true`, only completed runs are included in the results; otherwise, lists both active and completed runs. This field cannot be `true` when active\\_only is `true`. |\n| `job_id` | `INT64` | The job for which to list runs. If omitted, the Jobs service will list runs from all jobs. |\n| `offset` | `INT32` | The offset of the first run to return, relative to the most recent run. |\n| `limit` | `INT32` | The number of runs to return. This value should be greater than 0 and less than 1000. The default value is 20. If a request specifies a limit of 0, the service will instead use the maximum limit. |\n| `run_type` | `STRING` | The type of runs to return. For a description of run types, see [Run](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsrun). | \n### Response structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `runs` | An array of [Run](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsrun) | A list of runs, from most recently started to least. |\n| `has_more` | `BOOL` | If true, additional runs matching the provided filter are available for listing. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Jobs API 2.0\n##### Runs get\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/jobs\/runs\/get` | `GET` | \nRetrieve the metadata of a run. \nNote \nRuns are automatically removed after 60 days. If you to want to reference them beyond 60 days, you should save old run results before they expire. To export using the UI, see [Export job run results](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#export-job-runs). To export using the Jobs API, see [Runs export](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsserviceexportrun). \n### Example \n#### Request \n```\ncurl --netrc --request GET \\\n'https:\/\/\/api\/2.0\/jobs\/runs\/get?run_id=' \\\n| jq .\n\n``` \nOr: \n```\ncurl --netrc --get \\\nhttps:\/\/\/api\/2.0\/jobs\/runs\/get \\\n--data run_id= \\\n| jq .\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`.\n* `` with the ID of the run, for example `123`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file and [jq](https:\/\/stedolan.github.io\/jq\/). \n#### Response \n```\n{\n\"job_id\": 1,\n\"run_id\": 452,\n\"number_in_job\": 5,\n\"state\": {\n\"life_cycle_state\": \"RUNNING\",\n\"state_message\": \"Performing action\"\n},\n\"task\": {\n\"notebook_task\": {\n\"notebook_path\": \"\/Users\/someone@example.com\/my-notebook\"\n}\n},\n\"cluster_spec\": {\n\"existing_cluster_id\": \"1201-my-cluster\"\n},\n\"cluster_instance\": {\n\"cluster_id\": \"1201-my-cluster\",\n\"spark_context_id\": \"1102398-spark-context-id\"\n},\n\"overriding_parameters\": {\n\"jar_params\": [\"param1\", \"param2\"]\n},\n\"start_time\": 1457570074236,\n\"end_time\": 1457570075149,\n\"setup_duration\": 259754,\n\"execution_duration\": 3589020,\n\"cleanup_duration\": 31038,\n\"run_duration\": 3879812,\n\"trigger\": \"PERIODIC\"\n}\n\n``` \n### Request structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `run_id` | `INT64` | The canonical identifier of the run for which to retrieve the metadata. This field is required. | \n### Response structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `job_id` | `INT64` | The canonical identifier of the job that contains this run. |\n| `run_id` | `INT64` | The canonical identifier of the run. This ID is unique across all runs of all jobs. |\n| `number_in_job` | `INT64` | The sequence number of this run among all runs of the job. This value starts at 1. |\n| `original_attempt_run_id` | `INT64` | If this run is a retry of a prior run attempt, this field contains the run\\_id of the original attempt; otherwise, it is the same as the run\\_id. |\n| `state` | [RunState](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsrunstate) | The result and lifecycle states of the run. |\n| `schedule` | [CronSchedule](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobscronschedule) | The cron schedule that triggered this run if it was triggered by the periodic scheduler. |\n| `task` | [JobTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobtask) | The task performed by the run, if any. |\n| `cluster_spec` | [ClusterSpec](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsclusterspec) | A snapshot of the job\u2019s cluster specification when this run was created. |\n| `cluster_instance` | [ClusterInstance](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsclusterinstance) | The cluster used for this run. If the run is specified to use a new cluster, this field will be set once the Jobs service has requested a cluster for the run. |\n| `overriding_parameters` | [RunParameters](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsrunparameters) | The parameters used for this run. |\n| `start_time` | `INT64` | The time at which this run was started in epoch milliseconds (milliseconds since 1\/1\/1970 UTC). This may not be the time when the job task starts executing, for example, if the job is scheduled to run on a new cluster, this is the time the cluster creation call is issued. |\n| `end_time` | `INT64` | The time at which this run ended in epoch milliseconds (milliseconds since 1\/1\/1970 UTC). This field will be set to 0 if the job is still running. |\n| `setup_duration` | `INT64` | The time in milliseconds it took to set up the cluster. For runs that run on new clusters this is the cluster creation time, for runs that run on existing clusters this time should be very short. The total duration of the run is the sum of the `setup_duration`, `execution_duration`, and the `cleanup_duration`. The `setup_duration` field is set to 0 for multitask job runs. The total duration of a multitask job run is the value of the `run_duration` field. |\n| `execution_duration` | `INT64` | The time in milliseconds it took to execute the commands in the JAR or notebook until they completed, failed, timed out, were cancelled, or encountered an unexpected error. The total duration of the run is the sum of the `setup_duration`, `execution_duration`, and the `cleanup_duration`. The `execution_duration` field is set to 0 for multitask job runs. The total duration of a multitask job run is the value of the `run_duration` field. |\n| `cleanup_duration` | `INT64` | The time in milliseconds it took to terminate the cluster and clean up any associated artifacts. The total duration of the run is the sum of the `setup_duration`, `execution_duration`, and the `cleanup_duration`. The `cleanup_duration` field is set to 0 for multitask job runs. The total duration of a multitask job run is the value of the `run_duration` field. |\n| `run_duration` | `INT64` | The time in milliseconds it took the job run and all of its repairs to finish. This field is only set for multitask job runs and not task runs. The duration of a task run is the sum of the `setup_duration`, `execution_duration`, and the `cleanup_duration`. |\n| `trigger` | [TriggerType](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobstriggertype) | The type of trigger that fired this run. |\n| `creator_user_name` | `STRING` | The creator user name. This field won\u2019t be included in the response if the user has been deleted |\n| `run_page_url` | `STRING` | The URL to the detail page of the run. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Jobs API 2.0\n##### Runs export\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/jobs\/runs\/export` | `GET` | \nExport and retrieve the job run task. \nNote \nOnly notebook runs can be exported in HTML format. Exporting runs of other types will fail. \n### Example \n#### Request \n```\ncurl --netrc --request GET \\\n'https:\/\/\/api\/2.0\/jobs\/runs\/export?run_id=' \\\n| jq .\n\n``` \nOr: \n```\ncurl --netrc --get \\\nhttps:\/\/\/api\/2.0\/jobs\/runs\/export \\\n--data run_id= \\\n| jq .\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`.\n* `` with the ID of the run, for example `123`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file and [jq](https:\/\/stedolan.github.io\/jq\/). \n#### Response \n```\n{\n\"views\": [ {\n\"content\": \"Head<\/head>Body<\/body><\/html>\",\n\"name\": \"my-notebook\",\n\"type\": \"NOTEBOOK\"\n} ]\n}\n\n``` \nTo extract the HTML notebook from the JSON response, download and run this [Python script](https:\/\/docs.databricks.com\/_static\/examples\/extract.py). \nNote \nThe notebook body in the `__DATABRICKS_NOTEBOOK_MODEL` object is encoded. \n### Request structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `run_id` | `INT64` | The canonical identifier for the run. This field is required. |\n| `views_to_export` | [ViewsToExport](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsviewstoexport) | Which views to export (CODE, DASHBOARDS, or ALL). Defaults to CODE. | \n### Response structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `views` | An array of [ViewItem](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsviewitem) | The exported content in HTML format (one for every view item). |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Jobs API 2.0\n##### Runs cancel\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/jobs\/runs\/cancel` | `POST` | \nCancel a job run. Because the run is canceled asynchronously, the run may still be running when this request completes. The run will be terminated shortly. If the run is already in a terminal `life_cycle_state`, this method is a no-op. \nThis endpoint validates that the `run_id` parameter is valid and for invalid parameters returns HTTP status code 400. \n### Example \n```\ncurl --netrc --request POST \\\nhttps:\/\/\/api\/2.0\/jobs\/runs\/cancel \\\n--data '{ \"run_id\": }'\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`.\n* `` with the ID of the run, for example `123`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file. \n### Request structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `run_id` | `INT64` | The canonical identifier of the run to cancel. This field is required. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Jobs API 2.0\n##### Runs cancel all\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/jobs\/runs\/cancel-all` | `POST` | \nCancel all active runs of a job. Because the run is canceled asynchronously, it doesn\u2019t prevent new runs from being started. \nThis endpoint validates that the `job_id` parameter is valid and for invalid parameters returns HTTP status code 400. \n### Example \n```\ncurl --netrc --request POST \\\nhttps:\/\/\/api\/2.0\/jobs\/runs\/cancel-all \\\n--data '{ \"job_id\": }'\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`.\n* `` with the ID of the job, for example `123`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file. \n### Request structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `job_id` | `INT64` | The canonical identifier of the job to cancel all runs of. This field is required. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Jobs API 2.0\n##### Runs get output\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/jobs\/runs\/get-output` | `GET` | \nRetrieve the output and metadata of a single task run. When a notebook task returns a value through the [dbutils.notebook.exit()](https:\/\/docs.databricks.com\/notebooks\/notebook-workflows.html#notebook-workflows-exit)\ncall, you can use this endpoint to retrieve that value. Databricks restricts this API to return the first 5 MB of the output. For returning a larger result, you can store job results in a cloud storage service. \nThis endpoint validates that the `run_id` parameter is valid and for invalid parameters returns HTTP status code 400. \nRuns are automatically removed after 60 days. If you to want to reference them beyond 60 days, you should save old run results before they expire. To export using the UI, see [Export job run results](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#export-job-runs). To export using the Jobs API, see [Runs export](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsserviceexportrun). \n### Example \n#### Request \n```\ncurl --netrc --request GET \\\n'https:\/\/\/api\/2.0\/jobs\/runs\/get-output?run_id=' \\\n| jq .\n\n``` \nOr: \n```\ncurl --netrc --get \\\nhttps:\/\/\/api\/2.0\/jobs\/runs\/get-output \\\n--data run_id= \\\n| jq .\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`.\n* `` with the ID of the run, for example `123`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file and [jq](https:\/\/stedolan.github.io\/jq\/). \n#### Response \n```\n{\n\"metadata\": {\n\"job_id\": 1,\n\"run_id\": 452,\n\"number_in_job\": 5,\n\"state\": {\n\"life_cycle_state\": \"TERMINATED\",\n\"result_state\": \"SUCCESS\",\n\"state_message\": \"\"\n},\n\"task\": {\n\"notebook_task\": {\n\"notebook_path\": \"\/Users\/someone@example.com\/my-notebook\"\n}\n},\n\"cluster_spec\": {\n\"existing_cluster_id\": \"1201-my-cluster\"\n},\n\"cluster_instance\": {\n\"cluster_id\": \"1201-my-cluster\",\n\"spark_context_id\": \"1102398-spark-context-id\"\n},\n\"overriding_parameters\": {\n\"jar_params\": [\"param1\", \"param2\"]\n},\n\"start_time\": 1457570074236,\n\"setup_duration\": 259754,\n\"execution_duration\": 3589020,\n\"cleanup_duration\": 31038,\n\"run_duration\": 3879812,\n\"trigger\": \"PERIODIC\"\n},\n\"notebook_output\": {\n\"result\": \"the maybe truncated string passed to dbutils.notebook.exit()\"\n}\n}\n\n``` \n### Request structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `run_id` | `INT64` | The canonical identifier for the run. For a job with mulitple tasks, this is the `run_id` of a task run. See [Runs get output](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#get-runs-output). This field is required. | \n### Response structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `notebook_output` OR `error` | [NotebookOutput](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsnotebooktasknotebookoutput) OR `STRING` | If notebook\\_output, the output of a notebook task, if available. A notebook task that terminates (either successfully or with a failure) without calling `dbutils.notebook.exit()` is considered to have an empty output. This field will be set but its result value will be empty. If error, an error message indicating why output is not available. The message is unstructured, and its exact format is subject to change. |\n| `metadata` | [Run](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsrun) | All details of the run except for its output. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Jobs API 2.0\n##### Runs delete\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/jobs\/runs\/delete` | `POST` | \nDelete a non-active run. Returns an error if the run is active. \n### Example \n```\ncurl --netrc --request POST \\\nhttps:\/\/\/api\/2.0\/jobs\/runs\/delete \\\n--data '{ \"run_id\": }'\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`.\n* `` with the ID of the run, for example `123`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file. \n### Request structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `run_id` | `INT64` | The canonical identifier of the run for which to retrieve the metadata. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Jobs API 2.0\n##### Data structures\n\nIn this section: \n* [AutoScale](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#autoscale)\n* [AwsAttributes](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#awsattributes)\n* [AwsAvailability](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#awsavailability)\n* [ClusterInstance](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clusterinstance)\n* [ClusterLogConf](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clusterlogconf)\n* [ClusterSpec](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clusterspec)\n* [ClusterTag](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clustertag)\n* [CronSchedule](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#cronschedule)\n* [DbfsStorageInfo](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#dbfsstorageinfo)\n* [EbsVolumeType](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#ebsvolumetype)\n* [FileStorageInfo](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#filestorageinfo)\n* [InitScriptInfo](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#initscriptinfo)\n* [Job](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#job)\n* [JobEmailNotifications](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobemailnotifications)\n* [JobNotificationSettings](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobnotificationsettings)\n* [JobSettings](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsettings)\n* [JobTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobtask)\n* [JobsHealthRule](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobshealthrule)\n* [JobsHealthRules](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobshealthrules)\n* [Library](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#library)\n* [MavenLibrary](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#mavenlibrary)\n* [NewCluster](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#newcluster)\n* [NotebookOutput](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#notebookoutput)\n* [NotebookTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#notebooktask)\n* [ParamPair](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#parampair)\n* [PipelineTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#pipelinetask)\n* [PythonPyPiLibrary](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#pythonpypilibrary)\n* [RCranLibrary](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#rcranlibrary)\n* [Run](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#run)\n* [RunJobTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#runjobtask)\n* [RunLifeCycleState](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#runlifecyclestate)\n* [RunParameters](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#runparameters)\n* [RunResultState](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#runresultstate)\n* [RunState](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#runstate)\n* [S3StorageInfo](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#s3storageinfo)\n* [SparkConfPair](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#sparkconfpair)\n* [SparkEnvPair](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#sparkenvpair)\n* [SparkJarTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#sparkjartask)\n* [SparkPythonTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#sparkpythontask)\n* [SparkSubmitTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#sparksubmittask)\n* [TriggerType](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#triggertype)\n* [ViewItem](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#viewitem)\n* [ViewType](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#viewtype)\n* [ViewsToExport](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#viewstoexport)\n* [Webhook](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#webhook)\n* [WebhookNotifications](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#webhooknotifications)\n* [WorkspaceStorageInfo](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#workspacestorageinfo) \n### [AutoScale](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id50) \nRange defining the min and max number of cluster workers. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `min_workers` | `INT32` | The minimum number of workers to which the cluster can scale down when underutilized. It is also the initial number of workers the cluster will have after creation. |\n| `max_workers` | `INT32` | The maximum number of workers to which the cluster can scale up when overloaded. max\\_workers must be strictly greater than min\\_workers. | \n### [AwsAttributes](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id51) \nAttributes set during cluster creation related to Amazon Web Services. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `first_on_demand` | `INT32` | The first first\\_on\\_demand nodes of the cluster will be placed on on-demand instances. If this value is greater than 0, the cluster driver node will be placed on an on-demand instance. If this value is greater than or equal to the current cluster size, all nodes will be placed on on-demand instances. If this value is less than the current cluster size, first\\_on\\_demand nodes will be placed on on-demand instances and the remainder will be placed on `availability` instances. This value does not affect cluster size and cannot be mutated over the lifetime of a cluster. |\n| `availability` | [AwsAvailability](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clusterawsavailability) | Availability type used for all subsequent nodes past the first\\_on\\_demand ones. **Note:** If first\\_on\\_demand is zero, this availability type will be used for the entire cluster. |\n| `zone_id` | `STRING` | Identifier for the availability zone (AZ) in which the cluster resides. By default, the setting has a value of **auto**, otherwise known as Auto-AZ. With Auto-AZ, Databricks selects the AZ based on available IPs in the workspace subnets and retries in other availability zones if AWS returns insufficient capacity errors. If you want, you can also specify an availability zone to use. This benefits accounts that have reserved instances in a specific AZ. Specify the AZ as a string (for example, `\"us-west-2a\"`). The provided availability zone must be in the same region as the Databricks deployment. For example, \u201cus-west-2a\u201d is not a valid zone ID if the Databricks deployment resides in the \u201cus-east-1\u201d region. The list of available zones as well as the default value can be found by using the [GET \/api\/2.0\/clusters\/list-zones](https:\/\/docs.databricks.com\/api\/workspace\/clusters\/listzones) call. |\n| `instance_profile_arn` | `STRING` | Nodes for this cluster will only be placed on AWS instances with this instance profile. If omitted, nodes will be placed on instances without an instance profile. The instance profile must have previously been added to the Databricks environment by an account administrator. This feature may only be available to certain customer plans. |\n| `spot_bid_price_percent` | `INT32` | The max price for AWS spot instances, as a percentage of the corresponding instance type\u2019s on-demand price. For example, if this field is set to 50, and the cluster needs a new `i3.xlarge` spot instance, then the max price is half of the price of on-demand `i3.xlarge` instances. Similarly, if this field is set to 200, the max price is twice the price of on-demand `i3.xlarge` instances. If not specified, the default value is 100. When spot instances are requested for this cluster, only spot instances whose max price percentage matches this field will be considered. For safety, we enforce this field to be no more than 10000. |\n| `ebs_volume_type` | [EbsVolumeType](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clusterebsvolumetype) | The type of EBS volumes that will be launched with this cluster. |\n| `ebs_volume_count` | `INT32` | The number of volumes launched for each instance. You can choose up to 10 volumes. This feature is only enabled for supported node types. Legacy node types cannot specify custom EBS volumes. For node types with no instance store, at least one EBS volume needs to be specified; otherwise, cluster creation will fail. These EBS volumes will be mounted at `\/ebs0`, `\/ebs1`, and etc. Instance store volumes will be mounted at `\/local_disk0`, `\/local_disk1`, and etc. If EBS volumes are attached, Databricks will configure Spark to use only the EBS volumes for scratch storage because heterogeneously sized scratch devices can lead to inefficient disk utilization. If no EBS volumes are attached, Databricks will configure Spark to use instance store volumes. If EBS volumes are specified, then the Spark configuration `spark.local.dir` will be overridden. |\n| `ebs_volume_size` | `INT32` | The size of each EBS volume (in GiB) launched for each instance. For general purpose SSD, this value must be within the range 100 - 4096. For throughput optimized HDD, this value must be within the range 500 - 4096. Custom EBS volumes cannot be specified for the legacy node types (*memory-optimized* and *compute-optimized*). |\n| `ebs_volume_iops` | `INT32` | The number of IOPS per EBS gp3 volume. This value must be between 3000 and 16000. The value of IOPS and throughput is calculated based on AWS documentation to match the maximum performance of a gp2 volume with the same volume size. For more information, see the [EBS volume limit calculator](https:\/\/github.com\/awslabs\/aws-support-tools\/tree\/master\/EBS\/VolumeLimitCalculator). |\n| `ebs_volume_throughput` | `INT32` | The throughput per EBS gp3 volume, in MiB per second. This value must be between 125 and 1000. | \nIf neither `ebs_volume_iops` nor `ebs_volume_throughput` is specified, the values are inferred from the disk size: \n| Disk size | IOPS | Throughput |\n| --- | --- | --- |\n| Greater than 1000 | 3 times the disk size, up to 16000 | 250 |\n| Between 170 and 1000 | 3000 | 250 |\n| Below 170 | 3000 | 125 | \n### [AwsAvailability](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id52) \nThe set of AWS availability types supported when setting up nodes for a cluster. \n| Type | Description |\n| --- | --- |\n| `SPOT` | Use spot instances. |\n| `ON_DEMAND` | Use on-demand instances. |\n| `SPOT_WITH_FALLBACK` | Preferably use spot instances, but fall back to on-demand instances if spot instances cannot be acquired (for example, if AWS spot prices are too high). | \n### [ClusterInstance](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id53) \nIdentifiers for the cluster and Spark context used by a run. These two values together identify an execution context across all time. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `cluster_id` | `STRING` | The canonical identifier for the cluster used by a run. This field is always available for runs on existing clusters. For runs on new clusters, it becomes available once the cluster is created. This value can be used to view logs by browsing to `\/#setting\/sparkui\/$cluster_id\/driver-logs`. The logs will continue to be available after the run completes. The response won\u2019t include this field if the identifier is not available yet. |\n| `spark_context_id` | `STRING` | The canonical identifier for the Spark context used by a run. This field will be filled in once the run begins execution. This value can be used to view the Spark UI by browsing to `\/#setting\/sparkui\/$cluster_id\/$spark_context_id`. The Spark UI will continue to be available after the run has completed. The response won\u2019t include this field if the identifier is not available yet. | \n### [ClusterLogConf](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id54) \nPath to cluster log. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `dbfs` OR `s3` | [DbfsStorageInfo](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clusterclusterlogconfdbfsstorageinfo) [S3StorageInfo](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clusterclusterinitscriptinfos3storageinfo) | DBFS location of cluster log. Destination must be provided. For example, `{ \"dbfs\" : { \"destination\" : \"dbfs:\/home\/cluster_log\" } }` S3 location of cluster log. `destination` and either `region` or `warehouse` must be provided. For example, `{ \"s3\": { \"destination\" : \"s3:\/\/cluster_log_bucket\/prefix\", \"region\" : \"us-west-2\" } }` | \n### [ClusterSpec](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id55) \nImportant \n* When you run a job on a new jobs cluster, the job is treated as a Jobs Compute (automated) workload subject to Jobs Compute pricing.\n* When you run a job on an existing all-purpose cluster, it is treated as an All-Purpose Compute (interactive) workload subject to All-Purpose Compute pricing. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `existing_cluster_id` OR `new_cluster` | `STRING` OR [NewCluster](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsclusterspecnewcluster) | If existing\\_cluster\\_id, the ID of an existing cluster that will be used for all runs of this job. When running jobs on an existing cluster, you may need to manually restart the cluster if it stops responding. We suggest running jobs on new clusters for greater reliability. If new\\_cluster, a description of a cluster that will be created for each run. If specifying a [PipelineTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobspipelinetask), then this field can be empty. |\n| `libraries` | An array of [Library](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#managedlibrarieslibrary) | An optional list of libraries to be installed on the cluster that will execute the job. The default value is an empty list. | \n### [ClusterTag](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id56) \nCluster tag definition. \n| Type | Description |\n| --- | --- |\n| `STRING` | The key of the tag. The key length must be between 1 and 127 UTF-8 characters, inclusive. For a list of all restrictions, see AWS Tag Restrictions: |\n| `STRING` | The value of the tag. The value length must be less than or equal to 255 UTF-8 characters. For a list of all restrictions, see AWS Tag Restrictions: | \n### [CronSchedule](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id57) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `quartz_cron_expression` | `STRING` | A Cron expression using Quartz syntax that describes the schedule for a job. See [Cron Trigger](http:\/\/www.quartz-scheduler.org\/documentation\/quartz-2.3.0\/tutorials\/crontrigger.html) for details. This field is required. |\n| `timezone_id` | `STRING` | A Java timezone ID. The schedule for a job will be resolved with respect to this timezone. See [Java TimeZone](https:\/\/docs.oracle.com\/javase\/7\/docs\/api\/java\/util\/TimeZone.html) for details. This field is required. |\n| `pause_status` | `STRING` | Indicate whether this schedule is paused or not. Either \u201cPAUSED\u201d or \u201cUNPAUSED\u201d. | \n### [DbfsStorageInfo](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id58) \nDBFS storage information. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `destination` | `STRING` | DBFS destination. Example: `dbfs:\/my\/path` | \n### [EbsVolumeType](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id59) \nDatabricks supports gp2 and gp3 EBS volume types. Follow the instructions at [Manage SSD storage](https:\/\/docs.databricks.com\/admin\/clusters\/manage-ssd.html) to select gp2 or gp3 for your workspace. \n| Type | Description |\n| --- | --- |\n| `GENERAL_PURPOSE_SSD` | Provision extra storage using AWS EBS volumes. |\n| `THROUGHPUT_OPTIMIZED_HDD` | Provision extra storage using AWS st1 volumes. | \n### [FileStorageInfo](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id60) \nFile storage information. \nNote \nThis location type is only available for clusters set up using [Databricks Container Services](https:\/\/docs.databricks.com\/compute\/custom-containers.html). \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `destination` | `STRING` | File destination. Example: `file:\/my\/file.sh` | \n### [InitScriptInfo](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id61) \nPath to an init script. \nFor instructions on using init scripts with [Databricks Container Services](https:\/\/docs.databricks.com\/compute\/custom-containers.html), see [Use an init script](https:\/\/docs.databricks.com\/compute\/custom-containers.html#containers-init-script). \nNote \nThe file storage type (field name: `file`) is only available for clusters set up using [Databricks Container Services](https:\/\/docs.databricks.com\/compute\/custom-containers.html). See [FileStorageInfo](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clusterclusterinitscriptinfofilestorageinfo). \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `workspace` OR `dbfs` (deprecated) OR `S3` | [WorkspaceStorageInfo](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clusterclusterinitscriptinfoworkspacestorageinfo) [DbfsStorageInfo](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clusterclusterlogconfdbfsstorageinfo) (deprecated) [S3StorageInfo](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clusterclusterinitscriptinfos3storageinfo) | Workspace location of init script. Destination must be provided. For example, `{ \"workspace\" : { \"destination\" : \"\/Users\/someone@domain.com\/init_script.sh\" } }` (Deprecated) DBFS location of init script. Destination must be provided. For example, `{ \"dbfs\" : { \"destination\" : \"dbfs:\/home\/init_script\" } }` S3 location of init script. Destination and either region or warehouse must be provided. For example, `{ \"s3\": { \"destination\" : \"s3:\/\/init_script_bucket\/prefix\", \"region\" : \"us-west-2\" } }` | \n### [Job](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id62) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `job_id` | `INT64` | The canonical identifier for this job. |\n| `creator_user_name` | `STRING` | The creator user name. This field won\u2019t be included in the response if the user has already been deleted. |\n| `run_as` | `STRING` | The user name that the job will run as. `run_as` is based on the current job settings, and is set to the creator of the job if job access control is disabled, or the `is_owner` permission if job access control is enabled. |\n| `settings` | [JobSettings](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettings) | Settings for this job and all of its runs. These settings can be updated using the `resetJob` method. |\n| `created_time` | `INT64` | The time at which this job was created in epoch milliseconds (milliseconds since 1\/1\/1970 UTC). | \n### [JobEmailNotifications](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id63) \nImportant \nThe on\\_start, on\\_success, and on\\_failure fields accept only Latin characters (ASCII character set). Using non-ASCII characters will return an error. Examples of invalid, non-ASCII characters are Chinese, Japanese kanjis, and emojis. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `on_start` | An array of `STRING` | A list of email addresses to be notified when a run begins. If not specified on job creation, reset, or update, the list is empty, and notifications are not sent. |\n| `on_success` | An array of `STRING` | A list of email addresses to be notified when a run successfully completes. A run is considered to have completed successfully if it ends with a `TERMINATED` `life_cycle_state` and a `SUCCESSFUL` `result_state`. If not specified on job creation, reset, or update, the list is empty, and notifications are not sent. |\n| `on_failure` | An array of `STRING` | A list of email addresses to be notified when a run unsuccessfully completes. A run is considered to have completed unsuccessfully if it ends with an `INTERNAL_ERROR` `life_cycle_state` or a `SKIPPED`, `FAILED`, or `TIMED_OUT` result\\_state. If this is not specified on job creation, reset, or update the list is empty, and notifications are not sent. |\n| `on_duration_warning_threshold_exceeded` | An array of `STRING` | An list of email addresses to be notified when the duration of a run exceeds the threshold specified for the `RUN_DURATION_SECONDS` metric in the `health` field. If no rule for the `RUN_DURATION_SECONDS` metric is specified in the `health` field for the job, notifications are not sent. |\n| `no_alert_for_skipped_runs` | `BOOL` | If true, do not send email to recipients specified in `on_failure` if the run is skipped. | \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `on_start` | An array of [Webhook](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobwebhook) | An optional list of system destinations to be notified when a run begins. If not specified on job creation, reset, or update, the list is empty, and notifications are not sent. A maximum of 3 destinations can be specified for the `on_start` property. |\n| `on_success` | An array of [Webhook](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobwebhook) | An optional list of system destinations to be notified when a run completes successfully. A run is considered to have completed successfully if it ends with a `TERMINATED` `life_cycle_state` and a `SUCCESSFUL` `result_state`. If not specified on job creation, reset, or update, the list is empty, and notifications are not sent. A maximum of 3 destinations can be specified for the `on_success` property. |\n| `on_failure` | An array of [Webhook](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobwebhook) | An optional list of system destinations to be notified when a run completes unsuccessfully. A run is considered to have completed unsuccessfully if it ends with an `INTERNAL_ERROR` `life_cycle_state` or a `SKIPPED`, `FAILED`, or `TIMED_OUT` result\\_state. If this is not specified on job creation, reset, or update the list is empty, and notifications are not sent. A maximum of 3 destinations can be specified for the `on_failure` property. |\n| `on_duration_warning_threshold_exceeded` | An array of [Webhook](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobwebhook) | An optional list of system destinations to be notified when the duration of a run exceeds the threshold specified for the `RUN_DURATION_SECONDS` metric in the `health` field. A maximum of 3 destinations can be specified for the `on_duration_warning_threshold_exceeded` property. | \n### [JobNotificationSettings](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id64) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `no_alert_for_skipped_runs` | `BOOL` | If true, do not send notifications to recipients specified in `on_failure` if the run is skipped. |\n| `no_alert_for_canceled_runs` | `BOOL` | If true, do not send notifications to recipients specified in `on_failure` if the run is canceled. |\n| `alert_on_last_attempt` | `BOOL` | If true, do not send notifications to recipients specified in `on_start` for the retried runs and do not send notifications to recipients specified in `on_failure` until the last retry of the run. | \n### [JobSettings](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id65) \nImportant \n* When you run a job on a new jobs cluster, the job is treated as a Jobs Compute (automated) workload subject to Jobs Compute pricing.\n* When you run a job on an existing all-purpose cluster, it is treated as an All-Purpose Compute (interactive) workload subject to All-Purpose Compute pricing. \nSettings for a job. These settings can be updated using the `resetJob` method. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `existing_cluster_id` OR `new_cluster` | `STRING` OR [NewCluster](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsclusterspecnewcluster) | If existing\\_cluster\\_id, the ID of an existing cluster that will be used for all runs of this job. When running jobs on an existing cluster, you may need to manually restart the cluster if it stops responding. We suggest running jobs on new clusters for greater reliability. If new\\_cluster, a description of a cluster that will be created for each run. If specifying a [PipelineTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobspipelinetask), then this field can be empty. |\n| `notebook_task` OR `spark_jar_task` OR `spark_python_task` OR `spark_submit_task` OR `pipeline_task` OR `run_job_task` | [NotebookTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsnotebooktask) OR [SparkJarTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobssparkjartask) OR [SparkPythonTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobssparkpythontask) OR [SparkSubmitTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobssparksubmittask) OR [PipelineTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobspipelinetask) OR [RunJobTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsrunjobtask) | If notebook\\_task, indicates that this job should run a notebook. This field may not be specified in conjunction with spark\\_jar\\_task. If spark\\_jar\\_task, indicates that this job should run a JAR. If spark\\_python\\_task, indicates that this job should run a Python file. If spark\\_submit\\_task, indicates that this job should be launched by the spark submit script. If pipeline\\_task, indicates that this job should run a Delta Live Tables pipeline. If run\\_job\\_task, indicates that this job should run another job. |\n| `name` | `STRING` | An optional name for the job. The default value is `Untitled`. |\n| `libraries` | An array of [Library](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#managedlibrarieslibrary) | An optional list of libraries to be installed on the cluster that will execute the job. The default value is an empty list. |\n| `email_notifications` | [JobEmailNotifications](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobemailnotifications) | An optional set of email addresses that will be notified when runs of this job begin or complete as well as when this job is deleted. The default behavior is to not send any emails. |\n| `webhook_notifications` | [WebhookNotifications](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobsystemnotifications) | An optional set of system destinations to notify when runs of this job begin, complete, or fail. |\n| `notification_settings` | [JobNotificationSettings](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobnotificationsettings) | Optional notification settings that are used when sending notifications to each of the `email_notifications` and `webhook_notifications` for this job. |\n| `timeout_seconds` | `INT32` | An optional timeout applied to each run of this job. The default behavior is to have no timeout. |\n| `max_retries` | `INT32` | An optional maximum number of times to retry an unsuccessful run. A run is considered to be unsuccessful if it completes with the `FAILED` result\\_state or `INTERNAL_ERROR` `life_cycle_state`. The value -1 means to retry indefinitely and the value 0 means to never retry. The default behavior is to never retry. |\n| `min_retry_interval_millis` | `INT32` | An optional minimal interval in milliseconds between attempts. The default behavior is that unsuccessful runs are immediately retried. |\n| `retry_on_timeout` | `BOOL` | An optional policy to specify whether to retry a job when it times out. The default behavior is to not retry on timeout. |\n| `schedule` | [CronSchedule](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobscronschedule) | An optional periodic schedule for this job. The default behavior is that the job will only run when triggered by clicking \u201cRun Now\u201d in the Jobs UI or sending an API request to `runNow`. |\n| `max_concurrent_runs` | `INT32` | An optional maximum allowed number of concurrent runs of the job. Set this value if you want to be able to execute multiple runs of the same job concurrently. This is useful for example if you trigger your job on a frequent schedule and want to allow consecutive runs to overlap with each other, or if you want to trigger multiple runs which differ by their input parameters. This setting affects only new runs. For example, suppose the job\u2019s concurrency is 4 and there are 4 concurrent active runs. Then setting the concurrency to 3 won\u2019t kill any of the active runs. However, from then on, new runs will be skipped unless there are fewer than 3 active runs. This value cannot exceed 1000. Setting this value to 0 causes all new runs to be skipped. The default behavior is to allow only 1 concurrent run. |\n| `health` | [JobsHealthRules](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobshealthrules) | An optional set of health rules defined for the job. | \n### [JobTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id66) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `notebook_task` OR `spark_jar_task` OR `spark_python_task` OR `spark_submit_task` OR `pipeline_task` OR `run_job_task` | [NotebookTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsnotebooktask) OR [SparkJarTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobssparkjartask) OR [SparkPythonTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobssparkpythontask) OR [SparkSubmitTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobssparksubmittask) OR [PipelineTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobspipelinetask) OR [RunJobTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsrunjobtask) | If notebook\\_task, indicates that this job should run a notebook. This field may not be specified in conjunction with spark\\_jar\\_task. If spark\\_jar\\_task, indicates that this job should run a JAR. If spark\\_python\\_task, indicates that this job should run a Python file. If spark\\_submit\\_task, indicates that this job should be launched by the spark submit script. If pipeline\\_task, indicates that this job should run a Delta Live Tables pipeline. If run\\_job\\_task, indicates that this job should run another job. | \n### [JobsHealthRule](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id67) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `metric` | `STRING` | Specifies the health metric that is being evaluated for a particular health rule. Valid values are `RUN_DURATION_SECONDS`. |\n| `operator` | `STRING` | Specifies the operator used to compare the health metric value with the specified threshold. Valid values are `GREATER_THAN`. |\n| `value` | `INT32` | Specifies the threshold value that the health metric should meet to comply with the health rule. | \n### [JobsHealthRules](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id68) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `rules` | An array of [JobsHealthRule](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobshealthrule) | An optional set of health rules that can be defined for a job. | \n### [Library](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id69) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `jar` OR `egg` OR `whl` OR `pypi` OR `maven` OR `cran` | `STRING` OR `STRING` OR `STRING` OR [PythonPyPiLibrary](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#managedlibrariespythonpypilibrary) OR [MavenLibrary](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#managedlibrariesmavenlibrary) OR [RCranLibrary](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#managedlibrariesrcranlibrary) | If jar, URI of the JAR to be installed. DBFS and S3 URIs are supported. For example: `{ \"jar\": \"dbfs:\/mnt\/databricks\/library.jar\" }` or `{ \"jar\": \"s3:\/\/my-bucket\/library.jar\" }`. If S3 is used, make sure the cluster has read access on the library. You may need to launch the cluster with an instance profile to access the S3 URI. If egg, URI of the egg to be installed. DBFS and S3 URIs are supported. For example: `{ \"egg\": \"dbfs:\/my\/egg\" }` or `{ \"egg\": \"s3:\/\/my-bucket\/egg\" }`. If S3 is used, make sure the cluster has read access on the library. You may need to launch the cluster with an instance profile to access the S3 URI. If whl, URI of the `wheel` or zipped `wheels` to be installed. DBFS and S3 URIs are supported. For example: `{ \"whl\": \"dbfs:\/my\/whl\" }` or `{ \"whl\": \"s3:\/\/my-bucket\/whl\" }`. If S3 is used, make sure the cluster has read access on the library. You may need to launch the cluster with an instance profile to access the S3 URI. Also the `wheel` file name needs to use the [correct convention](https:\/\/www.python.org\/dev\/peps\/pep-0427\/#file-format). If zipped `wheels` are to be installed, the file name suffix should be `.wheelhouse.zip`. If pypi, specification of a PyPI library to be installed. Specifying the `repo` field is optional and if not specified, the default pip index is used. For example: `{ \"package\": \"simplejson\", \"repo\": \"https:\/\/my-repo.com\" }` If maven, specification of a Maven library to be installed. For example: `{ \"coordinates\": \"org.jsoup:jsoup:1.7.2\" }` If cran, specification of a CRAN library to be installed. | \n### [MavenLibrary](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id70) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `coordinates` | `STRING` | Gradle-style Maven coordinates. For example: `org.jsoup:jsoup:1.7.2`. This field is required. |\n| `repo` | `STRING` | Maven repo to install the Maven package from. If omitted, both Maven Central Repository and Spark Packages are searched. |\n| `exclusions` | An array of `STRING` | List of dependences to exclude. For example: `[\"slf4j:slf4j\", \"*:hadoop-client\"]`. Maven dependency exclusions: . | \n### [NewCluster](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id71) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `num_workers` OR `autoscale` | `INT32` OR [AutoScale](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clusterautoscale) | If num\\_workers, number of worker nodes that this cluster should have. A cluster has one Spark driver and num\\_workers executors for a total of num\\_workers + 1 Spark nodes. When reading the properties of a cluster, this field reflects the desired number of workers rather than the actual current number of workers. For example, if a cluster is resized from 5 to 10 workers, this field will immediately be updated to reflect the target size of 10 workers, whereas the workers listed in `spark_info` will gradually increase from 5 to 10 as the new nodes are provisioned. If autoscale, the required parameters to automatically scale clusters up and down based on load. |\n| `spark_version` | `STRING` | The Spark version of the cluster. A list of available Spark versions can be retrieved by using the [GET 2.0\/clusters\/spark-versions](https:\/\/docs.databricks.com\/api\/workspace\/clusters\/sparkversions) call. This field is required. |\n| `spark_conf` | [SparkConfPair](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clustersparkconfpair) | An object containing a set of optional, user-specified Spark configuration key-value pairs. You can also pass in a string of extra JVM options to the driver and the executors via `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` respectively. Example Spark confs: `{\"spark.speculation\": true, \"spark.streaming.ui.retainedBatches\": 5}` or `{\"spark.driver.extraJavaOptions\": \"-verbose:gc -XX:+PrintGCDetails\"}` |\n| `aws_attributes` | [AwsAttributes](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clusterawsattributes) | Attributes related to clusters running on Amazon Web Services. If not specified at cluster creation, a set of default values will be used. |\n| `node_type_id` | `STRING` | This field encodes, through a single value, the resources available to each of the Spark nodes in this cluster. For example, the Spark nodes can be provisioned and optimized for memory or compute intensive workloads A list of available node types can be retrieved by using the [GET 2.0\/clusters\/list-node-types](https:\/\/docs.databricks.com\/api\/workspace\/clusters\/listnodetypes) call. This field, the `instance_pool_id` field, or a cluster policy that specifies a node type ID or instance pool ID, is required. |\n| `driver_node_type_id` | `STRING` | The node type of the Spark driver. This field is optional; if unset, the driver node type will be set as the same value as `node_type_id` defined above. |\n| `ssh_public_keys` | An array of `STRING` | SSH public key contents that will be added to each Spark node in this cluster. The corresponding private keys can be used to login with the user name `ubuntu` on port `2200`. Up to 10 keys can be specified. |\n| `custom_tags` | [ClusterTag](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clusterclustertag) | An object containing a set of tags for cluster resources. Databricks tags all cluster resources (such as AWS instances and EBS volumes) with these tags in addition to default\\_tags. **Note**:* Tags are not supported on legacy node types such as compute-optimized and memory-optimized * Databricks allows at most 45 custom tags |\n| `cluster_log_conf` | [ClusterLogConf](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clusterclusterlogconf) | The configuration for delivering Spark logs to a long-term storage destination. Only one destination can be specified for one cluster. If the conf is given, the logs will be delivered to the destination every `5 mins`. The destination of driver logs is `\/\/driver`, while the destination of executor logs is `\/\/executor`. |\n| `init_scripts` | An array of [InitScriptInfo](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clusterclusterinitscriptinfo) | The configuration for storing init scripts. Any number of scripts can be specified. The scripts are executed sequentially in the order provided. If `cluster_log_conf` is specified, init script logs are sent to `\/\/init_scripts`. |\n| `spark_env_vars` | [SparkEnvPair](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#clustersparkenvpair) | An object containing a set of optional, user-specified environment variable key-value pairs. Key-value pair of the form (X,Y) are exported as is (i.e., `export X='Y'`) while launching the driver and workers. To specify an additional set of `SPARK_DAEMON_JAVA_OPTS`, we recommend appending them to `$SPARK_DAEMON_JAVA_OPTS` as shown in the following example. This ensures that all default databricks managed environmental variables are included as well. Example Spark environment variables: `{\"SPARK_WORKER_MEMORY\": \"28000m\", \"SPARK_LOCAL_DIRS\": \"\/local_disk0\"}` or `{\"SPARK_DAEMON_JAVA_OPTS\": \"$SPARK_DAEMON_JAVA_OPTS -Dspark.shuffle.service.enabled=true\"}` |\n| `enable_elastic_disk` | `BOOL` | Autoscaling Local Storage: when enabled, this cluster dynamically acquires additional disk space when its Spark workers are running low on disk space. This feature requires specific AWS permissions to function correctly - refer to [Enable autoscaling local storage](https:\/\/docs.databricks.com\/compute\/configure.html#autoscaling-local-storage) for details. |\n| `driver_instance_pool_id` | `STRING` | The optional ID of the instance pool to use for the driver node. You must also specify `instance_pool_id`. Refer to the [Instance Pools API](https:\/\/docs.databricks.com\/api\/workspace\/instancepools) for details. |\n| `instance_pool_id` | `STRING` | The optional ID of the instance pool to use for cluster nodes. If `driver_instance_pool_id` is present, `instance_pool_id` is used for worker nodes only. Otherwise, it is used for both the driver node and worker nodes. Refer to the [Instance Pools API](https:\/\/docs.databricks.com\/api\/workspace\/instancepools) for details. | \n### [NotebookOutput](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id72) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `result` | `STRING` | The value passed to [dbutils.notebook.exit()](https:\/\/docs.databricks.com\/notebooks\/notebook-workflows.html#notebook-workflows-exit). Databricks restricts this API to return the first 1 MB of the value. For a larger result, your job can store the results in a cloud storage service. This field will be absent if `dbutils.notebook.exit()` was never called. |\n| `truncated` | `BOOLEAN` | Whether or not the result was truncated. | \n### [NotebookTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id73) \nAll the output cells are subject to the size of 8MB. If the output of a cell has a larger size, the rest of the run will be cancelled and the run will be marked as failed. In that case, some of the content output from other cells may also be missing. \nIf you need help finding the cell that is beyond the limit, run the notebook against an all-purpose cluster and use this [notebook autosave technique](https:\/\/kb.databricks.com\/notebooks\/notebook-autosave.html). \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `notebook_path` | `STRING` | The absolute path of the notebook to be run in the Databricks workspace. This path must begin with a slash. This field is required. |\n| `revision_timestamp` | `LONG` | The timestamp of the revision of the notebook. |\n| `base_parameters` | A map of [ParamPair](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsparampair) | Base parameters to be used for each run of this job. If the run is initiated by a call to `run-now` with parameters specified, the two parameters maps will be merged. If the same key is specified in `base_parameters` and in `run-now`, the value from `run-now` will be used. Use [Pass context about job runs into job tasks](https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html) to set parameters containing information about job runs. If the notebook takes a parameter that is not specified in the job\u2019s `base_parameters` or the `run-now` override parameters, the default value from the notebook will be used. Retrieve these parameters in a notebook using [dbutils.widgets.get](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#dbutils-widgets). | \n### [ParamPair](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id74) \nName-based parameters for jobs running notebook tasks. \nImportant \nThe fields in this data structure accept only Latin characters (ASCII character set). Using non-ASCII characters will return an error. Examples of invalid, non-ASCII characters are Chinese, Japanese kanjis, and emojis. \n| Type | Description |\n| --- | --- |\n| `STRING` | Parameter name. Pass to [dbutils.widgets.get](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#dbutils-widgets) to retrieve the value. |\n| `STRING` | Parameter value. | \n### [PipelineTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id75) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `pipeline_id` | `STRING` | The full name of the Delta Live Tables pipeline task to execute. | \n### [PythonPyPiLibrary](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id76) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `package` | `STRING` | The name of the PyPI package to install. An optional exact version specification is also supported. Examples: `simplejson` and `simplejson==3.8.0`. This field is required. |\n| `repo` | `STRING` | The repository where the package can be found. If not specified, the default pip index is used. | \n### [RCranLibrary](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id77) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `package` | `STRING` | The name of the CRAN package to install. This field is required. |\n| `repo` | `STRING` | The repository where the package can be found. If not specified, the default CRAN repo is used. | \n### [Run](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id78) \nAll the information about a run except for its output. The output can be retrieved separately\nwith the `getRunOutput` method. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `job_id` | `INT64` | The canonical identifier of the job that contains this run. |\n| `run_id` | `INT64` | The canonical identifier of the run. This ID is unique across all runs of all jobs. |\n| `creator_user_name` | `STRING` | The creator user name. This field won\u2019t be included in the response if the user has already been deleted. |\n| `number_in_job` | `INT64` | The sequence number of this run among all runs of the job. This value starts at 1. |\n| `original_attempt_run_id` | `INT64` | If this run is a retry of a prior run attempt, this field contains the run\\_id of the original attempt; otherwise, it is the same as the run\\_id. |\n| `state` | [RunState](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsrunstate) | The result and lifecycle states of the run. |\n| `schedule` | [CronSchedule](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobscronschedule) | The cron schedule that triggered this run if it was triggered by the periodic scheduler. |\n| `task` | [JobTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobtask) | The task performed by the run, if any. |\n| `cluster_spec` | [ClusterSpec](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsclusterspec) | A snapshot of the job\u2019s cluster specification when this run was created. |\n| `cluster_instance` | [ClusterInstance](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsclusterinstance) | The cluster used for this run. If the run is specified to use a new cluster, this field will be set once the Jobs service has requested a cluster for the run. |\n| `overriding_parameters` | [RunParameters](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsrunparameters) | The parameters used for this run. |\n| `start_time` | `INT64` | The time at which this run was started in epoch milliseconds (milliseconds since 1\/1\/1970 UTC). This may not be the time when the job task starts executing, for example, if the job is scheduled to run on a new cluster, this is the time the cluster creation call is issued. |\n| `setup_duration` | `INT64` | The time it took to set up the cluster in milliseconds. For runs that run on new clusters this is the cluster creation time, for runs that run on existing clusters this time should be very short. |\n| `execution_duration` | `INT64` | The time in milliseconds it took to execute the commands in the JAR or notebook until they completed, failed, timed out, were cancelled, or encountered an unexpected error. |\n| `cleanup_duration` | `INT64` | The time in milliseconds it took to terminate the cluster and clean up any associated artifacts. The total duration of the run is the sum of the setup\\_duration, the execution\\_duration, and the cleanup\\_duration. |\n| `end_time` | `INT64` | The time at which this run ended in epoch milliseconds (milliseconds since 1\/1\/1970 UTC). This field will be set to 0 if the job is still running. |\n| `trigger` | [TriggerType](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobstriggertype) | The type of trigger that fired this run. |\n| `run_name` | `STRING` | An optional name for the run. The default value is `Untitled`. The maximum allowed length is 4096 bytes in UTF-8 encoding. |\n| `run_page_url` | `STRING` | The URL to the detail page of the run. |\n| `run_type` | `STRING` | The type of the run.* `JOB_RUN` - Normal job run. A run created with [Run now](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsservicerunnow). * `WORKFLOW_RUN` - Workflow run. A run created with [dbutils.notebook.run](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#dbutils-workflow). * `SUBMIT_RUN` - Submit run. A run created with [Run now](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsservicerunnow). |\n| `attempt_number` | `INT32` | The sequence number of this run attempt for a triggered job run. The initial attempt of a run has an attempt\\_number of 0. If the initial run attempt fails, and the job has a retry policy (`max_retries` > 0), subsequent runs are created with an `original_attempt_run_id` of the original attempt\u2019s ID and an incrementing `attempt_number`. Runs are retried only until they succeed, and the maximum `attempt_number` is the same as the `max_retries` value for the job. | \n### [RunJobTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id79) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `job_id` | `INT32` | Unique identifier of the job to run. This field is required. | \n### [RunLifeCycleState](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id80) \nThe life cycle state of a run. Allowed state transitions are: \n* `PENDING` -> `RUNNING` -> `TERMINATING` -> `TERMINATED`\n* `PENDING` -> `SKIPPED`\n* `PENDING` -> `INTERNAL_ERROR`\n* `RUNNING` -> `INTERNAL_ERROR`\n* `TERMINATING` -> `INTERNAL_ERROR` \n| State | Description |\n| --- | --- |\n| `PENDING` | The run has been triggered. If there is not already an active run of the same job, the cluster and execution context are being prepared. If there is already an active run of the same job, the run will immediately transition into the `SKIPPED` state without preparing any resources. |\n| `RUNNING` | The task of this run is being executed. |\n| `TERMINATING` | The task of this run has completed, and the cluster and execution context are being cleaned up. |\n| `TERMINATED` | The task of this run has completed, and the cluster and execution context have been cleaned up. This state is terminal. |\n| `SKIPPED` | This run was aborted because a previous run of the same job was already active. This state is terminal. |\n| `INTERNAL_ERROR` | An exceptional state that indicates a failure in the Jobs service, such as network failure over a long period. If a run on a new cluster ends in the `INTERNAL_ERROR` state, the Jobs service terminates the cluster as soon as possible. This state is terminal. | \n### [RunParameters](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id81) \nParameters for this run. Only one of jar\\_params, `python_params`, or notebook\\_params\nshould be specified in the `run-now` request, depending on the type of job task.\nJobs with Spark JAR task or Python task take a list of position-based parameters, and jobs\nwith notebook tasks take a key value map. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `jar_params` | An array of `STRING` | A list of parameters for jobs with Spark JAR tasks, e.g. `\"jar_params\": [\"john doe\", \"35\"]`. The parameters will be used to invoke the main function of the main class specified in the Spark JAR task. If not specified upon `run-now`, it will default to an empty list. jar\\_params cannot be specified in conjunction with notebook\\_params. The JSON representation of this field (i.e. `{\"jar_params\":[\"john doe\",\"35\"]}`) cannot exceed 10,000 bytes. Use [Pass context about job runs into job tasks](https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html) to set parameters containing information about job runs. |\n| `notebook_params` | A map of [ParamPair](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsparampair) | A map from keys to values for jobs with notebook task, e.g. `\"notebook_params\": {\"name\": \"john doe\", \"age\": \"35\"}`. The map is passed to the notebook and is accessible through the [dbutils.widgets.get](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#dbutils-widgets) function. If not specified upon `run-now`, the triggered run uses the job\u2019s base parameters. notebook\\_params cannot be specified in conjunction with jar\\_params. Use [Pass context about job runs into job tasks](https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html) to set parameters containing information about job runs. The JSON representation of this field (i.e. `{\"notebook_params\":{\"name\":\"john doe\",\"age\":\"35\"}}`) cannot exceed 10,000 bytes. |\n| `python_params` | An array of `STRING` | A list of parameters for jobs with Python tasks, e.g. `\"python_params\": [\"john doe\", \"35\"]`. The parameters are passed to Python file as command-line parameters. If specified upon `run-now`, it would overwrite the parameters specified in job setting. The JSON representation of this field (i.e. `{\"python_params\":[\"john doe\",\"35\"]}`) cannot exceed 10,000 bytes. Use [Pass context about job runs into job tasks](https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html) to set parameters containing information about job runs. Important These parameters accept only Latin characters (ASCII character set). Using non-ASCII characters will return an error. Examples of invalid, non-ASCII characters are Chinese, Japanese kanjis, and emojis. |\n| `spark_submit_params` | An array of `STRING` | A list of parameters for jobs with spark submit task, e.g. `\"spark_submit_params\": [\"--class\", \"org.apache.spark.examples.SparkPi\"]`. The parameters are passed to spark-submit script as command-line parameters. If specified upon `run-now`, it would overwrite the parameters specified in job setting. The JSON representation of this field (i.e. `{\"python_params\":[\"john doe\",\"35\"]}`) cannot exceed 10,000 bytes. Use [Pass context about job runs into job tasks](https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html) to set parameters containing information about job runs. Important These parameters accept only Latin characters (ASCII character set). Using non-ASCII characters will return an error. Examples of invalid, non-ASCII characters are Chinese, Japanese kanjis, and emojis. | \n### [RunResultState](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id82) \nThe result state of the run. \n* If `life_cycle_state` = `TERMINATED`: if the run had a task, the result is guaranteed to be\navailable, and it indicates the result of the task.\n* If `life_cycle_state` = `PENDING`, `RUNNING`, or `SKIPPED`, the result state is not available.\n* If `life_cycle_state` = `TERMINATING` or lifecyclestate = `INTERNAL_ERROR`: the result state\nis available if the run had a task and managed to start it. \nOnce available, the result state never changes. \n| State | Description |\n| --- | --- |\n| `SUCCESS` | The task completed successfully. |\n| `FAILED` | The task completed with an error. |\n| `TIMEDOUT` | The run was stopped after reaching the timeout. |\n| `CANCELED` | The run was canceled at user request. | \n### [RunState](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id83) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `life_cycle_state` | [RunLifeCycleState](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsrunlifecyclestate) | A description of a run\u2019s current location in the run lifecycle. This field is always available in the response. |\n| `result_state` | [RunResultState](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsrunresultstate) | The result state of a run. If it is not available, the response won\u2019t include this field. See [RunResultState](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#runresultstate) for details about the availability of result\\_state. |\n| `user_cancelled_or_timedout` | `BOOLEAN` | Whether a run was canceled manually by a user or by the scheduler because the run timed out. |\n| `state_message` | `STRING` | A descriptive message for the current state. This field is unstructured, and its exact format is subject to change. | \n### [S3StorageInfo](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id84) \nS3 storage information. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `destination` | `STRING` | S3 destination. For example: `s3:\/\/my-bucket\/some-prefix` You must configure the cluster with an instance profile and the instance profile must have write access to the destination. You *cannot* use AWS keys. |\n| `region` | `STRING` | S3 region. For example: `us-west-2`. Either region or warehouse must be set. If both are set, warehouse is used. |\n| `warehouse` | `STRING` | S3 warehouse. For example: `https:\/\/s3-us-west-2.amazonaws.com`. Either region or warehouse must be set. If both are set, warehouse is used. |\n| `enable_encryption` | `BOOL` | (Optional)Enable server side encryption, `false` by default. |\n| `encryption_type` | `STRING` | (Optional) The encryption type, it could be `sse-s3` or `sse-kms`. It is used only when encryption is enabled and the default type is `sse-s3`. |\n| `kms_key` | `STRING` | (Optional) KMS key used if encryption is enabled and encryption type is set to `sse-kms`. |\n| `canned_acl` | `STRING` | (Optional) Set canned access control list. For example: `bucket-owner-full-control`. If canned\\_acl is set, the cluster instance profile must have `s3:PutObjectAcl` permission on the destination bucket and prefix. The full list of possible canned ACLs can be found at . By default only the object owner gets full control. If you are using cross account role for writing data, you may want to set `bucket-owner-full-control` to make bucket owner able to read the logs. | \n### [SparkConfPair](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id85) \nSpark configuration key-value pairs. \n| Type | Description |\n| --- | --- |\n| `STRING` | A configuration property name. |\n| `STRING` | The configuration property value. | \n### [SparkEnvPair](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id86) \nSpark environment variable key-value pairs. \nImportant \nWhen specifying environment variables in a job cluster, the fields in this data structure accept only Latin characters (ASCII character set). Using non-ASCII characters will return an error. Examples of invalid, non-ASCII characters are Chinese, Japanese kanjis, and emojis. \n| Type | Description |\n| --- | --- |\n| `STRING` | An environment variable name. |\n| `STRING` | The environment variable value. | \n### [SparkJarTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id87) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `jar_uri` | `STRING` | Deprecated since 04\/2016. Provide a `jar` through the `libraries` field instead. For an example, see [Create](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#create). |\n| `main_class_name` | `STRING` | The full name of the class containing the main method to be executed. This class must be contained in a JAR provided as a library. The code should use `SparkContext.getOrCreate` to obtain a Spark context; otherwise, runs of the job will fail. |\n| `parameters` | An array of `STRING` | Parameters passed to the main method. Use [Pass context about job runs into job tasks](https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html) to set parameters containing information about job runs. | \n### [SparkPythonTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id88) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `python_file` | `STRING` | The URI of the Python file to be executed. DBFS and S3 paths are supported. This field is required. |\n| `parameters` | An array of `STRING` | Command line parameters passed to the Python file. Use [Pass context about job runs into job tasks](https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html) to set parameters containing information about job runs. | \n### [SparkSubmitTask](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id89) \nImportant \n* You can invoke Spark submit tasks only on new clusters.\n* In the new\\_cluster specification, `libraries` and `spark_conf` are not supported. Instead, use `--jars` and `--py-files` to add Java and Python libraries and `--conf` to set the Spark configuration.\n* `master`, `deploy-mode`, and `executor-cores` are automatically configured by Databricks;\nyou *cannot* specify them in parameters.\n* By default, the Spark submit job uses all available memory (excluding reserved memory for\nDatabricks services). You can set `--driver-memory`, and `--executor-memory` to a\nsmaller value to leave some room for off-heap usage.\n* The `--jars`, `--py-files`, `--files` arguments support DBFS and S3 paths. \nFor example, assuming the JAR is uploaded to DBFS, you can run `SparkPi` by setting the following parameters. \n```\n{\n\"parameters\": [\n\"--class\",\n\"org.apache.spark.examples.SparkPi\",\n\"dbfs:\/path\/to\/examples.jar\",\n\"10\"\n]\n}\n\n``` \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `parameters` | An array of `STRING` | Command-line parameters passed to spark submit. Use [Pass context about job runs into job tasks](https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html) to set parameters containing information about job runs. | \n### [TriggerType](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id90) \nThese are the type of triggers that can fire a run. \n| Type | Description |\n| --- | --- |\n| `PERIODIC` | Schedules that periodically trigger runs, such as a cron scheduler. |\n| `ONE_TIME` | One time triggers that fire a single run. This occurs you triggered a single run on demand through the UI or the API. |\n| `RETRY` | Indicates a run that is triggered as a retry of a previously failed run. This occurs when you request to re-run the job in case of failures. | \n### [ViewItem](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id91) \nThe exported content is in HTML format. For example, if the view to export is dashboards, one HTML string is returned for every dashboard. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `content` | `STRING` | Content of the view. |\n| `name` | `STRING` | Name of the view item. In the case of code view, the notebook\u2019s name. In the case of dashboard view, the dashboard\u2019s name. |\n| `type` | [ViewType](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsviewtype) | Type of the view item. | \n### [ViewType](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id92) \n| Type | Description |\n| --- | --- |\n| `NOTEBOOK` | Notebook view item. |\n| `DASHBOARD` | Dashboard view item. | \n### [ViewsToExport](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id93) \nView to export: either code, all dashboards, or all. \n| Type | Description |\n| --- | --- |\n| `CODE` | Code view of the notebook. |\n| `DASHBOARDS` | All dashboard views of the notebook. |\n| `ALL` | All views of the notebook. | \n### [Webhook](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id94) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `id` | `STRING` | Identifier referencing a system notification destination. This field is required. | \n### [WebhookNotifications](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id95) \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `on_start` | An array of [Webhook](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobwebhook) | An optional list of system destinations to be notified when a run begins. If not specified on job creation, reset, or update, the list is empty, and notifications are not sent. A maximum of 3 destinations can be specified for the `on_start` property. |\n| `on_success` | An array of [Webhook](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobwebhook) | An optional list of system destinations to be notified when a run completes successfully. A run is considered to have completed successfully if it ends with a `TERMINATED` `life_cycle_state` and a `SUCCESSFUL` `result_state`. If not specified on job creation, reset, or update, the list is empty, and notifications are not sent. A maximum of 3 destinations can be specified for the `on_success` property. |\n| `on_failure` | An array of [Webhook](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobwebhook) | An optional list of system destinations to be notified when a run completes unsuccessfully. A run is considered to have completed unsuccessfully if it ends with an `INTERNAL_ERROR` `life_cycle_state` or a `SKIPPED`, `FAILED`, or `TIMED_OUT` `result_state`. If this is not specified on job creation, reset, or update the list is empty, and notifications are not sent. A maximum of 3 destinations can be specified for the `on_failure` property. |\n| `on_duration_warning_threshold_exceeded` | An array of [Webhook](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsettingsjobwebhook) | An optional list of system destinations to be notified when the duration of a run exceeds the threshold specified for the `RUN_DURATION_SECONDS` metric in the `health` field. A maximum of 3 destinations can be specified for the `on_duration_warning_threshold_exceeded` property. | \n### [WorkspaceStorageInfo](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#id96) \nWorkspace storage information. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `destination` | `STRING` | File destination. Example: `\/Users\/someone@domain.com\/init_script.sh` |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html"} +{"content":"# Databricks reference documentation\n### Delta Live Tables API guide\n\nImportant \nThis article\u2019s content has been retired and might not be updated. See [Delta Live Tables](https:\/\/docs.databricks.com\/api\/workspace\/pipelines) in the Databricks REST API Reference. \nThe Delta Live Tables API allows you to create, edit, delete, start, and view details about pipelines. \nImportant \nTo access Databricks REST APIs, you must [authenticate](https:\/\/docs.databricks.com\/dev-tools\/auth\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html"} +{"content":"# Databricks reference documentation\n### Delta Live Tables API guide\n#### Create a pipeline\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/pipelines` | `POST` | \nCreates a new Delta Live Tables pipeline. \n### Example \nThis example creates a new triggered pipeline. \n#### Request \n```\ncurl --netrc -X POST \\\nhttps:\/\/\/api\/2.0\/pipelines \\\n--data @pipeline-settings.json\n\n``` \n`pipeline-settings.json`: \n```\n{\n\"name\": \"Wikipedia pipeline (SQL)\",\n\"storage\": \"\/Users\/username\/data\",\n\"clusters\": [\n{\n\"label\": \"default\",\n\"autoscale\": {\n\"min_workers\": 1,\n\"max_workers\": 5,\n\"mode\": \"ENHANCED\"\n}\n}\n],\n\"libraries\": [\n{\n\"notebook\": {\n\"path\": \"\/Users\/username\/DLT Notebooks\/Delta Live Tables quickstart (SQL)\"\n}\n}\n],\n\"continuous\": false\n}\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file. \n#### Response \n```\n{\n\"pipeline_id\": \"a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5\"\n}\n\n``` \n### Request structure \nSee [PipelineSettings](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipeline-spec). \n### Response structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| pipeline\\_id | `STRING` | The unique identifier for the newly created pipeline. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html"} +{"content":"# Databricks reference documentation\n### Delta Live Tables API guide\n#### Edit a pipeline\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/pipelines\/{pipeline_id}` | `PUT` | \nUpdates the settings for an existing pipeline. \n### Example \nThis example adds a `target` parameter to the pipeline with ID `a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5`: \n#### Request \n```\ncurl --netrc -X PUT \\\nhttps:\/\/\/api\/2.0\/pipelines\/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 \\\n--data @pipeline-settings.json\n\n``` \n`pipeline-settings.json` \n```\n{\n\"id\": \"a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5\",\n\"name\": \"Wikipedia pipeline (SQL)\",\n\"storage\": \"\/Users\/username\/data\",\n\"clusters\": [\n{\n\"label\": \"default\",\n\"autoscale\": {\n\"min_workers\": 1,\n\"max_workers\": 5,\n\"mode\": \"ENHANCED\"\n}\n}\n],\n\"libraries\": [\n{\n\"notebook\": {\n\"path\": \"\/Users\/username\/DLT Notebooks\/Delta Live Tables quickstart (SQL)\"\n}\n}\n],\n\"target\": \"wikipedia_quickstart_data\",\n\"continuous\": false\n}\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file. \n### Request structure \nSee [PipelineSettings](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipeline-spec).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html"} +{"content":"# Databricks reference documentation\n### Delta Live Tables API guide\n#### Delete a pipeline\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/pipelines\/{pipeline_id}` | `DELETE` | \nDeletes a pipeline from the Delta Live Tables system. \n### Example \nThis example deletes the pipeline with ID `a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5`: \n#### Request \n```\ncurl --netrc -X DELETE \\\nhttps:\/\/\/api\/2.0\/pipelines\/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html"} +{"content":"# Databricks reference documentation\n### Delta Live Tables API guide\n#### Start a pipeline update\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/pipelines\/{pipeline_id}\/updates` | `POST` | \nStarts an update for a pipeline. You can start an update for the entire pipeline graph, or a selective update of specific tables. \n### Examples \n#### Start a full refresh \nThis example starts an update with full refresh for the pipeline with ID `a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5`: \n##### Request \n```\ncurl --netrc -X POST \\\nhttps:\/\/\/api\/2.0\/pipelines\/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5\/updates \\\n--data '{ \"full_refresh\": \"true\" }'\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file. \n##### Response \n```\n{\n\"update_id\": \"a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8\",\n\"request_id\": \"a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8\"\n}\n\n``` \n#### Start an update of selected tables \nThis example starts an update that refreshes the `sales_orders_cleaned` and `sales_order_in_chicago` tables in the pipeline with ID `a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5`: \n##### Request \n```\ncurl --netrc -X POST \\\nhttps:\/\/\/api\/2.0\/pipelines\/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5\/updates \\\n--data '{ \"refresh_selection\": [\"sales_orders_cleaned\", \"sales_order_in_chicago\"] }'\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file. \n##### Response \n```\n{\n\"update_id\": \"a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8\",\n\"request_id\": \"a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8\"\n}\n\n``` \n#### Start a full update of selected tables \nThis example starts an update of the `sales_orders_cleaned` and `sales_order_in_chicago` tables, and an update with full refresh of the `customers` and `sales_orders_raw` tables in the pipeline with ID `a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5`. \n##### Request \n```\ncurl --netrc -X POST \\\nhttps:\/\/\/api\/2.0\/pipelines\/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5\/updates \\\n--data '{ \"refresh_selection\": [\"sales_orders_cleaned\", \"sales_order_in_chicago\"], \"full_refresh_selection\": [\"customers\", \"sales_orders_raw\"] }'\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file. \n##### Response \n```\n{\n\"update_id\": \"a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8\",\n\"request_id\": \"a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8\"\n}\n\n``` \n### Request structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `full_refresh` | `BOOLEAN` | Whether to reprocess all data. If `true`, the Delta Live Tables system resets all tables that are resettable before running the pipeline. This field is optional. The default value is `false`. An error is returned if `full_refesh` is true and either `refresh_selection` or `full_refresh_selection` is set. |\n| `refresh_selection` | An array of `STRING` | A list of tables to update. Use `refresh_selection` to start a refresh of a selected set of tables in the pipeline graph. This field is optional. If both `refresh_selection` and `full_refresh_selection` are empty, the entire pipeline graph is refreshed. An error is returned if:* `full_refesh` is true and `refresh_selection` is set. * One or more of the specified tables does not exist in the pipeline graph. |\n| `full_refresh_selection` | An array of `STRING` | A list of tables to update with full refresh. Use `full_refresh_selection` to start an update of a selected set of tables. The states of the specified tables are reset before the Delta Live Tables system starts the update. This field is optional. If both `refresh_selection` and `full_refresh_selection` are empty, the entire pipeline graph is refreshed. An error is returned if:* `full_refesh` is true and `refresh_selection` is set. * One or more of the specified tables does not exist in the pipeline graph. * One or more of the specified tables is not resettable. | \n### Response structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `update_id` | `STRING` | The unique identifier of the newly created update. |\n| `request_id` | `STRING` | The unique identifier of the request that started the update. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html"} +{"content":"# Databricks reference documentation\n### Delta Live Tables API guide\n#### Get the status of a pipeline update request\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/pipelines\/{pipeline_id}\/requests\/{request_id}` | `GET` | \nGets the status and information for the pipeline update associated with `request_id`, where `request_id` is a unique identifier for the request initiating the pipeline update. If the update is retried or restarted, then the new update inherits the request\\_id. \n### Example \nFor the pipeline with ID `a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5`, this example returns status and information for the update associated with request ID `a83d9f7c-d798-4fd5-aa39-301b6e6f4429`: \n#### Request \n```\ncurl --netrc -X GET \\\nhttps:\/\/\/api\/2.0\/pipelines\/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5\/requests\/a83d9f7c-d798-4fd5-aa39-301b6e6f4429\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file. \n#### Response \n```\n{\n\"status\": \"TERMINATED\",\n\"latest_update\":{\n\"pipeline_id\": \"a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5\",\n\"update_id\": \"90da8183-89de-4715-b5a9-c243e67f0093\",\n\"config\":{\n\"id\": \"aae89b88-e97e-40c4-8e1a-1b7ac76657e8\",\n\"name\": \"Retail sales (SQL)\",\n\"storage\": \"\/Users\/username\/data\",\n\"configuration\":{\n\"pipelines.numStreamRetryAttempts\": \"5\"\n},\n\"clusters\":[\n{\n\"label\": \"default\",\n\"autoscale\":{\n\"min_workers\": 1,\n\"max_workers\": 5,\n\"mode\": \"ENHANCED\"\n}\n}\n],\n\"libraries\":[\n{\n\"notebook\":{\n\"path\": \"\/Users\/username\/DLT Notebooks\/Delta Live Tables quickstart (SQL)\"\n}\n}\n],\n\"continuous\": false,\n\"development\": true,\n\"photon\": true,\n\"edition\": \"advanced\",\n\"channel\": \"CURRENT\"\n},\n\"cause\": \"API_CALL\",\n\"state\": \"COMPLETED\",\n\"cluster_id\": \"1234-567891-abcde123\",\n\"creation_time\": 1664304117145,\n\"full_refresh\": false,\n\"request_id\": \"a83d9f7c-d798-4fd5-aa39-301b6e6f4429\"\n}\n}\n\n``` \n### Response structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `status` | `STRING` | The status of the pipeline update request. One of* `ACTIVE`: An update for this request is actively running or may be retried in a new update. * `TERMINATED`: The request is terminated and will not be retried or restarted. |\n| `pipeline_id` | `STRING` | The unique identifier of the pipeline. |\n| `update_id` | `STRING` | The unique identifier of the update. |\n| `config` | [PipelineSettings](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipeline-spec) | The pipeline settings. |\n| `cause` | `STRING` | The trigger for the update. One of `API_CALL`, `RETRY_ON_FAILURE`, `SERVICE_UPGRADE`, `SCHEMA_CHANGE`, `JOB_TASK`, or `USER_ACTION`. |\n| `state` | `STRING` | The state of the update. One of `QUEUED`, `CREATED` `WAITING_FOR_RESOURCES`, `INITIALIZING`, `RESETTING`, `SETTING_UP_TABLES`, `RUNNING`, `STOPPING`, `COMPLETED`, `FAILED`, or `CANCELED`. |\n| `cluster_id` | `STRING` | The identifier of the cluster running the update. |\n| `creation_time` | `INT64` | The timestamp when the update was created. |\n| `full_refresh` | `BOOLEAN` | Whether this update resets all tables before running |\n| `refresh_selection` | An array of `STRING` | A list of tables to update without full refresh. |\n| `full_refresh_selection` | An array of `STRING` | A list of tables to update with full refresh. |\n| `request_id` | `STRING` | The unique identifier of the request that started the update. This is the value returned by the [update](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#start-update) request. If the update is retried or restarted, then the new update inherits the request\\_id. However, the `update_id` will be different. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html"} +{"content":"# Databricks reference documentation\n### Delta Live Tables API guide\n#### Stop any active pipeline update\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/pipelines\/{pipeline_id}\/stop` | `POST` | \nStops any active pipeline update. If no update is running, this request is a no-op. \nFor a continuous pipeline, the pipeline execution is paused. Tables currently processing finish refreshing, but downstream tables are not refreshed. On the next pipeline update, Delta Live Tables performs a selected refresh of tables that did not complete processing, and resumes processing of the remaining pipeline DAG. \nFor a triggered pipeline, the pipeline execution is stopped. Tables currently processing finish refreshing, but downstream tables are not refreshed. On the next pipeline update, Delta Live Tables refreshes all tables. \n### Example \nThis example stops an update for the pipeline with ID `a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5`: \n#### Request \n```\ncurl --netrc -X POST \\\nhttps:\/\/\/api\/2.0\/pipelines\/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5\/stop\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html"} +{"content":"# Databricks reference documentation\n### Delta Live Tables API guide\n#### List pipeline events\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/pipelines\/{pipeline_id}\/events` | `GET` | \nRetrieves events for a pipeline. \n### Example \nThis example retrieves a maximum of 5 events for the pipeline with ID `a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5`. \n#### Request \n```\ncurl --netrc -X GET \\\nhttps:\/\/\/api\/2.0\/pipelines\/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5\/events?max_results=5\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file. \n### Request structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `page_token` | `STRING` | Page token returned by previous call. This field is mutually exclusive with all fields in this request except max\\_results. An error is returned if any fields other than max\\_results are set when this field is set. This field is optional. |\n| `max_results` | `INT32` | The maximum number of entries to return in a single page. The system may return fewer than `max_results` events in a response, even if there are more events available. This field is optional. The default value is 25. The maximum value is 100. An error is returned if the value of `max_results` is greater than 100. |\n| `order_by` | `STRING` | A string indicating a sort order by timestamp for the results, for example, `[\"timestamp asc\"]`. The sort order can be ascending or descending. By default, events are returned in descending order by timestamp. This field is optional. |\n| `filter` | `STRING` | Criteria to select a subset of results, expressed using a SQL-like syntax. The supported filters are:* `level='INFO'` (or `WARN` or `ERROR`) * `level in ('INFO', 'WARN')` * `id='[event-id]'` * `timestamp > 'TIMESTAMP'` (or `>=`,`<`,`<=`,`=`) Composite expressions are supported, for example: `level in ('ERROR', 'WARN') AND timestamp> '2021-07-22T06:37:33.083Z'` This field is optional. | \n### Response structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `events` | An array of pipeline events. | The list of events matching the request criteria. |\n| `next_page_token` | `STRING` | If present, a token to fetch the next page of events. |\n| `prev_page_token` | `STRING` | If present, a token to fetch the previous page of events. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html"} +{"content":"# Databricks reference documentation\n### Delta Live Tables API guide\n#### Get pipeline details\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/pipelines\/{pipeline_id}` | `GET` | \nGets details about a pipeline, including the pipeline settings and recent updates. \n### Example \nThis example gets details for the pipeline with ID `a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5`: \n#### Request \n```\ncurl --netrc -X GET \\\nhttps:\/\/\/api\/2.0\/pipelines\/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file. \n#### Response \n```\n{\n\"pipeline_id\": \"a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5\",\n\"spec\": {\n\"id\": \"a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5\",\n\"name\": \"Wikipedia pipeline (SQL)\",\n\"storage\": \"\/Users\/username\/data\",\n\"clusters\": [\n{\n\"label\": \"default\",\n\"autoscale\": {\n\"min_workers\": 1,\n\"max_workers\": 5,\n\"mode\": \"ENHANCED\"\n}\n}\n],\n\"libraries\": [\n{\n\"notebook\": {\n\"path\": \"\/Users\/username\/DLT Notebooks\/Delta Live Tables quickstart (SQL)\"\n}\n}\n],\n\"target\": \"wikipedia_quickstart_data\",\n\"continuous\": false\n},\n\"state\": \"IDLE\",\n\"cluster_id\": \"1234-567891-abcde123\",\n\"name\": \"Wikipedia pipeline (SQL)\",\n\"creator_user_name\": \"username\",\n\"latest_updates\": [\n{\n\"update_id\": \"8a0b6d02-fbd0-11eb-9a03-0242ac130003\",\n\"state\": \"COMPLETED\",\n\"creation_time\": \"2021-08-13T00:37:30.279Z\"\n},\n{\n\"update_id\": \"a72c08ba-fbd0-11eb-9a03-0242ac130003\",\n\"state\": \"CANCELED\",\n\"creation_time\": \"2021-08-13T00:35:51.902Z\"\n},\n{\n\"update_id\": \"ac37d924-fbd0-11eb-9a03-0242ac130003\",\n\"state\": \"FAILED\",\n\"creation_time\": \"2021-08-13T00:33:38.565Z\"\n}\n],\n\"run_as_user_name\": \"username\"\n}\n\n``` \n### Response structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `pipeline_id` | `STRING` | The unique identifier of the pipeline. |\n| `spec` | [PipelineSettings](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipeline-spec) | The pipeline settings. |\n| `state` | `STRING` | The state of the pipeline. One of `IDLE` or `RUNNING`. If state = `RUNNING`, then there is at least one active update. |\n| `cluster_id` | `STRING` | The identifier of the cluster running the pipeline. |\n| `name` | `STRING` | The user-friendly name for this pipeline. |\n| `creator_user_name` | `STRING` | The username of the pipeline creator. |\n| `latest_updates` | An array of [UpdateStateInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#update-state-info) | Status of the most recent updates for the pipeline, ordered with the newest update first. |\n| `run_as_user_name` | `STRING` | The username that the pipeline runs as. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html"} +{"content":"# Databricks reference documentation\n### Delta Live Tables API guide\n#### Get update details\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/pipelines\/{pipeline_id}\/updates\/{update_id}` | `GET` | \nGets details for a pipeline update. \n### Example \nThis example gets details for update `9a84f906-fc51-11eb-9a03-0242ac130003` for the pipeline with ID `a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5`: \n#### Request \n```\ncurl --netrc -X GET \\\nhttps:\/\/\/api\/2.0\/pipelines\/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5\/updates\/9a84f906-fc51-11eb-9a03-0242ac130003\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file. \n#### Response \n```\n{\n\"update\": {\n\"pipeline_id\": \"a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5\",\n\"update_id\": \"9a84f906-fc51-11eb-9a03-0242ac130003\",\n\"config\": {\n\"id\": \"a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5\",\n\"name\": \"Wikipedia pipeline (SQL)\",\n\"storage\": \"\/Users\/username\/data\",\n\"configuration\": {\n\"pipelines.numStreamRetryAttempts\": \"5\"\n},\n\"clusters\": [\n{\n\"label\": \"default\",\n\"autoscale\": {\n\"min_workers\": 1,\n\"max_workers\": 5,\n\"mode\": \"ENHANCED\"\n}\n}\n],\n\"libraries\": [\n{\n\"notebook\": {\n\"path\": \"\/Users\/username\/DLT Notebooks\/Delta Live Tables quickstart (SQL)\"\n}\n}\n],\n\"target\": \"wikipedia_quickstart_data\",\n\"continuous\": false,\n\"development\": false\n},\n\"cause\": \"API_CALL\",\n\"state\": \"COMPLETED\",\n\"creation_time\": 1628815050279,\n\"full_refresh\": true,\n\"request_id\": \"a83d9f7c-d798-4fd5-aa39-301b6e6f4429\"\n}\n}\n\n``` \n### Response structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `pipeline_id` | `STRING` | The unique identifier of the pipeline. |\n| `update_id` | `STRING` | The unique identifier of this update. |\n| `config` | [PipelineSettings](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipeline-spec) | The pipeline settings. |\n| `cause` | `STRING` | The trigger for the update. One of `API_CALL`, `RETRY_ON_FAILURE`, `SERVICE_UPGRADE`. |\n| `state` | `STRING` | The state of the update. One of `QUEUED`, `CREATED` `WAITING_FOR_RESOURCES`, `INITIALIZING`, `RESETTING`, `SETTING_UP_TABLES`, `RUNNING`, `STOPPING`, `COMPLETED`, `FAILED`, or `CANCELED`. |\n| `cluster_id` | `STRING` | The identifier of the cluster running the pipeline. |\n| `creation_time` | `INT64` | The timestamp when the update was created. |\n| `full_refresh` | `BOOLEAN` | Whether this was a full refresh. If true, all pipeline tables were reset before running the update. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html"} +{"content":"# Databricks reference documentation\n### Delta Live Tables API guide\n#### List pipelines\n\n| Endpoint | HTTP Method |\n| --- | --- |\n| `2.0\/pipelines\/` | `GET` | \nLists pipelines defined in the Delta Live Tables system. \n### Example \nThis example retrieves details for pipelines where the name contains `quickstart`: \n#### Request \n```\ncurl --netrc -X GET \\\nhttps:\/\/\/api\/2.0\/pipelines?filter=name%20LIKE%20%27%25quickstart%25%27\n\n``` \nReplace: \n* `` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file. \n#### Response \n```\n{\n\"statuses\": [\n{\n\"pipeline_id\": \"e0f01758-fc61-11eb-9a03-0242ac130003\",\n\"state\": \"IDLE\",\n\"name\": \"DLT quickstart (Python)\",\n\"latest_updates\": [\n{\n\"update_id\": \"ee9ae73e-fc61-11eb-9a03-0242ac130003\",\n\"state\": \"COMPLETED\",\n\"creation_time\": \"2021-08-13T00:34:21.871Z\"\n}\n],\n\"creator_user_name\": \"username\"\n},\n{\n\"pipeline_id\": \"f4c82f5e-fc61-11eb-9a03-0242ac130003\",\n\"state\": \"IDLE\",\n\"name\": \"My DLT quickstart example\",\n\"creator_user_name\": \"username\"\n}\n],\n\"next_page_token\": \"eyJ...==\",\n\"prev_page_token\": \"eyJ..x9\"\n}\n\n``` \n### Request structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `page_token` | `STRING` | Page token returned by previous call. This field is optional. |\n| `max_results` | `INT32` | The maximum number of entries to return in a single page. The system may return fewer than `max_results` events in a response, even if there are more events available. This field is optional. The default value is 25. The maximum value is 100. An error is returned if the value of `max_results` is greater than 100. |\n| `order_by` | An array of `STRING` | A list of strings specifying the order of results, for example, `[\"name asc\"]`. Supported `order_by` fields are `id` and `name`. The default is `id asc`. This field is optional. |\n| `filter` | `STRING` | Select a subset of results based on the specified criteria. The supported filters are: `\"notebook=''\"` to select pipelines that reference the provided notebook path. `name LIKE '[pattern]'` to select pipelines with a name that matches `pattern`. Wildcards are supported, for example: `name LIKE '%shopping%'` Composite filters are not supported. This field is optional. | \n### Response structure \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `statuses` | An array of [PipelineStateInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipeline-state-info) | The list of events matching the request criteria. |\n| `next_page_token` | `STRING` | If present, a token to fetch the next page of events. |\n| `prev_page_token` | `STRING` | If present, a token to fetch the previous page of events. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html"} +{"content":"# Databricks reference documentation\n### Delta Live Tables API guide\n#### Data structures\n\nIn this section: \n* [AwsAttributes](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#awsattributes)\n* [AwsAvailability](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#awsavailability)\n* [ClusterLogConf](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#clusterlogconf)\n* [DbfsStorageInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#dbfsstorageinfo)\n* [EbsVolumeType](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#ebsvolumetype)\n* [FileStorageInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#filestorageinfo)\n* [InitScriptInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#initscriptinfo)\n* [KeyValue](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#keyvalue)\n* [NotebookLibrary](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#notebooklibrary)\n* [PipelinesAutoScale](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipelinesautoscale)\n* [PipelineLibrary](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipelinelibrary)\n* [PipelinesNewCluster](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipelinesnewcluster)\n* [PipelineSettings](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipelinesettings)\n* [PipelineStateInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipelinestateinfo)\n* [S3StorageInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#s3storageinfo)\n* [UpdateStateInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#updatestateinfo)\n* [WorkspaceStorageInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#workspacestorageinfo) \n### [AwsAttributes](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#id37) \nAttributes set during cluster creation related to Amazon Web Services. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `first_on_demand` | `INT32` | The first first\\_on\\_demand nodes of the cluster will be placed on on-demand instances. If this value is greater than 0, the cluster driver node will be placed on an on-demand instance. If this value is greater than or equal to the current cluster size, all nodes will be placed on on-demand instances. If this value is less than the current cluster size, first\\_on\\_demand nodes will be placed on on-demand instances and the remainder will be placed on `availability` instances. This value does not affect cluster size and cannot be mutated over the lifetime of a cluster. |\n| `availability` | [AwsAvailability](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#clusterawsavailability) | Availability type used for all subsequent nodes past the first\\_on\\_demand ones. **Note:** If first\\_on\\_demand is zero, this availability type will be used for the entire cluster. |\n| `zone_id` | `STRING` | Identifier for the availability zone (AZ) in which the cluster resides. By default, the setting has a value of **auto**, otherwise known as Auto-AZ. With Auto-AZ, Databricks selects the AZ based on available IPs in the workspace subnets and retries in other availability zones if AWS returns insufficient capacity errors. If you want, you can also specify an availability zone to use. This benefits accounts that have reserved instances in a specific AZ. Specify the AZ as a string (for example, `\"us-west-2a\"`). The provided availability zone must be in the same region as the Databricks deployment. For example, \u201cus-west-2a\u201d is not a valid zone ID if the Databricks deployment resides in the \u201cus-east-1\u201d region. The list of available zones as well as the default value can be found by using the [GET \/api\/2.0\/clusters\/list-zones](https:\/\/docs.databricks.com\/api\/workspace\/clusters\/listzones) call. |\n| `instance_profile_arn` | `STRING` | Nodes for this cluster will only be placed on AWS instances with this instance profile. If omitted, nodes will be placed on instances without an instance profile. The instance profile must have previously been added to the Databricks environment by an account administrator. This feature may only be available to certain customer plans. |\n| `spot_bid_price_percent` | `INT32` | The max price for AWS spot instances, as a percentage of the corresponding instance type\u2019s on-demand price. For example, if this field is set to 50, and the cluster needs a new `i3.xlarge` spot instance, then the max price is half of the price of on-demand `i3.xlarge` instances. Similarly, if this field is set to 200, the max price is twice the price of on-demand `i3.xlarge` instances. If not specified, the default value is 100. When spot instances are requested for this cluster, only spot instances whose max price percentage matches this field will be considered. For safety, we enforce this field to be no more than 10000. |\n| `ebs_volume_type` | [EbsVolumeType](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#clusterebsvolumetype) | The type of EBS volumes that will be launched with this cluster. |\n| `ebs_volume_count` | `INT32` | The number of volumes launched for each instance. You can choose up to 10 volumes. This feature is only enabled for supported node types. Legacy node types cannot specify custom EBS volumes. For node types with no instance store, at least one EBS volume needs to be specified; otherwise, cluster creation will fail. These EBS volumes will be mounted at `\/ebs0`, `\/ebs1`, and etc. Instance store volumes will be mounted at `\/local_disk0`, `\/local_disk1`, and etc. If EBS volumes are attached, Databricks will configure Spark to use only the EBS volumes for scratch storage because heterogeneously sized scratch devices can lead to inefficient disk utilization. If no EBS volumes are attached, Databricks will configure Spark to use instance store volumes. If EBS volumes are specified, then the Spark configuration `spark.local.dir` will be overridden. |\n| `ebs_volume_size` | `INT32` | The size of each EBS volume (in GiB) launched for each instance. For general purpose SSD, this value must be within the range 100 - 4096. For throughput optimized HDD, this value must be within the range 500 - 4096. Custom EBS volumes cannot be specified for the legacy node types (*memory-optimized* and *compute-optimized*). |\n| `ebs_volume_iops` | `INT32` | The number of IOPS per EBS gp3 volume. This value must be between 3000 and 16000. The value of IOPS and throughput is calculated based on AWS documentation to match the maximum performance of a gp2 volume with the same volume size. For more information, see the [EBS volume limit calculator](https:\/\/github.com\/awslabs\/aws-support-tools\/tree\/master\/EBS\/VolumeLimitCalculator). |\n| `ebs_volume_throughput` | `INT32` | The throughput per EBS gp3 volume, in MiB per second. This value must be between 125 and 1000. | \nIf neither `ebs_volume_iops` nor `ebs_volume_throughput` is specified, the values are inferred from the disk size: \n| Disk size | IOPS | Throughput |\n| --- | --- | --- |\n| Greater than 1000 | 3 times the disk size, up to 16000 | 250 |\n| Between 170 and 1000 | 3000 | 250 |\n| Below 170 | 3000 | 125 | \n### [AwsAvailability](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#id38) \nThe set of AWS availability types supported when setting up nodes for a cluster. \n| Type | Description |\n| --- | --- |\n| `SPOT` | Use spot instances. |\n| `ON_DEMAND` | Use on-demand instances. |\n| `SPOT_WITH_FALLBACK` | Preferably use spot instances, but fall back to on-demand instances if spot instances cannot be acquired (for example, if AWS spot prices are too high). | \n### [ClusterLogConf](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#id39) \nPath to cluster log. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `dbfs` OR `s3` | [DbfsStorageInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#clusterclusterlogconfdbfsstorageinfo) [S3StorageInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#clusterclusterinitscriptinfos3storageinfo) | DBFS location of cluster log. Destination must be provided. For example, `{ \"dbfs\" : { \"destination\" : \"dbfs:\/home\/cluster_log\" } }` S3 location of cluster log. `destination` and either `region` or `warehouse` must be provided. For example, `{ \"s3\": { \"destination\" : \"s3:\/\/cluster_log_bucket\/prefix\", \"region\" : \"us-west-2\" } }` | \n### [DbfsStorageInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#id40) \nDBFS storage information. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `destination` | `STRING` | DBFS destination. Example: `dbfs:\/my\/path` | \n### [EbsVolumeType](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#id41) \nDatabricks supports gp2 and gp3 EBS volume types. Follow the instructions at [Manage SSD storage](https:\/\/docs.databricks.com\/admin\/clusters\/manage-ssd.html) to select gp2 or gp3 for your workspace. \n| Type | Description |\n| --- | --- |\n| `GENERAL_PURPOSE_SSD` | Provision extra storage using AWS EBS volumes. |\n| `THROUGHPUT_OPTIMIZED_HDD` | Provision extra storage using AWS st1 volumes. | \n### [FileStorageInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#id42) \nFile storage information. \nNote \nThis location type is only available for clusters set up using [Databricks Container Services](https:\/\/docs.databricks.com\/compute\/custom-containers.html). \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `destination` | `STRING` | File destination. Example: `file:\/my\/file.sh` | \n### [InitScriptInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#id43) \nPath to an init script. \nFor instructions on using init scripts with [Databricks Container Services](https:\/\/docs.databricks.com\/compute\/custom-containers.html), see [Use an init script](https:\/\/docs.databricks.com\/compute\/custom-containers.html#containers-init-script). \nNote \nThe file storage type (field name: `file`) is only available for clusters set up using [Databricks Container Services](https:\/\/docs.databricks.com\/compute\/custom-containers.html). See [FileStorageInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#clusterclusterinitscriptinfofilestorageinfo). \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `workspace` OR `dbfs` (deprecated) OR `S3` | [WorkspaceStorageInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#clusterclusterinitscriptinfoworkspacestorageinfo) [DbfsStorageInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#clusterclusterlogconfdbfsstorageinfo) (deprecated) [S3StorageInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#clusterclusterinitscriptinfos3storageinfo) | Workspace location of init script. Destination must be provided. For example, `{ \"workspace\" : { \"destination\" : \"\/Users\/someone@domain.com\/init_script.sh\" } }` (Deprecated) DBFS location of init script. Destination must be provided. For example, `{ \"dbfs\" : { \"destination\" : \"dbfs:\/home\/init_script\" } }` S3 location of init script. Destination and either region or warehouse must be provided. For example, `{ \"s3\": { \"destination\" : \"s3:\/\/init_script_bucket\/prefix\", \"region\" : \"us-west-2\" } }` | \n### [KeyValue](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#id44) \nA key-value pair that specifies configuration parameters. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `key` | `STRING` | The configuration property name. |\n| `value` | `STRING` | The configuration property value. | \n### [NotebookLibrary](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#id45) \nA specification for a notebook containing pipeline code. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `path` | `STRING` | The absolute path to the notebook. This field is required. | \n### [PipelinesAutoScale](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#id46) \nAttributes defining an autoscaling cluster. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `min_workers` | `INT32` | The minimum number of workers to which the cluster can scale down when underutilized. It is also the initial number of workers the cluster will have after creation. |\n| `max_workers` | `INT32` | The maximum number of workers to which the cluster can scale up when overloaded. max\\_workers must be strictly greater than min\\_workers. |\n| `mode` | `STRING` | The autoscaling mode for the cluster:* `ENHANCED` to use [enhanced autoscaling](https:\/\/docs.databricks.com\/delta-live-tables\/auto-scaling.html). * `LEGACY` to use the [cluster autoscaling functionality](https:\/\/docs.databricks.com\/compute\/configure.html#autoscaling). | \n### [PipelineLibrary](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#id47) \nA specification for pipeline dependencies. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `notebook` | [NotebookLibrary](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipelines-notebook-library) | The path to a notebook defining Delta Live Tables datasets. The path must be in the Databricks workspace, for example: `{ \"notebook\" : { \"path\" : \"\/my-pipeline-notebook-path\" } }`. | \n### [PipelinesNewCluster](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#id48) \nA pipeline cluster specification. \nThe Delta Live Tables system sets the following attributes. These attributes cannot be configured by users: \n* `spark_version` \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `label` | `STRING` | A label for the cluster specification, either `default` to configure the default cluster, or `maintenance` to configure the maintenance cluster. This field is optional. The default value is `default`. |\n| `spark_conf` | [KeyValue](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipelines-keyvalue) | An object containing a set of optional, user-specified Spark configuration key-value pairs. You can also pass in a string of extra JVM options to the driver and the executors via `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` respectively. Example Spark confs: `{\"spark.speculation\": true, \"spark.streaming.ui.retainedBatches\": 5}` or `{\"spark.driver.extraJavaOptions\": \"-verbose:gc -XX:+PrintGCDetails\"}` |\n| `aws_attributes` | [AwsAttributes](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#clusterawsattributes) | Attributes related to clusters running on Amazon Web Services. If not specified at cluster creation, a set of default values will be used. |\n| `node_type_id` | `STRING` | This field encodes, through a single value, the resources available to each of the Spark nodes in this cluster. For example, the Spark nodes can be provisioned and optimized for memory or compute intensive workloads A list of available node types can be retrieved by using the [GET 2.0\/clusters\/list-node-types](https:\/\/docs.databricks.com\/api\/workspace\/clusters\/listnodetypes) call. |\n| `driver_node_type_id` | `STRING` | The node type of the Spark driver. This field is optional; if unset, the driver node type will be set as the same value as `node_type_id` defined above. |\n| `ssh_public_keys` | An array of `STRING` | SSH public key contents that will be added to each Spark node in this cluster. The corresponding private keys can be used to login with the user name `ubuntu` on port `2200`. Up to 10 keys can be specified. |\n| `custom_tags` | [KeyValue](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipelines-keyvalue) | An object containing a set of tags for cluster resources. Databricks tags all cluster resources with these tags in addition to default\\_tags. **Note**:* Tags are not supported on legacy node types such as compute-optimized and memory-optimized * Databricks allows at most 45 custom tags. |\n| `cluster_log_conf` | [ClusterLogConf](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#clusterclusterlogconf) | The configuration for delivering Spark logs to a long-term storage destination. Only one destination can be specified for one cluster. If this configuration is provided, the logs will be delivered to the destination every `5 mins`. The destination of driver logs is `\/\/driver`, while the destination of executor logs is `\/\/executor`. |\n| `spark_env_vars` | [KeyValue](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipelines-keyvalue) | An object containing a set of optional, user-specified environment variable key-value pairs. Key-value pairs of the form (X,Y) are exported as is (that is, `export X='Y'`) while launching the driver and workers. In order to specify an additional set of `SPARK_DAEMON_JAVA_OPTS`, Databricks recommends appending them to `$SPARK_DAEMON_JAVA_OPTS` as shown in the following example. This ensures that all default Databricks managed environmental variables are included as well. Example Spark environment variables: `{\"SPARK_WORKER_MEMORY\": \"28000m\", \"SPARK_LOCAL_DIRS\": \"\/local_disk0\"}` or `{\"SPARK_DAEMON_JAVA_OPTS\": \"$SPARK_DAEMON_JAVA_OPTS -Dspark.shuffle.service.enabled=true\"}` |\n| `init_scripts` | An array of [InitScriptInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#clusterclusterinitscriptinfo) | The configuration for storing init scripts. Any number of destinations can be specified. The scripts are executed sequentially in the order provided. If `cluster_log_conf` is specified, init script logs are sent to `\/\/init_scripts`. |\n| `instance_pool_id` | `STRING` | The optional ID of the instance pool to which the cluster belongs. See [Pool configuration reference](https:\/\/docs.databricks.com\/compute\/pools.html). |\n| `driver_instance_pool_id` | `STRING` | The optional ID of the instance pool to use for the driver node. You must also specify `instance_pool_id`. See [Instance Pools API](https:\/\/docs.databricks.com\/api\/workspace\/instancepools). |\n| `policy_id` | `STRING` | A [cluster policy](https:\/\/docs.databricks.com\/api\/workspace\/clusterpolicies) ID. |\n| `num_workers OR autoscale` | `INT32` OR [InitScriptInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipelines-autoscale) | If num\\_workers, number of worker nodes that this cluster should have. A cluster has one Spark driver and num\\_workers executors for a total of num\\_workers + 1 Spark nodes. When reading the properties of a cluster, this field reflects the desired number of workers rather than the actual number of workers. For instance, if a cluster is resized from 5 to 10 workers, this field is updated to reflect the target size of 10 workers, whereas the workers listed in executors gradually increase from 5 to 10 as the new nodes are provisioned. If autoscale, parameters needed to automatically scale clusters up and down based on load. This field is optional. |\n| `apply_policy_default_values` | `BOOLEAN` | Whether to use [policy](https:\/\/docs.databricks.com\/api\/workspace\/clusterpolicies) default values for missing cluster attributes. | \n### [PipelineSettings](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#id49) \nThe settings for a pipeline deployment. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `id` | `STRING` | The unique identifier for this pipeline. The identifier is created by the Delta Live Tables system, and must not be provided when creating a pipeline. |\n| `name` | `STRING` | A user-friendly name for this pipeline. This field is optional. By default, the pipeline name must be unique. To use a duplicate name, set `allow_duplicate_names` to `true` in the pipeline configuration. |\n| `storage` | `STRING` | A path to a DBFS directory for storing checkpoints and tables created by the pipeline. This field is optional. The system uses a default location if this field is empty. |\n| `configuration` | A map of `STRING:STRING` | A list of key-value pairs to add to the Spark configuration of the cluster that will run the pipeline. This field is optional. Elements must be formatted as key:value pairs. |\n| `clusters` | An array of [PipelinesNewCluster](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipelines-new-cluster) | An array of specifications for the clusters to run the pipeline. This field is optional. If this is not specified, the system will select a default cluster configuration for the pipeline. |\n| `libraries` | An array of [PipelineLibrary](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#pipeline-library) | The notebooks containing the pipeline code and any dependencies required to run the pipeline. |\n| `target` | `STRING` | A database name for persisting pipeline output data. See [Publish data from Delta Live Tables to the Hive metastore](https:\/\/docs.databricks.com\/delta-live-tables\/publish.html) for more information. |\n| `continuous` | `BOOLEAN` | Whether this is a continuous pipeline. This field is optional. The default value is `false`. |\n| `development` | `BOOLEAN` | Whether to run the pipeline in development mode. This field is optional. The default value is `false`. |\n| `photon` | `BOOLEAN` | Whether Photon acceleration is enabled for this pipeline. This field is optional. The default value is `false`. |\n| `channel` | `STRING` | The Delta Live Tables release channel specifying the runtime version to use for this pipeline. Supported values are:* `preview` to test the pipeline with upcoming changes to the Delta Live Tables runtime. * `current` to use the current Delta Live Tables runtime version. This field is optional. The default value is `current`. |\n| `edition` | `STRING` | The Delta Live Tables product edition to run the pipeline:* `CORE` supports streaming ingest workloads. * `PRO` also supports streaming ingest workloads and adds support for change data capture (CDC) processing. * `ADVANCED` supports all the features of the `PRO` edition and adds support for workloads that require Delta Live Tables expectations to enforce data quality constraints. This field is optional. The default value is `advanced`. | \n### [PipelineStateInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#id50) \nThe state of a pipeline, the status of the most recent updates, and information about associated resources. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `state` | `STRING` | The state of the pipeline. One of `IDLE` or `RUNNING`. |\n| `pipeline_id` | `STRING` | The unique identifier of the pipeline. |\n| `cluster_id` | `STRING` | The unique identifier of the cluster running the pipeline. |\n| `name` | `STRING` | The user-friendly name of the pipeline. |\n| `latest_updates` | An array of [UpdateStateInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#update-state-info) | Status of the most recent updates for the pipeline, ordered with the newest update first. |\n| `creator_user_name` | `STRING` | The username of the pipeline creator. |\n| `run_as_user_name` | `STRING` | The username that the pipeline runs as. This is a read only value derived from the pipeline owner. | \n### [S3StorageInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#id51) \nS3 storage information. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `destination` | `STRING` | S3 destination. For example: `s3:\/\/my-bucket\/some-prefix` You must configure the cluster with an instance profile and the instance profile must have write access to the destination. You *cannot* use AWS keys. |\n| `region` | `STRING` | S3 region. For example: `us-west-2`. Either region or warehouse must be set. If both are set, warehouse is used. |\n| `warehouse` | `STRING` | S3 warehouse. For example: `https:\/\/s3-us-west-2.amazonaws.com`. Either region or warehouse must be set. If both are set, warehouse is used. |\n| `enable_encryption` | `BOOL` | (Optional)Enable server side encryption, `false` by default. |\n| `encryption_type` | `STRING` | (Optional) The encryption type, it could be `sse-s3` or `sse-kms`. It is used only when encryption is enabled and the default type is `sse-s3`. |\n| `kms_key` | `STRING` | (Optional) KMS key used if encryption is enabled and encryption type is set to `sse-kms`. |\n| `canned_acl` | `STRING` | (Optional) Set canned access control list. For example: `bucket-owner-full-control`. If canned\\_acl is set, the cluster instance profile must have `s3:PutObjectAcl` permission on the destination bucket and prefix. The full list of possible canned ACLs can be found at . By default only the object owner gets full control. If you are using cross account role for writing data, you may want to set `bucket-owner-full-control` to make bucket owner able to read the logs. | \n### [UpdateStateInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#id52) \nThe current state of a pipeline update. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `update_id` | `STRING` | The unique identifier for this update. |\n| `state` | `STRING` | The state of the update. One of `QUEUED`, `CREATED`, `WAITING_FOR_RESOURCES`, `INITIALIZING`, `RESETTING`, `SETTING_UP_TABLES`, `RUNNING`, `STOPPING`, `COMPLETED`, `FAILED`, or `CANCELED`. |\n| `creation_time` | `STRING` | Timestamp when this update was created. | \n### [WorkspaceStorageInfo](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#id53) \nWorkspace storage information. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| `destination` | `STRING` | File destination. Example: `\/Users\/someone@domain.com\/init_script.sh` |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html"} +{"content":"# \n### Sign up for Databricks Community edition\n\nThis article describes how to sign up for **Databricks Community Edition**. Unlike the [Databricks Free Trial](https:\/\/docs.databricks.com\/getting-started\/free-trial.html), Community Edition doesn\u2019t require that you have your own cloud account or supply cloud compute or storage resources. \nHowever, several features available in the Databricks Platform Free Trial, such as the [REST API](https:\/\/docs.databricks.com\/api\/workspace\/introduction), are not available in Databricks Community Edition. For details, see [Databricks Community Edition FAQ](https:\/\/databricks.com\/product\/faq\/community-edition). \nTo sign up for Databricks Community Edition: \n1. Click [Try Databricks](https:\/\/databricks.com\/try-databricks) here or at the top of this page.\n2. Enter your name, company, email, and title, and click **Continue**.\n3. On the **Choose a cloud provider** dialog, click the **Get started with Community Edition** link. You\u2019ll see a page announcing that an email has been sent to the address you provided. \n![Try Databricks](https:\/\/docs.databricks.com\/_images\/try.png)\n4. Look for the welcome email and click the link to verify your email address. You are prompted to create your Databricks password.\n5. When you click **Submit**, you\u2019ll be taken to the Databricks Community Edition home page. \n![Community Edition landing page](https:\/\/docs.databricks.com\/_images\/landing-aws-ce.png)\n6. Run the [Get started: Query and visualize data from a notebook](https:\/\/docs.databricks.com\/getting-started\/quick-start.html) quickstart to familiarize yourself with Databricks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/community-edition.html"} +{"content":"# \n### Sign up for Databricks Community edition\n#### Log back in to your Databricks account\n\nTo log back in to your Databricks Community Edition account, visit [community.cloud.databricks.com](https:\/\/community.cloud.databricks.com\/login.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/community-edition.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Work with features in Workspace Feature Store\n\nNote \nThis documentation covers the Workspace Feature Store. Databricks recommends using [Feature Engineering in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html). Workspace Feature Store will be deprecated in the future. \nThis page describes how to create and work with feature tables in the Workspace Feature Store. \nNote \nIf your workspace is enabled for Unity Catalog, any table managed by Unity Catalog that has a primary key is automatically a feature table that you can use for model training and inference. All Unity Catalog capabilities, such as security, lineage, tagging, and cross-workspace access, are automatically available to the feature table. For information about working with feature tables in a Unity Catalog-enabled workspace, see [Feature Engineering in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html). \nFor information about tracking feature lineage and freshness, see [Discover features and track feature lineage](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/ui.html). \nNote \nDatabase and feature table names can contain only alphanumeric characters and underscores (\\_).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Work with features in Workspace Feature Store\n##### Create a database for feature tables\n\nBefore creating any feature tables, you must create a [database](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#database) to store them. \n```\n%sql CREATE DATABASE IF NOT EXISTS \n\n``` \nFeature tables are stored as [Delta tables](https:\/\/docs.databricks.com\/delta\/index.html). When you create a feature table with `create_table` (Feature Store client v0.3.6 and above) or `create_feature_table` (v0.3.5 and below), you must specify the database name. For example, this argument creates a Delta table named `customer_features` in the database `recommender_system`. \n`name='recommender_system.customer_features'` \nWhen you publish a feature table to an online store, the default table and database name are the ones specified when you created the table; you can specify different names using the `publish_table` method. \nThe Databricks Feature Store UI shows the name of the table and database in the online store, along with other metadata.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Work with features in Workspace Feature Store\n##### Create a feature table in Databricks Feature Store\n\nNote \nYou can also register an existing [Delta table](https:\/\/docs.databricks.com\/delta\/index.html) as a feature table. See [Register an existing Delta table as a feature table](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html#register-delta-table). \nThe basic steps to creating a feature table are: \n1. Write the Python functions to compute the features. The output of each function should be an Apache Spark DataFrame with a unique primary key. The primary key can consist of one or more columns.\n2. Create a feature table by instantiating a `FeatureStoreClient` and using `create_table` (v0.3.6 and above) or `create_feature_table` (v0.3.5 and below).\n3. Populate the feature table using `write_table`. \nFor details about the commands and parameters used in the following examples, see the [Feature Store Python API reference](https:\/\/api-docs.databricks.com\/python\/feature-store\/latest\/index.html). \n```\nfrom databricks.feature_store import feature_table\n\ndef compute_customer_features(data):\n''' Feature computation code returns a DataFrame with 'customer_id' as primary key'''\npass\n\n# create feature table keyed by customer_id\n# take schema from DataFrame output by compute_customer_features\nfrom databricks.feature_store import FeatureStoreClient\n\ncustomer_features_df = compute_customer_features(df)\n\nfs = FeatureStoreClient()\n\ncustomer_feature_table = fs.create_table(\nname='recommender_system.customer_features',\nprimary_keys='customer_id',\nschema=customer_features_df.schema,\ndescription='Customer features'\n)\n\n# An alternative is to use `create_table` and specify the `df` argument.\n# This code automatically saves the features to the underlying Delta table.\n\n# customer_feature_table = fs.create_table(\n# ...\n# df=customer_features_df,\n# ...\n# )\n\n# To use a composite key, pass all keys in the create_table call\n\n# customer_feature_table = fs.create_table(\n# ...\n# primary_keys=['customer_id', 'date'],\n# ...\n# )\n\n# Use write_table to write data to the feature table\n# Overwrite mode does a full refresh of the feature table\n\nfs.write_table(\nname='recommender_system.customer_features',\ndf = customer_features_df,\nmode = 'overwrite'\n)\n\n``` \n```\nfrom databricks.feature_store import feature_table\n\ndef compute_customer_features(data):\n''' Feature computation code returns a DataFrame with 'customer_id' as primary key'''\npass\n\n# create feature table keyed by customer_id\n# take schema from DataFrame output by compute_customer_features\nfrom databricks.feature_store import FeatureStoreClient\n\ncustomer_features_df = compute_customer_features(df)\n\nfs = FeatureStoreClient()\n\ncustomer_feature_table = fs.create_feature_table(\nname='recommender_system.customer_features',\nkeys='customer_id',\nschema=customer_features_df.schema,\ndescription='Customer features'\n)\n\n# An alternative is to use `create_feature_table` and specify the `features_df` argument.\n# This code automatically saves the features to the underlying Delta table.\n\n# customer_feature_table = fs.create_feature_table(\n# ...\n# features_df=customer_features_df,\n# ...\n# )\n\n# To use a composite key, pass all keys in the create_feature_table call\n\n# customer_feature_table = fs.create_feature_table(\n# ...\n# keys=['customer_id', 'date'],\n# ...\n# )\n\n# Use write_table to write data to the feature table\n# Overwrite mode does a full refresh of the feature table\n\nfs.write_table(\nname='recommender_system.customer_features',\ndf = customer_features_df,\nmode = 'overwrite'\n)from databricks.feature_store import feature_table\n\ndef compute_customer_features(data):\n''' Feature computation code returns a DataFrame with 'customer_id' as primary key'''\npass\n\n# create feature table keyed by customer_id\n# take schema from DataFrame output by compute_customer_features\nfrom databricks.feature_store import FeatureStoreClient\n\ncustomer_features_df = compute_customer_features(df)\n\nfs = FeatureStoreClient()\n\ncustomer_feature_table = fs.create_feature_table(\nname='recommender_system.customer_features',\nkeys='customer_id',\nschema=customer_features_df.schema,\ndescription='Customer features'\n)\n\n# An alternative is to use `create_feature_table` and specify the `features_df` argument.\n# This code automatically saves the features to the underlying Delta table.\n\n# customer_feature_table = fs.create_feature_table(\n# ...\n# features_df=customer_features_df,\n# ...\n# )\n\n# To use a composite key, pass all keys in the create_feature_table call\n\n# customer_feature_table = fs.create_feature_table(\n# ...\n# keys=['customer_id', 'date'],\n# ...\n# )\n\n# Use write_table to write data to the feature table\n# Overwrite mode does a full refresh of the feature table\n\nfs.write_table(\nname='recommender_system.customer_features',\ndf = customer_features_df,\nmode = 'overwrite'\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Work with features in Workspace Feature Store\n##### Register an existing Delta table as a feature table\n\nWith v0.3.8 and above, you can register an existing [Delta table](https:\/\/docs.databricks.com\/delta\/index.html) as a feature table. The Delta table must exist in the metastore. \nNote \nTo [update a registered feature table](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html#update-a-feature-table), you must use the [Feature Store Python API](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html). \n```\nfs.register_table(\ndelta_table='recommender.customer_features',\nprimary_keys='customer_id',\ndescription='Customer features'\n)\n\n```\n\n#### Work with features in Workspace Feature Store\n##### Control access to feature tables\n\nSee [Control access to feature tables](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/access-control.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Work with features in Workspace Feature Store\n##### Update a feature table\n\nYou can update a feature table by adding new features or by modifying specific rows based on the primary key. \nThe following feature table metadata cannot be updated: \n* Primary key\n* Partition key\n* Name or type of an existing feature \n### Add new features to an existing feature table \nYou can add new features to an existing feature table in one of two ways: \n* Update the existing feature computation function and run `write_table` with the returned DataFrame. This updates the feature table schema and merges new feature values based on the primary key.\n* Create a new feature computation function to calculate the new feature values. The DataFrame returned by this new computation function must contain the feature tables\u2019s primary keys and partition keys (if defined). Run `write_table` with the DataFrame to write the new features to the existing feature table, using the same primary key. \n### Update only specific rows in a feature table \nUse `mode = \"merge\"` in `write_table`. Rows whose primary key does not exist in the DataFrame sent in the `write_table` call remain unchanged. \n```\nfs.write_table(\nname='recommender.customer_features',\ndf = customer_features_df,\nmode = 'merge'\n)\n\n``` \n### Schedule a job to update a feature table \nTo ensure that features in feature tables always have the most recent values, Databricks recommends that you [create a job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) that runs a notebook to update your feature table on a regular basis, such as every day. If you already have a non-scheduled job created, you can convert it to a [scheduled job](https:\/\/docs.databricks.com\/workflows\/jobs\/schedule-jobs.html#job-schedule) to make sure the feature values are always up-to-date. \nCode to update a feature table uses `mode='merge'`, as shown in the following example. \n```\nfs = FeatureStoreClient()\n\ncustomer_features_df = compute_customer_features(data)\n\nfs.write_table(\ndf=customer_features_df,\nname='recommender_system.customer_features',\nmode='merge'\n)\n\n``` \n### Store past values of daily features \nDefine a feature table with a composite primary key. Include the date in the primary key. For example, for a feature table `store_purchases`, you might use a composite primary key (`date`, `user_id`) and partition key `date` for efficient reads. \n```\nfs.create_table(\nname='recommender_system.customer_features',\nprimary_keys=['date', 'customer_id'],\npartition_columns=['date'],\nschema=customer_features_df.schema,\ndescription='Customer features'\n)\n\n``` \nYou can then create code to read from the feature table filtering `date` to the time period of interest. \nYou can also create a [time series feature table](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html) by specifying the `date` column as a timestamp key using the `timestamp_keys` argument. \n```\nfs.create_table(\nname='recommender_system.customer_features',\nprimary_keys=['date', 'customer_id'],\ntimestamp_keys=['date'],\nschema=customer_features_df.schema,\ndescription='Customer timeseries features'\n)\n\n``` \nThis enables point-in-time lookups when you use `create_training_set` or `score_batch`. The system performs an as-of timestamp join, using the `timestamp_lookup_key` you specify. \nTo keep the feature table up to date, set up a regularly scheduled job to write features, or stream new feature values into the feature table. \n### Create a streaming feature computation pipeline to update features \nTo create a streaming feature computation pipeline, pass a streaming `DataFrame` as an argument to `write_table`. This method returns a `StreamingQuery` object. \n```\ndef compute_additional_customer_features(data):\n''' Returns Streaming DataFrame\n'''\npass # not shown\n\ncustomer_transactions = spark.readStream.load(\"dbfs:\/events\/customer_transactions\")\nstream_df = compute_additional_customer_features(customer_transactions)\n\nfs.write_table(\ndf=stream_df,\nname='recommender_system.customer_features',\nmode='merge'\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Work with features in Workspace Feature Store\n##### Read from a feature table\n\nUse `read_table` to read feature values. \n```\nfs = feature_store.FeatureStoreClient()\ncustomer_features_df = fs.read_table(\nname='recommender.customer_features',\n)\n\n```\n\n#### Work with features in Workspace Feature Store\n##### Search and browse feature tables\n\nUse the Feature Store UI to search for or browse feature tables. \n1. In the sidebar, select **Machine Learning > Feature Store** to display the Feature Store UI.\n2. In the search box, enter all or part of the name of a feature table, a feature, or a data source used for feature computation. You can also enter all or part of the [key or value of a tag](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html#work-with-feature-table-tags). Search text is case-insensitive. \n![Feature search example](https:\/\/docs.databricks.com\/_images\/feature-search-example.png)\n\n#### Work with features in Workspace Feature Store\n##### Get feature table metadata\n\nThe API to get feature table metadata depends on the Databricks runtime version you are using. With v0.3.6 and above, use `get_table`. With v0.3.5 and below, use `get_feature_table`. \n```\n# this example works with v0.3.6 and above\n# for v0.3.5, use `get_feature_table`\nfrom databricks.feature_store import FeatureStoreClient\nfs = FeatureStoreClient()\nfs.get_table(\"feature_store_example.user_feature_table\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Work with features in Workspace Feature Store\n##### Work with feature table tags\n\nTags are key-value pairs that you can create and use to [search for feature tables](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html#search-and-browse-feature-tables). You can create, edit, and delete tags using the Feature Store UI or the [Feature Store Python API](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html). \n### Work with feature table tags in the UI \nUse the Feature Store UI to search for or browse feature tables. To access the UI, in the sidebar, select **Machine Learning > Feature Store**. \n#### Add a tag using the Feature Store UI \n1. Click ![Tag icon](https:\/\/docs.databricks.com\/_images\/tags1.png) if it is not already open. The tags table appears. \n![tag table](https:\/\/docs.databricks.com\/_images\/tags-open.png)\n2. Click in the **Name** and **Value** fields and enter the key and value for your tag.\n3. Click **Add**. \n#### Edit or delete a tag using the Feature Store UI \nTo edit or delete an existing tag, use the icons in the **Actions** column. \n![tag actions](https:\/\/docs.databricks.com\/_images\/tag-edit-or-delete.png) \n### Work with feature table tags using the Feature Store Python API \nOn clusters running v0.4.1 and above, you can create, edit, and delete tags using the [Feature Store Python API](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html). \n#### Requirements \nFeature Store client v0.4.1 and above \n#### Create feature table with tag using the Feature Store Python API \n```\nfrom databricks.feature_store import FeatureStoreClient\nfs = FeatureStoreClient()\n\ncustomer_feature_table = fs.create_table(\n...\ntags={\"tag_key_1\": \"tag_value_1\", \"tag_key_2\": \"tag_value_2\", ...},\n...\n)\n\n``` \n#### Add, update, and delete tags using the Feature Store Python API \n```\nfrom databricks.feature_store import FeatureStoreClient\nfs = FeatureStoreClient()\n\n# Upsert a tag\nfs.set_feature_table_tag(table_name=\"my_table\", key=\"quality\", value=\"gold\")\n\n# Delete a tag\nfs.delete_feature_table_tag(table_name=\"my_table\", key=\"quality\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Work with features in Workspace Feature Store\n##### Update data sources for a feature table\n\nFeature store automatically tracks the data sources used to compute features. You can also manually update the data sources by using the [Feature Store Python API](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html). \n### Requirements \nFeature Store client v0.5.0 and above \n### Add data sources using the Feature Store Python API \nBelow are some example commands. For details, see [the API documentation](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html). \n```\nfrom databricks.feature_store import FeatureStoreClient\nfs = FeatureStoreClient()\n\n# Use `source_type=\"table\"` to add a table in the metastore as data source.\nfs.add_data_sources(feature_table_name=\"clicks\", data_sources=\"user_info.clicks\", source_type=\"table\")\n\n# Use `source_type=\"path\"` to add a data source in path format.\nfs.add_data_sources(feature_table_name=\"user_metrics\", data_sources=\"dbfs:\/FileStore\/user_metrics.json\", source_type=\"path\")\n\n# Use `source_type=\"custom\"` if the source is not a table or a path.\nfs.add_data_sources(feature_table_name=\"user_metrics\", data_sources=\"user_metrics.txt\", source_type=\"custom\")\n\n``` \n### Delete data sources using the Feature Store Python API \nFor details, see [the API documentation](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html). \nNote \nThe following command deletes data sources of all types (\u201ctable\u201d, \u201cpath\u201d, and \u201ccustom\u201d) that match the source names. \n```\nfrom databricks.feature_store import FeatureStoreClient\nfs = FeatureStoreClient()\nfs.delete_data_sources(feature_table_name=\"clicks\", sources_names=\"user_info.clicks\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Work with features in Workspace Feature Store\n##### Delete a feature table\n\nYou can delete a feature table using the Feature Store UI or the [Feature Store Python API](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html). \nNote \n* Deleting a feature table can lead to unexpected failures in upstream producers and downstream consumers (models, endpoints, and scheduled jobs). You must delete published online stores with your cloud provider.\n* When you delete a feature table using the API, the underlying Delta table is also dropped. When you delete a feature table from the UI, you must drop the underlying Delta table separately. \n### Delete a feature table using the UI \n1. On the feature table page, click ![Button Down](https:\/\/docs.databricks.com\/_images\/button-down.png) at the right of the feature table name and select **Delete**. If you do not have CAN MANAGE permission for the feature table, you will not see this option. \n![Select delete from drop-down menu](https:\/\/docs.databricks.com\/_images\/feature-store-deletion.png)\n2. In the Delete Feature Table dialog, click **Delete** to confirm.\n3. If you also want to [drop the underlying Delta table](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-drop-table.html), run the following command in a notebook. \n```\n%sql DROP TABLE IF EXISTS ;\n\n``` \n### Delete a feature table using the Feature Store Python API \nWith Feature Store client v0.4.1 and above, you can use `drop_table` to delete a feature table. When you delete a table with `drop_table`, the underlying Delta table is also dropped. \n```\nfs.drop_table(\nname='recommender_system.customer_features'\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### What are workspace files?\n\nA workspace file is any file in the Databricks workspace that is not a Databricks notebook. Workspace files can be any file type. Common examples include: \n* `.py` files used in custom modules.\n* `.md` files, such as `README.md`.\n* `.csv` or other small data files.\n* `.txt` files.\n* `.whl` libraries.\n* Log files. \nWorkspace files include files formerly referred to as \u201cFiles in Repos\u201d. For recommendations on working with files, see [Recommendations for files in volumes and workspace files](https:\/\/docs.databricks.com\/files\/files-recommendations.html). \nImportant \nWorkspace files are enabled everywhere by default in Databricks Runtime version 11.2, but can be disabled by admins using the REST API. For production workloads, use Databricks Runtime 11.3 LTS or above. Contact your workspace administrator if you cannot access this functionality.\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/workspace.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### What are workspace files?\n##### What you can do with workspace files\n\nDatabricks provides functionality similar to local development for many workspace file types, including a built-in file editor. Not all use cases for all file types are supported. For example, while you can include images in an imported directory or repository, you cannot embed images in notebooks. \nYou can create, edit, and manage access to workspace files using familiar patterns from notebook interactions. You can use relative paths for library imports from workspace files, similar to local development. For more details, see: \n* [Workspace files basic usage](https:\/\/docs.databricks.com\/files\/workspace-basics.html)\n* [Programmatically interact with workspace files](https:\/\/docs.databricks.com\/files\/workspace-interact.html)\n* [Work with Python and R modules](https:\/\/docs.databricks.com\/files\/workspace-modules.html)\n* [Manage notebooks](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html)\n* [File ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#files) \nInit scripts stored in workspace files have special behavior. You can use workspace files to store and reference init scripts in any Databricks Runtime versions. See [Store init scripts in workspace files](https:\/\/docs.databricks.com\/files\/workspace-init-scripts.html). \nNote \nIn Databricks Runtime 14.0 and above, the the default current working directory (CWD) for code executed locally is the directory containing the notebook or script being run. This is a change in behavior from Databricks Runtime 13.3 LTS and below. See [What is the default current working directory?](https:\/\/docs.databricks.com\/files\/cwd-dbr-14.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/workspace.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### What are workspace files?\n##### Limitations\n\nA complete list of workspace files limitations is found in [Workspace files limitations](https:\/\/docs.databricks.com\/files\/index.html#workspace-files-limitations). \n### File size limit \nIndividual workspace files are limited to 500 MB. \n### Databricks Runtime versions for files in Git folders with a cluster with Databricks Container Services \nOn clusters running Databricks Runtime 11.3 LTS and above, the default settings allow you to use workspace files in Git folders with Databricks Container Services (DCS). \nOn clusters running Databricks Runtime versions 10.4 LTS and 9.1 LTS, you must configure the dockerfile to access workspace files in Git folders on a cluster with DCS. Refer to the following dockerfiles for the desired Databricks Runtime version: \n* [Dockerfile for DBR 10.4 LTS](https:\/\/github.com\/databricks\/containers\/tree\/release-10.4-LTS\/experimental\/ubuntu\/files-in-repos)\n* [Dockerfile for DBR 9.1 LTS](https:\/\/github.com\/databricks\/containers\/tree\/release-9.1-LTS\/experimental\/ubuntu\/files-in-repos) \nSee [Customize containers with Databricks Container Service](https:\/\/docs.databricks.com\/compute\/custom-containers.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/workspace.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### What are workspace files?\n##### Enable workspace files\n\nTo enable support for non-notebook files in your Databricks workspace, call the [\/api\/2.0\/workspace-conf](https:\/\/docs.databricks.com\/api\/workspace\/workspaceconf\/setstatus) REST API from a notebook or other environment with access to your Databricks workspace. Workspace files are **enabled** by default. \nTo enable or re-enable support for non-notebook files in your Databricks workspace, call the `\/api\/2.0\/workspace-conf` and get the value of the `enableWorkspaceFileSystem` key. If it is set to `true`, non-notebook files are already enabled for your workspace. \nThe following example demonstrates how you can call this API from a notebook to check if workspace files are disabled and if so, re-enable them. \n### Example: Notebook for re-enabling Databricks workspace file support \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/files\/turn-on-files.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/workspace.html"} +{"content":"# Security and compliance guide\n## Secret management\n#### Secret workflow example\n\nIn this workflow example, we use secrets to set up JDBC credentials for connecting to an Azure Data Lake Store.\n\n#### Secret workflow example\n##### Create a secret scope\n\nCreate a secret scope called `jdbc`. \n```\ndatabricks secrets create-scope jdbc\n\n``` \nNote \nIf your account does not have the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons), you must create the scope with MANAGE permission granted to all users (\u201cusers\u201d). For example: \n```\ndatabricks secrets create-scope jdbc --initial-manage-principal users\n\n```\n\n#### Secret workflow example\n##### Create secrets\n\nAdd the secrets `username` and `password`. Run the following commands and enter the secret values in the opened editor. \n```\ndatabricks secrets put-secret jdbc username\ndatabricks secrets put-secret jdbc password\n\n```\n\n#### Secret workflow example\n##### Use the secrets in a notebook\n\nIn a notebook, read the secrets that are stored in the secret scope `jdbc` to configure a JDBC connector: \n```\nval driverClass = \"com.microsoft.sqlserver.jdbc.SQLServerDriver\"\nval connectionProperties = new java.util.Properties()\nconnectionProperties.setProperty(\"Driver\", driverClass)\n\nval jdbcUsername = dbutils.secrets.get(scope = \"jdbc\", key = \"username\")\nval jdbcPassword = dbutils.secrets.get(scope = \"jdbc\", key = \"password\")\nconnectionProperties.put(\"user\", s\"${jdbcUsername}\")\nconnectionProperties.put(\"password\", s\"${jdbcPassword}\")\n\n``` \nYou can now use these `ConnectionProperties` with the JDBC connector to talk to your data source.\nThe values fetched from the scope are never displayed in the notebook (see [Secret redaction](https:\/\/docs.databricks.com\/security\/secrets\/redaction.html)).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/secrets\/example-secret-workflow.html"} +{"content":"# Security and compliance guide\n## Secret management\n#### Secret workflow example\n##### Grant access to another group\n\nNote \nThis step requires that your account have the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons). \nAfter verifying that the credentials were configured correctly, share these credentials with the `datascience` group to use for their analysis by granting them permissions to read the secret scope and list the available secrets . \nGrant the `datascience` group the READ permission to these credentials by making the following request: \n```\ndatabricks secrets put-acl jdbc datascience READ\n\n``` \nFor more information about secret access control, see [Secret ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#secrets).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/secrets\/example-secret-workflow.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to dbt Cloud\n\ndbt (data build tool) is a development environment that enables data analysts and data engineers to transform data by simply writing select statements. dbt handles turning these select statements into tables and views. dbt compiles your code into raw SQL and then runs that code on the specified database in Databricks. dbt supports collaborative coding patterns and best practices such as version control, documentation, and modularity. \ndbt does not extract or load data. dbt focuses on the transformation step only, using a \u201ctransform after load\u201d architecture. dbt assumes that you already have a copy of your data in your database. \nThis article focuses on dbt Cloud. dbt Cloud comes equipped with turnkey support for scheduling jobs, CI\/CD, serving documentation, monitoring and alerting, and an integrated development environment (IDE). \nA local version of dbt called dbt Core is also available. dbt Core enables you to write dbt code in the text editor or IDE of your choice on your local development machine and then run dbt from the command line. dbt Core includes the dbt Command Line Interface (CLI). The dbt CLI is free to use and open source. For more information, see [Connect to dbt Core](https:\/\/docs.databricks.com\/partners\/prep\/dbt.html). \nBecause dbt Cloud and dbt Core can use hosted git repositories (for example, on GitHub, GitLab or BitBucket), you can use dbt Cloud to create a dbt project and then make it available to your dbt Cloud and dbt Core users. For more information, see [Creating a dbt project](https:\/\/docs.getdbt.com\/docs\/building-a-dbt-project\/projects#creating-a-dbt-project) and [Using an existing project](https:\/\/docs.getdbt.com\/docs\/building-a-dbt-project\/projects#using-an-existing-project) on the dbt website. \nFor a general overview of dbt, watch the following YouTube video (26 minutes).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/dbt-cloud.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to dbt Cloud\n##### Connect to dbt Cloud using Partner Connect\n\nThis section describes how to connect a Databricks SQL warehouse to dbt Cloud using Partner Connect, then give dbt Cloud read access to your data. \n### Differences between standard connections and dbt Cloud \nTo connect to dbt Cloud using Partner Connect, you follow the steps in [Connect to data prep partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/prep.html). The dbt Cloud connection is different from standard data preparation and transformation connections in the following ways: \n* In addition to a service principal and a personal access token, Partner Connect creates a SQL warehouse (formerly SQL endpoint) named **DBT\\_CLOUD\\_ENDPOINT** by default. \n### Steps to connect \nTo connect to dbt Cloud using Partner Connect, do the following: \n1. [Connect to data prep partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/prep.html).\n2. After you connect to dbt Cloud, your dbt Cloud dashboard appears. To explore your dbt Cloud project, in the menu bar, next to the dbt logo, select your dbt account name from the first drop-down if it is not displayed, and then select the **Databricks Partner Connect Trial** project from the second drop-down menu if it is not displayed. \nTip \nTo view your project\u2019s settings, click the \u201cthree stripes\u201d or \u201chamburger\u201d menu, click **Account Settings > Projects**, and click the name of the project. To view the connection settings, click the link next to **Connection**. To change any settings, click **Edit**. \nTo view the Databricks personal access token information for this project, click the \u201cperson\u201d icon on the menu bar, click **Profile > Credentials > Databricks Partner Connect Trial**, and click the name of the project. To make a change, click **Edit**. \n### Steps to give dbt Cloud read access to your data \nPartner Connect gives create-only permission to the **DBT\\_CLOUD\\_USER** service principal only on the default catalog. Follow these steps in your Databricks workspace to give the **DBT\\_CLOUD\\_USER** service principal read access to the data that you choose. \nWarning \nYou can adapt these steps to give dbt Cloud additional access across catalogs, databases, and tables within your workspace. However, as a security best practice, Databricks strongly recommends that you give access only to the individual tables that you need the **DBT\\_CLOUD\\_USER** service principal to work with and only read access to those tables. \n1. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog** in the sidebar.\n2. Select the SQL warehouse (**DBT\\_CLOUD\\_ENDPOINT**) in the drop-down list at the top right. \n![Select warehouse](https:\/\/docs.databricks.com\/_images\/select-endpoint.png) \n1. Under **Catalog Explorer**, select the catalog that contains the database for your table.\n2. Select the database that contains your table.\n3. Select your table.\nTip \nIf you do not see your catalog, database, or table listed, enter any portion of the name in the **Select Catalog**, **Select Database**, or **Filter tables** boxes, respectively, to narrow down the list. \n![Filter tables](https:\/\/docs.databricks.com\/_images\/filter-tables.png)\n3. Click **Permissions**.\n4. Click **Grant**.\n5. For **Type to add multiple users or groups**, select **DBT\\_CLOUD\\_USER**. This is the Databricks service principal that Partner Connect created for you in the previous section. \nTip \nIf you do not see **DBT\\_CLOUD\\_USER**, begin typing `DBT_CLOUD_USER` in the **Type to add multiple users or groups** box until it appears in the list, and then select it.\n6. Grant read access only by selecting `SELECT` and `READ METADATA`.\n7. Click **OK**. \nRepeat steps 4-9 for each additional table that you want to give dbt Cloud read access to. \n### Troubleshoot the dbt Cloud connection \nIf someone deletes the project in dbt Cloud for this account, and you the click the **dbt** tile, an error message appears, stating that the project cannot be found. To fix this, click **Delete connection**, and then start from the beginning of this procedure to create the connection again.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/dbt-cloud.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to dbt Cloud\n##### Connect to dbt Cloud manually\n\nThis section describes how to connect a Databricks cluster or a Databricks SQL warehouse in your Databricks workspace to dbt Cloud. \nImportant \nDatabricks recommends connecting to a SQL warehouse. If you don\u2019t have the Databricks SQL access entitlement, or if you want to run Python models, you can connect to a cluster instead. \n### Requirements \n* A cluster or SQL warehouse in your Databricks workspace. \n+ [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n+ [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* The connection details for your cluster or SQL warehouse, specifically the **Server Hostname**, **Port**, and **HTTP Path** values. \n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). To create a personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**.\nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n+ [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n+ [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \n* To connect dbt Cloud to data managed by Unity Catalog, dbt version 1.1 or above. \nThe steps in this article create a new environment that uses the latest dbt version. For information about upgrading the dbt version for an existing environment, see [Upgrading to the latest version of dbt in Cloud](https:\/\/docs.getdbt.com\/docs\/dbt-versions\/upgrade-core-in-cloud#upgrading-to-the-latest-version-of-dbt-in-cloud) in the dbt documentation. \n### Step 1: Sign up for dbt Cloud \nGo to [dbt Cloud - Signup](https:\/\/www.getdbt.com\/signup\/) and enter your email, name, and company information. Create a password and click **Create my account**. \n### Step 2: Create a dbt project \nIn this step, you create a dbt *project*, which contains a connection to a Databricks cluster or a SQL warehouse, a repository that contains your source code, and one or more environments (such as testing and production environments). \n1. [Sign in to dbt Cloud](https:\/\/cloud.getdbt.com\/login\/).\n2. Click the settings icon, and then click **Account Settings**.\n3. Click **New Project**.\n4. For **Name**, enter a unique name for your project, and then click **Continue**.\n5. For **Choose a connection**, click **Databricks**, and then click **Next**.\n6. For **Name**, enter a unique name for this connection.\n7. For **Select Adapter**, click **Databricks (dbt-databricks)**. \nNote \nDatabricks recommends using `dbt-databricks`, which supports Unity Catalog, instead of `dbt-spark`. By default, new projects use `dbt-databricks`. To migrate an existing project to `dbt-databricks`, see [Migrating from dbt-spark to dbt-databricks](https:\/\/docs.getdbt.com\/guides\/migration\/tools\/migrating-from-spark-to-databricks) in the dbt documentation.\n8. Under **Settings**, for **Server Hostname**, enter the server hostname value from the requirements.\n9. For **HTTP Path**, enter the HTTP path value from the requirements.\n10. If your workspace is Unity Catalog-enabled, under **Optional Settings**, enter the name of the catalog for dbt Cloud to use.\n11. Under **Development Credentials**, for **Token**, enter the personal access token from the requirements.\n12. For **Schema**, enter the name of the schema where you want dbt Cloud to create the tables and views (for example, `default`).\n13. Click **Test Connection**.\n14. If the test succeeds, click **Next**. \nFor more information, see [Connecting to Databricks ODBC](https:\/\/docs.getdbt.com\/docs\/dbt-cloud\/cloud-configuring-dbt-cloud\/connecting-your-database#connecting-to-databricks) on the dbt website. \nTip \nTo view or change the settings for this project, or to delete the project altogether, click the settings icon, click **Account Settings > Projects**, and click the name of the project. To change the settings, click **Edit**. To delete the project, click **Edit > Delete Project**. \nTo view or change your Databricks personal access token value for this project, click the \u201cperson\u201d icon, click **Profile > Credentials**, and click the name of the project. To make a change, click **Edit**. \nAfter you connect to a Databricks cluster or a Databricks SQL warehouse, follow the on-screen instructions to **Setup a Repository**, and then click **Continue**. \nAfter you set up the repository, follow the on-screen instructions to invite users and then click **Complete**. Or click **Skip & Complete**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/dbt-cloud.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to dbt Cloud\n##### Tutorial\n\nIn this section, you use your dbt Cloud project to work with some sample data. This section assumes that you have already created your project and have the dbt Cloud IDE open to that project. \n### Step 1: Create and run models \nIn this step, you use the dbt Cloud IDE to create and run *models*, which are `select` statements that create either a new view (the default) or a new table in a database, based on existing data in that same database. This procedure creates a model based on the sample `diamonds` table from the [Sample datasets](https:\/\/docs.databricks.com\/discover\/databricks-datasets.html). \nUse the following code to create this table. \n```\nDROP TABLE IF EXISTS diamonds;\n\nCREATE TABLE diamonds USING CSV OPTIONS (path \"\/databricks-datasets\/Rdatasets\/data-001\/csv\/ggplot2\/diamonds.csv\", header \"true\")\n\n``` \nThis procedure assumes this table has already been created in your workspace\u2019s `default` database. \n1. With the project open, click **Develop** at the top of the UI.\n2. Click **Initialize dbt project**.\n3. Click **Commit and sync**, enter a commit message, and then click **Commit**.\n4. Click **Create branch**, enter a name for your branch, and then click **Submit**.\n5. Create the first model: Click **Create New File**.\n6. In the text editor, enter the following SQL statement. This statement selects only the carat, cut, color, and clarity details for each diamond from the `diamonds` table. The `config` block instructs dbt to create a table in the database based on this statement. \n```\n{{ config(\nmaterialized='table',\nfile_format='delta'\n) }}\n\n``` \n```\nselect carat, cut, color, clarity\nfrom diamonds\n\n``` \nTip \nFor additional `config` options such as the `merge` incremental strategy, see [Databricks configurations](https:\/\/docs.getdbt.com\/reference\/resource-configs\/databricks-configs) in the dbt documentation.\n7. Click **Save As**.\n8. For the filename, enter `models\/diamonds_four_cs.sql` and then click **Create**.\n9. Create a second model: Click ![Create New File icon](https:\/\/docs.databricks.com\/_images\/dbt-cloud-create-new-file.png) (**Create New File**) in the upper-right corner.\n10. In the text editor, enter the following SQL statement. This statement selects unique values from the `colors` column in the `diamonds_four_cs` table, sorting the results in alphabetical order first to last. Because there is no `config` block, this model instructs dbt to create a view in the database based on this statement. \n```\nselect distinct color\nfrom diamonds_four_cs\nsort by color asc\n\n```\n11. Click **Save As**.\n12. For the filename, enter `models\/diamonds_list_colors.sql`, and then click **Create**.\n13. Create a third model: Click ![Create New File icon](https:\/\/docs.databricks.com\/_images\/dbt-cloud-create-new-file.png) (**Create New File**) in the upper-right corner.\n14. In the text editor, enter the following SQL statement. This statement averages diamond prices by color, sorting the results by average price from highest to lowest. This model instructs dbt to create a view in the database based on this statement. \n```\nselect color, avg(price) as price\nfrom diamonds\ngroup by color\norder by price desc\n\n```\n15. Click **Save As**.\n16. For the filename, enter `models\/diamonds_prices.sql` and click **Create**.\n17. Run the models: In the command line, run the `dbt run` command with the paths to the three preceding files. In the `default` database, dbt creates one table named `diamonds_four_cs` and two views named `diamonds_list_colors` and `diamonds_prices`. dbt gets these view and table names from their related `.sql` file names. \n```\ndbt run --model models\/diamonds_four_cs.sql models\/diamonds_list_colors.sql models\/diamonds_prices.sql\n\n``` \n```\n...\n... | 1 of 3 START table model default.diamonds_four_cs.................... [RUN]\n... | 1 of 3 OK created table model default.diamonds_four_cs............... [OK ...]\n... | 2 of 3 START view model default.diamonds_list_colors................. [RUN]\n... | 2 of 3 OK created view model default.diamonds_list_colors............ [OK ...]\n... | 3 of 3 START view model default.diamonds_prices...................... [RUN]\n... | 3 of 3 OK created view model default.diamonds_prices................. [OK ...]\n... |\n... | Finished running 1 table model, 2 view models ...\n\nCompleted successfully\n\nDone. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3\n\n```\n18. Run the following SQL code to list information about the new views and to select all rows from the table and views. \nIf you are connecting to a cluster, you can run this SQL code from a [notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook) that is attached to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL warehouse, you can run this SQL code from a [query](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html#create-a-query). \n```\nSHOW views IN default\n\n``` \n```\n+-----------+----------------------+-------------+\n| namespace | viewName | isTemporary |\n+===========+======================+=============+\n| default | diamonds_list_colors | false |\n+-----------+----------------------+-------------+\n| default | diamonds_prices | false |\n+-----------+----------------------+-------------+\n\n``` \n```\nSELECT * FROM diamonds_four_cs\n\n``` \n```\n+-------+---------+-------+---------+\n| carat | cut | color | clarity |\n+=======+=========+=======+=========+\n| 0.23 | Ideal | E | SI2 |\n+-------+---------+-------+---------+\n| 0.21 | Premium | E | SI1 |\n+-------+---------+-------+---------+\n...\n\n``` \n```\nSELECT * FROM diamonds_list_colors\n\n``` \n```\n+-------+\n| color |\n+=======+\n| D |\n+-------+\n| E |\n+-------+\n...\n\n``` \n```\nSELECT * FROM diamonds_prices\n\n``` \n```\n+-------+---------+\n| color | price |\n+=======+=========+\n| J | 5323.82 |\n+-------+---------+\n| I | 5091.87 |\n+-------+---------+\n...\n\n``` \n### Step 2: Create and run more complex models \nIn this step, you create more complex models for a set of related data tables. These data tables contain information about a fictional sports league of three teams playing a season of six games. This procedure creates the data tables, creates the models, and runs the models. \n1. Run the following SQL code to create the necessary data tables. \nIf you are connecting to a cluster, you can run this SQL code from a [notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook) that is attached to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL warehouse, you can run this SQL code from a [query](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html#create-a-query). \nThe tables and views in this step start with `zzz_` to help identify them as part of this example. You do not need to follow this pattern for your own tables and views. \n```\nDROP TABLE IF EXISTS zzz_game_opponents;\nDROP TABLE IF EXISTS zzz_game_scores;\nDROP TABLE IF EXISTS zzz_games;\nDROP TABLE IF EXISTS zzz_teams;\n\nCREATE TABLE zzz_game_opponents (\ngame_id INT,\nhome_team_id INT,\nvisitor_team_id INT\n) USING DELTA;\n\nINSERT INTO zzz_game_opponents VALUES (1, 1, 2);\nINSERT INTO zzz_game_opponents VALUES (2, 1, 3);\nINSERT INTO zzz_game_opponents VALUES (3, 2, 1);\nINSERT INTO zzz_game_opponents VALUES (4, 2, 3);\nINSERT INTO zzz_game_opponents VALUES (5, 3, 1);\nINSERT INTO zzz_game_opponents VALUES (6, 3, 2);\n\n-- Result:\n-- +---------+--------------+-----------------+\n-- | game_id | home_team_id | visitor_team_id |\n-- +=========+==============+=================+\n-- | 1 | 1 | 2 |\n-- +---------+--------------+-----------------+\n-- | 2 | 1 | 3 |\n-- +---------+--------------+-----------------+\n-- | 3 | 2 | 1 |\n-- +---------+--------------+-----------------+\n-- | 4 | 2 | 3 |\n-- +---------+--------------+-----------------+\n-- | 5 | 3 | 1 |\n-- +---------+--------------+-----------------+\n-- | 6 | 3 | 2 |\n-- +---------+--------------+-----------------+\n\nCREATE TABLE zzz_game_scores (\ngame_id INT,\nhome_team_score INT,\nvisitor_team_score INT\n) USING DELTA;\n\nINSERT INTO zzz_game_scores VALUES (1, 4, 2);\nINSERT INTO zzz_game_scores VALUES (2, 0, 1);\nINSERT INTO zzz_game_scores VALUES (3, 1, 2);\nINSERT INTO zzz_game_scores VALUES (4, 3, 2);\nINSERT INTO zzz_game_scores VALUES (5, 3, 0);\nINSERT INTO zzz_game_scores VALUES (6, 3, 1);\n\n-- Result:\n-- +---------+-----------------+--------------------+\n-- | game_id | home_team_score | visitor_team_score |\n-- +=========+=================+====================+\n-- | 1 | 4 | 2 |\n-- +---------+-----------------+--------------------+\n-- | 2 | 0 | 1 |\n-- +---------+-----------------+--------------------+\n-- | 3 | 1 | 2 |\n-- +---------+-----------------+--------------------+\n-- | 4 | 3 | 2 |\n-- +---------+-----------------+--------------------+\n-- | 5 | 3 | 0 |\n-- +---------+-----------------+--------------------+\n-- | 6 | 3 | 1 |\n-- +---------+-----------------+--------------------+\n\nCREATE TABLE zzz_games (\ngame_id INT,\ngame_date DATE\n) USING DELTA;\n\nINSERT INTO zzz_games VALUES (1, '2020-12-12');\nINSERT INTO zzz_games VALUES (2, '2021-01-09');\nINSERT INTO zzz_games VALUES (3, '2020-12-19');\nINSERT INTO zzz_games VALUES (4, '2021-01-16');\nINSERT INTO zzz_games VALUES (5, '2021-01-23');\nINSERT INTO zzz_games VALUES (6, '2021-02-06');\n\n-- Result:\n-- +---------+------------+\n-- | game_id | game_date |\n-- +=========+============+\n-- | 1 | 2020-12-12 |\n-- +---------+------------+\n-- | 2 | 2021-01-09 |\n-- +---------+------------+\n-- | 3 | 2020-12-19 |\n-- +---------+------------+\n-- | 4 | 2021-01-16 |\n-- +---------+------------+\n-- | 5 | 2021-01-23 |\n-- +---------+------------+\n-- | 6 | 2021-02-06 |\n-- +---------+------------+\n\nCREATE TABLE zzz_teams (\nteam_id INT,\nteam_city VARCHAR(15)\n) USING DELTA;\n\nINSERT INTO zzz_teams VALUES (1, \"San Francisco\");\nINSERT INTO zzz_teams VALUES (2, \"Seattle\");\nINSERT INTO zzz_teams VALUES (3, \"Amsterdam\");\n\n-- Result:\n-- +---------+---------------+\n-- | team_id | team_city |\n-- +=========+===============+\n-- | 1 | San Francisco |\n-- +---------+---------------+\n-- | 2 | Seattle |\n-- +---------+---------------+\n-- | 3 | Amsterdam |\n-- +---------+---------------+\n\n```\n2. Create the first model: Click ![Create New File icon](https:\/\/docs.databricks.com\/_images\/dbt-cloud-create-new-file.png) (**Create New File**) in the upper-right corner.\n3. In the text editor, enter the following SQL statement. This statement creates a table that provides the details of each game, such as team names and scores. The `config` block instructs dbt to create a table in the database based on this statement. \n```\n-- Create a table that provides full details for each game, including\n-- the game ID, the home and visiting teams' city names and scores,\n-- the game winner's city name, and the game date.\n\n``` \n```\n{{ config(\nmaterialized='table',\nfile_format='delta'\n) }}\n\n``` \n```\n-- Step 4 of 4: Replace the visitor team IDs with their city names.\nselect\ngame_id,\nhome,\nt.team_city as visitor,\nhome_score,\nvisitor_score,\n-- Step 3 of 4: Display the city name for each game's winner.\ncase\nwhen\nhome_score > visitor_score\nthen\nhome\nwhen\nvisitor_score > home_score\nthen\nt.team_city\nend as winner,\ngame_date as date\nfrom (\n-- Step 2 of 4: Replace the home team IDs with their actual city names.\nselect\ngame_id,\nt.team_city as home,\nhome_score,\nvisitor_team_id,\nvisitor_score,\ngame_date\nfrom (\n-- Step 1 of 4: Combine data from various tables (for example, game and team IDs, scores, dates).\nselect\ng.game_id,\ngo.home_team_id,\ngs.home_team_score as home_score,\ngo.visitor_team_id,\ngs.visitor_team_score as visitor_score,\ng.game_date\nfrom\nzzz_games as g,\nzzz_game_opponents as go,\nzzz_game_scores as gs\nwhere\ng.game_id = go.game_id and\ng.game_id = gs.game_id\n) as all_ids,\nzzz_teams as t\nwhere\nall_ids.home_team_id = t.team_id\n) as visitor_ids,\nzzz_teams as t\nwhere\nvisitor_ids.visitor_team_id = t.team_id\norder by game_date desc\n\n```\n4. Click **Save As**.\n5. For the filename, enter `models\/zzz_game_details.sql` and then click **Create**.\n6. Create a second model: Click ![Create New File icon](https:\/\/docs.databricks.com\/_images\/dbt-cloud-create-new-file.png) (**Create New File**) in the upper-right corner.\n7. In the text editor, enter the following SQL statement. This statement creates a view that lists team win-loss records for the season. \n```\n-- Create a view that summarizes the season's win and loss records by team.\n\n-- Step 2 of 2: Calculate the number of wins and losses for each team.\nselect\nwinner as team,\ncount(winner) as wins,\n-- Each team played in 4 games.\n(4 - count(winner)) as losses\nfrom (\n-- Step 1 of 2: Determine the winner and loser for each game.\nselect\ngame_id,\nwinner,\ncase\nwhen\nhome = winner\nthen\nvisitor\nelse\nhome\nend as loser\nfrom zzz_game_details\n)\ngroup by winner\norder by wins desc\n\n```\n8. Click **Save As**.\n9. For the filename, enter `models\/zzz_win_loss_records.sql` and then click **Create**.\n10. Run the models: In the command line, run the `dbt run` command with the paths to the two preceding files. In the `default` database (as specified in your project settings), dbt creates one table named `zzz_game_details` and one view named `zzz_win_loss_records`. dbt gets these view and table names from their related `.sql` file names. \n```\ndbt run --model models\/zzz_game_details.sql models\/zzz_win_loss_records.sql\n\n``` \n```\n...\n... | 1 of 2 START table model default.zzz_game_details.................... [RUN]\n... | 1 of 2 OK created table model default.zzz_game_details............... [OK ...]\n... | 2 of 2 START view model default.zzz_win_loss_records................. [RUN]\n... | 2 of 2 OK created view model default.zzz_win_loss_records............ [OK ...]\n... |\n... | Finished running 1 table model, 1 view model ...\n\nCompleted successfully\n\nDone. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2\n\n```\n11. Run the following SQL code to list information about the new view and to select all rows from the table and view. \nIf you are connecting to a cluster, you can run this SQL code from a [notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook) that is attached to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL warehouse, you can run this SQL code from a [query](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html#create-a-query). \n```\nSHOW VIEWS FROM default LIKE 'zzz_win_loss_records';\n\n``` \n```\n+-----------+----------------------+-------------+\n| namespace | viewName | isTemporary |\n+===========+======================+=============+\n| default | zzz_win_loss_records | false |\n+-----------+----------------------+-------------+\n\n``` \n```\nSELECT * FROM zzz_game_details;\n\n``` \n```\n+---------+---------------+---------------+------------+---------------+---------------+------------+\n| game_id | home | visitor | home_score | visitor_score | winner | date |\n+=========+===============+===============+============+===============+===============+============+\n| 1 | San Francisco | Seattle | 4 | 2 | San Francisco | 2020-12-12 |\n+---------+---------------+---------------+------------+---------------+---------------+------------+\n| 2 | San Francisco | Amsterdam | 0 | 1 | Amsterdam | 2021-01-09 |\n+---------+---------------+---------------+------------+---------------+---------------+------------+\n| 3 | Seattle | San Francisco | 1 | 2 | San Francisco | 2020-12-19 |\n+---------+---------------+---------------+------------+---------------+---------------+------------+\n| 4 | Seattle | Amsterdam | 3 | 2 | Seattle | 2021-01-16 |\n+---------+---------------+---------------+------------+---------------+---------------+------------+\n| 5 | Amsterdam | San Francisco | 3 | 0 | Amsterdam | 2021-01-23 |\n+---------+---------------+---------------+------------+---------------+---------------+------------+\n| 6 | Amsterdam | Seattle | 3 | 1 | Amsterdam | 2021-02-06 |\n+---------+---------------+---------------+------------+---------------+---------------+------------+\n\n``` \n```\nSELECT * FROM zzz_win_loss_records;\n\n``` \n```\n+---------------+------+--------+\n| team | wins | losses |\n+===============+======+========+\n| Amsterdam | 3 | 1 |\n+---------------+------+--------+\n| San Francisco | 2 | 2 |\n+---------------+------+--------+\n| Seattle | 1 | 3 |\n+---------------+------+--------+\n\n``` \n### Step 3: Create and run tests \nIn this step, you create *tests*, which are assertions you make about your models. When you run these tests, dbt tells you if each test in your project passes or fails. \nThere are two type of tests. *Schema tests*, written in YAML, return the number of records that do not pass an assertion. When this number is zero, all records pass, therefore the tests pass. *Data tests* are specific queries that must return zero records to pass. \n1. Create the schema tests: Click ![Create New File icon](https:\/\/docs.databricks.com\/_images\/dbt-cloud-create-new-file.png) (**Create New File**) in the upper-right corner.\n2. In the text editor, enter the following content. This file includes schema tests that determine whether the specified columns have unique values, are not null, have only the specified values, or a combination. \n```\nversion: 2\n\nmodels:\n- name: zzz_game_details\ncolumns:\n- name: game_id\ntests:\n- unique\n- not_null\n- name: home\ntests:\n- not_null\n- accepted_values:\nvalues: ['Amsterdam', 'San Francisco', 'Seattle']\n- name: visitor\ntests:\n- not_null\n- accepted_values:\nvalues: ['Amsterdam', 'San Francisco', 'Seattle']\n- name: home_score\ntests:\n- not_null\n- name: visitor_score\ntests:\n- not_null\n- name: winner\ntests:\n- not_null\n- accepted_values:\nvalues: ['Amsterdam', 'San Francisco', 'Seattle']\n- name: date\ntests:\n- not_null\n- name: zzz_win_loss_records\ncolumns:\n- name: team\ntests:\n- unique\n- not_null\n- relationships:\nto: ref('zzz_game_details')\nfield: home\n- name: wins\ntests:\n- not_null\n- name: losses\ntests:\n- not_null\n\n```\n3. Click **Save As**.\n4. For the filename, enter `models\/schema.yml`, and then click **Create**.\n5. Create the first data test: Click ![Create New File icon](https:\/\/docs.databricks.com\/_images\/dbt-cloud-create-new-file.png) (**Create New File**) in the upper-right corner.\n6. In the text editor, enter the following SQL statement. This file includes a data test to determine whether any games happened outside of the regular season. \n```\n-- This season's games happened between 2020-12-12 and 2021-02-06.\n-- For this test to pass, this query must return no results.\n\nselect date\nfrom zzz_game_details\nwhere date < '2020-12-12'\nor date > '2021-02-06'\n\n```\n7. Click **Save As**.\n8. For the filename, enter `tests\/zzz_game_details_check_dates.sql`, and then click **Create**.\n9. Create a second data test: Click ![Create New File icon](https:\/\/docs.databricks.com\/_images\/dbt-cloud-create-new-file.png) (**Create New File**) in the upper-right corner.\n10. In the text editor, enter the following SQL statement. This file includes a data test to determine whether any scores were negative or any games were tied. \n```\n-- This sport allows no negative scores or tie games.\n-- For this test to pass, this query must return no results.\n\nselect home_score, visitor_score\nfrom zzz_game_details\nwhere home_score < 0\nor visitor_score < 0\nor home_score = visitor_score\n\n```\n11. Click **Save As**.\n12. For the filename, enter `tests\/zzz_game_details_check_scores.sql`, and then click **Create**.\n13. Create a third data test: Click ![Create New File icon](https:\/\/docs.databricks.com\/_images\/dbt-cloud-create-new-file.png) (**Create New File**) in the upper-right corner.\n14. In the text editor, enter the following SQL statement. This file includes a data test to determine whether any teams had negative win or loss records, had more win or loss records than games played, or played more games than were allowed. \n```\n-- Each team participated in 4 games this season.\n-- For this test to pass, this query must return no results.\n\nselect wins, losses\nfrom zzz_win_loss_records\nwhere wins < 0 or wins > 4\nor losses < 0 or losses > 4\nor (wins + losses) > 4\n\n```\n15. Click **Save As**.\n16. For the filename, enter `tests\/zzz_win_loss_records_check_records.sql`, and then click **Create**.\n17. Run the tests: In the command line, run the `dbt test` command. \n### Step 4: Clean up \nYou can delete the tables and views you created for this example by running the following SQL code. \nIf you are connecting to a cluster, you can run this SQL code from a [notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook) that is attached to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL warehouse, you can run this SQL code from a [query](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html#create-a-query). \n```\nDROP TABLE zzz_game_opponents;\nDROP TABLE zzz_game_scores;\nDROP TABLE zzz_games;\nDROP TABLE zzz_teams;\nDROP TABLE zzz_game_details;\nDROP VIEW zzz_win_loss_records;\n\nDROP TABLE diamonds;\nDROP TABLE diamonds_four_cs;\nDROP VIEW diamonds_list_colors;\nDROP VIEW diamonds_prices;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/dbt-cloud.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to dbt Cloud\n##### Next steps\n\n* Learn more about dbt [models](https:\/\/docs.getdbt.com\/docs\/building-a-dbt-project\/building-models).\n* Learn how to [test](https:\/\/docs.getdbt.com\/docs\/building-a-dbt-project\/tests) your dbt projects.\n* Learn how to use [Jinja](https:\/\/docs.getdbt.com\/docs\/building-a-dbt-project\/jinja-macros), a templating language, for programming SQL in your dbt projects.\n* Learn about dbt [best practices](https:\/\/docs.getdbt.com\/docs\/guides\/best-practices).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/dbt-cloud.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to dbt Cloud\n##### Additional resources\n\n* [What, exactly, is dbt?](https:\/\/www.getdbt.com\/blog\/what-exactly-is-dbt)\n* [General dbt documentation](https:\/\/docs.getdbt.com\/docs\/introduction)\n* [dbt-core GitHub repository](https:\/\/github.com\/dbt-labs\/dbt)\n* [dbt CLI](https:\/\/docs.getdbt.com\/dbt-cli\/cli-overview)\n* [dbt pricing](https:\/\/www.getdbt.com\/pricing\/)\n* [Analytics Engineering for Everyone: Databricks in dbt Cloud](https:\/\/blog.getdbt.com\/analytics-engineering-for-everyone-databricks-in-dbt-cloud\/)\n* [dbt Cloud overview](https:\/\/docs.getdbt.com\/docs\/dbt-cloud\/cloud-overview)\n* [Connecting to Databricks](https:\/\/docs.getdbt.com\/docs\/dbt-cloud\/cloud-configuring-dbt-cloud\/connecting-your-database#connecting-to-databricks)\n* [dbt Discourse community](https:\/\/discourse.getdbt.com\/)\n* [dbt blog](https:\/\/blog.getdbt.com\/)\n* [Support](https:\/\/docs.getdbt.com\/docs\/dbt-cloud\/cloud-dbt-cloud-support)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/dbt-cloud.html"} +{"content":"# Security and compliance guide\n## Authentication and access control\n#### Access control lists\n\nThis article describes details about the permissions available for the different workspace objects. \nNote \nAccess control requires the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons). \nAccess control settings are disabled by default on workspaces that are upgraded from the Standard plan to the Premium plan or above. Once an access control setting is enabled, it can not be disabled. For more information, see [Access controls lists can be enabled on upgraded workspaces](https:\/\/docs.databricks.com\/release-notes\/product\/2024\/january.html#acls).\n\n#### Access control lists\n##### Access control lists overview\n\nIn Databricks, you can use access control lists (ACLs) to configure permission to access workspace level objects. Workspace admins have the CAN MANAGE permission on all objects in their workspace, which gives them the ability to manage permissions on all objects in their workspaces. Users automatically have the CAN MANAGE permission for objects that they create. \nFor an example of how to map typical personas to workspace-level permissions, see the [Proposal for Getting Started With Databricks Groups and Permissions](https:\/\/www.databricks.com\/discover\/pages\/access-control). \n### Manage access control lists with folders \nYou can manage workspace object permissions by adding objects to folders. Objects in a folder inherit all permissions settings of that folder. For example, a user that has the CAN RUN permission on a folder has CAN RUN permission on the alerts in that folder. To learn about organizing objects into folders, see [Workspace browser](https:\/\/docs.databricks.com\/workspace\/workspace-browser\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html"} +{"content":"# Security and compliance guide\n## Authentication and access control\n#### Access control lists\n##### Alerts ACLs\n\n| Ability | NO PERMISSIONS | CAN RUN | CAN MANAGE |\n| --- | --- | --- | --- |\n| See in alert list | | x | x |\n| View alert and result | | x | x |\n| Manually trigger alert run | | x | x |\n| Subscribe to notifications | | x | x |\n| Edit alert | | | x |\n| Modify permissions | | | x |\n| Delete alert | | | x |\n\n#### Access control lists\n##### Compute ACLs\n\n| Ability | NO PERMISSIONS | CAN ATTACH TO | CAN RESTART | CAN MANAGE |\n| --- | --- | --- | --- | --- |\n| Attach notebook to cluster | | x | x | x |\n| View Spark UI | | x | x | x |\n| View cluster metrics | | x | x | x |\n| View driver logs | | x | x | x |\n| Terminate cluster | | | x | x |\n| Start and restart cluster | | | x | x |\n| Edit cluster | | | | x |\n| Attach library to cluster | | | | x |\n| Resize cluster | | | | x |\n| Modify permissions | | | | x |\n\n#### Access control lists\n##### Legacy dashboard ACLs\n\n| Ability | NO PERMISSIONS | CAN VIEW | CAN RUN | CAN EDIT | CAN MANAGE |\n| --- | --- | --- | --- | --- | --- |\n| See in dashboard list | | x | x | x | x |\n| View dashboard and results | | x | x | x | x |\n| Refresh query results in the dashboard (or choose different parameters) | | | x | x | x |\n| Edit dashboard | | | | x | x |\n| Modify permissions | | | | | x |\n| Delete dashboard | | | | | x | \nEditing a legacy dashboard requires the **Run as viewer** sharing setting. See [Refresh behavior and execution context](https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html#sharing-setting).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html"} +{"content":"# Security and compliance guide\n## Authentication and access control\n#### Access control lists\n##### Delta Live Tables ACLs\n\n| Ability | NO PERMISSIONS | CAN VIEW | CAN RUN | CAN MANAGE | IS OWNER |\n| --- | --- | --- | --- | --- | --- |\n| View pipeline details and list pipeline | | x | x | x | x |\n| View Spark UI and driver logs | | x | x | x | x |\n| Start and stop a pipeline update | | | x | x | x |\n| Stop pipeline clusters directly | | | x | x | x |\n| Edit pipeline settings | | | | x | x |\n| Delete the pipeline | | | | x | x |\n| Purge runs and experiments | | | | x | x |\n| Modify permissions | | | | x | x |\n\n#### Access control lists\n##### Feature tables ACLs\n\n| Ability | CAN VIEW METADATA | CAN EDIT METADATA | CAN MANAGE |\n| --- | --- | --- | --- |\n| Read feature table | X | X | X |\n| Search feature table | X | X | X |\n| Publish feature table to online store | X | X | X |\n| Write features to feature table | | X | X |\n| Update description of feature table | | X | X |\n| Modify permissions | | | X |\n| Delete feature table | | | X |\n\n#### Access control lists\n##### File ACLs\n\n| Ability | NO PERMISSIONS | CAN READ | CAN RUN | CAN EDIT | CAN MANAGE |\n| --- | --- | --- | --- | --- | --- |\n| Read file | | x | x | x | x |\n| Comment | | x | x | x | x |\n| Attach and detach file | | | x | x | x |\n| Run file interactively | | | x | x | x |\n| Edit file | | | | x | x |\n| Modify permissions | | | | | x |\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html"} +{"content":"# Security and compliance guide\n## Authentication and access control\n#### Access control lists\n##### Folder ACLs\n\n| Ability | NO PERMISSIONS | CAN READ | CAN EDIT | CAN RUN | CAN MANAGE |\n| --- | --- | --- | --- | --- | --- |\n| List objects in folder | x | x | x | x | x |\n| View objects in folder | | x | x | x | x |\n| Clone and export items | | | x | x | x |\n| Run objects in the folder | | | | x | x |\n| Create, import, and delete items | | | | | x |\n| Move and rename items | | | | | x |\n| Modify permissions | | | | | x |\n\n#### Access control lists\n##### Git folder ACLs\n\n| Ability | NO PERMISSIONS | CAN READ | CAN RUN | CAN EDIT | CAN MANAGE |\n| --- | --- | --- | --- | --- | --- |\n| List assets in a folder | x | x | x | x | x |\n| View assets in a folder | | x | x | x | x |\n| Clone and export assets | | x | x | x | x |\n| Run executable assets in folder | | | x | x | x |\n| Edit and rename assets in a folder | | | | x | x |\n| Create a branch in a folder | | | | | x |\n| Pull or push a branch into a folder | | | | | x |\n| Create, import, delete, and move assets | | | | | x |\n| Modify permissions | | | | | x |\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html"} +{"content":"# Security and compliance guide\n## Authentication and access control\n#### Access control lists\n##### Job ACLs\n\n| Ability | NO PERMISSIONS | CAN VIEW | CAN MANAGE RUN | IS OWNER | CAN MANAGE |\n| --- | --- | --- | --- | --- | --- |\n| View job details and settings | | x | x | x | x |\n| View results | | x | x | x | x |\n| View Spark UI, logs of a job run | | | x | x | x |\n| Run now | | | x | x | x |\n| Cancel run | | | x | x | x |\n| Edit job settings | | | | x | x |\n| Delete job | | | | x | x |\n| Modify permissions | | | | x | x |\n\n#### Access control lists\n##### Dashboard ACLs\n\n| Ability | NO PERMISSIONS | CAN VIEW\/CAN RUN | CAN EDIT | CAN MANAGE |\n| --- | --- | --- | --- | --- |\n| View dashboard and results | | x | x | x |\n| Interact with widgets | | x | x | x |\n| Refresh the dashboard | | x | x | x |\n| Edit dashboard | | | x | x |\n| Clone dashboard | | x | x | x |\n| Publish dashboard snapshot | | | x | x |\n| Modify permissions | | | | x |\n| Delete dashboard | | | | x |\n\n#### Access control lists\n##### MLFlow experiment ACLs\n\n| Ability | NO PERMISSIONS | CAN READ | CAN EDIT | CAN MANAGE |\n| --- | --- | --- | --- | --- |\n| View run info search compare runs | | x | x | x |\n| View, list, and download run artifacts | | x | x | x |\n| Create, delete, and restore runs | | | x | x |\n| Log run params, metrics, tags | | | x | x |\n| Log run artifacts | | | x | x |\n| Edit experiment tags | | | x | x |\n| Purge runs and experiments | | | | x |\n| Modify permissions | | | | x |\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html"} +{"content":"# Security and compliance guide\n## Authentication and access control\n#### Access control lists\n##### MLFlow model ACLs\n\n| Ability | NO PERMISSIONS | CAN READ | CAN EDIT | CAN MANAGE STAGING VERSIONS | CAN MANAGE PRODUCTION VERSIONS | CAN MANAGE |\n| --- | --- | --- | --- | --- | --- | --- |\n| View model details, versions, stage transition requests, activities, and artifact download URIs | | x | x | x | x | x |\n| Request a model version stage transition | | x | x | x | x | x |\n| Add a version to a model | | | x | x | x | x |\n| Update model and version description | | | x | x | x | x |\n| Add or edit tags | | | x | x | x | x |\n| Transition model version between stages | | | | x | x | x |\n| Approve a transition request | | | | x | x | x |\n| Cancel a transition request | | | | | | x |\n| Rename model | | | | | | x |\n| Modify permissions | | | | | | x |\n| Delete model and model versions | | | | | | x |\n\n#### Access control lists\n##### Notebook ACLs\n\n| Ability | NO PERMISSIONS | CAN READ | CAN RUN | CAN EDIT | CAN MANAGE |\n| --- | --- | --- | --- | --- | --- |\n| View cells | | x | x | x | x |\n| Comment | | x | x | x | x |\n| Run via %run or notebook workflows | | x | x | x | x |\n| Attach and detach notebooks | | | x | x | x |\n| Run commands | | | x | x | x |\n| Edit cells | | | | x | x |\n| Modify permissions | | | | | x |\n\n#### Access control lists\n##### Pool ACLs\n\n| Ability | NO PERMISSIONS | CAN ATTACH TO | CAN MANAGE |\n| --- | --- | --- | --- |\n| Attach cluster to pool | | x | x |\n| Delete pool | | | x |\n| Edit pool | | | x |\n| Modify permissions | | | x |\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html"} +{"content":"# Security and compliance guide\n## Authentication and access control\n#### Access control lists\n##### Query ACLs\n\n| Ability | NO PERMISSIONS | CAN VIEW | CAN RUN | CAN EDIT | CAN MANAGE |\n| --- | --- | --- | --- | --- | --- |\n| View own queries | | x | x | x | x |\n| See in query list | | x | x | x | x |\n| View query text | | x | x | x | x |\n| View query result | | x | x | x | x |\n| Refresh query result (or choose different parameters) | | | x | x | x |\n| Include the query in a dashboard | | | x | x | x |\n| Edit query text | | | | x | x |\n| Change SQL warehouse or data source | | | | | x |\n| Modify permissions | | | | | x |\n| Delete query | | | | | x |\n\n#### Access control lists\n##### Secret ACLs\n\n| Ability | READ | WRITE | MANAGE |\n| --- | --- | --- | --- |\n| Read the secret scope | x | x | x |\n| List secrets in the scope | x | x | x |\n| Write to the secret scope | | x | x |\n| Modify permissions | | | x |\n\n#### Access control lists\n##### Serving endpoint ACLs\n\n| Ability | NO PERMISSIONS | CAN VIEW | CAN QUERY | CAN MANAGE |\n| --- | --- | --- | --- | --- |\n| Get endpoint | | x | x | x |\n| List endpoint | | x | x | x |\n| Query endpoint | | | x | x |\n| Update endpoint config | | | | x |\n| Delete endpoint | | | | x |\n| Modify permissions | | | | x |\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html"} +{"content":"# Security and compliance guide\n## Authentication and access control\n#### Access control lists\n##### SQL warehouse ACLs\n\n| Ability | NO PERMISSIONS | CAN USE | IS OWNER | CAN MANAGE |\n| --- | --- | --- | --- | --- |\n| Start the warehouse | | x | x | x |\n| View details for the warehouse | | x | x | x |\n| View all queries for the warehouse | | | x | x |\n| View warehouse monitoring tab | | | x | x |\n| Stop the warehouse | | | x | x |\n| Delete the warehouse | | | x | x |\n| Edit the warehouse | | | x | x |\n| Modify permissions | | | x | x |\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Access and manage saved queries\n\nThis article outlines how to use the Databricks UI to access and manage queries.\n\n### Access and manage saved queries\n#### View queries\n\nYou can view queries using the following methods: \n* Click ![Workspace Icon](https:\/\/docs.databricks.com\/_images\/workspace-icon.png) **Workspace** in the sidebar. Queries are viewable, by default, in the **Home** folder. Users can organize queries into folders in the workspace browser along with other Databricks objects.\n* Click ![Queries Icon](https:\/\/docs.databricks.com\/_images\/queries-icon.png) **Queries** in the sidebar. Objects in the **Queries** windows are sorted in reverse chronological order by default. You can reorder the list by clicking the **Created at** column heading. Type into the **Filter queries** text box to filter by Name, Tag, or Owner.\n\n### Access and manage saved queries\n#### Organize queries into folders in the workspace browser\n\nYou can organize queries into folders in the [workspace browser](https:\/\/docs.databricks.com\/workspace\/workspace-browser\/index.html) and other Databricks objects. See [Workspace browser](https:\/\/docs.databricks.com\/workspace\/workspace-browser\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Access and manage saved queries\n#### Transfer ownership of a query\n\nYou must be a workspace admin to transfer ownership of a query. Service principals and groups cannot be assigned ownership of a query. You can also transfer ownership using the [Permissions API](https:\/\/docs.databricks.com\/api\/workspace\/permissions). \n1. As a workspace admin, log in to your Databricks workspace.\n2. In the sidebar, click **Queries**.\n3. Click a query.\n4. Click the **Share** button at the top right to open the **Sharing** dialog.\n5. Click on the gear icon at the top right and click **Assign new owner**. \n![Assign new owner](https:\/\/docs.databricks.com\/_images\/assign-new-owner.png)\n6. Select the user to assign ownership to.\n7. Click **Confirm**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Access and manage saved queries\n#### Configure query permissions\n\nWorkspace admins and the query creator are automatically granted permissions to control which users can manage and run queries. You must have at least CAN MANAGE permission on a query to share queries. \nQueries support two types of sharing settings: \n* **Run as viewer**: The viewer\u2019s credential is used to run the query. The viewer must also have at least CAN USE permissions on the warehouse. \nUsers can only be granted the CAN EDIT permission when the sharing setting is set to Run as viewer.\n* **Run as owner**: The owner\u2019s credential is used to run the query. \nFor more information on query permission levels, see [Query ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#query). \n1. In the sidebar, click **Queries**.\n2. Click a query.\n3. Click the ![Share Button](https:\/\/docs.databricks.com\/_images\/share-button.png) button at the top right to open the **Sharing** dialog. \n![Manage query permissions](https:\/\/docs.databricks.com\/_images\/manage-permissions.png)\n4. Follow the steps based on the permission type you want to grant:\n5. Search for and select the groups and users, and assign the permission level.\n6. Click **Add**.\n7. In the **Sharing settings > Credentials** field at the bottom, select either **Run as viewer** or **Run as owner**. \nYou can also copy the link to the query in the Sharing dialog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Access and manage saved queries\n#### Admin access to all queries\n\nA Databricks workspace admin user has view access to all queries in the workspace. When the **All Queries** tab is selected, a workspace admin can view and delete any queries. However, a workspace admin can\u2019t edit a query when sharing setting credentials are set to **Run as owner**. \nTo view all queries: \n1. Click ![Queries Icon](https:\/\/docs.databricks.com\/_images\/queries-icon.png) **Queries** in the sidebar.\n2. Click the **All queries** tab near the top of the screen.\n\n### Access and manage saved queries\n#### Creating queries in other environments\n\nYou can create queries without using the Databricks UI using the Rest API, a JDBC\/ODBC connector, or a partner tool. \nSee [Use a SQL database tool](https:\/\/docs.databricks.com\/dev-tools\/index-sql.html) to run SQL commands and browse database objects in Databricks. \nYou can also create a query with the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_sql\\_query](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/sql_query). \nSee [Technology partners](https:\/\/docs.databricks.com\/integrations\/index.html) to learn about partner tools you can use through Partner Connect.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/index.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Authentication settings for the Databricks JDBC Driver\n\nThis article describes how to configure Databricks authentication settings for the [Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/index.html). \nTo configure a Databricks connection for the Databricks JDBC Driver, you must combine your compute resource settings, any driver capability settings, and the following authentication settings, into a JDBC connection URL or programmatic collection of JDBC connection properties. \nJDBC connection URLs use the following format: \n```\njdbc:databricks:\/\/:443;httpPath=[;=;=;=]\n\n``` \n* To get the values for `` and ``, see [Compute settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/compute.html).\n* Replace `=` as needed for each of the connection properties as listed in the following sections.\n* You can also add special or advanced [driver capability settings](https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html). \nProgrammatic collections of JDBC connection properties can be used in Java code such as the following example: \n```\npackage org.example;\n\nimport java.sql.Connection;\nimport java.sql.DriverManager;\nimport java.sql.ResultSet;\nimport java.sql.ResultSetMetaData;\nimport java.sql.Statement;\nimport java.util.Properties;\n\npublic class Main {\npublic static void main(String[] args) throws Exception {\nClass.forName(\"com.databricks.client.jdbc.Driver\");\nString url = \"jdbc:databricks:\/\/\" + System.getenv(\"DATABRICKS_SERVER_HOSTNAME\") + \":443\";\nProperties p = new java.util.Properties();\np.put(\"httpPath\", System.getenv(\"DATABRICKS_HTTP_PATH\"));\np.put(\"\", \"\", \"\", \"\")) {\nResultSetMetaData md = rs.getMetaData();\nString[] columns = new String[md.getColumnCount()];\nfor (int i = 0; i < columns.length; i++) {\ncolumns[i] = md.getColumnName(i + 1);\n}\nwhile (rs.next()) {\nSystem.out.print(\"Row \" + rs.getRow() + \"=[\");\nfor (int i = 0; i < columns.length; i++) {\nif (i != 0) {\nSystem.out.print(\", \");\n}\nSystem.out.print(columns[i] + \"='\" + rs.getObject(i + 1) + \"'\");\n}\nSystem.out.println(\")]\");\n}\n}\n}\nSystem.exit(0);\n}\n}\n\n``` \n* Set the `DATABRICKS_SERVER_HOSTNAME` and `DATABRICKS_HTTP_PATH` environment values to the target Databricks compute resource\u2019s **Server Hostname** and **HTTP Path** values, respectively. To get these values, see [Compute settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/compute.html). To set environment variables, see your operating system\u2019s documentation.\n* Replace `` and `` as needed for each of the connection properties as listed in the following sections.\n* You can also add special or advanced [driver capability settings](https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html), typically as additional `` and `` pairs.\n* For this example, replace `` with a SQL `SELECT` query string. \nWhether you use a connection URL or a collection of connection properties will depend on the requirements of your target app, tool, client, SDK, or API. Examples of JDBC connection URLs and programmatic collections of JDBC connection properties are provided in this article for each supported Databricks authentication type. \nThe Databricks JDBC Driver supports the following Databricks authentication types: \n* [Databricks personal access token](https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html#authentication-pat)\n* [Databricks username and password](https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html#authentication-username-password)\n* [OAuth 2.0 tokens](https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html#authentication-pass-through)\n* [OAuth user-to-machine (U2M) authentication](https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html#authentication-u2m)\n* [OAuth machine-to-machine (M2M) authentication](https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html#authentication-m2m)\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Authentication settings for the Databricks JDBC Driver\n########## Databricks personal access token\n\nTo create a Databricks personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**. \nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n* [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n* [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nTo authenticate using a Databricks personal access token, set the following configuration. \nFor a JDBC connection URL with embedded general configuration properties and sensitive credential properties: \n```\njdbc:databricks:\/\/:443;httpPath=;AuthMech=3;UID=token;PWD=\n\n``` \nFor Java code with general configuration properties and sensitive credential properties set outside of the JDBC connection URL: \n```\n\/\/ ...\nString url = \"jdbc:databricks:\/\/:443\";\nProperties p = new java.util.Properties();\np.put(\"httpPath\", \"\");\np.put(\"AuthMech\", \"3\");\np.put(\"UID\", \"token\");\np.put(\"PWD\", \"\");\n\/\/ ...\nConnection conn = DriverManager.getConnection(url, p);\n\/\/ ...\n\n``` \n* For a complete Java code example that you can adapt the preceding code snippet to you own needs, see the code example at the beginning of this article.\n* In the preceding URL or Java code, replace `` with the Databricks personal access token for your workspace user.\n* To get the values for `` and ``, see [Compute settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/compute.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Authentication settings for the Databricks JDBC Driver\n########## Databricks username and password\n\nDatabricks username and password authentication is also known as Databricks *basic* authentication. \nUsername and password authentication is possible only if [single sign-on](https:\/\/docs.databricks.com\/admin\/users-groups\/single-sign-on\/index.html) is disabled. \nTo authenticate using a Databricks username and password, set the following configuration. \nFor a JDBC connection URL with embedded general configuration properties and sensitive credential properties: \n```\njdbc:databricks:\/\/:443;httpPath=;AuthMech=3;UID=;PWD=\n\n``` \nFor Java code with general configuration properties and sensitive credential properties set outside of the JDBC connection URL: \n```\n\/\/ ...\nString url = \"jdbc:databricks:\/\/:443\";\nProperties p = new java.util.Properties();\np.put(\"httpPath\", \"\");\np.put(\"AuthMech\", \"3\");\np.put(\"UID\", \"\");\np.put(\"PWD\", \"\");\n\/\/ ...\nConnection conn = DriverManager.getConnection(url, p);\n\/\/ ...\n\n``` \n* For a complete Java code example that you can adapt the preceding code snippet to you own needs, see the code example at the beginning of this article.\n* In the preceding URL or Java code, replace `` and `` with the username and password.\n* To get the values for `` and ``, see [Compute settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/compute.html). \nFor more information, see the `Using User Name and Password` section in the [Databricks JDBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Databricks-JDBC-Driver-Install-and-Configuration-Guide.pdf).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Authentication settings for the Databricks JDBC Driver\n########## OAuth 2.0 tokens\n\nJDBC driver 2.6.36 and above supports an OAuth 2.0 token for a Databricks user or service principal. This is also known as OAuth 2.0 *token pass-through* authentication. \nTo create an OAuth 2.0 token for token pass-through authentication, do the following: \n* For a user, you can use the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/install.html) to generate the OAuth 2.0 token by initiating the OAuth U2M process, and then get the generated OAuth 2.0 token by running the `databricks auth token` command. See [OAuth user-to-machine (U2M) authentication](https:\/\/docs.databricks.com\/dev-tools\/cli\/authentication.html#u2m-auth). OAuth 2.0 tokens have a default lifetime of 1 hour. To generate a new OAuth 2.0 token, repeat this process.\n* For a service principal, see [Manually generate and use access tokens for OAuth machine-to-machine (M2M) authentication](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html#oauth-m2m-manual). Make a note of the service principal\u2019s OAuth `access_token` value. OAuth 2.0 tokens have a default lifetime of 1 hour. To generate a new OAuth 2.0 token, repeat this process. \nTo authenticate using OAuth 2.0 token pass-through authentication, set the following configuration. \nFor a JDBC connection URL with embedded general configuration properties and sensitive credential properties: \n```\njdbc:databricks:\/\/:443;httpPath=;AuthMech=11;Auth_Flow=0;Auth_AccessToken=\n\n``` \nFor Java code with general configuration properties and sensitive credential properties set outside of the JDBC connection URL: \n```\n\/\/ ...\nString url = \"jdbc:databricks:\/\/:443\";\nProperties p = new java.util.Properties();\np.put(\"httpPath\", \"\");\np.put(\"AuthMech\", \"11\");\np.put(\"Auth_Flow\", \"0\");\np.put(\"Auth_AccessToken\", \"\");\n\/\/ ...\nConnection conn = DriverManager.getConnection(url, p);\n\/\/ ...\n\n``` \n* For a complete Java code example that you can adapt the preceding code snippet to you own needs, see the code example at the beginning of this article.\n* In the preceding URL or Java code, replace `` with the OAuth 2.0 token.\n* To get the values for `` and ``, see [Compute settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/compute.html). \nFor more information, see the `Token Pass-through` section in the [Databricks JDBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Databricks-JDBC-Driver-Install-and-Configuration-Guide.pdf).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Authentication settings for the Databricks JDBC Driver\n########## OAuth user-to-machine (U2M) authentication\n\nJDBC driver 2.6.36 and above supports OAuth user-to-machine (U2M) authentication for a Databricks user. This is also known as OAuth 2.0 *browser-based* authentication. \nOAuth U2M or OAuth 2.0 browser-based authentication has no prerequisites. OAuth 2.0 tokens have a default lifetime of 1 hour. OAuth U2M or OAuth 2.0 browser-based authentication should refresh expired OAuth 2.0 tokens for you automatically. \nNote \nOAuth U2M or OAuth 2.0 browser-based authentication works only with applications that run locally. It does not work with server-based or cloud-based applications. \nTo authenticate using OAuth user-to-machine (U2M) or OAuth 2.0 browser-based authentication, set the following configuration. \nFor a JDBC connection URL with embedded general configuration properties and sensitive credential properties: \n```\njdbc:databricks:\/\/:443;httpPath=;AuthMech=11;Auth_Flow=2;TokenCachePassPhrase=;EnableTokenCache=0\n\n``` \nFor Java code with general configuration properties and sensitive credential properties set outside of the JDBC connection URL: \n```\n\/\/ ...\nString url = \"jdbc:databricks:\/\/:443\";\nProperties p = new java.util.Properties();\np.put(\"httpPath\", \"\");\np.put(\"AuthMech\", \"11\");\np.put(\"Auth_Flow\", \"2\");\np.put(\"TokenCachePassPhrase\", \"\");\np.put(\"EnableTokenCache\", \"0\");\n\/\/ ...\nConnection conn = DriverManager.getConnection(url, p);\n\/\/ ...\n\n``` \n* For a complete Java code example that you can adapt the preceding code snippet to you own needs, see the code example at the beginning of this article.\n* In the preceding URL or Java code, replace `` with a passphrase of your choice. The driver uses this key for refresh token encryption.\n* To get the values for `` and ``, see [Compute settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/compute.html). \nFor more information, see the `Using Browser Based Authentication` section in the [Databricks JDBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Databricks-JDBC-Driver-Install-and-Configuration-Guide.pdf).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Authentication settings for the Databricks JDBC Driver\n########## OAuth machine-to-machine (M2M) authentication\n\nJDBC driver 2.6.36 and above supports OAuth machine-to-machine (M2M) authentication for a Databricks service principal. This is also known as OAuth 2.0 *client credentials* authentication. \nNote \nJDBC does not currently connect using M2M for private link workpaces. \nTo configure OAuth M2M or OAuth 2.0 client credentials authentication, do the following: \n1. Create a Databricks service principal in your Databricks workspace, and create an OAuth secret for that service principal. \nTo create the service principal and its OAuth secret, see [OAuth machine-to-machine (M2M) authentication](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). Make a note of the service principal\u2019s **UUID** or **Application ID** value, and the **Secret** value for the service principal\u2019s OAuth secret.\n2. Give the service principal access to your cluster or warehouse. See [Compute permissions](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions) or [Manage a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html#manage). \nTo authenticate using OAuth machine-to-machine (M2M) or OAuth 2.0 client credentials authentication, set the following configuration. \nFor a JDBC connection URL with embedded general configuration properties and sensitive credential properties: \n```\njdbc:databricks:\/\/:443;httpPath=;AuthMech=11;Auth_Flow=1;OAuth2ClientId=;OAuth2Secret=\n\n``` \nFor Java code with general configuration properties and sensitive credential properties set outside of the JDBC connection URL: \n```\n\/\/ ...\nString url = \"jdbc:databricks:\/\/:443\";\nProperties p = new java.util.Properties();\np.put(\"httpPath\", \"\");\np.put(\"AuthMech\", \"11\");\np.put(\"Auth_Flow\", \"1\");\np.put(\"OAuth2ClientId\", \"\");\np.put(\"OAuth2Secret\", \"\");\n\/\/ ...\nConnection conn = DriverManager.getConnection(url, p);\n\/\/ ...\n\n``` \n* For a complete Java code example that you can adapt the preceding code snippet to you own needs, see the code example at the beginning of this article.\n* In the preceding URL or Java code, replace the following placeholders: \n+ Replace `` with the service principal\u2019s **UUID**\/**Application ID** value.\n+ Replace `` with the service principal\u2019s OAuth **Secret** value.\n+ To get the values for `` and ``, see [Compute settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/compute.html). \nFor more information, see the `Using M2M Based Authentication` section in the [Databricks JDBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Databricks-JDBC-Driver-Install-and-Configuration-Guide.pdf).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### IRAP compliance controls\n\nPreview \nThe ability for admins to add Enhanced Security and Compliance features is a feature in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). The compliance security profile and support for compliance standards are generally available (GA). \nIRAP compliance controls provide enhancements that help you with Infosec Registered Assessors Program (IRAP) compliance for your workspace. \nIRAP provides high-quality information and communications technology (ICT) security assessment services to the Australian government. IRAP provides a framework for assessing the implementation and effectiveness of an organization\u2019s security controls against the Australian government\u2019s security requirements. Databricks is IRAP certified. \nIRAP compliance controls require enabling the *compliance security profile*, which adds monitoring agents, enforces instance types for inter-node encryption, provides a hardened compute image, and other features. For technical details, see [Compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html). It is your responsibility to [confirm that each affected workspace has the compliance security profile enabled](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html#verify) and confirm that IRAP is added as a compliance program. \nIRAP compliance controls are only available in the `ap-southeast-2` region.\n\n####### IRAP compliance controls\n######## Which compute resources get enhanced security\n\nThe compliance security profile enhancements apply to compute resources in the [classic compute plane](https:\/\/docs.databricks.com\/getting-started\/overview.html) in all regions. \nSupport for serverless SQL warehouses for the compliance security profile varies by region and it is supported in the `ap-southeast-2` region. See [Serverless SQL warehouses support the compliance security profile in some regions](https:\/\/docs.databricks.com\/admin\/sql\/serverless.html#security-profile).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/irap.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### IRAP compliance controls\n######## Requirements\n\n* Your Databricks account must include the Enhanced Security and Compliance add-on. For details, see the [pricing page](https:\/\/databricks.com\/product\/pricing\/platform-addons).\n* Your Databricks workspace is in the `ap-southeast-2` region.\n* Your Databricks workspace is on the Enterprise tier.\n* [Single sign-on (SSO)](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/single-sign-on\/index.html) authentication is configured for the workspace.\n* Your workspace enables the [compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html) and adds the IRAP compliance standard as part of the compliance security profile configuration.\n* You must use the following VM instance types: \n+ **General purpose:** `M-fleet`, `Md-fleet`, `M5dn`, `M5n`, `M5zn`, `M7g`, `M7gd`, `M6i`, `M7i`, `M6id`, `M6in`, `M6idn`, `M6a`, `M7a`\n+ **Compute optimized:** `C5a`, `C5ad`, `C5n`, `C6gn`, `C7g`, `C7gd`, `C7gn`, `C6i`, `C6id`, `C7i`, `C6in`, `C6a`, `C7a`\n+ **Memory optimized:** `R-fleet`, `Rd-fleet`, `R7g`, `R7gd`, `R6i`, `R7i`, `R7iz`, `R6id`, `R6in`, `R6idn`, `R6a`, `R7a`\n+ **Storage optimized:** `D3`, `D3en`, `P3dn`, `R5dn`, `R5n`, `I4i`, `I4g`, `I3en`, `Im4gn`, `Is4gen`\n+ **Accelerated computing:** `G4dn`, `G5`, `P4d`, `P4de`, `P5`\n* Ensure that sensitive information is never entered in customer-defined input fields, such as workspace names, cluster names, and job names.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/irap.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### IRAP compliance controls\n######## Enable IRAP compliance controls\n\nTo configure your workspace to support processing of data regulated by the IRAP standard, the workspace must have the [compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html) enabled. You can enable the compliance security profile and add the PCI-DSS compliance standard across all workspaces or only on some workspaces. \nTo enable the compliance security profile and add the IRAP compliance standard for an existing workspace, see [Enable enhanced security and compliance features on a workspace](https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-compliance.html#aws-workspace-config). To set an account-level setting to enable the compliance security profile and IRAP for new workspaces, see [Set account-level defaults for new workspaces](https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-compliance.html#aws-account-level-defaults).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/irap.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### IRAP compliance controls\n######## Preview features that are supported for processing data under the IRAP Protected standard\n\nThe following preview features are supported for processing of processing data regulated under IRAP Protected standard: \n* [SCIM provisioning](https:\/\/docs.databricks.com\/admin\/users-groups\/scim\/index.html)\n* [IAM passthrough](https:\/\/docs.databricks.com\/archive\/credential-passthrough\/iam-passthrough.html)\n* [Secret paths in environment variables](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html#spark-conf-env-var)\n* [System tables](https:\/\/docs.databricks.com\/admin\/system-tables\/index.html)\n* [Serverless SQL warehouse usage when compliance security profile is enabled](https:\/\/docs.databricks.com\/admin\/sql\/serverless.html#security-profile), with support in some regions\n* [Filtering sensitive table data with row filters and column masks](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/row-and-column-filters.html)\n* [Unified login](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/single-sign-on\/index.html#unified-login)\n* [Lakehouse Federation to Redshift](https:\/\/docs.databricks.com\/query-federation\/redshift.html)\n* [Liquid clustering for Delta tables](https:\/\/docs.databricks.com\/delta\/clustering.html)\n* [Unity Catalog-enabled DLT pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html)\n* [Databricks Assistant](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html)\n* Scala support for shared clusters\n* Delta Live Tables Hive metastore to Unity Catalog clone API\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/irap.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### IRAP compliance controls\n######## Does Databricks permit the processing of data regulated under IRAP Protected standard?\n\nYes, if you comply with the [requirements](https:\/\/docs.databricks.com\/security\/privacy\/irap.html#requirements), enable the compliance security profile, and add the IRAP compliance standard as part of the compliance security profile configuration.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/irap.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Comparing SparkR and sparklyr\n\nR users can choose between two APIs for Apache Spark: [SparkR](https:\/\/spark.apache.org\/docs\/latest\/sparkr.html) and [sparklyr](https:\/\/spark.rstudio.com\/). This article compares these APIs. Databricks recommends that you choose one of these APIs to develop a Spark application in R. Combining code from both of these APIs into a single script or Databricks notebook or job can make your code more difficult to read and maintain.\n\n#### Comparing SparkR and sparklyr\n##### API origins\n\n[SparkR](https:\/\/spark.apache.org\/docs\/latest\/sparkr.html) is built by the Spark community and developers from Databricks. Because of this, SparkR closely follows the Spark [Scala classes](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/index.html) and [DataFrame API](https:\/\/spark.apache.org\/docs\/latest\/sql-getting-started.html#creating-dataframes). \n[sparklyr](https:\/\/spark.rstudio.com\/) started with [RStudio](https:\/\/www.rstudio.com\/) and has since been donated to the Linux Foundation. sparklyr is tightly integrated into the [tidyverse](https:\/\/www.tidyverse.org\/) in both its programming style and through API interoperability with [dplyr](https:\/\/dplyr.tidyverse.org\/). \nSparkR and sparklyr are highly capable of working with big data in R. Within the past few years, their feature sets are coming closer to parity.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/sparkr-vs-sparklyr.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Comparing SparkR and sparklyr\n##### API differences\n\nThe following code example shows how to use SparkR and sparklyr from a Databricks notebook to read a CSV file from the [Sample datasets](https:\/\/docs.databricks.com\/discover\/databricks-datasets.html) into Spark. \n```\n# #############################################################################\n# SparkR usage\n\n# Note: To load SparkR into a Databricks notebook, run the following:\n\n# library(SparkR)\n\n# You can then remove \"SparkR::\" from the following function call.\n# #############################################################################\n\n# Use SparkR to read the airlines dataset from 2008.\nairlinesDF <- SparkR::read.df(path = \"\/databricks-datasets\/asa\/airlines\/2008.csv\",\nsource = \"csv\",\ninferSchema = \"true\",\nheader = \"true\")\n\n# Print the loaded dataset's class name.\ncat(\"Class of SparkR object: \", class(airlinesDF), \"\\n\")\n\n# Output:\n#\n# Class of SparkR object: SparkDataFrame\n\n# #############################################################################\n# sparklyr usage\n\n# Note: To install, load, and connect with sparklyr in a Databricks notebook,\n# run the following:\n\n# install.packages(\"sparklyr\")\n# library(sparklyr)\n# sc <- sparklyr::spark_connect(method = \"databricks\")\n\n# If you run \"library(sparklyr)\", you can then remove \"sparklyr::\" from the\n# preceding \"spark_connect\" and from the following function call.\n# #############################################################################\n\n# Use sparklyr to read the airlines dataset from 2007.\nairlines_sdf <- sparklyr::spark_read_csv(sc = sc,\nname = \"airlines\",\npath = \"\/databricks-datasets\/asa\/airlines\/2007.csv\")\n\n# Print the loaded dataset's class name.\ncat(\"Class of sparklyr object: \", class(airlines_sdf))\n\n# Output:\n#\n# Class of sparklyr object: tbl_spark tbl_sql tbl_lazy tbl\n\n``` \nHowever, if you try to run a sparklyr function on a `SparkDataFrame` object from SparkR, or if you try to run a SparkR function on a `tbl_spark` object from sparklyr, it will not work, as shown in the following code example. \n```\n# Try to call a sparklyr function on a SparkR SparkDataFrame object. It will not work.\nsparklyr::sdf_pivot(airlinesDF, DepDelay ~ UniqueCarrier)\n\n# Output:\n#\n# Error : Unable to retrieve a Spark DataFrame from object of class SparkDataFrame\n\n## Now try to call s Spark R function on a sparklyr tbl_spark object. It also will not work.\nSparkR::arrange(airlines_sdf, \"DepDelay\")\n\n# Output:\n#\n# Error in (function (classes, fdef, mtable) :\n# unable to find an inherited method for function \u2018arrange\u2019 for signature \u2018\"tbl_spark\", \"character\"\u2019\n\n``` \nThis is because sparklyr translates dplyr functions such as `arrange` into a SQL query plan that is used by SparkSQL. This is not the case with SparkR, which has functions for SparkSQL tables and Spark DataFrames. These behaviors are why Databricks does not recommended combining SparkR and sparklyr APIs in the same script, notebook, or job.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/sparkr-vs-sparklyr.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Comparing SparkR and sparklyr\n##### API interoperability\n\nIn rare cases where you cannot avoid combining the SparkR and sparklyr APIs, you can use SparkSQL as a kind of bridge. For instance, in this article\u2019s first example, sparklyr loaded the airlines dataset from 2007 into a table named `airlines`. You can use the SparkR `sql` function to query this table, for example: \n```\ntop10delaysDF <- SparkR::sql(\"SELECT\nUniqueCarrier,\nDepDelay,\nOrigin\nFROM\nairlines\nWHERE\nDepDelay NOT LIKE 'NA'\nORDER BY DepDelay\nDESC LIMIT 10\")\n\n# Print the class name of the query result.\ncat(\"Class of top10delaysDF: \", class(top10delaysDF), \"\\n\\n\")\n\n# Show the query result.\ncat(\"Top 10 airline delays for 2007:\\n\\n\")\nhead(top10delaysDF, 10)\n\n# Output:\n#\n# Class of top10delaysDF: SparkDataFrame\n#\n# Top 10 airline delays for 2007:\n#\n# UniqueCarrier DepDelay Origin\n# 1 AA 999 RNO\n# 2 NW 999 EWR\n# 3 AA 999 PHL\n# 4 MQ 998 RST\n# 5 9E 997 SWF\n# 6 AA 996 DFW\n# 7 NW 996 DEN\n# 8 MQ 995 IND\n# 9 MQ 994 SJT\n# 10 AA 993 MSY\n\n``` \nFor additional examples, see [Work with DataFrames and tables in R](https:\/\/docs.databricks.com\/sparkr\/dataframes-tables.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/sparkr-vs-sparklyr.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Manage VPC endpoint registrations\n\nThis article describes how to manage VPC endpoint registrations in the account console.\n\n##### Manage VPC endpoint registrations\n###### What is a VPC endpoint registration?\n\nThis article discusses how to create Databricks VPC endpoint registration objects, which is a Databricks configuration object wrapping the regional AWS VPC endpoint. You must register AWS VPC endpoints to enable [AWS PrivateLink](https:\/\/aws.amazon.com\/privatelink). An AWS VPC endpoint represents a connection from one VPC to a PrivateLink service in another VPC. \nThis article does not contain all the information necessary to configure PrivateLink for your workspace. For all requirements and steps, see [Enable AWS PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html). \nOne of the PrivateLink requirements is to use a [customer-managed VPC](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html), which you register with Databricks to create a network configuration object. For PrivateLink back-end support, that network configuration object must reference your VPC endpoint registrations (your registered VPC endpoints). For more information about network configurations, see [Enable AWS PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html) and [Create network configurations for custom VPC deployment](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/networks.html). \nIf you have multiple workspaces that share the same customer-managed VPC, you can choose to share the AWS VPC endpoints. You can also share these VPC endpoints among multiple Databricks accounts, in which case register the AWS VPC endpoint in each Databricks account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/vpc-endpoints.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Manage VPC endpoint registrations\n###### Register a VPC endpoint\n\nNote \nThese instructions show you how to create the VPC endpoints from the **Cloud resources** page in the account console before you create a new workspace. You can also create the VPC endpoints in a similar way as part of the flow of creating or updating a new workspace and choosing **Register a VPC endpoint** from menus in the network configuration editor. See [Manually create a workspace (existing Databricks accounts)](https:\/\/docs.databricks.com\/admin\/workspace\/create-workspace.html) and [Create network configurations for custom VPC deployment](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/networks.html). \n1. In the [account console](https:\/\/docs.databricks.com\/admin\/account-settings\/index.html#account-console), click **Cloud resources**.\n2. Click **Network**.\n3. From the vertical navigation on the page, click **VPC endpoint registrations**.\n4. Click **Register a VPC endpoint**.\n5. In the **VPC endpoint registration name** field , type the human-readable name you\u2019d like for the new configuration. Databricks recommends including the region and the destination of this particular VPC endpoint. For example, if this is a VPC endpoint for back-end PrivateLink connectivity to the Databricks control plane secure cluster connectivity relay, you might name it something like `VPCE us-west-2 for SCC`.\n6. Choose the region. \nImportant \nThe region field must match your workspace region and the region of the AWS VPC endpoints that you are registering. However, Databricks validates this only during workspace creation (or during updating a workspace with PrivateLink), so it is critical that you carefully set the region in this step.\n7. In the **AWS VPC endpoint ID** field, paste the ID from the relevant AWS VPC endpoint.\n8. Click **Register new VPC endpoint**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/vpc-endpoints.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Manage VPC endpoint registrations\n###### Delete a VPC endpoint registration\n\nVPC endpoint registrations cannot be edited after creation. If the configuration has incorrect data or if you no longer need it, delete the VPC endpoint registration: \n1. In the [account console](https:\/\/docs.databricks.com\/admin\/account-settings\/index.html#account-console), click **Cloud resources**.\n2. Click **Network**.\n3. From the vertical navigation on the page, click **VPC endpoint registrations**.\n4. On the row for the configuration, click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) on the right, and select **Delete**.\n5. In the confirmation dialog, click **Confirm Delete**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/vpc-endpoints.html"} +{"content":"# Technology partners\n### Connect to ingestion partners using Partner Connect\n\nPartner Connect offers the simplest way to connect your Databricks workspace to a data ingestion partner solution. You typically follow the steps in this article to connect to an ingestion partner solution using Partner Connect.\n\n### Connect to ingestion partners using Partner Connect\n#### Before you begin:\n\n* Confirm that you meet the [requirements](https:\/\/docs.databricks.com\/partner-connect\/index.html#requirements) for using Partner Connect.\n* See the appropriate partner connection guide. \nImportant \nYou might have to meet partner-specific requirements. You might also have to follow different steps than the steps in this article. This is because not all partner solutions are featured in Partner Connect, and because the connection experience can differ between partners in Partner Connect. \nTip \nIf you have an existing partner account, Databricks recommends that you log in to your partner account and connect to Databricks manually. This is because the connection experience in Partner Connect is optimized for new partner accounts.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/ingestion.html"} +{"content":"# Technology partners\n### Connect to ingestion partners using Partner Connect\n#### Steps to connect to a data ingestion partner\n\nTo connect your Databricks workspace to a data ingestion partner solution, do the following: \n1. In the sidebar, click ![Partner Connect button](https:\/\/docs.databricks.com\/_images\/partner-connect.png) **Partner Connect**.\n2. Click the partner tile. \nNote \nIf the partner tile has a check mark icon inside it, an administrator has already used Partner Connect to connect the partner to your workspace. Skip to step 5. The partner uses the email address for your Databricks account to prompt you to sign in to your existing partner account.\n3. Select a catalog for the partner to write to, then click **Next**. \nNote \nIf a partner doesn\u2019t support Unity Catalog with Partner Connect, the workspace default catalog is used. If your workspace isn\u2019t Unity Catalog-enabled, `hive_metastore` is used. \nPartner Connect creates the following resources in your workspace: \n* A SQL warehouse named **`_ENDPOINT`** by default. You can change this default name before you click **Next**.\n* A Databricks [service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) named **`_USER`**.\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) that is associated with the **`_USER`** service principal.Partner Connect also grants the following privileges to the **`_USER`** service principal: \n* (Unity Catalog)`USE CATALOG`: Required to interact with objects in the selected catalog.\n* (Unity Catalog)`CREATE SCHEMA`: Required to interact with objects in the selected schema.\n* (Hive metastore) `USAGE`: Required to interact with objects in the Hive metastore.\n* (Hive metastore) `CREATE`: Grants the ability to create objects in the Hive metastore.\n4. Click **Next**. \nThe **Email** box displays the email address for your Databricks account. The partner uses this email address to prompt you to either create a new partner account or sign in to your existing partner account.\n5. Click **Connect to ``** or **Sign in**. \nA new tab opens in your web browser, which displays the partner website.\n6. Complete the on-screen instructions on the partner website to create your trial partner account or sign in to your existing partner account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/ingestion.html"} +{"content":"# Technology partners\n## Connect to data governance partners using Partner Connect\n#### Connect Databricks to Monte Carlo\n\nThis article describes how to connect your Databricks workspace to Monte Carlo. Monte Carlo monitors your data across your data warehouses, data lakes, ETL pipelines, and business intelligence tools and alerts for issues.\n\n#### Connect Databricks to Monte Carlo\n##### Connect to Monte Carlo using Partner Connect\n\n### Before you connect using Partner Connect \nBefore you connect to Monte Carlo using Partner Connect, review the following requirements and considerations: \n* You must be a Databricks workspace admin.\n* You must belong to the `Account Owners` authorization group for your Monte Carlo account.\n* Any workspace admin can delete a Monte Carlo connection from Partner Connect, but, only users who have Monte Carlo `Account Owner` permissions can delete the associated connection object in the Monte Carlo account. If a Databricks user doesn\u2019t have Monte Carlo `Account Owner` permissions, the deletion only removes the Partner Connect integration from the Databricks workspace. The integration remains intact in the Monte Carlo account.\n* A Monte Carlo account can only connect to one Databricks workspace using Partner Connect. If you try to connect a second workspace to a Monte Carlo account using Partner Connect, an error prompts you to connect manually. \n### Steps to connect using Partner Connect \nTo connect to Monte Carlo using Partner Connect, see [Connect to data governance partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/data-governance.html).\n\n#### Connect Databricks to Monte Carlo\n##### Connect to Monte Carlo manually\n\nTo connect to Databricks from Monte Carlo manually, see [Databricks](https:\/\/docs.getmontecarlo.com\/docs\/overview-databricks) in the Monte Carlo documentation.\n\n#### Connect Databricks to Monte Carlo\n##### Additional resources\n\n* [Website](https:\/\/www.montecarlodata.com\/)\n* [Documentation](https:\/\/docs.getmontecarlo.com\/)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/data-governance\/monte-carlo.html"} +{"content":"# Query data\n## Data format options\n#### Read and write XML files\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes how to read and write XML files. \nExtensible Markup Language (XML) is a markup language for formatting, storing, and sharing data in textual format. It defines a set of rules for serializing data ranging from documents to arbitrary data structures. \nNative XML file format support enables ingestion, querying, and parsing of XML data for batch processing or streaming. It can automatically infer and evolve schema and data types, supports SQL expressions like `from_xml`, and can generate XML documents. It doesn\u2019t require external jars and works seamlessly with Auto Loader, `read_files` and `COPY INTO`.\n\n#### Read and write XML files\n##### Requirements\n\nDatabricks Runtime 14.3 and above\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/xml.html"} +{"content":"# Query data\n## Data format options\n#### Read and write XML files\n##### Parse XML records\n\nXML specification mandates a well-formed structure. However, this specification doesn\u2019t immediately map to a tabular format. You must specify the `rowTag` option to indicate the XML element that maps to a `DataFrame` `Row`. The `rowTag` element becomes the top-level `struct`. The child elements of `rowTag` become the fields of the top-level `struct`. \nYou can specify the schema for this record or let it be inferred automatically. Because the parser only examines the `rowTag` elements, DTD and external entities are filtered out. \nThe following examples illustrate schema inference and parsing of an XML file using different `rowTag` options: \n```\nxmlString = \"\"\"\n\n\nCorets, Eva<\/author>\nMaeve Ascendant<\/title>\n<\/book>\n<book id=\"bk104\">\n<author>Corets, Eva<\/author>\n<title>Oberon's Legacy<\/title>\n<\/book>\n<\/books>\"\"\"\n\nxmlPath = \"dbfs:\/tmp\/books.xml\"\ndbutils.fs.put(xmlPath, xmlString, True)\n\n``` \n```\nval xmlString = \"\"\"\n<books>\n<book id=\"bk103\">\n<author>Corets, Eva<\/author>\n<title>Maeve Ascendant<\/title>\n<\/book>\n<book id=\"bk104\">\n<author>Corets, Eva<\/author>\n<title>Oberon's Legacy<\/title>\n<\/book>\n<\/books>\"\"\"\nval xmlPath = \"dbfs:\/tmp\/books.xml\"\ndbutils.fs.put(xmlPath, xmlString)\n\n``` \nRead the XML file with `rowTag` option as \u201cbooks\u201d: \n```\ndf = spark.read.option(\"rowTag\", \"books\").format(\"xml\").load(xmlPath)\ndf.printSchema()\ndf.show(truncate=False)\n\n``` \n```\nval df = spark.read.option(\"rowTag\", \"books\").xml(xmlPath)\ndf.printSchema()\ndf.show(truncate=false)\n\n``` \nOutput: \n```\nroot\n|-- book: array (nullable = true)\n| |-- element: struct (containsNull = true)\n| | |-- _id: string (nullable = true)\n| | |-- author: string (nullable = true)\n| | |-- title: string (nullable = true)\n\n+------------------------------------------------------------------------------+\n|book |\n+------------------------------------------------------------------------------+\n|[{bk103, Corets, Eva, Maeve Ascendant}, {bk104, Corets, Eva, Oberon's Legacy}]|\n+------------------------------------------------------------------------------+\n\n``` \nRead the XML file with `rowTag` as \u201cbook\u201d: \n```\ndf = spark.read.option(\"rowTag\", \"book\").format(\"xml\").load(xmlPath)\n# Infers three top-level fields and parses `book` in separate rows:\n\n``` \n```\nval df = spark.read.option(\"rowTag\", \"book\").xml(xmlPath)\n\/\/ Infers three top-level fields and parses `book` in separate rows:\n\n``` \nOutput: \n```\nroot\n|-- _id: string (nullable = true)\n|-- author: string (nullable = true)\n|-- title: string (nullable = true)\n\n+-----+-----------+---------------+\n|_id |author |title |\n+-----+-----------+---------------+\n|bk103|Corets, Eva|Maeve Ascendant|\n|bk104|Corets, Eva|Oberon's Legacy|\n+-----+-----------+---------------+\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/xml.html"} +{"content":"# Query data\n## Data format options\n#### Read and write XML files\n##### Data source options\n\nData source options for XML can be specified the following ways: \n* The `.option\/.options` methods of the following: \n+ DataFrameReader\n+ DataFrameWriter\n+ DataStreamReader\n+ DataStreamWriter\n* The following built-in functions: \n+ [from\\_xml](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/from_xml.html)\n+ [to\\_xml](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/to_xml.html)\n+ [schema\\_of\\_xml](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/schema_of_xml.html)\n* The `OPTIONS` clause of [CREATE TABLE USING DATA\\_SOURCE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-table-using.html) \nFor a list of options, see [Auto Loader options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/xml.html"} +{"content":"# Query data\n## Data format options\n#### Read and write XML files\n##### XSD support\n\nYou can optionally validate each row-level XML record by an XML Schema Definition (XSD). The XSD file is specified in the `rowValidationXSDPath` option. The XSD does not otherwise affect the schema provided or inferred. A record that fails the validation is marked as \u201ccorrupted\u201d and handled based on the corrupt record handling mode option described in the option section. \nYou can use `XSDToSchema` to extract a Spark DataFrame schema from a XSD file. It supports only simple, complex, and sequence types, and only supports basic XSD functionality. \n```\nimport org.apache.spark.sql.execution.datasources.xml.XSDToSchema\nimport org.apache.hadoop.fs.Path\n\nval xsdPath = \"dbfs:\/tmp\/books.xsd\"\nval xsdString = \"\"\"<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n<xs:schema xmlns:xs=\"http:\/\/www.w3.org\/2001\/XMLSchema\">\n<xs:element name=\"book\">\n<xs:complexType>\n<xs:sequence>\n<xs:element name=\"author\" type=\"xs:string\" \/>\n<xs:element name=\"title\" type=\"xs:string\" \/>\n<xs:element name=\"genre\" type=\"xs:string\" \/>\n<xs:element name=\"price\" type=\"xs:decimal\" \/>\n<xs:element name=\"publish_date\" type=\"xs:date\" \/>\n<xs:element name=\"description\" type=\"xs:string\" \/>\n<\/xs:sequence>\n<xs:attribute name=\"id\" type=\"xs:string\" use=\"required\" \/>\n<\/xs:complexType>\n<\/xs:element>\n<\/xs:schema>\"\"\"\n\ndbutils.fs.put(xsdPath, xsdString, true)\n\nval schema1 = XSDToSchema.read(xsdString)\nval schema2 = XSDToSchema.read(new Path(xsdPath))\n\n``` \nThe following table shows the conversion of XSD data types to Spark data types: \n| XSD Data Types | Spark Data Types |\n| --- | --- |\n| `boolean` | `BooleanType` |\n| `decimal` | `DecimalType` |\n| `unsignedLong` | `DecimalType(38, 0)` |\n| `double` | `DoubleType` |\n| `float` | `FloatType` |\n| `byte` | `ByteType` |\n| `short`, `unsignedByte` | `ShortType` |\n| `integer`, `negativeInteger`, `nonNegativeInteger`, `nonPositiveInteger`, `positiveInteger`, `unsignedShort` | `IntegerType` |\n| `long`, `unsignedInt` | `LongType` |\n| `date` | `DateType` |\n| `dateTime` | `TimestampType` |\n| `Others` | `StringType` |\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/xml.html"} +{"content":"# Query data\n## Data format options\n#### Read and write XML files\n##### Parse nested XML\n\nXML data in a string-valued column in an existing DataFrame can be parsed with `schema_of_xml` and `from_xml` that returns the schema and the parsed results as new `struct` columns. XML data passed as an argument to `schema_of_xml` and `from_xml` must be a single well-formed XML record. \n### schema\\_of\\_xml \n**Syntax** \n```\nschema_of_xml(xmlStr [, options] )\n\n``` \n**Arguments** \n* `xmlStr`: A STRING expression specifying a single well-formed XML record.\n* `options`: An optional `MAP<STRING,STRING>` literal specifying directives. \n**Returns** \nA STRING holding a definition of a struct with n fields of strings where the column names are derived from the XML element and attribute names. The field values hold the derived formatted SQL types. \n### from\\_xml \n**Syntax** \n```\nfrom_xml(xmlStr, schema [, options])\n\n``` \n**Arguments** \n* `xmlStr`: A STRING expression specifying a single well-formed XML record.\n* `schema`: A STRING expression or invocation of the `schema_of_xml` function.\n* `options`: An optional `MAP<STRING,STRING>` literal specifying directives. \n**Returns** \nA struct with field names and types matching the schema definition. Schema must be defined as comma-separated column name and data type pairs as used in, for example, `CREATE TABLE`. Most options shown in the [data source options](https:\/\/docs.databricks.com\/query\/formats\/xml.html#options) are applicable with the\nfollowing exceptions: \n* `rowTag`: Because there is only one XML record, the `rowTag` option is not applicable.\n* `mode` (default: `PERMISSIVE`): Allows a mode for dealing with corrupt records during parsing. \n+ `PERMISSIVE`: When it meets a corrupted record, puts the malformed string into a field configured by `columnNameOfCorruptRecord`, and sets malformed fields to `null`. To keep corrupt records, you can set a string type field named `columnNameOfCorruptRecord` in a user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a `columnNameOfCorruptRecord` field in an output schema.\n+ `FAILFAST`: Throws an exception when it meets corrupted records.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/xml.html"} +{"content":"# Query data\n## Data format options\n#### Read and write XML files\n##### Structure conversion\n\nDue to the structure differences between DataFrame and XML, there are some conversion rules from XML data to `DataFrame` and from `DataFrame` to XML data. Note that handling attributes can be disabled with the option `excludeAttribute`. \n### Conversion from XML to DataFrame \n**Attributes**: Attributes are converted as fields with the heading prefix `attributePrefix`. \n```\n<one myOneAttrib=\"AAAA\">\n<two>two<\/two>\n<three>three<\/three>\n<\/one>\n\n``` \nproduces a schema below: \n```\nroot\n|-- _myOneAttrib: string (nullable = true)\n|-- two: string (nullable = true)\n|-- three: string (nullable = true)\n\n``` \n**Character data in an element containing attribute(s) or child element(s):** These are parsed into the `valueTag` field. If there are multiple occurrences of character data, the `valueTag` field is converted to an `array` type. \n```\n<one>\n<two myTwoAttrib=\"BBBBB\">two<\/two>\nsome value between elements\n<three>three<\/three>\nsome other value between elements\n<\/one>\n\n``` \nproduces a schema below: \n```\nroot\n|-- _VALUE: array (nullable = true)\n| |-- element: string (containsNull = true)\n|-- two: struct (nullable = true)\n| |-- _VALUE: string (nullable = true)\n| |-- _myTwoAttrib: string (nullable = true)\n|-- three: string (nullable = true)\n\n``` \n### Conversion from DataFrame to XML \n**Element as an array in an array**: Writing a XML file from `DataFrame` having a field\n`ArrayType` with its element as `ArrayType` would have an additional nested field for the\nelement. This would not happen in reading and writing XML data but writing a `DataFrame`\nread from other sources. Therefore, roundtrip in reading and writing XML files has the same\nstructure but writing a `DataFrame` read from other sources is possible to have a different\nstructure. \nDataFrame with a schema below: \n```\n|-- a: array (nullable = true)\n| |-- element: array (containsNull = true)\n| | |-- element: string (containsNull = true)\n\n``` \nand with data below: \n```\n+------------------------------------+\n| a|\n+------------------------------------+\n|[WrappedArray(aa), WrappedArray(bb)]|\n+------------------------------------+\n\n``` \nproduces a XML file below: \n```\n<a>\n<item>aa<\/item>\n<\/a>\n<a>\n<item>bb<\/item>\n<\/a>\n\n``` \nThe element name of the unnamed array in the `DataFrame` is specified by the option `arrayElementName` (Default: `item`).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/xml.html"} +{"content":"# Query data\n## Data format options\n#### Read and write XML files\n##### Rescued data column\n\nThe rescued data column ensures that you never lose or miss out on data during ETL. You can enable the rescued data column to capture any data that wasn\u2019t parsed because one or more fields in a record have one of the following issues: \n* Absent from the provided schema\n* Does not match the data type of the provided schema\n* Has a case mismatch with the field names in the provided schema \nThe rescued data column is returned as a JSON document containing the columns that were rescued, and the source file path of the record. To remove the source file path from the rescued data column, you can set the following SQL configuration: \n```\nspark.conf.set(\"spark.databricks.sql.rescuedDataColumn.filePath.enabled\", \"false\")\n\n``` \n```\nspark.conf.set(\"spark.databricks.sql.rescuedDataColumn.filePath.enabled\", \"false\").\n\n``` \nYou can enable the rescued data column by setting the option `rescuedDataColumn` to a column name when reading data, such as `_rescued_data` with `spark.read.option(\"rescuedDataColumn\", \"_rescued_data\").format(\"xml\").load(<path>)`. \nThe XML parser supports three modes when parsing records: `PERMISSIVE`, `DROPMALFORMED`, and `FAILFAST`. When used together with `rescuedDataColumn`, data type mismatches do not cause records to be dropped in `DROPMALFORMED` mode or throw an error in `FAILFAST` mode. Only corrupt records (incomplete or malformed XML) are dropped or throw errors.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/xml.html"} +{"content":"# Query data\n## Data format options\n#### Read and write XML files\n##### Schema inference and evolution in Auto Loader\n\nFor a detailed discussion of this topic and applicable options, see [Configure schema inference and evolution in Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html). You can configure Auto Loader to automatically detect the schema of loaded XML data, allowing you to initialize tables without explicitly declaring the data schema and evolve the table schema as new columns are introduced. This eliminates the need to manually track and apply schema changes over time. \nBy default, Auto Loader schema inference seeks to avoid schema evolution issues due to type mismatches. For formats that don\u2019t encode data types (JSON, CSV, and XML), Auto Loader infers all columns as strings, including nested fields in XML files. The Apache Spark `DataFrameReader` uses a different behavior for schema inference, selecting data types for columns in XML sources based on sample data. To enable this behavior with Auto Loader, set the option `cloudFiles.inferColumnTypes` to `true`. \nAuto Loader detects the addition of new columns as it processes your data. When Auto Loader detects a new column, the stream stops with an `UnknownFieldException`. Before your stream throws this error, Auto Loader performs schema inference on the latest micro-batch of data and updates the schema location with the latest schema by merging new columns to the end of the schema. The data types of existing columns remain unchanged. Auto Loader supports different [modes for schema evolution](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html#evolution), which you set in the option `cloudFiles.schemaEvolutionMode`. \nYou can use [schema hints](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html#schema-hints) to enforce the schema information that you know and expect on an inferred schema. When you know that a column is of a specific data type, or if you want to choose a more general data type (for example, a double instead of an integer), you can provide an arbitrary number of hints for column data types as a string using SQL schema specification syntax. When the rescued data column is enabled, fields named in a case other than that of the schema are loaded to the `_rescued_data` column. You can change this behavior by setting the option `readerCaseSensitive` to `false`, in which case Auto Loader reads data in a case-insensitive way.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/xml.html"} +{"content":"# Query data\n## Data format options\n#### Read and write XML files\n##### Examples\n\nThe examples in this section use an XML file available for download in the [Apache Spark GitHub repo](https:\/\/github.com\/apache\/spark\/blob\/master\/sql\/core\/src\/test\/resources\/test-data\/xml-resources\/books.xml). \n### Read and write XML \n```\ndf = (spark.read\n.format('xml')\n.options(rowTag='book')\n.load(xmlPath)) # books.xml\n\nselected_data = df.select(\"author\", \"_id\")\n(selected_data.write\n.options(rowTag='book', rootTag='books')\n.xml('newbooks.xml'))\n\n``` \n```\nval df = spark.read\n.option(\"rowTag\", \"book\")\n.xml(xmlPath) \/\/ books.xml\n\nval selectedData = df.select(\"author\", \"_id\")\nselectedData.write\n.option(\"rootTag\", \"books\")\n.option(\"rowTag\", \"book\")\n.xml(\"newbooks.xml\")\n\n``` \n```\ndf <- loadDF(\"books.xml\", source = \"xml\", rowTag = \"book\")\n# In this case, `rootTag` is set to \"ROWS\" and `rowTag` is set to \"ROW\".\nsaveDF(df, \"newbooks.xml\", \"xml\", \"overwrite\")\n\n``` \nYou can manually specify the schema when reading data: \n```\nfrom pyspark.sql.types import StructType, StructField, StringType, DoubleType\n\ncustom_schema = StructType([\nStructField(\"_id\", StringType(), True),\nStructField(\"author\", StringType(), True),\nStructField(\"description\", StringType(), True),\nStructField(\"genre\", StringType(), True),\nStructField(\"price\", DoubleType(), True),\nStructField(\"publish_date\", StringType(), True),\nStructField(\"title\", StringType(), True)\n])\ndf = spark.read.options(rowTag='book').xml('books.xml', schema = customSchema)\n\nselected_data = df.select(\"author\", \"_id\")\nselected_data.write.options(rowTag='book', rootTag='books').xml('newbooks.xml')\n\n``` \n```\nimport org.apache.spark.sql.types.{StructType, StructField, StringType, DoubleType}\n\nval customSchema = StructType(Array(\nStructField(\"_id\", StringType, nullable = true),\nStructField(\"author\", StringType, nullable = true),\nStructField(\"description\", StringType, nullable = true),\nStructField(\"genre\", StringType, nullable = true),\nStructField(\"price\", DoubleType, nullable = true),\nStructField(\"publish_date\", StringType, nullable = true),\nStructField(\"title\", StringType, nullable = true)))\nval df = spark.read.option(\"rowTag\", \"book\").schema(customSchema).xml(xmlPath) \/\/ books.xml\n\nval selectedData = df.select(\"author\", \"_id\")\nselectedData.write.option(\"rootTag\", \"books\").option(\"rowTag\", \"book\").xml(\"newbooks.xml\")\n\n``` \n```\ncustomSchema <- structType(\nstructField(\"_id\", \"string\"),\nstructField(\"author\", \"string\"),\nstructField(\"description\", \"string\"),\nstructField(\"genre\", \"string\"),\nstructField(\"price\", \"double\"),\nstructField(\"publish_date\", \"string\"),\nstructField(\"title\", \"string\"))\n\ndf <- loadDF(\"books.xml\", source = \"xml\", schema = customSchema, rowTag = \"book\")\n# In this case, `rootTag` is set to \"ROWS\" and `rowTag` is set to \"ROW\".\nsaveDF(df, \"newbooks.xml\", \"xml\", \"overwrite\")\n\n``` \n### SQL API \nXML data source can infer data types: \n```\nDROP TABLE IF EXISTS books;\nCREATE TABLE books\nUSING XML\nOPTIONS (path \"books.xml\", rowTag \"book\");\nSELECT * FROM books;\n\n``` \nYou can also specify column names and types in DDL. In this case, the schema is not inferred automatically. \n```\nDROP TABLE IF EXISTS books;\n\nCREATE TABLE books (author string, description string, genre string, _id string,\nprice double, publish_date string, title string)\nUSING XML\nOPTIONS (path \"books.xml\", rowTag \"book\");\n\n``` \n### Load XML using COPY INTO \n```\nDROP TABLE IF EXISTS books;\nCREATE TABLE IF NOT EXISTS books;\n\nCOPY INTO books\nFROM \"\/FileStore\/xmltestDir\/input\/books.xml\"\nFILEFORMAT = XML\nFORMAT_OPTIONS ('mergeSchema' = 'true', 'rowTag' = 'book')\nCOPY_OPTIONS ('mergeSchema' = 'true');\n\n``` \n### Read XML with row validation \n```\ndf = (spark.read\n.format(\"xml\")\n.option(\"rowTag\", \"book\")\n.option(\"rowValidationXSDPath\", xsdPath)\n.load(inputPath))\ndf.printSchema()\n\n``` \n```\nval df = spark.read\n.option(\"rowTag\", \"book\")\n.option(\"rowValidationXSDPath\", xsdPath)\n.xml(inputPath)\ndf.printSchema\n\n``` \n### Parse nested XML (from\\_xml and schema\\_of\\_xml) \n```\nfrom pyspark.sql.functions import from_xml, schema_of_xml, lit, col\n\nxml_data = \"\"\"\n<book id=\"bk103\">\n<author>Corets, Eva<\/author>\n<title>Maeve Ascendant<\/title>\n<genre>Fantasy<\/genre>\n<price>5.95<\/price>\n<publish_date>2000-11-17<\/publish_date>\n<\/book>\n\"\"\"\n\ndf = spark.createDataFrame([(8, xml_data)], [\"number\", \"payload\"])\nschema = schema_of_xml(df.select(\"payload\").limit(1).collect()[0][0])\nparsed = df.withColumn(\"parsed\", from_xml(col(\"payload\"), schema))\nparsed.printSchema()\nparsed.show()\n\n``` \n```\nimport org.apache.spark.sql.functions.{from_xml,schema_of_xml,lit}\n\nval xmlData = \"\"\"\n<book id=\"bk103\">\n<author>Corets, Eva<\/author>\n<title>Maeve Ascendant<\/title>\n<genre>Fantasy<\/genre>\n<price>5.95<\/price>\n<publish_date>2000-11-17<\/publish_date>\n<\/book>\"\"\".stripMargin\n\nval df = Seq((8, xmlData)).toDF(\"number\", \"payload\")\nval schema = schema_of_xml(xmlData)\nval parsed = df.withColumn(\"parsed\", from_xml($\"payload\", schema))\nparsed.printSchema()\nparsed.show()\n\n``` \n### from\\_xml and schema\\_of\\_xml with SQL API \n```\nSELECT from_xml('\n<book id=\"bk103\">\n<author>Corets, Eva<\/author>\n<title>Maeve Ascendant<\/title>\n<genre>Fantasy<\/genre>\n<price>5.95<\/price>\n<publish_date>2000-11-17<\/publish_date>\n<\/book>',\nschema_of_xml('\n<book id=\"bk103\">\n<author>Corets, Eva<\/author>\n<title>Maeve Ascendant<\/title>\n<genre>Fantasy<\/genre>\n<price>5.95<\/price>\n<publish_date>2000-11-17<\/publish_date>\n<\/book>')\n);\n\n``` \n### Load XML with Auto Loader \n```\nquery = (spark\n.readStream\n.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"xml\")\n.option(\"rowTag\", \"book\")\n.option(\"cloudFiles.inferColumnTypes\", True)\n.option(\"cloudFiles.schemaLocation\", schemaPath)\n.option(\"cloudFiles.schemaEvolutionMode\", \"rescue\")\n.load(inputPath)\n.writeStream\n.format(\"delta\")\n.option(\"mergeSchema\", \"true\")\n.option(\"checkpointLocation\", checkPointPath)\n.trigger(Trigger.AvailableNow()))\n\nquery = query.start(outputPath).awaitTermination()\ndf = spark.read.format(\"delta\").load(outputPath)\ndf.show()\n\n``` \n```\nval query = spark\n.readStream\n.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"xml\")\n.option(\"rowTag\", \"book\")\n.option(\"cloudFiles.inferColumnTypes\", true)\n.option(\"cloudFiles.schemaLocation\", schemaPath)\n.option(\"cloudFiles.schemaEvolutionMode\", \"rescue\")\n.load(inputPath)\n.writeStream\n.format(\"delta\")\n.option(\"mergeSchema\", \"true\")\n.option(\"checkpointLocation\", checkPointPath)\n.trigger(Trigger.AvailableNow())\n\nquery.start(outputPath).awaitTermination()\nval df = spark.read.format(\"delta\").load(outputPath)\ndf.show()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/xml.html"} +{"content":"# Query data\n## Data format options\n#### Read and write XML files\n##### Additional resources\n\n[Read and write XML data using the spark-xml library](https:\/\/docs.databricks.com\/archive\/connectors\/spark-xml-library.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/xml.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks notebook interface and controls\n\nThe notebook toolbar includes menus and icons that you can use to manage and edit the notebook. \n![Notebook toolbar](https:\/\/docs.databricks.com\/_images\/toolbar.png) \nNext to the notebook name are buttons that let you [change the default language of the notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#default-language) and, if the notebook is included in a Databricks Git folder, [open the Git dialog](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html). \nTo view [previous versions](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#version-control) of the notebook, click the \u201cLast edit\u2026\u201d message to the right of the menus.\n\n#### Databricks notebook interface and controls\n##### Updated cell design\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nAn updated cell design is available. This page includes information about how to use both versions of the cell design. For an orientation to the new UI and answers to common questions, see [Orientation to the new cell UI](https:\/\/docs.databricks.com\/notebooks\/new-cell-ui-orientation.html). \nTo enable or disable the new cell design, open the [editor settings](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#configure-notebook-settings) page in the workspace. In the sidebar, click **Developer**. Under **Experimental features**, toggle **New cell UI**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks notebook interface and controls\n##### Notebook cells\n\nNotebooks contain a collection of two types of cells: code cells and Markdown cells. Code cells contain runnable code. Markdown cells contain Markdown code that renders into text and graphics when the cell is executed and can be used to document or illustrate your code. You can add or remove cells to your notebook to structure your work. \nYou can run a single cell, a group of cells, or run the whole notebook at once. A notebook cell can contain at most 10MB. Notebook cell output is limited to 20MB.\n\n#### Databricks notebook interface and controls\n##### Notebook toolbar icons and buttons\n\nThe icons and buttons at the right of the toolbar are described in the following table: \n| Icon | Description |\n| --- | --- |\n| Run all button Interrupt execution button | [Run all cells or stop execution](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html). The name of this button changes depending on the state of the notebook. |\n| Notebook header compute selector | Open [compute selector](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach). When the notebook is connected to a cluster or SQL warehouse, this button shows the name of the compute resource. |\n| Notebook header job scheduler | Open [job scheduler](https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html). |\n| Notebook header DLT selector | Open [Delta Live Tables](https:\/\/docs.databricks.com\/notebooks\/notebooks-dlt-pipeline.html). This button appears only if the notebook is part of a Delta Live Tables pipeline. |\n| Notebook header share button | Open [permissions dialog](https:\/\/docs.databricks.com\/notebooks\/notebooks-collaborate.html). |\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks notebook interface and controls\n##### Right sidebar actions\n\nSeveral actions are available from the notebook\u2019s right sidebar, as described in the following table: \n| Icon | Description |\n| --- | --- |\n| Notebook header comments icon | Open [notebook comments](https:\/\/docs.databricks.com\/notebooks\/notebooks-collaborate.html#command-comments). |\n| Notebook header experiment icon | Open [MLflow notebook experiment](https:\/\/docs.databricks.com\/mlflow\/experiments.html#mlflow-notebook-experiments). |\n| Notebook version history icon | Open [notebook version history](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#version-history). |\n| Notebook variable explorer | Open [variable explorer](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#variable-explorer). (Available for Python variables with Databricks Runtime 12.2 LTS and above.) |\n| Notebook environment | Open the Python environment panel. This panel shows all Python libraries available to the notebook, including notebook-scoped libraries, cluster libraries, and libraries included in the Databricks Runtime. Available only when the notebook is attached to a cluster. |\n\n#### Databricks notebook interface and controls\n##### Browse data\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nTo explore tables and volumes available to use in the notebook, click ![notebook data icon](https:\/\/docs.databricks.com\/_images\/notebook-data-icon.png) at the left side of the notebook to open the schema browser. See [Browse data](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#browse-data) for more details.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks notebook interface and controls\n##### Cell actions menu\n\nThe cell actions menu lets you cut and copy cells, move cells around in the notebook, and hide code or results. The menu has a different appearance in the original UI and the new UI. This section includes instructions for both versions. \nIf Databricks Assistant is enabled in your workspace, you can use it in a code cell to get help or suggestions for your code. To open a Databricks Assistant text box in a cell, click the Databricks Assistant icon ![Databricks Assistant icon](https:\/\/docs.databricks.com\/_images\/cell-assistant-icon.png) in the upper-right corner of the cell. \nYou can easily change a cell between code and markdown, or change the language of a code cell, using the cell language button near the upper-right corner of the cell. \n![Cell language button](https:\/\/docs.databricks.com\/_images\/cell-language.png) \n### Cell actions menu (original UI) \n![Cell actions menu](https:\/\/docs.databricks.com\/_images\/cmd-edit.png) \nFrom this menu you can also run code cells: \n![Cell actions menu - run](https:\/\/docs.databricks.com\/_images\/cell-actions-run.png) \nThe cell action menu also includes buttons that let you hide a cell ![Cell Minimize](https:\/\/docs.databricks.com\/_images\/cell-minimize.png) or delete a cell ![Delete Icon](https:\/\/docs.databricks.com\/_images\/delete-icon.png). \nFor Markdown cells, there is also an option to add the cell to a dashboard. For more information, see [Dashboards in notebooks](https:\/\/docs.databricks.com\/notebooks\/dashboards.html). \n![Dashboard](https:\/\/docs.databricks.com\/_images\/cell-actions-dashboard.png) \n### Work with cells in the new UI \nThe following screenshot describes the icons that appear at the upper-right of a notebook cell: \n![upper-right cell icons - new UI](https:\/\/docs.databricks.com\/_images\/notebook-cell-icons.png) \n**Language selector:** Select the language for the cell. \n**Databricks Assistant:** Enable or disable Databricks Assistant for code suggestions in the cell. \n**Cell focus:** Enlarge the cell to make it easier to edit. \n**Display cell actions menu:** Open the cell actions menu. The options in this menu are slightly different for code and Markdown cells. \n![Cell actions menu - new UI](https:\/\/docs.databricks.com\/_images\/new-cell-actions.png) \nTo run code cells in the new UI, click the down arrow at the upper-left of the code cell. \n![Cell run menu - new UI](https:\/\/docs.databricks.com\/_images\/cell-run-new.png) \nAfter a cell has been run, a notice appears to the right of the cell run menu, showing the last time the cell was run and the duration of the run. Hover your cursor over the notice for more details. \n![last run image](https:\/\/docs.databricks.com\/_images\/last-cell-run.png) \nTo add a Markdown cell or a cell that has tabular results to a dashboard, select **Add to dashboard** from the cell actions menu. For more information, see [Dashboards in notebooks](https:\/\/docs.databricks.com\/notebooks\/dashboards.html). \nTo delete a cell, click the trash icon to the right of the cell. This icon only appears when you hover your cursor over the cell. \n![cell trash icon](https:\/\/docs.databricks.com\/_images\/trash-icon.png) \nTo add a comment to code in a cell, highlight the code. To the right of the cell, a comment icon appears. Click the icon to open the comment box. \n![commment icon](https:\/\/docs.databricks.com\/_images\/cell-comment-icon.png) \nTo move a cell up or down, click and hold ![move cell icon](https:\/\/docs.databricks.com\/_images\/move-cell-icon.png) outside the upper-left corner of the cell, and drag the cell to the new location. You can also select **Move up** or **Move down** from the cell actions menu.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks notebook interface and controls\n##### Create cells\n\nNotebooks have two types of cells: code and Markdown. The contents of Markdown cells are rendered into HTML. For example, this snippet contains markup for a level-one heading: \n```\n%md ### Libraries\nImport the necessary libraries.\n\n``` \nrenders as shown: \n![rendered Markdown example](https:\/\/docs.databricks.com\/_images\/rendered-html-cell.png) \n### Create a cell (original UI) \nTo create a new cell in the original UI, hover over a cell at the top or bottom and click the ![Add Cell](https:\/\/docs.databricks.com\/_images\/add-cell.png) icon. You can also use the notebook cell menu: click ![Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret.png) and select **Add Cell Above** or **Add Cell Below**. \nFor a code cell, just type code into the cell. To create a Markdown cell, select **Markdown** from the cell\u2019s language button or type `%md` at the top of the cell. \n### Create a cell (new UI) \nTo create a new cell in the new UI, hover over a cell at the top or bottom. Click on **Code** or **Text** to create a code or Markdown cell, respectively. \n![buttons to create a new cell](https:\/\/docs.databricks.com\/_images\/create-cell.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks notebook interface and controls\n##### Cut, copy, and paste cells\n\nThere are several options to cut and copy cells. If you are using the Safari browser, only the keyboard shortcuts are available. \n* From the cell actions menu in the [original UI](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#cell-actions) or the [new UI](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#cell-actions-new-ui), select **Cut cell** or **Copy cell**.\n* Use keyboard shortcuts: `Command-X` or `Ctrl-X` to cut and `Command-C` or `Ctrl-C` to copy.\n* Use the **Edit** menu at the top of the notebook. Select **Cut** or **Copy**. \nAfter you cut or copy cells, you can paste those cells elsewhere in the notebook, into a different notebook, or into a notebook in a different browser tab or window. To paste cells, use the keyboard shortcut `Command-V` or `Ctrl-V`. The cells are pasted below the current cell. \nTo undo cut or paste actions, you can use the keyboard shortcut `Command-Z` or `Ctrl-Z` or the menu options **Edit > Undo cut cells** or **Edit > Undo paste cells**. \nTo select adjacent cells, click in a Markdown cell and then use **Shift** + **Up** or **Down** to select the cells above or below it. Use the edit menu to copy, cut, paste, or delete the selected cells as a group. To select all cells, select **Edit > Select all cells** or use the command mode shortcut **Cmd+A**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks notebook interface and controls\n##### Notebook table of contents\n\nTo display an automatically generated table of contents, click the icon at the upper left of the notebook (between the left sidebar and the topmost cell). The table of contents is generated from the Markdown headings used in the notebook. If you are using the new UI, cells with titles also appear in the table of contents. \n![Open TOC](https:\/\/docs.databricks.com\/_images\/open-toc-with-cursor.png)\n\n#### Databricks notebook interface and controls\n##### Cell display options\n\nThere are three display options for notebooks. Use the **View** menu to change the display option. \n* Standard view: results are displayed immediately after code cells.\n* Results only: only results are displayed.\n* Side-by-side: code and results cells are displayed side by side. \nIn the new UI, actions are available from icons in the cell gutter (the area to the right and left of the cell). For example, to move a cell up or down, use the grip dots ![move cell icon](https:\/\/docs.databricks.com\/_images\/move-cell-icon.png) in the left gutter. To delete a cell, use the trash can icon in the right gutter. \nFor easier editing, click the focus mode icon ![cell focus icon](https:\/\/docs.databricks.com\/_images\/focus-icon.png) to display the cell at full width. To exit focus mode, click ![exit cell focus icon](https:\/\/docs.databricks.com\/_images\/exit-focus-icon.png). You can also enlarge the displayed width of a cell by turning off **View > Centered layout**. \nTo automatically format all cells in the notebook to industry standard line lengths and spacing, select **Edit > Format notebook**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks notebook interface and controls\n##### Line and command numbers\n\nTo show or hide line numbers or command numbers, select **Line numbers** or **Command numbers** from the **View** menu. For line numbers, you can also use the keyboard shortcut **Control+L**. \nIf you enable line or command numbers, Databricks saves your preference and shows them in all of your other notebooks for that browser. \n### Line and command numbers (original UI) \nCommand numbers above cells link to that specific command. If you click the command number for a cell, it updates your URL to be anchored to that command. To get a URL link to a specific command in your notebook, right-click the command number and choose **Copy Link Address**. \n### Line and command numbers (new UI) \nLine numbers are off by default in the new UI. To turn them on, select **View > Line numbers**. When a cell is in an error state, line numbers are displayed regardless of the selection. \nTo toggle command numbers, select **View > Command numbers**. \nThe new UI does not include cell command number links. To get a URL link to a specific command in your notebook, click ![cell focus icon](https:\/\/docs.databricks.com\/_images\/focus-icon.png) to open focus mode, and copy the URL from the browser address bar. To exit focus mode, click ![exit cell focus icon](https:\/\/docs.databricks.com\/_images\/exit-focus-icon.png).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks notebook interface and controls\n##### Add a cell title\n\nTo add a title to a cell using the original UI, select **Show Title** from the [cell actions menu](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#cell-actions). \nTo add a title to a cell using the new UI, do one of the following: \n* Click the cell number shown at the center of the top of the cell and type the title.\n* Select **Add title** from the [cell actions menu](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#cell-actions). \nWith the new UI, cells that have titles appear in the [notebook\u2019s table of contents](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#notebook-toc). \n![add cell title](https:\/\/docs.databricks.com\/_images\/add-cell-title.gif)\n\n#### Databricks notebook interface and controls\n##### View notebooks in dark mode\n\nYou can choose to display notebooks in dark mode. To turn dark mode on or off, select **View > Theme** and select **Light theme** or **Dark theme**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks notebook interface and controls\n##### Hide and show cell content\n\nCell content consists of cell code and the results generated by running the cell. You can hide and show the cell code and result using the [cell actions menu](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#cell-actions) at the upper-right of the cell. \nFor related functionality, see [Collapsible headings](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#collapsible-headings). \n### Hide and show cell content (original UI) \nTo hide cell code or results, click ![Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret.png) and select **Hide Code** or **Hide Result**. You can also select ![Cell Minimize](https:\/\/docs.databricks.com\/_images\/cell-minimize.png) to display only the first line of a cell. \nTo show hidden cell code or results, click the **Show** links: \n![Show hidden code and results](https:\/\/docs.databricks.com\/_images\/notebook-cell-show.png) \n### Hide and show cell content (new UI) \nTo hide cell code or results, click the kebab menu ![cell kebab icon](https:\/\/docs.databricks.com\/_images\/kebab-icon-in-cell.png) at the upper-right of the cell and select **Hide code** or **Hide result**. You can also select **Collapse cell** to display only the first line of a cell. To expand a collapsed cell, select **Expand cell**. \nTo show hidden cell code or results, click the show icon: ![show icon](https:\/\/docs.databricks.com\/_images\/show-icon.png).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks notebook interface and controls\n##### Collapsible headings\n\nCells that appear after cells containing Markdown headings can be collapsed into the heading cell. To expand or collapse cells after cells containing Markdown headings throughout the notebook, select **Collapse all headings** from the **View** menu. The rest of this section describes how to expand or collapse a subset of cells. \nFor related functionality, see [Hide and show cell content](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#hide-show-cell). \n### Expand and collapse headings (original UI) \nThe image shows a level-two heading **MLflow setup** with the following two cells collapsed into it. \n![Collapsed cells in original UI](https:\/\/docs.databricks.com\/_images\/headings.png) \nTo expand and collapse headings, click the **+** and **-**. \n### Expand and collapse headings (new UI) \nThe image shows a level-two heading **MLflow setup** with the following two cells collapsed into it. \n![Collapsed cells in new UI](https:\/\/docs.databricks.com\/_images\/headings-new-ui.png) \nTo expand and collapse headings, hover your cursor over the Markdown cell. Click the arrow that appears to the left of the cell.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks notebook interface and controls\n##### Compute resources for notebooks\n\nThis section covers the options for notebook compute resources. You can run a notebook on a [Databricks cluster](https:\/\/docs.databricks.com\/compute\/index.html), or, for SQL commands, you also have the option to use a [SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html), a type of compute that is optimized for SQL analytics. \n### Attach a notebook to a cluster \nTo attach a notebook to a cluster, you need the [CAN ATTACH TO cluster-level permission](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions). \nImportant \nAs long as a notebook is attached to a cluster, any user with the [CAN RUN permission on the notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-collaborate.html#notebook-permissions) has implicit permission to access the cluster. \nTo attach a notebook to a [cluster](https:\/\/docs.databricks.com\/compute\/index.html), click the [compute selector in the notebook toolbar](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#notebook-toolbar) and select a cluster from the dropdown menu. \nThe menu shows a selection of clusters that you have used recently or that are currently running. \n![Attach notebook](https:\/\/docs.databricks.com\/_images\/cluster-attach.png) \nTo select from all available clusters, click **More\u2026**. Click on the cluster name to display a dropdown menu, and select an existing cluster. \n![more clusters dialog](https:\/\/docs.databricks.com\/_images\/clusters-more.png) \nYou can also [create a new cluster](https:\/\/docs.databricks.com\/compute\/configure.html) by selecting **Create new resource\u2026** from the dropdown menu. \nImportant \nAn attached notebook has the following Apache Spark variables defined. \n| Class | Variable Name |\n| --- | --- |\n| `SparkContext` | `sc` |\n| `SQLContext`\/`HiveContext` | `sqlContext` |\n| `SparkSession` (Spark 2.x) | `spark` | \nDo not create a `SparkSession`, `SparkContext`, or `SQLContext`. Doing so will lead to inconsistent behavior. \n### Use a notebook with a SQL warehouse \nWhen a notebook is attached to a SQL warehouse, you can run SQL and Markdown cells. If you run a cell in any other language (such as Python or R), it throws an error. SQL cells executed on a SQL warehouse appear in the [SQL warehouse\u2019s query history](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html). The user who ran a query can [view the query profile](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-profile.html) from the notebook by clicking the elapsed time at the bottom of the output. \nRunning a notebook requires a Pro or Serverless SQL warehouse. You must have access to the workspace and the SQL warehouse. \nTo attach a notebook to a [SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html) do the following: \n1. Click the [compute selector in the notebook toolbar](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#notebook-toolbar). The dropdown menu shows compute resources that are currently running or that you have used recently. SQL warehouses are marked with ![SQL warehouse label](https:\/\/docs.databricks.com\/_images\/sql-warehouse-label.png).\n2. From the menu, select a SQL warehouse. \nTo see all available SQL warehouses, select **More\u2026** from the dropdown menu. A dialog appears showing compute resources available for the notebook. Select **SQL Warehouse**, choose the warehouse you want to use, and click **Attach**. \n![more cluster dialog with SQL warehouse selected](https:\/\/docs.databricks.com\/_images\/clusters-more-sql-warehouse-button.png) \nYou can also select a SQL warehouse as the compute resource for a SQL notebook when you create a workflow or scheduled job. \nLimitations of SQL warehouses include: \n* When attached to a SQL warehouse, execution contexts have an idle timeout of 8 hours.\n* The maximum size for returned results is 10,000 rows or 2MB, whichever is smaller. \n### Detach a notebook \nTo detach a notebook from a compute resource, click the compute selector in the notebook toolbar and hover over the attached cluster or SQL warehouse in the list to display a side menu. From the side menu, select **Detach**. \n![Detach notebook](https:\/\/docs.databricks.com\/_images\/cluster-detach.png) \nYou can also detach notebooks from a cluster using the **Notebooks** tab on the cluster details page. \nWhen you detach a notebook, the [execution context](https:\/\/docs.databricks.com\/notebooks\/execution-context.html) is removed and all computed variable values are cleared from the notebook. \nTip \nDatabricks recommends that you detach unused notebooks from clusters. This frees up memory space on the driver.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks notebook interface and controls\n##### Use web terminal and Databricks CLI\n\nTo open the web terminal in a notebook, click ![reopen bottom panel](https:\/\/docs.databricks.com\/_images\/reopen-bottom-panel.png) at the bottom of the right sidebar. \n### Use Databricks CLI in a web terminal \nStarting with Databricks Runtime 15.0, you can use the Databricks CLI from the web terminal in the notebook. \n### Requirements \n* The notebook must be attached to a cluster in **Single user** or **No isolation shared** access mode.\n* The CLI is not available in workspaces enabled for PrivateLink. \nThe installed CLI is always the latest version. Authentication is based on the current user. \nYou cannot use the CLI from a notebook cell. Commands like `%sh databricks ...` in a notebook do not work with Databricks Runtime 15.0 or above.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html"} +{"content":"# What is Databricks?\n## What is a data lakehouse?\n#### Data discovery and collaboration in the lakehouse\n\nDatabricks designed Unity Catalog to help organizations reduce time to insights by empowering a broader set of data users to discover and analyze data at scale. Data stewards can securely grant access to data assets for diverse teams of end users in Unity Catalog. These users can then use a variety of languages and tools, including SQL and Python, to create derivative datasets, models, and dashboards that can be shared across teams.\n\n#### Data discovery and collaboration in the lakehouse\n##### Manage permissions at scale\n\nUnity Catalog provides administrators a unified location to assign permissions for catalogs, databases, tables, and views to groups of users. Privileges and metastores are shared across workspaces, allowing administrators to set secure permissions once against groups synced from identity providers and know that end users only have access to the proper data in any Databricks workspace they enter. \nUnity Catalog also allows administrators to define storage credentials, a secure way to store and share permissions on cloud storage infrastructure. You can grant privileges on these securables to power users within the organization so they can define external locations against cloud object storage locations, allowing data engineers to self-service for new workloads without needing to provide elevated permissions in cloud account consoles.\n\n#### Data discovery and collaboration in the lakehouse\n##### Discover data on Databricks\n\nUsers can browse available data objects in Unity Catalog using [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html). Catalog Explorer uses the privileges configured by Unity Catalog administrators to ensure that users are only able to see catalogs, databases, tables, and views that they have permissions to query. Once users find a dataset of interest, they can review field names and types, read comments on tables and individual fields, and preview a sample of the data. Users can also review the full history of the table to understand when and how data has changed, and the lineage feature allows users to track how certain datasets are derived from upstream jobs and used in downstream jobs. \nStorage credentials and external locations are also displayed in Catalog Explorer, allowing each user to fully grasp the privileges they have to read and write data across available locations and resources.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/collaboration.html"} +{"content":"# What is Databricks?\n## What is a data lakehouse?\n#### Data discovery and collaboration in the lakehouse\n##### Accelerate time to production with the lakehouse\n\nDatabricks supports workloads in SQL, Python, Scala, and R, allowing users with diverse skill sets and technical backgrounds to leverage their knowledge to derive analytic insights. You can use all languages supported by Databricks to define production jobs, and notebooks can leverage a combination of languages. This means that you can promote queries written by SQL analysts for last mile ETL into production data engineering code with almost no effort. Queries and workloads defined by personas across the organization leverage the same datasets, so there\u2019s no need to reconcile field names or make sure dashboards are up to date before sharing code and results with other teams. You can securely share code, notebooks, queries, and dashboards, all powered by the same scalable cloud infrastructure and defined against the same curated data sources.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/collaboration.html"} +{"content":"# \n### Initial Setup\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/setup\/index.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Manage external locations\n\nThis article describes how to list, view, update, grant permissions on, and delete [external locations](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html). \nNote \nDatabricks recommends governing file access using volumes. See [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html).\n\n#### Manage external locations\n##### Describe an external location\n\nTo see the properties of an external location, you can use Catalog Explorer or a SQL command. \n1. In the sidebar, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. At the bottom of the screen, click **External Data > External Locations**.\n3. Click the name of an external location to view its properties. \nRun the following command in a notebook or the Databricks SQL editor. Replace `<location-name>` with the name of the external location. \n```\nDESCRIBE EXTERNAL LOCATION <location-name>;\n\n```\n\n#### Manage external locations\n##### Show grants on an external location\n\nTo show grants on an external location, use a command like the following. You can optionally filter the results to show only the grants for the specified principal. \n```\nSHOW GRANTS [<principal>] ON EXTERNAL LOCATION <location-name>;\n\n``` \nReplace the placeholder values: \n* `<location-name>`: The name of the external location that authorizes reading from and writing to the S3 bucket in your cloud tenant.\n* `<principal>`: The email address of an account-level user or the name of an account-level group. \nNote \nIf a group name contains a space, use back-ticks around it (not apostrophes).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/manage-external-locations.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Manage external locations\n##### Grant permissions on an external location\n\nThis section describes how to grant and revoke permissions on an external location using Catalog Explorer and SQL commands in a notebook or Databricks SQL query. For information about using the Databricks CLI or Terraform instead, see the [Databricks Terraform documentation](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/external_location) and [What is the Databricks CLI?](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \nYou can grant the following permissions on an external location: \n* `CREATE EXTERNAL TABLE`\n* `CREATE EXTERNAL VOLUME`\n* `CREATE MANAGED STORAGE` \n**Permissions required**: The `CREATE EXTERNAL LOCATION` privilege on both the metastore and the storage credential referenced in the external location. Metastore admins have `CREATE EXTERNAL LOCATION` on the metastore by default. \nTo grant permission to use an external location: \n1. In the sidebar, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. At the bottom of the screen, click **External Data > External Locations**.\n3. Click the name of an external location to open its properties.\n4. Click **Permissions**.\n5. To grant permission to users or groups, select each identity, then click **Grant**.\n6. To revoke permissions from users or groups, select each identity, then click **Revoke**. \nRun the following SQL command in a notebook or SQL query editor. This example grants the ability to create an external table that references the external location: \n```\nGRANT CREATE EXTERNAL TABLE ON EXTERNAL LOCATION <location-name> TO <principal>;\n\n``` \nReplace the placeholder values: \n* `<location-name>`: The name of the external location that authorizes reading from and writing to the S3 bucket in your cloud tenant.\n* `<principal>`: The email address of an account-level user or the name of an account-level group. \nNote \nIf a group name contains a space, use back-ticks around it (not apostrophes). \n### Change the owner of an external location \nAn external location\u2019s creator is its initial owner. To change the owner to a different account-level user or group, run the following command in a notebook or the Databricks SQL editor or use [Catalog Explorer](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/ownership.html). Replace the placeholder values: \n* `<location-name>`: The name of the credential.\n* `<principal>`: The email address of an account-level user or the name of an account-level group. \n```\nALTER EXTERNAL LOCATION <location-name> OWNER TO <principal>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/manage-external-locations.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Manage external locations\n##### Mark an external location as read-only\n\nIf you want users to have read-only access to an external location, you can use Catalog Explorer to mark the external location as read-only. \nMaking external locations read-only: \n* Prevents users from writing to files in those external locations, regardless of any write permissions granted by the IAM role that underlies the storage credential, and regardless of the Unity Catalog permissions granted on that external location.\n* Prevents users from creating tables or volumes (whether external or managed) in those external locations.\n* Enables the system to validate the external location properly at creation time. \nYou can mark external locations as read-only when you create them. \nYou can also use Catalog Explorer to change read-only status after creating an external location: \n1. In the sidebar, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. At the bottom of the screen, click **External Data > External Locations**.\n3. Select the external location, click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu (also known as the three-dot menu) on the object row, and select **Edit**.\n4. On the edit dialog, select the **Read only** option.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/manage-external-locations.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Manage external locations\n##### Modify an external location\n\nAn external location\u2019s owner can rename, change the URI, and change the storage credential of the external location. \nTo rename an external location, do the following: \nRun the following command in a notebook or the Databricks SQL editor. Replace the placeholder values: \n* `<location-name>`: The name of the location.\n* `<new-location-name>`: A new name for the location. \n```\nALTER EXTERNAL LOCATION <location-name> RENAME TO <new-location-name>;\n\n``` \nTo change the URI that an external location points to in your cloud tenant, do the following: \nRun the following command in a notebook or the Databricks SQL editor. Replace the placeholder values: \n* `<location-name>`: The name of the external location.\n* `<url>`: The new storage URL the location should authorize access to in your cloud tenant. \n```\nALTER EXTERNAL LOCATION location_name SET URL '<url>' [FORCE];\n\n``` \nThe `FORCE` option changes the URL even if external tables depend upon the external location. \nTo change the storage credential that an external location uses, do the following: \nRun the following command in a notebook or the Databricks SQL editor. Replace the placeholder values: \n* `<location-name>`: The name of the external location.\n* `<credential-name>`: The name of the storage credential that grants access to the location\u2019s URL in your cloud tenant. \n```\nALTER EXTERNAL LOCATION <location-name> SET STORAGE CREDENTIAL <credential-name>;\n\n```\n\n#### Manage external locations\n##### Delete an external location\n\nTo delete (drop) an external location you must be its owner. To delete an external location, do the following: \nRun the following command in a notebook or the Databricks SQL editor. Items in brackets are optional. Replace `<location-name>` with the name of the external location. \n```\nDROP EXTERNAL LOCATION [IF EXISTS] <location-name>;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/manage-external-locations.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### The scope of the lakehouse platform\n#### A modern data and AI platform framework\n\nTo discuss the scope of the Databricks Data intelligence Platform, it is helpful to first define a basic framework for the modern data and AI platform: \n![Cloud data analytics framework](https:\/\/docs.databricks.com\/_images\/scope-cloud-data-framework.png)\n\n### The scope of the lakehouse platform\n#### Overview of the lakehouse scope\n\nThe Databricks Data Intelligence Platform covers the complete modern data platform framework. It is built on the lakehouse architecture and powered by a data intelligence engine that understands the unique qualities of your data. It is an open and unified foundation for ETL, ML\/AI, and DWH\/BI workloads, and provides Unity Catalog as the central data and AI governance solution.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/scope.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### The scope of the lakehouse platform\n#### Personas of the platform framework\n\nThe framework covers the primary data team members (personas) working with the applications in the framework: \n* **Data engineers** provide data scientists and business analysts with accurate and reproducible data for timely decision-making and real-time insights. They implement highly consistent and reliable ETL processes to increase user confidence and trust in data. They ensure that data is well integrated with the various pillars of the business and typically follow software engineering best practices.\n* **Data scientists** blend analytical expertise and business understanding to transform data into strategic insights and predictive models. They are adept at translating business challenges into data-driven solutions, be that through retrospective analytical insights or forward-looking predictive modeling. Leveraging data modeling and machine learning techniques, they design, develop, and deploy models that unveil patterns, trends, and forecasts from data. They act as a bridge, converting complex data narratives into comprehensible stories, ensuring business stakeholders not only understand but can also act upon the data-driven recommendations, in turn driving a data-centric approach to problem-solving within an organization.\n* **ML engineers** (machine learning engineers) lead the practical application of data science in products and solutions by building, deploying, and maintaining machine learning models. Their primary focus pivots towards the engineering aspect of model development and deployment. ML Engineers ensure the robustness, reliability, and scalability of machine learning systems in live environments, addressing challenges related to data quality, infrastructure, and performance. By integrating AI and ML models into operational business processes and user-facing products, they facilitate the utilization of data science in solving business challenges, ensuring models don\u2019t just stay in research but drive tangible business value.\n* **Business analysts** empower stakeholders and business teams with actionable data. They often interpret data and create reports or other documentation for leadership using standard BI tools. They are typically the go-to point of contact for non-technical business and operations colleagues for quick analysis questions.\n* **Business partners** are an important stakeholder in an increasingly networked business world. They are defined as a company or individual with whom a business has a formal relationship to achieve a common goal, and can include vendors, suppliers, distributors, and other third-party partners. Data sharing is an important aspect of business partnerships, as it enables the transfer and exchange of data to enhance collaboration and data-driven decision-making.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/scope.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### The scope of the lakehouse platform\n#### Domains of the platform framework\n\nThe platform consists of multiple domains: \n* **Storage:** In cloud, data is mainly stored in scalable, efficient and resilient cloud object storages provided by the cloud providers.\n* **Governance:** Capabilities around data governance, e.g. access control, auditing, metadata management, lineage tracking, monitoring for all data and AI assets.\n* **AI engine:** The AI Engine provides generative AI capabilities for the whole platform.\n* **Ingest & transform** The capabilities for ETL workloads.\n* **Advanced analytics, ML & AI** All capabilities around machine learning, AI, Generative AI, and also streaming analytics.\n* **Data warehouse** The domain supporting DWH and BI use cases.\n* **Orchestration** domain for central workflow management and the\n* **ETL & DS tools:** The front-end tools that data engineers, data scientists and ML engineers primarily use for work.\n* **BI tools:** The front-end tools that BI analysts primarily use for work.\n* **Collaboration**: Capabilities for data sharing between two or more parties.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/scope.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### The scope of the lakehouse platform\n#### The scope of the Databricks Platform\n\nThe Databricks Data Intelligence Platform and its components can be mapped to the framework in the following way: \n![Scope of the lakehouse](https:\/\/docs.databricks.com\/_images\/scope-lakehouse-aws.png) \n**[Download: Scope of the lakehouse - Databricks components](https:\/\/docs.databricks.com\/_extras\/documents\/scope-of-the-data-intelligence-platform-aws.pdf)** \nMost importantly, the Databricks Data Intelligence Platform covers all relevant workloads for the data domain in one platform, with [Apache Spark](https:\/\/docs.databricks.com\/spark\/index.html)\/[Photon](https:\/\/docs.databricks.com\/compute\/photon.html) as the engine: \n* **Ingestion & transform** \nFor data ingestion, [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html) incrementally and automatically processes files landing in cloud storage in scheduled or continuous jobs - without the need to manage state information. Once ingested, raw data needs to be transformed so it\u2019s ready for BI and ML\/AI. Databricks provides powerful ETL capabilities for data engineers, data scientists, and analysts. \n[Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html) (DLT) allows ETL jobs to be written in a declarative way, simplifying the entire implementation process. Data quality can be improved by defining [data expectations](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html).\n* **Advanced analytics, ML & AI** \nThe platform comes with *Databricks Mosaic AI*, a set of fully integrated machine learning and AI tools for [traditional machine and deep learning](https:\/\/docs.databricks.com\/machine-learning\/index.html) as well as [generative AI and large language models (LLMs)](https:\/\/docs.databricks.com\/generative-ai\/generative-ai.html). It covers the entire workflow from [preparing data](https:\/\/docs.databricks.com\/machine-learning\/data-preparation.html) to building of [machine learning](https:\/\/docs.databricks.com\/machine-learning\/train-model\/index.html) and [deep learning](https:\/\/docs.databricks.com\/machine-learning\/train-model\/deep-learning.html) models, to [Mosaic AI Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html). \n[Spark Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/index.html) and [DLT](https:\/\/docs.databricks.com\/delta-live-tables\/index.html) enable real-time analytics.\n* **Data warehouse** \nThe Databricks Data Intelligence Platform also provides a complete data warehouse solution with [Databricks SQL](https:\/\/docs.databricks.com\/sql\/index.html), centrally governed by [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) with fine grained access control. \nMapping of the Databricks Data Intelligence Platform features to the other layers of the framework, from bottom to top: \n* **Cloud storage** \nAll data for the lakehouse is stored in the cloud provider\u2019s object storage. Databricks supports three cloud providers: AWS, Azure, and GCP. Files in various structured and semi-structured formats (e.g., Parquet, CSV, JSON, Avro) as well as unstructured formats (e.g., images) are ingested and transformed using either batch or streaming processes. \n[Delta Lake](https:\/\/docs.databricks.com\/delta\/index.html) is the recommended data format for the lakehouse (file transactions, reliability, consistency, updates, and so on) and is completely open source to avoid lock-in. And [Delta Universal Format (UniForm)](https:\/\/docs.databricks.com\/delta\/uniform.html) allows you to read Delta tables with Iceberg reader clients. \nNo proprietary data formats are used in the Databricks Data Intelligence Platform.\n* **Data governance** \nOn top of the storage layer, [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) offers a wide range of data governance capabilities, including [metadata management](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html) in the metastore, [access control](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html), [auditing](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/audit.html), [data discovery](https:\/\/docs.databricks.com\/catalog-explorer\/index.html), [data lineage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html). \n[Lakehouse monitoring](https:\/\/docs.databricks.com\/lakehouse-monitoring\/index.html) provides out-of-the-box quality metrics for data and AI assets, and auto-generated dashboards to visualize these metrics. \nExternal SQL sources can be integrated into the lakehouse and Unity Catalog through [lakehouse federation](https:\/\/docs.databricks.com\/query-federation\/index.html).\n* **AI engine** \nThe Data Intelligence Platform is built on the lakehouse architecture and enhanced by the data intelligence engine [DatabricksIQ](https:\/\/docs.databricks.com\/databricksiq\/index.html). DatabricksIQ combines generative AI with the unification benefits of the lakehouse architecture to understand the unique semantics of your data. Intelligent Search and the [Databricks Assistant](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html) are expamples of AI powered services that simplify working with the platform for every user.\n* **Orchestration** \n[Databricks Workflows](https:\/\/docs.databricks.com\/workflows\/index.html) enable you to run diverse workloads for the full data and AI lifecycle on any cloud. They allow you to orchestrate jobs as well as Delta Live Tables for SQL, Spark, notebooks, DBT, ML models, and more.\n* **ETL & DS tools** \nAt the consumption layer, data engineers and ML engineers typically work with the platform using [IDEs](https:\/\/docs.databricks.com\/dev-tools\/index.html). Data scientists often prefer [notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html) and use the ML & AI runtimes, and the machine learning workflow system [MLflow](https:\/\/docs.databricks.com\/mlflow\/index.html) to track experiments and manage the model lifecycle.\n* **BI tools** \nBusiness analysts typically use their preferred BI tool to access the Databricks data warehouse. Databricks SQL can be queried by different Analysis and BI tools, see [BI and visualization](https:\/\/docs.databricks.com\/sql\/index.html) \nIn addition, the platform offers query and analysis tools out of the box: \n+ [Dashboards](https:\/\/docs.databricks.com\/dashboards\/index.html) to build data visualizations and share insights in a drag and drop manner.\n+ [SQL editor](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html) for SQL analysts to analyze data.\n* **Collaboration** \n[Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html) is an [open protocol](https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html) developed by Databricks for secure data sharing with other organizations regardless of the computing platforms they use. \n[Databricks Marketplace](https:\/\/docs.databricks.com\/marketplace\/index.html), is an open forum for exchanging data products. It takes advantage of Delta Sharing to give data providers the tools to share data products securely and data consumers the power to explore and expand their access to the data and data services they need.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/scope.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Create visualizations with Databricks Assistant\n\nPreview \nThis feature is currently in Public Preview. \nWhen drafting a dashboard, users can prompt the Databricks Assistant to build charts from any previously defined dashboard dataset. When you want to create a new chart, you can do so by asking what you want to learn from the data. You can also use the Assistant to edit a chart. After generating a chart, you can interact with and edit the generated visualization using the configuration panel. Users should always review visualizations generated by the Assistant to verify correctness. \n![Example showing chart creation using Databricks Assistant.](https:\/\/docs.databricks.com\/_images\/lakeview-db-assist-demo.gif)\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-w-db-assistant.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Create visualizations with Databricks Assistant\n###### How do I use a prompt to generate a visualization?\n\nThe following explains how to create visualizations using an existing dashboard where a dashboard dataset has already been defined. To learn how to create a new dashboard, see [Create a dashboard](https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-dashboard.html). \n1. Click ![Dashboards Icon](https:\/\/docs.databricks.com\/_images\/dashboards-icon.png) **Dashboards** to open the dashboards listing page.\n2. Click a dashboard title to start editing.\n3. Click the **Data** tab to see what dataset has been defined or used in the dashboard. \n![The dataset used in this example is defined by a SQL query on the samples catalog](https:\/\/docs.databricks.com\/_images\/lakeview-data-tab-db-assist.png) \n* To generate charts, at least one dataset must be identified in this section.\n* If more than one dataset is specified, Databricks Assistant will attempt to find the best dataset to respond to the user\u2019s input.\n4. Create a visualization widget. \n* Click ![Create Icon](https:\/\/docs.databricks.com\/_images\/lakeview-create.png) **Create a visualization** to create a visualization widget and use your mouse to place it in the canvas.\n5. Type a prompt into the visualization widget and press Enter. \n* The Assistant can take a moment to generate a response. After it generates a chart, you can choose to **Accept** or **Reject** it.\n* If the chart isn\u2019t what you want, retry the input using the ![Regenerate chart icon](https:\/\/docs.databricks.com\/_images\/lakeview-regenerate-chart.png) **Regenerate** button.\n* You can also edit the input and then retry. The updated prompt will modify the previously generated chart.\n6. You can edit the chart using Databricks Assistant or the configuration panel. With your visualization widget selected: \n* Click the ![Databricks Assistant](https:\/\/docs.databricks.com\/_images\/lakeview-sparkle-assist.png)**Assistant** icon. An input prompt appears. Enter a new prompt to your chart.\n* Use the configuration panel on the right side of the screen to adjust the existing chart.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-w-db-assistant.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Create visualizations with Databricks Assistant\n###### Supported capabilities\n\nDatabricks Assistant supports simple chart creation. \n### Number of fields and input \nYou can compare up to three fields and answer inputs like: \n* What were the average sales of my product?\n* What were the average sales of my product per week?\n* What were the average sales of my product per week by region? \n### Visualization types \nSupported visualization types include bar, line, area, scatter, pie, and counter. For any chart type, the following settings are supported: \n* Choosing fields for the X, Y, and Color-by encodings.\n* Choosing an available transformation, like SUM or AVG. \nConfiguration settings like Sorting, Title, Description, Labels, and choosing specific colors are not supported.\n\n##### Create visualizations with Databricks Assistant\n###### Tips for increasing the accuracy of created visualizations\n\n* Be specific. Specify the chart type and necessary fields with as much detail as possible.\n* Databricks Assistant has access only to table and column metadata and does not have access to row-level data. Thus, it might not create visualizations correctly if a question relies on the specific values in the data.\n\n##### Create visualizations with Databricks Assistant\n###### What is Databricks Assistant?\n\nDatabricks Assistant is an AI companion that enables users to be more efficient on the Databricks Platform. Visualization creation with Databricks Assistant is intended to help quickly answer questions about dashboard datasets. Its answers are based on table and column metadata. It declines to answer if it cannot find metadata related to the user question. The AI is new, so mistakes are possible. Use the visualization editor to verify that the appropriate fields have been correctly selected. \nCreating visualizations with the Databricks Assistant requires enabling **Partner-powered AI assistive features**. For details on enabling Databricks Assistant, see [What is Databricks Assistant?](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html). For questions about privacy and security, see [Privacy and security](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html#privacy-security).\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-w-db-assistant.html"} +{"content":"# Technology partners\n## Connect to security partners using Partner Connect\n#### Connect to Hunters\n\nThe Databricks integration with Hunters, a cloud-based platform for security operations, provides security data ingestion, built-in threat detection, investigation, and response.\n\n#### Connect to Hunters\n##### Connect to Hunters using Partner Connect\n\nTo connect to Hunters using Partner Connect, follow the steps in [Connect to security partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/data-security.html).\n\n#### Connect to Hunters\n##### Connect to Hunters manually\n\nTo connect to Databricks from Hunters manually, see [Set up Databricks as your data lake](https:\/\/docs.hunters.ai\/hunters\/docs\/set-up-databricks-as-your-data-lake) in the Hunters documentation.\n\n#### Connect to Hunters\n##### Additional resources\n\n* [Hunters website](https:\/\/www.hunters.security\/)\n* [Hunters documentation](https:\/\/docs.hunters.ai\/hunters\/docs)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/data-security\/hunters.html"} +{"content":"# Connect to data sources\n### Configure streaming data sources\n\nDatabricks can integrate with stream messaging services for near-real time data ingestion into the Databricks lakehouse. Databricks can also sync enriched and transformed data in the lakehouse with other streaming systems. \nStructured Streaming provides native streaming access to file formats supported by Apache Spark, but Databricks recommends Auto Loader for most Structured Streaming operations that read data from cloud object storage. See [What is Auto Loader?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html). \nIngesting streaming messages to Delta Lake allows you to retain messages indefinitely, allowing you to replay data streams without fear of losing data due to retention thresholds. \nTo learn more about specific configurations for streaming from message queues, see: \n* [Kafka](https:\/\/docs.databricks.com\/connect\/streaming\/kafka.html)\n* [Kinesis](https:\/\/docs.databricks.com\/connect\/streaming\/kinesis.html)\n* [Pub\/Sub](https:\/\/docs.databricks.com\/connect\/streaming\/pub-sub.html)\n* [Pulsar](https:\/\/docs.databricks.com\/connect\/streaming\/pulsar.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Manage model lifecycle using the Workspace Model Registry (legacy)\n\nImportant \nThis documentation covers the Workspace Model Registry. If your workspace is enabled for Unity Catalog, do not use the procedures on this page. Instead, see [Models in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html). \nFor guidance on how to upgrade from the Workspace Model Registry to Unity Catalog, see [Migrate workflows and models to Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#migrate-models-to-uc). \nIf your workspace\u2019s [default catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#view-the-current-default-catalog) is in Unity Catalog (rather than `hive_metastore`) and you are running a cluster using Databricks Runtime 13.3 LTS or above, models are automatically created in and loaded from the workspace default catalog, with no configuration required. To use the Workspace Model Registry in this case, you must explicitly target it by running `import mlflow; mlflow.set_registry_uri(\"databricks\")` at the start of your workload. A small number of workspaces where both the default catalog was configured to a catalog in Unity Catalog prior to January 2024 and the workspace model registry was used prior to January 2024 are exempt from this behavior and continue to use the Workspace Model Registry by default. \nStarting in April 2024, Databricks disabled Workspace Model Registry for workspaces in new accounts where the workspace\u2019s [default catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#view-the-current-default-catalog) is in Unity Catalog. \nThis article describes how to use the Workspace Model Registry as part of your machine learning workflow to manage the full lifecycle of ML models. The Workspace Model Registry is a Databricks-provided, hosted version of the MLflow Model Registry. \nThe Workspace Model Registry provides: \n* Chronological model lineage (which MLflow experiment and run produced the model at a given time).\n* [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html).\n* Model versioning.\n* Stage transitions (for example, from staging to production or archived).\n* [Webhooks](https:\/\/docs.databricks.com\/mlflow\/model-registry-webhooks.html) so you can automatically trigger actions based on registry events.\n* Email notifications of model events. \nYou can also create and view model descriptions and leave comments. \nThis article includes instructions for both the Workspace Model Registry UI and the Workspace Model Registry API. \nFor an overview of Workspace Model Registry concepts, see [ML lifecycle management using MLflow](https:\/\/docs.databricks.com\/mlflow\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Create or register a model\n\nYou can create or register a model using the UI, or [register a model using the API](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#register-model-api). \n### Create or register a model using the UI \nThere are two ways to register a model in the Workspace Model Registry. You can register an existing model that has been logged to MLflow, or you can create and register a new, empty model and then assign a previously logged model to it. \n#### Register an existing logged model from a notebook \n1. In the Workspace, identify the MLflow run containing the model you want to register. \n1. Click the **Experiment** icon ![Experiment icon](https:\/\/docs.databricks.com\/_images\/experiment1.png) in the notebook\u2019s right sidebar. \n![Notebook toolbar](https:\/\/docs.databricks.com\/_images\/notebook-toolbar.png)\n2. In the Experiment Runs sidebar, click the ![External Link](https:\/\/docs.databricks.com\/_images\/external-link.png) icon next to the date of the run. The MLflow Run page displays. This page shows details of the run including parameters, metrics, tags, and list of artifacts.\n2. In the Artifacts section, click the directory named **xxx-model**. \n![Register model](https:\/\/docs.databricks.com\/_images\/register-model.png)\n3. Click the **Register Model** button at the far right.\n4. In the dialog, click in the **Model** box and do one of the following: \n* Select **Create New Model** from the drop-down menu. The **Model Name** field appears. Enter a model name, for example `scikit-learn-power-forecasting`.\n* Select an existing model from the drop-down menu.\n![Create new model](https:\/\/docs.databricks.com\/_images\/create-model.png)\n5. Click **Register**. \n* If you selected **Create New Model**, this registers a model named `scikit-learn-power-forecasting`, copies the model into a secure location managed by the Workspace Model Registry, and creates a new version of the model.\n* If you selected an existing model, this registers a new version of the selected model.After a few moments, the **Register Model** button changes to a link to the new registered model version. \n![Select newly created model](https:\/\/docs.databricks.com\/_images\/registered-model-version.png)\n6. Click the link to open the new model version in the Workspace Model Registry UI. You can also find the model in the Workspace Model Registry by clicking ![Models Icon](https:\/\/docs.databricks.com\/_images\/models-icon.png) **Models** in the sidebar. \n#### Create a new registered model and assign a logged model to it \nYou can use the Create Model button on the registered models page to create a new, empty model and then assign a logged model to it. Follow these steps: \n1. On the registered models page, click **Create Model**. Enter a name for the model and click **Create**.\n2. Follow Steps 1 through 3 in [Register an existing logged model from a notebook](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#register-an-existing-logged-model-from-a-notebook).\n3. In the Register Model dialog, select the name of the model you created in Step 1 and click **Register**. This registers a model with the name you created, copies the model into a secure location managed by the Workspace Model Registry, and creates a model version: `Version 1`. \nAfter a few moments, the MLflow Run UI replaces the Register Model button with a link to the new registered model version. You can now select the model from the **Model** drop-down list in the Register Model dialog on the **Experiment Runs** page. You can also register new versions of the model by specifying its name in API commands like [Create ModelVersion](https:\/\/mlflow.org\/docs\/latest\/rest-api.html#create-modelversion). \n### Register a model using the API \nThere are three programmatic ways to register a model in the Workspace Model Registry. All methods copy the model into a secure location managed by the Workspace Model Registry. \n* To log a model and register it with the specified name during an MLflow experiment, use the `mlflow.<model-flavor>.log_model(...)` method. If a registered model with the name doesn\u2019t exist, the method registers a new model, creates Version 1, and returns a `ModelVersion` MLflow object. If a registered model with the name exists already, the method creates a new model version and returns the version object. \n```\nwith mlflow.start_run(run_name=<run-name>) as run:\n...\nmlflow.<model-flavor>.log_model(<model-flavor>=<model>,\nartifact_path=\"<model-path>\",\nregistered_model_name=\"<model-name>\"\n)\n\n```\n* To register a model with the specified name after all your experiment runs complete and you have decided which model is most suitable to add to the registry, use the `mlflow.register_model()` method. For this method, you need the run ID for the `mlruns:URI` argument. If a registered model with the name doesn\u2019t exist, the method registers a new model, creates Version 1, and returns a `ModelVersion` MLflow object. If a registered model with the name exists already, the method creates a new model version and returns the version object. \n```\nresult=mlflow.register_model(\"runs:<model-path>\", \"<model-name>\")\n\n```\n* To create a new registered model with the specified name, use the MLflow Client API `create_registered_model()` method. If the model name exists, this method throws an `MLflowException`. \n```\nclient = MlflowClient()\nresult = client.create_registered_model(\"<model-name>\")\n\n``` \nYou can also register a model with the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_mlflow\\_model](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/mlflow_model).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Quota limits\n\nStarting May 2024 for all Databricks workspaces, the Workspace Model Registry imposes quota limits on the total number of registered models and model versions per workspace. See [Resource limits](https:\/\/docs.databricks.com\/resources\/limits.html). If you exceed the registry quotas, Databricks recommends that you delete registered models and model versions that you no longer need. Databricks also recommends that you adjust your model registration and retention strategy to stay under the limit. If you require an increase to your workspace limits, reach out to your Databricks account team. \nThe following notebook illustrates how to inventory and delete your model registry entities. \n### Inventory workspace model registry entities notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-model-registry-inventory.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### View models in the UI\n\n### Registered models page \nThe registered models page displays when you click ![Models Icon](https:\/\/docs.databricks.com\/_images\/models-icon.png) **Models** in the sidebar. This page shows all of the models in the registry. \nYou can [create a new model](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#create-a-new-registered-model-and-assign-a-logged-model-to-it) from this page. \nAlso from this page, workspace administrators can [set permissions for all models in the Workspace Model Registry](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#permissions). \n![Registered models](https:\/\/docs.databricks.com\/_images\/registered-models.png) \n### Registered model page \nTo display the registered model page for a model, click a model name in the registered models page. The registered model page shows information about the selected model and a table with information about each version of the model. From this page, you can also: \n* Set up [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html).\n* [Automatically generate a notebook to use the model for inference](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#use-model-for-inference).\n* [Configure email notifications](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#email-notification).\n* [Compare model versions](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#compare-model-versions).\n* [Set permissions for the model](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#permissions).\n* [Delete a model](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#delete-a-model-or-model-version). \n![Registered model](https:\/\/docs.databricks.com\/_images\/registered-model1.png) \n### Model version page \nTo view the model version page, do one of the following: \n* Click a version name in the **Latest Version** column on the registered models page.\n* Click a version name in the **Version** column in the registered model page. \nThis page displays information about a specific version of a registered model and also provides a link to the source run (the version of the notebook that was run to create the model). From this page, you can also: \n* [Automatically generate a notebook to use the model for inference](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#use-model-for-inference).\n* [Delete a model](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#delete-a-model-or-model-version). \n![Model version](https:\/\/docs.databricks.com\/_images\/model-version1.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Control access to models\n\nYou must have at least CAN MANAGE permission to configure permissions on a model. For information on model permission levels, see [MLFlow model ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#models). A model version inherits permissions from its parent model. You cannot set permissions for model versions. \n1. In the sidebar, click ![Models Icon](https:\/\/docs.databricks.com\/_images\/models-icon.png) **Models**.\n2. Select a model name.\n3. Click **Permissions**. The Permission Settings dialog opens \n![Model permissions button](https:\/\/docs.databricks.com\/_images\/model-permission.png)\n4. In the dialog, select the **Select User, Group or Service Principal\u2026** drop-down and select a user, group, or service principal. \n![Change MLflow model permissions](https:\/\/docs.databricks.com\/_images\/select-permission.png)\n5. Select a permission from the permission drop-down.\n6. Click **Add** and click **Save**. \nWorkspace admins and users with CAN MANAGE permission at the registry-wide level can set permission levels on all models in the workspace by clicking **Permissions** on the Models page.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Transition a model stage\n\nA model version has one of the following stages: **None**, **Staging**, **Production**, or **Archived**. The **Staging** stage is meant for model testing and validating, while the **Production** stage is for model versions that have completed the testing or review processes and have been deployed to applications for live scoring. An Archived model version is assumed to be inactive, at which point you can consider [deleting it](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#delete-a-model-or-model-version). Different versions of a model can be in different stages. \nA user with appropriate [permission](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#permissions) can transition a model version between stages. If you have permission to transition a model version to a particular stage, you can make the transition directly. If you do not have permission, you can request a stage transition and a user that has permission to transition model versions can [approve, reject, or cancel the request](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#str). \nYou can transition a model stage using the UI or [using the API](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#transition-stage-api). \n### Transition a model stage using the UI \nFollow these instructions to transition a model\u2019s stage. \n1. To display the list of available model stages and your available options, in a model version page, click the drop down next to **Stage:** and request or select a transition to another stage. \n![Stage transition options](https:\/\/docs.databricks.com\/_images\/stage-options.png)\n2. Enter an optional comment and click **OK**. \n#### Transition a model version to the Production stage \nAfter testing and validation, you can transition or request a transition to the Production stage. \nWorkspace Model Registry allows more than one version of the registered model in each stage. If you want to have only one version in Production, you can transition all versions of the model currently in Production to Archived by checking **Transition existing Production model versions to Archived**. \n#### Approve, reject, or cancel a model version stage transition request \nA user without stage transition permission can request a stage transition. The request appears in the **Pending Requests** section in the model version page: \n![Transition to production](https:\/\/docs.databricks.com\/_images\/handle-transition-request.png) \nTo approve, reject, or cancel a stage transition request, click the **Approve**, **Reject**, or **Cancel** link. \nThe creator of a transition request can also cancel the request. \n#### View model version activities \nTo view all the transitions requested, approved, pending, and applied to a model version, go to the Activities section. This record of activities provides a lineage of the model\u2019s lifecycle for auditing or inspection. \n### Transition a model stage using the API \nUsers with appropriate [permissions](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#permissions) can transition a model version to a new stage. \nTo update a model version stage to a new stage, use the MLflow Client API `transition_model_version_stage()` method: \n```\nclient = MlflowClient()\nclient.transition_model_version_stage(\nname=\"<model-name>\",\nversion=<model-version>,\nstage=\"<stage>\",\ndescription=\"<description>\"\n)\n\n``` \nThe accepted values for `<stage>` are: `\"Staging\"|\"staging\"`, `\"Archived\"|\"archived\"`, `\"Production\"|\"production\"`, `\"None\"|\"none\"`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Use model for inference\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nAfter a model is registered in the Workspace Model Registry, you can automatically generate a notebook to use the model for batch or streaming inference. Alternatively, you can create an endpoint to use the model for real-time serving with [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html). \nIn the upper-right corner of the [registered model page](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#registered-model-page) or the [model version page](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#model-version-page), click ![use model button](https:\/\/docs.databricks.com\/_images\/use-model-for-inference.png). The Configure model inference dialog appears, which allows you to configure batch, streaming, or real-time inference. \nImportant \nAnaconda Inc. updated their [terms of service](https:\/\/www.anaconda.com\/terms-of-service) for anaconda.org channels. Based on the new terms of service you may require a commercial license if you rely on Anaconda\u2019s packaging and distribution. See [Anaconda Commercial Edition FAQ](https:\/\/www.anaconda.com\/blog\/anaconda-commercial-edition-faq) for more information. Your use of any Anaconda channels is governed by their terms of service. \nMLflow models logged before [v1.18](https:\/\/mlflow.org\/news\/2021\/06\/18\/1.18.0-release\/index.html) (Databricks Runtime 8.3 ML or earlier) were by default logged with the conda `defaults` channel (<https:\/\/repo.anaconda.com\/pkgs\/>) as a dependency. Because of this license change, Databricks has stopped the use of the `defaults` channel for models logged using MLflow v1.18 and above. The default channel logged is now `conda-forge`, which points at the community managed <https:\/\/conda-forge.org\/>. \nIf you logged a model before MLflow v1.18 without excluding the `defaults` channel from the conda environment for the model, that model may have a dependency on the `defaults` channel that you may not have intended.\nTo manually confirm whether a model has this dependency, you can examine `channel` value in the `conda.yaml` file that is packaged with the logged model. For example, a model\u2019s `conda.yaml` with a `defaults` channel dependency may look like this: \n```\nchannels:\n- defaults\ndependencies:\n- python=3.8.8\n- pip\n- pip:\n- mlflow\n- scikit-learn==0.23.2\n- cloudpickle==1.6.0\nname: mlflow-env\n\n``` \nBecause Databricks can not determine whether your use of the Anaconda repository to interact with your models is permitted under your relationship with Anaconda, Databricks is not forcing its customers to make any changes. If your use of the Anaconda.com repo through the use of Databricks is permitted under Anaconda\u2019s terms, you do not need to take any action. \nIf you would like to change the channel used in a model\u2019s environment, you can re-register the model to the Workspace model registry with a new `conda.yaml`. You can do this by specifying the channel in the `conda_env` parameter of `log_model()`. \nFor more information on the `log_model()` API, see the MLflow documentation for the model flavor you are working with, for example, [log\\_model for scikit-learn](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.sklearn.html#mlflow.sklearn.log_model). \nFor more information on `conda.yaml` files, see the [MLflow documentation](https:\/\/www.mlflow.org\/docs\/latest\/models.html#additional-logged-files). \n![Configure model inference dialog](https:\/\/docs.databricks.com\/_images\/configure-model-inference.png) \n### Configure batch inference \nWhen you follow these steps to create a batch inference notebook, the notebook is saved in your user folder under the `Batch-Inference` folder in a folder with the model\u2019s name. You can edit the notebook as needed. \n1. Click the **Batch inference** tab.\n2. From the **Model version** drop-down, select the model version to use. The first two items in the drop-down are the current Production and Staging version of the model (if they exist). When you select one of these options, the notebook automatically uses the Production or Staging version as of the time it is run. You do not need to update the notebook as you continue to develop the model.\n3. Click the **Browse** button next to **Input table**. The **Select input data** dialog appears. If necessary, you can change the cluster in the **Compute** drop-down. \nNote \nFor Unity Catalog enabled workspaces, the **Select input data** dialog allows you to select from three levels, `<catalog-name>.<database-name>.<table-name>`.\n4. Select the table containing the input data for the model, and click **Select**. The generated notebook automatically imports this data and sends it to the model. You can edit the generated notebook if the data requires any transformations before it is input to the model.\n5. Predictions are saved in a folder in the directory `dbfs:\/FileStore\/batch-inference`. By default, predictions are saved in a folder with the same name as the model. Each run of the generated notebook writes a new file to this directory with the timestamp appended to the name. You can also choose not to include the timestamp and to overwrite the file with subsequent runs of the notebook; instructions are provided in the generated notebook. \nYou can change the folder where the predictions are saved by typing a new folder name into the **Output table location** field or by clicking the folder icon to browse the directory and select a different folder. \nTo save predictions to a location in Unity Catalog, you must edit the notebook. For an example notebook that shows how to train a machine-learning model that uses data in Unity Catalog and write the results back to Unity Catalog, see [Machine learning tutorial](https:\/\/docs.databricks.com\/machine-learning\/train-model\/scikit-learn.html). \n### Configure streaming inference using Delta Live Tables \nWhen you follow these steps to create a streaming inference notebook, the notebook is saved in your user folder under the `DLT-Inference` folder in a folder with the model\u2019s name. You can edit the notebook as needed. \n1. Click the **Streaming (Delta Live Tables)** tab.\n2. From the **Model version** drop-down, select the model version to use. The first two items in the drop-down are the current Production and Staging version of the model (if they exist). When you select one of these options, the notebook automatically uses the Production or Staging version as of the time it is run. You do not need to update the notebook as you continue to develop the model.\n3. Click the **Browse** button next to **Input table**. The **Select input data** dialog appears. If necessary, you can change the cluster in the **Compute** drop-down. \nNote \nFor Unity Catalog enabled workspaces, the **Select input data** dialog allows you to select from three levels, `<catalog-name>.<database-name>.<table-name>`.\n4. Select the table containing the input data for the model, and click **Select**. The generated notebook creates a data transform that uses the input table as a source and integrates the MLflow [PySpark inference UDF](https:\/\/mlflow.org\/docs\/latest\/models.html#export-a-python-function-model-as-an-apache-spark-udf) to perform model predictions. You can edit the generated notebook if the data requires any additional transformations before or after the model is applied.\n5. Provide the output Delta Live Table name. The notebook creates a live table with the given name and uses it to store the model predictions. You can modify the generated notebook to customize the target dataset as needed - for example: define a streaming live table as output, add schema information or data quality constraints.\n6. You can then either create a new Delta Live Tables pipeline with this notebook or add it to an existing pipeline as an additional notebook library. \n### Configure real-time inference \n[Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) exposes your MLflow machine learning models as scalable REST API endpoints. To create a Model Serving endpoint, see [Create custom model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html). \n### Provide feedback \nThis feature is in preview, and we would love to get your feedback. To provide feedback, click `Provide Feedback` in the Configure model inference dialog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Compare model versions\n\nYou can compare model versions in the Workspace Model Registry. \n1. On the [registered model page](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#registered-model-page), select two or more model versions by clicking in the checkbox to the left of the model version.\n2. Click **Compare**.\n3. The Comparing `<N>` Versions screen appears, showing a table that compares the parameters, schema, and metrics of the selected model versions. At the bottom of the screen, you can select the type of plot (scatter, contour, or parallel coordinates) and the parameters or metrics to plot.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Control notification preferences\n\nYou can configure Workspace Model Registry to notify you by email about activity on registered models and model versions that you specify. \nOn the registered model page, the **Notify me about** menu shows three options: \n![Email notifications menu](https:\/\/docs.databricks.com\/_images\/email-notifications-menu.png) \n* **All new activity**: Send email notifications about all activity on all model versions of this model. If you created the registered model, this setting is the default.\n* **Activity on versions I follow**: Send email notifications only about model versions you follow. With this selection, you receive notifications for all model versions that you follow; you cannot turn off notifications for a specific model version.\n* **Mute notifications**: Do not send email notifications about activity on this registered model. \nThe following events trigger an email notification: \n* Creation of a new model version\n* Request for a stage transition\n* Stage transition\n* New comments \nYou are automatically subscribed to model notifications when you do any of the following: \n* Comment on that model version\n* Transition a model version\u2019s stage\n* Make a transition request for the model\u2019s stage \nTo see if you are following a model version, look at the Follow Status field on the [model version page](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#model-version-page), or at the table of model versions on the [registered model page](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#registered-model-page). \n### Turn off all email notifications \nYou can turn off email notifications in the Workspace Model Registry Settings tab of the User Settings menu: \n1. Click your username in the upper-right corner of the Databricks workspace, and select **Settings** from the drop-down menu.\n2. In the **Settings** sidebar, select **Notifications**.\n3. Turn off **Model Registry email notifications**. \nAn account admin can turn off email notifications for the entire organization in the [admin settings page](https:\/\/docs.databricks.com\/admin\/index.html#admin-settings). \n### Maximum number of emails sent \nWorkspace Model Registry limits the number of emails sent to each user per day per activity. For example, if you receive 20 emails in one day about new model versions created for a registered model, Workspace Model Registry sends an email noting that the daily limit has been reached, and no additional emails about that event are sent until the next day. \nTo increase the limit of the number of emails allowed, contact your Databricks account team.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Webhooks\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \n[Webhooks](https:\/\/docs.databricks.com\/mlflow\/model-registry-webhooks.html) enable you to listen for Workspace Model Registry events so your integrations can automatically trigger actions. You can use webhooks to automate and integrate your machine learning pipeline with existing CI\/CD tools and workflows. For example, you can trigger CI builds when a new model version is created or notify your team members through Slack each time a model transition to production is requested.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Annotate a model or model version\n\nYou can provide information about a model or model version by annotating it. For example, you may want to include an overview of the problem or information about the methodology and algorithm used. \n### Annotate a model or model version using the UI \nThe Databricks UI provides several ways to annotate models and model versions. You can add text information using a description or comments, and you can add [searchable key-value tags](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#search-for-a-model). Descriptions and tags are available for models and model versions; comments are only available for model versions. \n* Descriptions are intended to provide information about the model.\n* Comments provide a way to maintain an ongoing discussion about activities on a model version.\n* Tags let you customize model metadata to make it easier to find specific models. \n#### Add or update the description for a model or model version \n1. From the [registered model](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#registered-model-page) or [model version](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#model-version-page) page, click **Edit** next to **Description**. An edit window appears.\n2. Enter or edit the description in the edit window.\n3. Click **Save** to save your changes or **Cancel** to close the window. \nIf you entered a description of a model version, the description appears in the **Description** column in the table on the [registered model page](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#registered-model-page). The column displays a maximum of 32 characters or one line of text, whichever is shorter. \n#### Add comments for a model version \n1. Scroll down the [model version](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#model-version-page) page and click the down arrow next to **Activities**.\n2. Type your comment in the edit window and click **Add Comment**. \n#### Add tags for a model or model version \n1. From the [registered model](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#registered-model-page) or [model version](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#model-version-page) page, click ![Tag icon](https:\/\/docs.databricks.com\/_images\/tags1.png) if it is not already open. The tags table appears. \n![tag table](https:\/\/docs.databricks.com\/_images\/tags-open.png)\n2. Click in the **Name** and **Value** fields and type the key and value for your tag.\n3. Click **Add**. \n![add tag](https:\/\/docs.databricks.com\/_images\/tag-add.png) \n#### Edit or delete tags for a model or model version \nTo edit or delete an existing tag, use the icons in the **Actions** column. \n![tag actions](https:\/\/docs.databricks.com\/_images\/tag-edit-or-delete.png) \n### Annotate a model version using the API \nTo update a model version description, use the MLflow Client API `update_model_version()` method: \n```\nclient = MlflowClient()\nclient.update_model_version(\nname=\"<model-name>\",\nversion=<model-version>,\ndescription=\"<description>\"\n)\n\n``` \nTo set or update a tag for a registered model or model version, use the MLflow Client API [`set\\_registered\\_model\\_tag()`](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.tracking.html#mlflow.tracking.MlflowClient.set_registered_model_tag)) or [`set\\_model\\_version\\_tag()`](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.tracking.html#mlflow.tracking.MlflowClient.set_model_version_tag) method: \n```\nclient = MlflowClient()\nclient.set_registered_model_tag()(\nname=\"<model-name>\",\nkey=\"<key-value>\",\ntag=\"<tag-value>\"\n)\n\n``` \n```\nclient = MlflowClient()\nclient.set_model_version_tag()(\nname=\"<model-name>\",\nversion=<model-version>,\nkey=\"<key-value>\",\ntag=\"<tag-value>\"\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Rename a model (API only)\n\nTo rename a registered model, use the MLflow Client API `rename_registered_model()` method: \n```\nclient=MlflowClient()\nclient.rename_registered_model(\"<model-name>\", \"<new-model-name>\")\n\n``` \nNote \nYou can rename a registered model only if it has no versions, or all versions are in the None or Archived stage.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Search for a model\n\nYou can search for models in the Workspace Model Registry using the UI or the API. \nNote \nWhen you search for a model, only models for which you have at least CAN READ permissions are returned. \n### Search for a model using the UI \nTo display registered models, click ![Models Icon](https:\/\/docs.databricks.com\/_images\/models-icon.png) **Models** in the sidebar. \nTo search for a specific model, enter text in the search box. You can enter the name of a model or any part of the name: \n![Registered models search](https:\/\/docs.databricks.com\/_images\/registered-models.png) \nYou can also search on tags. Enter tags in this format: `tags.<key>=<value>`. To search for multiple tags, use the `AND` operator. \n![Tag-based search](https:\/\/docs.databricks.com\/_images\/search-with-tags.png) \nYou can search on both the model name and tags using the [MLflow search syntax](https:\/\/www.mlflow.org\/docs\/latest\/search-runs.html#syntax). For example: \n![Name and tag-based search](https:\/\/docs.databricks.com\/_images\/model-search-name-and-tag.png) \n### Search for a model using the API \nYou can search for registered models in the Workspace Model Registry with the MLflow Client API method [search\\_registered\\_models()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.client.html#mlflow.client.MlflowClient.search_registered_models) \nIf you have [set tags](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#annotate-a-model-version-using-the-api) on your models, you can also search by those tags with `search_registered_models()`. \n```\nprint(f\"Find registered models with a specific tag value\")\nfor m in client.search_registered_models(f\"tags.`<key-value>`='<tag-value>'\"):\npprint(dict(m), indent=4)\n\n``` \nYou can also search for a specific model name and list its version details using MLflow Client API `search_model_versions()` method: \n```\nfrom pprint import pprint\n\nclient=MlflowClient()\n[pprint(mv) for mv in client.search_model_versions(\"name='<model-name>'\")]\n\n``` \nThis outputs: \n```\n{ 'creation_timestamp': 1582671933246,\n'current_stage': 'Production',\n'description': 'A random forest model containing 100 decision trees '\n'trained in scikit-learn',\n'last_updated_timestamp': 1582671960712,\n'name': 'sk-learn-random-forest-reg-model',\n'run_id': 'ae2cc01346de45f79a44a320aab1797b',\n'source': '.\/mlruns\/0\/ae2cc01346de45f79a44a320aab1797b\/artifacts\/sklearn-model',\n'status': 'READY',\n'status_message': None,\n'user_id': None,\n'version': 1 }\n\n{ 'creation_timestamp': 1582671960628,\n'current_stage': 'None',\n'description': None,\n'last_updated_timestamp': 1582671960628,\n'name': 'sk-learn-random-forest-reg-model',\n'run_id': 'd994f18d09c64c148e62a785052e6723',\n'source': '.\/mlruns\/0\/d994f18d09c64c148e62a785052e6723\/artifacts\/sklearn-model',\n'status': 'READY',\n'status_message': None,\n'user_id': None,\n'version': 2 }\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Delete a model or model version\n\nYou can delete a model using the UI or the API. \n### Delete a model version or model using the UI \nWarning \nYou cannot undo this action. You can transition a model version to the Archived stage rather than deleting it from the registry. When you delete a model, all model artifacts stored by the Workspace Model Registry and all the metadata associated with the registered model are deleted. \nNote \nYou can only delete models and model versions in the None or Archived stage. If a registered model has versions in the Staging or Production stage, you must transition them to either the None or Archived stage before deleting the model. \nTo delete a model version: \n1. Click ![Models Icon](https:\/\/docs.databricks.com\/_images\/models-icon.png) **Models** in the sidebar.\n2. Click a model name.\n3. Click a model version.\n4. Click ![Delete model version](https:\/\/docs.databricks.com\/_images\/three-button-icon.png) at the upper right corner of the screen and select **Delete** from the drop-down menu. \nTo delete a model: \n1. Click ![Models Icon](https:\/\/docs.databricks.com\/_images\/models-icon.png) **Models** in the sidebar.\n2. Click a model name.\n3. Click ![Delete model](https:\/\/docs.databricks.com\/_images\/three-button-icon.png) at the upper right corner of the screen and select **Delete** from the drop-down menu. \n### Delete a model version or model using the API \nWarning \nYou cannot undo this action. You can transition a model version to the Archived stage rather than deleting it from the registry. When you delete a model, all model artifacts stored by the Workspace Model Registry and all the metadata associated with the registered model are deleted. \nNote \nYou can only delete models and model versions in the None or Archived stage. If a registered model has versions in the Staging or Production stage, you must transition them to either the None or Archived stage before deleting the model. \n#### Delete a model version \nTo delete a model version, use the MLflow Client API `delete_model_version()` method: \n```\n# Delete versions 1,2, and 3 of the model\nclient = MlflowClient()\nversions=[1, 2, 3]\nfor version in versions:\nclient.delete_model_version(name=\"<model-name>\", version=version)\n\n``` \n#### Delete a model \nTo delete a model, use the MLflow Client API `delete_registered_model()` method: \n```\nclient = MlflowClient()\nclient.delete_registered_model(name=\"<model-name>\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Share models across workspaces\n\nDatabricks recommends using [Models in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html) to share models across workspaces. Unity Catalog provides out-of-the-box support for cross-workspace model access, governance, and audit logging. \nHowever, if using the workspace model registry, you can also [share models across multiple workspaces](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/multiple-workspaces.html) with some setup. For example, you can develop and log a model in your own workspace and then access it from another workspace using a remote Workspace model registry. This is useful when multiple teams share access to models. You can create multiple workspaces and use and manage models across these environments.\n\n##### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Copy MLflow objects between workspaces\n\nTo import or export MLflow objects to or from your Databricks workspace, you can use the community-driven open source project [MLflow Export-Import](https:\/\/github.com\/mlflow\/mlflow-export-import#why-use-mlflow-export-import) to migrate MLflow experiments, models, and runs between workspaces. \nWith these tools, you can: \n* Share and collaborate with other data scientists in the same or another tracking server. For example, you can clone an experiment from another user into your workspace.\n* Copy a model from one workspace to another, such as from a development to a production workspace.\n* Copy MLflow experiments and runs from your local tracking server to your Databricks workspace.\n* Back up mission critical experiments and models to another Databricks workspace.\n\n##### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Example\n\nThis example illustrates how to use the Workspace Model Registry to build a machine learning application. \n[Workspace Model Registry example](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Import and visualize CSV data from a notebook\n\nThis article walks you through using a Databricks notebook to import data from a CSV file containing baby name data from [health.data.ny.gov](https:\/\/health.data.ny.gov\/api\/views\/jxy9-yhdk\/rows.csv) into your Unity Catalog volume using Python, Scala, and R. You also learn to modify a column name, visualize the data, and save to a table.\n\n### Get started: Import and visualize CSV data from a notebook\n#### Requirements\n\nTo complete the tasks in this article, you must meet the following requirements: \n* Your workspace must have [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) enabled. For information on getting started with Unity Catalog, see [Set up and manage Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html).\n* You must have permission to use an existing compute resource or create a new compute resource. See [Get started: Account and workspace setup](https:\/\/docs.databricks.com\/getting-started\/index.html) or see your Databricks administrator. \nTip \nFor a completed notebook for this article, see [Import and visualize data notebooks](https:\/\/docs.databricks.com\/getting-started\/import-visualize-data.html#notebook).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/import-visualize-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Import and visualize CSV data from a notebook\n#### Step 1: Create a new notebook\n\nTo create a notebook in your workspace: \n1. Click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar, and then click **Notebook**.\n2. On the Create Notebook page: \n* Specify a unique name for your notebook.\n* Set the default language for your notebook and then click **Confirm** if prompted.\n* Click **Connect** and select a compute resource. To create a new compute resource, see [Use compute](https:\/\/docs.databricks.com\/compute\/use-compute.html). \nTo learn more about creating and managing notebooks, see [Manage notebooks](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/import-visualize-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Import and visualize CSV data from a notebook\n#### Step 2: Define variables\n\nIn this step, you define variables for use in the example notebook you create in this article. \n1. Copy and paste the following code into the new empty notebook cell. Replace `<catalog-name>`, `<schema-name>`, and `<volume-name>` with the catalog, schema, and volume names for a Unity Catalog volume. Replace `<table_name>` with a table name of your choice. You will save the baby name data into this table later in this article.\n2. Press `Shift+Enter` to run the cell and create a new blank cell. \n```\ncatalog = \"<catalog_name>\"\nschema = \"<schema_name>\"\nvolume = \"<volume_name>\"\ndownload_url = \"https:\/\/health.data.ny.gov\/api\/views\/jxy9-yhdk\/rows.csv\"\nfile_name = \"baby_names.csv\"\ntable_name = \"baby_names\"\npath_volume = \"\/Volumes\/\" + catalog + \"\/\" + schema + \"\/\" + volume\npath_table = catalog + \".\" + schema\nprint(path_table) # Show the complete path\nprint(path_volume) # Show the complete path\n\n``` \n```\nval catalog = \"<catalog_name>\"\nval schema = \"<schema_name>\"\nval volume = \"<volume_name>\"\nval downloadUrl = \"https:\/\/health.data.ny.gov\/api\/views\/jxy9-yhdk\/rows.csv\"\nval fileName = \"baby_names.csv\"\nval tableName = \"baby_names\"\nval pathVolume = s\"\/Volumes\/${catalog}\/${schema}\/${volume}\"\nval pathTable = s\"${catalog}.${schema}\"\nprint(pathVolume) \/\/ Show the complete path\nprint(pathTable) \/\/ Show the complete path\n\n``` \n```\ncatalog <- \"<catalog_name>\"\nschema <- \"<schema_name>\"\nvolume <- \"<volume_name>\"\ndownload_url <- \"https:\/\/health.data.ny.gov\/api\/views\/jxy9-yhdk\/rows.csv\"\nfile_name <- \"baby_names.csv\"\ntable_name <- \"baby_names\"\npath_volume <- paste(\"\/Volumes\/\", catalog, \"\/\", schema, \"\/\", volume, sep = \"\")\npath_table <- paste(catalog, \".\", schema, sep = \"\")\nprint(path_volume) # Show the complete path\nprint(path_table) # Show the complete path\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/import-visualize-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Import and visualize CSV data from a notebook\n#### Step 3: Import CSV file\n\nIn this step, you import a CSV file containing baby name data from [health.data.ny.gov](https:\/\/health.data.ny.gov\/api\/views\/jxy9-yhdk\/rows.csv) into your Unity Catalog volume. \n1. Copy and paste the following code into the new empty notebook cell. This code copies the `rows.csv` file from [health.data.ny.gov](https:\/\/health.data.ny.gov\/api\/views\/jxy9-yhdk\/rows.csv) into your Unity Catalog volume using the [Databricks dbutuils](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#cp-command-dbutilsfscp) command.\n2. Press `Shift+Enter` to run the cell and then move to the next cell. \n```\ndbutils.fs.cp(f\"{download_url}\", f\"{path_volume}\" + \"\/\" + f\"{file_name}\")\n\n``` \n```\ndbutils.fs.cp(downloadUrl, s\"${pathVolume}\/${fileName}\")\n\n``` \n```\ndbutils.fs.cp(download_url, paste(path_volume, \"\/\", file_name, sep = \"\"))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/import-visualize-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Import and visualize CSV data from a notebook\n#### Step 4: Load CSV data into a DataFrame\n\nIn this step, you create a DataFrame named `df` from the CSV file that you previously loaded into your Unity Catalog volume by using the [spark.read.csv](https:\/\/spark.apache.org\/docs\/latest\/sql-data-sources-csv.html) method. \n1. Copy and paste the following code into the new empty notebook cell. This code loads baby name data into DataFrame `df` from the CSV file.\n2. Press `Shift+Enter` to run the cell and then move to the next cell. \n```\ndf = spark.read.csv(f\"{path_volume}\/{file_name}\",\nheader=True,\ninferSchema=True,\nsep=\",\")\n\n``` \n```\nval df = spark.read\n.option(\"header\", \"true\")\n.option(\"inferSchema\", \"true\")\n.option(\"delimiter\", \",\")\n.csv(s\"${pathVolume}\/${fileName}\")\n\n``` \n```\n# Load the SparkR package that is already preinstalled on the cluster.\nlibrary(SparkR)\n\ndf <- read.df(paste(path_volume, \"\/\", file_name, sep=\"\"),\nsource=\"csv\",\nheader = TRUE,\ninferSchema = TRUE,\ndelimiter = \",\")\n\n``` \nYou can load data from many [supported file formats](https:\/\/docs.databricks.com\/query\/formats\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/import-visualize-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Import and visualize CSV data from a notebook\n#### Step 5: Visualize data from notebook\n\nIn this step, you use the `display()` method to display the contents of the DataFrame in a table in the notebook, and then visualize the data in a word cloud chart in the notebook. \n1. Copy and paste the following code into the new empty notebook cell, and then click **Run cell** to display the data in a table. \n```\ndisplay(df)\n\n``` \n```\ndisplay(df)\n\n``` \n```\ndisplay(df)\n\n```\n2. Review the results in the table.\n3. Next to the **Table** tab, click **+** and then click **Visualization**.\n4. In the visualization editor, click **Visualization Type**, and verify that **Word cloud** is selected.\n5. In the **Words column**, verify that `First Name` is selected.\n6. In **Frequencies limit**, click `35`. \n![word cloud chart](https:\/\/docs.databricks.com\/_images\/word_cloud.png)\n7. Click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/import-visualize-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Import and visualize CSV data from a notebook\n#### Step 6: Save the DataFrame to a table\n\nImportant \nTo save your DataFrame in Unity Catalog, you must have `CREATE` table privileges on the catalog and schema. For information on permissions in Unity Catalog, see [Privileges and securable objects in Unity Catalog](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-privileges.html) and [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html), and [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html). \n1. Copy and paste the following code into an empty notebook cell. This code replaces a space in the column name. [Special characters](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-names.html), such as spaces are not allowed in column names. This code uses the Apache Spark `withColumnRenamed()` method. \n```\ndf = df.withColumnRenamed(\"First Name\", \"First_Name\")\ndf.printSchema\n\n``` \n```\nval dfRenamedColumn = df.withColumnRenamed(\"First Name\", \"First_Name\")\n\/\/ when modifying a DataFrame in Scala, you must assign it to a new variable\ndfRenamedColumn.printSchema()\n\n``` \n```\ndf <- withColumnRenamed(df, \"First Name\", \"First_Name\")\nprintSchema(df)\n\n```\n2. Copy and paste the following code into an empty notebook cell. This code saves the contents of the DataFrame to a table in Unity Catalog using the table name variable that you defined at the start of this article. \n```\ndf.write.mode(\"overwrite\").saveAsTable(f\"{path_table}\" + \".\" + f\"{table_name}\")\n\n``` \n```\ndfRenamedColumn.write.mode(\"overwrite\").saveAsTable(s\"${pathTable}.${tableName}\")\n\n``` \n```\nsaveAsTable(df, paste(path_table, \".\", table_name), mode = \"overwrite\")\n\n```\n3. To verify that the table was saved, click **Catalog** in the left sidebar to open the Catalog Explorer UI. Open your catalog and then your schema to verify that the table appears.\n4. Click your table to view the table schema on the **Overview** tab.\n5. Click **Sample Data** to view 100 rows of data from the table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/import-visualize-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Import and visualize CSV data from a notebook\n#### Import and visualize data notebooks\n\nUse one of following notebooks to perform the steps in this article. \n### Import data from CSV using Python \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/getting-started\/import-visualize-data-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Import data from CSV using Scala \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/getting-started\/import-visualize-data-scala.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Import data from CSV using R \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/getting-started\/import-visualize-data-sparkr.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n### Get started: Import and visualize CSV data from a notebook\n#### Next steps\n\n* To learn about adding additional data into existing table from a CSV file, see [Get started: Ingest and insert additional data](https:\/\/docs.databricks.com\/getting-started\/ingest-insert-additional-data.html).\n* To learn about cleansing and enhancing data, see [Get started: Enhance and cleanse data](https:\/\/docs.databricks.com\/getting-started\/cleanse-enhance-data.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/import-visualize-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Import and visualize CSV data from a notebook\n#### Additional resources\n\n* [Get started: Query and visualize data from a notebook](https:\/\/docs.databricks.com\/getting-started\/quick-start.html)\n* [Tutorial: Load and transform data using Apache Spark DataFrames](https:\/\/docs.databricks.com\/getting-started\/dataframes.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/import-visualize-data.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n### Unit testing for notebooks\n##### Test Databricks notebooks\n\nThis page briefly describes some techniques that are useful when testing code directly in Databricks notebooks. You can use these methods separately or together. \nFor a detailed walkthrough of how to set up and organize functions and unit tests in Databricks notebooks, see [Unit testing for notebooks](https:\/\/docs.databricks.com\/notebooks\/testing.html). \nMany unit testing libraries work directly within the notebook. For example, you can use the built-in Python [`unittest`](https:\/\/docs.python.org\/3\/library\/unittest.html) package to test notebook code. \n```\ndef reverse(s):\nreturn s[::-1]\n\nimport unittest\n\nclass TestHelpers(unittest.TestCase):\ndef test_reverse(self):\nself.assertEqual(reverse('abc'), 'cba')\n\nr = unittest.main(argv=[''], verbosity=2, exit=False)\nassert r.result.wasSuccessful(), 'Test failed; see logs above'\n\n``` \nTest failures appear in the output area of the cell. \n![Unit test failure](https:\/\/docs.databricks.com\/_images\/test-failure-output.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/test-notebooks.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n### Unit testing for notebooks\n##### Test Databricks notebooks\n###### Use Databricks widgets to select notebook mode\n\nYou can use [widgets](https:\/\/docs.databricks.com\/notebooks\/widgets.html) to distinguish test invocations from normal invocations in a single notebook. The following code produces the example shown in the screenshot: \n```\ndbutils.widgets.dropdown(\"Mode\", \"Test\", [\"Test\", \"Normal\"])\n\ndef reverse(s):\nreturn s[::-1]\n\nif dbutils.widgets.get('Mode') == 'Test':\nassert reverse('abc') == 'cba'\nprint('Tests passed')\nelse:\nprint(reverse('desrever'))\n\n``` \nThe first line generates the **Mode** dropdown menu: \n![Widget customize execution](https:\/\/docs.databricks.com\/_images\/test-mode-widget.png)\n\n##### Test Databricks notebooks\n###### Hide test code and results\n\nTo hide test code and results, select **Hide Code** or **Hide Result** from the [cell actions menu](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#cell-actions). Errors are displayed even if results are hidden.\n\n##### Test Databricks notebooks\n###### Schedule tests to run automatically\n\nTo run tests periodically and automatically, you can use [scheduled notebooks](https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html). You can configure the job to send [notification emails](https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html) to an email address that you specify. \n![Scheduled notebook test](https:\/\/docs.databricks.com\/_images\/test-failure-notification.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/test-notebooks.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n### Unit testing for notebooks\n##### Test Databricks notebooks\n###### Separate test code from the notebook\n\nYou can keep your test code separate from your notebook using either `%run` or Databricks Git folders. When you use `%run`, test code is included in a separate notebook that you call from another notebook. When you use Databricks Git folders, you can [keep test code in non-notebook source code files](https:\/\/docs.databricks.com\/notebooks\/share-code.html#reference-source-code-files-using-git). \nThis section shows some examples of using `%run` and Databricks Git folders to separate your test code from the notebook.\n\n##### Test Databricks notebooks\n###### Use `%run`\n\nThe screenshot below shows how to use `%run` to run a notebook from another notebook. For more information about using `%run`, see [Use %run to import a notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-workflows.html#run). The code used to generate the examples is shown following the screenshot. \n![Separating test code](https:\/\/docs.databricks.com\/_images\/notebook-test-separate.png) \nHere is the code used in the example. This code assumes that the notebooks **shared-code-notebook** and **shared-code-notebook-test** are in the same workspace folder. \n**shared-code-notebook**: \n```\ndef reverse(s):\nreturn s[::-1]\n\n``` \n**shared-code-notebook-test**: \nIn one cell: \n```\n%run .\/shared-code-notebook\n\n``` \nIn a subsequent cell: \n```\nimport unittest\n\nclass TestHelpers(unittest.TestCase):\ndef test_reverse(self):\nself.assertEqual(reverse('abc'), 'cba')\n\nr = unittest.main(argv=[''], verbosity=2, exit=False)\nassert r.result.wasSuccessful(), 'Test failed; see logs above'\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/test-notebooks.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n### Unit testing for notebooks\n##### Test Databricks notebooks\n###### Use Databricks Git folders\n\nFor code stored in a [Databricks Git folder](https:\/\/docs.databricks.com\/repos\/index.html), you can call the test and run it directly from a notebook. \n![Notebook testing invocation](https:\/\/docs.databricks.com\/_images\/sh-unittest-invocation.png) \nYou can also use [web terminal](https:\/\/docs.databricks.com\/compute\/web-terminal.html) to run tests in source code files just as you would on your local machine. \n![Git folders testing invocation](https:\/\/docs.databricks.com\/_images\/test-code-repos.png)\n\n##### Test Databricks notebooks\n###### Set up a CI\/CD-style workflow\n\nFor notebooks in a [Databricks Git folder](https:\/\/docs.databricks.com\/repos\/index.html), you can set up a CI\/CD-style workflow by configuring notebook tests to run for each commit. See [Databricks GitHub Actions](https:\/\/docs.databricks.com\/dev-tools\/ci-cd\/ci-cd-github.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/test-notebooks.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Software engineering best practices for notebooks\n\nThis article provides a hands-on walkthrough that demonstrates how to apply software engineering best practices to your Databricks notebooks, including version control, code sharing, testing, and optionally continuous integration and continuous delivery or deployment (CI\/CD). \nIn this walkthrough, you will: \n* Add notebooks to Databricks Git folders for version control.\n* Extract portions of code from one of the notebooks into a shareable module.\n* Test the shared code.\n* Run the notebooks from a Databricks job.\n* Optionally apply CI\/CD to the shared code.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/best-practices.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Software engineering best practices for notebooks\n##### Requirements\n\nTo complete this walkthrough, you must provide the following resources: \n* A remote repository with a [Git provider](https:\/\/docs.databricks.com\/repos\/index.html#supported-git-providers) that Databricks supports. This article\u2019s walkthrough uses GitHub. This walkthrough assumes that you have a GitHub repository named `best-notebooks` available. (You can give your repository a different name. If you do, replace `best-notebooks` with your repo\u2019s name throughout this walkthrough.) [Create a GitHub repo](https:\/\/docs.github.com\/get-started\/quickstart\/create-a-repo#create-a-repository) if you do not already have one. \nNote \nIf you create a new repo, be sure to initialize the repository with at least one file, for example a `README` file.\n* A Databricks [workspace](https:\/\/docs.databricks.com\/workspace\/index.html). [Create a workspace](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/workspaces.html) if you do not already have one.\n* A Databricks [all-purpose cluster](https:\/\/docs.databricks.com\/compute\/index.html) in the workspace. To run notebooks during the design phase, you [attach the notebooks to a running all-purpose cluster](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach). Later on, this walkthrough uses a Databricks [job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) to automate running the notebooks on this cluster. (You can also run jobs on [job clusters](https:\/\/docs.databricks.com\/compute\/index.html) that exist only for the jobs\u2019 lifetimes.) [Create an all-purpose cluster](https:\/\/docs.databricks.com\/compute\/configure.html) if you do not already have one.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/best-practices.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Software engineering best practices for notebooks\n##### Step 1: Set up Databricks Git folders\n\nIn this step, you connect your existing GitHub repo to Databricks Git folders in your existing Databricks workspace. \nTo enable your workspace to connect to your GitHub repo, you must first provide your workspace with your GitHub credentials, if you have not done so already. \n### Step 1.1: Provide your GitHub credentials \n1. Click your username at the top right of the workspace, and then click **Settings** in the dropdown list.\n2. In the **Settings** sidebar, under **User**, click **Linked accounts**.\n3. Under **Git integration**, for **Git provider**, select **GitHub**.\n4. Click **Personal access token**.\n5. For **Git provider username or email**, enter your GitHub username.\n6. For **Token**, enter your [GitHub personal access token (classic)](https:\/\/docs.github.com\/authentication\/keeping-your-account-and-data-secure\/creating-a-personal-access-token). This personal access token (classic) must have the **repo** and **workflow** permissions.\n7. Click **Save**. \n### Step 1.2: Connect to your GitHub repo \n1. On the workspace sidebar, click **Workspace**.\n2. In the **Workspace** browser, expand **Workspace > Users**.\n3. Right-click your username folder, and then click **Create > Git folder**.\n4. In the **Create Git folder** dialog: \n1. For **Git repository URL**, enter the GitHub [Clone with HTTPS](https:\/\/docs.github.com\/repositories\/creating-and-managing-repositories\/cloning-a-repository) URL for your GitHub repo. This article assumes that your URL ends with `best-notebooks.git`, for example `https:\/\/github.com\/<your-GitHub-username>\/best-notebooks.git`.\n2. For **Git provider**, select **GitHub**.\n3. Leave **Git folder name** set to the name of your repo, for example `best-notebooks`.\n4. Click **Create Git folder**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/best-practices.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Software engineering best practices for notebooks\n##### Step 2: Import and run the notebook\n\nIn this step, you import an existing external notebook into your repo. You could create your own notebooks for this walkthrough, but to speed things up we provide them for you here. \n### Step 2.1: Create a working branch in the repo \nIn this substep, you create a branch named `eda` in your repo. This branch enables you to work on files and code independently from your repo\u2019s `main` branch, which is a software engineering best practice. (You can give your branch a different name.) \nNote \nIn some repos, the `main` branch may be named `master` instead. If so, replace `main` with `master` throughout this walkthrough. \nTip \nIf you\u2019re not familiar with working in Git branches, see [Git Branches - Branches in a Nutshell](https:\/\/git-scm.com\/book\/en\/v2\/Git-Branching-Branches-in-a-Nutshell) on the Git website. \n1. The Git folder from Step 1.2 should be open. If not, then in the **Workspace** sidebar, expand **Workspace > Users**, then expand your username folder, and click your Git folder.\n2. Next to the folder name under the workspace navigation breadcrumb, click the **main** Git branch button.\n3. In the **best-notebooks** dialog, click the **Create branch** button. \nNote \nIf your repo has a name other than `best-notebooks`, this dialog\u2019s title will be different, here and throughout this walkthrough.\n4. Enter `eda`, and click **Create**.\n5. Close this dialog. \n### Step 2.2: Import the notebook into the repo \nIn this substep, you import an existing notebook from another repo into your repo. This notebook does the following: \n* Copies a CSV file from the [owid\/covid-19-data](https:\/\/github.com\/owid\/covid-19-data) GitHub repository onto a cluster in your workspace. This CSV file contains public data about COVID-19 hospitalizations and intensive care metrics from around the world.\n* Reads the CSV file\u2019s contents into a [pandas](https:\/\/pandas.pydata.org\/) [DataFrame](https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.html).\n* Filters the data to contain metrics from only the United States.\n* Displays a plot of the data.\n* Saves the pandas DataFrame as a [Pandas API on Spark](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/user_guide\/pandas_on_spark\/index.html) [DataFrame](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.pandas\/frame.html).\n* Performs data cleansing on the Pandas API on Spark DataFrame.\n* Writes the Pandas API on Spark DataFrame as a [Delta table](https:\/\/docs.databricks.com\/delta\/tutorial.html#create) in your workspace.\n* Displays the Delta table\u2019s contents. \nWhile you could create your own notebook in your repo here, importing an existing notebook instead helps to speed up this walkthrough. To create a notebook in this branch or move an existing notebook into this branch instead of importing a notebook, see [Workspace files basic usage](https:\/\/docs.databricks.com\/files\/workspace-basics.html). \n1. From the **best-notebooks** Git folder, click **Create > Folder**.\n2. In the **New folder** dialog, enter `notebooks`, and then click **Create**.\n3. From the **notebooks** folder, click the kebab, then **Import**.\n4. In the **Import** dialog: \n1. For **Import from**, select **URL**.\n2. Enter the URL to the raw contents of the `covid_eda_raw` notebook in the `databricks\/notebook-best-practices` repo in GitHub. To get this URL: \n1. Go to <https:\/\/github.com\/databricks\/notebook-best-practices>.\n2. Click the `notebooks` folder.\n3. Click the `covid_eda_raw.py` file.\n4. Click **Raw**.\n5. Copy the full URL from your web browser\u2019s address bar over into the **Import** dialog. \nNote \nThe **Import** dialog works with Git URLs for public repositories only.\n3. Click **Import**. \n### Step 2.3: Run the notebook \n1. If the notebook is not already showing, open the **notebooks** folder, and then click the **covid\\_eda\\_raw** notebook inside of the folder.\n2. [Select the cluster to attach this notebook to](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach-a-notebook-to-a-cluster). For instructions on creating a cluster, see [Create a cluster](https:\/\/docs.databricks.com\/compute\/configure.html).\n3. Click **Run All**.\n4. Wait while the notebook runs. \nAfter the notebook finishes running, in the notebook you should see a plot of the data as well as over 600 rows of raw data in the Delta table. If the cluster was not already running when you started running this notebook, it could take several minutes for the cluster to start up before displaying the results. \n### Step 2.4: Check in and merge the notebook \nIn this substep, you save your work so far to your GitHub repo. You then merge the notebook from your working branch into your repo\u2019s `main` branch. \n1. Next to the notebook\u2019s name, click the **eda** Git branch button.\n2. In the **best-notebooks** dialog, on the **Changes** tab, make sure the **notebooks\/covid\\_eda\\_raw.py** file is selected.\n3. For **Commit message (required)**, enter `Added raw notebook`.\n4. For **Description (optional)**, enter `This is the first version of the notebook.`\n5. Click **Commit & Push**.\n6. Click the pull request link in **Create a pull request on your git provider** in the banner.\n7. In GitHub, create the pull request, and then merge the pull request into the `main` branch.\n8. Back in your Databricks workspace, close the **best-notebooks** dialog if it is still showing.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/best-practices.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Software engineering best practices for notebooks\n##### Step 3: Move code into a shared module\n\nIn this step, you move some of the code in your notebook into a set of shared functions outside of your notebook. This enables you to use these functions with other similar notebooks, which can speed up future coding and help ensure more predictable and consistent notebook results. Sharing this code also enables you to more easily test these functions, which as a software engineering best practice can raise the overall quality of your code as you go. \n### Step 3.1: Create another working branch in the repo \n1. Next to the notebook\u2019s name, click the **eda** Git branch button.\n2. In the **best-notebooks** dialog, click the drop-down arrow next to the **eda** branch, and select **main**.\n3. Click the **Pull** button. If prompted to proceed with pulling, click **Confirm**.\n4. Click the **Create Branch** button.\n5. Enter `first_modules`, and then click **Create**. (You can give your branch a different name.)\n6. Close this dialog. \n### Step 3.2: Import the notebook into the repo \nTo speed up this walkthrough, in this substep you import another existing notebook into your repo. This notebook does the same things as the previous notebook, except this notebook will call shared code functions that are stored outside of the notebook. Again, you could create your own notebook in your repo here and do the actual code sharing yourself. \n1. From the **Workspace** browser, right-click the **notebooks** folder, and then click **Import**.\n2. In the **Import** dialog: \n1. For **Import from**, select **URL**.\n2. Enter the URL to the raw contents of the `covid_eda_modular` notebook in the `databricks\/notebook-best-practices` repo in GitHub. To get this URL: \n1. Go to <https:\/\/github.com\/databricks\/notebook-best-practices>.\n2. Click the `notebooks` folder.\n3. Click the `covid_eda_modular.py` file.\n4. Click **Raw**.\n5. Copy the full URL from your web browser\u2019s address bar over into the **Import Notebooks** dialog. \nNote \nThe **Import Notebooks** dialog works with Git URLs for public repositories only.\n3. Click **Import**. \n### Step 3.3: Add the notebook\u2019s supporting shared code functions \n1. From the **Workspace** browser, right-click the **best-notebooks** Git folder, and then click **Create > Folder**.\n2. In the **New folder** dialog, enter `covid_analysis`, and then click **Create**.\n3. From the **covid\\_analysis** folder click **Create > File**.\n4. In the **New File Name** dialog, enter `transforms.py`, and then click **Create File**.\n5. In the **transforms.py** editor window, enter the following code: \n```\nimport pandas as pd\n\n# Filter by country code.\ndef filter_country(pdf, country=\"USA\"):\npdf = pdf[pdf.iso_code == country]\nreturn pdf\n\n# Pivot by indicator, and fill missing values.\ndef pivot_and_clean(pdf, fillna):\npdf[\"value\"] = pd.to_numeric(pdf[\"value\"])\npdf = pdf.fillna(fillna).pivot_table(\nvalues=\"value\", columns=\"indicator\", index=\"date\"\n)\nreturn pdf\n\n# Create column names that are compatible with Delta tables.\ndef clean_spark_cols(pdf):\npdf.columns = pdf.columns.str.replace(\" \", \"_\")\nreturn pdf\n\n# Convert index to column (works with pandas API on Spark, too).\ndef index_to_col(df, colname):\ndf[colname] = df.index\nreturn df\n\n``` \nTip \nFor other code sharing techniques, see [Share code between Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/share-code.html). \n### Step 3.4: Add the shared code\u2019s dependencies \nThe preceding code has several Python package dependencies to enable the code to run properly. In this substep, you declare these package dependencies. Declaring dependencies improves reproducibility by using precisely defined versions of libraries. \n1. From the **Workspace** browser, right-click the **best-notebooks** Git folder, and then click **Create > File**. \nNote \nYou want the file that lists package dependencies to go into the Git folder\u2019s root, not into the **notebooks** or **covid\\_analysis** folders.\n2. In the **New File Name** dialog, enter `requirements.txt`, and then click **Create File**.\n3. In the **requirements.txt** editor window, enter the following code: \nNote \nIf the `requirements.txt` file is not visible, you may need to refresh your web browser. \n```\n-i https:\/\/pypi.org\/simple\nattrs==21.4.0\ncycler==0.11.0\nfonttools==4.33.3\niniconfig==1.1.1\nkiwisolver==1.4.2\nmatplotlib==3.5.1\nnumpy==1.22.3\npackaging==21.3\npandas==1.4.2\npillow==9.1.0\npluggy==1.0.0\npy==1.11.0\npy4j==0.10.9.3\npyarrow==7.0.0\npyparsing==3.0.8\npyspark==3.2.1\npytest==7.1.2\npython-dateutil==2.8.2\npytz==2022.1\nsix==1.16.0\ntomli==2.0.1\nwget==3.2\n\n``` \nNote \nThe preceding file lists specific package versions. For better compatibility, you can cross-reference these versions with the ones that are installed on your all-purpose cluster. See the \u201cSystem environment\u201d section for your cluster\u2019s Databricks Runtime version in [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). \nYour repo structure should now look like this: \n```\n|-- covid_analysis\n\u2502 \u2514\u2500\u2500 transforms.py\n\u251c\u2500\u2500 notebooks\n\u2502 \u251c\u2500\u2500 covid_eda_modular\n\u2502 \u2514\u2500\u2500 covid_eda_raw (optional)\n\u2514\u2500\u2500 requirements.txt\n\n``` \n### Step 3.5: Run the refactored notebook \nIn this substep, you run the `covid_eda_modular` notebook, which calls the shared code in `covid_analysis\/transforms.py`. \n1. From the **Workspace** browser, click the **covid\\_eda\\_modular** notebook inside the **notebooks** folder.\n2. [Select the cluster to attach this notebook to](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach-a-notebook-to-a-cluster).\n3. Click **Run All**.\n4. Wait while the notebook runs. \nAfter the notebook finishes running, in the notebook you should see similar results as the `covid_eda_raw` notebook: a plot of the data as well as over 600 rows of raw data in the Delta table. The main difference with this notebook is that a different filter is used (an `iso_code` of `DZA` instead of `USA`). If the cluster was not already running when you started running this notebook, it could take several minutes for the cluster to start up before displaying the results. \n### Step 3.6: Check in the notebook and its related code \n1. Next to the notebook\u2019s name, click the **first\\_modules** Git branch button.\n2. In the **best-notebooks** dialog, on the **Changes** tab, make sure the following are selected: \n* **requirements.txt**\n* **covid\\_analysis\/transforms.py**\n* **notebooks\/covid\\_eda\\_modular.py**\n3. For **Commit message (required)**, enter `Added refactored notebook`.\n4. For **Description (optional)**, enter `This is the second version of the notebook.`\n5. Click **Commit & Push**.\n6. Click the pull request link in **Create a pull request on your git provider** in the banner.\n7. In GitHub, create the pull request, and then merge the pull request into the `main` branch.\n8. Back in your Databricks workspace, close the **best-notebooks** dialog if it is still showing.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/best-practices.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Software engineering best practices for notebooks\n##### Step 4: Test the shared code\n\nIn this step, you test the shared code from the last step. However, you want to test this code without running the `covid_eda_modular` notebook itself. This is because if the shared code fails to run, the notebook itself would likely fail to run as well. You want to catch failures in your shared code first before having your main notebook eventually fail later. This testing technique is a software engineering best practice. \nTip \nFor additional approaches to testing for notebooks, as well as testing for R and Scala notebooks, see [Unit testing for notebooks](https:\/\/docs.databricks.com\/notebooks\/testing.html). \n### Step 4.1: Create another working branch in the repo \n1. Next to the notebook\u2019s name, click the **first\\_modules** Git branch button.\n2. In the **best-notebooks** dialog, click the drop-down arrow next to the **first\\_modules** branch, and select **main**.\n3. Click the **Pull** button. If prompted to proceed with pulling, click **Confirm**.\n4. Click **Create Branch**.\n5. Enter `first_tests`, and then click **Create**. (You can give your branch a different name.)\n6. Close this dialog. \n### Step 4.2: Add the tests \nIn this substep, you use the [pytest](https:\/\/docs.pytest.org\/) framework to test your shared code. In these tests, you [assert](https:\/\/docs.pytest.org\/en\/7.1.x\/getting-started.html) whether particular test results are achieved. If any test produces an unexpected result, that particular test fails the assertion and thus the test itself fails. \n1. From the **Workspace** browser, right-click your Git folder, and then click **Create > Folder**.\n2. In the **New folder** dialog, enter `tests`, and then click **Create**.\n3. From the **tests** folder, click **Create > File**.\n4. In the **New File Name** dialog, enter `testdata.csv`, and then click **Create File**.\n5. In **testdata.csv** editor window, enter the following test data: \n```\nentity,iso_code,date,indicator,value\nUnited States,USA,2022-04-17,Daily ICU occupancy,\nUnited States,USA,2022-04-17,Daily ICU occupancy per million,4.1\nUnited States,USA,2022-04-17,Daily hospital occupancy,10000\nUnited States,USA,2022-04-17,Daily hospital occupancy per million,30.3\nUnited States,USA,2022-04-17,Weekly new hospital admissions,11000\nUnited States,USA,2022-04-17,Weekly new hospital admissions per million,32.8\nAlgeria,DZA,2022-04-18,Daily ICU occupancy,1010\nAlgeria,DZA,2022-04-18,Daily ICU occupancy per million,4.5\nAlgeria,DZA,2022-04-18,Daily hospital occupancy,11000\nAlgeria,DZA,2022-04-18,Daily hospital occupancy per million,30.9\nAlgeria,DZA,2022-04-18,Weekly new hospital admissions,10000\nAlgeria,DZA,2022-04-18,Weekly new hospital admissions per million,32.1\n\n```\n6. From the **tests** folder, click **Create > File**.\n7. In the **New File Name** dialog, enter `transforms_test.py`, and then click **Create File**.\n8. In **transforms\\_test.py** editor window, enter the following test code. These tests use standard `pytest` [fixtures](https:\/\/docs.pytest.org\/en\/7.1.x\/explanation\/fixtures.html#about-fixtures) as well as a mocked in-memory pandas DataFrame: \n```\n# Test each of the transform functions.\nimport pytest\nfrom textwrap import fill\nimport os\nimport pandas as pd\nimport numpy as np\nfrom covid_analysis.transforms import *\nfrom pyspark.sql import SparkSession\n\n@pytest.fixture\ndef raw_input_df() -> pd.DataFrame:\n\"\"\"\nCreate a basic version of the input dataset for testing, including NaNs.\n\"\"\"\nreturn pd.read_csv('tests\/testdata.csv')\n\n@pytest.fixture\ndef colnames_df() -> pd.DataFrame:\ndf = pd.DataFrame(\ndata=[[0,1,2,3,4,5]],\ncolumns=[\n\"Daily ICU occupancy\",\n\"Daily ICU occupancy per million\",\n\"Daily hospital occupancy\",\n\"Daily hospital occupancy per million\",\n\"Weekly new hospital admissions\",\n\"Weekly new hospital admissions per million\"\n]\n)\nreturn df\n\n# Make sure the filter works as expected.\ndef test_filter(raw_input_df):\nfiltered = filter_country(raw_input_df)\nassert filtered.iso_code.drop_duplicates()[0] == \"USA\"\n\n# The test data has NaNs for Daily ICU occupancy; this should get filled to 0.\ndef test_pivot(raw_input_df):\npivoted = pivot_and_clean(raw_input_df, 0)\nassert pivoted[\"Daily ICU occupancy\"][0] == 0\n\n# Test column cleaning.\ndef test_clean_cols(colnames_df):\ncleaned = clean_spark_cols(colnames_df)\ncols_w_spaces = cleaned.filter(regex=(\" \"))\nassert cols_w_spaces.empty == True\n\n# Test column creation from index.\ndef test_index_to_col(raw_input_df):\nraw_input_df[\"col_from_index\"] = raw_input_df.index\nassert (raw_input_df.index == raw_input_df.col_from_index).all()\n\n``` \nYour repo structure should now look like this: \n```\n\u251c\u2500\u2500 covid_analysis\n\u2502 \u2514\u2500\u2500 transforms.py\n\u251c\u2500\u2500 notebooks\n\u2502 \u251c\u2500\u2500 covid_eda_modular\n\u2502 \u2514\u2500\u2500 covid_eda_raw (optional)\n\u251c\u2500\u2500 requirements.txt\n\u2514\u2500\u2500 tests\n\u251c\u2500\u2500 testdata.csv\n\u2514\u2500\u2500 transforms_test.py\n\n``` \n### Step 4.3: Run the tests \nTo speed up this walkthrough, in this substep you use an imported notebook to run the preceding tests. This notebook downloads and installs the tests\u2019 dependent Python packages into your workspace, runs the tests, and reports the tests\u2019 results. While you could run `pytest` from your cluster\u2019s [web terminal](https:\/\/docs.databricks.com\/compute\/web-terminal.html), running `pytest` from a notebook can be more convenient. \nNote \nRunning `pytest` runs all files whose names follow the form `test_*.py` or `\/*_test.py` in the current directory and its subdirectories. \n1. From the **Workspace** browser, right-click the **notebooks** folder, and then click **Import**.\n2. In the **Import Notebooks** dialog: \n1. For **Import from**, select **URL**.\n2. Enter the URL to the raw contents of the `run_unit_tests` notebook in the `databricks\/notebook-best-practices` repo in GitHub. To get this URL: \n1. Go to <https:\/\/github.com\/databricks\/notebook-best-practices>.\n2. Click the `notebooks` folder.\n3. Click the `run_unit_tests.py` file.\n4. Click **Raw**.\n5. Copy the full URL from your web browser\u2019s address bar over into the **Import Notebooks** dialog. \nNote \nThe **Import Notebooks** dialog works with Git URLs for public repositories only.\n3. Click **Import**.\n3. [Select the cluster to attach this notebook to](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach-a-notebook-to-a-cluster).\n4. Click **Run All**.\n5. Wait while the notebook runs. \nAfter the notebook finishes running, in the notebook you should see information about the number of passing and failed tests, along with other related details. If the cluster was not already running when you started running this notebook, it could take several minutes for the cluster to start up before displaying the results. \nYour repo structure should now look like this: \n```\n\u251c\u2500\u2500 covid_analysis\n\u2502 \u2514\u2500\u2500 transforms.py\n\u251c\u2500\u2500 notebooks\n\u2502 \u251c\u2500\u2500 covid_eda_modular\n\u2502 \u251c\u2500\u2500 covid_eda_raw (optional)\n\u2502 \u2514\u2500\u2500 run_unit_tests\n\u251c\u2500\u2500 requirements.txt\n\u2514\u2500\u2500 tests\n\u251c\u2500\u2500 testdata.csv\n\u2514\u2500\u2500 transforms_test.py\n\n``` \n### Step 4.4: Check in the notebook and related tests \n1. Next to the notebook\u2019s name, click the **first\\_tests** Git branch button.\n2. In the **best-notebooks** dialog, on the **Changes** tab, make sure the following are selected: \n* **tests\/transforms\\_test.py**\n* **notebooks\/run\\_unit\\_tests.py**\n* **tests\/testdata.csv**\n3. For **Commit message (required)**, enter `Added tests`.\n4. For **Description (optional)**, enter `These are the unit tests for the shared code.`.\n5. Click **Commit & Push**.\n6. Click the pull request link in **Create a pull request on your git provider** in the banner.\n7. In GitHub, create the pull request, and then merge the pull request into the `main` branch.\n8. Back in your Databricks workspace, close the **best-notebooks** dialog if it is still showing.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/best-practices.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Software engineering best practices for notebooks\n##### Step 5: Create a job to run the notebooks\n\nIn previous steps, you tested your shared code manually and ran your notebooks manually. In this step, you use a Databricks job to test your shared code and run your notebooks automatically, either on-demand or on a regular schedule. \n### Step 5.1: Create a job task to run the testing notebook \n1. On the workspace sidebar, click **Workflows**.\n2. On the **Jobs** tab, click **Create Job**.\n3. Edit the name of the job to be `covid_report`.\n4. For **Task name**, enter `run_notebook_tests`.\n5. For **Type**, select **Notebook**.\n6. For **Source**, select **Git provider**.\n7. Click **Add a git reference**.\n8. In the **Git information** dialog: \n1. For **Git repository URL**, enter the GitHub [Clone with HTTPS](https:\/\/docs.github.com\/repositories\/creating-and-managing-repositories\/cloning-a-repository) URL for your GitHub repo. This article assumes that your URL ends with `best-notebooks.git`, for example `https:\/\/github.com\/<your-GitHub-username>\/best-notebooks.git`.\n2. For **Git provider**, select **GitHub**.\n3. For **Git reference (branch \/ tag \/ commit)**, enter `main`.\n4. Next to **Git reference (branch \/ tag \/ commit)**, select **branch**.\n5. Click **Confirm**.\n9. For **Path**, enter `notebooks\/run_unit_tests`. Do not add the `.py` file extension.\n10. For **Cluster**, select the cluster from the previous step.\n11. Click **Create task**. \nNote \nIn this scenario, Databricks does not recommend that you use the schedule button in the notebook as described in [Create and manage scheduled notebook jobs](https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html) to schedule a job to run this notebook periodically. This is because the schedule button creates a job by using the latest *working* copy of the notebook in the workspace repo. Instead, Databricks recommends that you follow the preceding instructions to create a job that uses the latest *committed* version of the notebook in the repo. \n### Step 5.2: Create a job task to run the main notebook \n1. Click the **+ Add task** icon.\n2. A pop-up menu appears. Select **Notebook**.\n3. For **Task name**, enter `run_main_notebook`.\n4. For **Type**, select **Notebook**.\n5. For **Path**, enter `notebooks\/covid_eda_modular`. Do not add the `.py` file extension.\n6. For **Cluster**, select the cluster from the previous step.\n7. Verify **Depends on** value is `run_notebook-tests`.\n8. Click **Create task**. \n### Step 5.3 Run the job \n1. Click **Run now**.\n2. In the pop-up, click **View run**. \nNote \nIf the pop-up disappears too quickly, then do the following: \n1. On the sidebar in the **Data Science & Engineering** or **Databricks Machine Learning** environment, click **Workflows**.\n2. On the **Job runs** tab, click the **Start time** value for the latest job with **covid\\_report** in the **Jobs** column.\n3. To see the job results, click on the **run\\_notebook\\_tests** tile, the **run\\_main\\_notebook** tile, or both. The results on each tile are the same as if you ran the notebooks yourself, one by one. \nNote \nThis job ran on-demand. To set up this job to run on a regular basis, see [Add a job schedule](https:\/\/docs.databricks.com\/workflows\/jobs\/schedule-jobs.html#job-schedule).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/best-practices.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Software engineering best practices for notebooks\n##### (Optional) Step 6: Set up the repo to test the code and run the notebook automatically whenever the code changes\n\nIn the previous step, you used a job to automatically test your shared code and run your notebooks at a point in time or on a recurring basis. However, you may prefer to trigger tests automatically when changes are merged into your GitHub repo, using a CI\/CD tool such as [GitHub Actions](https:\/\/docs.github.com\/actions). \n### Step 6.1: Set up GitHub access to your workspace \nIn this substep, you set up a GitHub Actions workflow that runs jobs in the workspace whenever changes are merged into your repository. You do this by giving GitHub a unique Databricks token for access. \nFor security reasons, Databricks discourages you from giving your Databricks workspace user\u2019s personal access token to GitHub. Instead, Databricks recommends that you give GitHub a Databricks access token that is associated with a Databricks service principal. For instructions, see the [AWS](https:\/\/github.com\/marketplace\/actions\/run-databricks-notebook#aws) section of the [Run Databricks Notebook GitHub Action](https:\/\/github.com\/marketplace\/actions\/run-databricks-notebook) page in the GitHub Actions Marketplace. \nImportant \nNotebooks are run with all of the workspace permissions of the identity that is associated with the token, so Databricks recommends using a service principal. If you really want to give your Databricks workspace user\u2019s personal access token to GitHub for personal exploration purposes only, and you understand that for security reasons Databricks discourages this practice, see the instructions to [create your workspace user\u2019s personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). \n### Step 6.2: Add the GitHub Actions workflow \nIn this substep, you add a GitHub Actions workflow to run the `run_unit_tests` notebook whenever there is a pull request to the repo. \nThis substep stores the GitHub Actions workflow in a file that is stored within multiple folder levels in your GitHub repo. GitHub Actions requires a specific nested folder hierarchy to exist in your repo in order to work properly. To complete this step, you must use the website for your GitHub repo, because the Databricks Git folder user interface does not support creating nested folder hierarchies. \n1. In the website for your GitHub repo, click the **Code** tab.\n2. Click the arrow next to **main** to expand the **Switch branches or tags** drop-down list.\n3. In the **Find or create a branch** box, enter `adding_github_actions`.\n4. Click **Create branch: adding\\_github\\_actions from \u2018main\u2019**.\n5. Click **Add file > Create new file**.\n6. For **Name your file**, enter `.github\/workflows\/databricks_pull_request_tests.yml`.\n7. In the editor window, enter the following code. This code uses the pull\\_request hook from the [Run Databricks Notebook GitHub Action](https:\/\/github.com\/marketplace\/actions\/run-databricks-notebook) to run the `run_unit_tests` notebook. \nIn the following code, replace: \n* `<your-workspace-instance-URL>` with your Databricks [instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url).\n* `<your-access-token>` with the token that you generated earlier.\n* `<your-cluster-id>` with your target [cluster ID](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#cluster-url-and-id).\n```\nname: Run pre-merge Databricks tests\n\non:\npull_request:\n\nenv:\n# Replace this value with your workspace instance name.\nDATABRICKS_HOST: https:\/\/<your-workspace-instance-name>\n\njobs:\nunit-test-notebook:\nruns-on: ubuntu-latest\ntimeout-minutes: 15\n\nsteps:\n- name: Checkout repo\nuses: actions\/checkout@v2\n- name: Run test notebook\nuses: databricks\/run-notebook@main\nwith:\ndatabricks-token: <your-access-token>\n\nlocal-notebook-path: notebooks\/run_unit_tests.py\n\nexisting-cluster-id: <your-cluster-id>\n\ngit-commit: \"${{ github.event.pull_request.head.sha }}\"\n\n# Grant all users view permission on the notebook's results, so that they can\n# see the result of the notebook, if they have related access permissions.\naccess-control-list-json: >\n[\n{\n\"group_name\": \"users\",\n\"permission_level\": \"CAN_VIEW\"\n}\n]\nrun-name: \"EDA transforms helper module unit tests\"\n\n```\n8. Click **Commit changes**.\n9. In the **Commit changes** dialog, enter `Create databricks_pull_request_tests.yml` into **Commit message**\n10. Select **Commit directly to the adding\\_github\\_actions branch** and click **Commit changes**.\n11. On the **Code** tab, click **Compare & pull request**, and then create the pull request.\n12. On the pull request page, wait for the icon next to **Run pre-merge Databricks tests \/ unit-test-notebook (pull\\_request)** to display a green check mark. (It may take a few moments for the icon to appear.) If there is a red X instead of a green check mark, click **Details** to find out why. If the icon or **Details** are no longer showing, click **Show all checks**.\n13. If the green check mark appears, merge the pull request into the `main` branch.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/best-practices.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Software engineering best practices for notebooks\n##### (Optional) Step 7: Update the shared code in GitHub to trigger tests\n\nIn this step, you make a change to the shared code and then push the change into your GitHub repo, which immediately triggers the tests automatically, based on the GitHub Action from the previous step. \n### Step 7.1: Create another working branch in the repo \n1. From the **Workspace** browser, open the **best-notebooks** Git folder.\n2. Next to the folder\u2019s name, click the **first\\_tests** Git branch button.\n3. In the **best-notebooks** dialog, click the drop-down arrow next to the **first\\_tests** branch, and select **main**.\n4. Click the **Pull** button. If prompted to proceed with pulling, click **Confirm**.\n5. Click the **+** (**Create branch**) button.\n6. Enter `trigger_tests`, and then click **Create**. (You can give your branch a different name.)\n7. Close this dialog. \n### Step 7.2: Change the shared code \n1. From the **Workspace** browser, in the **best-notebooks** Git folder, click the **covid\\_analysis\/transforms.py** file.\n2. Change the third line of this file: \n```\n# Filter by country code.\n\n``` \nTo this: \n```\n# Filter by country code. If not specified, use \"USA.\"\n\n``` \n### Step 7.3: Check in the change to trigger the tests \n1. Next to the file\u2019s name, click the **trigger\\_tests** Git branch button.\n2. In the **best-notebooks** dialog, on the **Changes** tab, make sure **covid\\_analysis\/transforms.py** is selected.\n3. For **Commit message (required)**, enter `Updated comment`.\n4. For **Description (optional)**, enter `This updates the comment for filter_country.`\n5. Click **Commit & Push**.\n6. Click the pull request link in **Create a pull request on your git provider** in the banner, and then create the pull request in GitHub.\n7. On the pull request page, wait for the icon next to **Run pre-merge Databricks tests \/ unit-test-notebook (pull\\_request)** to display a green check mark. (It may take a few moments for the icon to appear.) If there is a red X instead of a green check mark, click **Details** to find out why. If the icon or **Details** are no longer showing, click **Show all checks**.\n8. If the green check mark appears, merge the pull request into the `main` branch.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/best-practices.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Develop a job on Databricks by using Databricks Asset Bundles\n\n*Databricks Asset Bundles*, also known simply as *bundles*, enable you to programmatically validate, deploy, and run Databricks resources such as jobs. You can also use bundles to programmatically manage Delta Live Tables pipelines and work with MLOps Stacks. See [What are Databricks Asset Bundles?](https:\/\/docs.databricks.com\/dev-tools\/bundles\/index.html). \nThis article describes steps that you can complete from a local development setup to use a bundle that programmatically manages a job. See [Introduction to Databricks Workflows](https:\/\/docs.databricks.com\/workflows\/index.html). \nIf you have existing jobs that were created by using the Databricks Workflows user interface or API that you want to move to bundles, then you must recreate them as bundle configuration files. To do so, Databricks recommends that you first create a bundle by using the steps below and the validate whether the bundle works. You can then add job definitions, notebooks, and other sources to the bundle. See [Add an existing job definition to a bundle](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-bundles-with-jobs.html#existing-job). \nIn addition to using the Databricks CLI to run a job deployed by a bundle, you can also view and run these jobs in the Databricks Jobs UI. See [View and run a job created with a Databricks Asset Bundle](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#view-dabs-jobs).\n\n##### Develop a job on Databricks by using Databricks Asset Bundles\n###### Requirements\n\n* Databricks CLI version 0.218 or above. To check your installed version of the Databricks CLI, run the command `databricks -v`. To install the Databricks CLI, see [Install or update the Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/install.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-bundles-with-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Develop a job on Databricks by using Databricks Asset Bundles\n###### Decision: Create the bundle by using a template or manually\n\nDecide whether you want to create an example bundle using a template or manually: \n* [Create the bundle by using a template](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-bundles-with-jobs.html#create-the-bundle-by-using-a-template)\n* [Create the bundle manually](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-bundles-with-jobs.html#create-the-bundle-manually)\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-bundles-with-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Develop a job on Databricks by using Databricks Asset Bundles\n###### Create the bundle by using a template\n\nIn these steps, you create the bundle by using the Databricks default bundle template for Python, which consists of a notebook or Python code, paired with the definition of a job to run it. You then validate, deploy, and run the deployed job within your Databricks workspace. The remote workspace must have workspace files enabled. See [What are workspace files?](https:\/\/docs.databricks.com\/files\/workspace.html). \n### Step 1: Set up authentication \nIn this step, you set up authentication between the Databricks CLI on your development machine and your Databricks workspace. This article assumes that you want to use OAuth user-to-machine (U2M) authentication and a corresponding Databricks configuration profile named `DEFAULT` for authentication. \nNote \nU2M authentication is appropriate for trying out these steps in real time. For fully automated workflows, Databricks recommends that you use OAuth machine-to-machine (M2M) authentication instead. See the M2M authentication setup instructions in [Authentication](https:\/\/docs.databricks.com\/dev-tools\/bundles\/index.html#authentication). \n1. Use the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/install.html) to initiate OAuth token management locally by running the following command for each target workspace. \nIn the following command, replace `<workspace-url>` with your Databricks [workspace instance URL](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `https:\/\/dbc-a1b2345c-d6e7.cloud.databricks.com`. \n```\ndatabricks auth login --host <workspace-url>\n\n```\n2. The Databricks CLI prompts you to save the information that you entered as a Databricks [configuration profile](https:\/\/docs.databricks.com\/dev-tools\/auth\/index.html#config-profiles). Press `Enter` to accept the suggested profile name, or enter the name of a new or existing profile. Any existing profile with the same name is overwritten with the information that you entered. You can use profiles to quickly switch your authentication context across multiple workspaces. \nTo get a list of any existing profiles, in a separate terminal or command prompt, use the Databricks CLI to run the command `databricks auth profiles`. To view a specific profile\u2019s existing settings, run the command `databricks auth env --profile <profile-name>`.\n3. In your web browser, complete the on-screen instructions to log in to your Databricks workspace.\n4. To view a profile\u2019s current OAuth token value and the token\u2019s upcoming expiration timestamp, run one of the following commands: \n* `databricks auth token --host <workspace-url>`\n* `databricks auth token -p <profile-name>`\n* `databricks auth token --host <workspace-url> -p <profile-name>`If you have multiple profiles with the same `--host` value, you might need to specify the `--host` and `-p` options together to help the Databricks CLI find the correct matching OAuth token information. \n### Step 2: Create the bundle \nA bundle contains the artifacts you want to deploy and the settings for the resources you want to run. \n1. Use your terminal or command prompt to switch to a directory on your local development machine that will contain the template\u2019s generated bundle.\n2. Use the Dataricks CLI to run the `bundle init` command: \n```\ndatabricks bundle init\n\n```\n3. For `Template to use`, leave the default value of `default-python` by pressing `Enter`.\n4. For `Unique name for this project`, leave the default value of `my_project`, or type a different value, and then press `Enter`. This determines the name of the root directory for this bundle. This root directory is created within your current working directory.\n5. For `Include a stub (sample) notebook`, select `yes` and press `Enter`.\n6. For `Include a stub (sample) DLT pipeline`, select `no` and press `Enter`. This instructs the Databricks CLI to not define a sample Delta Live Tables pipeline in your bundle.\n7. For `Include a stub (sample) Python package`, select `no` and press `Enter`. This instructs the Databricks CLI to not add sample Python wheel package files or related build instructions to your bundle. \n### Step 3: Explore the bundle \nTo view the files that the template generated, switch to the root directory of your newly created bundle and open this directory with your preferred IDE, for example Visual Studio Code. Files of particular interest include the following: \n* `databricks.yml`: This file specifies the bundle\u2019s programmatic name, includes a reference to the job definition, and specifies settings about the target workspace.\n* `resources\/<project-name>_job.yml`: This file specifies the job\u2019s settings, including a default notebook task.\n* `src\/notebook.ipynb`: This file is a sample notebook that, when run, simply initializes an RDD that contains the numbers 1 through 10. \nFor customizing jobs, the mappings within a job declaration correspond to the create job operation\u2019s request payload as defined in [POST \/api\/2.1\/jobs\/create](https:\/\/docs.databricks.com\/api\/workspace\/jobs\/create) in the REST API reference, expressed in YAML format. \nTip \nYou can define, combine, and override the settings for new job clusters in bundles by using the techniques described in [Override cluster settings in Databricks Asset Bundles](https:\/\/docs.databricks.com\/dev-tools\/bundles\/cluster-override.html). \n### Step 4: Validate the project\u2019s bundle configuration file \nIn this step, you check whether the bundle configuration is valid. \n1. From the root directory, use the Databricks CLI to run the `bundle validate` command, as follows: \n```\ndatabricks bundle validate\n\n```\n2. If a summary of the bundle configuration is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step. \nIf you make any changes to your bundle after this step, you should repeat this step to check whether your bundle configuration is still valid. \n### Step 5: Deploy the local project to the remote workspace \nIn this step, you deploy the local notebook to your remote Databricks workspace and create the Databricks job within your workspace. \n1. From the bundle root, use the Databricks CLI to run the `bundle deploy` command as follows: \n```\ndatabricks bundle deploy -t dev\n\n```\n2. Check whether the local notebook was deployed: In your Databricks workspace\u2019s sidebar, click **Workspace**.\n3. Click into the **Users > `<your-username>` > .bundle > `<project-name>` > dev > files > src** folder. The notebook should be in this folder.\n4. Check whether the job was created: In your Databricks workspace\u2019s sidebar, click **Workflows**.\n5. On the **Jobs** tab, click **[dev `<your-username>`] `<project-name>_job`**.\n6. Click the **Tasks** tab. There should be one task: **notebook\\_task**. \nIf you make any changes to your bundle after this step, you should repeat steps 4-5 to check whether your bundle configuration is still valid and then redeploy the project. \n### Step 6: Run the deployed project \nIn this step, you run the Databricks job in your workspace. \n1. From the root directory, use the Databricks CLI to run the `bundle run` command, as follows, replacing `<project-name>` with the name of your project from Step 2: \n```\ndatabricks bundle run -t dev <project-name>_job\n\n```\n2. Copy the value of `Run URL` that appears in your terminal and paste this value into your web browser to open your Databricks workspace.\n3. In your Databricks workspace, after the job task completes successfully and shows a green title bar, click the job task to see the results. \nIf you make any changes to your bundle after this step, you should repeat steps 4-6 to check whether your bundle configuration is still valid, redeploy the project, and run the redeployed project. \n### Step 7: Clean up \nIn this step, you delete the deployed notebook and the job from your workspace. \n1. From the root directory, use the Databricks CLI to run the `bundle destroy` command, as follows: \n```\ndatabricks bundle destroy\n\n```\n2. Confirm the job deletion request: When prompted to permanently destroy resources, type `y` and press `Enter`.\n3. Confirm the notebook deletion request: When prompted to permanently destroy the previously deployed folder and all of its files, type `y` and press `Enter`.\n4. If you also want to delete the bundle from your development machine, you can now delete the local directory from Step 2. \nYou have reached the end of the steps for creating a bundle by using a template.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-bundles-with-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Develop a job on Databricks by using Databricks Asset Bundles\n###### Create the bundle manually\n\nIn these steps, you create a bundle from scratch. This simple bundle consists of two notebooks and the definition of a Databricks job to run these notebooks. You then validate, deploy, and run the deployed notebooks from the job within your Databricks workspace. These steps automate the quickstart titled [Create your first workflow with a Databricks job](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-quickstart.html). \n### Step 1: Create the bundle \nA bundle contains the artifacts you want to deploy and the settings for the resources you want to run. \n1. Create or identify an empty directory on your development machine.\n2. Switch to the empty directory in your terminal, or open the empty directory in your IDE. \nTip \nYour empty directory could be associated with a cloned repository that is managed by a Git provider. This enables you to manage your bundle with external version control and to more easily collaborate with other developers and IT professionals on your project. However, to help simplify this demonstration, a cloned repo is not used here. \nIf you choose to clone a repo for this demo, Databricks recommends that the repo is empty or has only basic files in it such as `README` and `.gitignore`. Otherwise, any pre-existing files in the repo might be unnecessarily synchronized to your Databricks workspace. \n### Step 2: Add notebooks to the project \nIn this step, you add two notebooks to your project. The first notebook gets a list of trending baby names since 2007 from the New York State Department of Health\u2019s public data sources. See [Baby Names: Trending by Name: Beginning 2007](https:\/\/health.data.ny.gov\/Health\/Baby-Names-Beginning-2007\/jxy9-yhdk) on the department\u2019s website. The first notebook then saves this data to your Databricks Unity Catalog volume named `my-volume` in a schema named `default` within a catalog named `main`. The second notebook queries the saved data and displays aggregated counts of the baby names by first name and sex for 2014. \n1. From the directory\u2019s root, create the first notebook, a file named `retrieve-baby-names.py`.\n2. Add the following code to the `retrieve-baby-names.py` file: \n```\n# Databricks notebook source\nimport requests\n\nresponse = requests.get('http:\/\/health.data.ny.gov\/api\/views\/jxy9-yhdk\/rows.csv')\ncsvfile = response.content.decode('utf-8')\ndbutils.fs.put(\"\/Volumes\/main\/default\/my-volume\/babynames.csv\", csvfile, True)\n\n```\n3. Create the second notebook, a file named `filter-baby-names.py`, in the same directory.\n4. Add the following code to the `filter-baby-names.py` file: \n```\n# Databricks notebook source\nbabynames = spark.read.format(\"csv\").option(\"header\", \"true\").option(\"inferSchema\", \"true\").load(\"\/Volumes\/main\/default\/my-volume\/babynames.csv\")\nbabynames.createOrReplaceTempView(\"babynames_table\")\nyears = spark.sql(\"select distinct(Year) from babynames_table\").toPandas()['Year'].tolist()\nyears.sort()\ndbutils.widgets.dropdown(\"year\", \"2014\", [str(x) for x in years])\ndisplay(babynames.filter(babynames.Year == dbutils.widgets.get(\"year\")))\n\n``` \n### Step 3: Add a bundle configuration schema file to the project \nIf you are using an IDE such as Visual Studio Code, PyCharm Professional, or IntelliJ IDEA Ultimate that provides support for YAML files and JSON schema files, you can use your IDE to not only create the bundle configuration schema file but to check your project\u2019s bundle configuration file syntax and formatting and provide code completion hints, as follows. Note that while the bundle configuration file that you will create later in Step 5 is YAML-based, the bundle configuration schema file in this step is JSON-based. \n1. Add YAML language server support to Visual Studio Code, for example by installing the [YAML](https:\/\/marketplace.visualstudio.com\/items?itemName=redhat.vscode-yaml) extension from the Visual Studio Code Marketplace.\n2. Generate the Databricks Asset Bundle configuration JSON schema file by using the Databricks CLI to run the `bundle schema` command and redirect the output to a JSON file. For example, generate a file named `bundle_config_schema.json` within the current directory, as follows: \n```\ndatabricks bundle schema > bundle_config_schema.json\n\n```\n3. Note that later in Step 5, you will add the following comment to the beginning of your bundle configuration file, which associates your bundle configuration file with the specified JSON schema file: \n```\n# yaml-language-server: $schema=bundle_config_schema.json\n\n``` \nNote \nIn the preceding comment, if your Databricks Asset Bundle configuration JSON schema file is in a different path, replace `bundle_config_schema.json` with the full path to your schema file. \n1. Generate the Databricks Asset Bundle configuration JSON schema file by using the Databricks CLI to run the `bundle schema` command and redirect the output to a JSON file. For example, generate a file named `bundle_config_schema.json` within the current directory, as follows: \n```\ndatabricks bundle schema > bundle_config_schema.json\n\n```\n2. Configure PyCharm to recognize the bundle configuration JSON schema file, and then complete the JSON schema mapping, by following the instructions in [Configure a custom JSON schema](https:\/\/www.jetbrains.com\/help\/pycharm\/json.html#ws_json_schema_add_custom_procedure).\n3. Note that later in Step 5, you will use PyCharm to create or open a bundle configuration file. By convention, this file is named `databricks.yml`. \n1. Generate the Databricks Asset Bundle configuration JSON schema file by using the Databricks CLI to run the `bundle schema` command and redirect the output to a JSON file. For example, generate a file named `bundle_config_schema.json` within the current directory, as follows: \n```\ndatabricks bundle schema > bundle_config_schema.json\n\n```\n2. Configure IntelliJ IDEA to recognize the bundle configuration JSON schema file, and then complete the JSON schema mapping, by following the instructions in [Configure a custom JSON schema](https:\/\/www.jetbrains.com\/help\/idea\/json.html#ws_json_schema_add_custom_procedure).\n3. Note that later in Step 5, you will use IntelliJ IDEA to create or open a bundle configuration file. By convention, this file is named `databricks.yml`. \n### Step 4: Set up authentication \nIn this step, you set up authentication between the Databricks CLI on your development machine and your Databricks workspace. This article assumes that you want to use OAuth user-to-machine (U2M) authentication and a corresponding Databricks configuration profile named `DEFAULT` for authentication. \nNote \nU2M authentication is appropriate for trying out these steps in real time. For fully automated workflows, Databricks recommends that you use OAuth machine-to-machine (M2M) authentication instead. See the M2M authentication setup instructions in [Authentication](https:\/\/docs.databricks.com\/dev-tools\/bundles\/index.html#authentication). \n1. Use the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/install.html) to initiate OAuth token management locally by running the following command for each target workspace. \nIn the following command, replace `<workspace-url>` with your Databricks [workspace instance URL](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `https:\/\/dbc-a1b2345c-d6e7.cloud.databricks.com`. \n```\ndatabricks auth login --host <workspace-url>\n\n```\n2. The Databricks CLI prompts you to save the information that you entered as a Databricks [configuration profile](https:\/\/docs.databricks.com\/dev-tools\/auth\/index.html#config-profiles). Press `Enter` to accept the suggested profile name, or enter the name of a new or existing profile. Any existing profile with the same name is overwritten with the information that you entered. You can use profiles to quickly switch your authentication context across multiple workspaces. \nTo get a list of any existing profiles, in a separate terminal or command prompt, use the Databricks CLI to run the command `databricks auth profiles`. To view a specific profile\u2019s existing settings, run the command `databricks auth env --profile <profile-name>`.\n3. In your web browser, complete the on-screen instructions to log in to your Databricks workspace.\n4. To view a profile\u2019s current OAuth token value and the token\u2019s upcoming expiration timestamp, run one of the following commands: \n* `databricks auth token --host <workspace-url>`\n* `databricks auth token -p <profile-name>`\n* `databricks auth token --host <workspace-url> -p <profile-name>`If you have multiple profiles with the same `--host` value, you might need to specify the `--host` and `-p` options together to help the Databricks CLI find the correct matching OAuth token information. \n### Step 5: Add a bundle configuration file to the project \nIn this step, you define how you want to deploy and run the two notebooks. For this demo, you want to use a Databricks job to run the first notebook and then the second notebook. Because the first notebook saves the data and the second notebook queries the saved data, you want the first notebook to finish running before the second notebook starts. You model these objectives within a bundle configuration file in your project. \n1. From the directory\u2019s root, create the bundle configuration file, a file named `databricks.yml`.\n2. Add the following code to the `databricks.yml` file, replacing `<workspace-url>` with your [workspace URL](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `https:\/\/dbc-a1b2345c-d6e7.cloud.databricks.com`. This URL must match the one in your `.databrickscfg` file: \nTip \nThe first line, starting with `# yaml-language-server`, is required only if your IDE supports it. See Step 3 earlier for details. \n```\n# yaml-language-server: $schema=bundle_config_schema.json\nbundle:\nname: baby-names\n\nresources:\njobs:\nretrieve-filter-baby-names-job:\nname: retrieve-filter-baby-names-job\njob_clusters:\n- job_cluster_key: common-cluster\nnew_cluster:\nspark_version: 12.2.x-scala2.12\nnode_type_id: i3.xlarge\nnum_workers: 1\ntasks:\n- task_key: retrieve-baby-names-task\njob_cluster_key: common-cluster\nnotebook_task:\nnotebook_path: .\/retrieve-baby-names.py\n- task_key: filter-baby-names-task\ndepends_on:\n- task_key: retrieve-baby-names-task\njob_cluster_key: common-cluster\nnotebook_task:\nnotebook_path: .\/filter-baby-names.py\n\ntargets:\ndevelopment:\nworkspace:\nhost: <workspace-url>\n\n``` \nFor customizing jobs, the mappings within a job declaration correspond to the create job operation\u2019s request payload as defined in [POST \/api\/2.1\/jobs\/create](https:\/\/docs.databricks.com\/api\/workspace\/jobs\/create) in the REST API reference, expressed in YAML format. \nTip \nYou can define, combine, and override the settings for new job clusters in bundles by using the techniques described in [Override cluster settings in Databricks Asset Bundles](https:\/\/docs.databricks.com\/dev-tools\/bundles\/cluster-override.html). \n### Step 6: Validate the project\u2019s bundle configuration file \nIn this step, you check whether the bundle configuration is valid. \n1. Use the Databricks CLI to run the `bundle validate` command, as follows: \n```\ndatabricks bundle validate\n\n```\n2. If a summary of the bundle configuration is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step. \nIf you make any changes to your bundle after this step, you should repeat this step to check whether your bundle configuration is still valid. \n### Step 7: Deploy the local project to the remote workspace \nIn this step, you deploy the two local notebooks to your remote Databricks workspace and create the Databricks job within your workspace. \n1. Use the Databricks CLI to run the `bundle deploy` command as follows: \n```\ndatabricks bundle deploy -t development\n\n```\n2. Check whether the two local notebooks were deployed: In your Databricks workspace\u2019s sidebar, click **Workspace**.\n3. Click into the **Users > `<your-username>` > .bundle > baby-names > development > files** folder. The two notebooks should be in this folder.\n4. Check whether the job was created: In your Databricks workspace\u2019s sidebar, click **Workflows**.\n5. On the **Jobs** tab, click **retrieve-filter-baby-names-job**.\n6. Click the **Tasks** tab. There should be two tasks: **retrieve-baby-names-task** and **filter-baby-names-task**. \nIf you make any changes to your bundle after this step, you should repeat steps 6-7 to check whether your bundle configuration is still valid and then redeploy the project. \n### Step 8: Run the deployed project \nIn this step, you run the Databricks job in your workspace. \n1. Use the Databricks CLI to run the `bundle run` command, as follows: \n```\ndatabricks bundle run -t development retrieve-filter-baby-names-job\n\n```\n2. Copy the value of `Run URL` that appears in your terminal and paste this value into your web browser to open your Databricks workspace.\n3. In your Databricks workspace, after the two tasks complete successfully and show green title bars, click the **filter-baby-names-task** task to see the query results. \nIf you make any changes to your bundle after this step, you should repeat steps 6-8 to check whether your bundle configuration is still valid, redeploy the project, and run the redeployed project. \n### Step 9: Clean up \nIn this step, you delete the two deployed notebooks and the job from your workspace. \n1. Use the Databricks CLI to run the `bundle destroy` command, as follows: \n```\ndatabricks bundle destroy\n\n```\n2. Confirm the job deletion request: When prompted to permanently destroy resources, type `y` and press `Enter`.\n3. Confirm the notebooks deletion request: When prompted to permanently destroy the previously deployed folder and all of its files, type `y` and press `Enter`. \nRunning the `bundle destroy` command deletes only the deployed job and the folder containing the two deployed notebooks. This command does not delete any side effects, such as the `babynames.csv` file that the first notebook created. To delete the `babybnames.csv` file, do the following: \n1. In the sidebar of your Databricks workspace, click **Catalog**.\n2. Click **Browse DBFS**.\n3. Click the **FileStore** folder.\n4. Click the dropdown arrow next to **babynames.csv**, and click **Delete**.\n5. If you also want to delete the bundle from your development machine, you can now delete the local directory from Step 1.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-bundles-with-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Develop a job on Databricks by using Databricks Asset Bundles\n###### Add an existing job definition to a bundle\n\nYou can use an existing job definition as a basis to define a new job in a bundle configuration file. To do this, complete the following steps. \nNote \nThe following steps create a new job that has the same settings as the existing job. However, the new job has a different job ID than the existing job. You cannot automatically import an existing job ID into a bundle. \n### Step 1: Get the existing job definition in YAML format \nIn this step, use the Databricks workspace user interface to get the YAML representation of the existing job definition. \n1. In your Databricks workspace\u2019s sidebar, click **Workflows**.\n2. On the **Jobs** tab, click your job\u2019s **Name** link.\n3. Next to the **Run now** button, click the ellipses, and then click **View YAML**.\n4. On the **Create** tab, copy the job definition\u2019s YAML to your local clipboard by clicking **Copy**. \n### Step 2: Add the job definition YAML to a bundle configuration file \nIn your bundle configuration file, add the YAML that you copied from the previous step to one of the following locations labelled `<job-yaml-can-go-here>` in your bundle configuration files, as follows: \n```\nresources:\njobs:\n<some-unique-programmatic-identifier-for-this-job>:\n<job-yaml-can-go-here>\n\ntargets:\n<some-unique-programmatic-identifier-for-this-target>:\nresources:\njobs:\n<some-unique-programmatic-identifier-for-this-job>:\n<job-yaml-can-go-here>\n\n``` \n### Step 3: Add notebooks, Python files, and other artifacts to the bundle \nAny Python files and notebooks that are referenced in the existing job should be moved to the bundle\u2019s sources. \nFor better compatibility with bundles, notebooks should use the IPython notebook format (`.ipynb`). If you develop the bundle locally, you can export an existing notebook from a Databricks workspace into the `.ipynb` format by clicking **File > Export > IPython Notebook** from the Databricks notebook user interface. By convention, you should then put the downloaded notebook into the `src\/` directory in your bundle. \nAfter you add your notebooks, Python files, and other artifacts to the bundle, make sure that your job definition references them. For example, for a notebook with the filename of `hello.ipynb` that is in a `src\/` directory, and the `src\/` directory is in the same folder as the bundle configuration file that references the `src\/` directory, the job definition might be expressed as follows: \n```\nresources:\njobs:\nhello-job:\nname: hello-job\ntasks:\n- task_key: hello-task\nnotebook_task:\nnotebook_path: .\/src\/hello.ipynb\n\n``` \n### Step 4: Validate, deploy, and run the new job \n1. Validate that the bundle\u2019s configuration files are syntactically correct by running the following command: \n```\ndatabricks bundle validate\n\n```\n2. Deploy the bundle by running the following command. In this command, replace `<target-identifier>` with the unique programmatic identifier for the target from the bundle configuration: \n```\ndatabricks bundle deploy -t <target-identifier>\n\n```\n3. Run the job with the following command. \n```\ndatabricks bundle run -t <target-identifier> <job-identifier>\n\n``` \n* Replace `<target-identifier>` with the unique programmatic identifier for the target from the bundle configuration.\n* Replace `<job-identifier>` with the unique programmatic identifier for the job from the bundle configuration.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-bundles-with-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Develop a job on Databricks by using Databricks Asset Bundles\n###### Configure a job that uses serverless compute\n\nPreview \nServerless compute for workflows is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). For information on eligibility and enablement, see [Enable serverless compute public preview](https:\/\/docs.databricks.com\/admin\/workspace-settings\/serverless.html). \nThe following examples demonstrate bundle configurations to create a job that uses serverless compute. \nTo use serverless compute to run a job that includes notebook tasks, omit the `job_clusters` configuration from the bundle configuration file. \n```\n# yaml-language-server: $schema=bundle_config_schema.json\nbundle:\nname: baby-names\n\nresources:\njobs:\nretrieve-filter-baby-names-job-serverless:\nname: retrieve-filter-baby-names-job-serverless\ntasks:\n- task_key: retrieve-baby-names-task\nnotebook_task:\nnotebook_path: .\/retrieve-baby-names.py\n- task_key: filter-baby-names-task\ndepends_on:\n- task_key: retrieve-baby-names-task\nnotebook_task:\nnotebook_path: .\/filter-baby-names.py\n\ntargets:\ndevelopment:\nworkspace:\nhost: <workspace-url>\n\n``` \nTo use serverless compute to run a job that includes Python tasks, include the `environments` configuration. \n```\n# yaml-language-server: $schema=bundle_config_schema.json\nbundle:\nname: serverless-python-tasks\n\nresources:\njobs:\nserverless-python-job:\nname: serverless-job-with-python-tasks\n\ntasks:\n- task_key: wheel-task-1\npython_wheel_task:\nentry_point: main\npackage_name: wheel_package\nenvironment_key: Default\n\nenvironments:\n- environment_key: Default\nspec:\nclient: \"1\"\ndependencies:\n- workflows_authoring_toolkit==0.0.1\n\ntargets:\ndevelopment:\nworkspace:\nhost: <workspace-url>\n\n``` \nSee [Run your Databricks job with serverless compute for workflows](https:\/\/docs.databricks.com\/workflows\/jobs\/run-serverless-jobs.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-bundles-with-jobs.html"} +{"content":"# Databricks data engineering\n## What are init scripts?\n#### Set and use environment variables with init scripts\n\nInit scripts have access to all environment variables present on a cluster. Databricks sets many default variables that can be useful in init script logic. \nEnvironment variables set in the Spark config are available to init scripts. See [Environment variables](https:\/\/docs.databricks.com\/compute\/configure.html#env-var).\n\n#### Set and use environment variables with init scripts\n##### What environment variables are exposed to the init script by default?\n\nCluster-scoped and global init scripts support the following environment variables: \n* `DB_CLUSTER_ID`: the ID of the cluster on which the script is running. See the [Clusters API](https:\/\/docs.databricks.com\/api\/workspace\/clusters).\n* `DB_CONTAINER_IP`: the private IP address of the container in which Spark runs. The init script is run inside this container. See the [Clusters API](https:\/\/docs.databricks.com\/api\/workspace\/clusters).\n* `DB_IS_DRIVER`: whether the script is running on a driver node.\n* `DB_DRIVER_IP`: the IP address of the driver node.\n* `DB_INSTANCE_TYPE`: the instance type of the host VM.\n* `DB_CLUSTER_NAME`: the name of the cluster the script is executing on.\n* `DB_IS_JOB_CLUSTER`: whether the cluster was created to run a job. See [Create a job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#create-a-job). \nFor example, if you want to run part of a script only on a driver node, you could write a script like: \n```\necho $DB_IS_DRIVER\nif [[ $DB_IS_DRIVER = \"TRUE\" ]]; then\n<run this part only on driver>\nelse\n<run this part only on workers>\nfi\n<run this part on both driver and workers>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/environment-variables.html"} +{"content":"# Databricks data engineering\n## What are init scripts?\n#### Set and use environment variables with init scripts\n##### Use secrets in init scripts\n\nYou can use any valid variable name when you reference a secret. Access to secrets referenced in environment variables is determined by the permissions of the user who configured the cluster. Secrets stored in environment variables are accessible by all users of the cluster, but are redacted from plaintext display. \nSee [Reference a secret in an environment variable](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html#reference-a-secret-in-an-environment-variable).\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/environment-variables.html"} +{"content":"# Get started: Account and workspace setup\n### Tutorial: Run an end-to-end lakehouse analytics pipeline\n\nThis tutorial shows you how to set up an end-to-end analytics pipeline for a Databricks lakehouse. \nImportant \nThis tutorial uses interactive notebooks to complete common ETL tasks in Python on Unity Catalog enabled clusters. If you are not using Unity Catalog, see [Run your first ETL workload on Databricks](https:\/\/docs.databricks.com\/getting-started\/etl-quick-start.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/lakehouse-e2e.html"} +{"content":"# Get started: Account and workspace setup\n### Tutorial: Run an end-to-end lakehouse analytics pipeline\n#### Tasks in this tutorial\n\nBy the end of this article, you will feel comfortable: \n1. [Launching a Unity Catalog enabled compute cluster](https:\/\/docs.databricks.com\/getting-started\/lakehouse-e2e.html#cluster).\n2. [Creating a Databricks notebook](https:\/\/docs.databricks.com\/getting-started\/lakehouse-e2e.html#notebook).\n3. [Writing and reading data from a Unity Catalog external location](https:\/\/docs.databricks.com\/getting-started\/lakehouse-e2e.html#external-location).\n4. [Configuring incremental data ingestion to a Unity Catalog table with Auto Loader](https:\/\/docs.databricks.com\/getting-started\/lakehouse-e2e.html#auto-loader).\n5. [Executing notebook cells to process, query, and preview data](https:\/\/docs.databricks.com\/getting-started\/lakehouse-e2e.html#process).\n6. [Scheduling a notebook as a Databricks job](https:\/\/docs.databricks.com\/getting-started\/lakehouse-e2e.html#schedule).\n7. [Querying Unity Catalog tables from Databricks SQL](https:\/\/docs.databricks.com\/getting-started\/lakehouse-e2e.html#query) \nDatabricks provides a suite of production-ready tools that allow data professionals to quickly develop and deploy extract, transform, and load (ETL) pipelines. Unity Catalog allows data stewards to configure and secure storage credentials, external locations, and database objects for users throughout an organization. Databricks SQL allows analysts to run SQL queries against the same tables used in production ETL workloads, allowing for real time business intelligence at scale. \nYou can also use Delta Live Tables to build ETL pipelines. Databricks created Delta Live Tables to reduce the complexity of building, deploying, and maintaining production ETL pipelines. See [Tutorial: Run your first Delta Live Tables pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/lakehouse-e2e.html"} +{"content":"# Get started: Account and workspace setup\n### Tutorial: Run an end-to-end lakehouse analytics pipeline\n#### Requirements\n\n* You are logged into Databricks. \nNote \nIf you do not have cluster control privileges, you can still complete most of the steps below as long as you have [access to a cluster](https:\/\/docs.databricks.com\/compute\/use-compute.html#permissions).\n\n### Tutorial: Run an end-to-end lakehouse analytics pipeline\n#### Step 1: Create a cluster\n\nTo do exploratory data analysis and data engineering, create a cluster to provide the compute resources needed to execute commands. \n1. Click ![compute icon](https:\/\/docs.databricks.com\/_images\/clusters-icon.png) **Compute** in the sidebar.\n2. Click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar, then select **Cluster**. This opens the New Cluster\/Compute page.\n3. Specify a unique name for the cluster.\n4. Select the **Single node** radio button.\n5. Select **Single User** from the **Access mode** dropdown.\n6. Make sure your email address is visible in the **Single User** field.\n7. Select the desired **Databricks runtime version**, 11.1 or above to use Unity Catalog.\n8. Click **Create compute** to create the cluster. \nTo learn more about Databricks clusters, see [Compute](https:\/\/docs.databricks.com\/compute\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/lakehouse-e2e.html"} +{"content":"# Get started: Account and workspace setup\n### Tutorial: Run an end-to-end lakehouse analytics pipeline\n#### Step 2: Create a Databricks notebook\n\nTo get started writing and executing interactive code on Databricks, create a notebook. \n1. Click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar, then click **Notebook**.\n2. On the Create Notebook page: \n* Specify a unique name for your notebook.\n* Make sure the default language is set to **Python**.\n* Use the **Connect** dropdown menu to select the cluster you created in step 1 from the **Cluster** dropdown. \nThe notebook opens with one empty cell. \nTo learn more about creating and managing notebooks, see [Manage notebooks](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/lakehouse-e2e.html"} +{"content":"# Get started: Account and workspace setup\n### Tutorial: Run an end-to-end lakehouse analytics pipeline\n#### Step 3: Write and read data from an external location managed by Unity Catalog\n\nDatabricks recommends using [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html) for incremental data ingestion. Auto Loader automatically detects and processes new files as they arrive in cloud object storage. \nUse Unity Catalog to manage secure access to external locations. Users or service principals with `READ FILES` permissions on an external location can use Auto Loader to ingest data. \nNormally, data will arrive in an external location due to writes from other systems. In this demo, you can simulate data arrival by writing out JSON files to an external location. \nCopy the code below into a notebook cell. Replace the string value for `catalog` with the name of a catalog with `CREATE CATALOG` and `USE CATALOG` permissions. Replace the string value for `external_location` with the path for an external location with `READ FILES`, `WRITE FILES`, and `CREATE EXTERNAL TABLE` permissions. \nExternal locations can be defined as an entire storage container, but often point to a directory nested in a container. \nThe correct format for an external location path is `\"s3:\/\/bucket-name\/path\/to\/external_location\"`. \n```\n\nexternal_location = \"<your-external-location>\"\ncatalog = \"<your-catalog>\"\n\ndbutils.fs.put(f\"{external_location}\/filename.txt\", \"Hello world!\", True)\ndisplay(dbutils.fs.head(f\"{external_location}\/filename.txt\"))\ndbutils.fs.rm(f\"{external_location}\/filename.txt\")\n\ndisplay(spark.sql(f\"SHOW SCHEMAS IN {catalog}\"))\n\n``` \nExecuting this cell should print a line that reads 12 bytes, print the string \u201cHello world!\u201d, and display all the databases present in the catalog provided. If you are unable to get this cell to run, confirm that you are in a Unity Catalog enabled workspace and request proper permissions from your workspace administrator to complete this tutorial. \nThe Python code below uses your email address to create a unique database in the catalog provided and a unique storage location in external location provided. Executing this cell will remove all data associated with this tutorial, allowing you to execute this example idempotently. A class is defined and instantiated that you will use to simulate batches of data arriving from a connected system to your source external location. \nCopy this code to a new cell in your notebook and execute it to configure your environment. \nNote \nThe variables defined in this code should allow you to safely execute it without risk of conflicting with existing workspace assets or other users. Restricted network or storage permissions will raise errors when executing this code; contact your workspace administrator to troubleshoot these restrictions. \n```\n\nfrom pyspark.sql.functions import col\n\n# Set parameters for isolation in workspace and reset demo\nusername = spark.sql(\"SELECT regexp_replace(current_user(), '[^a-zA-Z0-9]', '_')\").first()[0]\ndatabase = f\"{catalog}.e2e_lakehouse_{username}_db\"\nsource = f\"{external_location}\/e2e-lakehouse-source\"\ntable = f\"{database}.target_table\"\ncheckpoint_path = f\"{external_location}\/_checkpoint\/e2e-lakehouse-demo\"\n\nspark.sql(f\"SET c.username='{username}'\")\nspark.sql(f\"SET c.database={database}\")\nspark.sql(f\"SET c.source='{source}'\")\n\nspark.sql(\"DROP DATABASE IF EXISTS ${c.database} CASCADE\")\nspark.sql(\"CREATE DATABASE ${c.database}\")\nspark.sql(\"USE ${c.database}\")\n\n# Clear out data from previous demo execution\ndbutils.fs.rm(source, True)\ndbutils.fs.rm(checkpoint_path, True)\n\n# Define a class to load batches of data to source\nclass LoadData:\n\ndef __init__(self, source):\nself.source = source\n\ndef get_date(self):\ntry:\ndf = spark.read.format(\"json\").load(source)\nexcept:\nreturn \"2016-01-01\"\nbatch_date = df.selectExpr(\"max(distinct(date(tpep_pickup_datetime))) + 1 day\").first()[0]\nif batch_date.month == 3:\nraise Exception(\"Source data exhausted\")\nreturn batch_date\n\ndef get_batch(self, batch_date):\nreturn (\nspark.table(\"samples.nyctaxi.trips\")\n.filter(col(\"tpep_pickup_datetime\").cast(\"date\") == batch_date)\n)\n\ndef write_batch(self, batch):\nbatch.write.format(\"json\").mode(\"append\").save(self.source)\n\ndef land_batch(self):\nbatch_date = self.get_date()\nbatch = self.get_batch(batch_date)\nself.write_batch(batch)\n\nRawData = LoadData(source)\n\n``` \nYou can now land a batch of data by copying the following code into a cell and executing it. You can manually execute this cell up to 60 times to trigger new data arrival. \n```\nRawData.land_batch()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/lakehouse-e2e.html"} +{"content":"# Get started: Account and workspace setup\n### Tutorial: Run an end-to-end lakehouse analytics pipeline\n#### Step 4: Configure Auto Loader to ingest data to Unity Catalog\n\nDatabricks recommends storing data with [Delta Lake](https:\/\/docs.databricks.com\/delta\/index.html). Delta Lake is an open source storage layer that provides ACID transactions and enables the data lakehouse. Delta Lake is the default format for tables created in Databricks. \nTo configure Auto Loader to ingest data to a Unity Catalog table, copy and paste the following code into an empty cell in your notebook: \n```\n# Import functions\nfrom pyspark.sql.functions import col, current_timestamp\n\n# Configure Auto Loader to ingest JSON data to a Delta table\n(spark.readStream\n.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"json\")\n.option(\"cloudFiles.schemaLocation\", checkpoint_path)\n.load(file_path)\n.select(\"*\", col(\"_metadata.file_path\").alias(\"source_file\"), current_timestamp().alias(\"processing_time\"))\n.writeStream\n.option(\"checkpointLocation\", checkpoint_path)\n.trigger(availableNow=True)\n.option(\"mergeSchema\", \"true\")\n.toTable(table))\n\n``` \nTo learn more about Auto Loader, see [What is Auto Loader?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html). \nTo learn more about Structured Streaming with Unity Catalog, see [Using Unity Catalog with Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/unity-catalog.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/lakehouse-e2e.html"} +{"content":"# Get started: Account and workspace setup\n### Tutorial: Run an end-to-end lakehouse analytics pipeline\n#### Step 5: Process and interact with data\n\nNotebooks execute logic cell-by-cell. Use these steps to execute the logic in your cell: \n1. To run the cell you completed in the previous step, select the cell and press **SHIFT+ENTER**.\n2. To query the table you\u2019ve just created, copy and paste the following code into an empty cell, then press **SHIFT+ENTER** to run the cell. \n```\ndf = spark.read.table(table_name)\n\n```\n3. To preview the data in your DataFrame, copy and paste the following code into an empty cell, then press **SHIFT+ENTER** to run the cell. \n```\ndisplay(df)\n\n``` \nTo learn more about interactive options for visualizing data, see [Visualizations in Databricks notebooks](https:\/\/docs.databricks.com\/visualizations\/index.html).\n\n### Tutorial: Run an end-to-end lakehouse analytics pipeline\n#### Step 6: Schedule a job\n\nYou can run Databricks notebooks as production scripts by adding them as a task in a Databricks job. In this step, you will create a new job that you can trigger manually. \nTo schedule your notebook as a task: \n1. Click **Schedule** on the right side of the header bar.\n2. Enter a unique name for the **Job name**.\n3. Click **Manual**.\n4. In the **Cluster** drop-down, select the cluster you created in step 1.\n5. Click **Create**.\n6. In the window that appears, click **Run now**.\n7. To see the job run results, click the ![External Link](https:\/\/docs.databricks.com\/_images\/external-link.png) icon next to the **Last run** timestamp. \nFor more information on jobs, see [What is Databricks Jobs?](https:\/\/docs.databricks.com\/workflows\/index.html#what-is-jobs).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/lakehouse-e2e.html"} +{"content":"# Get started: Account and workspace setup\n### Tutorial: Run an end-to-end lakehouse analytics pipeline\n#### Step 7: Query table from Databricks SQL\n\nAnyone with the `USE CATALOG` permission on the current catalog, the `USE SCHEMA` permission on the current schema, and `SELECT` permissions on the table can query the contents of the table from their preferred Databricks API. \nYou need access to a running SQL warehouse to execute queries in Databricks SQL. \nThe table you created earlier in this tutorial has the name `target_table`. You can query it using the catalog you provided in the first cell and the database with the patern `e2e_lakehouse_<your-username>`. You can use [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html) to find the data objects that you created.\n\n### Tutorial: Run an end-to-end lakehouse analytics pipeline\n#### Additional Integrations\n\nLearn more about integrations and tools for data engineering with Databricks: \n* [Connect your favorite IDE](https:\/\/docs.databricks.com\/dev-tools\/index.html)\n* [Use dbt with Databricks](https:\/\/docs.databricks.com\/partners\/prep\/dbt.html)\n* [Learn about the Databricks Command Line Interface (CLI)](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html)\n* [Learn about the Databricks Terraform Provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/lakehouse-e2e.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with the Databricks AutoML UI\n\nThis article demonstrates how to train a machine learning model using [AutoML](https:\/\/docs.databricks.com\/machine-learning\/automl\/index.html) and the Databricks Machine Learning UI. The AutoML UI steps you through the process of training a classification, regression or forecasting model on a dataset. \nTo access the UI: \n1. In the sidebar, select **New > AutoML Experiment**. \nYou can also create a new AutoML experiment from the [Experiments page](https:\/\/docs.databricks.com\/mlflow\/experiments.html). \nThe **Configure AutoML experiment page** displays. On this page, you configure the AutoML process, specifying the dataset, problem type, target or label column to predict, metric to use to evaluate and score the experiment runs, and stopping conditions.\n\n#### Train ML models with the Databricks AutoML UI\n##### Requirements\n\nSee [Requirements](https:\/\/docs.databricks.com\/machine-learning\/automl\/index.html#requirement) for AutoML experiments.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with the Databricks AutoML UI\n##### Set up classification or regression problems\n\nYou can set up a classification or regression problem using the AutoML UI with the following steps: \n1. In the **Compute** field, select a cluster running Databricks Runtime ML.\n2. From the **ML problem type** drop-down menu, select **Regression** or **Classification**. If you are trying to predict a continuous numeric value for each observation, such as annual income, select regression. If you are trying to assign each observation to one of a discrete set of classes, such as good credit risk or bad credit risk, select classification.\n3. Under **Dataset**, select **Browse**.\n4. Navigate to the table you want to use and click **Select**. The table schema appears. \nFor classification and regression problems only, you can specify which [columns to include in training](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html#column-select) and select [custom imputation methods](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html#impute-missing-values).\n5. Click in the **Prediction target** field. A drop-down appears listing the columns shown in the schema. Select the column you want the model to predict.\n6. The **Experiment name** field shows the default name. To change it, type the new name in the field. \nYou can also: \n* Specify [additional configuration options](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html#advanced-config).\n* Use [existing feature tables in Feature Store to augment the original input dataset](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html#feature-store).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with the Databricks AutoML UI\n##### Set up forecasting problems\n\nYou can set up a forecasting problem using the AutoML UI with the following steps: \n1. In the **Compute** field, select a cluster running Databricks Runtime 10.0 ML or above.\n2. From the **ML problem type** drop-down menu, select **Forecasting**.\n3. Under **Dataset**, click **Browse**. Navigate to the table you want to use and click **Select**. The table schema appears.\n4. Click in the **Prediction target** field. A dropdown menu appears listing the columns shown in the schema. Select the column you want the model to predict.\n5. Click in the **Time column** field. A drop-down appears showing the dataset columns that are of type `timestamp` or `date`. Select the column containing the time periods for the time series.\n6. For multi-series forecasting, select the column(s) that identify the individual time series from the **Time series identifiers** drop-down. AutoML groups the data by these columns as different time series and trains a model for each series independently. If you leave this field blank, AutoML assumes that the dataset contains a single time series.\n7. In the **Forecast horizon and frequency** fields, specify the number of time periods into the future for which AutoML should calculate forecasted values. In the left box, enter the integer number of periods to forecast. In the right box, select the units.\n.. note :: To use Auto-ARIMA, the time series must have a regular frequency (that is, the interval between any two points must be the same throughout the time series). The frequency must match the frequency unit specified in the API call or in the AutoML UI. AutoML handles missing time steps by filling in those values with the previous value.\n8. In Databricks Runtime 11.3 LTS ML and above, you can save prediction results. To do so, specify a database in the **Output Database** field. Click **Browse** and select a database from the dialog. AutoML writes the prediction results to a table in this database.\n9. The **Experiment name** field shows the default name. To change it, type the new name in the field. \nYou can also: \n* Specify [additional configuration options](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html#advanced-config).\n* Use [existing feature tables in Feature Store to augment the original input dataset](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html#feature-store).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with the Databricks AutoML UI\n##### Use existing feature tables from Databricks Feature Store\n\nIn Databricks Runtime 11.3 LTS ML and above, you can use feature tables in Databricks Feature Store to expand the input training dataset for your classification and regression problems. \nIn Databricks Runtime 12.2 LTS ML and above, you can use feature tables in Databricks Feature Store to expand the input training dataset for all of your AutoML problems: classification, regression, and forecasting. \nTo create a feature table, see [Create a feature table in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html#create-feature-table) or [Create a feature table in Databricks Feature Store](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html#create-feature-table). \nAfter you finish configuring your AutoML experiment, you can select a features table with the following steps: \n1. Click **Join features (optional)**. \n![Select Join features button](https:\/\/docs.databricks.com\/_images\/automl-join-features.png)\n2. On the **Join Additional Features** page, select a feature table in the **Feature Table** field.\n3. For each **Feature table primary key**, select the corresponding lookup key. The lookup key should be a column in the training dataset you provided for your AutoML experiment.\n4. For [time series feature tables](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html), select the corresponding timestamp lookup key. Similarly, the timestamp lookup key should be a column in the training dataset you provided for your AutoML experiment. \n![Select primary key and lookup tables](https:\/\/docs.databricks.com\/_images\/automl-feature-store-lookup-key.png)\n5. To add more feature tables, click **Add another Table** and repeat the above steps.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with the Databricks AutoML UI\n##### Advanced configurations\n\nOpen the **Advanced Configuration (optional)** section to access these parameters. \n* The evaluation metric is the [primary metric](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html#classification-regression) used to score the runs.\n* In Databricks Runtime 10.4 LTS ML and above, you can exclude training frameworks from consideration. By default, AutoML trains models using frameworks listed under [AutoML algorithms](https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html#automl-algorithm).\n* You can edit the stopping conditions. Default stopping conditions are: \n+ For forecasting experiments, stop after 120 minutes.\n+ In Databricks Runtime 10.4 LTS ML and below, for classification and regression experiments, stop after 60 minutes or after completing 200 trials, whichever happens first. For Databricks Runtime 11.0 ML and above, the number of trials is not used as a stopping condition.\n+ In Databricks Runtime 10.4 LTS ML and above, for classification and regression experiments, AutoML incorporates early stopping; it stops training and tuning models if the validation metric is no longer improving.\n* In Databricks Runtime 10.4 LTS ML and above, you can select a [time column](https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html#control-automl-split) to split the data for training, validation, and testing in chronological order (applies only to classification and regression).\n* Databricks recommends not populating the **Data directory** field. Doing so, triggers the default behavior which is to securely store the dataset as an MLflow artifact. A [DBFS](https:\/\/docs.databricks.com\/dbfs\/index.html) path can be specified, but in this case, the dataset does not inherit the AutoML experiment\u2019s access permissions.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with the Databricks AutoML UI\n##### Column selection\n\nNote \nThis functionality is only available for classification and regression problems \nIn Databricks Runtime 10.3 ML and above, you can specify which columns AutoML should use for training. To exclude a column, uncheck it in the **Include** column. \nYou cannot drop the column selected as the prediction target or as the [time column](https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html#control-automl-split) to split the data. \nBy default, all columns are included.\n\n#### Train ML models with the Databricks AutoML UI\n##### Imputation of missing values\n\nIn Databricks Runtime 10.4 LTS ML and above, you can specify how null values are imputed. In the UI, select a method from the drop-down in the **Impute with** column in the table schema. \nBy default, AutoML selects an imputation method based on the column type and content. \nNote \nIf you specify a non-default imputation method, AutoML does not perform [semantic type detection](https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html#semantic-detection).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with the Databricks AutoML UI\n##### Run the experiment and monitor the results\n\nTo start the AutoML experiment, click **Start AutoML**. The experiment starts to run, and the AutoML training page appears. To refresh the runs table, click ![Refresh button](https:\/\/docs.databricks.com\/_images\/automl-refresh-button.png). \nFrom this page, you can: \n* Stop the experiment at any time.\n* Open the data exploration notebook.\n* Monitor runs.\n* Navigate to the run page for any run. \nWith Databricks Runtime 10.1 ML and above, AutoML displays warnings for potential issues with the dataset, such as unsupported column types or high cardinality columns. \nNote \nDatabricks does its best to indicate potential errors or issues. However, this may not be comprehensive and may not capture issues or errors for which you may be searching. Please make sure to conduct your own reviews as well. \nTo see any warnings for the dataset, click the **Warnings** tab on the training page, or on the experiment page after the experiment has completed. \n![AutoML warnings](https:\/\/docs.databricks.com\/_images\/automl-alerts.png) \nWhen the experiment completes, you can: \n* [Register and deploy](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html#register-deploy-automl-ui) one of the models with MLflow.\n* Select **View notebook for best model** to review and edit the notebook that created the best model.\n* Select **View data exploration notebook** to open the data exploration notebook.\n* Search, filter, and sort the runs in the runs table.\n* See details for any run: \n+ The generated notebook containing source code for a trial run can be found by clicking into the MLflow run. The notebook is saved in the **Artifacts** section of the run page. You can download this notebook and import it into the workspace, if downloading artifacts is enabled by your workspace administrators.\n+ To view results of the run, click in the **Models** column or the **Start Time** column. The run page appears showing information about the trial run (such as parameters, metrics, and tags) and artifacts created by the run, including the model. This page also includes code snippets that you can use to make predictions with the model. \nTo return to this AutoML experiment later, find it in the table on the [Experiments page](https:\/\/docs.databricks.com\/mlflow\/experiments.html). The results of each AutoML experiment, including the data exploration and training notebooks, are stored in a `databricks_automl` folder in the [home folder](https:\/\/docs.databricks.com\/workspace\/workspace-objects.html#home-folder) of the user who ran the experiment.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with the Databricks AutoML UI\n##### Register and deploy a model\n\nYou can register and deploy your model with the AutoML UI: \n1. Select the link in the **Models** column for the model to register. When a run completes, the best model (based on the primary metric) is the top row.\n2. Select ![register model button](https:\/\/docs.databricks.com\/_images\/register-model-button.png) to register the model in [Model Registry](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html).\n3. Select ![Models Icon](https:\/\/docs.databricks.com\/_images\/models-icon.png) **Models** in the sidebar to navigate to the Model Registry.\n4. Select the name of your model in the model table.\n5. From the registered model page, you can serve the model with [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html). \n### No module named \u2018pandas.core.indexes.numeric \nWhen serving a model built using AutoML with Model Serving, you may get the error: `No module named 'pandas.core.indexes.numeric`. \nThis is due to an incompatible `pandas` version between AutoML and the model serving endpoint environment. You can resolve this error by running the [add-pandas-dependency.py script](https:\/\/docs.databricks.com\/_extras\/documents\/add-pandas-dependency.py). The script edits the `requirements.txt` and `conda.yaml` for your logged model to include the appropriate `pandas` dependency version: `pandas==1.5.3` \n1. Modify the script to include the `run_id` of the MLflow run where your model was logged.\n2. Re-registering the model to the MLflow model registry.\n3. Try serving the new version of the MLflow model.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html"} +{"content":"# Model serving with Databricks\n### Model Serving limits and regions\n\nThis article summarizes the limitations and region availability for Databricks Model Serving and supported endpoint types.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html"} +{"content":"# Model serving with Databricks\n### Model Serving limits and regions\n#### Limitations\n\nDatabricks Model Serving imposes default limits to ensure reliable performance. If you have feedback on these limits, please reach out to your Databricks account team. \nThe following table summarizes resource and payload limitations for model serving endpoints. \n| Feature | Granularity | Limit |\n| --- | --- | --- |\n| Payload size | Per request | 16 MB |\n| Queries per second (QPS) | Per workspace | 200, but can be increased to 25,000 or more by reaching out to your Databricks account team. |\n| Model execution duration | Per request | 120 seconds |\n| CPU endpoint model memory usage | Per endpoint | 4GB |\n| GPU endpoint model memory usage | Per endpoint | Greater than or equal to assigned GPU memory, depends on the GPU workload size |\n| Provisioned concurrency | Per workspace | 200 concurrency. Can be increased by reaching out to your Databricks account team. |\n| Overhead latency | Per request | Less than 50 milliseconds |\n| Foundation Model APIs (pay-per-token) rate limits | Per workspace | Reach out to your Databricks account team to increase the following limits.* The DBRX Instruct model has a limit of 1 query per second. * Other chat and completion models have a default rate limit of 2 queries per second. * Embedding models have a default 300 embedding inputs per second. |\n| Foundation Model APIs (provisioned throughput) rate limits | Per workspace | 200 | \nModel Serving endpoints are protected by [access control](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#serving-endpoints) and respect networking-related ingress rules configured on the workspace, like IP allowlists and [PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html). \nAdditional limitations exist: \n* If your workspace is deployed in a region that supports model serving but is served by a [control plane](https:\/\/docs.databricks.com\/getting-started\/overview.html#architecture) in an unsupported region, the workspace does not support model serving. If you attempt to use model serving in such a workspace, you will see in an error message stating that your workspace is not supported. Reach out to your Databricks account team for more information.\n* Model Serving does not support init scripts.\n* By default, Model Serving does not support PrivateLink to external endpoints. Support for this functionality is evaluated and implemented on a per region basis. Reach out to your Databricks account team for more information. \n### Foundation Model APIs limits \nNote \nAs part of providing the Foundation Model APIs, Databricks may process your data outside of the region and cloud provider where your data originated. \nThe following are limits relevant to Foundation Model APIs workloads: \n* **Provisioned throughput** supports the HIPAA compliance profile and should be used for workloads requiring compliance certifications. **Pay-per-token** workloads are **not** HIPAA or compliance security profile compliant.\n* For Foundation Model APIs endpoints, only workspace admins can change the governance settings, like the rate limits. To change rate limits use the following steps: \n1. Open the Serving UI in your workspace to see your serving endpoints.\n2. From the kebab menu on the Foundation Model APIs endpoint you want to edit, select **View details**.\n3. From the kebab menu on the upper-right side of the endpoints details page, select **Change rate limit**.\n* To use the DBRX model architecture for a provisioned throughput workload, your serving endpoint must be in `us-east-1` or `us-west-2`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html"} +{"content":"# Model serving with Databricks\n### Model Serving limits and regions\n#### Region availability\n\nNote \nIf you require an endpoint in an unsupported region, reach out to your Databricks account team. \nFor provisioned throughput workloads that use DBRX models, see [Foundation Model APIs limits](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html#fmapi-limits) for region availability. \n| Region | Location | Core Model Serving capability \\* | Foundation Model APIs (provisioned throughout) \\*\\* | Foundation Model APIs (pay-per-token) | External models |\n| --- | --- | --- | --- | --- | --- |\n| `ap-northeast-1` | Asia Pacific (Tokyo) | X | X | | X |\n| `ap-northeast-2` | Asia Pacific (Seoul) | | | | |\n| `ap-south-1` | Asia Pacific (Mumbai) | X | X | | X |\n| `ap-southeast-1` | Asia Pacific (Singapore) | X | | | X |\n| `ap-southeast-2` | Asia Pacific (Sydney) | X | X | | X |\n| `ca-central-1` | Canada (Central) | X | X | | X |\n| `eu-central-1` | EU (Frankfurt) | X | X | | X |\n| `eu-west-1` | EU (Ireland) | X | X | | X |\n| `eu-west-2` | EU (London) | | | | |\n| `eu-west-3` | EU (Paris) | | | | |\n| `sa-east-1` | South America (Sao Paulo) | | | | |\n| `us-west-1` | US West (Northern California) | | | X | X |\n| `us-west-2` | US West (Oregon) | X | X | X | X |\n| `us-east-1` | US East (Northern Virginia) | X | X | X | X |\n| `us-east-2` | US East (Ohio) | X | X | X | X | \n\\* only cpu compute \n\\*\\* includes gpu support\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Snowflake\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes how to set up Lakehouse Federation to run federated queries on Snowflake data that is not managed by Databricks. To learn more about Lakehouse Federation, see [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). \nTo connect to your Snowflake database using Lakehouse Federation, you must create the following in your Databricks Unity Catalog metastore: \n* A *connection* to your Snowflake database.\n* A *foreign catalog* that mirrors your Snowflake database in Unity Catalog so that you can use Unity Catalog query syntax and data governance tools to manage Databricks user access to the database.\n\n#### Run federated queries on Snowflake\n##### Before you begin\n\nWorkspace requirements: \n* Workspace enabled for Unity Catalog. \nCompute requirements: \n* Network connectivity from your Databricks Runtime cluster or SQL warehouse to the target database systems. See [Networking recommendations for Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/networking.html).\n* Databricks clusters must use Databricks Runtime 13.3 LTS or above and shared or single-user access mode.\n* SQL warehouses must be Pro or Serverless. \nPermissions required: \n* To create a connection, you must be a metastore admin or a user with the `CREATE CONNECTION` privilege on the Unity Catalog metastore attached to the workspace.\n* To create a foreign catalog, you must have the `CREATE CATALOG` permission on the metastore and be either the owner of the connection or have the `CREATE FOREIGN CATALOG` privilege on the connection. \nAdditional permission requirements are specified in each task-based section that follows. \n* If you plan to authenticate using single sign-on (SSO), create a security integration in the Snowflake console. See the following section for details.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/snowflake.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Snowflake\n##### (Optional) Create a security integration in the Snowflake console\n\nIf you want to authenticate using SSO, follow this step before you create a Snowflake connection. To authenticate using a username and password instead, skip this section. \nNote \nOnly Snowflake\u2019s native OAuth integration is supported. External OAuth integrations like Okta or Microsoft Entra ID are not supported. \nIn the Snowflake console, run `CREATE SECURITY INTEGRATION`. Replace the following values: \n* `<integration-name>`: A unique name for your OAuth integration.\n* `<workspace-url>`: A Databricks workspace URL. You must set `OAUTH_REDIRECT_URI` to `https:\/\/<workspace-url>\/login\/oauth\/snowflake.html`, where `<workspace-url>` is the unique URL of the Databricks workspace where you will create the Snowflake connection.\n* `<duration-in-seconds>`: A time length for refresh tokens. \nImportant \n`OAUTH_REFRESH_TOKEN_VALIDITY` is a custom field that is set to 90 days by default. After the refresh token expires, you must re-authenticate the connection. Set the field to a reasonable time length. \n```\nCREATE SECURITY INTEGRATION <integration-name>\nTYPE = oauth\nENABLED = true\nOAUTH_CLIENT = custom\nOAUTH_CLIENT_TYPE = 'CONFIDENTIAL'\nOAUTH_REDIRECT_URI = 'https:\/\/<workspace-url>\/login\/oauth\/snowflake.html'\nOAUTH_ISSUE_REFRESH_TOKENS = TRUE\nOAUTH_REFRESH_TOKEN_VALIDITY = <duration-in-seconds>\nOAUTH_ENFORCE_PKCE = TRUE;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/snowflake.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Snowflake\n##### Create a connection\n\nA connection specifies a path and credentials for accessing an external database system. To create a connection, you can use Catalog Explorer or the `CREATE CONNECTION` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** Metastore admin or user with the `CREATE CONNECTION` privilege. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **External Data** menu and select **Connections**.\n3. Click **Create connection**.\n4. Enter a user-friendly **Connection name**.\n5. Select a **Connection type** of **Snowflake**.\n6. Enter the following connection properties for your Snowflake warehouse. \n* **Auth type**: `OAuth` or `Username and password`\n* **Host**: For example, `snowflake-demo.east-us-2.azure.snowflakecomputing.com`\n* **Port**: For example, `443`\n* **Snowflake warehouse**: For example, `my-snowflake-warehouse`\n* **User**: For example, `snowflake-user`\n* (OAuth) **Client ID**: In the Snowflake console, run `SELECT SYSTEM$SHOW_OAUTH_CLIENT_SECRETS('<security_integration_name>')` to retrieve the client ID for your security integration.\n* (OAuth): **Client secret**: In the Snowflake console, run `SELECT SYSTEM$SHOW_OAUTH_CLIENT_SECRETS('<security_integration_name>')` to retrieve the client secret for your security integration.\n* (OAuth) **Client scope**: `refresh_token session:role:<role-name>`. Specify the Snowflake role to use in `<role-name>`.\n* (Username and password) **Password**: For example, `password123`(OAuth) You are prompted to sign in to Snowflake using your SSO credentials.\n7. (Optional) Click **Test connection** to confirm that it works.\n8. (Optional) Add a comment.\n9. Click **Create**. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nCREATE CONNECTION <connection-name> TYPE snowflake\nOPTIONS (\nhost '<hostname>',\nport '<port>',\nsfWarehouse '<warehouse-name>',\nuser '<user>',\npassword '<password>'\n);\n\n``` \nWe recommend that you use Databricks [secrets](https:\/\/docs.databricks.com\/security\/secrets\/index.html) instead of plaintext strings for sensitive values like credentials. For example: \n```\nCREATE CONNECTION <connection-name> TYPE snowflake\nOPTIONS (\nhost '<hostname>',\nport '<port>',\nsfWarehouse '<warehouse-name>',\nuser secret ('<secret-scope>','<secret-key-user>'),\npassword secret ('<secret-scope>','<secret-key-password>')\n)\n\n``` \nFor information about setting up secrets, see [Secret management](https:\/\/docs.databricks.com\/security\/secrets\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/snowflake.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Snowflake\n##### Create a foreign catalog\n\nA foreign catalog mirrors a database in an external data system so that you can query and manage access to data in that database using Databricks and Unity Catalog. To create a foreign catalog, you use a connection to the data source that has already been defined. \nTo create a foreign catalog, you can use Catalog Explorer or the `CREATE FOREIGN CATALOG` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** `CREATE CATALOG` permission on the metastore and either ownership of the connection or the `CREATE FOREIGN CATALOG` privilege on the connection. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Click the **Create Catalog** button.\n3. On the **Create a new catalog** dialog, enter a name for the catalog and select a **Type** of **Foreign**.\n4. Select the **Connection** that provides access to the database that you want to mirror as a Unity Catalog catalog.\n5. Enter the name of the **Database** that you want to mirror as a catalog.\n6. Click **Create.** \nRun the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional. Replace the placeholder values: \n* `<catalog-name>`: Name for the catalog in Databricks.\n* `<connection-name>`: The [connection object](https:\/\/docs.databricks.com\/query-federation\/index.html#connection) that specifies the data source, path, and access credentials.\n* `<database-name>`: Name of the database you want to mirror as a catalog in Databricks. \n```\nCREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>\nOPTIONS (database '<database-name>');\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/snowflake.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Snowflake\n##### Supported pushdowns\n\nThe following pushdowns are supported: \n* Filters\n* Projections\n* Limit\n* Joins\n* Aggregates (Average, Corr, CovPopulation, CovSample, Count, Max, Min, StddevPop, StddevSamp, Sum, VariancePop, VarianceSamp)\n* Functions (String functions, Mathematical functions, Data, Time and Timestamp functions, and other miscellaneous functions, such as Alias, Cast, SortOrder)\n* Windows functions (DenseRank, Rank, RowNumber)\n* Sorting\n\n#### Run federated queries on Snowflake\n##### Data type mappings\n\nWhen you read from Snowflake to Spark, data types map as follows: \n| Snowflake type | Spark type |\n| --- | --- |\n| decimal, number, numeric | DecimalType |\n| bigint, byteint, int, integer, smallint, tinyint | IntegerType |\n| float, float4, float8 | FloatType |\n| double, double precision, real | DoubleType |\n| char, character, string, text, time, varchar | StringType |\n| binary | BinaryType |\n| boolean | BooleanType |\n| date | DateType |\n| datetime, timestamp, timestamp\\_ltz, timestamp\\_ntz, timestamp\\_tz | TimestampType |\n\n#### Run federated queries on Snowflake\n##### OAuth limitations\n\nThe following are OAuth support limitations: \n* The Snowflake OAuth endpoint must be accessible from Databricks control plane IPs. See [Outbound from Databricks control plane](https:\/\/docs.databricks.com\/resources\/supported-regions.html#outbound). Snowflake supports configuring network policies at the security integration level, which allows for a separate network policy that enables direct connectivity from the Databricks control plane to the OAuth endpoint for authorization.\n* **Use Proxy**, **Proxy host**, **Proxy port**, and Snowflake role configuration options are not supported. Specify **Snowflake role** as part of the OAuth scope.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/snowflake.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Snowflake\n##### Additional resources\n\n* [Configure Snowflake OAuth for custom clients](https:\/\/docs.snowflake.com\/en\/user-guide\/oauth-custom) in the Snowflake documentation\n* [SQL reference: CREATE SECURITY INTEGRATION (Snowflake OAuth)](https:\/\/docs.snowflake.com\/en\/sql-reference\/sql\/create-security-integration-oauth-snowflake) in the Snowflake documentation\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/snowflake.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n#### Customer-managed keys for encryption\n\nThis article provides an overview of customer-managed keys for encryption. \nNote \nThis feature requires the Enterprise pricing tier. \nTo configure customer-managed keys for encryption, see [Configure customer-managed keys for encryption](https:\/\/docs.databricks.com\/security\/keys\/configure-customer-managed-keys.html).\n\n#### Customer-managed keys for encryption\n##### Customer-managed keys for encryption overview\n\nSome services and data support adding a customer-managed key to help protect and control access to encrypted data. You can use the key management service in your cloud to maintain a customer-managed encryption key. \nDatabricks has two customer-managed key use cases that involve different types of data and locations: \n* **Managed services**: Data in the [Databricks control plane](https:\/\/docs.databricks.com\/getting-started\/overview.html) (notebooks, secrets, and Databricks SQL query data).\n* **Workspace storage**: Your workspace storage bucket (which contains DBFS root) and the EBS volumes of compute resources in the classic compute plane. \nUnity Catalog also supports the ability to read from and write to S3 buckets with KMS encryption enabled. See [Create a storage credential for connecting to AWS S3](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html) \nTo configure customer-managed keys for workspace storage, see [Configure customer-managed keys for encryption](https:\/\/docs.databricks.com\/security\/keys\/configure-customer-managed-keys.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n#### Customer-managed keys for encryption\n##### Customer-managed keys for managed services\n\nManaged services data in the Databricks [control plane](https:\/\/docs.databricks.com\/getting-started\/overview.html) is encrypted at rest. You can add a customer-managed key for managed services to help protect and control access to the following types of encrypted data: \n* Notebook source in the Databricks [control plane](https:\/\/docs.databricks.com\/getting-started\/overview.html).\n* Notebook results for notebooks run interactively (not as jobs) that are stored in the control plane. By default, larger results are also stored in your workspace root bucket. You can configure Databricks to [store all interactive notebook results in your cloud account](https:\/\/docs.databricks.com\/admin\/workspace-settings\/storage.html).\n* Secrets stored in [Databricks secrets](https:\/\/docs.databricks.com\/security\/secrets\/index.html).\n* Databricks SQL [queries and query history](https:\/\/docs.databricks.com\/security\/keys\/sql-encryption.html).\n* Personal access tokens (PAT) or other credentials used to [set up Git integration with Databricks Git folders](https:\/\/docs.databricks.com\/repos\/repos-setup.html).\n* [Vector Search indexes and metadata](https:\/\/docs.databricks.com\/generative-ai\/vector-search.html). \nTo configure customer-managed keys for managed services, see [Configure customer-managed keys for encryption](https:\/\/docs.databricks.com\/security\/keys\/configure-customer-managed-keys.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n#### Customer-managed keys for encryption\n##### Customer-managed keys for workspace storage\n\nYou can add a customer-managed key for workspace storage to protect and control access to the following types of encrypted data: \n* **Your workspace storage bucket**: If you add a workspace storage encryption key, Databricks encrypts the data on the Amazon S3 bucket in your AWS account that you specified when you set up your workspace, which is known as the workspace storage bucket. This bucket contains [DBFS root](https:\/\/docs.databricks.com\/dbfs\/dbfs-root.html), which includes the FileStore area, MLflow Models, and Delta Live Table data in your DBFS root (not DBFS mounts). The bucket also includes workspace system data, which includes job results, Databricks SQL results, notebook revisions, and other workspace data. For more information, see [Create an S3 bucket for workspace deployment](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/storage.html).\n* **Your cluster\u2019s EBS volumes (optional)**: For Databricks Runtime cluster nodes and other compute resources in the [classic compute plane](https:\/\/docs.databricks.com\/getting-started\/overview.html), you can optionally use the key to encrypt the VM\u2019s remote EBS volumes. \nNote \nThis feature affects your [DBFS root](https:\/\/docs.databricks.com\/dbfs\/dbfs-root.html) but is not used for encrypting data on any additional DBFS mounts. For S3 DBFS mounts, you can use other approaches to writing encrypted data with your keys. For more information, see [Encrypt data in S3 buckets](https:\/\/docs.databricks.com\/dbfs\/mounts.html#s3-encryption). Mounts are a legacy access pattern. Databricks recommends using Unity Catalog for managing all data access. See [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n#### Customer-managed keys for encryption\n##### Compare customer-managed keys use cases\n\nThe following table lists which customer-managed key features are used for which types of data. \n| Type of data | Location | Which customer-managed key feature to use |\n| --- | --- | --- |\n| Notebook source and metadata | Control plane | Managed services |\n| Personal access tokens (PAT) or other credentials used for [Git integration with Databricks Git folders](https:\/\/docs.databricks.com\/repos\/repos-setup.html) | Control plane | Managed services |\n| Secrets stored by the [secret manager APIs](https:\/\/docs.databricks.com\/security\/secrets\/index.html) | Control plane | Managed services |\n| [Databricks SQL](https:\/\/docs.databricks.com\/sql\/index.html) queries and query history | Control plane | Managed services |\n| [Vector Search indexes and metadata](https:\/\/docs.databricks.com\/generative-ai\/vector-search.html) | Control plane | Managed services |\n| The [remote EBS volumes](https:\/\/aws.amazon.com\/ebs\/) for Databricks Runtime cluster nodes and other compute resources. | [Classic compute plane in your AWS account](https:\/\/docs.databricks.com\/getting-started\/overview.html). The customer managed keys for remote EBS volumes applies only to compute resources in the classic compute plane in your AWS account. See [Serverless compute and customer-managed keys](https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html#serverless). | Workspace storage |\n| [Customer-accessible DBFS root data](https:\/\/docs.databricks.com\/dbfs\/dbfs-root.html) | [DBFS root](https:\/\/docs.databricks.com\/dbfs\/dbfs-root.html) in your workspace storage bucket in your AWS account. This also includes the FileStore area. | Workspace storage |\n| [Job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) results | Workspace storage bucket in your AWS account | Workspace storage |\n| [Databricks SQL](https:\/\/docs.databricks.com\/sql\/index.html) query results | Workspace storage bucket in your AWS account | Workspace storage |\n| [MLflow Models](https:\/\/docs.databricks.com\/mlflow\/models.html) | Workspace storage bucket in your AWS account | Workspace storage |\n| [Delta Live Table](https:\/\/docs.databricks.com\/delta-live-tables\/index.html) | If you use a DBFS path in your DBFS root, this is stored in the workspace storage bucket in your AWS account. This does not apply to [DBFS paths that represent mount points](https:\/\/docs.databricks.com\/dbfs\/mounts.html) to other data sources. | Workspace storage |\n| [Interactive notebook results](https:\/\/docs.databricks.com\/admin\/workspace-settings\/notebook-results.html) | By default, when you run a notebook interactively (rather than as a job) results are stored in the control plane for performance with some large results stored in your workspace storage bucket in your AWS account. You can choose to configure Databricks to [store all interactive notebook results in your AWS account](https:\/\/docs.databricks.com\/admin\/workspace-settings\/notebook-results.html#configure-the-storage-location-for-all-interactive-notebook-results). | For partial results in the control plane, use a customer-managed key for managed services. For results in the workspace storage bucket, which you can [configure](https:\/\/docs.databricks.com\/admin\/workspace-settings\/notebook-results.html#configure-the-storage-location-for-all-interactive-notebook-results) for all result storage, use a customer-managed key for workspace storage. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n#### Customer-managed keys for encryption\n##### Serverless compute and customer-managed keys\n\n[Databricks SQL Serverless](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html) supports: \n* Customer-managed keys for managed services for Databricks SQL queries and query history.\n* Customer-managed keys for your workspace storage bucket including DBFS root storage for Databricks SQL results. \nServerless SQL warehouses do not use customer-managed keys for EBS storage encryption on compute nodes, which is an optional part of configuring customer-managed keys for workspace storage. Disks for serverless compute resources are short-lived and tied to the lifecycle of the serverless workload. When compute resources are stopped or scaled down, the VMs and their storage are destroyed. \n### Model Serving \nResources for [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html), a serverless compute feature, are generally in two categories: \n* Resources that you create for the model are stored in your workspace\u2019s DBFS root in your workspace\u2019s S3 bucket. This includes the model\u2019s artifacts and version metadata. Both the workspace model registry and MLflow use this storage. You can configure this storage to use customer-managed keys.\n* Resources that Databricks creates directly on your behalf include the model image and ephemeral serverless compute storage. These are encrypted with Databricks-managed keys and do not support customer-managed keys. \nCustomer-managed keys for EBS storage, which is an optional part of the customer-managed workspace storage feature, does *not* apply to serverless compute resources. Disks for serverless compute resources are short-lived and tied to the lifecycle of the serverless workload. When compute resources are stopped or scaled down, the VMs and their storage are destroyed.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n##### `horovod.spark`: distributed deep learning with Horovod\n\nLearn how to use the `horovod.spark` package to perform distributed training of machine learning models.\n\n##### `horovod.spark`: distributed deep learning with Horovod\n###### `horovod.spark` on Databricks\n\nDatabricks supports the `horovod.spark` package, which provides an estimator API that you can use in ML pipelines with Keras and PyTorch. For details, see [Horovod on Spark](https:\/\/github.com\/horovod\/horovod\/blob\/master\/docs\/spark.rst), which includes a section on [Horovod on Databricks](https:\/\/github.com\/horovod\/horovod\/blob\/master\/docs\/spark.rst#horovod-on-databricks). \nNote \n* Databricks installs the `horovod` package with dependencies. If you upgrade or downgrade these dependencies, there might be compatibility issues.\n* When using `horovod.spark` with custom callbacks in Keras, you must save models in the TensorFlow SavedModel format. \n+ With TensorFlow 2.x, use the `.tf` suffix in the file name.\n+ With TensorFlow 1.x, set the option `save_weights_only=True`.\n\n##### `horovod.spark`: distributed deep learning with Horovod\n###### Requirements\n\nDatabricks Runtime ML 7.4 or above. \nNote \n`horovod.spark` does not support pyarrow versions 11.0 and above (see relevant [GitHub Issue](https:\/\/github.com\/horovod\/horovod\/issues\/3829)). Databricks Runtime 15.0 ML includes pyarrow version 14.0.1. To use `horovod.spark` with Databricks Runtime 15.0 ML or above, you must manually install pyarrow, specifying a version below 11.0.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-spark.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n##### `horovod.spark`: distributed deep learning with Horovod\n###### Example: Distributed training function\n\nHere is a basic example to run a distributed training function using `horovod.spark`: \n```\ndef train():\nimport horovod.tensorflow as hvd\nhvd.init()\n\nimport horovod.spark\nhorovod.spark.run(train, num_proc=2)\n\n```\n\n##### `horovod.spark`: distributed deep learning with Horovod\n###### Example notebooks: Horovod Spark estimators using Keras and PyTorch\n\nThe following notebooks demonstrate how to use the Horovod Spark Estimator API with Keras and PyTorch. \n### Horovod Spark Estimator Keras notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/horovod-spark-estimator-keras.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Horovod Spark Estimator PyTorch notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/horovod-spark-estimator-pytorch.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-spark.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### What is predictive I\/O?\n\nPredictive I\/O is a collection of Databricks optimizations that improve performance for data interactions. Predictive I\/O capabilities are grouped into the following categories: \n* Accelerated reads reduce the time it takes to scan and read data.\n* Accelerated updates reduce the amount of data that needs to be rewritten during updates, deletes, and merges. \nPredictive I\/O is exclusive to the Photon engine on Databricks.\n\n#### What is predictive I\/O?\n##### Use predictive I\/O to accelerate reads\n\nPredictive I\/O is used to accelerate data scanning and filtering performance for all operations on supported compute types. \nImportant \nPredictive I\/O reads are supported by the serverless and pro types of SQL warehouses, and Photon-accelerated clusters running Databricks Runtime 11.3 LTS and above. \nPredictive I\/O improves scanning performance by applying deep learning techniques to do the following: \n* Determine the most efficient access pattern to read the data and only scanning the data that is actually needed.\n* Eliminate the decoding of columns and rows that are not required to generate query results.\n* Calculate the probabilities of the search criteria in selective queries matching a row. As queries run, we use these probabilities to anticipate where the next matching row would occur and only read that data from cloud storage.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/predictive-io.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### What is predictive I\/O?\n##### Use predictive I\/O to accelerate updates\n\nPredictive I\/O for updates are used automatically for all tables that have deletion vectors enabled using the following Photon-enabled compute types: \n* Serverless SQL warehouses.\n* Pro SQL warehouses.\n* Clusters running Databricks Runtime 14.0 and above. \nNote \nSupport for predictive I\/O for updates is present in Databricks Runtime 12.2 LTS and above, but Databricks recommends using 14.0 and above for best performance. \nSee [What are deletion vectors?](https:\/\/docs.databricks.com\/delta\/deletion-vectors.html). \nImportant \nA workspace admin setting controls whether deletion vectors are auto-enabled for new Delta tables. See [Auto-enable deletion vectors](https:\/\/docs.databricks.com\/admin\/workspace-settings\/deletion-vectors.html). \nYou enable support for deletion vectors on a Delta Lake table by setting a Delta Lake table property. You enable deletion vectors during table creation or alter an existing table, as in the following examples: \n```\nCREATE TABLE <table-name> [options] TBLPROPERTIES ('delta.enableDeletionVectors' = true);\n\nALTER TABLE <table-name> SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);\n\n``` \nWarning \nWhen you enable deletion vectors, the table protocol version is upgraded. After upgrading, the table will not be readable by Delta Lake clients that do not support deletion vectors. See [How does Databricks manage Delta Lake feature compatibility?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html). \nFor a list of clients that support deletion vectors, see [Compatibility with Delta clients](https:\/\/docs.databricks.com\/delta\/deletion-vectors.html#compatibility). \nIn Databricks Runtime 14.1 and above, you can drop the deletion vectors table feature to enable compatibility with other Delta clients. See [Drop Delta table features](https:\/\/docs.databricks.com\/delta\/drop-feature.html). \nPredictive I\/O leverages deletion vectors to accelerate updates by reducing the frequency of full file rewrites during data modification on Delta tables. Predictive I\/O optimizes `DELETE`, `MERGE`, and `UPDATE` operations. \nRather than rewriting all records in a data file when any record is updated or deleted, predictive I\/O uses deletion vectors to indicate records have been removed from the target data files. Supplemental data files are used to indicate updates. \nSubsequent reads on the table resolve current table state by applying the noted changes to the most recent table version. \nImportant \nPredictive I\/O updates share all limitations with deletion vectors. In Databricks Runtime 12.2 LTS and greater, the following limitations exist: \n* Delta Sharing is not supported on tables with deletion vectors enabled.\n* You cannot generate a manifest file for a table with deletion vectors present. Run `REORG TABLE ... APPLY (PURGE)` and ensure no concurrent write operations are running in order to generate a manifest.\n* You cannot incrementally generate manifest files for a table with deletion vectors enabled.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/predictive-io.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Production considerations for Structured Streaming\n##### Configure Structured Streaming batch size on Databricks\n\nLimiting the input rate for Structured Streaming queries helps to maintain a consistent batch size and prevents large batches from leading to spill and cascading micro-batch processing delays. \nDatabricks provides the same options to control Structured Streaming batch sizes for both Delta Lake and Auto Loader.\n\n##### Configure Structured Streaming batch size on Databricks\n###### Limit input rate with maxFilesPerTrigger\n\nSetting `maxFilesPerTrigger` (or `cloudFiles.maxFilesPerTrigger` for Auto Loader) specifies an upper-bound for the number of files processed in each micro-batch. For both Delta Lake and Auto Loader the default is 1000. (Note that this option is also present in Apache Spark for other file sources, where there is no max by default.)\n\n##### Configure Structured Streaming batch size on Databricks\n###### Limit input rate with maxBytesPerTrigger\n\nSetting `maxBytesPerTrigger` (or `cloudFiles.maxBytesPerTrigger` for Auto Loader) sets a \u201csoft max\u201d for the amount of data processed in each micro-batch. This means that a batch processes approximately this amount of data and may process more than the limit in order to make the streaming query move forward in cases when the smallest input unit is larger than this limit. There is no default for this setting. \nFor example, if you specify a byte string such as `10g` to limit each microbatch to 10 GB of data and you have files that are 3 GB each, Databricks processes 12 GB in a microbatch.\n\n##### Configure Structured Streaming batch size on Databricks\n###### Setting multiple input rates together\n\nIf you use `maxBytesPerTrigger` in conjunction with `maxFilesPerTrigger`, the micro-batch processes data until reaching the lower limit of either `maxFilesPerTrigger` or `maxBytesPerTrigger`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/batch-size.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Production considerations for Structured Streaming\n##### Configure Structured Streaming batch size on Databricks\n###### Limiting input rates for other Structured Streaming sources\n\nStreaming sources such as Apache Kafka each have custom input limits, such as `maxOffsetsPerTrigger`. For more details, see [Configure streaming data sources](https:\/\/docs.databricks.com\/connect\/streaming\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/batch-size.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Manage notebooks\n\nYou can manage notebooks using the UI, the CLI, and the Workspace API. This article focuses on performing notebook tasks using the UI. For the other methods, see [What is the Databricks CLI?](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) and [the Workspace API reference](https:\/\/docs.databricks.com\/api\/workspace\/introduction).\n\n#### Manage notebooks\n##### Create a notebook\n\n### Use the New button in the workspace sidebar \nTo create a new notebook in your default folder, click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar and select **Notebook** from the menu. \nDatabricks creates and opens a new, blank notebook in your default folder. The default language is the language you most recently used, and the notebook is automatically attached to the compute resource that you most recently used. \n### Create a notebook in any folder \nYou can create a new notebook in any folder (for example, in the **Shared** folder) following these steps: \n1. In the sidebar, click ![Workspace Icon](https:\/\/docs.databricks.com\/_images\/workspace-icon.png) **Workspace**.\n2. Right-click on the name of any folder and select **Create > Notebook**. A blank notebook opens in the workspace.\n\n#### Manage notebooks\n##### Open a notebook\n\nIn your workspace, click a ![Notebook Icon](https:\/\/docs.databricks.com\/_images\/notebook.png). The notebook path displays when you hover over the notebook title.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Manage notebooks\n##### Delete a notebook\n\nSee [Folders](https:\/\/docs.databricks.com\/workspace\/workspace-objects.html#folders) and [Workspace object operations](https:\/\/docs.databricks.com\/workspace\/workspace-objects.html#objects) for information about how to access the workspace menu and delete notebooks or other items in the workspace.\n\n#### Manage notebooks\n##### Copy notebook path or URL\n\nTo get the notebook file path or URL without opening the notebook, right-click the notebook name and select **Copy > Path** or **Copy > URL**.\n\n#### Manage notebooks\n##### Rename a notebook\n\nTo change the title of an open notebook, click the title and edit inline or click **File > Rename**.\n\n#### Manage notebooks\n##### Control access to a notebook\n\nIf your Databricks account has the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons), you can use [Workspace access control](https:\/\/docs.databricks.com\/notebooks\/notebooks-collaborate.html) to control who has access to a notebook.\n\n#### Manage notebooks\n##### Configure editor settings\n\nTo configure editor settings: \n1. Click your username at the top right of the workspace and select **Settings** from the drop down.\n2. In the **Settings** sidebar, select **Developer**.\n\n#### Manage notebooks\n##### View notebooks attached to a cluster\n\nThe **Notebooks** tab on the cluster details page displays notebooks that have recently been attached to a cluster. The tab also displays the status of the notebook, along with the last time a command was run from the notebook. \n![Cluster details attached notebooks](https:\/\/docs.databricks.com\/_images\/notebooks.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html"} +{"content":"# \n### RAG Studio region availability\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html)\n\n### RAG Studio region availability\n#### RAG Studio regions\n\nRAG Studio is available in the following regions: \n* `aws-ap-southeast-2`\n* `aws-us-east-1`\n* `aws-us-west-2`\n* `aws-eu-west-1`\n* `aws-us-east-2`\n* `aws-eu-central-1`\n* `aws-ap-southeast-1`\n* `aws-ca-central-1`\n* `azure-eastus2`\n* `azure-eastus`\n* `azure-westus`\n* `azure-westeurope`\n* `azure-northeurope`\n* `azure-centralus`\n* `azure-northcentralus`\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/regions.html"} +{"content":"# \n### RAG Studio region availability\n#### Integrations availability\n\n### Vector Search \n* `aws-ap-southeast-2`\n* `aws-us-east-1`\n* `aws-us-west-2`\n* `aws-eu-west-1`\n* `aws-us-east-2`\n* `azure-eastus2`\n* `azure-eastus`\n* `azure-westus`\n* `azure-westeurope`\n* `azure-northeurope`\n* `azure-centralus` \n### Foundational Model APIs - [pay-per-token](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html#what-are-databricks-foundation-model-apis) \n* `aws-us-east-1`\n* `aws-us-west-2`\n* `aws-us-east-2`\n* `azure-eastus2`\n* `azure-eastus`\n* `azure-westus`\n* `azure-centralus`\n* `azure-northcentralus` \n### Foundational Model APIs - [provisioned throughput](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html#what-are-databricks-foundation-model-apis) \n* `aws-ap-southeast-2`\n* `aws-us-east-1`\n* `aws-us-west-2`\n* `aws-eu-west-1`\n* `aws-us-east-2`\n* `azure-eastus2`\n* `azure-eastus`\n* `azure-westus`\n* `azure-westeurope`\n* `azure-northeurope`\n* `azure-centralus` \n### [External Models e.g., OpenAI](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html) \n* `aws-ap-southeast-2`\n* `aws-us-east-1`\n* `aws-us-west-2`\n* `aws-eu-west-1`\n* `aws-us-east-2`\n* `aws-eu-central-1`\n* `aws-ap-southeast-1`\n* `aws-ca-central-1`\n* `azure-eastus2`\n* `azure-eastus`\n* `azure-westus`\n* `azure-westeurope`\n* `azure-northeurope`\n* `azure-centralus`\n* `azure-northcentralus`\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/regions.html"} +{"content":"# Security and compliance guide\n## Networking\n### Users to Databricks networking\n##### Configure private connectivity to Databricks\n\nThis article describes private connectivity between users and their Databricks workspaces. For information on how to configure private connectivity from the control plane to the classic compute plane, see [Classic compute plane networking](https:\/\/docs.databricks.com\/security\/network\/classic\/index.html).\n\n##### Configure private connectivity to Databricks\n###### Private connectivity to Databricks overview\n\nAWS PrivateLink provides private connectivity from AWS VPCs and on-premises networks to AWS services without exposing the traffic to the public network. Databricks supports using PrivateLink to allow users and applications to connect to Databricks over a VPC interface endpoint. This connection is supported when connecting to the web application, REST API, and the Databricks Connect API. \nYou can optionally mandate private connectivity for the workspace, which means Databricks rejects any connections over the public network. You must configure private connectivity from users to Databricks and from the control plane to the compute plane in order to mandate private connectivity for a worksapce. \nYou can enable PrivateLink while creating a workspace or on an existing workspace. To enable private connectivity to Databricks, see [Enable AWS PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/front-end\/front-end-private-connect.html"} +{"content":"# AI and Machine Learning on Databricks\n### Step-by-step: AI and Machine Learning on Databricks\n\nThis article guides you through articles that help you learn how to build AI and LLM solutions natively on Databricks. Topics include key steps of the end-to-end AI lifecycle, from data preparation and model building to deployment, monitoring and MLOps.\n\n### Step-by-step: AI and Machine Learning on Databricks\n#### Prepare your data for model training\n\nLearn how to load and process your data for AI workloads, including data preparation for fine-tuning LLMs.\n[How to prepare your data for model training](https:\/\/docs.databricks.com\/machine-learning\/data-preparation.html)\n\n### Step-by-step: AI and Machine Learning on Databricks\n#### Feature engineering\n\nWith feature engineering available in Unity Catalog, learn how to create feature tables, track the lineage of features and discover features that others have already built. \n[Feature engineering in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html)\n\n### Step-by-step: AI and Machine Learning on Databricks\n#### Train and register models\n\nLearn how to use AutoML for efficient training and tuning of your ML models, and MLflow for experiment tracking. \n* [Train models with AutoML](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html)\n* [Register models to Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html)\n\n### Step-by-step: AI and Machine Learning on Databricks\n#### Production real-time or batch serving\n\nGet started with using model serving for real-time workloads or deploy MLflow models for offline inference. \n* [Model serving with Databricks](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html)\n* [Deploy models for batch inference](https:\/\/docs.databricks.com\/machine-learning\/model-inference\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/ml-and-ai-index.html"} +{"content":"# AI and Machine Learning on Databricks\n### Step-by-step: AI and Machine Learning on Databricks\n#### Self-hosting large language models (LLMs)\n\nLearn how to securely and cost-effectively host open source LLMs within your Databricks environment \n* [GPU model serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html#gpu)\n\n### Step-by-step: AI and Machine Learning on Databricks\n#### Monitor deployed models\n\nLearn how to monitor your AI models in production. Continuously capture and log Model Serving endpoint inputs and predictions into a Delta Table using Inference Tables, ensuring you stay on top of model performance metrics. Lakehouse Monitoring also lets you know if you meet desired benchmarks. \n* [Inference Tables](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html)\n\n### Step-by-step: AI and Machine Learning on Databricks\n#### Bundle assets for programmatic deployment\n\nLearn how to use Databricks Asset Bundles for efficient packaging and deployment of all data and AI assets. \n[Databricks Asset Bundles](https:\/\/docs.databricks.com\/dev-tools\/bundles\/index.html)\n\n### Step-by-step: AI and Machine Learning on Databricks\n#### End-to-end MLOps\n\nSee how you can use Databricks to combine DataOps, ModelOps and DevOps for end-to-end ML and LLM operations for your AI application. \n[MLOps on Databricks](https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/ml-and-ai-index.html"} +{"content":"# AI and Machine Learning on Databricks\n### Step-by-step: AI and Machine Learning on Databricks\n#### Build LLM-powered RAG solutions\n\nLearn how to create LLM-powered applications leveraging your data. Use RAG (retrieval augmented generation) with LLMs to build Q&A chatbots that provide more accurate answers. \n[\u201cRAG\u201d Workflow (Vector Search + Model Serving)](https:\/\/www.databricks.com\/resources\/demos\/tutorials\/data-science-and-ai\/lakehouse-ai-deploy-your-llm-chatbot?itm_data=demo_center)\n\n### Step-by-step: AI and Machine Learning on Databricks\n#### Additional resources\n\nIf the outlined steps above don\u2019t cater to your needs, a wealth of information is available in the [Machine Learning documentation](https:\/\/docs.databricks.com\/machine-learning\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/ml-and-ai-index.html"} +{"content":"# What is Delta Lake?\n### Use UniForm to read Delta Tables with Iceberg clients\n\nDelta Universal Format (UniForm) allows you to read Delta tables with Iceberg reader clients. This feature requires Databricks Runtime 14.3 LTS or above. \nImportant \nFor documentation for the legacy UniForm `IcebergCompatV1` table feature, see [Legacy UniForm IcebergCompatV1](https:\/\/docs.databricks.com\/archive\/legacy\/uniform.html). \nUniForm takes advantage of the fact that both Delta Lake and Iceberg consist of Parquet data files and a metadata layer. UniForm automatically generates Iceberg metadata asynchronously, without rewriting data, so that Iceberg clients can read Delta tables as if they were Iceberg tables. A single copy of the data files serves both formats. \nYou can configure an external connection to have Unity Catalog act as an Iceberg catalog. See [Read using the Unity Catalog Iceberg catalog endpoint](https:\/\/docs.databricks.com\/delta\/uniform.html#catalog-api). \nUniForm uses zstd instead of snappy as the compression codec for underlying Parquet data files. \nNote \nUniForm metadata generation runs asynchronously on the compute used to write data to Delta tables, which might increase the driver resource usage.\n\n### Use UniForm to read Delta Tables with Iceberg clients\n#### Requirements\n\nTo enable UniForm, you must fulfill the following requirements: \n* The Delta table must be registered to Unity Catalog. Both managed and external tables are supported.\n* The table must have column mapping enabled. See [Rename and drop columns with Delta Lake column mapping](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html).\n* The Delta table must have a `minReaderVersion` >= 2 and `minWriterVersion` >= 7. See [How does Databricks manage Delta Lake feature compatibility?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html).\n* Writes to the table must use Databricks Runtime 14.3 LTS or above. \nNote \nYou cannot enable deletion vectors on a table with UniForm enabled. When enabling UniForm on an existing table with deletion vectors enabled, UniForm disables and purges deletion vectors and rewrites data files as necessary.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/uniform.html"} +{"content":"# What is Delta Lake?\n### Use UniForm to read Delta Tables with Iceberg clients\n#### Enable Delta UniForm\n\nImportant \nEnabling Delta UniForm sets the Delta table feature `IcebergCompatV2`, a write protocol feature. Only clients that support this table feature can write to UniForm-enabled tables. You must use Databricks Runtime 14.3 LTS or above to write to Delta tables with this feature enabled. \nYou can turn off UniForm by unsetting the `delta.universalFormat.enabledFormats` table property. You cannot turn off column mapping after it has been enabled, and upgrades to Delta Lake reader and writer protocol versions cannot be undone. \nYou must set the following table properties to enable UniForm support for Iceberg: \n```\n'delta.enableIcebergCompatV2' = 'true'\n'delta.universalFormat.enabledFormats' = 'iceberg'\n\n``` \nYou must also enable [column mapping](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html) to use UniForm. This is enabled automatically if you enable UniForm during table creation, as in the following example: \n```\nCREATE TABLE T(c1 INT) TBLPROPERTIES(\n'delta.enableIcebergCompatV2' = 'true',\n'delta.universalFormat.enabledFormats' = 'iceberg');\n\n``` \nYou can enable UniForm on an existing table using the following syntax: \n```\nREORG TABLE table_name APPLY (UPGRADE UNIFORM(ICEBERG_COMPAT_VERSION=2));\n\n``` \nNote \nThis syntax also works to upgrade from the Public Preview version of UniForm, which used the table feature `IcebergCompatV1`. \nThis syntax automatically disables and purges deletion vectors from the table. Existing files are rewritten as necessary to make them Iceberg compatible. \nWhen you first enable UniForm, asynchronous metadata generation begins. This task must complete before external clients can query the table using Iceberg. See [Check Iceberg metadata generation status](https:\/\/docs.databricks.com\/delta\/uniform.html#status). \nNote \nIf you plan to use BigQuery as your Iceberg reader client, you must set `spark.databricks.delta.write.dataFilesToSubdir` to `true` on Databricks to accommodate a BigQuery requirement for data layout. \nSee [Limitations](https:\/\/docs.databricks.com\/delta\/uniform.html#limitations).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/uniform.html"} +{"content":"# What is Delta Lake?\n### Use UniForm to read Delta Tables with Iceberg clients\n#### When does UniForm generate Iceberg metadata?\n\nDatabricks triggers Iceberg metadata generation asynchronously after a Delta Lake write transaction completes using the same compute that completed the Delta transaction. You can also manually trigger Iceberg metadata generation. See [Manually trigger Iceberg metadata conversion](https:\/\/docs.databricks.com\/delta\/uniform.html#manual-trigger). \nTo avoid write latencies associated with Iceberg metadata generation, Delta tables with frequent commits might bundle multiple Delta commits into a single Iceberg commit. \nDelta Lake ensures that only one Iceberg metadata generation process is in progress at any time. Commits that would trigger a second concurrent Iceberg metadata generation process will successfully commit to Delta, but they won\u2019t trigger asynchronous Iceberg metadata generation. This prevents cascading latency for metadata generation for workloads with frequent commits (seconds to minutes between commits). \nSee [Delta and Iceberg table versions](https:\/\/docs.databricks.com\/delta\/uniform.html#versions).\n\n### Use UniForm to read Delta Tables with Iceberg clients\n#### Check Iceberg metadata generation status\n\nUniForm adds the following fields to Unity Catalog and Iceberg table metadata to track metadata generation status: \n| Metadata field | Description |\n| --- | --- |\n| `converted_delta_version` | The latest version of the Delta table for which Iceberg metadata was successfully generated. |\n| `converted_delta_timestamp` | The timestamp of the latest Delta commit for which Iceberg metadata was successfully generated. | \nOn Databricks, you can review these metadata fields by doing one of the following: \n* Reviewing the `Delta Uniform Iceberg` section returned by `DESCRIBE EXTENDED table_name`.\n* Reviewing table metadata with Catalog Explorer.\n* Using the [REST API to get a table](https:\/\/docs.databricks.com\/api\/workspace\/tables\/get). \nSee documentation for your Iceberg reader client for how to review table properties outside Databricks. For OSS Apache Spark, you can see these properties using the following syntax: \n```\nSHOW TBLPROPERTIES <table-name>;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/uniform.html"} +{"content":"# What is Delta Lake?\n### Use UniForm to read Delta Tables with Iceberg clients\n#### Manually trigger Iceberg metadata conversion\n\nYou can manually trigger Iceberg metadata generation for the latest version of the Delta table. This operation runs synchronously, meaning that when it completes, the table contents available in Iceberg reflect the latest version of the Delta table available when the conversion process started. \nThis operation should not be necessary under normal conditions, but can help if you encounter the following: \n* A cluster terminates before automatic metadata generation succeeds.\n* An error or job failure interrupts metadata generation.\n* A client that does not support UniForm Iceberg metadata gneration writes to the Delta table. \nUse the following syntax to manually trigger Iceberg metadata generation: \n```\nMSCK REPAIR TABLE <table-name> SYNC METADATA\n\n``` \nSee [REPAIR TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-repair-table.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/uniform.html"} +{"content":"# What is Delta Lake?\n### Use UniForm to read Delta Tables with Iceberg clients\n#### Read using a metadata JSON path\n\nSome Iceberg clients require you provide a path to versioned metadata files to register external Iceberg tables. Each time UniForm converts a new version of the Delta table to Iceberg, it creates a new metadata JSON file. \nClients that use metadata JSON paths for configuring Iceberg include BigQuery. Refer to documentation for the Iceberg reader client for configuration details. \nDelta Lake stores Iceberg metadata under the table directory, using the following pattern: \n```\n<table-path>\/metadata\/<version-number>-<uuid>.metadata.json\n\n``` \nOn Databricks, you can review this metadata location by doing one of the following: \n* Reviewing the `Delta Uniform Iceberg` section returned by `DESCRIBE EXTENDED table_name`.\n* Reviewing table metadata with Catalog Explorer.\n* Using the following command with the REST API: \n```\nGET api\/2.1\/unity-catalog\/tables\/<catalog-name>.<schame-name>.<table-name>\n\n``` \nThe response includes the following information: \n```\n{\n...\n\"delta_uniform_iceberg\": {\n\"metadata_location\": \"<cloud-storage-uri>\/metadata\/v<version-number>-<uuid>.metadata.json\"\n}\n}\n\n``` \nImportant \nPath-based Iceberg reader clients might require manually updating and refreshing metadata JSON paths to read current table versions. Users might encounter errors when querying Iceberg tables using out-of-date versions as Parquet data files are removed from the Delta table with `VACUUM`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/uniform.html"} +{"content":"# What is Delta Lake?\n### Use UniForm to read Delta Tables with Iceberg clients\n#### Read using the Unity Catalog Iceberg catalog endpoint\n\nSome Iceberg clients can connect to an Iceberg REST catalog. Unity Catalog provides a read-only implementation of the Iceberg REST catalog API for Delta tables with UniForm enabled using the endpoint `\/api\/2.1\/unity-catalog\/iceberg`. See the [Iceberg REST API spec](https:\/\/github.com\/apache\/iceberg\/blob\/master\/open-api\/rest-catalog-open-api.yaml) for details on using this REST API. \nClients known to support the Iceberg catalog API include Apache Spark, Flink, and Trino. You must configure access to the underlying cloud object storage containing the Delta table with UniForm enabled. Refer to documentation for the Iceberg reader client for configuration details. \nYou must generate and configure a Databricks personal access token to allow other services to connect to Unity Catalog. See [Authentication for Databricks automation - overview](https:\/\/docs.databricks.com\/dev-tools\/auth\/index.html). \nThe following is an example of the settings to configure OSS Apache Spark to read UniForm as Iceberg: \n```\n\"spark.sql.extensions\": \"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions\",\n\"spark.sql.catalog.unity\"=\"org.apache.iceberg.spark.SparkCatalog\",\n\"spark.sql.catalog.unity.catalog-impl\": \"org.apache.iceberg.rest.RESTCatalog\",\n\"spark.sql.catalog.unity.uri\": \"<api-root>\/api\/2.1\/unity-catalog\/iceberg\",\n\"spark.sql.catalog.unity.token\":\"<your_personal_access_token>\",\n\"spark.sql.catalog.unity.io-impl\": \"org.apache.iceberg.aws.s3.S3FileIO\n\n``` \nSubstitute the full URL of the workspace in which you generated the personal access token for `<api-root>`. \nNote \nWhen querying tables in Unity Catalog using this method, object identifiers use the following pattern: \n```\nunity.<catalog-name>.<schema-name>.<table-name>\n\n``` \nThis pattern uses the same three-tier namespacing present in Unity Catalog, but adds an additional prefix `unity`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/uniform.html"} +{"content":"# What is Delta Lake?\n### Use UniForm to read Delta Tables with Iceberg clients\n#### Delta and Iceberg table versions\n\nBoth Delta Lake and Iceberg allow time travel queries using table versions or timestamps stored in table metadata. \nIn general, Iceberg and Delta table versions do not align by either the commit timestamp or the version ID. If you wish to verify which version of a Delta table a given version of an Iceberg table corresponds to, you can use the corresponding table properties set on the Iceberg table. See [Check Iceberg metadata generation status](https:\/\/docs.databricks.com\/delta\/uniform.html#status).\n\n### Use UniForm to read Delta Tables with Iceberg clients\n#### Limitations\n\nThe following limitations exist: \n* UniForm does not work on tables with deletion vectors enabled. See [What are deletion vectors?](https:\/\/docs.databricks.com\/delta\/deletion-vectors.html).\n* Delta tables with UniForm enabled do not support `VOID` types.\n* Iceberg clients can only read from UniForm. Writes are not supported.\n* Iceberg reader clients might have individual limitations, regardless of UniForm. See documentation for your chosen client.\n* The recipients of Delta Sharing can only read the table as Delta, even when UniForm is enabled. \nChange Data Feed works for Delta clients when UniForm is enabled, but does not have support in Iceberg. \nSome Delta Lake table features used by UniForm are not supported by some Delta Sharing reader clients. See [Share data and AI assets securely using Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/uniform.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n#### Track ML and deep learning training runs\n###### Organize training runs with MLflow experiments\n\nExperiments are units of organization for your [model training runs](https:\/\/docs.databricks.com\/mlflow\/runs.html). There are two types of experiments: workspace and notebook. \n* You can create a workspace experiment from the Databricks Machine Learning UI or the MLflow API. Workspace experiments are not associated with any notebook, and any notebook can log a run to these experiments by using the experiment ID or the experiment name.\n* A notebook experiment is associated with a specific notebook. Databricks automatically creates a notebook experiment if there is no active experiment when you start a run using [mlflow.start\\_run()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.start_run). \nTo see all of the experiments in a workspace that you have access to, select **Machine Learning > Experiments** in the sidebar. \n![Experiments page](https:\/\/docs.databricks.com\/_images\/experiments-page.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/experiments.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n#### Track ML and deep learning training runs\n###### Organize training runs with MLflow experiments\n####### Create workspace experiment\n\nThis section describes how to create a workspace experiment using the Databricks UI. You can create a workspace experiment directly [from the workspace](https:\/\/docs.databricks.com\/mlflow\/experiments.html#create-expt-from-workspace) or [from the Experiments page](https:\/\/docs.databricks.com\/mlflow\/experiments.html#create-expt-from-expts-page). \nYou can also use the [MLflow API](https:\/\/mlflow.org\/docs\/latest\/index.html), or the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) with [databricks\\_mlflow\\_experiment](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/mlflow_experiment). \nFor instructions on logging runs to workspace experiments, see [Logging example notebook](https:\/\/docs.databricks.com\/mlflow\/tracking.html#mlflow-recording-runs). \n1. Click ![Workspace Icon](https:\/\/docs.databricks.com\/_images\/workspace-icon.png) **Workspace** in the sidebar.\n2. Navigate to the folder in which you want to create the experiment.\n3. Right-click on the folder and select **Create > MLflow experiment**.\n4. In the Create MLflow Experiment dialog, enter a name for the experiment and an optional artifact location. If you do not specify an artifact location, artifacts are stored in `dbfs:\/databricks\/mlflow-tracking\/<experiment-id>`. \nDatabricks supports [DBFS](https:\/\/docs.databricks.com\/dbfs\/index.html), S3, and Azure Blob storage artifact locations. \nTo store artifacts in S3, specify a URI of the form `s3:\/\/<bucket>\/<path>`. MLflow obtains credentials to access S3 from your clusters\u2019s [instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html). Artifacts stored in S3 do not appear in the MLflow UI; you must download them using an object storage client. \nNote \nFor MLflow version 2.3.0 and above, the maximum size for an MLflow artifact uploaded to DBFS on AWS is up to 5 TiB. For MLflow version 2.3.0 and lower, the maximum size for an MLflow artifact uploaded to DBFS on AWS is 5 GiB. \nNote \nWhen you store an artifact in a location other than DBFS, the artifact does not appear in the MLflow UI. Models stored in locations other than DBFS cannot be registered in Model Registry.\n5. Click **Create**. An empty experiment appears. \nYou can also create a new workspace experiment from the Experiments page. To create a new experiment, use the ![create experiment drop-down](https:\/\/docs.databricks.com\/_images\/create-expt-dropdown.png) drop-down menu. From the drop-down menu, you can select either an [AutoML experiment](https:\/\/docs.databricks.com\/machine-learning\/automl\/index.html) or a blank (empty) experiment. \n* AutoML experiment. The **Configure AutoML experiment** page appears. For information about using AutoML, see [Train ML models with the Databricks AutoML UI](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html).\n* Blank experiment. The **Create MLflow Experiment** dialog appears. Enter a name and optional artifact location in the dialog to create a new workspace experiment. The default artifact location is `dbfs:\/databricks\/mlflow-tracking\/<experiment-id>`. \nTo log runs to this experiment, call `mlflow.set_experiment()` with the experiment path. The experiment path appears at the top of the experiment page. See [Logging example notebook](https:\/\/docs.databricks.com\/mlflow\/tracking.html#mlflow-recording-runs) for details and an example notebook.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/experiments.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n#### Track ML and deep learning training runs\n###### Organize training runs with MLflow experiments\n####### Create notebook experiment\n\nWhen you use the [mlflow.start\\_run() command](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.start_run) in a notebook, the run logs metrics and parameters to the active experiment. If no experiment is active, Databricks creates a notebook experiment. A notebook experiment shares the same name and ID as its corresponding notebook. The notebook ID is the numerical identifier at the end of a [Notebook URL and ID](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-notebook-url). \nAlternatively, you can pass a Databricks workspace path to an existing notebook in [mlflow.set\\_experiment()](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.html?highlight=set_experiment#mlflow.set_experiment) to create a notebook experiment for it. \nFor instructions on logging runs to notebook experiments, see [Logging example notebook](https:\/\/docs.databricks.com\/mlflow\/tracking.html#mlflow-recording-runs). \nNote \nIf you delete a notebook experiment using the API (for example, `MlflowClient.tracking.delete_experiment()` in Python), the notebook itself is moved into the Trash folder.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/experiments.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n#### Track ML and deep learning training runs\n###### Organize training runs with MLflow experiments\n####### View experiments\n\nEach experiment that you have access to appears on the experiments page. From this page, you can view any experiment. Click on an experiment name to display the experiment page. \nAdditional ways to access the experiment page: \n* You can access the experiment page for a workspace experiment from the workspace menu.\n* You can access the experiment page for a notebook experiment from the notebook. \nTo search for experiments, type text in the **Filter experiments** field and press **Enter** or click the magnifying glass icon. The experiment list changes to show only those experiments that contain the search text in the **Name**, **Created by**, **Location**, or **Description** column. \nClick the name of any experiment in the table to display its experiment page: \n![View experiment](https:\/\/docs.databricks.com\/_images\/quick-start-nb-experiment.png) \nThe experiment page lists all runs associated with the experiment. From the table, you can open the run page for any run associated with the experiment by clicking its **Run Name**. The **Source** column gives you access to the notebook version that created the run. You can also search and [filter](https:\/\/docs.databricks.com\/mlflow\/runs.html#filter-runs) runs by metrics or parameter settings. \n### View workspace experiment \n1. Click ![Workspace Icon](https:\/\/docs.databricks.com\/_images\/workspace-icon.png) **Workspace** in the sidebar.\n2. Go to the folder containing the experiment.\n3. Click the experiment name. \n### View notebook experiment \nIn the notebook\u2019s right sidebar, click the **Experiment** icon ![Experiment icon](https:\/\/docs.databricks.com\/_images\/experiment1.png). \nThe Experiment Runs sidebar appears and shows a summary of each run associated with the notebook experiment, including run parameters and metrics. At the top of the sidebar is the name of the experiment that the notebook most recently logged runs to (either a notebook experiment or a workspace experiment). \n![View run parameters and metrics](https:\/\/docs.databricks.com\/_images\/mlflow-notebook-revision.png) \nFrom the sidebar, you can navigate to the experiment page or directly to a run. \n* To view the experiment, click ![External Link](https:\/\/docs.databricks.com\/_images\/external-link.png) at the far right, next to **Experiment Runs**.\n* To display a [run](https:\/\/docs.databricks.com\/mlflow\/runs.html), click the name of the run.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/experiments.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n#### Track ML and deep learning training runs\n###### Organize training runs with MLflow experiments\n####### Manage experiments\n\nYou can rename, delete, or manage permissions for an experiment you own from the experiments page, the [experiment page](https:\/\/docs.databricks.com\/mlflow\/experiments.html#experiment-page), or the workspace menu. \nNote \nYou cannot directly rename, delete, or manage permissions on an MLflow experiment that was created by a notebook in a Databricks Git folder. You must perform these actions at the Git folder level. \n### Rename experiment from the experiments page or the experiment page \nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nTo rename an experiment from the experiments page or the experiment page, click ![three button icon](https:\/\/docs.databricks.com\/_images\/three-button-icon.png) and select **Rename**. \n### Rename experiment from the workspace menu \n1. Click ![Workspace Icon](https:\/\/docs.databricks.com\/_images\/workspace-icon.png) **Workspace** in the sidebar.\n2. Go to the folder containing the experiment.\n3. Right-click on the experiment name and select **Rename**. \n### Copy experiment name \nTo copy the experiment name, click ![Copy Icon](https:\/\/docs.databricks.com\/_images\/copy-icon.png) at the top of the experiment page. You can use this name in the MLflow command `set_experiment` to set the active MLflow experiment. \n![Experiment name icon](https:\/\/docs.databricks.com\/_images\/get-experiment-name.png) \nYou can also copy the experiment name from the [experiment sidebar in a notebook](https:\/\/docs.databricks.com\/mlflow\/experiments.html#view-notebook-experiment). \n### Delete notebook experiment \nNotebook experiments are part of the notebook and cannot be deleted separately. When you [delete a notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#delete-a-notebook), the associated notebook experiment is deleted. When you delete a notebook experiment using the UI, the notebook is also deleted. \nTo delete notebook experiments using the API, use the [Workspace API](https:\/\/docs.databricks.com\/api\/workspace\/introduction) to ensure both the notebook and experiment are deleted from the workspace. \n### Delete workspace experiment from the workspace menu \n1. Click ![Workspace Icon](https:\/\/docs.databricks.com\/_images\/workspace-icon.png) **Workspace** in the sidebar.\n2. Go to the folder containing the experiment.\n3. Right-click on the experiment name and select **Move to Trash**. \n### Delete workspace or notebook experiment from the experiments page or the experiment page \nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nTo delete an experiment from the experiments page or the [experiment page](https:\/\/docs.databricks.com\/mlflow\/experiments.html#experiment-page), click ![three button icon](https:\/\/docs.databricks.com\/_images\/three-button-icon.png) and select **Delete**. \nWhen you delete a notebook experiment, the notebook is also deleted. \n### Change permissions for experiment \nTo change permissions for an experiment from the [experiment page](https:\/\/docs.databricks.com\/mlflow\/experiments.html#experiment-page), click **Share**. \n![Experiment page permissions button](https:\/\/docs.databricks.com\/_images\/expt-permission.png) \nYou can change permissions for an experiment that you own from the [experiments page](https:\/\/docs.databricks.com\/mlflow\/experiments.html#experiment-page). Click ![three button icon](https:\/\/docs.databricks.com\/_images\/three-button-icon.png) in the **Actions** column and select **Permission**. \nFor information on experiment permission levels, see [MLFlow experiment ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#experiments).\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/experiments.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n#### Track ML and deep learning training runs\n###### Organize training runs with MLflow experiments\n####### Copy experiments between workspaces\n\nTo migrate MLflow experiments between workspaces, you can use the community-driven open source project [MLflow Export-Import](https:\/\/github.com\/mlflow\/mlflow-export-import#why-use-mlflow-export-import). \nWith these tools, you can: \n* Share and collaborate with other data scientists in the same or another tracking server. For example, you can clone an experiment from another user into your workspace.\n* Copy MLflow experiments and runs from your local tracking server to your Databricks workspace.\n* Back up mission critical experiments and models to another Databricks workspace.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/experiments.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to MicroStrategy\n\nThis article describes how to use MicroStrategy Workstation with a Databricks cluster or a Databricks SQL warehouse (formerly Databricks SQL endpoint).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/microstrategy.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to MicroStrategy\n##### Requirements\n\nBefore you connect to MicroStrategy manually, you need the following: \n* A cluster or SQL warehouse in your Databricks workspace. \n+ [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n+ [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* The connection details for your cluster or SQL warehouse, specifically the **Server Hostname**, **Port**, and **HTTP Path** values. \n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). To create a personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**.\nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n+ [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n+ [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/microstrategy.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to MicroStrategy\n##### Connect to MicroStrategy manually\n\nTo connect to MicroStrategy manually, do the following: \n1. Start MicroStrategy Workstation.\n2. In the navigation bar, in the **Analysis** area, next to **Dossiers**, click the plus (**Create a new dossier**) icon.\n3. In the **Untitled Dossier** window, in the **Datasets** panel, click **New Data**. If the **Datasets** pane is not visible, in the navigation bar, click the dataset (**Click to Open Datasets Panel**) icon.\n4. In the **Data Sources** dialog, click the **Databricks** icon. If the **Databricks** icon is not visible, scroll to view it, or in the **Search** box, enter **Databricks**.\n5. In the **Select Import Options**, click to **Select Tables**, **Build a Query**, or **Type a Query**, and click **Next**.\n6. In the **Import from Table - Select** or **Import from Table - SQL Editor** dialog, next to **Data Sources**, click the plus (**New Data Source**) icon.\n7. In the **Connections** dialog, enter a **Connection Name**.\n8. For **[JDBC URL]**, enter the JDBC URL from Step 1. If the URL ends with `;UID=token;PWD=<personal-access-token>`, do not include that portion of the URL, as you will provide this information next.\n9. If a **Token** box displays, enter your personal access token from Step 1.\n10. If a **User** box displays, enter the word `token`.\n11. If a **Password** box displays, enter your personal access token from Step 1.\n12. Click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/microstrategy.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to MicroStrategy\n##### Next steps\n\nTo continue using MicroStrategy Workstation, see the following resources on the MicroStrategy website: \n* [Introduction to Dossiers](https:\/\/www2.microstrategy.com\/producthelp\/Current\/Workstation\/en-us\/Content\/Intro_to_Dossiers.htm)\n* [How to Connect to Databricks](https:\/\/www2.microstrategy.com\/producthelp\/Current\/Gateway_Connections\/WebHelp\/Lang_1033\/Content\/databricks.htm)\n* [MicroStrategy Workstation Help](https:\/\/www2.microstrategy.com\/producthelp\/Current\/Workstation\/en-us\/Content\/home_workstation.htm)\n\n#### Connect to MicroStrategy\n##### Additional resources\n\n[Support](https:\/\/www.microstrategy.com\/support)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/microstrategy.html"} +{"content":"# Databricks data engineering\n## What are init scripts?\n#### Init script logging\n\nInit script start and finish events are captured in cluster event logs. Details are captured in cluster logs. Global init script create, edit, and delete events are also captured in account-level audit logs.\n\n#### Init script logging\n##### Init script events\n\n[Cluster event logs](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#event-log) capture two init script events: `INIT_SCRIPTS_STARTED` and `INIT_SCRIPTS_FINISHED`, indicating which scripts are scheduled for execution and which have completed successfully. `INIT_SCRIPTS_FINISHED` also captures execution duration. \nGlobal init scripts are indicated in the log event details by the key `\"global\"` and cluster-scoped init scripts are indicated by the key `\"cluster\"`. \nNote \nCluster event logs do not log init script events for each cluster node; only one node is selected to represent them all.\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/logs.html"} +{"content":"# Databricks data engineering\n## What are init scripts?\n#### Init script logging\n##### Where are init script logs written?\n\nIf [cluster log delivery](https:\/\/docs.databricks.com\/compute\/configure.html#cluster-log-delivery) is configured for a cluster, the init script logs are written to `\/<cluster-log-path>\/<cluster-id>\/init_scripts`. \nLogs for each container in the cluster are written to a subdirectory called `init_scripts\/<cluster-id>_<container-ip>`. \nFor example, if `cluster-log-path` is set to `cluster-logs`, the path to the logs for a specific container would be: `dbfs:\/cluster-logs\/<cluster-id>\/init_scripts\/<cluster-id>_<container-ip>`. \nIf the cluster is configured to write logs to DBFS, you can view the logs using the [File system utility (dbutils.fs)](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#dbutils-fs) or the [DBFS CLI (legacy)](https:\/\/docs.databricks.com\/archive\/dev-tools\/cli\/dbfs-cli.html). For example, if the cluster ID is `1001-234039-abcde739`: \n```\ndbfs ls dbfs:\/cluster-logs\/1001-234039-abcde739\/init_scripts\n\n``` \n```\n1001-234039-abcde739_10_97_225_166\n1001-234039-abcde739_10_97_231_88\n1001-234039-abcde739_10_97_244_199\n\n``` \n```\ndbfs ls dbfs:\/cluster-logs\/1001-234039-abcde739\/init_scripts\/1001-234039-abcde739_10_97_225_166\n\n``` \n```\n<timestamp>_<log-id>_<init-script-name>.sh.stderr.log\n<timestamp>_<log-id>_<init-script-name>.sh.stdout.log\n\n``` \nWhen cluster log delivery is not configured, logs are written to `\/databricks\/init_scripts`. You can use standard shell commands in a notebook to list and view the logs: \n```\n%sh\nls \/databricks\/init_scripts\/\ncat \/databricks\/init_scripts\/<timestamp>_<log-id>_<init-script-name>.sh.stdout.log\n\n``` \nEvery time a cluster launches, it writes a log to the init script log folder. \nImportant \nAny user who creates a cluster and enables cluster log delivery can view the `stderr` and `stdout` output from global init scripts. You should ensure that your global init scripts do not output any sensitive information.\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/logs.html"} +{"content":"# Databricks data engineering\n## What are init scripts?\n#### Init script logging\n##### Init script events in audit logs\n\nDatabricks audit logs capture global init script create, edit, and delete events under the event type `globalInitScripts`. See [Global init scripts events](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html#init-scripts).\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/logs.html"} +{"content":"# Compute\n## Use compute\n#### Customize containers with Databricks Container Service\n\nDatabricks Container Services lets you specify a Docker image when you create compute. Some example use cases include: \n* Library customization: you have full control over the system libraries you want installed.\n* Golden container environment: your Docker image is a locked down environment that will never change.\n* Docker CI\/CD integration: you can integrate Databricks with your Docker CI\/CD pipelines. \nYou can also use Docker images to create custom deep learning environments on compute with GPU devices. For additional information about using GPU compute with Databricks Container Services, see [Databricks Container Services on GPU compute](https:\/\/docs.databricks.com\/compute\/gpu.html#databricks-container-services-on-gpu). \nFor tasks to be executed each time the container starts, use an [init script](https:\/\/docs.databricks.com\/compute\/custom-containers.html#containers-init-script).\n\n#### Customize containers with Databricks Container Service\n##### Requirements\n\n* Your Databricks workspace must have Databricks Container Services [enabled](https:\/\/docs.databricks.com\/compute\/custom-containers.html#enable).\n* Your machine must be running a recent Docker daemon (one that is tested and works with Client\/Server Version 18.03.0-ce) and the `docker` command must be available on your `PATH`.\n\n#### Customize containers with Databricks Container Service\n##### Limitations\n\n* Databricks Container Services is not supported on compute using shared access mode.\n* Databricks Runtime for Machine Learning does not support Databricks Container Services.\n* To access Volumes on Databricks Container Services, add the following configuration to the compute\u2019s **Spark config** field: `spark.databricks.unityCatalog.volumes.enabled true`. \n* Databricks Container Services is not supported on AWS Graviton instance types.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/custom-containers.html"} +{"content":"# Compute\n## Use compute\n#### Customize containers with Databricks Container Service\n##### Step 1: Build your base\n\nDatabricks recommends that you build your Docker base from a base that Databricks has built and tested. It is also possible to build your Docker base from scratch. This section describes the two options. \n### Option 1. Use a base built by Databricks \nThis example uses the `9.x` tag for an image that will target a compute with runtime version Databricks Runtime 9.1 LTS and above: \n```\nFROM databricksruntime\/standard:9.x\n...\n\n``` \nTo specify additional Python libraries, such as the latest version of pandas and urllib, use the container-specific version of `pip`. For the `databricksruntime\/standard:9.x` container, include the following: \n```\nRUN \/databricks\/python3\/bin\/pip install pandas\nRUN \/databricks\/python3\/bin\/pip install urllib3\n\n``` \nFor the `databricksruntime\/standard:8.x` container or lower, include the following: \n```\nRUN \/databricks\/conda\/envs\/dcs-minimal\/bin\/pip install pandas\nRUN \/databricks\/conda\/envs\/dcs-minimal\/bin\/pip install urllib3\n\n``` \nBase images are hosted on Docker Hub at <https:\/\/hub.docker.com\/u\/databricksruntime>. The Dockerfiles used to generate these bases are at <https:\/\/github.com\/databricks\/containers>. \nNote \nDocker Hub hosted images with Tags with \u201c-LTS\u201d suffix will be patched. All other images are examples and are not patched regularly. \nNote \nThe base images `databricksruntime\/standard` and `databricksruntime\/minimal` are not to be confused with the unrelated `databricks-standard` and `databricks-minimal` environments included in the no longer available Databricks Runtime with Conda (Beta). \n### Option 2. Build your own Docker base \nYou can also build your Docker base from scratch. The Docker image must meet these requirements: \n* JDK 8u191 as Java on the system `PATH`\n* bash\n* iproute2 ([ubuntu iproute](https:\/\/packages.ubuntu.com\/search?keywords=iproute2))\n* coreutils ([ubuntu coreutils](https:\/\/packages.ubuntu.com\/search?keywords=coreutils))\n* procps ([ubuntu procps](https:\/\/packages.ubuntu.com\/search?keywords=procps))\n* sudo ([ubuntu sudo](https:\/\/packages.ubuntu.com\/search?keywords=sudo))\n* Ubuntu Linux \nTo build your own image from scratch, you must create the virtual environment. You must also include packages that are built into Databricks compute, such as Python and R. To get started, you can use the appropriate base image: \n* For R: `databricksruntime\/rbase`\n* For Python: `databricksruntime\/python`\n* For the minimal image built by Databricks: `databricksruntime\/minimal` \nYou can also refer to the example [Dockerfiles in GitHub](https:\/\/hub.docker.com\/u\/databricksruntime). \nNote \nDatabricks recommends using Ubuntu Linux; however, it is possible to use Alpine Linux. To use Alpine Linux, you must include these files: \n* [alpine coreutils](https:\/\/www.gnu.org\/software\/coreutils\/)\n* [alpine procps](https:\/\/pkgs.alpinelinux.org\/packages?name=procps-ng)\n* [alpine sudo](https:\/\/pkgs.alpinelinux.org\/packages?name=sudo) \nIn addition, you must set up Python, as shown in this [example Dockerfile](https:\/\/github.com\/databricks\/containers\/blob\/master\/experimental\/alpine\/minimal\/Dockerfile). \nWarning \nTest your custom container image thoroughly on a Databricks compute. Your container may work on a local or build machine, but when your container is launched on Databricks the compute launch may fail, certain features may become disabled, or your container may stop working, even silently. In worst-case scenarios, it could corrupt your data or accidentally expose your data to external parties. \nAs a reminder, your use of Databricks Container Services is subject to the [Service Specific Terms](https:\/\/databricks.com\/product-specific-terms).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/custom-containers.html"} +{"content":"# Compute\n## Use compute\n#### Customize containers with Databricks Container Service\n##### Step 2: Push your base image\n\nPush your custom base image to a Docker registry. This process is supported with the following registries: \n* [Docker Hub](https:\/\/hub.docker.com\/) with no auth or basic auth.\n* [Amazon Elastic Container Registry (Amazon ECR)](https:\/\/aws.amazon.com\/ecr\/) with IAM (with the exception of Commercial Cloud Services (C2S)).\n* [Azure Container Registry](https:\/\/learn.microsoft.com\/azure\/container-registry\/) with basic auth. \nOther Docker registries that support no auth or basic auth are also expected to work. \nNote \nIf you use Docker Hub for your Docker registry, be sure to check that rate limits accommodate the amount of compute that you expect to launch in a six-hour period. These rate limits are different for anonymous users, authenticated users without a paid subscription, and paid subscriptions. See [the Docker documentation](https:\/\/docs.docker.com\/docker-hub\/download-rate-limit\/) for details. If this limit is exceeded, you will get a \u201c429 Too Many Requests\u201d response.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/custom-containers.html"} +{"content":"# Compute\n## Use compute\n#### Customize containers with Databricks Container Service\n##### Step 3: Launch your compute\n\nYou can launch your compute using the UI or the API. \n### Launch your compute using the UI \n1. On the [Create compute page](https:\/\/docs.databricks.com\/compute\/configure.html), specify a Databricks Runtime Version that supports Databricks Container Services.\n2. Under **Advanced options**, select the **Docker** tab.\n3. Select **Use your own Docker container**.\n4. In the **Docker Image URL** field, enter your custom Docker image. \nDocker image URL examples: \n| Registry | Tag format |\n| --- | --- |\n| Docker Hub | `<organization>\/<repository>:<tag>` (for example: `databricksruntime\/standard:latest`) |\n| Amazon ECR | `<aws-account-id>.dkr.ecr.<region>.amazonaws.com\/<repository>:<tag>` |\n| Azure Container Registry | `<your-registry-name>.azurecr.io\/<repository-name>:<tag>` |\n5. Select the authentication type. You can use secrets to store username and password authentication values. See [Use secrets for authentication](https:\/\/docs.databricks.com\/compute\/custom-containers.html#secrets). \n### Launch your compute using the API \n1. [Generate an API token](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement).\n2. Use the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) to launch a compute with your custom Docker base. \n```\ndatabricks clusters create \\\n--cluster-name <cluster-name> \\\n--node-type-id i3.xlarge \\\n--json '{\n\"num_workers\": 0,\n\"docker_image\": {\n\"url\": \"databricksruntime\/standard:latest\",\n\"basic_auth\": {\n\"username\": \"<docker-registry-username>\",\n\"password\": \"<docker-registry-password>\"\n}\n},\n\"spark_version\": \"14.3.x-scala2.12\",\n\"aws_attributes\": {\n\"availability\": \"ON_DEMAND\",\n\"instance_profile_arn\": \"arn:aws:iam::<aws-account-number>:instance-profile\/<iam-role-name>\"\n}\n}'\n\n``` \n`basic_auth` requirements depend on your Docker image type: \n* For public Docker images, *do not* include the `basic_auth` field.\n* For private Docker images, you must include the `basic_auth` field, using a service principal ID and password as the username and password.\n* For Azure Container Registry, you must set the `basic_auth` field to the ID and password for a service principal. See [Azure Container Registry service principal authentication documentation](https:\/\/learn.microsoft.com\/azure\/container-registry\/container-registry-auth-service-principal) for information about creating the service principal.\n* For Amazon ECR images, do not include the `basic_auth` field. You must launch your compute with an [instance profile](https:\/\/docs.databricks.com\/api\/workspace\/clusters) that includes permissions to pull Docker images from the Docker repository where the image resides. To do this, follow [steps 3 and 4 of the process for setting up secure access to S3 buckets using instance profiles](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html).\n* You can also use a secret to store authentication information. See [Use secrets for authentication](https:\/\/docs.databricks.com\/compute\/custom-containers.html#secrets). \nHere is an example of an IAM role with permission to pull any image. The repository is specified by `<arn-of-repository>`. \n```\n{\n\"Version\": \"2012-10-17\",\n\"Statement\": [\n{\n\"Effect\": \"Allow\",\n\"Action\": [\n\"ecr:GetAuthorizationToken\"\n],\n\"Resource\": \"*\"\n},\n{\n\"Effect\": \"Allow\",\n\"Action\": [\n\"ecr:BatchCheckLayerAvailability\",\n\"ecr:GetDownloadUrlForLayer\",\n\"ecr:GetrepositoryPolicy\",\n\"ecr:DescribeRepositories\",\n\"ecr:ListImages\",\n\"ecr:DescribeImages\",\n\"ecr:BatchGetImage\"\n],\n\"Resource\": [ \"<arn-of-repository>\" ]\n}\n]\n}\n\n``` \nIf the Amazon ECR image resides in a different AWS account than the Databricks compute, use an [ECR repository policy](https:\/\/docs.aws.amazon.com\/AmazonECR\/latest\/userguide\/repository-policies.html) in addition to the compute instance profile to grant the compute access. Here is an example of an ECR repository policy. The IAM role assumed by the compute\u2019s instance profile is specified by `<arn-of-IAM-role>`. \n```\n{\n\"Version\": \"2012-10-17\",\n\"Statement\": [{\n\"Sid\": \"AllowCrossAccountPush\",\n\"Effect\": \"Allow\",\n\"Principal\": {\n\"AWS\": \"<arn-of-IAM-role>\"\n},\n\"Action\": [\n\"ecr:BatchCheckLayerAvailability\",\n\"ecr:BatchGetImage\",\n\"ecr:DescribeImages\",\n\"ecr:DescribeRepositories\",\n\"ecr:GetDownloadUrlForLayer\",\n\"ecr:GetrepositoryPolicy\",\n\"ecr:ListImages\"\n]\n}]\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/custom-containers.html"} +{"content":"# Compute\n## Use compute\n#### Customize containers with Databricks Container Service\n##### Use an init script\n\nDatabricks Container Services enable customers to include init scripts in the Docker container. In most cases, you should avoid init scripts and instead make customizations through Docker directly (using the Dockerfile). However, certain tasks must be executed when the container starts, instead of when the container is built. Use an init script for these tasks. \nFor example, suppose you want to run a security daemon inside a custom container. Install and build the daemon in the Docker image through your image building pipeline. Then, add an init script that starts the daemon. In this example, the init script would include a line like `systemctl start my-daemon`. \nIn the API, you can specify init scripts as part of the compute spec as follows. For more information, see the [Clusters API](https:\/\/docs.databricks.com\/api\/workspace\/clusters). \n```\n\"init_scripts\": [\n{\n\"file\": {\n\"destination\": \"file:\/my\/local\/file.sh\"\n}\n}\n]\n\n``` \nFor Databricks Container Services images, you can also store init scripts in cloud storage. \nThe following steps take place when you launch a compute that uses Databricks Container Services: \n1. VMs are acquired from the cloud provider.\n2. The custom Docker image is downloaded from your repo.\n3. Databricks creates a Docker container from the image.\n4. Databricks Runtime code is copied into the Docker container.\n5. The init scripts are executed. See [What are init scripts?](https:\/\/docs.databricks.com\/init-scripts\/index.html). \nDatabricks ignores the Docker `CMD` and `ENTRYPOINT` primitives.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/custom-containers.html"} +{"content":"# Compute\n## Use compute\n#### Customize containers with Databricks Container Service\n##### Use secrets for authentication\n\nDatabricks Container Service supports using secrets for authentication. When creating your compute resource, instead of entering your plain text username or password, enter your secret using the `{{secrets\/<scope-name>\/<dcs-secret>}}` format. For information on creating secrets, see [Secrets](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html).\n\n#### Customize containers with Databricks Container Service\n##### Enable Container Services\n\nTo use custom containers on your compute, a workspace admin must enable Databricks Container Services. \nWorkspace admins can enable Databricks Container Service using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). In a JSON request body, specify `enableDcs` to `true`, as in the following example: \n```\ndatabricks workspace-conf set-status \\\n--json '{\"enableDcs\": \"true\"}'\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/custom-containers.html"} +{"content":"# What is Delta Lake?\n### Shallow clone for Unity Catalog tables\n\nImportant \nShallow clone support for Unity Catalog managed tables is in Public Preview in Databricks Runtime 13.3 and above. Shallow clone support for Unity Catalog external table is in Public Preview in Databricks Runtime 14.2 and above. \nYou can use shallow clone to create new Unity Catalog tables from existing Unity Catalog tables. Shallow clone support for Unity Catalog allows you to create tables with access control privileges independent from their parent tables without needing to copy underlying data files. \nImportant \nYou can only clone Unity Catalog managed tables to Unity Catalog managed tables and Unity Catalog external tables to Unity Catalog external tables. `VACUUM` behavior differs between managed and external tables. See [Vacuum and Unity Catalog shallow clones](https:\/\/docs.databricks.com\/delta\/clone-unity-catalog.html#vacuum). \nFor more on Delta clone, see [Clone a table on Databricks](https:\/\/docs.databricks.com\/delta\/clone.html). \nFor more on Unity Catalog tables, see [Create tables in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clone-unity-catalog.html"} +{"content":"# What is Delta Lake?\n### Shallow clone for Unity Catalog tables\n#### Create a shallow clone on Unity Catalog\n\nYou can create a shallow clone in Unity Catalog using the same syntax available for shallow clones throughout the product, as shown in the following syntax example: \n```\nCREATE TABLE <catalog-name>.<schema-name>.<target-table-name> SHALLOW CLONE <catalog-name>.<schema-name>.<source-table-name>\n\n``` \nTo create a shallow clone on Unity Catalog, you must have sufficient privileges on both the source and target resources, as detailed in the following table: \n| Resource | Permissions required |\n| --- | --- |\n| Source table | `SELECT` |\n| Source schema | `USE SCHEMA` |\n| Source catalog | `USE CATALOG` |\n| Target schema | `USE SCHEMA`, `CREATE TABLE` |\n| Target catalog | `USE CATALOG` |\n| Target external location (external tables only) | `CREATE EXTERNAL TABLE` | \nLike other create table statements, the user who creates a shallow clone is the owner of the target table. The owner of a target cloned table can control the access rights for that table independently of the source table. \nNote \nThe owner of a cloned table might be different than the owner of a source table.\n\n### Shallow clone for Unity Catalog tables\n#### Query or modify a shallow cloned table on Unity Catalog\n\nImportant \nThe instructions in this section describe privileges needed for compute configured with shared access mode. For Single User access mode, see [Work with shallow cloned tables in Single User access mode](https:\/\/docs.databricks.com\/delta\/clone-unity-catalog.html#single-user). \nTo query a shallow clone on Unity Catalog, you must have sufficient privileges on the table and containing resources, as detailed in the following table: \n| Resource | Permissions required |\n| --- | --- |\n| Catalog | `USE CATALOG` |\n| Schema | `USE SCHEMA` |\n| Table | `SELECT` | \nYou must also have `MODIFY` permissions on the target of the clone operation to complete the following actions: \n* Insert records\n* Delete records\n* Update records\n* `MERGE`\n* `CREATE OR REPLACE TABLE`\n* `DROP TABLE`\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clone-unity-catalog.html"} +{"content":"# What is Delta Lake?\n### Shallow clone for Unity Catalog tables\n#### Vacuum and Unity Catalog shallow clones\n\nImportant \nThis behavior is in Public Preview in Databricks Runtime 13.3 LTS and above for managed tables and Databricks Runtime 14.2 and above for external tables. \nWhen you use Unity Catalog tables for the source and target of a shallow clone operation, Unity Catalog manages the underlying data files to improve reliability for the source and target of the clone operation. Running `VACUUM` on the source of a shallow clone does not break the cloned table. \nNormally, when `VACUUM` identifies valid files for a given retention threshold, only the metadata for the current table is considered. Shallow clone support for Unity Catalog tracks the relationships between all cloned tables and the source data files, so valid files are expanded to include data files necessary for returning queries for any shallow cloned table as well as the source table. \nThis means that for Unity Catalog shallow clone `VACUUM` semantics, a valid data file is any file within the specified retention threshold for the source table or any cloned table. Managed tables and external tables have slightly different semantics. \nThis enhanced tracking of metadata changes how `VACUUM` operations impact data files backing the Delta tables, with the following semantics: \n* For managed tables, `VACUUM` operations against either the source or target of a shallow clone operation might delete data files from the source table.\n* For external tables, `VACUUM` operations only remove data files from the source table when run against the source table.\n* Only data files not considered valid for the source table or any shallow clone against the source are removed.\n* If multiple shallow clones are defined against a single source table, running `VACUUM` on any of the cloned tables does not remove valid data files for other cloned tables. \nNote \nDatabricks recommends never running `VACUUM` with a retention setting of less than 7 days to avoid corrupting ongoing long-running transactions. If you need to run `VACUUM` with a lower retention threshold, make sure you understand how `VACUUM` on shallow clones in Unity Catalog differs from how `VACUUM` interacts with other cloned tables on Databricks. See [Clone a table on Databricks](https:\/\/docs.databricks.com\/delta\/clone.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clone-unity-catalog.html"} +{"content":"# What is Delta Lake?\n### Shallow clone for Unity Catalog tables\n#### Work with shallow cloned tables in Single User access mode\n\nWhen working with Unity Catalog shallow clones in Single User access mode, you must have permissions on the resources for the cloned table source as well as the target table. \nThis means that for simple queries in addition to the [required permissions on the target table](https:\/\/docs.databricks.com\/delta\/clone-unity-catalog.html#query-permissions), you must have `USE` permissions on the source catalog and schema and `SELECT` permissions on the source table. For any queries that would update or insert records to the target table, you must also have `MODIFY` permissions on the source table. \nDatabricks recommends working with Unity Catalog clones on compute with shared access mode as this allows independent evolution of permissions for Unity Catalog shallow clone targets and their source tables.\n\n### Shallow clone for Unity Catalog tables\n#### Limitations\n\n* Shallow clones on external tables must be external tables. Shallow clones on managed tables must be managed tables.\n* You cannot share shallow clones using Delta Sharing.\n* You cannot nest shallow clones, meaning you cannot make a shallow clone from a shallow clone.\n* For managed tables, dropping the source table breaks the target table for shallow clones. Data files backing external tables are not removed by `DROP TABLE` operations, and so shallow clones of external tables are not impacted by dropping the source.\n* Unity Catalog allows users to `UNDROP` managed tables for around 7 days after a `DROP TABLE` command. In Databricks Runtime 13.3 LTS and above, managed shallow clones based on a dropped managed table continue to work during this 7 day period. If you do not `UNDROP` the source table in this window, the shallow clone stops functioning once the source table\u2019s data files are garbage collected.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clone-unity-catalog.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Set up Databricks Git folders (Repos)\n##### Set up private Git connectivity for Databricks Git folders (Repos)\n\nLearn about and configure Git server proxy for Git folders, which enables you to proxy Git commands from Databricks Git folders to your on-premises repos served by GitHub Enterprise Server, Bitbucket Server, and GitLab self-managed. \nNote \nUsers with a Databricks Git server proxy configured during preview should upgrade cluster permissions for best performance. See [Remove global CAN\\_ATTACH\\_TO permissions](https:\/\/docs.databricks.com\/repos\/git-proxy.html#remove-preview-perms).\n\n##### Set up private Git connectivity for Databricks Git folders (Repos)\n###### What is Git server proxy for Databricks Git folders?\n\nDatabricks Git server proxy for Git folders is a feature that allows you to proxy Git commands from your Databricks workspace to an on-premises Git server. \n[Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html) (formerly Repos) represents your connected Git repos as folders. The contents of these folders are version-controlled by syncing them to the connected Git repository. By default, Git folders can synchronize only with public Git providers (like public GitHub, GitLab, Azure DevOps, and others). However, if you host your own on-premises Git server (such as GitHub Enterprise Server, Bitbucket Server , or GitLab self-managed), you must use Git server proxy with Git folders to provide Databricks access to your Git server. Your Git server must be accessible from your Databricks data plane (driver node). \nNote \nCurrently, Databricks Git folders can contain only Databricks notebooks and sub-folders, along with a specific set of other asset types. For a current list of supported asset types, see [Limits & FAQ for Git integration with Databricks Git folders](https:\/\/docs.databricks.com\/repos\/limits.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-proxy.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Set up Databricks Git folders (Repos)\n##### Set up private Git connectivity for Databricks Git folders (Repos)\n###### How does Git Server Proxy for Databricks Git folders work?\n\nGit Server Proxy for Databricks Git folders proxies Git commands from the Databricks control plane to a \u201cproxy cluster\u201d running in your Databricks workspace\u2019s compute plane. In this context, the proxy cluster is a cluster configured to run a proxy service for Git commands from Databricks Git folders to your self-hosted Git repo. This proxy service receives Git commands from the Databricks control plane and forwards them to your Git server instance. \nThe diagram below illustrates the overall system architecture: \n![Diagram that shows how Git Server Proxy for Databricks Git folders is configured to run from a customer's compute plane](https:\/\/docs.databricks.com\/_images\/git-proxy-server1.png) \nImportant \nDatabricks provides an enablement notebook [you can run to configure your Git server instance to proxy commands for Databricks Git folders](https:\/\/docs.databricks.com\/repos\/git-proxy.html#enablement-notebook). [Get the enablement notebook on GitHub](https:\/\/github.com\/databricks\/databricks-repos-proxy\/blob\/main\/enable_git_proxy_jupyter.ipynb). \nCurrently, a Git server proxy no longer requires `CAN_ATTACH_TO` permission for all users. Admins with an existing proxy clusters can now modify the cluster ACL permission to enable this feature. To enable it: \n1. Select **Compute** from the sidebar, and then click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu next to the Compute entry for the Git Server Proxy you\u2019re running: \n![Select Compute from the sidebar, select the kebab to the right of your Git proxy server compute resource](https:\/\/docs.databricks.com\/_images\/git-proxy-perms1.png)\n2. From the dialog, remove the **Can Attach To** entry for **All Users**: \n![In the modal dialog box that pops up, click X to the right of All Users, Can Attach To](https:\/\/docs.databricks.com\/_images\/git-proxy-perms2.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-proxy.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Set up Databricks Git folders (Repos)\n##### Set up private Git connectivity for Databricks Git folders (Repos)\n###### How do I set up Git Server Proxy for Databricks Git folders?\n\nThis section describes how to prepare your Git server instance for Git Server Proxy for Databricks Git folders, create the proxy, and validate your configuration. \n### Before you begin \nBefore enabling the proxy, consider the following prerequisites and planning tasks: \n* Your workspace has the Databricks Git folders feature enabled.\n* Your Git server instance is accessible from your Databricks workspace\u2019s compute plane VPC, and has both HTTPS and personal access tokens (PATs) enabled. \nNote \nGit server proxy for Databricks works in all regions supported by your VPC. \n### Step 1: Prepare your Git server instance \nTo configure your Git server instance: \n1. Give the proxy cluster\u2019s driver node access your Git server. \nYour enterprise Git server can have an `allowlist` of IP addresses from which access is permitted. \n1. Associate a static outbound IP address for traffic that originates from your proxy cluster. You can do this by proxying traffic through a NAT gateway.\n2. Add the IP address from the previous step to your Git server\u2019s allowlist. \n2. Set your Git server instance to allow HTTPS transport. \n* For GitHub Enterprise, see [Which remote URL should I use](https:\/\/help.github.com\/en\/enterprise\/2.20\/user\/github\/using-git\/which-remote-url-should-i-use) in the GitHub Enterprise help.\n* For Bitbucket, go to the Bitbucket server administration page and select server settings. In the HTTP(S) SCM hosting section, enable the **HTTP(S) enabled** checkbox. \n### Step 2: Run the enablement notebook \nTo enable the proxy: \n1. Log into your Databricks workspace as a workspace admin with access rights to create a cluster.\n2. Import this notebook: \n[Enable Git server proxy for Databricks Git folders for private Git server connectivity in Git folders](https:\/\/github.com\/databricks\/databricks-repos-proxy\/blob\/main\/enable_git_proxy_jupyter.ipynb).\n3. Select \u201cRun All\u201d to perform the following tasks: \n* Create [a single node cluster](https:\/\/docs.databricks.com\/compute\/configure.html#single-node) named \u201cDatabricks Git Proxy\u201d, which does not auto-terminate. This is the \u201cproxy cluster\u201d that will process and forward Git commands from your Databricks workspace to your on-premises Git server.\n* Enable a feature flag that controls whether Git requests in Databricks Git folders are proxied via the cluster. \nImportant \nYou must be an admin on the workspace with access rights to create a cluster. \nNote \nYou should be aware of the following: \n* Running an additional long-running cluster to host the proxy software incurs extra DBUs. To minimize costs, the notebook configures the proxy to use a single node cluster with an inexpensive node type. However, you might want to modify the cluster options to suit your needs. \n### Step 3: Validate your Git server configuration \nTo validate your Git server configuration, try to clone a repo hosted on your private Git server via the proxy cluster. A successful clone means that you have successfully enabled Git server proxy for your workspace. \n### Step 4: Create proxy-enabled repos \nAfter users configure their Git credentials, no further steps are required to create or synchronize your repos. To configure credentials and create a repo in Databricks Git folders, see [Configure Git credentials & connect a remote repo to Databricks](https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-proxy.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Set up Databricks Git folders (Repos)\n##### Set up private Git connectivity for Databricks Git folders (Repos)\n###### Remove global CAN\\_ATTACH\\_TO permissions\n\nAdmins with an existing proxy clusters can now modify the cluster ACL permission to leverage generally available Git server proxy behavior. \nIf you previously configured Databricks Git server proxy with `CAN_ATTACH_TO` privileges, use the following steps to remove these permissions: \n1. Select **Compute** from the sidebar, and then click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu next to the Compute entry for the Git server proxy you\u2019re running: \n![Select Compute from the sidebar, select the kebab to the right of your Git proxy server compute resource](https:\/\/docs.databricks.com\/_images\/git-proxy-perms1.png)\n2. From the dialog, remove the **Can Attach To** entry for **All Users**: \n![In the modal dialog box that pops up, click X to the right of All Users, Can Attach To](https:\/\/docs.databricks.com\/_images\/git-proxy-perms2.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-proxy.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Set up Databricks Git folders (Repos)\n##### Set up private Git connectivity for Databricks Git folders (Repos)\n###### Troubleshooting\n\nDid you encounter an error while configuring Git server proxy for Databricks Git folders? Here are some common issues and ways to diagnose them more effectively. \n### Checklist for common problems \nBefore you start diagnosing an error, confirm that you\u2019ve completed the following steps: \n* Confirm that your proxy cluster is running.\n* Confirm that your Databricks Git folders users have \u201cattach to\u201d permissions on the proxy cluster.\n* Run the enablement notebook again and capture the results, if you haven\u2019t already. If you are unable to debug the issue, Databricks Support can review the results. You can export and send the enablement notebook as a DBC archive. \n### Inspect logs on the proxy cluster \nThe file at `\/databricks\/git-proxy\/git-proxy.log` on the proxy cluster contains logs that are useful for debugging purposes. \nThe log file should start with the line `Data-plane proxy server binding to ('', 8000)\u2026` If it does not, this means that the proxy server did not start properly. Try restarting the cluster, or delete the cluster you created and run the enablement notebook again. \nIf the log file does start with this line, review the log statements that follow it for each Git request initiated by a Git operation in Databricks Git folders. \nFor example: \n```\ndo_GET: https:\/\/server-address\/path\/to\/repo\/info\/refs?service=git-upload-pack 10.139.0.25 - - [09\/Jun\/2021 06:53:02] \/\n\"GET \/server-address\/path\/to\/repo\/info\/refs?service=git-upload-pack HTTP\/1.1\" 200`\n\n``` \nError logs written to this file can be useful to help you or Databricks Support debug issues. \n### Common error messages and their resolution \n* **Secure connection could not be established because of SSL problems** \nYou might see the following error: \n```\nhttps:\/\/git.consult-prodigy.com\/Prodigy\/databricks_test: Secure connection to https:\/\/git.consult-prodigy.com\/Prodigy\/databricks_test could not be established because of SLL problems\n\n``` \nOften this means that you are using a repository that requires special SSL certificates. Check the content of the `\/databricks\/git-proxy\/git-proxy.log` file on the proxy cluster. If it says that certificate validation failed, then you must add the certificate of authority to the system certificate chain. First, extract the root certificate (using the [browser](https:\/\/daniel.haxx.se\/blog\/2018\/11\/07\/get-the-ca-cert-for-curl\/) or other option) and upload it to DBFS. Then, edit the **Git folders Git Proxy** cluster to use the `GIT_PROXY_CA_CERT_PATH` environment variable to point to the root certificate file. For more information about editing cluster environment variables, see [Environment variables](https:\/\/docs.databricks.com\/compute\/configure.html#env-var). \nAfter you have completed that step, restart the cluster. \n![The Databricks modal dialog where you set environment variables for a Git proxy](https:\/\/docs.databricks.com\/_images\/git-proxy-set-env-vars.png)\n* **Failure to clone repository with error \u201cMissing\/Invalid Git credentials\u201d** \nFirst, check that you have [configured your Git credentials in User Settings](https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html). \nYou might encounter this error: \n```\nError: Invalid Git credentials. Go to User Settings -> Git Integration and check that your personal access token or app password has the correct repo access.\n\n``` \nIf your organization is using SAML SSO, make sure the token has been authorized (this can be done from your Git server\u2019s Personal Access Token (PAT) management page).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-proxy.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Set up Databricks Git folders (Repos)\n##### Set up private Git connectivity for Databricks Git folders (Repos)\n###### Frequently asked questions\n\n### What are the security implications of the Git server proxy? \nThe most important things to know are: \n* Proxying does not affect the security architecture of your Databricks control plane.\n* You can only have one Git proxy server cluster per workspace. \n### Is all Databricks Git folders-related Git traffic routed through the proxy cluster, even for public Git repos? \nYes. In the current release, your Databricks workspace does not differentiate between proxied and non-proxied repos. \n### Does the Git proxy feature work with other Git enterprise server providers? \nDatabricks Git folders supports GitHub Enterprise, Bitbucket Server, Azure DevOps Server, and GitLab self-managed. Other enterprise Git server providers should work as well if they conform to common Git specifications. \n### Do Databricks Git folders support GPG signing of commits? \nNo. \n### Do Databricks Git folders support SSH transport for Git operations? \nNo. Only HTTPS is supported. \n### Is the use of a non-default HTTPS port on the Git server supported? \nCurrently, the enablement notebook assumes that your Git server uses the default HTTPS port 443. You can set the environment variable `GIT_PROXY_CUSTOM_HTTP_PORT` to overwrite the port value with a preferred one. \n### Can you share one proxy for multiple workspaces or do you need one proxy cluster per workspace? \nYou need one proxy cluster per Databricks workspace. \n### Does the proxy work with legacy single-notebook versioning? \nNo, the proxy does not work with legacy single-notebook versioning. Users must migrate to Databricks Git folders versioning. \n### Can Databricks hide Git server URLs that are proxied? Could users enter the original Git server URLs rather than proxied URLs? \nYes to both questions. Users do not need to adjust their behavior for the proxy. With the current proxy implementation, all Git traffic for Databricks Git folders is routed through the proxy. Users enter the normal Git repo URL such as `https:\/\/git.company.com\/org\/repo-name.git`. \n### How often will users work with the Git URLs? \nTypically a user would just add the Git URL when they create a new repo or check out an existing repo that they have not already checked out. \n### Does the feature transparently proxy authentication data to the Git server? \nYes, the proxy uses the user account\u2019s Git server token to authenticate to the Git server. \n### Is there Databricks access to Git server code? \nThe Databricks proxy service accesses the Git repository on the Git server using the user-provided credential and synchronizes any code files in the repository with the repo. Access is restricted by the permissions specified in the user-provided personal access token (PAT).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-proxy.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Networking recommendations for Lakehouse Federation\n\nThis article provides guidance for setting up a viable network path between your Databricks clusters or SQL warehouses and the external database system that you are connecting to using Lakehouse Federation. \nBear the following important information in mind: \n* All network traffic is directly between Databricks clusters (or SQL warehouses) and the external database system. Neither Unity Catalog or the Databricks control plane are on the network path.\n* Databricks compute (that is, clusters and SQL warehouses) always deploys in the cloud, but the external database system can be on-premises or hosted on any cloud provider, as long as there\u2019s a viable network path between your Databricks compute and the external database.\n* If you have inbound or outbound network restrictions on either Databricks compute or the external database system, refer to the following sections for general guidance to help you create a viable network path. \nFor more information on networking in Databricks workspaces, see [Networking](https:\/\/docs.databricks.com\/security\/network\/index.html).\n\n#### Networking recommendations for Lakehouse Federation\n##### Database system and Databricks compute both accessible from internet\n\nThe connection should work without any configuration.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/networking.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Networking recommendations for Lakehouse Federation\n##### Database system has network access restrictions\n\nIf the external database system has inbound or outbound network access restrictions and the Databricks cluster or SQL warehouse is accessible from the internet, then perform the following configurations, depending on the type of compute: \n**Classic compute resources**: \nConfigure one of the following network solutions: \n* Stable egress IP on Databricks compute. \nSet up a stable IP address alongside a load balancer, NAT gateway, internet gateway or equivalent, and connect it with the subnet where Databricks compute is deployed. This allows the compute to share a stable public IP address that can be allowlisted on the external database side. \nThe external database system should allowlist the Databricks compute stable IP for both ingress and egress traffic. \n+ PrivateLink (only when the external database is on the same cloud as Databricks compute) \nConfigure a PrivateLink connection between the network where the database is deployed and the network where Databricks compute is deployed.**Serverless compute resources**: \nContact your Databricks account team to learn about plans to support secure network access to external databases from serverless compute.\n\n#### Networking recommendations for Lakehouse Federation\n##### Databricks compute has network access restrictions\n\nIf the external database system is accessible from the Internet and the Databricks compute has inbound or outbound network access restrictions (which is only possible if you are on a customer-managed network), then perform one of the following configurations: \n* Allowlist the hostname of the external database in the firewall rules of the subnet where Databricks compute is deployed. \nIf you choose to allowlist the external database IP address rather than hostname, make sure that the external database has a stable IP address.\n* PrivateLink (only when the external database is on same cloud as Databricks compute) \nConfigure a PrivateLink connection between the network where the database is deployed and the network where Databricks compute is deployed.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/networking.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Networking recommendations for Lakehouse Federation\n##### Databricks compute has a custom DNS server\n\nIf the external database system is accessible from the Internet and the Databricks compute has a custom DNS server (which is only possible if you are on a customer-managed network), add the database system\u2019s hostname to your custom DNS server so that it can be resolved.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/networking.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n##### Distributed training with TensorFlow 2\n\n[spark-tensorflow-distributor](https:\/\/github.com\/tensorflow\/ecosystem\/tree\/master\/spark\/spark-tensorflow-distributor) is an open-source native package in TensorFlow that helps users do distributed training with TensorFlow on their Spark clusters. It is built on top of `tensorflow.distribute.Strategy`, which is one of the major features in TensorFlow 2. For detailed API documentation, see [docstrings](https:\/\/github.com\/tensorflow\/ecosystem\/blob\/master\/spark\/spark-tensorflow-distributor\/spark_tensorflow_distributor\/mirrored_strategy_runner.py#L40). For general documentation about distributed TensorFlow, see [Distributed training with TensorFlow](https:\/\/www.tensorflow.org\/guide\/distributed_training).\n\n##### Distributed training with TensorFlow 2\n###### Example notebook\n\n### Distributed Training with TensorFlow 2 \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/spark-tensorflow-distributor.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/spark-tf-distributor.html"} +{"content":"# Databricks data engineering\n## Apache Spark on Databricks\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/dataframes.html"} +{"content":"# Databricks data engineering\n## Apache Spark on Databricks\n#### Tutorial: Load and transform data using Apache Spark DataFrames\n\nThis tutorial shows you how to load and transform data using the Apache Spark Python (PySpark) DataFrame API, the Apache Spark Scala DataFrame API, and the SparkR SparkDataFrame API in Databricks. \nBy the end of this tutorial, you will understand what a DataFrame is and be familiar with the following tasks: \n* [Define variables and copy public data into a Unity Catalog volume](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#define-variables)\n* [Create a DataFrame with Python](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#create-dataframe)\n* [Load data into a DataFrame from CSV file](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#create-dataframe-from-csv)\n* [View and interact with a DataFrame](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#interact-with-dataframe)\n* [Save the DataFrame](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#save-dataframe)\n* [Run SQL queries in PySpark](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#run-sql) \nSee also [Apache Spark PySpark API reference](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrame.html#pyspark-sql-dataframe). \n* [Define variables and copy public data into a Unity Catalog volume](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#define-variables)\n* [Create a DataFrame with Scala](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#create-dataframe)\n* [Load data into a DataFrame from CSV file](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#create-dataframe-from-csv)\n* [View and interacting with a DataFrame](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#interact-with-dataframe)\n* [Save the DataFrame](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#save-dataframe)\n* [Run SQL queries in Apache Spark](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#run-sql) \nSee also [Apache Spark Scala API reference](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/index.html). \n* [Define variables and copy public data into a Unity Catalog volume](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#define-variables)\n* [Create a SparkR SparkDataFrames](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#create-dataframe)\n* [Load data into a DataFrame from CSV file](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#create-dataframe-from-csv)\n* [View and interact with a DataFrame](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#interact-with-dataframe)\n* [Save the DataFrame](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#save-dataframe)\n* [Run SQL queries in SparkR](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#run-sql) \nSee also [Apache SparkR API reference](https:\/\/spark.apache.org\/docs\/latest\/sparkr.html#sparkdataframe).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/dataframes.html"} +{"content":"# Databricks data engineering\n## Apache Spark on Databricks\n#### Tutorial: Load and transform data using Apache Spark DataFrames\n##### What is a DataFrame?\n\nA DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can think of a DataFrame like a spreadsheet, a SQL table, or a dictionary of series objects. Apache Spark DataFrames provide a rich set of functions (select columns, filter, join, aggregate) that allow you to solve common data analysis problems efficiently. \nApache Spark DataFrames are an abstraction built on top of Resilient Distributed Datasets (RDDs). Spark DataFrames and Spark SQL use a unified planning and optimization engine, allowing you to get nearly identical performance across all supported languages on Databricks (Python, SQL, Scala, and R).\n\n#### Tutorial: Load and transform data using Apache Spark DataFrames\n##### Requirements\n\nTo complete the following tutorial, you must meet the following requirements: \n* To use the examples in this tutorial, your workspace must have [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) enabled.\n* The examples in this tutorial use a Unity Catalog [volume](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html) to store sample data. To use these examples, create a volume and use that volume\u2019s catalog, schema, and volume names to set the volume path used by the examples.\n* You must have the following permissions in Unity Catalog: \n+ `READ VOLUME` and `WRITE VOLUME`, or `ALL PRIVILEGES` for the volume used for this tutorial.\n+ `USE SCHEMA` or `ALL PRIVILEGES` for the schema used for this tutorial.\n+ `USE CATALOG` or `ALL PRIVILEGES` for the catalog used for this tutorial.To set these permissions, see your Databricks administrator or [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html). \nTip \nFor a completed notebook for this article, see [DataFrame tutorial notebooks](https:\/\/docs.databricks.com\/getting-started\/dataframes.html#notebook).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/dataframes.html"} +{"content":"# Databricks data engineering\n## Apache Spark on Databricks\n#### Tutorial: Load and transform data using Apache Spark DataFrames\n##### Step 1: Define variables and load CSV file\n\nThis step defines variables for use in this tutorial and then loads a CSV file containing baby name data from [health.data.ny.gov](https:\/\/health.data.ny.gov\/api\/views\/jxy9-yhdk\/rows.csv) into your Unity Catalog volume. \n1. Open a new notebook by clicking the ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) icon. To learn how to navigate Databricks notebooks, see [Databricks notebook interface and controls](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html).\n2. Copy and paste the following code into the new empty notebook cell. Replace `<catalog-name>`, `<schema-name>`, and `<volume-name>` with the catalog, schema, and volume names for a Unity Catalog volume. Replace `<table_name>` with a table name of your choice. You will load baby name data into this table later in this tutorial. \n```\ncatalog = \"<catalog_name>\"\nschema = \"<schema_name>\"\nvolume = \"<volume_name>\"\ndownload_url = \"https:\/\/health.data.ny.gov\/api\/views\/jxy9-yhdk\/rows.csv\"\nfile_name = \"rows.csv\"\ntable_name = \"<table_name>\"\npath_volume = \"\/Volumes\/\" + catalog + \"\/\" + schema + \"\/\" + volume\npath_table = catalog + \".\" + schema\nprint(path_table) # Show the complete path\nprint(path_volume) # Show the complete path\n\n``` \n```\nval catalog = \"<catalog_name>\"\nval schema = \"<schema_name>\"\nval volume = \"<volume_name>\"\nval downloadUrl = \"https:\/\/health.data.ny.gov\/api\/views\/jxy9-yhdk\/rows.csv\"\nval fileName = \"rows.csv\"\nval tableName = \"<table_name>\"\nval pathVolume = s\"\/Volumes\/$catalog\/$schema\/$volume\"\nval pathTable = s\"$catalog.$schema\"\nprint(pathVolume) \/\/ Show the complete path\nprint(pathTable) \/\/ Show the complete path\n\n``` \n```\ncatalog <- \"<catalog_name>\"\nschema <- \"<schema_name>\"\nvolume <- \"<volume_name>\"\ndownload_url <- \"https:\/\/health.data.ny.gov\/api\/views\/jxy9-yhdk\/rows.csv\"\nfile_name <- \"rows.csv\"\ntable_name <- \"<table_name>\"\npath_volume <- paste(\"\/Volumes\/\", catalog, \"\/\", schema, \"\/\", volume, sep = \"\")\npath_table <- paste(catalog, \".\", schema, sep = \"\")\nprint(path_volume) # Show the complete path\nprint(path_table) # Show the complete path\n\n```\n3. Press `Shift+Enter` to run the cell and create a new blank cell.\n4. Copy and paste the following code into the new empty notebook cell. This code copies the `rows.csv` file from [health.data.ny.gov](https:\/\/health.data.ny.gov\/api\/views\/jxy9-yhdk\/rows.csv) into your Unity Catalog volume using the [Databricks dbutuils](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#cp-command-dbutilsfscp) command. \n```\ndbutils.fs.cp(f\"{download_url}\", f\"{path_volume}\/{file_name}\")\n\n``` \n```\ndbutils.fs.cp(downloadUrl, s\"$pathVolume\/$fileName\")\n\n``` \n```\ndbutils.fs.cp(download_url, paste(path_volume, \"\/\", file_name, sep = \"\"))\n\n```\n5. Press `Shift+Enter` to run the cell and then move to the next cell.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/dataframes.html"} +{"content":"# Databricks data engineering\n## Apache Spark on Databricks\n#### Tutorial: Load and transform data using Apache Spark DataFrames\n##### Step 2: Create a DataFrame\n\nThis step creates a DataFrame named `df1` with test data and then displays its contents. \n1. Copy and paste the following code into the new empty notebook cell. This code creates the DataFrame with test data, and then displays the contents and the schema of the DataFrame. \n```\ndata = [[2021, \"test\", \"Albany\", \"M\", 42]]\ncolumns = [\"Year\", \"First_Name\", \"County\", \"Sex\", \"Count\"]\n\ndf1 = spark.createDataFrame(data, schema=\"Year int, First_Name STRING, County STRING, Sex STRING, Count int\")\ndisplay(df1) # The display() method is specific to Databricks notebooks and provides a richer visualization.\n# df1.show() The show() method is a part of the Apache Spark DataFrame API and provides basic visualization.\n\n``` \n```\nval data = Seq((2021, \"test\", \"Albany\", \"M\", 42))\nval columns = Seq(\"Year\", \"First_Name\", \"County\", \"Sex\", \"Count\")\n\nval df1 = data.toDF(columns: _*)\ndisplay(df1) \/\/ The display() method is specific to Databricks notebooks and provides a richer visualization.\n\/\/ df1.show() The show() method is a part of the Apache Spark DataFrame API and provides basic visualization.\n\n``` \n```\n# Load the SparkR package that is already preinstalled on the cluster.\nlibrary(SparkR)\n\ndata <- data.frame(\nYear = as.integer(c(2021)),\nFirst_Name = c(\"test\"),\nCounty = c(\"Albany\"),\nSex = c(\"M\"),\nCount = as.integer(c(42))\n)\n\ndf1 <- createDataFrame(data)\ndisplay(df1) # The display() method is specific to Databricks notebooks and provides a richer visualization.\n# head(df1) The head() method is a part of the Apache SparkR DataFrame API and provides basic visualization.\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/dataframes.html"} +{"content":"# Databricks data engineering\n## Apache Spark on Databricks\n#### Tutorial: Load and transform data using Apache Spark DataFrames\n##### Step 3: Load data into a DataFrame from CSV file\n\nThis step creates a DataFrame named `df_csv` from the CSV file that you previously loaded into your Unity Catalog volume. See [spark.read.csv](https:\/\/spark.apache.org\/docs\/latest\/sql-data-sources-csv.html). \n1. Copy and paste the following code into the new empty notebook cell. This code loads baby name data into DataFrame `df_csv` from the CSV file and then displays the contents of the DataFrame. \n```\ndf_csv = spark.read.csv(f\"{path_volume}\/{file_name}\",\nheader=True,\ninferSchema=True,\nsep=\",\")\ndisplay(df_csv)\n\n``` \n```\nval dfCsv = spark.read\n.option(\"header\", \"true\")\n.option(\"inferSchema\", \"true\")\n.option(\"delimiter\", \",\")\n.csv(s\"$pathVolume\/$fileName\")\n\ndisplay(dfCsv)\n\n``` \n```\ndf_csv <- read.df(paste(path_volume, \"\/\", file_name, sep=\"\"),\nsource=\"csv\",\nheader = TRUE,\ninferSchema = TRUE,\ndelimiter = \",\")\n\ndisplay(df_csv)\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell. \nYou can load data from many [supported file formats](https:\/\/docs.databricks.com\/query\/formats\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/dataframes.html"} +{"content":"# Databricks data engineering\n## Apache Spark on Databricks\n#### Tutorial: Load and transform data using Apache Spark DataFrames\n##### Step 4: View and interact with your DataFrame\n\nView and interact with your baby names DataFrames using the following methods. \n### Print the DataFrame schema \nLearn how to display the schema of an Apache Spark DataFrame. Apache Spark uses the term *schema* to refer to the names and data types of the columns in the DataFrame. \nNote \nDatabricks also uses the term schema to describe a collection of tables registered to a catalog. \n1. Copy and paste the following code into an empty notebook cell. This code shows the schema of your DataFrames with the `.printSchema()` method to view the schemas of the two DataFrames - to prepare to union the two DataFrames. \n```\ndf_csv.printSchema()\ndf1.printSchema()\n\n``` \n```\ndfCsv.printSchema()\ndf1.printSchema()\n\n``` \n```\nprintSchema(df_csv)\nprintSchema(df1)\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell. \n### Rename column in the DataFrame \nLearn how to rename a column in a DataFrame. \n1. Copy and paste the following code into an empty notebook cell. This code renames a column in the `df1_csv` DataFrame to match the respective column in the `df1` DataFrame. This code uses the Apache Spark `withColumnRenamed()` method. \n```\ndf_csv = df_csv.withColumnRenamed(\"First Name\", \"First_Name\")\ndf_csv.printSchema\n\n``` \n```\nval dfCsvRenamed = dfCsv.withColumnRenamed(\"First Name\", \"First_Name\")\n\/\/ when modifying a DataFrame in Scala, you must assign it to a new variable\ndfCsvRenamed.printSchema()\n\n``` \n```\ndf_csv <- withColumnRenamed(df_csv, \"First Name\", \"First_Name\")\nprintSchema(df_csv)\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell. \n### Combine DataFrames \nLearn how to create a new DataFrame that adds the rows of one DataFrame to another. \n1. Copy and paste the following code into an empty notebook cell. This code uses the Apache Spark `union()` method to combine the contents of your first DataFrame `df` with DataFrame `df_csv` containing the baby names data loaded from the CSV file. \n```\ndf = df1.union(df_csv)\ndisplay(df)\n\n``` \n```\nval df = df1.union(dfCsvRenamed)\ndisplay(df)\n\n``` \n```\ndisplay(df <- union(df1, df_csv))\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell. \n### Filter rows in a DataFrame \nDiscover the most popular baby names in your data set by filtering rows, using the Apache Spark `.filter()` or `.where()` methods. Use filtering to select a subset of rows to return or modify in a DataFrame. There is no difference in performance or syntax, as seen in the following examples. \n#### Using .filter() method \n1. Copy and paste the following code into an empty notebook cell. This code uses the the Apache Spark `.filter()` method to display those rows in the DataFrame with a count of more than 50. \n```\ndisplay(df.filter(df[\"Count\"] > 50))\n\n``` \n```\ndisplay(df.filter(df(\"Count\") > 50))\n\n``` \n```\ndisplay(filteredDF <- filter(df, df$Count > 50))\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell. \n#### Using .where() method \n1. Copy and paste the following code into an empty notebook cell. This code uses the the Apache Spark `.where()` method to display those rows in the DataFrame with a count of more than 50. \n```\ndisplay(df.where(df[\"Count\"] > 50))\n\n``` \n```\ndisplay(df.where(df(\"Count\") > 50))\n\n``` \n```\ndisplay(filtered_df <- where(df, df$Count > 50))\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell. \n### Select columns from a DataFrame and order by frequency \nLearn about which baby name frequency with the `select()` method to specify the columns from the DataFrame to return. Use the Apache Spark `orderby` and `desc` functions to order the results. \nThe [pyspark.sql](https:\/\/spark.apache.org\/docs\/2.4.0\/api\/python\/pyspark.sql.html) module for Apache Spark provides support for SQL functions. Among these functions that we use in this tutorial are the the Apache Spark `orderBy()`, `desc()`, and `expr()` functions. You enable the use of these functions by importing them into your session as needed. \n1. Copy and paste the following code into an empty notebook cell. This code imports the `desc()` function and then uses the Apache Spark `select()` method and Apache Spark `orderBy()` and `desc()` functions to display the most common names and their counts in descending order. \n```\nfrom pyspark.sql.functions import desc\ndisplay(df.select(\"First_Name\", \"Count\").orderBy(desc(\"Count\")))\n\n``` \n```\nimport org.apache.spark.sql.functions.desc\ndisplay(df.select(\"First_Name\", \"Count\").orderBy(desc(\"Count\")))\n\n``` \n```\ndisplay(arrange(select(df, df$First_Name, df$Count), desc(df$Count)))\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell. \n### Create a subset DataFrame \nLearn how to create a subset DataFrame from an existing DataFrame. \n1. Copy and paste the following code into an empty notebook cell. This code uses the Apache Spark `filter` method to create a new DataFrame restricting the data by year, count, and sex. It uses the Apache Spark `select()` method to limit the columns. It also uses the Apache Spark `orderBy()` and `desc()` functions to sort the new DataFrame by count. \n```\nsubsetDF = df.filter((df[\"Year\"] == 2009) & (df[\"Count\"] > 100) & (df[\"Sex\"] == \"F\")).select(\"First_Name\", \"County\", \"Count\").orderBy(desc(\"Count\"))\ndisplay(subsetDF)\n\n``` \n```\nval subsetDF = df.filter((df(\"Year\") === 2009) && (df(\"Count\") > 100) && (df(\"Sex\") === \"F\")).select(\"First_Name\", \"County\", \"Count\").orderBy(desc(\"Count\"))\n\ndisplay(subsetDF)\n\n``` \n```\nsubsetDF <- select(filter(df, (df$Count > 100) & (df$year == 2009) & df[\"Sex\"] == \"F\")), \"First_Name\", \"County\", \"Count\")\ndisplay(subsetDF)\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/dataframes.html"} +{"content":"# Databricks data engineering\n## Apache Spark on Databricks\n#### Tutorial: Load and transform data using Apache Spark DataFrames\n##### Step 5: Save the DataFrame\n\nLearn how to save a DataFrame,. You can either save your DataFrame to a table or write the DataFrame to a file or multiple files. \n### Save the DataFrame to a table \nDatabricks uses the Delta Lake format for all tables by default. To save your DataFrame, you must have `CREATE` table privileges on the catalog and schema. \n1. Copy and paste the following code into an empty notebook cell. This code saves the contents of the DataFrame to a table using the variable you defined at the start of this tutorial. \n```\ndf.write.mode(\"overwrite\").saveAsTable(f\"{path_table}.{table_name}\")\n\n``` \n```\ndf.write.mode(\"overwrite\").saveAsTable(s\"$pathTable\" + \".\" + s\"$tableName\")\n\n``` \n```\nsaveAsTable(df, paste(path_table, \".\", table_name), mode = \"overwrite\")\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell. \nMost Apache Spark applications work on large data sets and in a distributed fashion. Apache Spark writes out a directory of files rather than a single file. Delta Lake splits the Parquet folders and files. Many data systems can read these directories of files. Databricks recommends using tables over file paths for most applications. \n### Save the DataFrame to JSON files \n1. Copy and paste the following code into an empty notebook cell. This code saves the DataFrame to a directory of JSON files. \n```\ndf.write.format(\"json\").mode(\"overwrite\").save(\"\/tmp\/json_data\")\n\n``` \n```\ndf.write.format(\"json\").mode(\"overwrite\").save(\"\/tmp\/json_data\")\n\n``` \n```\nwrite.df(df, path = \"\/tmp\/json_data\", source = \"json\", mode = \"overwrite\")\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell. \n### Read the DataFrame from a JSON file \nLearn how to use the Apache Spark `spark.read.format()` method to read JSON data from a directory into a DataFrame. \n1. Copy and paste the following code into an empty notebook cell. This code displays the JSON files you saved in the previous example. \n```\ndisplay(spark.read.format(\"json\").json(\"\/tmp\/json_data\"))\n\n``` \n```\ndisplay(spark.read.format(\"json\").json(\"\/tmp\/json_data\"))\n\n``` \n```\ndisplay(read.json(\"\/tmp\/json_data\"))\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/dataframes.html"} +{"content":"# Databricks data engineering\n## Apache Spark on Databricks\n#### Tutorial: Load and transform data using Apache Spark DataFrames\n##### Additional tasks: Run SQL queries in PySpark, Scala, and R\n\nApache Spark DataFrames provide the following options to combine SQL with PySpark, Scala, and R. You can run the following code in the same notebook that you created for this tutorial. \n### Specify a column as a SQL query \nLearn how to use the Apache Spark `selectExpr()` method. This is a variant of the `select()` method that accepts SQL expressions and return an updated DataFrame. This method allows you to use a SQL expression, such as `upper`. \n1. Copy and paste the following code into an empty notebook cell. This code uses the Apache Spark `selectExpr()` method and the SQL `upper` expression to convert a string column to upper case (and rename the column). \n```\ndisplay(df.selectExpr(\"Count\", \"upper(County) as big_name\"))\n\n``` \n```\ndisplay(df.selectExpr(\"Count\", \"upper(County) as big_name\"))\n\n``` \n```\ndisplay(df_selected <- selectExpr(df, \"Count\", \"upper(County) as big_name\"))\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell. \n### Use `expr()` to use SQL syntax for a column \nLearn how to import and use the Apache Spark `expr()` function to use SQL syntax anywhere a column would be specified. \n1. Copy and paste the following code into an empty notebook cell. This code imports the `expr()` function and then uses the Apache Spark `expr()` function and the SQL `lower` expression to convert a string column to lower case (and rename the column). \n```\nfrom pyspark.sql.functions import expr\ndisplay(df.select(\"Count\", expr(\"lower(County) as little_name\")))\n\n``` \n```\nimport org.apache.spark.sql.functions.{col, expr}\n\/\/ Scala requires us to import the col() function as well as the expr() function\n\ndisplay(df.select(col(\"Count\"), expr(\"lower(County) as little_name\")))\n\n``` \n```\ndisplay(df_selected <- selectExpr(df, \"Count\", \"lower(County) as little_name\"))\n# expr() function is not supported in R, selectExpr in SparkR replicates this functionality\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell. \n### Run an arbitrary SQL query using spark.sql() function \nLearn how to use the Apache Spark `spark.sql()` function to run arbitrary SQL queries. \n1. Copy and paste the following code into an empty notebook cell. This code uses the Apache Spark `spark.sql()` function to query a SQL table using SQL syntax. \n```\ndisplay(spark.sql(f\"SELECT * FROM {path_table}.{table_name}\"))\n\n``` \n```\ndisplay(spark.sql(s\"SELECT * FROM $pathTable.$tableName\"))\n\n``` \n```\ndisplay(sql(paste(\"SELECT * FROM\", path_table, \".\", table_name)))\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/dataframes.html"} +{"content":"# Databricks data engineering\n## Apache Spark on Databricks\n#### Tutorial: Load and transform data using Apache Spark DataFrames\n##### DataFrame tutorial notebooks\n\nThe following notebooks include the examples queries from this tutorial. \n### DataFrames tutorial using Python \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/getting-started\/tutorial-uc-spark-dataframe-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### DataFrames tutorial using Scala \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/getting-started\/tutorial-uc-spark-dataframe-scala.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### DataFrames tutorial using R \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/getting-started\/tutorial-uc-spark-dataframe-sparkr.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### Tutorial: Load and transform data using Apache Spark DataFrames\n##### Additional resources\n\n* [Reference for Apache Spark APIs](https:\/\/docs.databricks.com\/reference\/spark.html)\n* [Convert between PySpark and pandas DataFrames](https:\/\/docs.databricks.com\/pandas\/pyspark-pandas-conversion.html)\n* [Pandas API on Spark](https:\/\/docs.databricks.com\/pandas\/pandas-on-spark.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/dataframes.html"} +{"content":"# What is data warehousing on Databricks?\n## Write queries and explore data in the SQL Editor\n#### What are Databricks SQL alerts?\n\nDatabricks SQL alerts periodically run queries, evaluate defined conditions, and send notifications if a condition is met. You can set up alerts to monitor your business and send notifications when reported data falls outside of expected limits. Scheduling an alert executes its underlying query and checks the alert criteria. This is independent of any schedule that might exist on the underlying query. \nImportant \n* Alerts leveraging queries with [parameters](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html) use the default value specified in the SQL editor for each parameter.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/alerts\/index.html"} +{"content":"# What is data warehousing on Databricks?\n## Write queries and explore data in the SQL Editor\n#### What are Databricks SQL alerts?\n##### View and organize alerts\n\nUse one of the following options to access alerts: \n* Click the ![Workspace Icon](https:\/\/docs.databricks.com\/_images\/workspace-icon.png) **Workspace** in the sidebar to view alerts in the **Home** folder, where they are stored by default. Users can organize alerts into folders in the workspace browser along with other Databricks objects.\n* Click the ![Alerts Icon](https:\/\/docs.databricks.com\/_images\/alerts-icon.png) **Alerts** in the sidebar to view the alerts listing page. \nBy default, objects are sorted in reverse chronological order. You can reorder the list by clicking the column headings. Click the **All alerts** tab near the top of the screen to view all alerts in the workspace. Click the **My alerts** tab to view alerts where you are the owner. \n* **Name** shows the string name of each alert.\n* **State** shows whether the alert status is `TRIGGERED`, `OK`, or `UNKNOWN`.\n* **Last Updated** shows the last updated time or date.\n* **Created at** shows the date and time the alert was created. \n+ `TRIGGERED` means that on the most recent execution, the Value column in your target query met the Condition and Threshold you configured. If your alert checks whether \u201ccats\u201d is above 1500, your alert will be triggered as long as \u201ccats\u201d is above 1500.\n+ `OK` means that on the most recent query execution, the Value column did not meet the Condition and Threshold you configured. This doesn\u2019t mean that the Alert was not previously triggered. If your \u201ccats\u201d value is now 1470, your alert will show as `OK`.\n+ `UNKNOWN` means Databricks SQL does not have enough data to evaluate the alert\ncriteria. You will see this status immediately after creating your Alert and until the query has executed. You will also see this status if there was no data in the query result or if the most recent query result doesn\u2019t include the *Value Column* you configured.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/alerts\/index.html"} +{"content":"# What is data warehousing on Databricks?\n## Write queries and explore data in the SQL Editor\n#### What are Databricks SQL alerts?\n##### Create an alert\n\nFollow these steps to create an alert on a single column of a query. \n1. Do one of the following: \n* Click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar and select **Alert**.\n* Click ![Alerts Icon](https:\/\/docs.databricks.com\/_images\/alerts-icon.png) **Alerts** in the sidebar and click the **+ New Alert** button.\n* Click ![Workspace Icon](https:\/\/docs.databricks.com\/_images\/workspace-icon.png) **Workspace** in the sidebar and click **+ Create Alert**.\n* Click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu in the upper-right corner of a saved query and click **+ Create Alert**.\n2. In the **Query** field, search for a target query. \n![Target query](https:\/\/docs.databricks.com\/_images\/new-alert-query-search.png) \nTo alert on multiple columns, you need to modify your query. See [Alert aggregations](https:\/\/docs.databricks.com\/sql\/user\/alerts\/index.html#alert-multiple-columns).\n3. In the **Trigger condition** field, configure the alert. \n* The **Value column** drop-down controls which field of your query result is evaluated. Alert conditions can be set on the first value of a column in the query result, or you can choose to set an aggregation across all the rows of a single column, such as SUM or AVERAGE. \n![Alert aggregations](https:\/\/docs.databricks.com\/_images\/alert-aggregation.png)\n* The **Operator** drop-down controls the logical operation to be applied.\n* The **Threshold value** text input is compared against the Value column using the Condition you specify.\n![Trigger conditions](https:\/\/docs.databricks.com\/_images\/trigger-condition.png)\n4. Click **Preview alert** to preview the alert and test whether the alert would trigger with the current data.\n5. In the **When alert is triggered, send notification** field, select how many notifications are sent when your alert is triggered: \n* **Just once**: Send a notification when the [alert status](https:\/\/docs.databricks.com\/sql\/user\/alerts\/index.html#view-alerts) changes from `OK` to `TRIGGERED`.\n* **Each time alert is evaluated**: Send a notification whenever the alert status is `TRIGGERED` regardless of its status at the previous evaluation.\n* **At most every**: Send a notification whenever the alert status is `TRIGGERED` at a specific interval. This choice lets you avoid notification spam for alerts that trigger often.Regardless of which notification setting you choose, you receive a notification whenever the status goes from `OK` to `TRIGGERED` or from `TRIGGERED` to `OK`. The schedule settings affect how many notifications you will receive if the status remains `TRIGGERED` from one execution to the next. For details, see [Notification frequency](https:\/\/docs.databricks.com\/sql\/user\/alerts\/index.html#notification-frequency).\n6. In the **Template** drop-down, choose a template: \n* **Use default template**: Alert notification is a message with links to the Alert configuration screen and the Query screen.\n* **Use custom template**: Alert notification includes more specific information about the alert. \n1. A box displays, consisting of input fields for subject and body. Any static content is valid, and you can incorporate built-in template variables: \n+ `ALERT_STATUS`: The evaluated alert status (string).\n+ `ALERT_CONDITION`: The alert condition operator (string).\n+ `ALERT_THRESHOLD`: The alert threshold (string or number).\n+ `ALERT_COLUMN`: The alert column name (string).\n+ `ALERT_NAME`: The alert name (string).\n+ `ALERT_URL`: The alert page URL (string).\n+ `QUERY_NAME`: The associated query name (string).\n+ `QUERY_URL`: The associated query page URL (string).\n+ `QUERY_RESULT_TABLE`: The query result HTML table (string).\n+ `QUERY_RESULT_VALUE`: The query result value (string or number).\n+ `QUERY_RESULT_ROWS`: The query result rows (value array).\n+ `QUERY_RESULT_COLS`: The query result columns (string array).An example subject, for instance, could be: `Alert \"{{ALERT_NAME}}\" changed status to {{ALERT_STATUS}}`.\n2. You can use HTML to format messages in a custom template. The following tags and attributes are allowed in templates: \n+ Tags: `<a>`, `<abbr>`, `<acronym>`, `<b>`, `<blockquote>`, `<body>`, `<br>`, `<code>`, `<div>`, `<em>`, `<h1>`, `<h2>`, `<h3>`, `<h4>`, `<h5`>, `<h6`>, `<head`>, `<hr>`, `<html>`, `<i>`, `<li>`, `<ol>`, `<p>`, `<span>`, `<strong>`, `<table>`, `<tbody>`, `<td>`, `<th>`, `<tr>`, `<ul>`\n+ Attributes: href (for `<a>`), title (for `<a>`, `<abbr>`, `<acronym>`)\n3. Click the **Preview** toggle button to preview the rendered result. \nImportant \nThe preview is useful for verifying that template variables are rendered correctly. It is not an accurate representation of the eventual notification content, as each notification destination can display notifications differently.\n4. Click the **Save Changes** button.\n7. Click **Create Alert**.\n8. Click **Add Schedule**. \n* Use the dropdown pickers to specify the frequency, period, starting time, and time zone. Optionally, select the **Show cron syntax** checkbox to edit the schedule in [Quartz Cron Syntax](http:\/\/www.quartz-scheduler.org\/documentation\/quartz-2.3.0\/tutorials\/crontrigger.html).\n* Choose **More options** to show optional settings. You can also choose: \n+ A name for the schedule.\n+ A SQL warehouse to power the query. By default, the SQL warehouse used for ad hoc query execution is also used for a scheduled job. Use this optional setting to select a different warehouse to run the scheduled query.\n9. Click the **Destinations** tab in the **Add schedule** dialog. \n![Destinations tab in settings dialog](https:\/\/docs.databricks.com\/_images\/add-new-dest.png) \n* Use the drop-down to select an available [notification destination](https:\/\/docs.databricks.com\/admin\/workspace-settings\/notification-destinations.html). Or, start typing a username to add individuals.\nImportant \nIf you skip this step you *will not* be notified when the alert is triggered.\n10. Click **Create**. Your saved alert and notification details appear on the screen. \n![Saved alert](https:\/\/docs.databricks.com\/_images\/saved-alert.png)\n11. Share the schedule. \n* To the right of the listed schedule, choose the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu and select **Edit schedule permissions**.\n* Choose a user or group from the drop-down menu in the dialog.\n* Choose out of the following schedule permissions: \n+ NO PERMISSIONS: No permissions have been granted. Users with no permissions cannot see that the schedule exists, even if they are subscribers or included in listed notification destinations.\n+ CAN VIEW: Grants permission to view scheduled run results.\n+ CAN MANAGE RUN: Grants permission to view scheduled run results.\n+ CAN MANAGE: Grants permission to view, modify, and delete schedules. This permission is required in order to make changes to the run interval, update the subscriber list, and pause or unpause the schedule.\n+ IS OWNER: Grants all permissions of CAN MANAGE. Additionally, the credentials of the schedule owner will be used to run dashboard queries. Only a workspace admin can change the owner.\nImportant \nPermissions for alerts and schedules are separate. Grant access to users and groups in the notifications destinations list so they can view scheduled run results.\n12. Share the alert. \n* Click ![Share Button](https:\/\/docs.databricks.com\/_images\/share-button.png) near the top-right of the page.\n* Add users or groups who should have access to the alert.\n* Choose the appropriate permission level, then click **Add**. \nImportant \nCAN MANAGE grants permission to view, modify, and delete schedules. This permission is required in order to make changes to the run interval, update the notification destination list, and pause or unpause the schedule. \nFor more information on alert permission levels, see [Alerts ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#alerts).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/alerts\/index.html"} +{"content":"# What is data warehousing on Databricks?\n## Write queries and explore data in the SQL Editor\n#### What are Databricks SQL alerts?\n##### Alert aggregations\n\nAn aggregation on an alert works by modifying the original SQL of the Databricks SQL query attached to the alert. The alert wraps the original query text in a common table expression (CTE) and performs a wrapping aggregation query on it to aggregate the query result. \nAs an example, a `SUM` aggregation on an alert attached to a query with text `SELECT 1 AS column_name` means that whenever the alert is refreshed, the modified SQL that runs would be: `WITH q AS (SELECT 1 AS column_name) SELECT SUM(column_name) FROM q`. \nThis means that the original query result (pre-aggregated) cannot be shown in an alert custom body (with parameters such as `QUERY_RESULT_ROWS` and `QUERY_RESULT_COLS`) whenever there is an aggregation on an alert. Instead, those variables will only display the final, post-aggregation query result. \nNote \nAll trigger conditions related to aggregations are not supported by the API.\n\n#### What are Databricks SQL alerts?\n##### Alert on multiple columns\n\nTo set an alert based on multiple columns of a query, your query can implement the alert logic and\nreturn a boolean value for the alert to trigger on. For example: \n```\nSELECT CASE WHEN drafts_count > 10000 AND archived_count > 5000 THEN 1 ELSE 0 END\nFROM (\nSELECT sum(CASE WHEN is_archived THEN 1 ELSE 0 END) AS archived_count,\nsum(CASE WHEN is_draft THEN 1 ELSE 0 END) AS drafts_count\nFROM queries) data\n\n``` \nThis query returns `1` when `drafts_count > 10000 and archived_count > 5000`.\nThen you can configure the alert to trigger when the value is `1`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/alerts\/index.html"} +{"content":"# What is data warehousing on Databricks?\n## Write queries and explore data in the SQL Editor\n#### What are Databricks SQL alerts?\n##### Notification frequency\n\nDatabricks SQL sends notifications to your chosen notification destinations whenever it detects\nthat the Alert status has changed from `OK` to `TRIGGERED` or vice versa.\nConsider this example where an Alert is configured on a query that is scheduled\nto run once daily. The daily status of the Alert appears in the following table.\nPrior to Monday the alert status was `OK`. \n| Day | Alert Status |\n| --- | --- |\n| Monday | OK |\n| Tuesday | OK |\n| Wednesday | TRIGGERED |\n| Thursday | TRIGGERED |\n| Friday | TRIGGERED |\n| Saturday | TRIGGERED |\n| Sunday | OK | \nIf the notification frequency is set to `Just Once`, Databricks SQL sends a\nnotification on Wednesday when the status changed from `OK` to `TRIGGERED` and\nagain on Sunday when it switches back. It does not send alerts on Thursday,\nFriday, or Saturday unless you specifically configure it to do so because the\nAlert status did not change between executions on those days.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/alerts\/index.html"} +{"content":"# What is data warehousing on Databricks?\n## Write queries and explore data in the SQL Editor\n#### What are Databricks SQL alerts?\n##### Configure alert permissions and transfer alert ownership\n\nYou must have at least CAN MANAGE permission on a query to share queries. For alert permission levels, see [Alerts ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#alerts). \n1. In the sidebar, click **Alerts**.\n2. Click an alert.\n3. Click the ![Share Button](https:\/\/docs.databricks.com\/_images\/share-button.png) button at the top right to open the **Sharing** dialog. \n![Manage alert permissions](https:\/\/docs.databricks.com\/_images\/alert-sharing.png)\n4. Search for and select the groups and users, and assign the permission level.\n5. Click **Add**. \n### Transfer ownership of an alert \nWhen you save an alert, you become the alert\u2019s owner. If an alert\u2019s owner is removed from a workspace, the alert no longer has an owner. A workspace admin user can transfer ownership of an alert to a different user. Service principals and groups cannot be assigned ownership of a alert. You can also transfer ownership using the [Permissions API](https:\/\/docs.databricks.com\/api\/workspace\/permissions). \n1. As a workspace admin, log in to your Databricks workspace.\n2. In the sidebar, click **Alerts**.\n3. Click an alert.\n4. Click the **Share** button at the top right to open the **Sharing** dialog.\n5. Click on the gear icon at the top right and click **Assign new owner**. \n![Assign new owner](https:\/\/docs.databricks.com\/_images\/assign-new-owner.png)\n6. Select the user to assign ownership to.\n7. Click **Confirm**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/alerts\/index.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Implement data processing and analysis workflows with Jobs\n\nYou can use a Databricks job to orchestrate your data processing, machine learning, or data analytics pipelines on the Databricks platform. Databricks Jobs support a number of workload types, including notebooks, scripts, Delta Live Tables pipelines, Databricks SQL queries, and [dbt](https:\/\/docs.databricks.com\/partners\/prep\/dbt.html) projects. The following articles guide you in using the features and options of Databricks Jobs to implement your data pipelines.\n\n#### Implement data processing and analysis workflows with Jobs\n##### Transform, analyze, and visualize your data with a Databricks job\n\nYou can use a job to create a data pipeline that ingests, transforms, analyzes, and visualizes data. The example in [Use Databricks SQL in a Databricks job](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html) builds a pipeline that: \n1. Uses a Python script to fetch data using a REST API.\n2. Uses [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html) to ingest and transform the fetched data and save the transformed data to Delta Lake.\n3. Uses the Jobs integration with Databricks SQL to analyze the transformed data and create graphs to visualize the results.\n\n#### Implement data processing and analysis workflows with Jobs\n##### Use dbt transformations in a job\n\nUse the `dbt` task type if you are doing data transformation with a dbt core project and want to integrate that project into a Databricks job, or you want to create new dbt transformations and run those transformations in a job. See [Use dbt transformations in a Databricks job](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbt-in-workflows.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/index.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Implement data processing and analysis workflows with Jobs\n##### Use a Python package in a job\n\nPython wheel files are a standard way to package and distribute the files required to run a Python application. You can easily create a job that uses Python code packaged as a Python wheel file with the `Python wheel` task type. See [Use a Python wheel file in a Databricks job](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-python-wheels-in-workflows.html).\n\n#### Implement data processing and analysis workflows with Jobs\n##### Use code packaged in a JAR\n\nLibraries and applications implemented in a JVM language such as Java and Scala are commonly packaged in a Java archive (JAR) file. Databricks Jobs supports code packaged in a JAR with the `JAR` task type. See [Use a JAR in a Databricks job](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-jars-in-workflows.html).\n\n#### Implement data processing and analysis workflows with Jobs\n##### Use notebooks or Python code maintained in a central repository\n\nA common way to manage version control and collaboration for production artifacts is to use a central repository such as GitHub. Databricks Jobs supports creating and running jobs using notebooks or Python code imported from a repository, including GitHub or Databricks Git folders. See [Use version-controlled source code in a Databricks job](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-repos.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/index.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Implement data processing and analysis workflows with Jobs\n##### Orchestrate your jobs with Apache Airflow\n\nDatabricks recommends using Databricks Jobs to orchestrate your workflows. However, Apache Airflow is commonly used as a workflow orchestration system and provides native support for Databricks Jobs. While Databricks Jobs provides a visual UI to create your workflows, Airflow uses Python files to define and deploy your data pipelines. For an example of creating and running a job with Airflow, see [Orchestrate Databricks jobs with Apache Airflow](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-airflow-with-jobs.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/index.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Heatmap options\n\nThis section covers the configuration options for heatmap chart visualizations. For an example, see [heat map example](https:\/\/docs.databricks.com\/visualizations\/visualization-types.html#heatmap).\n\n##### Heatmap options\n###### General\n\nTo configure general options, click **General** and configure each of the following required settings: \n* **X Column**: The part of the query result to use to divide the data into rows, such as days of the week.\n* **Y Columns**: The part of the query result to use to divide the data into rows, such as trip distance.\n* **Color Column**: The part of the query result that determines the color for each grid area, such as trip fare.\n\n##### Heatmap options\n###### X axis\n\nTo configure formatting options for the X axis, click **X axis** and configure the following optional settings: \n* **Scale**: **Automatic (Categorical)** or **Categorical**.\n* **Name**: Override the column name with a different display name.\n* **Sort values**: Whether to sort the X axis values, even if they are not sorted in the query.\n* **Reverse Order**: Whether to reverse the sorting order.\n* **Show labels** Whether to show the X axis values as labels.\n* **Hide axis**: If enabled, hides the X axis labels and scale markers.\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/heatmap.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Heatmap options\n###### Y axis\n\nTo configure formatting options for the Y axis, click **Y axis** and configure the following optional settings: \n* **Scale**: Categorical is the only option.\n* **Name**: Override the column name with a different display name.\n* **Start Value**: Show only values higher than a given value, regardless of the query result.\n* **End Value**: Show only values lower than a given value, regardless of the query result.\n* **Hide axis**: If enabled, hides the Y axis labels and scale markers.\n* **Sort values**: Whether to sort the X axis values, even if they are not sorted in the query.\n* **Reverse Order**: Whether to reverse the sorting order.\n\n##### Heatmap options\n###### Colors\n\nTo configure colors, click **Colors** and optionally override automatic colors and configure custom colors: \n* **Name**: Override the legend name with a different display name.\n* **Color scheme**: Specify a custom color scheme.\n\n##### Heatmap options\n###### Data labels\n\nTo configure labels for each data point in the visualization, click **Data labels** and configure the following optional settings: \n* **Show data labels**: Whether to show data labels. Data labels can add visual clutter and are usually disabled for heatmaps.\n* **Number values format**: The format to use for labels for numeric values.\n* **Percent values format**: The format to use for labels for percentages.\n* **Date\/time values format**: The format to use for labels for date\/time values.\n* **Data labels**: The format to use for labels for other types of values.\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/heatmap.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deploy models for batch inference and prediction\n#### Deep learning model inference performance tuning guide\n\nThis section provides some tips for debugging and performance tuning for model inference on Databricks. For an overview, see the [deep learning inference workflow](https:\/\/docs.databricks.com\/machine-learning\/model-inference\/dl-model-inference.html). \nTypically there are two main parts in model inference: data input pipeline and model inference. The data input pipeline is heavy on data I\/O input and model inference is heavy on computation. Determining the bottleneck of the workflow is simple. Here are some approaches: \n* Reduce the model to a trivial model and measure the examples per second. If the difference of the end to end time between the full model and the trivial model is minimal, then the data input pipeline is likely a bottleneck, otherwise model inference is the bottleneck.\n* If running model inference with GPU, check the GPU utilization [metrics](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-performance). If GPU utilization is not continuously high, then the data input pipeline may be the bottleneck.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-inference\/model-inference-performance.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deploy models for batch inference and prediction\n#### Deep learning model inference performance tuning guide\n##### Optimize data input pipeline\n\nUsing GPUs can efficiently optimize the running speed for model inference. As GPUs and other accelerators become faster, it is important that the data input pipeline keep up with demand. The data input pipeline reads the data into Spark Dataframes, transforms it, and loads it as the input for model inference. If data input is the bottleneck, here are some tips to increase I\/O throughput: \n* Set the max records per batch. Larger number of max records can reduce the I\/O overhead to call the UDF function as long as the records can fit in memory. To set the batch size, set the following config: \n```\nspark.conf.set(\"spark.sql.execution.arrow.maxRecordsPerBatch\", \"5000\")\n\n```\n* Load the data in batches and prefetch it when preprocessing the input data in the pandas UDF. \nFor TensorFlow, Databricks recommends using the [tf.data API](https:\/\/www.tensorflow.org\/guide\/data). You can parse the map in parallel by setting `num_parallel_calls` in a `map` function and call `prefetch` and `batch` for prefetching and batching. \n```\ndataset.map(parse_example, num_parallel_calls=num_process).prefetch(prefetch_size).batch(batch_size)\n\n``` \nFor PyTorch, Databricks recommends using the [DataLoader class](https:\/\/pytorch.org\/tutorials\/beginner\/data_loading_tutorial.html). You can set `batch_size` for batching and `num_workers` for parallel data loading. \n```\ntorch.utils.data.DataLoader(images, batch_size=batch_size, num_workers=num_process)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-inference\/model-inference-performance.html"} +{"content":"# Databricks data engineering\n## What are init scripts?\n#### What files can I reference in an init script?\n\nThe support for referencing other files in an init script depends on where the referenced files are stored. This article outlines this behavior and provides recommendations. \nDatabricks recommends managing all init scripts as cluster-scoped init scripts.\n\n#### What files can I reference in an init script?\n##### What identity is used to run init scripts?\n\nIn single user access mode, the identity of the assigned principal (a user or service principal) is used. \nIn shared access mode or no-isolation shared access mode, init scripts use the identity of the cluster owner. \nNot all locations for storing init scripts are supported on all Databricks Runtime versions and access modes. See [Where can init scripts be installed?](https:\/\/docs.databricks.com\/init-scripts\/index.html#compatibility).\n\n#### What files can I reference in an init script?\n##### Can I reference files in Unity Catalog volumes from init scripts?\n\nYou can reference libraries and init scripts stored in Unity Catalog volumes from init scripts stored in Unity Catalog volumes. \nImportant \nCredentials required to access other files stored in Unity Catalog volumes are only made available within init scripts stored in Unity Catalog volumes. You cannot reference any files in Unity Catalog volumes from init scripts configured from other locations. \nFor clusters with shared access mode, only the configured init script needs to be added to the allowlist. Access to other files referenced in the init script is governed by Unity Catalog.\n\n#### What files can I reference in an init script?\n##### Can I reference workspace files from init scripts?\n\nIn Databricks Runtime 11.3 LTS and above, you can reference other workspace files such as libraries, configuration files, or shell scripts from init scripts stored with workspace files.\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/referencing-files.html"} +{"content":"# Databricks data engineering\n## What are init scripts?\n#### What files can I reference in an init script?\n##### Can I reference files in cloud object storage from init scripts?\n\nYou can reference libraries and init scripts stored in cloud object storage from init scripts. \nFor clusters with shared access mode, only the configured init script needs to be added to the allowlist. Access to other files referenced in the init script is determined by access configured to cloud object storage. \nDatabricks recommends using instance profiles to manage access to libraries and init scripts stored in S3. Use the following documentation in the cross-reference link to complete this setup: \n1. Create a IAM role with read and list permissions on your desired buckets. See [Tutorial: Configure S3 access with an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html).\n2. Launch a cluster with the instance profile. See [Instance profiles](https:\/\/docs.databricks.com\/compute\/configure.html#instance-profiles).\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/referencing-files.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables SQL language reference\n\nThis article provides details for the Delta Live Tables SQL programming interface. \n* For information on the Python API, see the [Delta Live Tables Python language reference](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html).\n* For more information about SQL commands, see [SQL language reference](https:\/\/docs.databricks.com\/sql\/language-manual\/index.html). \nYou can use Python user-defined functions (UDFs) in your SQL queries, but you must define these UDFs in Python files before calling them in SQL source files. See [User-defined scalar functions - Python](https:\/\/docs.databricks.com\/udf\/python.html).\n\n##### Delta Live Tables SQL language reference\n###### Limitations\n\nThe `PIVOT` clause is not supported. The `pivot` operation in Spark requires eager loading of input data to compute the schema of the output. This capability is not supported in Delta Live Tables.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables SQL language reference\n###### Create a Delta Live Tables materialized view or streaming table\n\nYou use the same basic SQL syntax when declaring either a streaming table or a materialized view (also referred to as a `LIVE TABLE`). \nYou can only declare streaming tables using queries that read against a streaming source. Databricks recommends using Auto Loader for streaming ingestion of files from cloud object storage. See [Auto Loader SQL syntax](https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html#auto-loader-sql). \nYou must include the `STREAM()` function around a dataset name when specifying other tables or views in your pipeline as a streaming source. \nThe following describes the syntax for declaring materialized views and streaming tables with SQL: \n```\nCREATE OR REFRESH [TEMPORARY] { STREAMING TABLE | LIVE TABLE } table_name\n[(\n[\ncol_name1 col_type1 [ GENERATED ALWAYS AS generation_expression1 ] [ COMMENT col_comment1 ] [ column_constraint ],\ncol_name2 col_type2 [ GENERATED ALWAYS AS generation_expression2 ] [ COMMENT col_comment2 ] [ column_constraint ],\n...\n]\n[\nCONSTRAINT expectation_name_1 EXPECT (expectation_expr1) [ON VIOLATION { FAIL UPDATE | DROP ROW }],\nCONSTRAINT expectation_name_2 EXPECT (expectation_expr2) [ON VIOLATION { FAIL UPDATE | DROP ROW }],\n...\n]\n[ table_constraint ] [, ...]\n)]\n[USING DELTA]\n[PARTITIONED BY (col_name1, col_name2, ... )]\n[LOCATION path]\n[COMMENT table_comment]\n[TBLPROPERTIES (key1 [ = ] val1, key2 [ = ] val2, ... )]\nAS select_statement\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables SQL language reference\n###### Create a Delta Live Tables view\n\nThe following describes the syntax for declaring views with SQL: \n```\nCREATE TEMPORARY [STREAMING] LIVE VIEW view_name\n[(\n[\ncol_name1 [ COMMENT col_comment1 ],\ncol_name2 [ COMMENT col_comment2 ],\n...\n]\n[\nCONSTRAINT expectation_name_1 EXPECT (expectation_expr1) [ON VIOLATION { FAIL UPDATE | DROP ROW }],\nCONSTRAINT expectation_name_2 EXPECT (expectation_expr2) [ON VIOLATION { FAIL UPDATE | DROP ROW }],\n...\n]\n)]\n[COMMENT view_comment]\nAS select_statement\n\n```\n\n##### Delta Live Tables SQL language reference\n###### Auto Loader SQL syntax\n\nThe following describes the syntax for working with Auto Loader in SQL: \n```\nCREATE OR REFRESH STREAMING TABLE table_name\nAS SELECT *\nFROM cloud_files(\n\"<file-path>\",\n\"<file-format>\",\nmap(\n\"<option-key>\", \"<option_value\",\n\"<option-key>\", \"<option_value\",\n...\n)\n)\n\n``` \nYou can use supported format options with Auto Loader. Using the `map()` function, you can pass any number of options to the `cloud_files()` method. Options are key-value pairs, where the keys and values are strings. For details on support formats and options, see [File format options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#format-options).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables SQL language reference\n###### Example: Define tables\n\nYou can create a dataset by reading from an external data source or from datasets defined in a pipeline. To read from an internal dataset, prepend the `LIVE` keyword to the dataset name. The following example defines two different datasets: a table called `taxi_raw` that takes a JSON file as the input source and a table called `filtered_data` that takes the `taxi_raw` table as input: \n```\nCREATE OR REFRESH LIVE TABLE taxi_raw\nAS SELECT * FROM json.`\/databricks-datasets\/nyctaxi\/sample\/json\/`\n\nCREATE OR REFRESH LIVE TABLE filtered_data\nAS SELECT\n...\nFROM LIVE.taxi_raw\n\n```\n\n##### Delta Live Tables SQL language reference\n###### Example: Read from a streaming source\n\nTo read data from a streaming source, for example, Auto Loader or an internal data set, define a `STREAMING` table: \n```\nCREATE OR REFRESH STREAMING TABLE customers_bronze\nAS SELECT * FROM cloud_files(\"\/databricks-datasets\/retail-org\/customers\/\", \"csv\")\n\nCREATE OR REFRESH STREAMING TABLE customers_silver\nAS SELECT * FROM STREAM(LIVE.customers_bronze)\n\n``` \nFor more information on streaming data, see [Transform data with Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/transform.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables SQL language reference\n###### Control how tables are materialized\n\nTables also offer additional control of their materialization: \n* Specify how tables are [partitioned](https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html#schema-partition-example) using `PARTITIONED BY`. You can use partitioning to speed up queries.\n* You can set table properties using `TBLPROPERTIES`. See [Delta Live Tables table properties](https:\/\/docs.databricks.com\/delta-live-tables\/properties.html#table-properties).\n* Set a storage location using the `LOCATION` setting. By default, table data is stored in the pipeline storage location if `LOCATION` isn\u2019t set.\n* You can use [generated columns](https:\/\/docs.databricks.com\/delta\/generated-columns.html) in your schema definition. See [Example: Specify a schema and partition columns](https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html#schema-partition-example). \nNote \nFor tables less than 1 TB in size, Databricks recommends letting Delta Live Tables control data organization. Unless you expect your table to grow beyond a terabyte, you should generally not specify partition columns.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables SQL language reference\n###### Example: Specify a schema and partition columns\n\nYou can optionally specify a schema when you define a table. The following example specifies the schema for the target table, including using Delta Lake [generated columns](https:\/\/docs.databricks.com\/delta\/generated-columns.html) and defining partition columns for the table: \n```\nCREATE OR REFRESH LIVE TABLE sales\n(customer_id STRING,\ncustomer_name STRING,\nnumber_of_line_items STRING,\norder_datetime STRING,\norder_number LONG,\norder_day_of_week STRING GENERATED ALWAYS AS (dayofweek(order_datetime))\n) PARTITIONED BY (order_day_of_week)\nCOMMENT \"Raw data on sales\"\nAS SELECT * FROM ...\n\n``` \nBy default, Delta Live Tables infers the schema from the `table` definition if you don\u2019t specify a schema.\n\n##### Delta Live Tables SQL language reference\n###### Example: Define table constraints\n\nNote \nDelta Live Tables support for table constraints is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To define table constraints, your pipeline must be a Unity Catalog-enabled pipeline and configured to use the `preview` channel. \nWhen specifying a schema, you can define primary and foreign keys. The constraints are informational and are not enforced. The following example defines a table with a primary and foreign key constraint: \n```\nCREATE OR REFRESH LIVE TABLE sales\n(customer_id STRING NOT NULL PRIMARY KEY,\ncustomer_name STRING,\nnumber_of_line_items STRING,\norder_datetime STRING,\norder_number LONG,\norder_day_of_week STRING GENERATED ALWAYS AS (dayofweek(order_datetime)),\nCONSTRAINT fk_customer_id FOREIGN KEY (customer_id) REFERENCES main.default.customers(customer_id)\n)\nCOMMENT \"Raw data on sales\"\nAS SELECT * FROM ...\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables SQL language reference\n###### Set configuration values for a table or view\n\nUse `SET` to specify a configuration value for a table or view, including Spark configurations. Any table or view you define in a notebook after the `SET` statement has access to the defined value. Any Spark configurations specified using the `SET` statement are used when executing the Spark query for any table or view following the SET statement. To read a configuration value in a query, use the string interpolation syntax `${}`. The following example sets a Spark configuration value named `startDate` and uses that value in a query: \n```\nSET startDate='2020-01-01';\n\nCREATE OR REFRESH LIVE TABLE filtered\nAS SELECT * FROM src\nWHERE date > ${startDate}\n\n``` \nTo specify multiple configuration values, use a separate `SET` statement for each value.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables SQL language reference\n###### SQL properties\n\n| CREATE TABLE or VIEW |\n| --- |\n| **`TEMPORARY`** Create a table but do not publish metadata for the table. The `TEMPORARY` clause instructs Delta Live Tables to create a table that is available to the pipeline but should not be accessed outside the pipeline. To reduce processing time, a temporary table persists for the lifetime of the pipeline that creates it, and not just a single update. |\n| **`STREAMING`** Create a table that reads an input dataset as a stream. The input dataset must be a streaming data source, for example, Auto Loader or a `STREAMING` table. |\n| **`PARTITIONED BY`** An optional list of one or more columns to use for partitioning the table. |\n| **`LOCATION`** An optional storage location for table data. If not set, the system will default to the pipeline storage location. |\n| **`COMMENT`** An optional description for the table. |\n| **`column_constraint`** An optional informational primary key or foreign key [constraint](https:\/\/docs.databricks.com\/azure\/databricks\/sql\/language-manual\/sql-ref-syntax-ddl-create-table-constraint) on the column. |\n| **`table_constraint`** An optional informational primary key or foreign key [constraint](https:\/\/docs.databricks.com\/azure\/databricks\/sql\/language-manual\/sql-ref-syntax-ddl-create-table-constraint) on the table. |\n| **`TBLPROPERTIES`** An optional list of [table properties](https:\/\/docs.databricks.com\/delta-live-tables\/properties.html) for the table. |\n| **`select_statement`** A Delta Live Tables query that defines the dataset for the table. | \n| CONSTRAINT clause |\n| --- |\n| **`EXPECT expectation_name`** Define data quality constraint `expectation_name`. If `ON VIOLATION` constraint is not defined, add rows that violate the constraint to the target dataset. |\n| **`ON VIOLATION`** Optional action to take for failed rows:* `FAIL UPDATE`: Immediately stop pipeline execution. * `DROP ROW`: Drop the record and continue processing. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Delta Live Tables language references\n##### Delta Live Tables SQL language reference\n###### Change data capture with SQL in Delta Live Tables\n\nUse the `APPLY CHANGES INTO` statement to use Delta Live Tables CDC functionality, as described in the following: \n```\nCREATE OR REFRESH STREAMING TABLE table_name;\n\nAPPLY CHANGES INTO LIVE.table_name\nFROM source\nKEYS (keys)\n[IGNORE NULL UPDATES]\n[APPLY AS DELETE WHEN condition]\n[APPLY AS TRUNCATE WHEN condition]\nSEQUENCE BY orderByColumn\n[COLUMNS {columnList | * EXCEPT (exceptColumnList)}]\n[STORED AS {SCD TYPE 1 | SCD TYPE 2}]\n[TRACK HISTORY ON {columnList | * EXCEPT (exceptColumnList)}]\n\n``` \nYou define data quality constraints for an `APPLY CHANGES` target using the same `CONSTRAINT` clause as non-`APPLY CHANGES` queries. See [Manage data quality with Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html). \nNote \nThe default behavior for `INSERT` and `UPDATE` events is to *upsert* CDC events from the source: update any rows in the target table that match the specified key(s) or insert a new row when a matching record does not exist in the target table. Handling for `DELETE` events can be specified with the `APPLY AS DELETE WHEN` condition. \nImportant \nYou must declare a target streaming table to apply changes into. You can optionally specify the schema for your target table. When specifying the schema of the `APPLY CHANGES` target table, you must also include the `__START_AT` and `__END_AT` columns with the same data type as the `sequence_by` field. \nSee [APPLY CHANGES API: Simplify change data capture in Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html). \n| Clauses |\n| --- |\n| **`KEYS`** The column or combination of columns that uniquely identify a row in the source data. This is used to identify which CDC events apply to specific records in the target table. This clause is required. |\n| **`IGNORE NULL UPDATES`** Allow ingesting updates containing a subset of the target columns. When a CDC event matches an existing row and IGNORE NULL UPDATES is specified, columns with a `null` will retain their existing values in the target. This also applies to nested columns with a value of `null`. This clause is optional. The default is to overwrite existing columns with `null` values. |\n| **`APPLY AS DELETE WHEN`** Specifies when a CDC event should be treated as a `DELETE` rather than an upsert. To handle out-of-order data, the deleted row is temporarily retained as a tombstone in the underlying Delta table, and a view is created in the metastore that filters out these tombstones. The retention interval can be configured with the `pipelines.cdc.tombstoneGCThresholdInSeconds` [table property](https:\/\/docs.databricks.com\/delta-live-tables\/properties.html#table-properties). This clause is optional. |\n| **`APPLY AS TRUNCATE WHEN`** Specifies when a CDC event should be treated as a full table `TRUNCATE`. Because this clause triggers a full truncate of the target table, it should be used only for specific use cases requiring this functionality. The `APPLY AS TRUNCATE WHEN` clause is supported only for SCD type 1. SCD type 2 does not support truncate. This clause is optional. |\n| **`SEQUENCE BY`** The column name specifying the logical order of CDC events in the source data. Delta Live Tables uses this sequencing to handle change events that arrive out of order. This clause is required. |\n| **`COLUMNS`** Specifies a subset of columns to include in the target table. You can either:* Specify the complete list of columns to include: `COLUMNS (userId, name, city)`. * Specify a list of columns to exclude: `COLUMNS * EXCEPT (operation, sequenceNum)` This clause is optional. The default is to include all columns in the target table when the `COLUMNS` clause is not specified. |\n| **`STORED AS`** Whether to store records as SCD type 1 or SCD type 2. This clause is optional. The default is SCD type 1. |\n| **`TRACK HISTORY ON`** Specifies a subset of output columns to generate history records when there are any changes to those specified columns. You can either:* Specify the complete list of columns to track: `COLUMNS (userId, name, city)`. * Specify a list of columns to be excluded from tracking: `COLUMNS * EXCEPT (operation, sequenceNum)` This clause is optional. The default is track history for all the output columns when there are any changes, equivalent to `TRACK HISTORY ON *`. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Download and install the Databricks ODBC Driver\n\nThis article describes how to download and install the [Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/index.html). \nReview the [JDBC ODBC driver license](https:\/\/databricks.com\/jdbc-odbc-driver-license) before you download and install the ODBC driver. \nSome tools and clients require you to install the ODBC driver before you can set up a connection to Databricks, while others embed the ODBC driver and do not require separate installation. For example, to use [Tableau Desktop](https:\/\/docs.databricks.com\/partners\/bi\/tableau.html), the ODBC driver must be installed, while recent [Power BI Desktop](https:\/\/docs.databricks.com\/partners\/bi\/power-bi.html) releases include the ODBC driver preinstalled and no further action is needed. If you do not need to download or install the ODBC driver, skip ahead to [Next steps](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html#next-steps). \nTo download and install the ODBC driver, complete the following instructions, depending on your operating system: \n* [Download and install the ODBC driver for Windows](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html#odbc-windows)\n* [Download and install the ODBC driver for macOS](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html#odbc-mac)\n* [Download and install the ODBC driver for Linux](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html#odbc-linux)\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/download.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Download and install the Databricks ODBC Driver\n########## Download and install the ODBC driver for Windows\n\n1. Go to the [All ODBC Driver Versions - Windows](https:\/\/www.databricks.com\/spark\/odbc-drivers-archive#windows) download page.\n2. Click the **32-Bit** or **64-Bit** link, depending on your operating system\u2019s architecture, for the latest version of the ODBC driver.\n3. Extract the contents of the downloaded `.zip` file. For extraction instructions, see your operating system\u2019s documentation.\n4. Double-click the extracted `.msi` file, and then follow the on-screen directions, to install the driver in `C:\\Program Files\\Simba Spark ODBC Driver`.\n5. Go to [Next steps](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html#next-steps). \n### Download and install the ODBC driver for macOS \n1. Go to the [All ODBC Driver Versions - Mac OS](https:\/\/www.databricks.com\/spark\/odbc-drivers-archive#mac) download page.\n2. Click the **32-Bit** link for the latest version of the ODBC driver.\n3. Extract the contents of the downloaded `.zip` file. For extraction instructions, see your operating system\u2019s documentation.\n4. Double-click the extracted `.dmg` file, and then follow the on-screen directions, to install the driver in `\/Library\/simba\/spark`.\n5. Go to [Next steps](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html#next-steps). \n### Download and install the ODBC driver for Linux \n1. Go to one of the following download pages, depending on your operating system\u2019s package manager type: \n* [All ODBC Driver Versions - Linux (rpm)](https:\/\/www.databricks.com\/spark\/odbc-drivers-archive#linux)\n* [All ODBC Driver Versions - Linux (deb)](https:\/\/www.databricks.com\/spark\/odbc-drivers-archive#deb)\n2. Click the **32-Bit** or **64-Bit** link, depending on your operating system\u2019s architecture, for the latest version of the ODBC driver.\n3. Extract the contents of the downloaded `.zip` file. For extraction instructions, see your operating system\u2019s documentation.\n4. Install the ODBC driver, depending on your operating system\u2019s package manager type: \n* For Linux (rpm): \n```\nsudo yum --nogpgcheck localinstall simbaspark_<version>.rpm\n\n```\n* Linux (deb): \n```\nsudo dpkg -i simbaspark_<version>.deb\n\n```The installation directory is `\/opt\/simba\/spark`.\n5. Go to [Next steps](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html#next-steps).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/download.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Download and install the Databricks ODBC Driver\n########## Next steps\n\nTo configure a Databricks connection for the Databricks ODBC Driver, see the following articles: \n* [Compute settings for the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/compute.html)\n* [Authentication settings for the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/authentication.html)\n* [Driver capability settings for the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html)\n* [Create an ODBC DSN for the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/dsn.html)\n* [Create an ODBC DSN-less connection string for the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/dsn-less.html) \nFor more information, see the [Databricks ODBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Simba-Apache-Spark-ODBC-Connector-Install-and-Configuration-Guide.pdf).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/download.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### VPC peering\n\nVPC peering allows your Databricks clusters to connect to your other AWS infrastructure (RDS, Redshift, Kafka, Cassandra, and so on) using private IP addresses within the internal AWS network. \nThe VPC hosting the other infrastructure must have a CIDR range distinct from the Databricks VPC and any other CIDR range included as a destination in the Databricks VPC main route table. If you have a conflict, you can contact Databricks support to inquire about moving your Databricks VPC to a new CIDR range of your choice. You can view this by searching for the Databricks VPC in your AWS Console, clicking on the main route table associated with it, and then examining the **Route Tables** tab. Here is an example of a main route table for a Databricks deployment that is already peered with several other VPCs: \n![Databricks VPC Route Table](https:\/\/docs.databricks.com\/_images\/dbc-vpc-route-table.png) \nFor information on VPC peering, see the [AWS VPC Peering guide](https:\/\/docs.aws.amazon.com\/vpc\/latest\/userguide\/vpc-peering.html). \nThis guide walks you through an example of peering an AWS Aurora RDS to your Databricks VPC using the AWS Console. If you prefer a programmatic solution, go to [Programmatic VPC peering](https:\/\/docs.databricks.com\/security\/network\/classic\/vpc-peering.html#programmatic-peering) for a notebook that performs all of the steps for you. Finally, there is a [troubleshooting](https:\/\/docs.databricks.com\/security\/network\/classic\/vpc-peering.html#troubleshooting) section for common problems and resolutions. \nImportant \nConsult your AWS\/devops team before trying to set up VPC peering. Some familiarity with AWS as well as sufficient permissions will ensure this process goes smoothly. The notebook can help you make this transition, however depending on your environment it is important to ensure to make the necessary modifications to ensure there is no impact to the your existing infrastructure.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/vpc-peering.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### VPC peering\n###### AWS Console example\n\nThe following diagram illustrates all of the different components that are involved in peering your Databricks deployment to your other AWS infrastructure. In the example, Databricks is deployed in one AWS account and the Aurora RDS is deployed into another. A peering connection is established to link the two VPCs across both AWS accounts. \n![VPC Peering Connection Across AWS Accounts](https:\/\/docs.databricks.com\/_images\/aws-vpc-peer-diagram.png) \nAs you move through this process within your own AWS Console, it helps to keep a table of information to refer back to. Record the following: \n1. ID and CIDR Range of your Databricks VPC.\n2. ID and CIDR Range of your other infrastructure (Aurora RDS).\n3. ID of the main route table of your Databricks VPC. \n| AWS Service | Name | ID | CIDR Range |\n| --- | --- | --- | --- |\n| VPC | Databricks VPC | vpc-dbcb3fbc | 10.126.0.0\/16 |\n| VPC | Aurora RDS VPC | vpc-7b52471c | 172.78.0.0\/16 |\n| Route Table | Databricks Main Route Table | rtb-3775c750 | | \n### Step 1: Create a peering connection \n1. Navigate to the **VPC Dashboard**.\n2. Select **Peering Connections**.\n3. Click **Create Peering Connection**\n4. Set the **VPC Requester** to the Databricks VPC ID.\n5. Set the **VPC Acceptor** to the Aurora VPC ID.\n6. Click **Create Peering Connection**. \n![Create Peering Connection](https:\/\/docs.databricks.com\/_images\/aws-peering-connection.png) \n### Step 2: Record the ID of the peering connection \n| AWS Service | Name | ID | CIDR Range |\n| --- | --- | --- | --- |\n| VPC | Databricks VPC | vpc-dbcb3fbc | 10.126.0.0\/16 |\n| VPC | Aurora RDS VPC | vpc-7b52471c | 172.78.0.0\/16 |\n| Route Table | Databricks Main Route Table | rtb-3775c750 | |\n| Peering Connection | Databricks VPC <> Aurora VPC | pcx-4d148024 | | \n### Step 3: Accept the peering connection request \nThe VPC with the Aurora RDS will need to have its owner approve the request. The status on Peering Connections indicates **Pending Acceptance** until this is done. \n![Peering Connection Pending Acceptance](https:\/\/docs.databricks.com\/_images\/aws-vpc-peering-connection-pending.png) \nSelect **Actions > Accept Request**. \n![Peering Connection Accept Request](https:\/\/docs.databricks.com\/_images\/aws-vpc-peering-connection-accept.png) \n### Step 4: Add DNS resolution to peering connection \n1. Log into the AWS Account that hosts the Databricks VPC.\n2. Navigate to the **VPC Dashboard**.\n3. Select **Peering Connections**.\n4. From the Actions menu, select **Edit DNS Settings**.\n5. Click to enable **DNS resolution**.\n6. Log into the AWS Account that hosts the Aurora VPC and repeat steps 2 - 4. \n![Enable DNS Resolution](https:\/\/docs.databricks.com\/_images\/aws-vpc-peering-connection-enable-dns.png) \n### Step 5: Add destination to Databricks VPC main route table \n1. Select **Route Tables** in the VPC Dashboard.\n2. Search for the Databricks VPC ID.\n3. Click the **Edit** button under the **Routes** tab.\n4. Click **Add another route**.\n5. Enter the CIDR range of the Aurora VPC for the **Destination**.\n6. Enter the ID of the peering connection for the **Target**. \n![Databricks VPC Route Destinations](https:\/\/docs.databricks.com\/_images\/dbc-vpc-route-table-route.png) \n### Step 6: Add destination to Aurora VPC main route table \n1. Select **Route Tables** in the VPC Dashboard.\n2. Search for the Aurora VPC ID.\n3. Click the **Edit** button under the **Routes** tab.\n4. Click **Add another route**.\n5. Enter the CIDR range of the Databricks VPC for the **Destination**.\n6. Enter the ID of the peering connection for the **Target**. \n![Aurora VPC Route Destinations](https:\/\/docs.databricks.com\/_images\/aurora-vpc-route-table-route.png) \n### Step 7: Find the Databricks unmanaged security group \n1. Select **Security Groups** in the VPC Dashboard.\n2. Search for the ID of the Databricks VPC.\n3. Find and Record the ID of the security group with **Unmanaged** in the name. Do *not* select the Managed security group. \n| AWS Service | Name | ID | CIDR Range |\n| --- | --- | --- | --- |\n| VPC | Databricks VPC | vpc-dbcb3fbc | 10.126.0.0\/16 |\n| VPC | Aurora RDS VPC | vpc-7b52471c | 172.78.0.0\/16 |\n| Route Table | Databricks Main Route Table | rtb-3775c750 | |\n| Peering Connection | Databricks VPC <> Aurora VPC | pcx-4d148024 | |\n| Security Group | Databricks Unmanaged Group | sg-96016bef | | \n### Step 8: Add rule to unmanaged security group \n1. Select **Security Groups** in the VPC Dashboard.\n2. Search for the ID of the Aurora VPC.\n3. Add an **Inbound Rule** by clicking **Edit** and then **Add Another Rule**.\n4. Select **Custom TCP Rule** or the service that relates to your RDS.\n5. Set the **Port Range** to correspond to your RDS service. The default for Aurora [MySQL] is 3306.\n6. Set the **Source** to be the security group ID of the **Unmanaged** Databricks security group. \n![Aurora Security Group Rule](https:\/\/docs.databricks.com\/_images\/aurora-security-group-rule.png) \n### Step 9: Test connectivity \n1. Create a Databricks cluster.\n2. Check to see if you can connect to the database with the following `netcat` command: \n```\n%sh nc -zv <hostname> <port>\n\n``` \n![Validate Connectivity](https:\/\/docs.databricks.com\/_images\/aws-vpc-peering-validate.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/vpc-peering.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### VPC peering\n###### Programmatic VPC peering\n\nThis notebook supports two scenarios: \n* Establishing VPC peering between Databricks VPC and another VPC in the same AWS account\n* Establishing VPC peering between Databricks VPC and another VPC in a different AWS account \n### VPC peering notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/vpc-peering.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/vpc-peering.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### VPC peering\n###### Troubleshooting\n\n**Can\u2019t establish connectivity with `netcat`** \nIf you can\u2019t establish connectivity with `netcat`, check that the hostname is resolving via DNS by using the `host` Linux command. If the hostname does not resolve, verify that you have enabled DNS resolution in your peering connection. \n```\n%sh host -t a <hostname>\n\n``` \n![Validate DNS Resolution](https:\/\/docs.databricks.com\/_images\/aws-validate-dns.png) \n**Can\u2019t establish connectivity with the hostname or the IP address** \nIf you aren\u2019t able to establish connectivity with either the hostname or the IP address, verify that the VPC of your Aurora RDS has 3 subnets associated with its main route table. \n1. Select **Subnets** from the **VPC Dashboard** and search for the ID of the Aurora VPC. There should be a subnet for each availability zone. \n![Aurora VPC Subnets](https:\/\/docs.databricks.com\/_images\/aws-vpc-aurora-subnets.png)\n2. Make sure that each of those subnets are associated with the main route table. \n1. Select **Route Tables** from the VPC Dashboard and search for the main route table associated with the Aurora RDS.\n2. Click the **Subnet Associations** tab and then **Edit**. You should see all 3 subnets in the list, but none of them should have **Associate** selected. \n![Aurora Subnet Associations](https:\/\/docs.databricks.com\/_images\/aurora-subnet-associations.png) \n**DNS is not working** \nCheck in Route 53 and confirm that the Databricks VPC is associated with the private hosted zones used within your VPC.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/vpc-peering.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n#### TensorFlow\n\nTensorFlow is an open-source framework for machine learning created by Google. It supports deep-learning and general numerical computations on CPUs, GPUs, and clusters of GPUs.\nIt is subject to the terms and conditions of the [Apache License 2.0](https:\/\/github.com\/tensorflow\/tensorflow\/blob\/master\/LICENSE). \nDatabricks Runtime ML includes TensorFlow and TensorBoard, so you can use these libraries without installing any packages. For the version of TensorFlow installed in the Databricks Runtime ML version that you are using, see the [release notes](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). \nNote \nThis guide is not a comprehensive guide on TensorFlow. See the [TensorFlow website](https:\/\/www.tensorflow.org\/).\n\n#### TensorFlow\n##### Single node and distributed training\n\nTo test and migrate single-machine workflows, use a [Single Node cluster](https:\/\/docs.databricks.com\/compute\/configure.html#single-node). \nFor distributed training options for deep learning, see [Distributed training](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/index.html).\n\n#### TensorFlow\n##### Tensorflow example notebook\n\nThe following notebook shows how you can run TensorFlow (1.x and 2.x) with TensorBoard monitoring on a Single Node cluster. \n### TensorFlow 1.15\/2.x notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/tensorflow-single-node.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/tensorflow.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n#### TensorFlow\n##### TensorFlow Keras example notebook\n\n[TensorFlow Keras](https:\/\/keras.io\/about\/) is a deep learning API written in Python that runs on top of the machine learning platform TensorFlow. The 10-minute tutorial notebook shows an example of training machine learning models on tabular data with TensorFlow Keras, including using inline [TensorBoard](https:\/\/docs.databricks.com\/machine-learning\/train-model\/tensorboard.html). \n### Get started with TensorFlow Keras notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/getting-started\/get-started-keras-dbr7ml.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/tensorflow.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/run-serverless-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Run your Databricks job with serverless compute for workflows\n\nPreview \nServerless compute for workflows is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). For information on eligibility and enablement, see [Enable serverless compute public preview](https:\/\/docs.databricks.com\/admin\/workspace-settings\/serverless.html). \nImportant \nBecause the public preview of serverless compute for workflows does not support controlling egress traffic, your jobs have full access to the internet. \nServerless compute for workflows allows you to run your Databricks job without configuring and deploying infrastructure. With serverless compute, you focus on implementing your data processing and analysis pipelines, and Databricks efficiently manages compute resources, including optimizing and scaling compute for your workloads. Autoscaling and [Photon](https:\/\/docs.databricks.com\/compute\/photon.html) are automatically enabled for the compute resources that run your job. \nServerless compute for workflows auto-optimization automatically optimizes compute by selecting appropriate resources such as instance types, memory, and processing engines based on your workload. Auto-optimization also automatically retries failed jobs. \nDatabricks automatically upgrades the Databricks Runtime version to support enhancements and upgrades to the platform while ensuring the stability of your Databricks jobs. To see the current Databricks Runtime version used by serverless compute for workflows, see [Serverless compute release notes](https:\/\/docs.databricks.com\/release-notes\/serverless.html). \nBecause cluster creation permission is not required, all workspace users can use serverless compute to run their workflows. \nThis article describes using the Databricks Jobs UI to create and run jobs that use serverless compute. You can also automate creating and running jobs that use serverless compute with the Jobs API, Databricks Asset Bundles, and the Databricks SDK for Python. \n* To learn about using the Jobs API to create and run jobs that use serverless compute, see [Jobs](https:\/\/docs.databricks.com\/api\/workspace\/jobs) in the REST API reference.\n* To learn about using Databricks Asset Bundles to create and run jobs that use serverless compute, see [Develop a job on Databricks by using Databricks Asset Bundles](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-bundles-with-jobs.html).\n* To learn about using the Databricks SDK for Python to create and run jobs that use serverless compute, see [Databricks SDK for Python](https:\/\/docs.databricks.com\/dev-tools\/sdk-python.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/run-serverless-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Run your Databricks job with serverless compute for workflows\n##### Requirements\n\n* Your Databricks workspace must have Unity Catalog enabled.\n* Because serverless compute for workflows uses [shared access mode](https:\/\/docs.databricks.com\/compute\/configure.html#access-modes), your workloads must support this access mode.\n* Your Databricks workspace must be in a supported region. See [Databricks clouds and regions](https:\/\/docs.databricks.com\/resources\/supported-regions.html).\n\n#### Run your Databricks job with serverless compute for workflows\n##### Create a job using serverless compute\n\nServerless compute is supported with the notebook, Python script, dbt, and Python wheel [task types](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#task-types). By default, serverless compute is selected as the compute type when you [create a new job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#job-create) and add one of these supported task types. \n![Create serverless task](https:\/\/docs.databricks.com\/_images\/create-serverless-job-ui.png) \nDatabricks recommends using serverless compute for all job tasks. You can also specify different compute types for tasks in a job, which might be required if a task type is not supported by serverless compute for workflows.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/run-serverless-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Run your Databricks job with serverless compute for workflows\n##### Configure an existing job to use serverless compute\n\nYou can switch an existing job to use serverless compute for supported task types when you [edit the job](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#job-edit). To switch to serverless compute, either: \n* In the **Job details** side panel click **Swap** under **Compute**, click **New**, enter or update any settings, and click **Update**.\n* Click ![Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret.png) in the **Compute** drop-down menu and select **Serverless**. \n![Switch task to serverless compute](https:\/\/docs.databricks.com\/_images\/swap-existing-to-serverless.png)\n\n#### Run your Databricks job with serverless compute for workflows\n##### Schedule a notebook using serverless compute\n\nIn addition to using the Jobs UI to create and schedule a job using serverless compute, you can create and run a job that uses serverless compute directly from a Databricks notebook. See [Create and manage scheduled notebook jobs](https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/run-serverless-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Run your Databricks job with serverless compute for workflows\n##### Set Spark configuration parameters\n\nTo automate the configuration of Spark on serverless compute, Databricks allows setting only specific Spark configuration parameters. For the list of allowable parameters, see [Supported Spark configuration parameters](https:\/\/docs.databricks.com\/release-notes\/serverless.html#supported-spark-config). \nYou can set Spark configuration parameters at the session level only. To do this, set them in a notebook and add the notebook to a task included in the same job that uses the parameters. See [Get and set Apache Spark configuration properties in a notebook](https:\/\/kb.databricks.com\/data\/get-and-set-spark-config).\n\n#### Run your Databricks job with serverless compute for workflows\n##### Configure notebook environments and dependencies\n\nTo manage library dependencies and environment configuration for a notebook task, add the configuration to a cell in the notebook. The following example installs Python libraries using `pip install` from workspace files and with a `requirements.txt` file and sets a `spark.sql.session.timeZone` session variable: \n```\n%pip install -r .\/requirements.txt\n%pip install simplejson\n%pip install \/Volumes\/my\/python.whl\n%pip install \/Workspace\/my\/python.whl\n%pip install https:\/\/some-distro.net\/popular.whl\nspark.conf.set('spark.sql.session.timeZone', 'Europe\/Amsterdam')\n\n``` \nTo set the same environment across multiple notebooks, you can use a single notebook to configure the environment and then use the `%run` magic command to run that notebook from any notebook that requires the environment configuration. See [Use %run to import a notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-workflows.html#run).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/run-serverless-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Run your Databricks job with serverless compute for workflows\n##### Configure environments and dependencies for non-notebook tasks\n\nFor other supported task types, such as Python script, Python wheel, or dbt tasks, a default environment includes installed Python libraries. To see the list of installed libraries, see the **Installed Python libraries** section in the release notes for the Databricks Runtime version on which your serverless compute for workflows deployment is based. To see the current Databricks Runtime version used by serverless compute for workflows, see [Serverless compute release notes](https:\/\/docs.databricks.com\/release-notes\/serverless.html). You can also install Python libraries if a task requires a library that is not installed. you can install Python libraries from [workspace files](https:\/\/docs.databricks.com\/files\/workspace.html), Unity Catalog [volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html), or public package repositories. To add a library when you create or edit a task: \n1. In the **Environment and Libraries** dropdown menu, click ![Edit Icon](https:\/\/docs.databricks.com\/_images\/edit-icon.png) next to the **Default** environment or click **+ Add new environment**. \n![Edit default environment](https:\/\/docs.databricks.com\/_images\/edit-serverless-environment.png)\n2. In the **Configure environment** dialog, click **+ Add library**.\n3. Select the type of dependency from the dropdown menu under **Libraries**.\n4. In the **File Path** text box, enter the path to the library. \n* For a Python Wheel in a workspace file, the path should be absolute and start with `\/Workspace\/`.\n* For a Python Wheel in a Unity Catalog volume, the path should be `\/Volumes\/<catalog>\/<schema>\/<volume>\/<path>.whl`.\n* For a `requirements.txt` file, select PyPi and enter `-r \/path\/to\/requirements.txt`. \n![Add task libraries](https:\/\/docs.databricks.com\/_images\/add-serverless-libraries.png) \n1. Click **Confirm** or **+ Add library** to add another library.\n2. If you\u2019re adding a task, click **Create task**. If you\u2019re editing a task, click **Save task**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/run-serverless-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Run your Databricks job with serverless compute for workflows\n##### Configure serverless compute auto-optimization to disallow retries\n\nServerless compute for workflows auto-optimization automatically optimizes the compute used to run your jobs and retries failed jobs. Auto-optimization is enabled by default, and Databricks recommends leaving it enabled to ensure critical workloads run successfully at least once. However, if you have workloads that must be executed at most once, for example, jobs that are not idempotent, you can turn off auto-optimization when adding or editing a task: \n1. Next to **Retries**, click **Add** (or ![Edit Icon](https:\/\/docs.databricks.com\/_images\/edit-icon.png) if a retry policy already exists).\n2. In the **Retry Policy** dialog, uncheck **Enable serverless auto-optimization (may include additional retries)**.\n3. Click **Confirm**.\n4. If you\u2019re adding a task, click **Create task**. If you\u2019re editing a task, click **Save task**.\n\n#### Run your Databricks job with serverless compute for workflows\n##### Monitor the cost of jobs that use serverless compute for workflows\n\nYou can monitor the cost of jobs that use serverless compute for workflows by querying the [billable usage system table](https:\/\/docs.databricks.com\/admin\/system-tables\/billing.html). This table is updated to include user and workload attributes about serverless costs. See [Billable usage system table reference](https:\/\/docs.databricks.com\/admin\/system-tables\/billing.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/run-serverless-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Run your Databricks job with serverless compute for workflows\n##### View details for your Spark queries\n\nServerless compute for workflows has a new interface for viewing detailed runtime information for your Spark statements, such as metrics and query plans. To view query insights for Spark statements included in your jobs run on serverless compute: \n1. Click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar.\n2. In the **Name** column, click the job name you want to view insights for.\n3. Click the specific run you want to view insights for.\n4. In the **Compute** section of the **Task run** side panel, click **Query history**.\n5. You are redirected to the Query History, prefiltered based on the task run ID of the task you were in. \nFor information on using query history, see [Query history](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html).\n\n#### Run your Databricks job with serverless compute for workflows\n##### Limitations\n\nFor a list of serverless compute for workflows limitations, see [Serverless compute limitations](https:\/\/docs.databricks.com\/release-notes\/serverless.html#limitations) in the serverless compute release notes.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/run-serverless-jobs.html"} +{"content":"# Transform data\n### Model semi-structured data\n\nThis article recommends patterns for storing semi-structured data depending on how your organization uses the data. Databricks provides functions, native data types, and query syntax to work with semi-structured, nested, and complex data. \nThe following considerations impact which pattern you should use: \n* Do the fields or types in the data source change frequently?\n* How many total unique fields are contained in the data source?\n* Do you need to optimize your workloads for writes or reads? \nDatabricks recommends storing data as Delta tables for downstream queries.\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/semi-structured.html"} +{"content":"# Transform data\n### Model semi-structured data\n#### Use JSON strings\n\nYou can store data in a single string column using standard JSON formatting and then query fields in the JSON using `:` notation. \nMany systems output records as string or byte-encoded JSON records. Ingesting and storing these records as strings has very low processing overhead. You can also use the `to_json` function to turn any struct of data into a JSON string. \nConsider the following strengths and weaknesses when choosing to store data as JSON strings: \n* All values are stored as strings without type information.\n* JSON supports all data types that can be represented using text.\n* JSON supports strings of arbitrary length.\n* There are no limits on the number of fields that can be represented in a single JSON data column.\n* Data requires no pre-processing before writing to the table.\n* You can resolve type issues present in the data in downstream workloads.\n* JSON provides the worst performance on read, as you must parse the entire string for every query. \nJSON strings provide great flexibility and an easy-to-implement solution for getting raw data into a lakehouse table. You might choose to use JSON strings for many applications, but they are especially useful when the most important outcome of a workload is storing a complete and accurate representation of a data source for downstream processing. Some use cases might include: \n* Ingesting streaming data from a queue service such as Kafka.\n* Recording responses REST API queries.\n* Storing raw records from an upstream data source not controlled by your team. \nAssuming your ingestion logic is flexible, storing data as a JSON string should be resilient even if you encounter new fields, changes in data structure, or type changes in the data source. While downstream workloads might fail due to these changes, your table contains a full history of the source data, meaning that you can remediate issues without needing to go back to the data source.\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/semi-structured.html"} +{"content":"# Transform data\n### Model semi-structured data\n#### Use structs\n\nYou can store semi-structured data with structs and enable all native functionality of columns while maintaining the nested structure of the data source. \nDelta Lake treats data stored as structs the same as any other columns, meaning that there is no functional difference from structs and columns. The Parquet data files used by Delta Lake create a column for each field in a struct. You can use struct fields as clustering columns or partitioning columns, and you can collect statistics on structs for data skipping. \nStructs generally provide the best performance on read, as they support all data skipping optimizations and store individual fields as columns. Performance can begin to suffer when the number of columns present gets into the hundreds. \nEach field in a struct has a data type, which is enforced on write the same as columns. As such, structs require full pre-processing of data. This can be beneficial when you only want validated data committed to a table, but can lead to dropped data or failing jobs when processing malformed records from upstream systems. \nStructs are less flexible than JSON streams for schema evolution, whether this is for evolving data types or adding new fields.\n\n### Model semi-structured data\n#### Use maps and arrays\n\nYou can use a combination of maps and arrays to replicate semi-structured data formats natively in Delta Lake. Statistics cannot be collected on fields defined with these types, but they provide balanced performance on both read and write for semi-structured datasets that have around 500 fields. \nBoth the key and value of maps are typed, so data is pre-processed and schema is enforced on write. \nTo accelerate queries, Databricks recommends storing fields that are often used to filter data as separate columns.\n\n### Model semi-structured data\n#### Do I need to flatten my data?\n\nIf you are storing your data using JSON or maps, consider storing fields frequently used for filtering queries as columns. Stats collection, partitioning, and clustering are not available for fields within JSON strings or maps. You do not need to do this for data stored as structs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/semi-structured.html"} +{"content":"# Transform data\n### Model semi-structured data\n#### Syntax for working with nested data\n\nReview the following resources for information on working with nested data: \n* [Transform complex data types](https:\/\/docs.databricks.com\/optimizations\/complex-types.html)\n* [Query semi-structured data in Databricks](https:\/\/docs.databricks.com\/optimizations\/semi-structured.html)\n* [Higher-order functions](https:\/\/docs.databricks.com\/optimizations\/higher-order-lambda-functions.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/semi-structured.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/index.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n######## Databricks ODBC Driver\n\nDatabricks provides an [ODBC driver](https:\/\/www.databricks.com\/spark\/odbc-drivers-download) that enables you to connect participating apps, tools, clients, SDKs, and APIs to Databricks through Open Database Connectivity (ODBC), an industry-standard specification for accessing database management systems. \nThis article and its related articles supplement the information in the [Databricks ODBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Simba-Apache-Spark-ODBC-Connector-Install-and-Configuration-Guide.pdf), available online in PDF format and in your ODBC driver download\u2019s `docs` directory. \nNote \nDatabricks also provides a JDBC driver. See [Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/index.html). \nThe process for using the ODBC driver is as follows: \n1. Download and install the ODBC driver, depending on your target operating system. See [Download and install the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html).\n2. Gather configuration settings to connect to your target Databricks compute resource (a Databricks cluster or a Databricks SQL warehouse), using your target Databricks authentication type and any special or advanced driver capabilities. See: \n* [Compute settings for the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/compute.html)\n* [Authentication settings for the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/authentication.html)\n* [Driver capability settings for the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html)\n3. Store your gathered configuration settings as an ODBC Data Source Name (DSN) or as a DSN-less connection string, as follows: \n* To create a DSN, see [Create an ODBC DSN for the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/dsn.html).\n* To create a DSN-less connection string, see [Create an ODBC DSN-less connection string for the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/dsn-less.html).Whether you use a DSN or DSN-less connection string will depend on the requirements for your target app, tool, client, SDK, or API.\n4. To use your DSN or DSN-less connection string with your target app, tool, client, SDK, or API, see [Technology partners](https:\/\/docs.databricks.com\/integrations\/index.html) or your provider\u2019s documentation. \nFor more information, view the [Databricks ODBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Simba-Apache-Spark-ODBC-Connector-Install-and-Configuration-Guide.pdf) in PDF format. This guide is also included as a PDF file named `Simba Apache Spark ODBC Connector Install and Configuration Guide.pdf` in your ODBC driver download\u2019s `docs` directory.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/index.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n######## Databricks ODBC Driver\n######### Additional resources\n\n* [Databricks ODBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Simba-Apache-Spark-ODBC-Connector-Install-and-Configuration-Guide.pdf)\n* [Troubleshooting connections](https:\/\/kb.databricks.com\/bi\/jdbc-odbc-troubleshooting.html) \n* [Databricks SQL Connector for Python](https:\/\/docs.databricks.com\/dev-tools\/python-sql-connector.html)\n* [Connect Python and pyodbc to Databricks](https:\/\/docs.databricks.com\/dev-tools\/pyodbc.html)\n* [Connect Tableau to Databricks](https:\/\/docs.databricks.com\/partners\/bi\/tableau.html)\n* [Connect Power BI to Databricks](https:\/\/docs.databricks.com\/partners\/bi\/power-bi.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/index.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Security\n#### compliance\n##### and privacy for the data lakehouse\n####### Best practices for security, compliance & privacy\n\nThe security best practices can be found in the Databricks Security and Trust Center under [Security Features](https:\/\/www.databricks.com\/trust#security-features). \nFor details, see this PDF: [Databricks AWS Security Best Practices and Threat Model](https:\/\/cms.databricks.com\/sites\/default\/files\/2023-03\/Databricks-AWS-Security-Best-Practices-and-Threat-Model.pdf). \nThe following sections list the best practices that can be found in the PDF along the principles of this pillar.\n\n####### Best practices for security, compliance & privacy\n######## 1. Manage identity and access using least privilege\n\n* Authenticate via single sign-on.\n* Use multifactor authentication.\n* Disable local passwords.\n* Set complex local passwords.\n* Separate admin accounts from normal user accounts.\n* Use token management.\n* SCIM synchronization of users and groups.\n* Limit cluster creation rights.\n* Store and use secrets securely.\n* Cross-account IAM role configuration.\n* Customer-approved workspace login.\n* Use clusters that support user isolation.\n* Use service principals to run production jobs. \nDetails are in the PDF referenced near the beginning of this article.\n\n####### Best practices for security, compliance & privacy\n######## 2. Protect data in transit and at rest\n\n* Avoid storing production data in DBFS.\n* Secure access to cloud storage.\n* Use data exfiltration settings within the admin console.\n* Use bucket versioning.\n* Encrypt storage and restrict access.\n* Add a customer-managed key for managed services.\n* Add a customer-managed key for workspace storage. \nDetails are in the PDF referenced near the beginning of this article.\n\n####### Best practices for security, compliance & privacy\n######## 3. Secure your network, and identify and protect endpoints\n\n* Deploy with a customer-managed VPC or VNet.\n* Use IP access lists.\n* Implement network exfiltration protections.\n* Apply VPC service controls.\n* Use VPC endpoint policies.\n* Configure PrivateLink. \nDetails are in the PDF referenced near the beginning of this article.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/security-compliance-and-privacy\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Security\n#### compliance\n##### and privacy for the data lakehouse\n####### Best practices for security, compliance & privacy\n######## 4. Review the shared responsibility model\n\n* Review the Shared Responsibility Model. \nDetails are in the PDF referenced near the beginning of this article.\n\n####### Best practices for security, compliance & privacy\n######## 5. Meet compliance and data privacy requirements\n\n* Review the Databricks compliance standards. \nDetails are in the PDF referenced near the beginning of this article.\n\n####### Best practices for security, compliance & privacy\n######## 6. Monitor system security\n\n* Use Databricks audit log delivery.\n* Configure tagging to monitor usage and enable charge-back.\n* Monitor workspace using Overwatch.\n* Monitor provisioning activities.\n* Use Enhanced Security Monitoring or Compliance Security Profile. \nDetails are in the PDF referenced near the beginning of this article.\n\n####### Best practices for security, compliance & privacy\n######## Generic controls\n\n* Service quotas.\n* Controlling libraries.\n* Isolate sensitive workloads into different workspaces.\n* Use CI\/CD processes to scan code for hard-coded secrets.\n* Use AWS Nitro instances. \nDetails are in the PDF referenced near the beginning of this article.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/security-compliance-and-privacy\/best-practices.html"} +{"content":"# Model serving with Databricks\n## Monitor model quality and endpoint health\n#### Inference tables for monitoring and debugging models\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes inference tables for monitoring served models. The following diagram shows a typical workflow with inference tables. The inference table automatically captures incoming requests and outgoing responses for a model serving endpoint and logs them as a Unity Catalog Delta table. You can use the data in this table to monitor, debug, and improve ML models. \n![Inference tables workflow](https:\/\/docs.databricks.com\/_images\/inference-tables-diagram.png)\n\n#### Inference tables for monitoring and debugging models\n##### What are inference tables?\n\nMonitoring the performance of models in production workflows is an important aspect of the AI and ML model lifecycle. Inference Tables simplify monitoring and diagnostics for models by continuously logging serving request inputs and responses (predictions) from Databricks Model Serving endpoints and saving them into a Delta table in Unity Catalog. You can then use all of the capabilities of the Databricks platform, such as DBSQL queries, notebooks, and Lakehouse Monitoring to monitor, debug, and optimize your models. \nYou can enable inference tables on any existing or newly created model serving endpoint, and requests to that endpoint are then automatically logged to a table in UC. \nSome common applications for inference tables are the following: \n* Monitor data and model quality. You can continuously monitor your model performance and data drift using Lakehouse Monitoring. Lakehouse Monitoring automatically generates data and model quality dashboards that you can share with stakeholders. Additionally, you can enable alerts to know when you need to retrain your model based on shifts in incoming data or reductions in model performance.\n* Debug production issues. Inference Tables log data like HTTP status codes, model execution times, and request and response JSON code. You can use this performance data for debugging purposes. You can also use the historical data in Inference Tables to compare model performance on historical requests.\n* Create a training corpus. By joining Inference Tables with ground truth labels, you can create a training corpus that you can use to re-train or fine-tune and improve your model. Using Databricks Workflows, you can set up a continuous feedback loop and automate re-training.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html"} +{"content":"# Model serving with Databricks\n## Monitor model quality and endpoint health\n#### Inference tables for monitoring and debugging models\n##### Requirements\n\n* Your workspace must have Unity Catalog enabled.\n* To enable inference tables on an endpoint both the creator of the endpoint and the modifier need the following permissions: \n+ CAN MANAGE permission on the endpoint.\n+ `USE CATALOG` permissions on the specified catalog.\n+ `USE SCHEMA` permissions on the specified schema.\n+ `CREATE TABLE` permissions in the schema.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html"} +{"content":"# Model serving with Databricks\n## Monitor model quality and endpoint health\n#### Inference tables for monitoring and debugging models\n##### Enable and disable inference tables\n\nThis section shows you how to enable or disable inference tables using the Databricks UI. You can also use the API; see [Enable inference tables on model serving endpoints using the API](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/enable-model-serving-inference-tables.html) for instructions. \nThe owner of the inference tables is the user who created the endpoint. All access control lists (ACLs) on the table follow the standard Unity Catalog permissions and can be modified by the table owner. \nWarning \nThe inference table could become corrupted if you do any of the following: \n* Change the table schema.\n* Change the table name.\n* Delete the table.\n* Lose permissions to the Unity Catalog catalog or schema. \nIn this case, the `auto_capture_config` of the endpoint status shows a `FAILED` state for the payload table. If this happens, you must create a new endpoint to continue using inference tables. \nTo enable inference tables during endpoint creation use the following steps: \n1. Click **Serving** in the Databricks Machine Learning UI.\n2. Click **Create serving endpoint**.\n3. Select **Enable inference tables**.\n4. In the drop-down menus, select the desired catalog and schema where you would like the table to be located. \n![catalog and schema for inference table](https:\/\/docs.databricks.com\/_images\/inference-table-location.png)\n5. The default table name is `<catalog>.<schema>.<endpoint-name>_payload`. If desired, you can enter a custom table prefix.\n6. Click **Create serving endpoint**. \nYou can also enable inference tables on an existing endpoint. To edit an existing endpoint configuration do the following: \n1. Navigate to your endpoint page.\n2. Click **Edit configuration**.\n3. Follow the previous instructions, starting with step 3.\n4. When you are done, click **Update serving endpoint**. \nFollow these instructions to disable inference tables: \nImportant \nWhen you disable inference tables on an endpoint, you cannot re-enable them. To continue using inference tables, you must create a new endpoint and enable inference tables on it. \n1. Navigate to your endpoint page.\n2. Click **Edit configuration**.\n3. Click **Enable inference table** to remove the checkmark.\n4. Once you are satisfied with the endpoint specifications, click **Update**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html"} +{"content":"# Model serving with Databricks\n## Monitor model quality and endpoint health\n#### Inference tables for monitoring and debugging models\n##### Workflow: Monitor model performance using inference tables\n\nTo monitor model performance using inference tables, follow these steps: \n1. Enable [inference tables](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/enable-model-serving-inference-tables.html) on your endpoint, either during endpoint creation or by updating it afterwards.\n2. Schedule a workflow to process the JSON payloads in the inference table by unpacking them according to the schema of the endpoint.\n3. (Optional) Join the unpacked requests and responses with ground-truth labels to allow model quality metrics to be calculated.\n4. Create a monitor over the resulting Delta table and refresh the metrics. \nThe starter notebooks implement this workflow.\n\n#### Inference tables for monitoring and debugging models\n##### Starter notebook for monitoring an inference table\n\nThe following notebook implements the steps outlined above to unpack requests from a Lakehouse Monitoring inference table. The notebook can be run on demand, or on a recurring schedule using [Databricks Workflows](https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html). \n### Inference table Lakehouse Monitoring starter notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/monitoring\/inference-table-monitor.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html"} +{"content":"# Model serving with Databricks\n## Monitor model quality and endpoint health\n#### Inference tables for monitoring and debugging models\n##### Starter notebook for monitoring text quality from endpoints serving LLMs\n\nThe following notebook unpacks requests from an inference table, computes a set of text evaluation metrics (such as readability and toxicity), and enables monitoring over these metrics. The notebook can be run on demand, or on a recurring schedule using [Databricks Workflows](https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html). \n### LLM inference table Lakehouse Monitoring starter notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/monitoring\/llm-inference-table-monitor.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html"} +{"content":"# Model serving with Databricks\n## Monitor model quality and endpoint health\n#### Inference tables for monitoring and debugging models\n##### Query and analyze results in the inference table\n\nAfter your served models are ready, all requests made to your models are logged automatically to the inference table, along with the responses. You can view the table in the UI, query the table from DBSQL or a notebook, or query the table using the REST API. \n**To view the table in the UI:**\nOn the endpoint page, click the name of the inference table to open the table in Catalog Explorer. \n![link to inference table name on endpoint page](https:\/\/docs.databricks.com\/_images\/inference-table-name.png) \n**To query the table from DBSQL or a Databricks notebook:**\nYou can run code similar to the following to query the inference table. \n```\nSELECT * FROM <catalog>.<schema>.<payload_table>\n\n``` \nIf you enabled inference tables using the UI, `payload_table` is the table name you assigned when you created the endpoint. If you enabled inference tables using the API, `payload_table` is reported in the `state` section of the `auto_capture_config` response. For an example, see [Enable inference tables on model serving endpoints using the API](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/enable-model-serving-inference-tables.html). \n### Performance note \nAfter invoking the endpoint, you can see the invocation logged to your inference table within 10 minutes after sending a scoring request. In addition, Databricks guarantees log delivery happens at least once, so it is possible, though unlikely, that duplicate logs are sent.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html"} +{"content":"# Model serving with Databricks\n## Monitor model quality and endpoint health\n#### Inference tables for monitoring and debugging models\n##### Unity Catalog inference table schema\n\nEach request and response that gets logged to an inference table is written to a Delta table with the following schema: \nNote \nIf you invoke the endpoint with a batch of inputs, the whole batch is logged as one row. \n| Column name | Description | Type |\n| --- | --- | --- |\n| `databricks_request_id` | A Databricks generated request identifier attached to all model serving requests. | STRING |\n| `client_request_id` | An optional client generated request identifier that can be specified in the model serving request body. See [Specify `client\\_request\\_id`](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html#client-id) for more information. | STRING |\n| `date` | The UTC date on which the model serving request was received. | DATE |\n| `timestamp_ms` | The timestamp in epoch milliseconds on when the model serving request was received. | LONG |\n| `status_code` | The HTTP status code that was returned from the model. | INT |\n| `sampling_fraction` | The sampling fraction used in the event that the request was down-sampled. This value is between 0 and 1, where 1 represents that 100% of incoming requests were included. | DOUBLE |\n| `execution_time_ms` | The execution time in milliseconds for which the model performed inference. This does not include overhead network latencies and only represents the time it took for the model to generate predictions. | LONG |\n| `request` | The raw request JSON body that was sent to the model serving endpoint. | STRING |\n| `response` | The raw response JSON body that was returned by the model serving endpoint. | STRING |\n| `request_metadata` | A map of metadata related to the model serving endpoint associated with the request. This map contains the endpoint name, model name, and model version used for your endpoint. | MAP<STRING, STRING> |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html"} +{"content":"# Model serving with Databricks\n## Monitor model quality and endpoint health\n#### Inference tables for monitoring and debugging models\n##### Specify `client_request_id`\n\nThe `client_request_id` field is an optional value the user can provide in the model serving request body. This allows the user to provide their own identifier for a request that shows up in the final inference table under the `client_request_id` and can be used for joining your request with other tables that use the `client_request_id`, like ground truth label joining. To specify a `client_request_id`, include it at as a top level key of the request payload. If no `client_request_id` is specified, the value appears as null in the row corresponding to the request. \n```\n{\n\"client_request_id\": \"<user-provided-id>\",\n\"dataframe_records\": [\n{\n\"sepal length (cm)\": 5.1,\n\"sepal width (cm)\": 3.5,\n\"petal length (cm)\": 1.4,\n\"petal width (cm)\": 0.2\n},\n{\n\"sepal length (cm)\": 4.9,\n\"sepal width (cm)\": 3,\n\"petal length (cm)\": 1.4,\n\"petal width (cm)\": 0.2\n},\n{\n\"sepal length (cm)\": 4.7,\n\"sepal width (cm)\": 3.2,\n\"petal length (cm)\": 1.3,\n\"petal width (cm)\": 0.2\n}\n]\n}\n\n``` \nThe `client_request_id` can later be used for ground truth label joins if there are other tables that have labels associated with the `client_request_id`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html"} +{"content":"# Model serving with Databricks\n## Monitor model quality and endpoint health\n#### Inference tables for monitoring and debugging models\n##### Limitations\n\n* Customer managed keys are not supported.\n* For endpoints that host [foundation models](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html), inference tables are only supported on [provisioned throughput](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html#throughput) workloads. \n+ Inference tables on provisioned throughput endpoints do not support logging streaming requests.\n* Inference tables are not supported on endpoints that host [external models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html).\n* AWS PrivateLink is not supported by default. Reach out to your Databricks account team to enable it.\n* When inference tables is enabled, the limit for the total max concurrency across all served models in a single endpoint is 128. Reach out to your Databricks account team to request an increase to this limit.\n* If an inference table contains more than 500K files, no additional data is logged. To avoid exceeding this limit, run [OPTIMIZE](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-optimize.html) or set up retention on your table by deleting older data. To check the number of files in your table, run `DESCRIBE DETAIL <catalog>.<schema>.<payload_table>`. \nFor general model serving endpoint limitations, see [Model Serving limits and regions](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Compute features on demand using Python user-defined functions\n\nThis article describes how to create and use on-demand features in Databricks. \nMachine learning models for real-time applications often require the most recent feature values. In the example shown in the diagram, one feature for a restaurant recommendation model is the user\u2019s current distance from a restaurant. This feature must be calculated \u201con demand\u201d\u2014that is, at the time of the scoring request. Upon receiving a scoring request, the model looks up the restaurant\u2019s location, and then applies a pre-defined function to calculate the distance between the user\u2019s current location and the restaurant. That distance is passed as an input to the model, along with other precomputed features from the feature store. \n![compute features on demand workflow](https:\/\/docs.databricks.com\/_images\/on-demand-feature.png) \nTo use on-demand features, your workspace must be enabled for [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) and you must use Databricks Runtime 13.3 LTS ML or above.\n\n#### Compute features on demand using Python user-defined functions\n##### What are on-demand features?\n\n\u201cOn-demand\u201d refers to features whose values are not known ahead of time, but are calculated at the time of inference. In Databricks, you use [Python user-defined functions (UDFs)](https:\/\/docs.databricks.com\/udf\/unity-catalog.html#register-a-python-udf-to-unity-catalog) to specify how to calculate on-demand features. These functions are governed by Unity Catalog and discoverable through [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/on-demand-features.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Compute features on demand using Python user-defined functions\n##### Workflow\n\nTo compute features on-demand, you specify a Python user-defined function (UDF) that describes how to calculate the feature values. \n* During training, you provide this function and its input bindings in the `feature_lookups` parameter of the `create_training_set` API.\n* You must log the trained model using the Feature Store method `log_model`. This ensures that the model automatically evaluates on-demand features when it is used for inference.\n* For batch scoring, the `score_batch` API automatically calculates and returns all feature values, including on-demand features.\n* When you serve a model with [Databricks Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html), the model automatically uses the Python UDF to compute on-demand features for each scoring request.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/on-demand-features.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Compute features on demand using Python user-defined functions\n##### Create a Python UDF\n\nYou can create a Python UDF in a notebook or in Databricks SQL. \nFor example, running the following code in a notebook cell creates the Python UDF `example_feature` in the catalog `main` and schema `default`. \n```\n%sql\nCREATE FUNCTION main.default.example_feature(x INT, y INT)\nRETURNS INT\nLANGUAGE PYTHON\nCOMMENT 'add two numbers'\nAS $$\ndef add_numbers(n1: int, n2: int) -> int:\nreturn n1 + n2\n\nreturn add_numbers(x, y)\n$$\n\n``` \nAfter running the code, you can navigate through the three-level namespace in [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html) to view the function definition: \n![function in Catalog Explorer](https:\/\/docs.databricks.com\/_images\/example_feature_func.png) \nFor more details about creating Python UDFs, see [Register a Python UDF to Unity Catalog](https:\/\/docs.databricks.com\/udf\/unity-catalog.html#register-a-python-udf-to-unity-catalog) and [the SQL language manual](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-sql-function.html). \n### How to handle missing feature values \nWhen a Python UDF depends on the result of a FeatureLookup, the value returned if the requested lookup key is not found depends on the environment. When using `score_batch`, the value returned is `None`. When using online serving, the value returned is `float(\"nan\")`. \nThe following code is an example of how to handle both cases. \n```\n%sql\nCREATE OR REPLACE FUNCTION square(x INT)\nRETURNS INT\nLANGUAGE PYTHON AS\n$$\nimport numpy as np\nif x is None or np.isnan(x):\nreturn 0\nreturn x * x\n$$\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/on-demand-features.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Compute features on demand using Python user-defined functions\n##### Train a model using on-demand features\n\nTo train the model, you use a `FeatureFunction`, which is passed to the `create_training_set` API in the `feature_lookups` parameter. \nThe following example code uses the Python UDF `main.default.example_feature` that was defined in the previous section. \n```\n# Install databricks-feature-engineering first with:\n# %pip install databricks-feature-engineering\n# dbutils.library.restartPython()\n\nfrom databricks.feature_engineering import FeatureEngineeringClient\nfrom databricks.feature_engineering import FeatureFunction, FeatureLookup\nfrom sklearn import linear_model\n\nfe = FeatureEngineeringClient()\n\nfeatures = [\n# The feature 'on_demand_feature' is computed as the sum of the the input value 'new_source_input'\n# and the pre-materialized feature 'materialized_feature_value'.\n# - 'new_source_input' must be included in base_df and also provided at inference time.\n# - For batch inference, it must be included in the DataFrame passed to 'FeatureEngineeringClient.score_batch'.\n# - For real-time inference, it must be included in the request.\n# - 'materialized_feature_value' is looked up from a feature table.\n\nFeatureFunction(\nudf_name=\"main.default.example_feature\", # UDF must be in Unity Catalog so uses a three-level namespace\ninput_bindings={\n\"x\": \"new_source_input\",\n\"y\": \"materialized_feature_value\"\n},\noutput_name=\"on_demand_feature\",\n),\n# retrieve the prematerialized feature\nFeatureLookup(\ntable_name = 'main.default.table',\nfeature_names = ['materialized_feature_value'],\nlookup_key = 'id'\n)\n]\n\n# base_df includes the columns 'id', 'new_source_input', and 'label'\ntraining_set = fe.create_training_set(\ndf=base_df,\nfeature_lookups=features,\nlabel='label',\nexclude_columns=['id', 'new_source_input', 'materialized_feature_value'] # drop the columns not used for training\n)\n\n# The training set contains the columns 'on_demand_feature' and 'label'.\ntraining_df = training_set.load_df().toPandas()\n\n# training_df columns ['materialized_feature_value', 'label']\nX_train = training_df.drop(['label'], axis=1)\ny_train = training_df.label\n\nmodel = linear_model.LinearRegression().fit(X_train, y_train)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/on-demand-features.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Compute features on demand using Python user-defined functions\n##### Log the model and register it to Unity Catalog\n\nModels packaged with feature metadata can be [registered to Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html). The feature tables used to create the model must be stored in Unity Catalog. \nTo ensure that the model automatically evaluates on-demand features when it is used for inference, you must set the registry URI and then log the model, as follows: \n```\nimport mlflow\nmlflow.set_registry_uri(\"databricks-uc\")\n\nfe.log_model(\nmodel=model,\nartifact_path=\"main.default.model\",\nflavor=mlflow.sklearn,\ntraining_set=training_set,\nregistered_model_name=\"main.default.recommender_model\"\n)\n\n``` \nIf the Python UDF that defines the on-demand features imports any Python packages, you must specify these packages using the argument `extra_pip_requirements`. For example: \n```\nimport mlflow\nmlflow.set_registry_uri(\"databricks-uc\")\n\nfe.log_model(\nmodel=model,\nartifact_path=\"model\",\nflavor=mlflow.sklearn,\ntraining_set=training_set,\nregistered_model_name=\"main.default.recommender_model\",\nextra_pip_requirements=[\"scikit-learn==1.20.3\"]\n)\n\n```\n\n#### Compute features on demand using Python user-defined functions\n##### Limitation\n\nOn-demand features can output all [data types supported by Feature Store](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html#supported-data-types) except MapType and ArrayType.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/on-demand-features.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Compute features on demand using Python user-defined functions\n##### Notebook examples: On-demand features\n\nThe following notebook shows an example of how to train and score a model that uses an on-demand feature. \n### Basic on-demand features demo notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/on-demand-basic-demo.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nThe following notebook shows an example of a restaurant recommendation model. The restaurant\u2019s location is looked up from a Databricks online table. The user\u2019s current location is sent as part of the scoring request. The model uses an on-demand feature to compute the real-time distance from the user to the restaurant. That distance is then used as an input to the model. \n### Restaurant recommendation on-demand features using online tables demo notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/on-demand-restaurant-recommendation-demo-online-tables.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/on-demand-features.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Dashboards in notebooks\n\nThis page describes how to create dashboards based on notebook cell outputs. Dashboards allow you to publish graphs and visualizations, and then share them in a presentation format with your organization.\n\n#### Dashboards in notebooks\n##### Create or add to a dashboard\n\nYou create a dashboard by adding an item to it. \n* To add a visualization or output results table to a dashboard, click the down arrow next to the tab name and select **Add to dashboard >**.\n* To add a Markdown cell to a dashboard, click the dashboard icon in the cell actions menu. Markdown cells are useful as labels on your dashboard. \n![Add visualization to dashboard](https:\/\/docs.databricks.com\/_images\/viz-and-md-dashboard-menus.png) \nYou can select to create a new dashboard or add the plot to an existing dashboard. \n* If you select **Add to new dashboard**, the new dashboard is automatically displayed.\n* To add the plot to an existing dashboard, click the name of the dashboard. A checkmark appears to indicate that the plot is now on that dashboard. The dashboard is not automatically displayed. To go to the dashboard, click ![go to dashboard icon](https:\/\/docs.databricks.com\/_images\/go-to-dashboard-icon.png) to the right of the dashboard name.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/dashboards.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Dashboards in notebooks\n##### Control size and placement of items on a dashboard\n\n* To resize an item, click ![corner resize icon on dashboard](https:\/\/docs.databricks.com\/_images\/corner-icon.png) at the lower-right corner and move your cursor until the item is the size you want.\n* To move an item, click in the item and hold while you move your cursor.\n* There are two layout options in the right panel, **Stack** and **Float**. **Stack** keeps the items neatly lined up on the dashboard. To move items around more freely, select **Float**.\n* To add a name to a plot, move your cursor over the plot. A control panel ![dashboard item control panel](https:\/\/docs.databricks.com\/_images\/control-panel.png) appears in the upper-right corner. \n+ Select the Settings icon ![dashboard item settings icon](https:\/\/docs.databricks.com\/_images\/dashboard-settings-icon.png). The **Configure Dashboard Element** dialog appears.\n+ In the dialog, click **Show Title**, enter a title for the plot, and click **Save**.\n* To remove the item from the dashboard, move your cursor over the plot to display the control panel in the upper-right corner, and click ![dashboard item remove icon](https:\/\/docs.databricks.com\/_images\/dashboard-remove-icon.png).\n\n#### Dashboards in notebooks\n##### Add a title to a dashboard\n\nIn the box, enter the title. To change the title, enter the new title in the box. \n![dashboard title box](https:\/\/docs.databricks.com\/_images\/dashboard-title.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/dashboards.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Dashboards in notebooks\n##### Navigate between a dashboard and a notebook\n\nOn the dashboard menu, click ![go to dashboard icon](https:\/\/docs.databricks.com\/_images\/go-to-dashboard-icon.png) to the right of the dashboard name to go to the dashboard. To return to the notebook, click the notebook\u2019s name underneath the title of the dashboard. \nTo go from a dashboard directly to the notebook cell that created a plot, move your cursor over the plot. A control panel appears in the upper-right corner of the cell. Click ![go to notebook cell icon](https:\/\/docs.databricks.com\/_images\/dashboard-go-to-cell-icon.png).\n\n#### Dashboards in notebooks\n##### Present a dashboard\n\nTo present a dashboard, click ![present dashboard button](https:\/\/docs.databricks.com\/_images\/present-dashboard-button.png). To return to the interactive dashboard, click **Exit** in the upper-left corner.\n\n#### Dashboards in notebooks\n##### Delete a dashboard\n\nTo delete a dashboard, click ![delete dashboard button](https:\/\/docs.databricks.com\/_images\/delete-dashboard-button.png).\n\n#### Dashboards in notebooks\n##### Create a scheduled job to refresh a dashboard\n\nTo schedule a dashboard to refresh at a specified interval, click ![notebook header schedule button](https:\/\/docs.databricks.com\/_images\/nb-header-job.png) to create a scheduled job for the notebook that generates the plots on the dashboard. For details about scheduled jobs, see [Create and manage scheduled notebook jobs](https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/dashboards.html"} +{"content":"# \n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/6-deploy-rag-app-to-production.html"} +{"content":"# \n### Deploy a RAG application to production\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html) \nRAG Studio includes multiple environments to help you manage the lifecycle of your application. Up until now, these tutorials have worked in the RAG Studio development and Reviewers `Environment`. \nIn this tutorial, you will deploy a version of your application to the `End Users` environment.. Read the [Environments](https:\/\/docs.databricks.com\/rag-studio\/details\/environments.html) for more details about how and why environments work. \n1. If you did not already run this command in [Initialize a RAG Application](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1-create-sample-app.html), run the following command to initialize these `Environments`. This command takes about 10 minutes to run. \n```\n.\/rag setup-prod-env\n\n``` \nNote \nSee [Infrastructure and Unity Catalog assets created by RAG Studio](https:\/\/docs.databricks.com\/rag-studio\/details\/created-infra.html) for details of what is created in your Workspace and Unity Catalog schema.\n2. Run the following command to deploy the version to the End Users `Environment`. This command takes about 10 minutes to run. \n```\n.\/rag deploy-chain -v 1 -e end_users\n\n```\n3. In the console, you will see output similar to below. Open the URL in your web browser to open the `\ud83d\udcac Review UI`. You can share this URL with your `\ud83e\udde0 Expert Users`. \n```\n...truncated for clarity of docs...\n=======\nTask deploy_chain_task:\nYour Review UI is now available. Open the Review UI here: https:\/\/<workspace-url>\/ml\/review\/model\/catalog.schema.rag_studio_databricks-docs-bot\/version\/1\/environment\/end_users\n\n```\n4. If you want `\ud83d\udc64 End Users` to use the `\ud83d\udcac Review UI`, add permissions to the deployed version. \n* Give the Databricks user you wish to grant access `read` permissions to \n+ the MLflow Experiment\n+ the Model Serving endpoint\n+ the Unity Catalog Model\nTip \n**\ud83d\udea7 Roadmap \ud83d\udea7** Support for adding any corporate SSO to access the `\ud83d\udcac Review UI` e.g., no requirements for a Databricks account.\n5. Now, every time a `\ud83d\udc64 End Users` chats with your RAG Application, the `\ud83d\uddc2\ufe0f Request Log` and `\ud83d\udc4d Assessment & Evaluation Results Log` will be populated.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/6-deploy-rag-app-to-production.html"} +{"content":"# \n### Deploy a RAG application to production\n#### Data flow\n\n![legend](https:\/\/docs.databricks.com\/_images\/data-flow-prod.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/6-deploy-rag-app-to-production.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Evaluate large language models with MLflow\n\nThis article introduces MLflow LLM Evaluate, MLflow\u2019s large language model (LLM) evaluation functionality packaged in [mlflow.evaluate](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.evaluate). This article also describes what is needed to evaluate your LLM and what evaluation metrics are supported.\n\n### Evaluate large language models with MLflow\n#### What is MLflow LLM Evaluate?\n\nEvaluating LLM performance is slightly different from traditional ML models, as very often there is no single ground truth to compare against. MLflow provides an API, `mlflow.evaluate()` to help evaluate your LLMs. \nMLflow\u2019s LLM evaluation functionality consists of three main components: \n* **A model to evaluate**: It can be an MLflow `pyfunc` model, a DataFrame with a predictions column, a URI that points to one registered MLflow model, or any Python callable that represents your model, such as a HuggingFace text summarization pipeline.\n* **Metrics**: the metrics to compute, LLM evaluate uses LLM metrics.\n* **Evaluation data**: the data your model is evaluated at, it can be a Pandas DataFrame, a Python list, a `numpy` array or an `mlflow.data.dataset.Dataset` instance.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/llm-evaluate.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Evaluate large language models with MLflow\n#### Requirements\n\n* MLflow 2.8 and above.\n* In order to evaluate your LLM with `mlflow.evaluate()`, your LLM has to be one of the following: \n+ A `mlflow.pyfunc.PyFuncModel` instance or a URI pointing to a logged `mlflow.pyfunc.PyFuncModel` model.\n+ A custom Python function that takes in string inputs and outputs a single string. Your callable must match the signature of `mlflow.pyfunc.PyFuncModel.predict` without a `params` argument. The function should: \n- Have `data` as the only argument, which can be a `pandas.Dataframe`, `numpy.ndarray`, Python list, dictionary or scipy matrix.\n- Return one of the following: `pandas.DataFrame`, `pandas.Series`, `numpy.ndarray` or list.\n+ A static dataset.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/llm-evaluate.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Evaluate large language models with MLflow\n#### Evaluate with an MLflow model\n\nYou can evaluate your LLM as an MLflow model. For detailed instruction on how to convert your model into a `mlflow.pyfunc.PyFuncModel` instance, see how to [Create a custom pyfunc model](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.pyfunc.html#creating-custom-pyfunc-models). \nTo evaluate your model as an MLflow model, Databricks recommends following these steps: \nNote \nTo successfully log a model targeting Azure OpenAI Service, you must specify the following environment variables for authentication and functionality. See the [OpenAI with MLflow](https:\/\/mlflow.org\/docs\/latest\/llms\/openai\/guide\/index.html?highlight=azure%20openai#azure-openai-service-integration) documentation for more details. \n```\nos.environ[\"OPENAI_API_TYPE\"] = \"azure\"\nos.environ[\"OPENAI_API_VERSION\"] = \"2023-05-15\"\nos.environ[\"OPENAI_API_BASE\"] = \"https:\/\/<>.<>.<>.com\/\"\nos.environ[\"OPENAI_DEPLOYMENT_NAME\"] = \"deployment-name\"\n\n``` \n1. Package your LLM as an MLflow model and log it to MLflow server using `log_model`. Each flavor (`opeanai`, `pytorch`, \u2026) has its own `log_model` API, such as `mlflow.openai.log_model()`: \n```\nwith mlflow.start_run():\nsystem_prompt = \"Answer the following question in two sentences\"\n# Wrap \"gpt-3.5-turbo\" as an MLflow model.\nlogged_model_info = mlflow.openai.log_model(\nmodel=\"gpt-3.5-turbo\",\ntask=openai.ChatCompletion,\nartifact_path=\"model\",\nmessages=[\n{\"role\": \"system\", \"content\": system_prompt},\n{\"role\": \"user\", \"content\": \"{question}\"},\n],\n)\n\n```\n2. Use the URI of logged model as the model instance in `mlflow.evaluate()`: \n```\nresults = mlflow.evaluate(\nlogged_model_info.model_uri,\neval_data,\ntargets=\"ground_truth\",\nmodel_type=\"question-answering\",\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/llm-evaluate.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Evaluate large language models with MLflow\n#### Evaluate with a custom function\n\nIn MLflow 2.8.0 and above, `mlflow.evaluate()` supports evaluating a Python function without requiring the model be logged to MLflow. This is useful when you don\u2019t want to log the model and just want to evaluate it. The following example uses `mlflow.evaluate()` to evaluate a function. \nYou also need to set up OpenAI authentication to run the following code: \n```\neval_data = pd.DataFrame(\n{\n\"inputs\": [\n\"What is MLflow?\",\n\"What is Spark?\",\n],\n\"ground_truth\": [\n\"MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.\",\n\"Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. It was developed in response to limitations of the Hadoop MapReduce computing model, offering improvements in speed and ease of use. Spark provides libraries for various tasks such as data ingestion, processing, and analysis through its components like Spark SQL for structured data, Spark Streaming for real-time data processing, and MLlib for machine learning tasks\",\n],\n}\n)\n\ndef openai_qa(inputs):\nanswers = []\nsystem_prompt = \"Please answer the following question in formal language.\"\nfor index, row in inputs.iterrows():\ncompletion = openai.ChatCompletion.create(\nmodel=\"gpt-3.5-turbo\",\nmessages=[\n{\"role\": \"system\", \"content\": system_prompt},\n{\"role\": \"user\", \"content\": \"{row}\"},\n],\n)\nanswers.append(completion.choices[0].message.content)\n\nreturn answers\n\nwith mlflow.start_run() as run:\nresults = mlflow.evaluate(\nopenai_qa,\neval_data,\nmodel_type=\"question-answering\",\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/llm-evaluate.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Evaluate large language models with MLflow\n#### Evaluate with a static dataset\n\nFor MLflow 2.8.0 and above, `mlflow.evaluate()` supports evaluating a static dataset without specifying a model. This is useful when you save the model output to a column in a Pandas DataFrame or an MLflow PandasDataset, and want to evaluate the static dataset without re-running the model. \nSet `model=None`, and put model outputs in the `data` argument. This configuration is only applicable when the data is a Pandas DataFrame. \nIf you are using a Pandas DataFrame, you must specify the column name that contains the model output using the top-level `predictions` parameter in `mlflow.evaluate()`: \n```\nimport mlflow\nimport pandas as pd\n\neval_data = pd.DataFrame(\n{\n\"inputs\": [\n\"What is MLflow?\",\n\"What is Spark?\",\n],\n\"ground_truth\": [\n\"MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. \"\n\"It was developed by Databricks, a company that specializes in big data and machine learning solutions. \"\n\"MLflow is designed to address the challenges that data scientists and machine learning engineers \"\n\"face when developing, training, and deploying machine learning models.\",\n\"Apache Spark is an open-source, distributed computing system designed for big data processing and \"\n\"analytics. It was developed in response to limitations of the Hadoop MapReduce computing model, \"\n\"offering improvements in speed and ease of use. Spark provides libraries for various tasks such as \"\n\"data ingestion, processing, and analysis through its components like Spark SQL for structured data, \"\n\"Spark Streaming for real-time data processing, and MLlib for machine learning tasks\",\n],\n\"predictions\": [\n\"MLflow is an open-source platform that provides handy tools to manage Machine Learning workflow \"\n\"lifecycle in a simple way\",\n\"Spark is a popular open-source distributed computing system designed for big data processing and analytics.\",\n],\n}\n)\n\nwith mlflow.start_run() as run:\nresults = mlflow.evaluate(\ndata=eval_data,\ntargets=\"ground_truth\",\npredictions=\"predictions\",\nextra_metrics=[mlflow.metrics.genai.answer_similarity()],\nevaluators=\"default\",\n)\nprint(f\"See aggregated evaluation results below: \\n{results.metrics}\")\n\neval_table = results.tables[\"eval_results_table\"]\nprint(f\"See evaluation table below: \\n{eval_table}\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/llm-evaluate.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Evaluate large language models with MLflow\n#### LLM evaluation metric types\n\nThere are two types of LLM evaluation metrics in MLflow: \n* Metrics that rely on SaaS models, like OpenAI, for scoring such as `mlflow.metrics.genai.answer_relevance`. These metrics are created using `mlflow.metrics.genai.make_genai_metric()`. For each data record, these metrics send one prompt consisting of the following information to the SaaS model, and extract the score from the model response. \n+ Metrics definition.\n+ Metrics grading criteria.\n+ Reference examples.\n+ Input data or context.\n+ Model output.\n+ [optional] Ground truth.\n* Function-based per-row metrics. These metrics calculate a score for each data record (row in terms of Pandas or Spark DataFrame), based on certain functions, like Rouge, `mlflow.metrics.rougeL`, or Flesch Kincaid,`mlflow.metrics.flesch_kincaid_grade_level`. These metrics are similar to traditional metrics.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/llm-evaluate.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Evaluate large language models with MLflow\n#### Select metrics to evaluate your LLM\n\nYou can select which metrics to evaluate your model. The full reference for supported evaluation metrics can be found in the [MLflow evaluate documentation](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.evaluate). \nYou can either: \n* Use the **default** metrics that are pre-defined for your model type.\n* Use a **custom** list of metrics. \nTo use defaults metrics for pre-selected tasks, specify the `model_type` argument in `mlflow.evaluate`, as shown by the example below: \n```\nresults = mlflow.evaluate(\nmodel,\neval_data,\ntargets=\"ground_truth\",\nmodel_type=\"question-answering\",\n)\n\n``` \nThe table summarizes the supported LLM model types and associated default metrics. \n| `question-answering` | `text-summarization` | `text` |\n| --- | --- | --- |\n| exact-match | [ROUGE](https:\/\/huggingface.co\/spaces\/evaluate-metric\/rouge)\u2020 | [toxicity](https:\/\/huggingface.co\/spaces\/evaluate-measurement\/toxicity)\\* |\n| [toxicity](https:\/\/huggingface.co\/spaces\/evaluate-measurement\/toxicity)\\* | [toxicity](https:\/\/huggingface.co\/spaces\/evaluate-measurement\/toxicity)\\* | [ari\\_grade\\_level](https:\/\/en.wikipedia.org\/wiki\/Automated_readability_index)\\*\\* |\n| [ari\\_grade\\_level](https:\/\/en.wikipedia.org\/wiki\/Automated_readability_index)\\*\\* | [ari\\_grade\\_level](https:\/\/en.wikipedia.org\/wiki\/Automated_readability_index)\\*\\* | [flesch\\_kincaid\\_grade\\_level](https:\/\/en.wikipedia.org\/wiki\/Flesch%E2%80%93Kincaid_readability_tests#Flesch%E2%80%93Kincaid_grade_level)\\*\\* |\n| [flesch\\_kincaid\\_grade\\_level](https:\/\/en.wikipedia.org\/wiki\/Flesch%E2%80%93Kincaid_readability_tests#Flesch%E2%80%93Kincaid_grade_level)\\*\\* | [flesch\\_kincaid\\_grade\\_level](https:\/\/en.wikipedia.org\/wiki\/Flesch%E2%80%93Kincaid_readability_tests#Flesch%E2%80%93Kincaid_grade_level)\\*\\* | | \n`*` Requires package [evaluate](https:\/\/pypi.org\/project\/evaluate), [torch](https:\/\/pytorch.org\/get-started\/locally\/), and [transformers](https:\/\/huggingface.co\/docs\/transformers\/installation). \n`**` Requires package [textstat](https:\/\/pypi.org\/project\/textstat). \n`\u2020` Requires package [evaluate](https:\/\/pypi.org\/project\/evaluate), [nltk](https:\/\/pypi.org\/project\/nltk), and [rouge-score](https:\/\/pypi.org\/project\/rouge-score). \n### Use a custom list of metrics \nYou can specify a custom list of metrics in the `extra_metrics` argument in `mlflow.evaluate`. \nTo add additional metrics to the default metrics list of pre-defined model type, keep the `model_type` parameter and add your metrics to `extra_metrics`. The following evaluates your model using all metrics for the `question-answering` model and `mlflow.metrics.latency()`. \n```\nresults = mlflow.evaluate(\nmodel,\neval_data,\ntargets=\"ground_truth\",\nmodel_type=\"question-answering\",\nextra_metrics=[mlflow.metrics.latency()],\n)\n\n``` \nTo disable default metric calculation and only calculate your selected metrics, remove the `model_type` argument and define the desired metrics. \n```\nresults = mlflow.evaluate(model,\neval_data,\ntargets=\"ground_truth\",\nextra_metrics=[mlflow.metrics.toxicity(), mlflow.metrics.latency()],\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/llm-evaluate.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Evaluate large language models with MLflow\n#### Metrics with LLM as a judge\n\nYou can also add pre-canned metrics that use LLM as the judge to the `extra_metrics` argument in `mlflow.evaluate()`. For a list of these LLM as the judge metrics, see [Metrics with LLM as the judge](https:\/\/mlflow.org\/docs\/latest\/llms\/llm-evaluate\/index.html#metrics-with-llm-as-the-judge). \nNote \nYou can also [Create custom LLM as the judge and heuristic based evaluation metrics](https:\/\/mlflow.org\/docs\/latest\/llms\/llm-evaluate\/index.html#creating-custom-llm-evaluation-metrics). \n```\nfrom mlflow.metrics.genai import answer_relevance\n\nanswer_relevance_metric = answer_relevance(model=\"openai:\/gpt-4\")\n\neval_df = pd.DataFrame() # Index(['inputs', 'predictions', 'context'], dtype='object')\n\neval_results = mlflow.evaluate(\ndata = eval_df, # evaluation data\nmodel_type=\"question-answering\",\npredictions=\"predictions\", # prediction column_name from eval_df\nextra_metrics=[answer_relevance_metric]\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/llm-evaluate.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Evaluate large language models with MLflow\n#### View evaluation results\n\n`mlflow.evaluate()` returns the evaluation results as an `mlflow.models.EvaluationResult` instance. \nTo see the score on selected metrics, you can check the following attributes of the evaluation result: \n* `metrics`: This stores the aggregated results, like average or variance across the evaluation dataset. The following takes a second pass on the code example above and focuses on printing out the aggregated results. \n```\nwith mlflow.start_run() as run:\nresults = mlflow.evaluate(\ndata=eval_data,\ntargets=\"ground_truth\",\npredictions=\"predictions\",\nextra_metrics=[mlflow.metrics.genai.answer_similarity()],\nevaluators=\"default\",\n)\nprint(f\"See aggregated evaluation results below: \\n{results.metrics}\")\n\n```\n* `tables['eval_results_table']`: This stores the per-row evaluation results. \n```\nwith mlflow.start_run() as run:\nresults = mlflow.evaluate(\ndata=eval_data,\ntargets=\"ground_truth\",\npredictions=\"predictions\",\nextra_metrics=[mlflow.metrics.genai.answer_similarity()],\nevaluators=\"default\",\n)\nprint(\nf\"See per-data evaluation results below: \\n{results.tables['eval_results_table']}\"\n)\n\n```\n\n### Evaluate large language models with MLflow\n#### LLM evaluation with MLflow example notebook\n\nThe following LLM evaluation with MLflow example notebook is a use-case oriented example. \n### LLM evaluation with MLflow example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/question-answering-evaluation.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/llm-evaluate.html"} +{"content":"# Discover data\n### Sample datasets\n\nThere are a variety of sample datasets provided by Databricks and made available by third parties that you can use in your Databricks [workspace](https:\/\/docs.databricks.com\/admin\/workspace\/index.html).\n\n### Sample datasets\n#### Unity Catalog datasets\n\n[Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) provides access to a number of sample datasets in the `samples` catalog. You can review these datasets in the [Catalog Explorer UI](https:\/\/docs.databricks.com\/catalog-explorer\/index.html) and reference them directly in a [notebook](https:\/\/docs.databricks.com\/notebooks\/index.html) or in the [SQL editor](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html) by using the `<catalog-name>.<schema-name>.<table-name>` pattern. \nThe `nyctaxi` schema (also known as a database) contains the table `trips`, which has details about taxi rides in New York City. The following statement returns the first 10 records in this table: \n```\nSELECT * FROM samples.nyctaxi.trips LIMIT 10\n\n``` \nThe `tpch` schema contains data from the [TPC-H Benchmark](https:\/\/www.tpc.org\/tpch\/). To list the tables in this schema, run: \n```\nSHOW TABLES IN samples.tpch\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/databricks-datasets.html"} +{"content":"# Discover data\n### Sample datasets\n#### Databricks datasets (databricks-datasets)\n\nDatabricks includes a variety of sample datasets mounted to [DBFS](https:\/\/docs.databricks.com\/dbfs\/index.html). \nNote \nThe availability and location of Databricks datasets are subject to change without notice. \n### Browse Databricks datasets \nTo browse these files from a Python, Scala, or R [notebook](https:\/\/docs.databricks.com\/notebooks\/index.html), you can use [Databricks Utilities (dbutils) reference](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html). The following code lists all of the available Databricks datasets. \n```\ndisplay(dbutils.fs.ls('\/databricks-datasets'))\n\n``` \n```\ndisplay(dbutils.fs.ls(\"\/databricks-datasets\"))\n\n``` \n```\n%fs ls \"\/databricks-datasets\"\n\n``` \n### Get information about Databricks datasets \nTo get more information about a Databricks dataset, you can use a [local file API](https:\/\/docs.databricks.com\/files\/index.html) to print out the dataset `README` (if one is available) by using a Python, R, or Scala [notebook](https:\/\/docs.databricks.com\/notebooks\/index.html), as shown in this code example. \n```\nf = open('\/discover\/databricks-datasets\/README.md', 'r')\nprint(f.read())\n\n``` \n```\nscala.io.Source.fromFile(\"\/discover\/databricks-datasets\/README.md\").foreach {\nprint\n}\n\n``` \n```\nlibrary(readr)\n\nf = read_lines(\"\/discover\/databricks-datasets\/README.md\", skip = 0, n_max = -1L)\nprint(f)\n\n``` \n### Create a table based on a Databricks dataset \nThis code example demonstrates how to use SQL in the [SQL editor](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html), or how to use SQL, Python, Scala, or R [notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html), to create a table based on a Databricks dataset: \n```\nCREATE TABLE default.people10m OPTIONS (PATH 'dbfs:\/databricks-datasets\/learning-spark-v2\/people\/people-10m.delta')\n\n``` \n```\nspark.sql(\"CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:\/databricks-datasets\/learning-spark-v2\/people\/people-10m.delta')\")\n\n``` \n```\nspark.sql(\"CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:\/databricks-datasets\/learning-spark-v2\/people\/people-10m.delta')\")\n\n``` \n```\nlibrary(SparkR)\nsparkR.session()\n\nsql(\"CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:\/databricks-datasets\/learning-spark-v2\/people\/people-10m.delta')\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/databricks-datasets.html"} +{"content":"# Discover data\n### Sample datasets\n#### Third-party sample datasets in CSV format\n\nDatabricks has built-in tools to quickly upload third-party sample datasets as comma-separated values (CSV) files into Databricks workspaces. Some popular third-party sample datasets available in CSV format: \n| Sample dataset | | To download the sample dataset as a CSV file\u2026 |\n| --- | --- | --- |\n| [The Squirrel Census](https:\/\/www.thesquirrelcensus.com\/data) | | On the **Data** webpage, click **Park Data**, **Squirrel Data**, or **Stories**. |\n| [OWID Dataset Collection](https:\/\/github.com\/owid\/owid-datasets) | | In the GitHub repository, click the **datasets** folder. Click the subfolder that contains the target dataset, and then click the dataset\u2019s CSV file. |\n| [Data.gov CSV datasets](https:\/\/catalog.data.gov\/dataset\/?res_format=CSV) | | On the search results webpage, click the target search result, and next to the **CSV** icon, click **Download**. |\n| [Diamonds](https:\/\/www.kaggle.com\/datasets\/shivam2503\/diamonds) (Requires a [Kaggle](https:\/\/www.kaggle.com\/account\/login) account) | | On the dataset\u2019s webpage, on the **Data** tab, on the **Data** tab, next to **diamonds.csv**, click the **Download** icon. |\n| [NYC Taxi Trip Duration](https:\/\/www.kaggle.com\/c\/nyc-taxi-trip-duration) (Requires a [Kaggle](https:\/\/www.kaggle.com\/account\/login) account) | | On the dataset\u2019s webpage, on the **Data** tab, next to **sample\\_submission.zip**, click the **Download** icon. To find the dataset\u2019s CSV files, extracts the contents of the downloaded ZIP file. |\n| [UFO Sightings](https:\/\/data.world\/timothyrenner\/ufo-sightings) (Requires a [data.world](https:\/\/data.world\/login) account) | | On the dataset\u2019s webpage, next to **nuforc\\_reports.csv**, click the **Download** icon. | \nTo use third-party sample datasets in your Databricks workspace, do the following: \n1. Follow the third-party\u2019s instructions to download the dataset as a CSV file to your local machine.\n2. [Upload the CSV file](https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-data.html) from your local machine into your Databricks workspace.\n3. To work with the imported data, use Databricks SQL to [query the data](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html#create-a-query). Or you can use a [notebook](https:\/\/docs.databricks.com\/notebooks\/index.html) to [load the data as a DataFrame](https:\/\/docs.databricks.com\/delta\/tutorial.html#read).\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/databricks-datasets.html"} +{"content":"# Discover data\n### Sample datasets\n#### Third-party sample datasets within libraries\n\nSome third parties include sample datasets within [libraries](https:\/\/docs.databricks.com\/libraries\/index.html), such as [Python Package Index (PyPI)](https:\/\/pypi.org\/) packages or [Comprehensive R Archive Network (CRAN)](https:\/\/cran.r-project.org\/) packages. For more information, see the library provider\u2019s documentation. \n* To install a library on a Databricks [cluster](https:\/\/docs.databricks.com\/compute\/index.html) by using the cluster user interface, see [Cluster libraries](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html).\n* To install a Python library by using a Databricks [notebook](https:\/\/docs.databricks.com\/notebooks\/index.html), see [Notebook-scoped Python libraries](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html).\n* To install an R library by using a Databricks notebook, see [Notebook-scoped R libraries](https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/databricks-datasets.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Use XGBoost on Databricks\n##### Distributed training of XGBoost models using `xgboost.spark`\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThe Python package xgboost>=1.7 contains a new module `xgboost.spark`. This module includes the xgboost PySpark estimators `xgboost.spark.SparkXGBRegressor`, `xgboost.spark.SparkXGBClassifier`, and `xgboost.spark.SparkXGBRanker`. These new classes support the inclusion of XGBoost estimators in SparkML Pipelines. For API details, see the [XGBoost python spark API doc](https:\/\/xgboost.readthedocs.io\/en\/stable\/python\/python_api.html#module-xgboost.spark).\n\n##### Distributed training of XGBoost models using `xgboost.spark`\n###### Requirements\n\nDatabricks Runtime 12.0 ML and above.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost-spark.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Use XGBoost on Databricks\n##### Distributed training of XGBoost models using `xgboost.spark`\n###### `xgboost.spark` parameters\n\nThe estimators defined in the `xgboost.spark` module support most of the same parameters and arguments used in standard XGBoost. \n* The parameters for the class constructor, `fit` method, and `predict` method are largely identical to those in the `xgboost.sklearn` module.\n* Naming, values, and defaults are mostly identical to those described in [XGBoost parameters](https:\/\/xgboost.readthedocs.io\/en\/stable\/parameter.html).\n* Exceptions are a few unsupported parameters (such as `gpu_id`, `nthread`, `sample_weight`, `eval_set`), and the `pyspark` estimator specific parameters that have been added (such as `featuresCol`, `labelCol`, `use_gpu`, `validationIndicatorCol`). For details, see [XGBoost Python Spark API documentation](https:\/\/xgboost.readthedocs.io\/en\/stable\/python\/python_api.html#module-xgboost.spark).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost-spark.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Use XGBoost on Databricks\n##### Distributed training of XGBoost models using `xgboost.spark`\n###### Distributed training\n\nPySpark estimators defined in the `xgboost.spark` module support distributed XGBoost training using the `num_workers` parameter. To use distributed training, create a classifier or regressor and set `num_workers` to the number of concurrent running Spark tasks during distributed training. To use the all Spark task slots, set `num_workers=sc.defaultParallelism`. \nFor example: \n```\nfrom xgboost.spark import SparkXGBClassifier\nclassifier = SparkXGBClassifier(num_workers=sc.defaultParallelism)\n\n``` \nNote \n* You cannot use `mlflow.xgboost.autolog` with distributed XGBoost. To log an xgboost Spark model using MLflow, use `mlflow.spark.log_model(spark_xgb_model, artifact_path)`.\n* You cannot use distributed XGBoost on a cluster that has autoscaling enabled. New worker nodes that start in this elastic scaling paradigm cannot receive new sets of tasks and remain idle. For instructions to disable autoscaling, see [Enable autoscaling](https:\/\/docs.databricks.com\/compute\/configure.html#autoscaling).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost-spark.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Use XGBoost on Databricks\n##### Distributed training of XGBoost models using `xgboost.spark`\n###### Enable optimization for training on sparse features dataset\n\nPySpark Estimators defined in `xgboost.spark` module support optimization for training on datasets with sparse features.\nTo enable optimization of sparse feature sets, you need to provide a dataset to the `fit` method that contains a features column consisting of values of type `pyspark.ml.linalg.SparseVector` and set the estimator parameter `enable_sparse_data_optim` to `True`. Additionally, you need to set the `missing` parameter to `0.0`. \nFor example: \n```\nfrom xgboost.spark import SparkXGBClassifier\nclassifier = SparkXGBClassifier(enable_sparse_data_optim=True, missing=0.0)\nclassifier.fit(dataset_with_sparse_features_col)\n\n```\n\n##### Distributed training of XGBoost models using `xgboost.spark`\n###### GPU training\n\nPySpark estimators defined in the `xgboost.spark` module support training on GPUs. Set the parameter `use_gpu` to `True` to enable GPU training. \nNote \nFor each Spark task used in XGBoost distributed training, only one GPU is used in training when the `use_gpu` argument is set to `True`. Databricks recommends using the default value of `1` for the Spark cluster configuration `spark.task.resource.gpu.amount`. Otherwise, the additional GPUs allocated to this Spark task are idle. \nFor example: \n```\nfrom xgboost.spark import SparkXGBClassifier\nclassifier = SparkXGBClassifier(num_workers=sc.defaultParallelism, use_gpu=True)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost-spark.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Use XGBoost on Databricks\n##### Distributed training of XGBoost models using `xgboost.spark`\n###### Example notebook\n\nThis notebook shows the use of the Python package `xgboost.spark` with Spark MLlib. \n### PySpark-XGBoost notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/xgboost-pyspark-new.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost-spark.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Use XGBoost on Databricks\n##### Distributed training of XGBoost models using `xgboost.spark`\n###### Migration guide for the deprecated `sparkdl.xgboost` module\n\n* Replace `from sparkdl.xgboost import XgboostRegressor` with `from xgboost.spark import SparkXGBRegressor` and replace `from sparkdl.xgboost import XgboostClassifier` with `from xgboost.spark import SparkXGBClassifier`.\n* Change all parameter names in the estimator constructor from camelCase style to snake\\_case style. For example, change `XgboostRegressor(featuresCol=XXX)` to `SparkXGBRegressor(features_col=XXX)`.\n* The parameters `use_external_storage` and `external_storage_precision` have been removed. `xgboost.spark` estimators use the DMatrix data iteration API to use memory more efficiently. There is no longer a need to use the inefficient external storage mode. For extremely large datasets, Databricks recommends that you increase the `num_workers` parameter, which makes each training task partition the data into smaller, more manageable data partitions. Consider setting `num_workers = sc.defaultParallelism`, which sets `num_workers` to the total number of Spark task slots in the cluster.\n* For estimators defined in `xgboost.spark`, setting `num_workers=1` executes model training using a single Spark task. This utilizes the number of CPU cores specified by the Spark cluster configuration setting `spark.task.cpus`, which is 1 by default. To use more CPU cores to train the model, increase `num_workers` or `spark.task.cpus`. You cannot set the `nthread` or `n_jobs` parameter for estimators defined in `xgboost.spark`. This behavior is different from the previous behavior of estimators defined in the deprecated `sparkdl.xgboost` package. \n### Convert `sparkdl.xgboost` model into `xgboost.spark` model \n`sparkdl.xgboost` models are saved in a different format than `xgboost.spark` models and have\n[different parameter settings](https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost-spark.html#xgboost-spark-parameters). Use the following\nutility function to convert the model: \n```\ndef convert_sparkdl_model_to_xgboost_spark_model(\nxgboost_spark_estimator_cls,\nsparkdl_xgboost_model,\n):\n\"\"\"\n:param xgboost_spark_estimator_cls:\n`xgboost.spark` estimator class, e.g. `xgboost.spark.SparkXGBRegressor`\n:param sparkdl_xgboost_model:\n`sparkdl.xgboost` model instance e.g. the instance of\n`sparkdl.xgboost.XgboostRegressorModel` type.\n\n:return\nA `xgboost.spark` model instance\n\"\"\"\n\ndef convert_param_key(key):\nfrom xgboost.spark.core import _inverse_pyspark_param_alias_map\nif key == \"baseMarginCol\":\nreturn \"base_margin_col\"\nif key in _inverse_pyspark_param_alias_map:\nreturn _inverse_pyspark_param_alias_map[key]\nif key in ['use_external_storage', 'external_storage_precision', 'nthread', 'n_jobs', 'base_margin_eval_set']:\nreturn None\nreturn key\n\nxgboost_spark_params_dict = {}\nfor param in sparkdl_xgboost_model.params:\nif param.name == \"arbitraryParamsDict\":\ncontinue\nif sparkdl_xgboost_model.isDefined(param):\nxgboost_spark_params_dict[param.name] = sparkdl_xgboost_model.getOrDefault(param)\n\nxgboost_spark_params_dict.update(sparkdl_xgboost_model.getOrDefault(\"arbitraryParamsDict\"))\n\nxgboost_spark_params_dict = {\nconvert_param_key(k): v\nfor k, v in xgboost_spark_params_dict.items()\nif convert_param_key(k) is not None\n}\n\nbooster = sparkdl_xgboost_model.get_booster()\nbooster_bytes = booster.save_raw(\"json\")\nbooster_config = booster.save_config()\nestimator = xgboost_spark_estimator_cls(**xgboost_spark_params_dict)\nsklearn_model = estimator._convert_to_sklearn_model(booster_bytes, booster_config)\nreturn estimator._copyValues(estimator._create_pyspark_model(sklearn_model))\n\n# Example\nfrom xgboost.spark import SparkXGBRegressor\n\nnew_model = convert_sparkdl_model_to_xgboost_spark_model(\nxgboost_spark_estimator_cls=SparkXGBRegressor,\nsparkdl_xgboost_model=model,\n)\n\n``` \nIf you have a `pyspark.ml.PipelineModel` model containing a `sparkdl.xgboost` model as the\nlast stage, you can replace the stage of `sparkdl.xgboost` model with\nthe converted `xgboost.spark` model. \n```\npipeline_model.stages[-1] = convert_sparkdl_model_to_xgboost_spark_model(\nxgboost_spark_estimator_cls=SparkXGBRegressor,\nsparkdl_xgboost_model=pipeline_model.stages[-1],\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost-spark.html"} +{"content":"# What is data warehousing on Databricks?\n## Get started with data warehousing using Databricks SQL\n#### Visualize queries and create a legacy dashboard\n\nThis tutorial uses the New York City taxi dataset in Samples. It shows you how to use SQL editor in Databricks SQL to create a visualization for each of several queries and then create a dashboard using these visualizations. It also shows you how to create a dashboard parameter for each of the visualizations in the dashboard. \nNote \nDashboards (formerly Lakeview dashboards) are now generally available. \n* Databricks recommends authoring new dashboards using the latest tooling. See [Dashboards](https:\/\/docs.databricks.com\/dashboards\/index.html).\n* Original Databricks SQL dashboards are now called **legacy dashboards**. They will continue to be supported and updated with critical bug fixes, but new functionality will be limited. You can continue to use legacy dashboards for both authoring and consumption.\n* Convert legacy dashboards using the migration tool or REST API. See [Clone a legacy dashboard to a Lakeview dashboard](https:\/\/docs.databricks.com\/dashboards\/clone-legacy-to-lakeview.html) for instructions on using the built-in migration tool. See [Use Databricks APIs to manage dashboards](https:\/\/docs.databricks.com\/dashboards\/tutorials\/index.html#apis) for tutorials on creating and managing dashboards using the REST API.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/get-started\/visualize-data-tutorial.html"} +{"content":"# What is data warehousing on Databricks?\n## Get started with data warehousing using Databricks SQL\n#### Visualize queries and create a legacy dashboard\n##### Connect to Databricks SQL with SQL editor\n\n1. Click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar and select **Query**. \nThe SQL editor opens.\n2. Select a warehouse. \nThe first time you create a query the list of available SQL warehouses displays in alphabetical order. The next time you create a query, the last used warehouse is selected. \n3. Click **Serverless Starter Warehouse**. This warehouse is created for you automatically to help you get started quickly. If serverless is not enabled for your workspace, choose **Starter Warehouse**. For information on creating SQL warehouses, see [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/get-started\/visualize-data-tutorial.html"} +{"content":"# What is data warehousing on Databricks?\n## Get started with data warehousing using Databricks SQL\n#### Visualize queries and create a legacy dashboard\n##### Query for pickup hour distribution\n\n1. In SQL editor, paste the following query in the new query window to return the distribution of taxi pickups by hour. \n```\nSELECT\ndate_format(tpep_pickup_datetime, \"HH\") AS `Pickup Hour`,\ncount(*) AS `Number of Rides`\nFROM\nsamples.nyctaxi.trips\nGROUP BY 1\n\n```\n2. Press **Ctrl\/Cmd + Enter** or click **Run (1000)**. After a few seconds, the query results are shown below the query in the results pane. \n**Limit 1000** is selected by default for all queries to ensure that the query returns at most 1000 rows. If a query is saved with the **Limit 1000** setting, this setting applies to all executions of the query (including within dashboards). If you want to return all rows for this query, you can unselect **LIMIT 1000** by clicking the **Run (1000)** drop-down. If you want to specify a different limit on the number of rows, you can add a `LIMIT` clause in your query with a value of your choice. \nThe query result displays in the Results tab.\n3. Click **Save** and save the query as `Pickup hour`. \n![Results of your first query nyc taxi query](https:\/\/docs.databricks.com\/_images\/first-nyc-query-results.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/get-started\/visualize-data-tutorial.html"} +{"content":"# What is data warehousing on Databricks?\n## Get started with data warehousing using Databricks SQL\n#### Visualize queries and create a legacy dashboard\n##### Create a visualization for the distribution of taxi pickups by hour.\n\n1. Next to the Results tab, click **+** and then click **Visualization**. \nThe visualization editor displays.\n2. In the **Visualization Type** drop-down, verify that **Bar** is selected.\n3. Change the visualization name to `Bar chart`.\n4. Verify that `Pickup Hour` is specified for the **Y column** drop down.\n5. Verify that `Number of Rides` and `Sum` are specified for the **X column** drop down. \n![Pickup hour distribution](https:\/\/docs.databricks.com\/_images\/pickup_distribution.png)\n6. Click **Save**. \nThe saved chart displays in the SQL editor.\n\n#### Visualize queries and create a legacy dashboard\n##### Query for daily fare trends\n\n1. In SQL editor, click **+** and then click **Create new query**.\n2. In the new query window, paste the following query to return the daily fare trends. \n```\nSELECT\nT.weekday,\nCASE\nWHEN T.weekday = 1 THEN 'Sunday'\nWHEN T.weekday = 2 THEN 'Monday'\nWHEN T.weekday = 3 THEN 'Tuesday'\nWHEN T.weekday = 4 THEN 'Wednesday'\nWHEN T.weekday = 5 THEN 'Thursday'\nWHEN T.weekday = 6 THEN 'Friday'\nWHEN T.weekday = 7 THEN 'Saturday'\nELSE 'N\/A'\nEND AS day_of_week,\nT.fare_amount,\nT.trip_distance\nFROM\n(\nSELECT\ndayofweek(tpep_pickup_datetime) as weekday,\n*\nFROM\n`samples`.`nyctaxi`.`trips`\n) T\n\n```\n3. Click **Save** and save the query as `Daily fare to distance analysis`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/get-started\/visualize-data-tutorial.html"} +{"content":"# What is data warehousing on Databricks?\n## Get started with data warehousing using Databricks SQL\n#### Visualize queries and create a legacy dashboard\n##### Create a visualization for daily fare trends\n\n1. Next to the **Results** tab, click **+** and then click **Visualization**. \nThe visualization editor displays.\n2. In the **Visualization Type** drop-down, select **Scatter**.\n3. Change the visualization name to `Fare by distance`.\n4. On the **General** tab, set the value for the **X column** to `trip_distance` and set the value for the **Y columns** to `fare_amount`.\n5. In the **Group by** drop-down, set the value to `day_of_week`.\n6. On the **X axis** tab, set the **Name** value to `Trip distance (miles)`.\n7. On the **Y axis** tab, set the **Name** value to `Fare Amount (USD)`.\n8. Click **Save** \nThe saved chart displays in the SQL editor. \n![Daily fare trend](https:\/\/docs.databricks.com\/_images\/daily_fare_trend.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/get-started\/visualize-data-tutorial.html"} +{"content":"# What is data warehousing on Databricks?\n## Get started with data warehousing using Databricks SQL\n#### Visualize queries and create a legacy dashboard\n##### Create a dashboard using these visualizations\n\n1. Click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar and select **Legacy dashboard**.\n2. Set the dashboard name to `NYC taxi trip analysis`.\n3. Click **Save**. \n1. In the **Choose warehouse** drop-down list, select **Serverless Starter Warehouse**. If serverless is not enabled for your workspace, choose **Starter Warehouse**. \n5. In the **Add** drop-down list, click **Visualization**.\n6. In the **Add visualization widget** window, select the **Daily fare to distance analysis** query.\n7. In the **Select existing visualization** list, select **Fare by distance**.\n8. In the **Title** text box, enter `Daily fare trends`. \n![Add visualization widget](https:\/\/docs.databricks.com\/_images\/add-visualization-widget.png)\n9. Click **Add to legacy dashboard**. \nThe Daily fare trends visualization appears on the dashbard design surface.\n10. In the **Add** drop-down list to add a second widget to the dashboard, and then click **Visualization**.\n11. In the **Add visualization widget** window, select the **Pickup hour** query.\n12. In the **Select existing visualization** list, select **Bar chart**.\n13. In the **Title** text box, enter `Pickup hour distribution`.\n14. Click **Add to legacy dashboard**.\n15. Resize this visualization to match the width of the first visualization in the dashboard.\n16. Click **Done Editing**. \n![Initial dashboard](https:\/\/docs.databricks.com\/_images\/dashboard.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/get-started\/visualize-data-tutorial.html"} +{"content":"# What is data warehousing on Databricks?\n## Get started with data warehousing using Databricks SQL\n#### Visualize queries and create a legacy dashboard\n##### Add a pickup zip code parameter to each query\n\n1. In SQL editor, open the **Daily fare to distance analysis** query.\n2. Add the following `WHERE` clause to the **Daily fare to distance analysis** query to filter the query by pickup zip code. \n```\nWHERE\npickup_zip IN ({{ pickupzip }})\n\n```\n3. In the **pickupzip** text box, enter `10018` and then click **Apply changes** to execute the query with the pickup zip code parameter.\n4. Click **Save**.\n5. Open the **Pickup hour** query.\n6. Add the following `WHERE` clause to the **Pickup hour** query to filter the query by the pickup zip code. Add this clause before the `GROUP BY` clause. \n```\nWHERE\npickup_zip IN ({{ pickupzip }})\n\n```\n7. In the **pickupzip** text box, enter `10018` and then click **Apply changes** to execute the query with the pickup zip code filter.\n8. Click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/get-started\/visualize-data-tutorial.html"} +{"content":"# What is data warehousing on Databricks?\n## Get started with data warehousing using Databricks SQL\n#### Visualize queries and create a legacy dashboard\n##### Update the dashboard to use a dashboard parameter\n\n1. Open the **NYC taxi trip analysis** dashboard. \nEach of the visualizations now includes a parameter for the pickup zip code. \n![Widget - parameters](https:\/\/docs.databricks.com\/_images\/widget_parameters.png)\n2. Click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) for this dashboard and then click **Edit**.\n3. Click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) for **Daily fare trends** visualization and then click **Change widget settings**.\n4. In the **Parameters** section, click the pencil icon ![Edit icon](https:\/\/docs.databricks.com\/_images\/pencil-edit-icon.png) for the **Widget parameter** in the **Value** field. \n![View widget parameters](https:\/\/docs.databricks.com\/_images\/widget_parameters.png)\n5. In the **Edit source and Value** window, change the **Source** to **New dashboard parameter**. \n![Change widget parameters to new dashboard parametes](https:\/\/docs.databricks.com\/_images\/new_dashboard_parameter.png)\n6. Click **OK** and then click **Save**. \nThe **pickupzip** dashboard parameter appears and the widget parameter for the **Daily fare trends** visualization no longer appears.\n7. 1. Click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) for **Pickup hour distribution** visualization and then click **Change widget settings**.\n8. In the **Parameters** section, click the pencil icon ![Edit icon](https:\/\/docs.databricks.com\/_images\/pencil-edit-icon.png) for the **Widget parameter** in the **Value** field.\n9. In the **Edit source and Value** window, change the **Source** to **Existing dashboard parameter**.\n10. Verify that **pickupzip** is selected as the **Key** value.\n11. Click **OK** and then click **Save**. \nThe widget parameter for the **Pickup hour distribution** visualization no longer appears.\n12. Click **Done editing**.\n13. Change the value of the **pickupzip** dashboard parameter to `10017` and then click **Apply changes**. \nThe data in each of the vizualizations now displays the data for pickups in the 10017 zip code. \n![Change widget parameters to new dashboard parameters](https:\/\/docs.databricks.com\/_images\/dashboard_parameters.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/get-started\/visualize-data-tutorial.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n#### Pandas API on Spark\n\nNote \nThis feature is available on clusters that run [Databricks Runtime 10.0 (unsupported)](https:\/\/docs.databricks.com\/archive\/runtime-release-notes\/10.0.html) and above. For clusters that run [Databricks Runtime 9.1 LTS](https:\/\/docs.databricks.com\/release-notes\/runtime\/9.1lts.html) and below, use [Koalas](https:\/\/docs.databricks.com\/archive\/legacy\/koalas.html) instead. \nCommonly used by data scientists, [pandas](https:\/\/pandas.pydata.org) is a Python package that provides easy-to-use data structures and data analysis tools for the Python programming language. However, pandas does not scale out to big data. Pandas API on Spark fills this gap by providing pandas equivalent APIs that work on Apache Spark. Pandas API on Spark is useful not only for pandas users but also PySpark users, because pandas API on Spark supports many tasks that are difficult to do with PySpark, for example plotting data directly from a PySpark DataFrame.\n\n#### Pandas API on Spark\n##### Requirements\n\nPandas API on Spark is available beginning in Apache Spark 3.2 (which is included beginning in [Databricks Runtime 10.0 (unsupported)](https:\/\/docs.databricks.com\/archive\/runtime-release-notes\/10.0.html)) by using the following `import` statement: \n```\nimport pyspark.pandas as ps\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/pandas\/pandas-on-spark.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n#### Pandas API on Spark\n##### Notebook\n\nThe following notebook shows how to migrate from pandas to pandas API on Spark. \n### pandas to pandas API on Spark notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/pandas-to-pandas-api-on-spark-in-10-minutes.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### Pandas API on Spark\n##### Resources\n\n* [Pandas API on Spark overview](https:\/\/spark.apache.org\/pandas-on-spark\/)\n* [Pandas API on Spark user guide](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/user_guide\/pandas_on_spark\/index.html)\n* [Migrating from Koalas to pandas API on Spark](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/migration_guide\/koalas_to_pyspark.html)\n* [Pandas API on Spark reference](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.pandas\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/pandas\/pandas-on-spark.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Enable AWS PrivateLink\n\nThis article explains how to use AWS PrivateLink to enable private connectivity between users and their Databricks workspaces and between clusters on the classic compute plane and core services on the control plane within the Databricks workspace infrastructure.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Enable AWS PrivateLink\n###### Overview\n\nAWS PrivateLink provides private connectivity from AWS VPCs and on-premises networks to AWS services without exposing the traffic to the public network. Databricks workspaces support PrivateLink connections for two connection types: \n* **Front-end (user to workspace)**: A front-end PrivateLink connection allows users to connect to the Databricks web application, REST API, and Databricks Connect API over a VPC interface endpoint.\n* **Back-end (compute plane to control plane)**: Databricks Runtime clusters in a customer-managed VPC (the [compute plane](https:\/\/docs.databricks.com\/getting-started\/overview.html)) connect to a Databricks workspace\u2019s core services (the [control plane](https:\/\/docs.databricks.com\/getting-started\/overview.html)) in the Databricks cloud account. Clusters connect to the control plane for two destinations: REST APIs (such as the Secrets API) and the [secure cluster connectivity](https:\/\/docs.databricks.com\/security\/network\/classic\/secure-cluster-connectivity.html) relay. This PrivateLink connection type involves two different VPC interface endpoints because of the two different destination services. \nYou can implement both front-end and back-end PrivateLink connections or just one of them. This article discusses how to configure either one or both PrivateLink connection types. If you implement PrivateLink for both the front-end and back-end connections, you can optionally mandate private connectivity for the workspace, which means Databricks rejects any connections over the public network. If you decline to implement any one of these connection types, you cannot enforce this requirement. \nTo enable PrivateLink connections, you must create Databricks configuration objects and add new fields to existing configuration objects. \nTo create configuration objects and create (or update) a workspace, this article describes how to [use the account console](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#account-console) or [use the Account API](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#account-api). \nThe following table describes important terminology. \n| Terminology | Description |\n| --- | --- |\n| AWS PrivateLink | An AWS technology that provides private connectivity from AWS VPCs and on-premises networks to AWS services without exposing the traffic to the public network. |\n| Front-end PrivateLink | The PrivateLink connection for users to connect to the Databricks web application, REST API, and Databricks Connect API. |\n| Back-end PrivateLink | The PrivateLink connection for the [compute plane](https:\/\/docs.databricks.com\/getting-started\/overview.html) in your AWS account to connect to the Databricks [control plane](https:\/\/docs.databricks.com\/getting-started\/overview.html)). |\n| AWS VPC endpoint service | An AWS VPC endpoint service is a PrivateLink-powered service. Each Databricks control plane (typically one per region) publishes two AWS VPC endpoint services for PrivateLink. The workspace VPC endpoint service applies to both a Databricks front-end PrivateLink connection or the Databricks back-end PrivateLink connection for REST APIs. Databricks publishes another VPC endpoint service for its [secure cluster connectivity](https:\/\/docs.databricks.com\/security\/network\/classic\/secure-cluster-connectivity.html) relay. |\n| AWS VPC endpoint | An [AWS VPC interface endpoint](https:\/\/docs.aws.amazon.com\/vpc\/latest\/privatelink\/vpce-interface.html) enables private connections between your VPC and VPC endpoint services powered by AWS PrivateLink. You must create AWS VPC interface endpoints and then register them with Databricks. Registering a VPC endpoint creates a Databricks-specific object called a VPC endpoint registration that references the AWS VPC endpoint. |\n| Databricks network configuration | A Databricks object that describes the important information about a [Configure a customer-managed VPC](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html). If you implement any PrivateLink connection (front-end or back-end), your workspace must use a customer-managed VPC. For PrivateLink back-end support only, your network configuration needs an extra property that identifies the VPC endpoints for the back-end connection. |\n| Databricks private access settings object | A Databricks object that describes a workspace\u2019s PrivateLink connectivity. You must attach a private access settings object to the workspace during workspace creation, whether using front-end, back-end, or both. It expresses your intent to use AWS PrivateLink with your workspace. It controls your settings for the front-end use case of AWS PrivateLink for public network access. It controls which VPC endpoints are permitted to access your workspace. |\n| Databricks workspace configuration object | A Databricks object that describes a workspace. To enable PrivateLink, this object must reference Databricks private access settings object. For back-end PrivateLink, the workspace must also have a Databricks network configuration object with two extra fields that specify which VPC endpoint registrations to use, one for control plane\u2019s secure cluster connectivity relay and the other connects to the workspace to access REST APIs. | \n### Updates of existing PrivateLink configuration objects \nThis article focuses on the main two use cases of creating a new workspace or enabling PrivateLink on a workspace. You also can make other configuration changes to related objects using the UI or API: \n* You can enable PrivateLink support for front-end, back-end, or both types of connectivity on a new or existing workspace. Add a private access settings object ([UI](https:\/\/docs.databricks.com\/admin\/workspace\/update-workspace.html) or [API](https:\/\/docs.databricks.com\/api\/account\/introduction)). To do so, create a new network configuration with new settings, for example for a new VPC or different PrivateLink support settings, and then update the workspace to use the new network configuration. Note that you cannot remove (downgrade) any existing front-end or back-end PrivateLink support on a workspace.\n* Add or update a workspace\u2019s registered VPC endpoints by creating a new network configuration object with registered VPC endpoints and then update the workspace\u2019s network configuration ([UI](https:\/\/docs.databricks.com\/admin\/workspace\/update-workspace.html) or [API](https:\/\/docs.databricks.com\/api\/account\/introduction)).\n* For more information about what kinds of workspace fields can be changed on failed or running workspaces, see information about this task by using the [UI](https:\/\/docs.databricks.com\/admin\/workspace\/update-workspace.html) or [API](https:\/\/docs.databricks.com\/api\/account\/introduction). \nNote that not all related objects can be updated. Where update is not possible, create new objects and set their parent objects to reference the new objects. The following rules apply both to the account console UI and the Account API: \n| Object | Can be created | Can be updated |\n| --- | --- | --- |\n| Workspace configurations | Yes | Yes |\n| Private access settings | Yes | Yes |\n| Network configurations | Yes | **No** |\n| VPC endpoint registrations | Yes | **No** | \nTo update CIDR ranges on an existing VPC, see [Updating CIDRs](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#update-cidr). \n### Network flow \nThe following diagram shows the network flow in a typical implementation. \n![PrivateLink network architecture](https:\/\/docs.databricks.com\/_images\/privatelink-network.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Enable AWS PrivateLink\n###### Requirements\n\n**Databricks account** \n* Your Databricks account is on the [Enterprise pricing tier](https:\/\/databricks.com\/product\/pricing\/platform-addons).\n* You have your Databricks account ID. Get your account ID from the [account console](https:\/\/docs.databricks.com\/admin\/account-settings\/index.html#account-id). \n**Databricks workspace** \n* Your Databricks workspace must use [Configure a customer-managed VPC](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html) to add any PrivateLink connection (even a front-end-only connection). Note that you cannot update an existing workspace with a Databricks-managed VPC and change it to use a customer-managed VPC.\n* If you implement the back-end PrivateLink connection, your Databricks workspace must use [Secure cluster connectivity](https:\/\/docs.databricks.com\/security\/network\/classic\/secure-cluster-connectivity.html). To add back-end PrivateLink to an older existing workspace that does not use secure cluster connectivity, contact your Databricks account team. \nNote \nThe `us-west-1` region does not support PrivateLink. \n**AWS account permissions** \n* If you are the user who sets up PrivateLink, you must have all necessary AWS permissions to provision a Databricks workspace and to provision new VPC endpoints for your workspace. \n**Network architecture** \n* To implement the front-end PrivateLink connection to access the workspace from your on-premises network, add private connectivity from the on-premises network to an AWS VPC using either Direct Connect or VPN.\n* For guidance for other network objects, see [Step 1: Configure AWS network objects](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#create-vpc).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Enable AWS PrivateLink\n###### Step 1: Configure AWS network objects\n\nYou can use the AWS Management Console to create these objects or automate the process with tools such as the [Terraform provider for networks](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/mws_networks). \nTo configure a VPC, subnets, and security groups: \n1. Set up a VPC for your workspace if you haven\u2019t already done so. You may re-use a VPC from another workspace, but you must create separate subnets for each workspace. Every workspace requires at least two private subnets. \n1. To create a VPC, see [Configure a customer-managed VPC](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html). If you are updating a workspace for PrivateLink rather than creating a new workspace, note that the workspace must already be using a customer-managed VPC.\n2. On your VPC, ensure that you enable both of the settings **DNS Hostnames** and **DNS resolution**.\n3. Ensure that the network ACLs for the subnets have **bidirectional** (outbound and inbound) rules that allow TCP access to 0.0.0.0\/0 for these ports: \n* 443: for Databricks infrastructure, cloud data sources, and library repositories\n* 3306: for the metastore\n* 6666: for PrivateLink\n* 2443: only for use with compliance security profile\n* 8443 through 8451: Future extendability. Ensure these [ports are open by January 31, 2024](https:\/\/docs.databricks.com\/release-notes\/product\/2023\/august.html#aws-new-egress-ports).\nImportant \nIf your workspace uses the [compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html), you must also allow **bidirectional** (outbound and inbound) access to port 2443 to support FIPS endpoints for the secure cluster connectivity relay.\n2. For back-end PrivateLink: \n1. Create and configure an extra VPC subnet (optional): \n* For your VPC endpoints, including back-end PrivateLink VPC endpoints and also any [optional VPC endpoints to other AWS services](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#optional-vpce), you can create them in any of your workspace subnets as long as the network can route to the VPC endpoints.\n* Attach a separate route table to your VPC endpoints subnet, which would be different from the route table attached to your workspace subnets. The route table for your VPC endpoints subnet needs only a single default route for the local VPC.\n2. Create and configure an extra security group (recommended but optional): \n* In addition to the security group that is normally required for a workspace, create a separate security group that allows HTTPS\/443 and TCP\/6666 **bidirectional** (outbound and inbound) access to both the workspace subnets as well as the separate VPC endpoints subnet if you created one. This configuration allows access for both the workspace for REST APIs (port 443) and for secure cluster connectivity (6666). This makes it easy to share the security group for both purposes. \nImportant \nIf your workspace uses the [compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html), you must also allow **bidirectional** (outbound and inbound) access to port 2443 to support FIPS endpoints for the secure cluster connectivity relay.\n3. For front-end PrivateLink: \n* For your transit VPC and its subnets, ensure they are reachable from the user environment. Create a transit VPC that terminates your AWS Direct Connect or VPN gateway connection or one that is routable from your transit VPC. \nIf you enable both front-end and back-end PrivateLink, you can optionally share the front-end workspace (web application) VPC endpoint with the back-end workspace (REST API) VPC endpoint if the VPC endpoint is network accessible from the workspace subnets.\n* Create a new security group for the front-end endpoint. The security group must allow HTTPS (port 443) **bidirectional** (outbound and inbound) access for both the source network and the endpoint subnet itself.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Enable AWS PrivateLink\n###### Step 2: Create VPC endpoints\n\n### Back-end VPC endpoints \nFor back-end PrivateLink, you create two VPC endpoints. One is for the [secure cluster connectivity](https:\/\/docs.databricks.com\/security\/network\/classic\/secure-cluster-connectivity.html) relay. One is for the workspace, which allows compute plane calls to Databricks REST APIs. For general documentation on VPC endpoint management with the AWS Management Console, see the AWS article [Create VPC endpoints in the AWS Management Console](https:\/\/docs.aws.amazon.com\/vpc\/latest\/privatelink\/vpce-interface.html). When you create the VPC endpoints, it\u2019s important to set the field in **Additional settings** that in the AWS Management Console page for creating VPC endpoints is called **Enable DNS name**. As a terminology note, this is the same field that AWS in some places refers to as **Enable Private DNS** or **Enable Private DNS on this endpoint** when viewing or editing a VPC endpoint. \nFor tools that help you automate creating and managing VPC endpoints, see the AWS articles [CloudFormation: Creating VPC Endpoint](https:\/\/docs.aws.amazon.com\/AWSCloudFormation\/latest\/UserGuide\/aws-resource-ec2-vpcendpoint.html) and [AWS CLI: create-vpc-endpoint](https:\/\/docs.aws.amazon.com\/cli\/latest\/reference\/ec2\/create-vpc-endpoint.html). \nYou can share the back-end VPC endpoints across multiple workspaces that use the same customer-managed VPC. Whether you share the back-end VPC endpoints across multiple workspaces depends on your organization\u2019s AWS architecture best practices and your overall throughput requirements across all workloads. \n* If you decide to share those across workspaces, you must create the back-end VPC endpoints in a separate subnet that is routable from the subnets of all the workspaces. For guidance, contact your Databricks account team.\n* You can also share VPC endpoints across workspaces from multiple Databricks accounts as long as the workspaces share the same customer-managed VPC, in which case you need to register the VPC endpoints in each Databricks account. \nThe following procedure uses the AWS Management Console. You can also automate this step using the [Terraform provider for VPC endpoints](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/mws_vpc_endpoint). \nTo create the back-end VPC endpoints in the AWS Management Console: \n1. Go to the [VPC endpoints](https:\/\/console.aws.amazon.com\/vpc\/home#Endpoints:) section of the AWS Management Console.\n2. Use the region picker in the upper right next to your account name picker and confirm you are using the region that matches the region you will use for your workspace. If needed, change the region using the region picker.\n3. Create the VPC endpoint: \n1. Click **Create Endpoint**.\n2. Give the endpoint a name that indicates the region and the purpose of the VPC endpoint. For the workspace VPC endpoint, Databricks recommends that you include the region and the word `workspace`, such as `databricks-us-west-2-workspace-vpce`.\n3. Under **Service Category**, choose **Other endpoint services**.\n4. In the service name field, paste in the service name. Use the table in [Regional endpoint reference](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#regional-endpoints) to get the two regional service names for your region. \nFor your first VPC endpoint that you create, copy the regional service name for the workspace (REST API).\n5. Click **Verify service**. Confirm the page reports in a green box **Service name verified**. If you see an error \u201cService name could not be verified\u201d, check whether you have correctly matched the regions of your VPC, subnets, and your new VPC endpoint.\n6. In the **VPC** field, select your VPC. Choose your workspace VPC.\n7. In the **Subnets** section, choose exactly one of your Databricks workspace subnets. For related discussion, see [Step 1: Configure AWS network objects](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#create-vpc).\n8. In the **Security groups** section, choose the security group you created for back-end connections in [Step 1: Configure AWS network objects](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#create-vpc).\n9. Click to expand the **Additional settings** section.\n10. Ensure that the endpoint has the **Enable DNS name** field enabled. As a terminology note, this is the same field that AWS in some places refers to as **Enable Private DNS** or **Enable Private DNS on this endpoint** when viewing or editing a VPC endpoint.\n11. Click **Create endpoint**.\n4. Repeat the above procedure and use the table in [Regional endpoint reference](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#regional-endpoints) to get the regional service name for the secure cluster connectivity relay. Give the endpoint a name that indicates the region and the purpose of the VPC endpoint. Databricks recommends that you include the region and the word `scc`, such as `databricks-us-west-2-scc-vpce`. \n### Front-end VPC endpoints \nA front-end endpoint originates in your transit VPC that usually is the source of user web application access, Typically it is a transit VPC that is connected to an on-premises network. This is generally a separate VPC from the workspace\u2019s compute plane VPC. Although the Databricks VPC endpoint service is the same shared service for the front-end connection and the back-end REST API connection, in typical implementations, connections originate from two separate VPCs and thus need separate AWS VPC endpoints that originate in each VPC. \nIf you have multiple Databricks accounts, you can share a front-end VPC endpoint across Databricks accounts. Register the endpoint in each relevant Databricks account. \nThe following procedure uses the AWS Management Console. You can also automate this step using the [Terraform provider for VPC endpoints](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/mws_vpc_endpoint). \nTo create the front-end VPC endpoints in the AWS Management Console: \n1. Go to the [VPC endpoints](https:\/\/console.aws.amazon.com\/vpc\/home#Endpoints:) section of the AWS Management Console.\n2. Use the region picker in the upper right next to your account name picker and confirm you are using the region that matches the **transit VPC region**, which in some cases might be different than your workspace region. If needed, change the region using the region picker.\n3. Create the VPC endpoint: \n1. Click **Create Endpoint**.\n2. Give the endpoint a name that indicates the region and the purpose of the VPC endpoint. For the workspace VPC endpoint, Databricks recommends that you include the region and the word `workspace` or `frontend`, such as `databricks-us-west-2-workspace-vpce`.\n3. Under **Service Category**, choose **Other endpoint services**.\n4. In the service name field, paste in the service name. Use the table in [Regional endpoint reference](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#regional-endpoints) to find the regional service names. Copy the one labelled **Workspace (including REST API)**.\n5. Click **Verify service**. Confirm the page reports in a green box **Service name verified**. If you see an error \u201cService name could not be verified\u201d, check whether you have correctly matched the regions of your VPC, subnets, and your new VPC endpoint.\n6. In the **VPC** menu, click your transit VPC.\n7. In the **Subnets** section, choose a subnet. For related discussion, see [Step 1: Configure AWS network objects](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#create-vpc).\n8. In the **Security groups** section, choose the security group you created for front-end connections in [Step 1: Configure AWS network objects](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#create-vpc).\n9. Click **Create endpoint**. \n### Regional endpoint reference \nGet your region\u2019s VPC endpoint service domains from the table at [PrivateLink VPC endpoint services](https:\/\/docs.databricks.com\/resources\/supported-regions.html#privatelink). \nNote \nIf you use the account console to create your network configuration, the UI refers to the workspace VPC endpoint as the VPC endpoint for REST APIs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Enable AWS PrivateLink\n###### Step 3: Register PrivateLink objects and attach them to a workspace\n\nYou can do this step the following ways: \n* [Use the account console](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#use-the-account-console)\n* [Use the Account API](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#use-the-account-api)\n* [Use Terraform](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#use-terraform) \n### [Use the account console](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#id1) \nYou can use the [account console](https:\/\/docs.databricks.com\/admin\/account-settings\/index.html#account-console) to register your VPC endpoints, create and register other required workspace resources, and finally create a new workspace with PrivateLink. \nWithin the account console, several types of objects are relevant for PrivateLink configuration: \n* **VPC endpoint registrations (required for front-end, back-end, or both)**: After creating VPC endpoints in the AWS Management Console (see the [previous step](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#regional-endpoints)), register them in Databricks to create VPC endpoint registrations. See the [account console\u2019s page for VPC endpoints](https:\/\/accounts.cloud.databricks.com\/cloud-resources\/network\/vpc-endpoints).\n* **Network configurations (required for back-end VPC endpoints)**: Network configurations represent information about a customer-managed VPC. They also contain two back-end PrivateLink configuration fields. Add these two fields in the network configuration object. They must reference the two back-end VPC endpoints that you created in AWS. See the [account console\u2019s page for network configurations](https:\/\/accounts.cloud.databricks.com\/cloud-resources\/network\/network-configurations). If you have an existing network configuration and you want to add fields for PrivateLink, you must create a new network configuration.\n* **Private access configurations (required for front-end, back-end, or both)**: A workspace\u2019s private access configuration object encapsulates a few settings about AWS PrivateLink connectivity. Create a new private access settings object just for this workspace, or share one among multiple workspaces in the same AWS region. This object serves several purposes. It expresses your intent to use AWS PrivateLink with your workspace. It controls your settings for the front-end use case of AWS PrivateLink for public network access. It controls which VPC endpoints are permitted to access your workspace. \nThere are two ways to use the account console to define cloud resources for a workspace. \n* **Create resources in advance**: You can create relevant cloud resources first before you create your workspace in the [cloud resources area of the account console](https:\/\/accounts.cloud.databricks.com\/cloud-resources). This is useful if you might not be able to do all the steps at the same time or if different teams perform network setup and create workspaces.\n* **Within the workspace creation page, add configurations as needed**: On the page that creates (or updates) a workspace, there are pickers for different cloud resources. In most cases, there are picker items that let you create that resource immediately in a pop-up view. For example, a network configuration picker has an option **Add a new network configuration**. \nThis article describes the how to create resources in advance and then reference them. You can use the other approach if that works better for you. See the editors for [VPC endpoints](https:\/\/docs.databricks.com\/security\/network\/classic\/vpc-endpoints.html), [network configurations](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/networks.html), and [private access settings](https:\/\/docs.databricks.com\/security\/network\/classic\/private-access-settings.html). \n#### Step 3a: Register your VPC endpoints (for front-end, back-end, or both) \nFollow the instructions in [Manage VPC endpoint registrations](https:\/\/docs.databricks.com\/security\/network\/classic\/vpc-endpoints.html). \n* For back-end PrivateLink, register the back-end VPC endpoints you created and name the configurations for their purposes, for example add `-scc` for secure cluster connectivity and `-workspace` for the workspace (REST API) VPC endpoint registration. For back-end VPC endpoints, the region field must match your workspace region and the region of the AWS VPC endpoints that you are registering. However, Databricks validates this only during workspace creation (or during updating a workspace with PrivateLink), so it is critical that you carefully set the region in this step.\n* For front-end PrivateLink, register the front-end VPC endpoint you created in the transit VPC. For front-end PrivateLink, the region field must match your transit VPC region and the region of the AWS VPC endpoint for the workspace for the front-end connection. \n#### Step 3b: Create a network configuration (for back-end) \nFollow the instructions in [Create network configurations for custom VPC deployment](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/networks.html). For detailed requirements for customer-managed VPC along with its associated subnets and security groups, see [Configure a customer-managed VPC](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html). The most important fields for PrivateLink are under the heading **Back-end private connectivity**. There are two fields where you choose your back-end VPC endpoint registrations that you [created in the previous step](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#create-vpce). For the first one, select the VPC endpoint registration for the *secure cluster connectivity relay*. For the other choose the VPC endpoint registration for the *workspace (REST APIs)*. \n#### Step 3c: Create a PAS object (for front-end, back-end, or both) \nCreating a private access settings (PAS) object is an important step for PrivateLink configuration. Follow the instructions in [Manage private access settings](https:\/\/docs.databricks.com\/security\/network\/classic\/private-access-settings.html). \n* For the region, be sure it matches the region of your workspace as this is not validated immediately but workspace deployment fails if it does not match.\n* Set the **Public access enabled** field, which configures public access to the front-end connection (the web application and REST APIs) for your workspace. \n+ If set to **False** (the default), the front-end connection can be accessed only using PrivateLink connectivity and not from the public internet. Because access from the public network is disallowed in this case, the [Configure IP access lists for workspaces](https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list-workspace.html) feature is not supported for the workspace.\n+ If set to **True**, the front-end connection can be accessed either from PrivateLink connectivity or from the public internet. You can optionally configure an IP access list for the workspace to restrict the source networks that could access the web application and REST APIs from the public internet (but not the PrivateLink connection).\n* Set the **Private Access Level** field to the value that best represents which VPC endpoints to allow for your workspace. \n+ Set to **Account** to limit connections to those VPC endpoints that are registered in your Databricks account.\n+ Set to **Endpoint** to limit connections to an explicit set of VPC endpoints, which you can enter in a field that appears. It lets you select VPC endpoint registrations that you\u2019ve already created. Be sure to include your *front-end* VPC endpoint registration if you created one. \n#### Step 3d: Create or update the workspace (front-end, back-end, or both) \nThe workspace must already use a customer-managed VPC and secure cluster connectivity must be enabled, which is the case for most E2 workspaces. \nThe following instructions describe creating a workspace using [the account console\u2019s workspaces page](https:\/\/accounts.cloud.databricks.com\/workspaces). \n1. Follow the instructions in [Manually create a workspace (existing Databricks accounts)](https:\/\/docs.databricks.com\/admin\/workspace\/create-workspace.html) to create a workspace. See that article for guidance on workspace fields such as workspace URL, region, Unity Catalog, credential configurations, and storage configurations. Do not yet click the **Save** button.\n2. Click **Advanced configurations** to view additional fields.\n3. For back-end PrivateLink, choose the network configuration. Under **Virtual Private Cloud**, in the menu choose the Databricks network configuration you created.\n4. For any PrivateLink usage, select the private access settings object. Look below the **Private Link** heading. Click the menu and choose the name of the private access settings object that you created.\n5. Click **Save**.\n6. After creating (or updating) a workspace, wait until it\u2019s available for using or creating clusters. The workspace status stays at status `RUNNING` and the VPC change happens immediately. However, you cannot use or create clusters for another 20 minutes. If you create or use clusters before this time interval elapses, clusters do not launch successfully, fail, or could cause other unexpected behavior.\n7. Continue on to [Step 4: Configure internal DNS to redirect user requests to the web application (for front-end)](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#create-dns-name). \n### [Use the Account API](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#id2) \n#### Step 3a: Register VPC endpoints (front-end, back-end, or both) \nUsing the [Account API](https:\/\/docs.databricks.com\/api\/account\/introduction), register the VPC endpoint IDs for your back-end VPC endpoints. For each one, this creates a Databricks VPC endpoint registration. \nFor back-end VPC endpoints, if you have multiple workspaces in the same region that share the same customer-managed VPC, you can choose to share the AWS VPC endpoints. You can also share these VPC endpoints among multiple Databricks accounts, in which case register the AWS VPC endpoint in each Databricks account. \nFor front-end VPC endpoints, if you have multiple Databricks accounts, you can share a front-end VPC endpoint across Databricks accounts. Register the endpoint in each relevant Databricks account. \nTo register a VPC endpoint in Databricks, make a `POST` request to the `\/accounts\/<account-id>\/vpc-endpoints` REST API endpoint and pass the following fields in the request body: \n* `vpc_endpoint_name`: User-visible name for the VPC endpoint registration within Databricks.\n* `region`: AWS region name\n* `aws_vpc_endpoint_id`: Your VPC endpoint\u2019s ID within AWS. It starts with prefix `vpce-`. \nFor example: \n```\ncurl -X POST -n \\\n'https:\/\/accounts.cloud.databricks.com\/api\/2.0\/accounts\/<account-id>\/vpc-endpoints' \\\n-d '{\n\"vpc_endpoint_name\": \"Databricks front-end endpoint\",\n\"region\": \"us-west-2\",\n\"aws_vpc_endpoint_id\": \"<vpce-id>\"\n}'\n\n``` \nThe response JSON includes a `vpc_endpoint_id` field. If you are adding a back-end PrivateLink connection, save this value. This ID is specific to this configuration within Databricks. You need this ID when you create the network configuration in a later step ([Step 3b: Create a network configuration (back-end)](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#create-network-config)). \nRelated Account API operations that may be useful: \n* [Check the state of a VPC endpoint registration](https:\/\/docs.databricks.com\/api\/account\/introduction) \u2014The `state` field in the response JSON indicates the state within AWS.\n* [Get all VPC endpoint registrations in your account](https:\/\/docs.databricks.com\/api\/account\/introduction)\n* [Delete VPC endpoint registration](https:\/\/docs.databricks.com\/api\/account\/introduction) \n#### Step 3b: Create a network configuration (back-end) \nNote \nIf you implement only the front-end connection, skip this step. Although you must create a network configuration because a customer-managed VPC is required, there are no PrivateLink changes to this object if you only implement a front-end PrivateLink connection. \nFor any PrivateLink support, you must use a [Configure a customer-managed VPC](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html). This feature requires you to create a network configuration object that encapsulates the ID for the VPC, the subnets, and the security groups. \nFor back-end PrivateLink support, your network configuration must have an extra field that is specific to PrivateLink. The network configuration `vpc_endpoints` field references your Databricks-specific VPC endpoint IDs that were returned when you registered your VPC endpoints. See [Step 3a: Register VPC endpoints (front-end, back-end, or both)](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#register-vpce-ids). \nAdd both of these fields in that object: \n* `rest_api`: Set this to a JSON array that includes exactly one element: the Databricks-specific ID for the back-end REST API VPC endpoint that you registered in [Step 3a: Register VPC endpoints (front-end, back-end, or both)](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#register-vpce-ids). \nImportant \nBe careful to use the Databricks-specific ID that was created when you registered the regional endpoint based on the table in [Regional endpoint reference](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#regional-endpoints). It is a common configuration error to set the wrong ID on this field.\n* `dataplane_relay`: Set this to a JSON array that includes exactly one element: the Databricks-specific ID for the back-end SCC VPC endpoint that you registered in [Step 3a: Register VPC endpoints (front-end, back-end, or both)](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#register-vpce-ids). \nImportant \nBe careful to use the Databricks-specific ID that was created when you registered the regional endpoint based on the table in [Regional endpoint reference](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#regional-endpoints). It is a common configuration error to set the wrong ID on this field. \nYou get these Databricks-specific VPC endpoint IDs from the JSON responses of the requests you made in [Step 3a: Register VPC endpoints (front-end, back-end, or both)](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#register-vpce-ids), within the `vpc_endpoint_id` response field. \nThe following example creates a new network configuration that references the VPC endpoint IDs. Replace `<databricks-vpce-id-for-scc>` with your Databricks-specific VPC endpoint ID for the secure cluster connectivity relay. Replace `<databricks-vpce-id-for-rest-apis>` with your Databricks-specific VPC endpoint ID for the REST APIs. \n```\ncurl -X POST -n \\\n'https:\/\/accounts.cloud.databricks.com\/api\/2.0\/accounts\/<account-id>\/networks' \\\n-d '{\n\"network_name\": \"Provide name for the Network configuration\",\n\"vpc_id\": \"<aws-vpc-id>\",\n\"subnet_ids\": [\n\"<aws-subnet-1-id>\",\n\"<aws-subnet-2-id>\"\n],\n\"security_group_ids\": [\n\"<aws-sg-id>\"\n],\n\"vpc_endpoints\": {\n\"dataplane_relay\": [\n\"<databricks-vpce-id-for-scc>\"\n],\n\"rest_api\": [\n\"<databricks-vpce-id-for-rest-apis>\"\n]\n}\n}'\n\n``` \n#### Step 3c: Create a PAS configuration (front-end, back-end, or both) \nUse the Databricks [Account API](https:\/\/docs.databricks.com\/api\/account\/introduction) to create or attach a private access settings (PAS) object. \nThe private access settings object supports the following scenarios: \n* Implement only a front-end VPC endpoint\n* Implement only a back-end VPC endpoint\n* Implement both front-end and back-end VPC endpoints \nFor a workspace to support any of these PrivateLink connectivity scenarios, the workspace must be created with an attached *private access settings object*. This can be a new private access settings object intended only for this workspace, or re-use and share an existing private access setting object across multiple workspaces in the same AWS region. \nThis object serves two purposes: \n1. Expresses your intent to use AWS PrivateLink with your workspace. If you intend to connect to your workspace using front-end or back-end PrivateLink, you must attach one of these objects to your workspace during workspace creation.\n2. Controls your settings for the front-end use case of AWS PrivateLink. If you wish to use back-end PrivateLink only, you can choose to set the object\u2019s `public_access_enabled` field to `true`. \nIn the private access settings object definition, the `public_access_enabled` configures public access to the front-end connection (the web application and REST APIs) for your workspace: \n* If set to `false` (the default), the front-end connection can be accessed only using PrivateLink connectivity and not from the public internet. Because access from the public network is disallowed in this case, the [Configure IP access lists for workspaces](https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list-workspace.html) feature is not supported for the workspace.\n* If set to `true`, the front-end connection can be accessed either from PrivateLink connectivity or from the public internet. You can optionally configure an IP access list for the workspace to restrict the source networks that could access the web application and REST APIs from the public internet (but not the PrivateLink connection). \nTo create a private access settings object, make a `POST` request to the `\/accounts\/<account-id>\/private-access-settings` REST API endpoint. The request body must include the following properties: \n* `private_access_settings_name`: Human-readable name for the private access settings object.\n* `region`: AWS region name.\n* `public_access_enabled`: Specifies whether to enable public access for the front-end connection. If `true`, public access is possible for the front-end connection in addition to PrivateLink connections. See the previous table for the required value for your implementation.\n* `private_access_level`: Specify which VPC endpoints can connect to this workspace: \n+ `ACCOUNT (the default)`: Limit connections to those VPC endpoints that are registered in your Databricks account.\n+ `ENDPOINT`: Limit connections to an explicit set of VPC endpoints. See the related `allowed_vpc_endpoint_ids` property.\nNote \nThe private access level `ANY` is deprecated. The level is unavailable for new or existing private access settings objects.\n* `allowed_vpc_endpoint_ids`: Use only if `private_access_level` is set to `ENDPOINT`. This property specifies the set of VPC endpoints that can connect to this workspace. Specify as a JSON array of VPC endpoint IDs. Use the Databricks IDs that were returned during [endpoint registration](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#register-vpce-ids), not the AWS IDs. \n```\ncurl -X POST -n \\\n'https:\/\/accounts.cloud.databricks.com\/api\/2.0\/accounts\/<account-id>\/private-access-settings' \\\n-d '{\n\"private_access_settings_name\": \"Default PAS for us-west-2\",\n\"region\": \"us-west-2\",\n\"public_access_enabled\": true\n}'\n\n``` \nThe response JSON includes a `private_access_settings_id` field. This ID is specific to this configuration within Databricks. It is important that you save that result field because you will need it when you create the workspace. \nRelated APIs: \n* [Get a private access settings object by its ID](https:\/\/docs.databricks.com\/api\/account\/introduction)\n* [Get all private access settings objects](https:\/\/docs.databricks.com\/api\/account\/introduction)\n* [Update a private access settings object](https:\/\/docs.databricks.com\/api\/account\/introduction)\n* [Delete private access settings object](https:\/\/docs.databricks.com\/api\/account\/introduction) \n#### Step 3d: Create or update a workspace \nThe workspace must already use a customer-managed VPC. \nThe important fields for creating a workspace with PrivateLink connectivity are `private_access_settings_id` (the ID of your new [private access settings object](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#add-pas-config) and `network_id` (the ID of your new [network configuration](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#create-network-config)). \nTo create a workspace with PrivateLink connectivity: \n1. Read the instructions in Databricks [Account API](https:\/\/docs.databricks.com\/api\/account\/introduction) for guidance on all fields for a new workspace with Account API. For complete instructions on all fields such as storage configurations, credential configurations, and customer-managed keys, see [Create a workspace using the Account API](https:\/\/docs.databricks.com\/admin\/workspace\/create-workspace-api.html).\n2. Call the [Create a new workspace API](https:\/\/docs.databricks.com\/api\/account\/introduction) (`POST \/accounts\/{account_id}\/workspaces`) and be sure to include `private_access_settings_id` and `network_id`, for example: \n```\ncurl -X POST -n \\\n'https:\/\/accounts.cloud.databricks.com\/api\/2.0\/accounts\/<databricks-account-id>\/workspaces' \\\n-d '{\n\"workspace_name\": \"my-company-example\",\n\"deployment_name\": \"my-company-example\",\n\"aws_region\": \"us-west-2\",\n\"credentials_id\": \"<aws-credentials-id>\",\n\"storage_configuration_id\": \"<databricks-storage-config-id>\",\n\"network_id\": \"<databricks-network-config-id>\",\n\"managed_services_customer_managed_key_id\": \"<aws-kms-managed-services-key-id>\",\n\"storage_customer_managed_key_id\": \"<aws-kms-notebook-workspace-storage-config-id>\",\n\"private_access_settings_id\": \"<private-access-settings-id>\"\n}'\n\n```\n3. After creating or updating an existing workspace with PrivateLink, you must wait before the workspace is available for using or creating clusters, the workspace status stays at status `RUNNING` and the VPC change happens immediately. However, you cannot use or create clusters for another 20 minutes. If you create or use clusters before this time interval elapses, clusters do not launch successfully, fail, or could cause other unexpected behavior. \n### [Use Terraform](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#id3) \nTo use Terraform to create underlying AWS network objects and the related Databricks PrivateLink objects, see these Terraform providers: \n* [Terraform provider that registers VPC endpoints](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/mws_vpc_endpoint). Before using this resource, you must have already created the necessary AWS VPC endpoints.\n* [Terraform provider that creates an AWS VPC and a Databricks network configuration](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/mws_networks).\n* [Terraform provider that creates a Databricks private access settings object](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/mws_private_access_settings). \nTo use Terraform to deploy a workspace, see this Terraform provider: \n* [Terraform provider for workspaces](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/mws_workspaces).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Enable AWS PrivateLink\n###### Step 4: Configure internal DNS to redirect user requests to the web application (for front-end)\n\nTo use your front-end PrivateLink connection, redirect user requests to the web application. This requires changing private DNS for the network that your users use or connect to. If users need to access the Databricks workspace from an on-premises network that is under the scope of your internal or custom DNS, perform the following configuration after the workspace is created or updated to ensure that your workspace URL maps to the VPC endpoint private IP for your workspace VPC endpoint. \nConfigure your internal DNS such that it maps the web application workspace URL to your front-end VPC endpoint. \nUse the `nslookup` Unix command line tool to test the DNS resolution using your workspace deploy domain name, for example: \n```\nnslookup my-workspace-name-here.cloud.databricks.com\n\n``` \nExample response: \n```\nNon-authoritative answer:\n\nmy-workspace-name-here.cloud.databricks.com canonical name = oregon.cloud.databricks.com.\n\noregon.cloud.databricks.com canonical name = a89b3c627d423471389d6ada5c3311b4-f09b129745548506.elb.us-west-2.amazonaws.com.\n\nName: a89b3c627d423471389d6ada5c3311b4-f09b129745548506.elb.us-west-2.amazonaws.com\n\nAddress: 44.234.192.47\n\n``` \nExample DNS mapping for a workspace with front-end VPC endpoint in AWS region `us-east-1`: \n* By default the DNS mapping is: \n+ `myworkspace.cloud.databricks.com` maps to `nvirginia.privatelink.cloud.databricks.com`. In this case `nvirginia` is the control plane instance short name in that region.\n+ `nvirginia.privatelink.cloud.databricks.com` maps to `nvirginia.cloud.databricks.com`.\n+ `nvirginia.cloud.databricks.com` maps to the AWS public IPs.\n* After your DNS changes, from your transit VPC (where your front-end VPC endpoint is), the DNS mapping would be: \n+ `myworkspace.cloud.databricks.com` maps to `nvirginia.privatelink.cloud.databricks.com`.\n+ `nvirginia.privatelink.cloud.databricks.com` maps to the private IP of your VPC endpoint for front-end connectivity. \nFor the workspace URL to map to the VPC endpoint private IP from the on-premises network, you must do one of the following: \n* Configure conditional forwarding for the workspace URL to use [AmazonDNS](https:\/\/docs.aws.amazon.com\/vpc\/latest\/userguide\/VPC_DHCP_Options.html#AmazonDNS).\n* Create an A-record for the workspace URL in your on-premises or internal DNS that maps to the VPC endpoint private IP.\n* Complete steps similar to what you would do to enable access to other similar PrivateLink-enabled services. \nYou can choose to map the workspace URL directly to the front-end (workspace) VPC endpoint private IP by creating an A-record in your internal DNS, such that the DNS mapping looks like this: \n* `myworkspace.cloud.databricks.com` maps to the VPC endpoint private IP \nAfter you make changes to your internal DNS configuration, test the configuration by accessing the Databricks workspace web application and REST API from your transit VPC. Create a VPC endpoint in the transit VPC if necessary to test the configuration. \nIf you didn\u2019t configure your DNS record in your private DNS domain, you may receive an error. You can fix this by creating the following records on your DNS server. Then, you can access the workspace, Spark interface, and web terminal services. \n| Record type | Record name | Value |\n| --- | --- | --- |\n| A | <deployment-name>.cloud.databricks.com | PrivateLink interface IP |\n| CNAME | dbc-dp-<workspace-id>.cloud.databricks.com | <deployment-name>.cloud.databricks.com | \nIf you have questions about how this applies to your network architecture, contact your Databricks account team.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Enable AWS PrivateLink\n###### Step 5: (Optional) Configure front-end PrivateLink with unified login\n\nPreview \nUnified login with front-end PrivateLink is in Private Preview. You must contact your Databricks account team to request access to this preview. \nUnified login allows you to manage one SSO configuration in your account that is used for the account and Databricks workspaces. See [Enable unified login](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/single-sign-on\/index.html#unified-login). To use unified login with front-end PrivateLink, you must configure the following: \n### Step 5a: Allow the PrivateLink redirect URI in your identity provider \n1. As an account admin, log in to the [account console](https:\/\/docs.databricks.com\/admin\/account-settings\/index.html#account-console).\n2. In the sidebar, click **Settings**.\n3. Click the **Single sign-on** tab.\n4. Copy the value in the **Databricks Redirect URI** field.\n5. Replace the `accounts` with `accounts-pl-auth` to get the Databricks PrivateLink Redirect URI.\n6. Go to your identity provider.\n7. Add the Databricks PrivateLink Redirect URI as an additional redirect URL. If you configure SSO using SAML, also add the Databricks PrivateLink Redirect URI as an additional entity ID. \nIf you have both private link and non-private link workspaces in your account, do not remove the **Databricks Redirect URI** with `account` from your identity provider redirect URLs. \n### Step 5b: Configure a private hosted zone for your transit VPC \nPerform the following configuration in your transit VPC to ensure that the Databricks PrivateLink Redirect URI maps to the VPC endpoint private IP address for your workspace VPC endpoint. \n1. From your transit VPC, Use the `nslookup` Unix command line tool to get the DNS resolution using your workspace URL. See the example in [Step 4: Configure internal DNS to redirect user requests to the web application (for front-end)](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#create-dns-name).\n2. Copy the control plane instance URL of your private link workspace. The control plane instance URL is in the format `<region>.privatelink.cloud.databricks.com`.\n3. In your transit VPC, create a private hosted zone with domain name `privatelink.cloud.databricks.com`.\n4. Add a CNAME record that resolves `accounts-pl-auth.privatelink.cloud.databricks.com` to your control plane instance URL.\n5. Test the configuration by accessing the Databricks PrivateLink Redirect URI from your transit VPC.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Enable AWS PrivateLink\n###### Step 6: Add VPC endpoints for other AWS services\n\nFor typical use cases, the following VPC endpoints are **required** so that clusters and other compute resources in the classic compute plane can connect to AWS native services: \n* **S3 VPC gateway endpoint**: Attach this only to the route table that\u2019s attached to your workspace subnets. If you\u2019re using the recommended separate subnet with its own route table for back-end VPC endpoints, then the S3 VPC endpoint doesn\u2019t need to be attached to that particular route table. See [this AWS article about S3 gateway endpoints](https:\/\/docs.aws.amazon.com\/vpc\/latest\/privatelink\/vpc-endpoints-s3.html).\n* **STS VPC interface endpoint**: Create this in all the workspace subnets and attach it to the workspace security group. Do not create this in the subnet for back-end VPC endpoints. See [this AWS section about STS interface endpoints](https:\/\/docs.aws.amazon.com\/IAM\/latest\/UserGuide\/reference_interface_vpc_endpoints.html#reference_sts_vpc_endpoint_create) and this [general article about interface endpoints](https:\/\/docs.aws.amazon.com\/vpc\/latest\/privatelink\/create-interface-endpoint.html#create-interface-endpoint).\n* **Kinesis VPC interface endpoint**: Just like the STS VPC interface endpoint, create the Kinesis VPC interface endpoint in all the workspace subnets and attach them to the workspace security group. See this AWS article anout Kenesis interface endpoints]<https:\/\/docs.aws.amazon.com\/streams\/latest\/dev\/vpc.html>) and this [general article about interface endpoints](https:\/\/docs.aws.amazon.com\/vpc\/latest\/privatelink\/create-interface-endpoint.html#create-interface-endpoint) \nIf you want to lock down a workspace VPC so that no other outbound connections are supported, the workspace won\u2019t have access to the Databricks-provided default legacy Hive metastore because AWS does not yet support PrivateLink for JDBC traffic to RDS. One option is that you could configure the regional Databricks-provided metastore FQDN or IP in an egress firewall, or a public route table to Internet Gateway, or a Network ACL for the public subnet hosting a NAT Gateway. In such a case, the traffic to the Databricks-provided metastore would go over the public network. However, if you do not want to access the Databricks-managed metastore over the public network: \n* You could deploy an external metastore in your own VPC. See [External Apache Hive metastore (legacy)](https:\/\/docs.databricks.com\/archive\/external-metastores\/external-hive-metastore.html).\n* You could use AWS Glue for your metastore. Glue supports PrivateLink. See [Use AWS Glue Data Catalog as a metastore (legacy)](https:\/\/docs.databricks.com\/archive\/external-metastores\/aws-glue-metastore.html). \nYou may also want to consider any need for access to public library repositories like pypi (for python) or CRAN (for R). To access those, either reconsider deploying in a fully-locked-down outbound mode, or instead use an egress firewall in your architecture to configure the required repositories. The overall architecture of your deployment depends on your overall requirements. If you have questions, contact your Databricks account team. \nTo create the AWS VPC endpoints using the AWS Management Console, see the AWS article for [creating VPC endpoints in the AWS Management Console](https:\/\/docs.aws.amazon.com\/vpc\/latest\/privatelink\/vpce-interface.html). \nFor tools that can help automate VPC endpoint creation and management, see: \n* The article [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html)\n* The Terraform resources [databricks\\_mws\\_vpc\\_endpoint](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/mws_vpc_endpoint) and [databricks\\_mws\\_private\\_access\\_settings](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/mws_private_access_settings).\n* The Terraform guide [Deploying prerequisite resources and enabling PrivateLink connections](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/guides\/aws-private-link-workspace).\n* The AWS article [CloudFormation: Creating VPC Endpoint](https:\/\/docs.aws.amazon.com\/AWSCloudFormation\/latest\/UserGuide\/aws-resource-ec2-vpcendpoint.html)\n* The AWS article [AWS CLI : create-vpc-endpoint](https:\/\/docs.aws.amazon.com\/cli\/latest\/reference\/ec2\/create-vpc-endpoint.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html"} +{"content":"# \n### Ingest or connect raw data\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html) \nThe following guide walks you through ingesting data for your RAG Studio application. \nImportant \nThe default `\ud83d\udce5 Data Ingestor` downloads the Databricks documentation. \nYou can modify the code in `src\/notebooks\/ingest_data.py` to ingest from another source or adjust `config\/rag-config.yml` to use data that already exists in a Unity Catalog Volume. \nThe default `\ud83d\uddc3\ufe0f Data Processor` that ships with RAG Studio only supports HTML files. If you have other file types in your Unity Catalog Volume, follow the steps in [Creating a \ud83d\uddc3\ufe0f Data Processor version](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-data-processor.html) to adjust the `\ud83d\uddc3\ufe0f Data Processor` code. \n1. Run the following command to start the data ingestion process. This step will take approximately 10 minutes. \n```\n.\/rag ingest-data -e dev\n\n```\n2. You will see the following message in your console when the ingestion completes. \n```\n-------------------------\nRun URL: <URL to the deployment Databricks Job>\n\n<timestamp> \"[dev e] [databricks-docs-bot][dev] ingest_data\" RUNNING\n<timestamp> \"[dev e] [databricks-docs-bot][dev] ingest_data\" TERMINATED SUCCESS\nSuccessfully downloaded and uploaded Databricks documentation articles to UC Volume '`catalog`.`schema`.`raw_databricks_docs`'\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1b-ingest-data.html"} +{"content":"# \n### Ingest or connect raw data\n#### Follow the next tutorial!\n\n[Deploy a version of a RAG Application](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1c-deploy-version.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1b-ingest-data.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query data in Azure Synapse Analytics\n\nYou can access Azure Synapse from Databricks using the Azure Synapse connector, which uses the `COPY` statement in Azure Synapse to transfer large volumes of data efficiently between a Databricks cluster and an Azure Synapse instance using an Azure Data Lake Storage Gen2 storage account for temporary staging. \nNote \nYou may prefer Lakehouse Federation for managing queries on Azure Synapse or Azure Data Warehouse data. See [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). \n[Azure Synapse Analytics](https:\/\/azure.microsoft.com\/services\/synapse-analytics\/) is a cloud-based enterprise data warehouse that leverages massively parallel processing (MPP) to quickly run complex queries across petabytes of data. \nImportant \nThis connector is for use with Synapse Dedicated Pool instances only and is not compatible with other Synapse components. \nNote \n`COPY` is available only on Azure Data Lake Storage Gen2 instances. If you\u2019re looking for details on working with Polybase, see [Connecting Databricks and Azure Synapse with PolyBase (legacy)](https:\/\/docs.databricks.com\/archive\/azure\/synapse-polybase.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/synapse-analytics.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query data in Azure Synapse Analytics\n##### Example syntax for Synapse\n\nYou can query Synapse in Scala, Python, SQL, and R. The following code examples use storage account keys and forward the storage credentials from Databricks to Synapse. \nNote \nUse the connection string provided by Azure portal, which enables Secure Sockets Layer (SSL) encryption for all data sent between the Spark driver and the Azure Synapse instance through the JDBC connection. To verify that the SSL encryption is enabled, you can search for `encrypt=true` in the connection string. \nImportant \n[External locations defined in Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html) are not supported as `tempDir` locations. \n```\n\/\/ Set up the storage account access key in the notebook session conf.\nspark.conf.set(\n\"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net\",\n\"<your-storage-account-access-key>\")\n\n\/\/ Get some data from an Azure Synapse table. The following example applies to Databricks Runtime 11.3 LTS and above.\nval df: DataFrame = spark.read\n.format(\"sqldw\")\n.option(\"host\", \"hostname\")\n.option(\"port\", \"port\") \/* Optional - will use default port 1433 if not specified. *\/\n.option(\"user\", \"username\")\n.option(\"password\", \"password\")\n.option(\"database\", \"database-name\")\n.option(\"dbtable\", \"schema-name.table-name\") \/* If schemaName not provided, default to \"dbo\". *\/\n.option(\"tempDir\", \"abfss:\/\/<your-container-name>@<your-storage-account-name>.dfs.core.windows.net\/<your-directory-name>\")\n.option(\"forwardSparkAzureStorageCredentials\", \"true\")\n.load()\n\n\/\/ Get some data from an Azure Synapse table. The following example applies to Databricks Runtime 10.4 LTS and below.\nval df: DataFrame = spark.read\n.format(\"com.databricks.spark.sqldw\")\n.option(\"url\", \"jdbc:sqlserver:\/\/<the-rest-of-the-connection-string>\")\n.option(\"tempDir\", \"abfss:\/\/<your-container-name>@<your-storage-account-name>.dfs.core.windows.net\/<your-directory-name>\")\n.option(\"forwardSparkAzureStorageCredentials\", \"true\")\n.option(\"dbTable\", \"<your-table-name>\")\n.load()\n\n\/\/ Load data from an Azure Synapse query.\nval df: DataFrame = spark.read\n.format(\"com.databricks.spark.sqldw\")\n.option(\"url\", \"jdbc:sqlserver:\/\/<the-rest-of-the-connection-string>\")\n.option(\"tempDir\", \"abfss:\/\/<your-container-name>@<your-storage-account-name>.dfs.core.windows.net\/<your-directory-name>\")\n.option(\"forwardSparkAzureStorageCredentials\", \"true\")\n.option(\"query\", \"select x, count(*) as cnt from table group by x\")\n.load()\n\n\/\/ Apply some transformations to the data, then use the\n\/\/ Data Source API to write the data back to another table in Azure Synapse.\n\ndf.write\n.format(\"com.databricks.spark.sqldw\")\n.option(\"url\", \"jdbc:sqlserver:\/\/<the-rest-of-the-connection-string>\")\n.option(\"forwardSparkAzureStorageCredentials\", \"true\")\n.option(\"dbTable\", \"<your-table-name>\")\n.option(\"tempDir\", \"abfss:\/\/<your-container-name>@<your-storage-account-name>.dfs.core.windows.net\/<your-directory-name>\")\n.save()\n\n``` \n```\n\n# Set up the storage account access key in the notebook session conf.\nspark.conf.set(\n\"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net\",\n\"<your-storage-account-access-key>\")\n\n# Get some data from an Azure Synapse table. The following example applies to Databricks Runtime 11.3 LTS and above.\ndf = spark.read\n.format(\"sqldw\")\n.option(\"host\", \"hostname\")\n.option(\"port\", \"port\") # Optional - will use default port 1433 if not specified.\n.option(\"user\", \"username\")\n.option(\"password\", \"password\")\n.option(\"database\", \"database-name\")\n.option(\"dbtable\", \"schema-name.table-name\") # If schemaName not provided, default to \"dbo\".\n.option(\"tempDir\", \"abfss:\/\/<your-container-name>@<your-storage-account-name>.dfs.core.windows.net\/<your-directory-name>\")\n.option(\"forwardSparkAzureStorageCredentials\", \"true\")\n.load()\n\n# Get some data from an Azure Synapse table. The following example applies to Databricks Runtime 10.4 LTS and below.\ndf = spark.read \\\n.format(\"com.databricks.spark.sqldw\") \\\n.option(\"url\", \"jdbc:sqlserver:\/\/<the-rest-of-the-connection-string>\") \\\n.option(\"tempDir\", \"abfss:\/\/<your-container-name>@<your-storage-account-name>.dfs.core.windows.net\/<your-directory-name>\") \\\n.option(\"forwardSparkAzureStorageCredentials\", \"true\") \\\n.option(\"dbTable\", \"<your-table-name>\") \\\n.load()\n\n# Load data from an Azure Synapse query.\ndf = spark.read \\\n.format(\"com.databricks.spark.sqldw\") \\\n.option(\"url\", \"jdbc:sqlserver:\/\/<the-rest-of-the-connection-string>\") \\\n.option(\"tempDir\", \"abfss:\/\/<your-container-name>@<your-storage-account-name>.dfs.core.windows.net\/<your-directory-name>\") \\\n.option(\"forwardSparkAzureStorageCredentials\", \"true\") \\\n.option(\"query\", \"select x, count(*) as cnt from table group by x\") \\\n.load()\n\n# Apply some transformations to the data, then use the\n# Data Source API to write the data back to another table in Azure Synapse.\n\ndf.write \\\n.format(\"com.databricks.spark.sqldw\") \\\n.option(\"url\", \"jdbc:sqlserver:\/\/<the-rest-of-the-connection-string>\") \\\n.option(\"forwardSparkAzureStorageCredentials\", \"true\") \\\n.option(\"dbTable\", \"<your-table-name>\") \\\n.option(\"tempDir\", \"abfss:\/\/<your-container-name>@<your-storage-account-name>.dfs.core.windows.net\/<your-directory-name>\") \\\n.save()\n\n``` \n```\n-- Set up the storage account access key in the notebook session conf.\nSET fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net=<your-storage-account-access-key>;\n\n-- Read data using SQL. The following example applies to Databricks Runtime 11.3 LTS and above.\nCREATE TABLE example_table_in_spark_read\nUSING sqldw\nOPTIONS (\nhost '<hostname>',\nport '<port>' \/* Optional - will use default port 1433 if not specified. *\/\nuser '<username>',\npassword '<password>',\ndatabase '<database-name>'\ndbtable '<schema-name>.<table-name>', \/* If schemaName not provided, default to \"dbo\". *\/\nforwardSparkAzureStorageCredentials 'true',\ntempDir 'abfss:\/\/<your-container-name>@<your-storage-account-name>.dfs.core.windows.net\/<your-directory-name>'\n);\n\n-- Read data using SQL. The following example applies to Databricks Runtime 10.4 LTS and below.\nCREATE TABLE example_table_in_spark_read\nUSING com.databricks.spark.sqldw\nOPTIONS (\nurl 'jdbc:sqlserver:\/\/<the-rest-of-the-connection-string>',\nforwardSparkAzureStorageCredentials 'true',\ndbtable '<your-table-name>',\ntempDir 'abfss:\/\/<your-container-name>@<your-storage-account-name>.dfs.core.windows.net\/<your-directory-name>'\n);\n\n-- Write data using SQL.\n-- Create a new table, throwing an error if a table with the same name already exists:\n\nCREATE TABLE example_table_in_spark_write\nUSING com.databricks.spark.sqldw\nOPTIONS (\nurl 'jdbc:sqlserver:\/\/<the-rest-of-the-connection-string>',\nforwardSparkAzureStorageCredentials 'true',\ndbTable '<your-table-name>',\ntempDir 'abfss:\/\/<your-container-name>@<your-storage-account-name>.dfs.core.windows.net\/<your-directory-name>'\n)\nAS SELECT * FROM table_to_save_in_spark;\n\n``` \n```\n# Load SparkR\nlibrary(SparkR)\n\n# Set up the storage account access key in the notebook session conf.\nconf <- sparkR.callJMethod(sparkR.session(), \"conf\")\nsparkR.callJMethod(conf, \"set\", \"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net\", \"<your-storage-account-access-key>\")\n\n# Get some data from an Azure Synapse table.\ndf <- read.df(\nsource = \"com.databricks.spark.sqldw\",\nurl = \"jdbc:sqlserver:\/\/<the-rest-of-the-connection-string>\",\nforward_spark_azure_storage_credentials = \"true\",\ndbTable = \"<your-table-name>\",\ntempDir = \"abfss:\/\/<your-container-name>@<your-storage-account-name>.dfs.core.windows.net\/<your-directory-name>\")\n\n# Load data from an Azure Synapse query.\ndf <- read.df(\nsource = \"com.databricks.spark.sqldw\",\nurl = \"jdbc:sqlserver:\/\/<the-rest-of-the-connection-string>\",\nforward_spark_azure_storage_credentials = \"true\",\nquery = \"select x, count(*) as cnt from table group by x\",\ntempDir = \"abfss:\/\/<your-container-name>@<your-storage-account-name>.dfs.core.windows.net\/<your-directory-name>\")\n\n# Apply some transformations to the data, then use the\n# Data Source API to write the data back to another table in Azure Synapse.\n\nwrite.df(\ndf,\nsource = \"com.databricks.spark.sqldw\",\nurl = \"jdbc:sqlserver:\/\/<the-rest-of-the-connection-string>\",\nforward_spark_azure_storage_credentials = \"true\",\ndbTable = \"<your-table-name>\",\ntempDir = \"abfss:\/\/<your-container-name>@<your-storage-account-name>.dfs.core.windows.net\/<your-directory-name>\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/synapse-analytics.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query data in Azure Synapse Analytics\n##### How does authentication between Databricks and Synapse work?\n\nThe Azure Synapse connector uses three types of network connections: \n* Spark driver to Azure Synapse\n* Spark cluster to Azure storage account\n* Azure Synapse to Azure storage account\n\n#### Query data in Azure Synapse Analytics\n##### Configuring access to Azure storage\n\nBoth Databricks and Synapse need privileged access to an Azure storage account to be used for temporary data storage. \nAzure Synapse does not support using SAS for storage account access. You can configure access for both services by doing one of the following: \n* Use the account key and secret for the storage account and set `forwardSparkAzureStorageCredentials` to `true`. See [Set Spark properties to configure Azure credentials to access Azure storage](https:\/\/docs.databricks.com\/connect\/storage\/azure-storage.html#set-spark-properties-to-configure-azure-credentials-to-access-azure-storage).\n* Use Azure Data Lake Storage Gen2 with OAuth 2.0 authentication and set `enableServicePrincipalAuth` to `true`. See [Configure connection from Databricks to Synapse with OAuth 2.0 with a service principal](https:\/\/docs.databricks.com\/connect\/external-systems\/synapse-analytics.html#service-principal).\n* Configure your Azure Synapse instance to have a Managed Service Identity and set `useAzureMSI` to `true`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/synapse-analytics.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query data in Azure Synapse Analytics\n##### Required Azure Synapse permissions\n\nBecause it uses `COPY` in the background, the Azure Synapse connector requires the JDBC connection user to have permission to run the following commands in the connected Azure Synapse instance: \n* [COPY INTO](https:\/\/learn.microsoft.com\/sql\/t-sql\/statements\/copy-into-transact-sql) \nIf the destination table does not exist in Azure Synapse, permission to run the following command is required in addition to the command above: \n* [CREATE TABLE](https:\/\/learn.microsoft.com\/sql\/t-sql\/statements\/create-table-azure-sql-data-warehouse) \nThe following table summarizes the permissions required for writes with `COPY`: \n| Permissions (insert into an existing table) | Permissions (insert into a new table) |\n| --- | --- |\n| ADMINISTER DATABASE BULK OPERATIONS INSERT | ADMINISTER DATABASE BULK OPERATIONS INSERT CREATE TABLE ALTER ON SCHEMA :: dbo |\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/synapse-analytics.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query data in Azure Synapse Analytics\n##### Configure connection from Databricks to Synapse with OAuth 2.0 with a service principal\n\nYou can authenticate to Azure Synapse Analytics using a service principal with access to the underlying storage account. For more information on using service principal credentials to access an Azure storage account, see [Connect to Azure Data Lake Storage Gen2 and Blob Storage](https:\/\/docs.databricks.com\/connect\/storage\/azure-storage.html). You must set the `enableServicePrincipalAuth` option to `true` in the connection configuration [Databricks Synapse connector options reference](https:\/\/docs.databricks.com\/connect\/external-systems\/synapse-analytics.html#parameters) to enable the connector to authenticate with a service principal. \nYou can optionally use a different service principal for the Azure Synapse Analytics connection. The following example configures service principal credentials for the storage account and optional service principal credentials for Synapse: \n```\n; Defining the Service Principal credentials for the Azure storage account\nfs.azure.account.auth.type OAuth\nfs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\nfs.azure.account.oauth2.client.id <application-id>\nfs.azure.account.oauth2.client.secret <service-credential>\nfs.azure.account.oauth2.client.endpoint https:\/\/login.microsoftonline.com\/<directory-id>\/oauth2\/token\n\n; Defining a separate set of service principal credentials for Azure Synapse Analytics (If not defined, the connector will use the Azure storage account credentials)\nspark.databricks.sqldw.jdbc.service.principal.client.id <application-id>\nspark.databricks.sqldw.jdbc.service.principal.client.secret <service-credential>\n\n``` \n```\n\/\/ Defining the Service Principal credentials for the Azure storage account\nspark.conf.set(\"fs.azure.account.auth.type\", \"OAuth\")\nspark.conf.set(\"fs.azure.account.oauth.provider.type\", \"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\")\nspark.conf.set(\"fs.azure.account.oauth2.client.id\", \"<application-id>\")\nspark.conf.set(\"fs.azure.account.oauth2.client.secret\", \"<service-credential>\")\nspark.conf.set(\"fs.azure.account.oauth2.client.endpoint\", \"https:\/\/login.microsoftonline.com\/<directory-id>\/oauth2\/token\")\n\n\/\/ Defining a separate set of service principal credentials for Azure Synapse Analytics (If not defined, the connector will use the Azure storage account credentials)\nspark.conf.set(\"spark.databricks.sqldw.jdbc.service.principal.client.id\", \"<application-id>\")\nspark.conf.set(\"spark.databricks.sqldw.jdbc.service.principal.client.secret\", \"<service-credential>\")\n\n``` \n```\n# Defining the service principal credentials for the Azure storage account\nspark.conf.set(\"fs.azure.account.auth.type\", \"OAuth\")\nspark.conf.set(\"fs.azure.account.oauth.provider.type\", \"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\")\nspark.conf.set(\"fs.azure.account.oauth2.client.id\", \"<application-id>\")\nspark.conf.set(\"fs.azure.account.oauth2.client.secret\", \"<service-credential>\")\nspark.conf.set(\"fs.azure.account.oauth2.client.endpoint\", \"https:\/\/login.microsoftonline.com\/<directory-id>\/oauth2\/token\")\n\n# Defining a separate set of service principal credentials for Azure Synapse Analytics (If not defined, the connector will use the Azure storage account credentials)\nspark.conf.set(\"spark.databricks.sqldw.jdbc.service.principal.client.id\", \"<application-id>\")\nspark.conf.set(\"spark.databricks.sqldw.jdbc.service.principal.client.secret\", \"<service-credential>\")\n\n``` \n```\n# Load SparkR\nlibrary(SparkR)\nconf <- sparkR.callJMethod(sparkR.session(), \"conf\")\n\n# Defining the service principal credentials for the Azure storage account\nsparkR.callJMethod(conf, \"set\", \"fs.azure.account.auth.type\", \"OAuth\")\nsparkR.callJMethod(conf, \"set\", \"fs.azure.account.oauth.provider.type\", \"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\")\nsparkR.callJMethod(conf, \"set\", \"fs.azure.account.oauth2.client.id\", \"<application-id>\")\nsparkR.callJMethod(conf, \"set\", \"fs.azure.account.oauth2.client.secret\", \"<service-credential>\")\nsparkR.callJMethod(conf, \"set\", \"fs.azure.account.oauth2.client.endpoint\", \"https:\/\/login.microsoftonline.com\/<directory-id>\/oauth2\/token\")\n\n# Defining a separate set of service principal credentials for Azure Synapse Analytics (If not defined, the connector will use the Azure storage account credentials)\nsparkR.callJMethod(conf, \"set\", \"spark.databricks.sqldw.jdbc.service.principal.client.id\", \"<application-id>\")\nsparkR.callJMethod(conf, \"set\", \"spark.databricks.sqldw.jdbc.service.principal.client.secret\", \"<service-credential>\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/synapse-analytics.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query data in Azure Synapse Analytics\n##### Supported save modes for batch writes\n\nThe Azure Synapse connector supports `ErrorIfExists`, `Ignore`, `Append`, and `Overwrite` save modes with the default mode being `ErrorIfExists`. For more information on supported save modes in Apache Spark, see [Spark SQL documentation on Save Modes](https:\/\/spark.apache.org\/docs\/latest\/sql-data-sources-load-save-functions.html#save-modes).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/synapse-analytics.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query data in Azure Synapse Analytics\n##### Databricks Synapse connector options reference\n\nThe `OPTIONS` provided in Spark SQL support the following settings: \n| Parameter | Required | Default | Notes |\n| --- | --- | --- | --- |\n| `dbTable` | Yes, unless `query` is specified | No default | The table to create or read from in Azure Synapse. This parameter is required when saving data back to Azure Synapse. You can also use `{SCHEMA NAME}.{TABLE NAME}` to access a table in a given schema. If schema name is not provided, the default schema associated with the JDBC user is used. The previously supported `dbtable` variant is deprecated and will be ignored in future releases. Use the \u201ccamel case\u201d name instead. |\n| `query` | Yes, unless `dbTable` is specified | No default | The query to read from in Azure Synapse. For tables referred in the query, you can also use `{SCHEMA NAME}.{TABLE NAME}` to access a table in a given schema. If schema name is not provided, the default schema associated with the JDBC user is used. |\n| `user` | No | No default | The Azure Synapse username. Must be used in tandem with `password` option. Can only be used if the user and password are not passed in the URL. Passing both will result in an error. |\n| `password` | No | No default | The Azure Synapse password. Must be used in tandem with `user` option. Can only be used if the user and password are not passed in the URL. Passing both will result in an error. |\n| `url` | Yes | No default | A JDBC URL with `sqlserver` set as the subprotocol. It is recommended to use the connection string provided by Azure portal. Setting `encrypt=true` is strongly recommended, because it enables SSL encryption of the JDBC connection. If `user` and `password` are set separately, you do not need to include them in the URL. |\n| `jdbcDriver` | No | Determined by the JDBC URL\u2019s subprotocol | The class name of the JDBC driver to use. This class must be on the classpath. In most cases, it should not be necessary to specify this option, as the appropriate driver classname should automatically be determined by the JDBC URL\u2019s subprotocol. The previously supported `jdbc_driver` variant is deprecated and will be ignored in future releases. Use the \u201ccamel case\u201d name instead. |\n| `tempDir` | Yes | No default | A `abfss` URI. We recommend you use a dedicated Blob storage container for the Azure Synapse. The previously supported `tempdir` variant is deprecated and will be ignored in future releases. Use the \u201ccamel case\u201d name instead. You cannot use an [External location defined in Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html) as a `tempDir` location. |\n| `tempCompression` | No | `SNAPPY` | The compression algorithm to be used to encode\/decode temporary by both Spark and Azure Synapse. Currently supported values are: `UNCOMPRESSED`, `SNAPPY` and `GZIP`. |\n| `forwardSparkAzureStorageCredentials` | No | false | If `true`, the library automatically discovers the storage account access key credentials that Spark is using to connect to the Blob storage container and forwards those credentials to Azure Synapse over JDBC. These credentials are sent as part of the JDBC query. Therefore it is strongly recommended that you enable SSL encryption of the JDBC connection when you use this option. When configuring storage authentication, you must set exactly one of `useAzureMSI` and `forwardSparkAzureStorageCredentials` to `true`. Alternatively, you can set `enableServicePrincipalAuth` to `true` and use service principal for both JDBC and storage authentication. The `forwardSparkAzureStorageCredentials` option does not support authentication to storage using either a managed service identity or service principal. Only storage account access key is supported. The previously supported `forward_spark_azure_storage_credentials` variant is deprecated and will be ignored in future releases. Use the \u201ccamel case\u201d name instead. |\n| `useAzureMSI` | No | false | If `true`, the library will specify `IDENTITY = 'Managed Service Identity'` and no `SECRET` for the database scoped credentials it creates. When configuring storage authentication, you must set exactly one of `useAzureMSI` and `forwardSparkAzureStorageCredentials` to `true`. Alternatively, you can set `enableServicePrincipalAuth` to `true` and use service principal for both JDBC and storage authentication. |\n| `enableServicePrincipalAuth` | No | false | If `true`, the library will use the provided service principal credentials to connect to the Azure storage account and Azure Synapse Analytics over JDBC. If either `forward_spark_azure_storage_credentials` or `useAzureMSI` is set to `true`, that option would take precedence over service principal in storage authentication. |\n| `tableOptions` | No | `CLUSTERED COLUMNSTORE INDEX`, `DISTRIBUTION = ROUND_ROBIN` | A string used to specify [table options](https:\/\/learn.microsoft.com\/sql\/t-sql\/statements\/create-table-azure-sql-data-warehouse) when creating the Azure Synapse table set through `dbTable`. This string is passed literally to the `WITH` clause of the `CREATE TABLE` SQL statement that is issued against Azure Synapse. The previously supported `table_options` variant is deprecated and will be ignored in future releases. Use the \u201ccamel case\u201d name instead. |\n| `preActions` | No | No default (empty string) | A `;` separated list of SQL commands to be executed in Azure Synapse before writing data to the Azure Synapse instance. These SQL commands are required to be valid commands accepted by Azure Synapse. If any of these commands fail, it is treated as an error and the write operation is not executed. |\n| `postActions` | No | No default (empty string) | A `;` separated list of SQL commands to be executed in Azure Synapse after the connector successfully writes data to the Azure Synapse instance. These SQL commands are required to be valid commands accepted by Azure Synapse. If any of these commands fail, it is treated as an error and you\u2019ll get an exception after the data is successfully written to the Azure Synapse instance. |\n| `maxStrLength` | No | 256 | `StringType` in Spark is mapped to the `NVARCHAR(maxStrLength)` type in Azure Synapse. You can use `maxStrLength` to set the string length for all `NVARCHAR(maxStrLength)` type columns that are in the table with name `dbTable` in Azure Synapse. The previously supported `maxstrlength` variant is deprecated and will be ignored in future releases. Use the \u201ccamel case\u201d name instead. |\n| `applicationName` | No | `Databricks-User-Query` | The tag of the connection for each query. If not specified or the value is an empty string, the default value of the tag is added the JDBC URL. The default value prevents the Azure DB Monitoring tool from raising spurious SQL injection alerts against queries. |\n| `maxbinlength` | No | No default | Control the column length of `BinaryType` columns. This parameter is translated as `VARBINARY(maxbinlength)`. |\n| `identityInsert` | No | false | Setting to `true` enables `IDENTITY_INSERT` mode, which inserts a DataFrame provided value in the identity column of the Azure Synapse table. See [Explicitly inserting values into an IDENTITY column](https:\/\/learn.microsoft.com\/azure\/synapse-analytics\/sql-data-warehouse\/sql-data-warehouse-tables-identity#explicitly-inserting-values-into-an-identity-column). |\n| `externalDataSource` | No | No default | A pre-provisioned external data source to read data from Azure Synapse. An external data source can only be used with PolyBase and removes the CONTROL permission requirement because the connector does not need to create a scoped credential and an external data source to load data. For example usage and the list of permissions required when using an external data source, see [Required Azure Synapse permissions for PolyBase with the external data source option](https:\/\/docs.databricks.com\/archive\/azure\/synapse-polybase.html#dw-polybase-external-data-source-permissions). |\n| `maxErrors` | No | 0 | The maximum number of rows that can be rejected during reads and writes before the loading operation is cancelled. The rejected rows will be ignored. For example, if two out of ten records have errors, only eight records will be processed. See [REJECT\\_VALUE documentation in CREATE EXTERNAL TABLE](https:\/\/learn.microsoft.com\/sql\/t-sql\/statements\/create-external-table-transact-sql?view=azure-sqldw-latest&tabs=dedicated#reject_value--reject_value-1) and [MAXERRORS documentation in COPY](https:\/\/learn.microsoft.com\/sql\/t-sql\/statements\/copy-into-transact-sql?view=azure-sqldw-latest#maxerrors--max_errors). |\n| `inferTimestampNTZType` | No | false | If `true`, values of type Azure Synapse `TIMESTAMP` are interpreted as `TimestampNTZType` (timestamp without time zone) during reads. Otherwise, all timestamps are interpreted as `TimestampType` regardless of the type in the underlying Azure Synapse table. | \nNote \n* `tableOptions`, `preActions`, `postActions`, and `maxStrLength` are relevant only when writing data from Databricks to a new table in Azure Synapse.\n* Even though all data source option names are case-insensitive, we recommend that you specify them in \u201ccamel case\u201d for clarity.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/synapse-analytics.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query data in Azure Synapse Analytics\n##### Query pushdown into Azure Synapse\n\nThe Azure Synapse connector implements a set of optimization rules\nto push the following operators down into Azure Synapse: \n* `Filter`\n* `Project`\n* `Limit` \nThe `Project` and `Filter` operators support the following expressions: \n* Most boolean logic operators\n* Comparisons\n* Basic arithmetic operations\n* Numeric and string casts \nFor the `Limit` operator, pushdown is supported only when there is no ordering specified. For example: \n`SELECT TOP(10) * FROM table`, but not `SELECT TOP(10) * FROM table ORDER BY col`. \nNote \nThe Azure Synapse connector does not push down expressions operating on strings, dates, or timestamps. \nQuery pushdown built with the Azure Synapse connector is enabled by default. You can disable it by setting `spark.databricks.sqldw.pushdown` to `false`.\n\n#### Query data in Azure Synapse Analytics\n##### Temporary data management\n\nThe Azure Synapse connector *does not* delete the temporary files that it creates in the Azure storage container. Databricks recommends that you periodically delete temporary files under the user-supplied `tempDir` location. \nTo facilitate data cleanup, the Azure Synapse connector does not store data files directly under `tempDir`, but instead creates a subdirectory of the form: `<tempDir>\/<yyyy-MM-dd>\/<HH-mm-ss-SSS>\/<randomUUID>\/`. You can set up periodic jobs (using the Databricks [jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) feature or otherwise) to recursively delete any subdirectories that are older than a given threshold (for example, 2 days), with the assumption that there cannot be Spark jobs running longer than that threshold. \nA simpler alternative is to periodically drop the whole container and create a new one with the same name. This requires that you use a dedicated container for the temporary data produced by the Azure Synapse connector and that you can find a time window in which you can guarantee that no queries involving the connector are running.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/synapse-analytics.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query data in Azure Synapse Analytics\n##### Temporary object management\n\nThe Azure Synapse connector automates data transfer between a Databricks cluster and an Azure Synapse instance. For reading data from an Azure Synapse table or query or writing data to an Azure Synapse table, the Azure Synapse connector creates temporary objects, including `DATABASE SCOPED CREDENTIAL`, `EXTERNAL DATA SOURCE`, `EXTERNAL FILE FORMAT`, and `EXTERNAL TABLE` behind the scenes. These objects live only throughout the duration of the corresponding Spark job and are automatically dropped. \nWhen a cluster is running a query using the Azure Synapse connector, if the Spark driver process crashes or is forcefully restarted, or if the cluster is forcefully terminated or restarted, temporary objects might not be dropped. To facilitate identification and manual deletion of these objects, the Azure Synapse connector prefixes the names of all intermediate temporary objects created in the Azure Synapse instance with a tag of the form: `tmp_databricks_<yyyy_MM_dd_HH_mm_ss_SSS>_<randomUUID>_<internalObject>`. \nWe recommend that you periodically look for leaked objects using queries such as the following: \n* `SELECT * FROM sys.database_scoped_credentials WHERE name LIKE 'tmp_databricks_%'`\n* `SELECT * FROM sys.external_data_sources WHERE name LIKE 'tmp_databricks_%'`\n* `SELECT * FROM sys.external_file_formats WHERE name LIKE 'tmp_databricks_%'`\n* `SELECT * FROM sys.external_tables WHERE name LIKE 'tmp_databricks_%'`\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/synapse-analytics.html"} +{"content":"# What is Delta Lake?\n### Use liquid clustering for Delta tables\n\nDelta Lake liquid clustering replaces table partitioning and `ZORDER` to simplify data layout decisions and optimize query performance. Liquid clustering provides flexibility to redefine clustering keys without rewriting existing data, allowing data layout to evolve alongside analytic needs over time. \nImportant \nDatabricks recommends using Databricks Runtime 15.2 and above for all tables with liquid clustering enabled. Public preview support with limitations is available in Databricks Runtime 13.3 LTS and above. \nNote \nTables with liquid clustering enabled support row-level concurrency in Databricks Runtime 13.3 LTS and above. Row-level concurrency is generally available in Databricks Runtime 14.2 and above for all tables with deletion vectors enabled. See [Isolation levels and write conflicts on Databricks](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html).\n\n### Use liquid clustering for Delta tables\n#### What is liquid clustering used for?\n\nDatabricks recommends liquid clustering for all new Delta tables. The following are examples of scenarios that benefit from clustering: \n* Tables often filtered by high cardinality columns.\n* Tables with significant skew in data distribution.\n* Tables that grow quickly and require maintenance and tuning effort.\n* Tables with concurrent write requirements.\n* Tables with access patterns that change over time.\n* Tables where a typical partition key could leave the table with too many or too few partitions.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clustering.html"} +{"content":"# What is Delta Lake?\n### Use liquid clustering for Delta tables\n#### Enable liquid clustering\n\nYou can enable liquid clustering on an existing table or during table creation. Clustering is not compatible with partitioning or `ZORDER`, and requires that you use Databricks to manage all layout and optimization operations for data in your table. After liquid clustering is enabled, run `OPTIMIZE` jobs as usual to incrementally cluster data. See [How to trigger clustering](https:\/\/docs.databricks.com\/delta\/clustering.html#optimize). \nTo enable liquid clustering, add the `CLUSTER BY` phrase to a table creation statement, as in the examples below: \nNote \nIn Databricks Runtime 14.2 and above, you can use DataFrame APIs and DeltaTable API in Python or Scala to enable liquid clustering. \n```\n-- Create an empty table\nCREATE TABLE table1(col0 int, col1 string) USING DELTA CLUSTER BY (col0);\n\n-- Using a CTAS statement\nCREATE EXTERNAL TABLE table2 CLUSTER BY (col0) -- specify clustering after table name, not in subquery\nLOCATION 'table_location'\nAS SELECT * FROM table1;\n\n-- Using a LIKE statement to copy configurations\nCREATE TABLE table3 LIKE table1;\n\n``` \n```\n# Create an empty table\n(DeltaTable.create()\n.tableName(\"table1\")\n.addColumn(\"col0\", dataType = \"INT\")\n.addColumn(\"col1\", dataType = \"STRING\")\n.clusterBy(\"col0\")\n.execute())\n\n# Using a CTAS statement\ndf = spark.read.table(\"table1\")\ndf.write.format(\"delta\").clusterBy(\"col0\").saveAsTable(\"table2\")\n\n# CTAS using DataFrameWriterV2\ndf = spark.read.table(\"table1\")\ndf.writeTo(\"table1\").using(\"delta\").clusterBy(\"col0\").create()\n\n``` \n```\n\/\/ Create an empty table\nDeltaTable.create()\n.tableName(\"table1\")\n.addColumn(\"col0\", dataType = \"INT\")\n.addColumn(\"col1\", dataType = \"STRING\")\n.clusterBy(\"col0\")\n.execute()\n\n\/\/ Using a CTAS statement\nval df = spark.read.table(\"table1\")\ndf.write.format(\"delta\").clusterBy(\"col0\").saveAsTable(\"table2\")\n\n\/\/ CTAS using DataFrameWriterV2\nval df = spark.read.table(\"table1\")\ndf.writeTo(\"table1\").using(\"delta\").clusterBy(\"col0\").create()\n\n``` \nWarning \nTables created with liquid clustering enabled have numerous Delta table features enabled at creation and use Delta writer version 7 and reader version 3. You can override the enablement of some of these features. See [Override default feature enablement (optional)](https:\/\/docs.databricks.com\/delta\/clustering.html#override). \nTable protocol versions cannot be downgraded, and tables with clustering enabled are not readable by Delta Lake clients that do not support all enabled Delta reader protocol table features. See [How does Databricks manage Delta Lake feature compatibility?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html). \nYou can enable liquid clustering on an existing unpartitioned Delta table using the following syntax: \n```\nALTER TABLE <table_name>\nCLUSTER BY (<clustering_columns>)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clustering.html"} +{"content":"# What is Delta Lake?\n### Use liquid clustering for Delta tables\n#### Override default feature enablement (optional)\n\nYou can override default behavior that enables Delta table features during liquid clustering enablement. This prevents the reader and writer protocols associated with those table features from being upgraded. You must have an existing table to complete the following steps: \n1. Use `ALTER TABLE` to set the table property that disables one or more features. For example, to disable deletion vectors run the following: \n```\nALTER TABLE table_name SET TBLPROPERTIES ('delta.enableDeletionVectors' = false);\n\n```\n2. Enable liquid clustering on the table by running the following: \n```\nALTER TABLE <table_name>\nCLUSTER BY (<clustering_columns>)\n\n``` \nThe following table provides information on the Delta features you can override and how enablement impacts compatibility with Databricks Runtime versions. \n| Delta feature | Runtime compatibility | Property to override enablement | Impact of disablement on liquid clustering |\n| --- | --- | --- | --- |\n| Deletion vectors | Reads and writes require Databricks Runtime 12.2 lTS and above. | `'delta.enableDeletionVectors' = false` | Row-level concurrency is disabled, making transactions and clustering operations more likely to conflict. See [Write conflicts with row-level concurrency](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html#rlc-conflicts). `DELETE`, `MERGE`, and `UPDATE` commands might run slower. |\n| Row tracking | Writes require Databricks Runtime 13.3 LTS and above. Can be read from any Databricks Runtime version. | `'delta.enableRowTracking' = false` | Row-level concurrency is disabled, making transactions and clustering operations more likely to conflict. See [Write conflicts with row-level concurrency](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html#rlc-conflicts). |\n| Checkpoints V2 | Reads and writes require Databricks Runtime 13.3 LTS and above. | `'delta.checkpointPolicy' = 'classic'` | No impact on liquid clustering behavior. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clustering.html"} +{"content":"# What is Delta Lake?\n### Use liquid clustering for Delta tables\n#### Choose clustering keys\n\nDatabricks recommends choosing clustering keys based on commonly used query filters. Clustering keys can be defined in any order. If two columns are correlated, you only need to add one of them as a clustering key. \nYou can specify up to 4 columns as clustering keys. You can only specify columns with statistics collected for clustering keys. By default, the first 32 columns in a Delta table have statistics collected. See [Specify Delta statistics columns](https:\/\/docs.databricks.com\/delta\/data-skipping.html#stats-cols). \nClustering supports the following data types for clustering keys: \n* Date\n* Timestamp\n* TimestampNTZ (requires Databricks Runtime 14.3 LTS or above)\n* String\n* Integer\n* Long\n* Short\n* Float\n* Double\n* Decimal\n* Byte\n* Boolean \nIf you\u2019re converting an existing table, consider the following recommendations: \n| Current data optimization technique | Recommendation for clustering keys |\n| --- | --- |\n| Hive-style partitioning | Use partition columns as clustering keys. |\n| Z-order indexing | Use the `ZORDER BY` columns as clustering keys. |\n| Hive-style partitioning and Z-order | Use both partition columns and `ZORDER BY` columns as clustering keys. |\n| Generated columns to reduce cardinality (for example, date for a timestamp) | Use the original column as a clustering key, and don\u2019t create a generated column. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clustering.html"} +{"content":"# What is Delta Lake?\n### Use liquid clustering for Delta tables\n#### Write data to a clustered table\n\nYou must use a Delta writer client that supports all Delta write protocol table features used by liquid clustering. On Databricks, you must use Databricks Runtime 13.3 LTS and above. \nOperations that cluster on write include the following: \n* `INSERT INTO` operations\n* `CTAS` and `RTAS` statements\n* `COPY INTO` from Parquet format\n* `spark.write.format(\"delta\").mode(\"append\")` \nStructured Streaming writes never trigger clustering on write. Additional limitations apply. See [Limitations](https:\/\/docs.databricks.com\/delta\/clustering.html#limitations). \nClustering on write only triggers when data in the transaction meets a size threshold. These thresholds vary by the number of clustering columns and are lower for Unity Catalog managed tables than other Delta tables. \n| Number of clustering columns | Threshold size for Unity Catalog managed tables | Threshold size for other Delta tables |\n| --- | --- | --- |\n| 1 | 64 MB | 256 MB |\n| 2 | 256 MB | 1 GB |\n| 3 | 512 MB | 2 GB |\n| 4 | 1 GB | 4 GB | \nBecause not all operations apply liquid clustering, Databricks recommends frequently running `OPTIMIZE` to ensure that all data is efficiently clustered.\n\n### Use liquid clustering for Delta tables\n#### How to trigger clustering\n\nTo trigger clustering, you must use Databricks Runtime 13.3 LTS or above. Use the `OPTIMIZE` command on your table, as in the following example: \n```\nOPTIMIZE table_name;\n\n``` \nLiquid clustering is incremental, meaning that data is only rewritten as necessary to accommodate data that needs to be clustered. Data files with clustering keys that do not match data to be clustered are not rewritten. \nFor best performance, Databricks recommends scheduling regular `OPTIMIZE` jobs to cluster data. For tables experiencing many updates or inserts, Databricks recommends scheduling an `OPTIMIZE` job every one or two hours. Because liquid clustering is incremental, most `OPTIMIZE` jobs for clustered tables run quickly.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clustering.html"} +{"content":"# What is Delta Lake?\n### Use liquid clustering for Delta tables\n#### Read data from a clustered table\n\nYou can read data in a clustered table using any Delta Lake client that supports reading deletion vectors. For best query results, include clustering keys in your query filters, as in the following example: \n```\nSELECT * FROM table_name WHERE cluster_key_column_name = \"some_value\";\n\n```\n\n### Use liquid clustering for Delta tables\n#### Change clustering keys\n\nYou can change clustering keys for a table at any time by running an `ALTER TABLE` command, as in the following example: \n```\nALTER TABLE table_name CLUSTER BY (new_column1, new_column2);\n\n``` \nWhen you change clustering keys, subsequent `OPTIMIZE` and write operations use the new clustering approach, but existing data is not rewritten. \nYou can also turn off clustering by setting the keys to `NONE`, as in the following example: \n```\nALTER TABLE table_name CLUSTER BY NONE;\n\n``` \nSetting cluster keys to `NONE` does not rewrite data that has already been clustered, but prevents future `OPTIMIZE` operations from using clustering keys.\n\n### Use liquid clustering for Delta tables\n#### See how table is clustered\n\nYou can use `DESCRIBE` commands to see the clustering keys for a table, as in the following examples: \n```\nDESCRIBE TABLE table_name;\n\nDESCRIBE DETAIL table_name;\n\n```\n\n### Use liquid clustering for Delta tables\n#### Compatibility for tables with liquid clustering\n\nTables created with liquid clustering in Databricks Runtime 14.1 and above use v2 checkpoints by default. You can read and write tables with v2 checkpoints in Databricks Runtime 13.3 LTS and above. \nYou can disable v2 checkpoints and downgrade table protocols to read tables with liquid clustering in Databricks Runtime 12.2 LTS and above. See [Drop Delta table features](https:\/\/docs.databricks.com\/delta\/drop-feature.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clustering.html"} +{"content":"# What is Delta Lake?\n### Use liquid clustering for Delta tables\n#### Limitations\n\nThe following limitations exist: \n* In Databricks Runtime 15.1 and below, clustering on write does not support source queries that include filters, joins, or aggregations.\n* Structured Streaming workloads do not support clustering-on-write.\n* You cannot create a table with liquid clustering enabled using a Structured Streaming write. You can use Structured Streaming to write data to an existing table with liquid clustering enabled.\n* Delta Sharing does not support sharing tables with liquid clustering enabled.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clustering.html"} +{"content":"# AI and Machine Learning on Databricks\n## Reference solutions for machine learning\n#### Natural language processing\n\nYou can perform natural language processing tasks on Databricks using popular open source libraries such as Spark ML and spark-nlp or proprietary libraries through the Databricks partnership with John Snow Labs. \nFor examples of NLP with Hugging Face, see [Additional resources](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/index.html#hugging-face-db)\n\n#### Natural language processing\n##### Feature creation from text using Spark ML\n\nSpark ML contains a range of text processing tools to create features from text columns. You can\ncreate input features from text for model training algorithms directly in your\n[Spark ML pipelines](https:\/\/spark.apache.org\/docs\/latest\/ml-pipeline.html) using Spark ML. Spark ML\nsupports a range of [text processors](https:\/\/spark.apache.org\/docs\/latest\/ml-features.html),\nincluding tokenization, stop-word processing, word2vec, and feature hashing.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/natural-language-processing.html"} +{"content":"# AI and Machine Learning on Databricks\n## Reference solutions for machine learning\n#### Natural language processing\n##### Training and inference using Spark NLP\n\nYou can scale out many deep learning methods for natural language processing on Spark using the\nopen-source Spark NLP library. This library supports standard natural language processing\noperations such as tokenizing, named entity recognition, and vectorization using the included\n[annotators](https:\/\/nlp.johnsnowlabs.com\/docs\/en\/annotators). You can also summarize, perform\nnamed entity recognition, translate, and generate text using many pre-trained deep learning\nmodels based on [Spark NLP\u2019s transformers](https:\/\/nlp.johnsnowlabs.com\/docs\/en\/transformers)\nsuch as BERT and T5 Marion. \n### Perform inference in batch using Spark NLP on CPUs \nSpark NLP provides many pre-trained models you can use with minimal code. This section contains\nan example of using the Marian Transformer for machine translation. For the full set of examples, see\nthe [Spark NLP documentation](https:\/\/nlp.johnsnowlabs.com\/docs\/en\/quickstart). \n#### Requirements \n* Install Spark NLP on the cluster using the latest Maven coordinates for Spark NLP, such as `com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0`. Your cluster must be started with the appropriate Spark configuration options set in order for this library to work.\n* To use Spark NLP, your cluster must have the correct `.jar` file downloaded from John Snow Labs. You can create or use a cluster running [any compatible runtime](https:\/\/nlp.johnsnowlabs.com\/docs\/en\/install#databricks-support). \n#### Example code for Machine Translation \nIn a notebook cell, install `sparknlp` python libraries: \n```\n%pip install sparknlp\n\n``` \nConstruct a pipeline for translation and run it on some sample text: \n```\nfrom sparknlp.base import DocumentAssembler\nfrom sparknlp.annotator import SentenceDetectorDLModel, MarianTransformer\nfrom pyspark.ml import Pipeline\n\ndocument_assembler = DocumentAssembler().setInputCol(\"text\").setOutputCol(\"document\")\n\nsentence_detector = SentenceDetectorDLModel.pretrained(\"sentence_detector_dl\", \"xx\") \\\n.setInputCols(\"document\").setOutputCol(\"sentence\")\n\nmarian_transformer = MarianTransformer.pretrained() \\\n.setInputCols(\"sentence\").setOutputCol(\"translation\")\n\npipeline = Pipeline().setStages([document_assembler, sentence_detector, marian_transformer])\n\ndata = spark.createDataFrame([[\"You can use Spark NLP to translate text. \" + \\\n\"This example pipeline translates English to French\"]]).toDF(\"text\")\n\n# Create a pipeline model that can be reused across multiple data frames\nmodel = pipeline.fit(data)\n\n# You can use the model on any data frame that has a \u201ctext\u201d column\nresult = model.transform(data)\n\ndisplay(result.select(\"text\", \"translation.result\"))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/natural-language-processing.html"} +{"content":"# AI and Machine Learning on Databricks\n## Reference solutions for machine learning\n#### Natural language processing\n##### Example: Named-entity recognition model using Spark NLP and MLflow\n\nThe example notebook illustrates how to train a named entity recognition model using Spark NLP,\nsave the model to MLflow, and use the model for inference on text. Refer to the\n[John Snow Labs documentation for Spark NLP](https:\/\/nlp.johnsnowlabs.com\/docs\/en\/training)\nto learn how to train additional natural language processing models. \n### Spark NLP model training and inference notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/spark-nlp-training-and-inference-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### Natural language processing\n##### Healthcare NLP with John Snow Labs partnership\n\nJohn Snow Labs Spark NLP for Healthcare is a proprietary library for clinical and biomedical\ntext mining. This library provides pre-trained models for recognizing and working with clinical\nentities, drugs, risk factors, anatomy, demographics, and sensitive data. You can try\nSpark NLP for Healthcare using the [Partner Connect integration with John Snow Labs](https:\/\/docs.databricks.com\/partners\/ml\/john-snow-labs.html). You need a trial or paid account with John Snow Labs to try out the commands demonstrated in this guide. \nRead more about the full capabilities of John Snow Labs Spark NLP for Healthcare and documentation for use at their [website](https:\/\/nlp.johnsnowlabs.com\/docs\/en\/license_getting_started).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/natural-language-processing.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Manage connections for Lakehouse Federation\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes how to list all Lakehouse Federation connections defined in a Unity Catalog metastore, get connection details, grant connection permissions, and drop connections using Catalog Explorer and SQL statements in notebooks or the Databricks SQL query editor. \nSee also [Create a connection](https:\/\/docs.databricks.com\/query-federation\/index.html#connection).\n\n#### Manage connections for Lakehouse Federation\n##### List connections\n\n**Permissions required**: The list of connections returned depends on your role and permissions. Users with the `USE CONNECTION` privilege on the metastore see all connections. Otherwise, you can view only the connections for which you are the connection object owner or have some privilege on. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **External Data** menu and select **Connections**. \nThe connections you have permission to see are listed, along with the URL, create date, owner, and comment. \nRun the following command in a notebook or the Databricks SQL query editor. Optionally, replace `<pattern>` with a [`LIKE` predicate](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/like.html). \n```\nSHOW CONNECTIONS [LIKE <pattern>];\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/connections.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Manage connections for Lakehouse Federation\n##### Get connection details\n\n**Permissions required**: Connection owner, `USE CONNECTION` privilege on the metastore, or some privilege on the connection. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **External Data** menu and select **Connections**.\n3. Find the connection and select it to view details. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nDESCRIBE CONNECTION <connection-name>;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/connections.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Manage connections for Lakehouse Federation\n##### Grant and revoke permissions on connections\n\nYou can grant permission to use a connection to create foreign catalogs or to view details about a connection: \n* `CREATE FOREIGN CATALOG` grants the ability to create a foreign catalog as a read-only mirror of a database in the data source described by the connection.\n* `USE CONNECTION` grants the ability to view details about the connection. \n**Permissions required**: Metastore admin or connection owner. \nTo grant permission to use a connection: \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **External Data** menu and select **Connections**.\n3. Find the connection and select it.\n4. On the **Permissions** tab, click **Grant**.\n5. On the **Grant on `<connection-name>`** dialog, start typing the user or group name or click the user menu to browse and select users and groups.\n6. Select the privileges you want to grant. \nSee the privilege descriptions in the section introduction.\n7. Click **Grant**. \nTo revoke a connection privilege: \n1. Follow the preceding steps to get to the **Permissions** tab for the connection.\n2. Select the user or group whose privilege you want to revoke.\n3. Click **Revoke** both on the tab and on the confirmation dialog. \nTo grant the ability to create a foreign catalog using a connection, run the following command in a notebook or the Databricks SQL query editor. \n```\nGRANT CREATE FOREIGN CATALOG ON CONNECTION <connection-name> TO <user-name>;\n\n``` \nTo grant the ability to view the connection, run the following: \n```\nGRANT USE CONNECTION ON CONNECTION <connection-name> TO <user-name>;\n\n``` \nTo revoke a privilege, run the following, where `<privilege>` is one of the privileges on the connection granted to the user: \n```\nREVOKE <privilege> ON CONNECTION <connection-name> FROM <user-name>;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/connections.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Manage connections for Lakehouse Federation\n##### Drop connections\n\n**Permissions required**: Connection owner \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **External Data** menu and select **Connections**.\n3. Find the connection and select it.\n4. Click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu (also known as the three-dot menu) and select **Delete**.\n5. On the confirmation dialog, click **Delete**. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nDROP CONNECTION [IF EXISTS] <connection-name>;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/connections.html"} +{"content":"# AI and Machine Learning on Databricks\n### Reference solutions for machine learning\n\nDatabricks provides a rich collection of tools to help you develop machine learning applications. However, finding these tools and understanding how best to apply them to your machine learning problems can be difficult. The reference solutions provided on this page show detailed examples of how you can use Databricks for some common machine learning applications.\n\n### Reference solutions for machine learning\n#### Image processing and computer vision\n\nThis article contains a reference solution for distributed image model inference based on a common setup shared by many real-world image applications. \n* [Reference solution for image applications](https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/images-etl-inference.html)\n\n### Reference solutions for machine learning\n#### Recommender system\n\nThis article contains reference solutions for deep-learning-based recommendation models. \n* [Train recommender models](https:\/\/docs.databricks.com\/machine-learning\/train-recommender-models.html)\n\n### Reference solutions for machine learning\n#### Natural language processing\n\nThis article describes performing distributed training and inference for natural language processing applications. \n* [Natural language processing](https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/natural-language-processing.html)\n\n### Reference solutions for machine learning\n#### Data labeling\n\nThis article describes options for labeling data. \n* [Data labeling](https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/data-labeling.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/index.html"} +{"content":"# Get started: Account and workspace setup\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/free-training.html"} +{"content":"# Get started: Account and workspace setup\n### Get free Databricks training\n\nAs a customer, you have access to all Databricks free customer training offerings. These offerings include courses, recorded webinars, and quarterly product roadmap webinars. You can access the material from your Databricks Academy account. Follow these steps to get started: \n1. Go to [Databricks Academy](https:\/\/academy.databricks.com\/) and click the red Academy login button in the top navigation. \n* If you\u2019ve logged into Databricks Academy before, use your existing credentials.\n* If you\u2019ve never logged into Databricks Academy, a customer account has been created for you, using your Databricks username, usually your work email address. You must reset your password. It may take up to 24 hours for the training pathway to appear in your account.\n2. After you log into your Databricks Academy account, click ![hamburger menu](https:\/\/docs.databricks.com\/_images\/academy-hamburger.png) in the top left corner. \n* Click **Course Catalog**.\n* The catalogs available to you appear. Databricks Academy organizes groupings of learning content into catalogs, which include courses and learning paths. \nIf you\u2019ve followed the steps above and do not see the pathways in your account, please file a [training support ticket](https:\/\/help.databricks.com\/s\/contact-us?ReqType=training&_gl=1*15jslbo*_gcl_aw*R0NMLjE2NDY0MjQxMjguQ2p3S0NBaUFqb2VSQmhBSkVpd0FZWTNuREtMY0hoVVR4SzRrR2RXLVB1cVV6RTBPMk9DRldYT1N5bXVkYW1FazMxYjd1eHk2MnFNYzNob0MtQndRQXZEX0J3RQ..&_ga=2.215756816.647757455.1645463336-1009786773.1643394912&_gac=1.27354830.1646424129.CjwKCAiAjoeRBhAJEiwAYY3nDKLcHhUTxK4kGdW-PuqUzE0O2OCFWXOSymudamEk31b7uxy62qMc3hoC-BwQAvD_BwE). \n**The Databricks documentation** also provides many tutorials and quickstarts that can help you get up to speed on the platform, both here in the Getting Started section and in other sections: \n* [Quickstart](https:\/\/docs.databricks.com\/get-started-index.html)\n* [Apache Spark](https:\/\/docs.databricks.com\/spark\/index.html)\n* [Ingest data into a Databricks lakehouse](https:\/\/docs.databricks.com\/ingestion\/index.html)\n* [Sample datasets](https:\/\/docs.databricks.com\/discover\/databricks-datasets.html)\n* [DataFrames](https:\/\/docs.databricks.com\/getting-started\/dataframes.html)\n* [Delta Lake](https:\/\/docs.databricks.com\/delta\/tutorial.html)\n* [Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/examples.html)\n* [Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/ml-tutorials.html) \n**The [Knowledge Base](https:\/\/kb.databricks.com)** provides troubleshooting tips and answers to frequently asked questions.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/free-training.html"} +{"content":"# Query data\n## Data format options\n#### Text files\n\nYou can process files with the `text` format option to parse each line in any text-based file as a row in a DataFrame. This can be useful for a number of operations, including log parsing. It can also be useful if you need to ingest CSV or JSON data as raw strings. \nFor more information, see [text files](https:\/\/spark.apache.org\/docs\/latest\/sql-data-sources-text.html).\n\n#### Text files\n##### Options\n\nSee the following Apache Spark reference articles for supported read and write options. \n* Read \n+ [Python](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrameReader.text.html?highlight=text#pyspark.sql.DataFrameReader.text)\n+ [Scala](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/sql\/DataFrameReader.html#text(paths:String*):org.apache.spark.sql.DataFrame)\n* Write \n+ [Python](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrameWriter.text.html?highlight=text#pyspark.sql.DataFrameWriter.text)\n+ [Scala](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/sql\/DataFrameWriter.html#text(path:String):Unit)\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/text.html"} +{"content":"# AI and Machine Learning on Databricks\n## MLOps workflows on Databricks\n#### What is MLOps Stacks?\n\nMLOps Stacks automates the creation of infrastructure for an ML project workflow. It sets up the elements required to implement and operate ML for continuous deployment across development, staging, and production environments, including: \n* ML pipelines for model training, deployment, and inference.\n* Feature tables.\n* Release pipeline for production. \nMLOps Stacks is fully integrated into the Databricks CLI and Databricks Asset Bundles, providing a single toolchain for developing, testing, and deploying both data and ML assets on Databricks. The environment created by MLOps Stacks implements the [MLOps workflow recommended by Databricks](https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html). You can customize the code to create stacks to match your organization\u2019s processes or requirements. \nThis article explains how MLOps Stacks works and describes the structure of the project created by MLOps.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-stacks.html"} +{"content":"# AI and Machine Learning on Databricks\n## MLOps workflows on Databricks\n#### What is MLOps Stacks?\n##### MLOps Stacks components\n\nA \u201cstack\u201d refers to the set of tools used in a development process. The default MLOps Stack takes advantage of the unified Databricks platform and uses the following tools: \n* ML model development code: [Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html), [MLflow](https:\/\/docs.databricks.com\/mlflow\/index.html)\n* [Databricks Feature Store](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html)\n* ML model repository: [MLflow Models in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html)\n* ML model serving: [Databricks Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html)\n* Infrastructure-as-code: [Databricks Asset Bundles](https:\/\/docs.databricks.com\/dev-tools\/bundles\/index.html)\n* Orchestrator: [Databricks Workflows](https:\/\/docs.databricks.com\/workflows\/index.html)\n* CI\/CD: [GitHub Actions](https:\/\/docs.github.com\/actions), [Azure DevOps](https:\/\/learn.microsoft.com\/azure\/devops\/?view=azure-devops)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-stacks.html"} +{"content":"# AI and Machine Learning on Databricks\n## MLOps workflows on Databricks\n#### What is MLOps Stacks?\n##### How does MLOps Stacks work?\n\nYou use the Databricks CLI to create an MLOps Stack. For step-by-step instructions, see [Databricks Asset Bundles for MLOps Stacks](https:\/\/docs.databricks.com\/dev-tools\/bundles\/mlops-stacks.html). \nWhen you initiate an MLOps Stacks project, the software steps you through entering the configuration details and then creates a directory containing the files that compose your project. This directory, or stack, implements the production MLOps workflow recommended by Databricks. The components shown in the diagram are created for you, and you need only edit the files to add your custom code. \n![MLOps Stacks component diagram](https:\/\/docs.databricks.com\/_images\/mlops-stacks-components.png) \nYour organization can use the default stack, or customize it as needed to add, remove, or revise components to fit your organization\u2019s practices. See the [GitHub repository readme](https:\/\/github.com\/databricks\/mlops-stacks\/blob\/main\/stack-customization.md) for details. \nMLOps Stacks is designed with a modular structure to allow the different ML teams to work independently on a project while following software engineering best practices and maintaining production-grade CI\/CD. Production engineers configure ML infrastructure that allows data scientists to develop, test, and deploy ML pipelines and models to production. \nAs shown in the green boxes in the diagram, the three Databricks code components of an MLOps Stack are the following: \n* ML code. A data scientist can create code in a Databricks notebook or in a local IDE. You can use GitHub or Azure DevOps for source control. As shown in the diagram, when the project is created, it is in a runnable state with example code. You edit or replace this code with your own code.\n* Resource configurations. These .yml files define individual workflows that comprise the project, such as training and batch inference jobs. They are configured and deployed using Databricks CLI bundles.\nBy defining these resources in .yml files, you can govern, audit, and deploy changes using pull requests instead of untrackable changes made using the UI.\n* CI\/CD workflows. Implemented using GitHub Actions or Azure DevOps in conjunction with Databricks Workflows, these workflows test and deploy the ML code (for model training, batch inference, and so on) and the Databricks ML resource configurations across your development, staging, and production workspaces. As with the resource files, these workflows automate all production changes and ensure that only tested code is deployed to production.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-stacks.html"} +{"content":"# AI and Machine Learning on Databricks\n## MLOps workflows on Databricks\n#### What is MLOps Stacks?\n##### MLOps Stacks project structure\n\nAn MLOps Stack uses [Databricks Asset Bundles](https:\/\/docs.databricks.com\/dev-tools\/bundles\/index.html) \u2013 a collection of source files that serves as the end-to-end definition of a project. These source files include information about how they are to be tested and deployed. Collecting the files as a bundle makes it easy to co-version changes and use software engineering best practices such as source control, code review, testing, and CI\/CD. \nThe diagram shows the files created for the default MLOps Stack. For details about the files included in the stack, see the documentation on the [GitHub repository](https:\/\/github.com\/databricks\/mlops-stacks) or [Databricks Asset Bundles for MLOps Stacks](https:\/\/docs.databricks.com\/dev-tools\/bundles\/mlops-stacks.html). \n![MLops Stacks directory structure](https:\/\/docs.databricks.com\/_images\/mlops-directory-structure.png)\n\n#### What is MLOps Stacks?\n##### Resources\n\n[Databricks MLOps Stacks repository on GitHub](https:\/\/github.com\/databricks\/mlops-stacks)\n\n#### What is MLOps Stacks?\n##### Next steps\n\nTo get started, see [Databricks Asset Bundles for MLOps Stacks](https:\/\/docs.databricks.com\/dev-tools\/bundles\/mlops-stacks.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-stacks.html"} +{"content":"# Connect to data sources\n### Connect to cloud object storage using Unity Catalog\n\nThis article gives an overview of the cloud storage connection configurations that are required to work with data using Unity Catalog. \nDatabricks recommends using Unity Catalog to manage access to all data stored in cloud object storage. Unity Catalog provides a suite of tools to configure secure connections to cloud object storage. These connections provide access to complete the following actions: \n* Ingest raw data into a lakehouse.\n* Create and read managed tables in secure cloud storage.\n* Register or create external tables containing tabular data.\n* Read and write unstructured data. \nWarning \n**Do not give end users storage-level access to Unity Catalog managed tables or volumes. This compromises data security and governance.** \nAvoid granting users direct access to Amazon S3 or Cloudflare R2 buckets that are used as Unity Catalog managed storage. The only identity that should have access to data managed by Unity Catalog is the identity used by Unity Catalog. Ignoring this creates the following issues in your environment: \n* Access controls established in Unity Catalog can be circumvented by users who have direct access to S3 or R2 buckets.\n* Auditing, lineage, and other Unity Catalog monitoring features will not capture direct access.\n* The lifecycle of data is broken. That is, modifying, deleting, or evolving tables in Databricks will break the consumers that have direct access to storage, and writes outside of Databricks could result in data corruption. \nNote \nIf your workspace was created before November 8, 2023, it might not be enabled for Unity Catalog. An account admin must enable Unity Catalog for your workspace. See [Enable a workspace for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/enable-workspaces.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html"} +{"content":"# Connect to data sources\n### Connect to cloud object storage using Unity Catalog\n#### How does Unity Catalog connect object storage to Databricks?\n\nDatabricks on AWS supports both AWS S3 and Cloudflare R2 buckets (Public Preview) as cloud storage locations for data assets registered in Unity Catalog. R2 is intended primarily for uses cases in which you want to avoid data egress fees, such as Delta Sharing across clouds and regions. For more information, see [Use Cloudflare R2 replicas or migrate storage to R2](https:\/\/docs.databricks.com\/data-sharing\/manage-egress.html#r2). \nTo manage access to the underlying cloud storage that holds tables and volumes, Unity Catalog uses the following object types: \n* A *storage credential* represents an authentication and authorization mechanism for accessing data stored on your cloud tenant, using an IAM role for S3 buckets or an R2 API token for Cloudflare R2 buckets. Each storage credential is subject to Unity Catalog access-control policies that control which users and groups can access the credential. If a user does not have access to a storage credential in Unity Catalog, the request fails and Unity Catalog does not attempt to authenticate to your cloud tenant on the user\u2019s behalf. Permission to create storage credentials should only be granted to users who need to define external locations. See [Create a storage credential for connecting to AWS S3](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html) and [Create a storage credential for connecting to Cloudflare R2](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials-r2.html).\n* An *external location* is an object that combines a cloud storage path with a storage credential that authorizes access to the cloud storage path. Each storage location is subject to Unity Catalog access-control policies that control which users and groups can access the credential. If a user does not have access to a storage location in Unity Catalog, the request fails and Unity Catalog does not attempt to authenticate to your cloud tenant on the user\u2019s behalf. Permission to create and use external locations should only be granted to users who need to create external tables, external volumes, or managed storage locations. See [Create an external location to connect cloud storage to Databricks](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html). \nExternal locations are used both for external data assets, like *external tables* and *external volumes*, and for *managed* data assets, like *managed tables* and *managed volumes*. For more information about the difference, see [Tables](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#table) and [Volumes](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#volume). \nWhen an external location is used for storing *managed tables* and *managed volumes*, it is called a *managed storage location*. Managed storage locations can exist at the metastore, catalog, or schema level. Databricks recommends configuring managed storage locations at the catalog level. If you need more granular isolation, you can specify managed storage locations at the schema level. Workspaces that are enabled for Unity Catalog automatically have no metastore-level storage by default, but you can specify a managed storage location at the metastore level to provide default location when no catalog-level storage is defined. Workspaces that are enabled for Unity Catalog manually receive a metastore-level managed storage location by default. See [Specify a managed storage location in Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/managed-storage.html) and [Unity Catalog best practices](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html). \n*Volumes* are the securable object that most Databricks users should use to interact directly with non-tabular data in cloud object storage. See [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html). \nNote \nWhile Unity Catalog supports path-based access to external tables and external volumes using cloud storage URIs, Databricks recommends that users read and write all Unity Catalog tables using table names and access data in volumes using `\/Volumes` paths.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html"} +{"content":"# Connect to data sources\n### Connect to cloud object storage using Unity Catalog\n#### Next steps\n\nIf you\u2019re just getting started with Unity Catalog as an admin, see [Set up and manage Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html). \nIf you\u2019re a new user and your workspace is already enabled for Unity Catalog, see [Tutorial: Create your first table and grant privileges](https:\/\/docs.databricks.com\/getting-started\/create-table.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html"} +{"content":"# DatabricksIQ-powered features\n### Use Databricks Assistant\n\nPreview \nThis feature is currently in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nDatabricks Assistant is a context-aware AI assistant that can help you with Databricks notebooks, SQL editor, jobs, Lakeview dashboards, and file editor. Databricks Assistant assists you with data and code, when you ask for help using a conversational interface.\n\n### Use Databricks Assistant\n#### What you can do with Databricks Assistant\n\nDatabricks Assistant can help with the following tasks: \n* Generate, debug, optimize, and explain code.\n* Create visualizations from data. See [Create visualizations with Databricks Assistant](https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-w-db-assistant.html).\n* Debug jobs.\n* Code and edit SQL queries.\n* Find relevant help in Databricks documentation. \nThe assistant uses Unity Catalog metadata to understand your tables, columns, descriptions, and popular data assets across your company to provide personalized responses.\n\n### Use Databricks Assistant\n#### Enable Databricks Assistant\n\nDatabricks Assistant is enabled by default. See [Enable or disable Databricks Assistant](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html#enable-or-disable).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/use-databricks-assistant.html"} +{"content":"# DatabricksIQ-powered features\n### Use Databricks Assistant\n#### Use Databricks Assistant in a notebook cell\n\nIn a notebook, Databricks Assistant is available in the Assistant pane or inline in a code cell. \nTo use Databricks Assistant directly in a code cell, press **Cmd + i** on MacOS or **Ctrl + i** on Windows. A text box appears in the cell. You can type a question or comment in English and then press **Enter** (not **Shift+Enter**, like you would to run a cell) to have the assistant generate a response. \n![Inline assistant helps you locate and add enrichment data.](https:\/\/docs.databricks.com\/_images\/inline-assistant-example-weather.png) \n### Cell action prompts \nPrompt shortcuts help you create common prompts. \n| Prompt text | What the Assistant does |\n| --- | --- |\n| `\/` | Displays common commands |\n| `\/doc` | Comments the code in a diff view |\n| `\/explain` | Provides an explanation of the code in a cell |\n| `\/fix` | Proposes a fix to any code errors in a diff view | \nWhen you use `\/fix` or `\/doc`, in the diff window select **Accept** to accept the proposed changes or **Reject** to keep the original code. If you accept the proposed code, the code does not automatically run. You can review the code before running it. If the generated code is not what you wanted, try again by adding more details or information to your comment. See [Tips for using Databricks Assistant](https:\/\/docs.databricks.com\/notebooks\/use-databricks-assistant.html#tips). \nFor code autocomplete, performance may be better using the Assistant pane than in a notebook cell. \nThe Assistant closes automatically if you **Accept** or **Reject** the code it generated.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/use-databricks-assistant.html"} +{"content":"# DatabricksIQ-powered features\n### Use Databricks Assistant\n#### Use the Assistant pane\n\nThis section describes the default experience of the Assistant pane. If you enabled the **New Assistant** experience that tracks query threads and history throughout editor contexts, see [Threads and query history](https:\/\/docs.databricks.com\/notebooks\/use-databricks-assistant.html#new-assistant). \nTo open the Assistant pane, click ![Databricks assistant icon.](https:\/\/docs.databricks.com\/_images\/assistant-icon.png) in the left sidebar. \n![Screenshot of assistant pane in use.](https:\/\/docs.databricks.com\/_images\/assistant-pane-example-weather.png) \nType questions in the text box at the bottom of the Assistant pane and press **Enter** or click ![Enter assistant text.](https:\/\/docs.databricks.com\/_images\/assistant-arrow.png) at the right of the text box. The Assistant displays its answer. The following screenshot shows actions you can take after the Assistant has generated code in the Assistant pane. \n![Icons at the top of the code box in the assistant pane.](https:\/\/docs.databricks.com\/_images\/assistant-pane-codebox-header.png) \nYou can run the same query again to generate another answer. To do so, hover your cursor over the answer and click ![Regenerate query icon.](https:\/\/docs.databricks.com\/_images\/regenerate.png). \nTo close the pane, click the icon again or click ![Close assistant icon.](https:\/\/docs.databricks.com\/_images\/close-assistant-cell.png) in the upper-right corner of the cell. You can expand the pane to full width by clicking ![The open full width icon.](https:\/\/docs.databricks.com\/_images\/open-full-width.png); click ![close full width icon](https:\/\/docs.databricks.com\/_images\/exit-focus-icon.png) to return the pane to default width. \n![Icons at the top of the assistant pane.](https:\/\/docs.databricks.com\/_images\/assistant-pane-header.png) \nThe Assistant pane keeps track of your conversations even if you close the pane or notebook. To clear previous conversations, click ![Clear assistant icon.](https:\/\/docs.databricks.com\/_images\/clear-assistant-icon.png) at the upper-right of the Assistant pane. \n### Threads and query history \nIf you [enabled the \\*\\*New Assistant\\*\\* experience](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html#enable-or-disable), conversation threads persist across the different context where Databricks Assistant is available. From the Assistant pane, you can create new conversation threads, view query history, and manage your Databricks Assistant experience. \n![Icons at the top of the assistant pane.](https:\/\/docs.databricks.com\/_images\/assistant-preview-buttons.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/use-databricks-assistant.html"} +{"content":"# DatabricksIQ-powered features\n### Use Databricks Assistant\n#### AI-based autocomplete\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nAI-based autocomplete provides inline code suggestions as you type in Databricks notebooks, the SQL editor, and the file editor. Inline code suggestions are available for Python and SQL. \n### Enable and disable inline code suggestions \nEnable or disable the feature for yourself: \n1. Click your profile icon in the top bar, and then click **Settings**.\n2. From the list on the left, click **User** > **Developer**.\n3. Under **Experimental features**, turn on **Databricks Assistant autocomplete**. \nNote \nInline code suggestions are not available in AWS GovCloud regions or workspaces with FedRAMP compliance. \n### Get inline code suggestions: Python and SQL examples \nAs you type, suggestions automatically appear. Press **Tab** to accept a suggestion. To manually trigger a suggestion, press **Option + Shift + Space** (on macOS) or **Control + Shift + Space** (on Windows). \n![Animated GIF of code completion for SQL.](https:\/\/docs.databricks.com\/_images\/code-complete-sql.gif) \n![Animated GIF of code completion for Python.](https:\/\/docs.databricks.com\/_images\/code-complete-python.gif) \nAI-based autocomplete can also generate code from comments: \n![Animated GIF of code completion from a comment.](https:\/\/docs.databricks.com\/_images\/code-complete-from-comment.gif)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/use-databricks-assistant.html"} +{"content":"# DatabricksIQ-powered features\n### Use Databricks Assistant\n#### Debug code: Python and SQL examples\n\nTo use Databricks Assistant to fix code, do any of the following: \n* Ask a question in the Assistant pane.\n* Click the **Diagnose Error** button that appears in the cell results when an error occurs.\n* Click **Debug** to interactively step through the code line-by-line, set breakpoints, inspect variables, and analyze the program\u2019s execution. \nThe tabs below show examples in Python and SQL code: \n![Assistant debugging example in Python.](https:\/\/docs.databricks.com\/_images\/diagnose-error.png) \n![Assistant debugging example in SQL.](https:\/\/docs.databricks.com\/_images\/diagnose-error-sql.png)\n\n### Use Databricks Assistant\n#### Explain code\n\nDatabricks Assistant can provide detailed explanations of code snippets. Use the `\/explain` prompt and include terms like \u201cbe concise\u201d or \u201cexplain code line-by-line\u201d to request the level of detail that you want. You can also ask Databricks Assistant to add comments to the code.\n\n### Use Databricks Assistant\n#### Get information from Databricks documentation\n\nDatabricks Assistant can help answer questions based on Databricks documentation directly from the notebook editor. \n![Answer a question based on Databricks documentation.](https:\/\/docs.databricks.com\/_images\/documentation-question.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/use-databricks-assistant.html"} +{"content":"# DatabricksIQ-powered features\n### Use Databricks Assistant\n#### Tips for using Databricks Assistant\n\nThis section includes some general tips and best practices when using Databricks Assistant. \n### Databricks Assistant uses context to provide better answers \nDatabricks Assistant has access to table and column schemas and metadata. This allows you to use natural language and generate more accurate queries. For example, if a table has a column called **State**, you can ask Databricks Assistant to generate a list of users who live in Michigan. \nDatabricks Assistant uses the following context: \n* Code or queries in the current notebook cell or Databricks SQL editor tab.\n* Table and Column names and descriptions.\n* Previous prompt questions.\n* Favorite and active tables.\n* For the **diagnose error** feature, the stack trace from the error output. \nWhen selecting columns from a DataFrame, you can get more accurate results by providing a starting query. For example, provide a statement like `SELECT * FROM <table_name>`. This allows Databricks Assistant to get the column names and not have to guess. \nBecause Databricks Assistant uses your conversation history to provide better and more accurate answers, you can ask Databricks Assistant to alter the output of a previous response without having to rewrite the entire prompt. You can use the Assistant\u2019s chat history to iteratively clean, explore, filter, and slice DataFrames in the Assistant pane. \n### Be specific \nThe structure and detail that Databricks Assistant provides varies from time to time, even for the same prompt. Try to provide the assistant as much guidance as you can to help it return the information you want in the desired format, level of detail, and so on. For example: \n* \u201cExplain this code in a couple sentences\u201d or \u201cExplain this code line-by-line\u201d.\n* \u201cCreate a visualization using MatPlotLib\u201d or \u201cCreate a visualization using Seaborn\u201d. \n### Give examples of row-level data values \nBecause Databricks Assistant does not use row-level data, you might need to provide more detail to prompts to get the most accurate answer. Use table or column comments in Catalog Explorer to add a line of sample data. For example, suppose your height column is in the format `feet`-`inches`. To help the assistant interpret the data, add a comment such as \u201cThe height column is in string format and is separated by a hyphen. Example: \u20186-2\u2019.\u201d For information about table and column comments, see [Document data in Catalog Explorer using markdown comments](https:\/\/docs.databricks.com\/catalog-explorer\/markdown-data-comments.html). \nIf you need to use column data type conversions to run an operation, you might need to provide details. For example: \u201cconvert this code from pandas to PySpark, including the code needed to convert the pandas DataFrame to a PySpark DataFrame and changing the data type of column churn from boolean to integer\u201d. \n### Use Shift+Enter to add a new line in the chat text box \nYou can use **Shift+Enter** to add a new line in the Assistant chat text box. This makes it easy to format and organize your messages to Databricks Assistant. \n### Edit and run code in Databricks Assistant chat pane \nYou can run code in the Assistant pane to validate it or use it as a scratchpad. To run code, click ![run code icon](https:\/\/docs.databricks.com\/_images\/run-code.png) in the upper-left corner of the code box in the Assistant pane. \nThe tabs below show examples for Python and SQL code: \n![run code in assistant pane](https:\/\/docs.databricks.com\/_images\/run-code.gif) \n![run code in assistant pane](https:\/\/docs.databricks.com\/_images\/run-code-sql.gif) \nWhen you run code in the Assistant pane, output is displayed and the variables become usable in the notebook. \nYou can also edit the code that Databricks Assistant generates directly in the Assistant chat box before moving the code to the notebook.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/use-databricks-assistant.html"} +{"content":"# DatabricksIQ-powered features\n### Use Databricks Assistant\n#### Additional information\n\nThe following articles contain additional information about using Databricks Assistant: \n* [What is Databricks Assistant?](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html)\n* [DatabricksIQ trust and safety](https:\/\/docs.databricks.com\/databricksiq\/databricksiq-trust.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/use-databricks-assistant.html"} +{"content":"# Security and compliance guide\n### Authentication and access control\n\nThis article introduces authentication and access control in Databricks. For information about securing access to your data, see [Data governance with Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/index.html). \nFor more information on how to best configure user and groups in Databricks, see [Identity best practices](https:\/\/docs.databricks.com\/admin\/users-groups\/best-practices.html).\n\n### Authentication and access control\n#### Single sign-on\n\nSingle sign-on enables you to authenticate your users using your organization\u2019s identity provider. Databricks recommends configuring SSO for greater security and improved usability. Once SSO is configured, you can enable fine-grained access control, such as multi-factor authentication, via your identity provider. Unified login allows you to manage one SSO configuration in your account that is used for the account and Databricks workspaces. If your account was created before June 21, 2023, you can also manage SSO individually on your account and workspaces. See [SSO in your Databricks account console](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/single-sign-on\/index.html) and [Set up SSO for your workspace](https:\/\/docs.databricks.com\/admin\/users-groups\/single-sign-on\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/index.html"} +{"content":"# Security and compliance guide\n### Authentication and access control\n#### Sync users and groups from your identity provider using SCIM provisioning\n\nYou can use [SCIM](http:\/\/www.simplecloud.info\/), or System for Cross-domain Identity Management, an open standard that allows you to automate user provisioning, to sync users and groups automatically from your identity provider to your Databricks account. SCIM streamlines onboarding a new employee or team by using your identity provider to create users and groups in Databricks and give them the proper level of access. When a user leaves your organization or no longer needs access to Databricks, admins can terminate the user in your identity provider, and that user\u2019s account is also removed from Databricks. This ensures a consistent offboarding process and prevents unauthorized users from accessing sensitive data. For more information, see [Sync users and groups from your identity provider](https:\/\/docs.databricks.com\/admin\/users-groups\/scim\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/index.html"} +{"content":"# Security and compliance guide\n### Authentication and access control\n#### Secure API authentication\n\nDatabricks personal access tokens are one of the most well-supported types of credentials for resources and operations at the Databricks workspace level. In order to secure API authentication, workspace admins can control which users, service principals, and groups can create and use Databricks personal access tokens. \nDatabricks users can also access REST APIs using their Databricks username and password (basic authentication). In accounts where unified login is disabled, workspace admins can use password access control to grant and revoke the ability for specific users to use basic authentication. \nFor more information, see [Manage access to Databricks automation](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html). \nWorkspace admins can also review Databricks personal access tokens, delete tokens, and set the maximum lifetime of new tokens for their workspace. See [Monitor and manage personal access tokens](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html). \nFor more information on authenticating to Databricks automation, see [Authentication for Databricks automation - overview](https:\/\/docs.databricks.com\/dev-tools\/auth\/index.html).\n\n### Authentication and access control\n#### Access control overview\n\nIn Databricks, there are different access control systems for different securable objects. The table below shows which access control system governs which type of securable object. \n| Securable object | Access control system |\n| --- | --- |\n| Workspace-level securable objects | Access control lists |\n| Account-level securable objects | Account role based access control |\n| Data securable objects | Unity Catalog | \nDatabricks also provides admin roles and entitlements that are assigned directly to users, service principals, and groups. \nFor information about securing data, see [Data governance with Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/index.html"} +{"content":"# Security and compliance guide\n### Authentication and access control\n#### Access control lists\n\nIn Databricks, you can use access control lists (ACLs) to configure permission to access workspace objects such as notebooks and SQL Warehouses. All workspace admin users can manage access control lists, as can users who have been given delegated permissions to manage access control lists. For more information on access control lists, see [Access control lists](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html).\n\n### Authentication and access control\n#### Account role based access control\n\nYou can use account role based access control to configure permission to use account-level objects such as service principals and groups. Account roles are defined once, in your account, and apply across all workspaces. All account admin users can manage account roles, as can users who have been given delegated permissions to manage them, such as group managers and service principal managers. \nFollow these articles for more information on account roles on specific account-level objects: \n* [Roles for managing service principals](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/service-principal-acl.html)\n* [Manage roles on a group using the account console](https:\/\/docs.databricks.com\/admin\/users-groups\/groups.html#manage-group-roles-account)\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/index.html"} +{"content":"# Security and compliance guide\n### Authentication and access control\n#### Databricks admin roles\n\nIn addition to access control on securable objects, there are built-in roles on the Databricks platform. Users, service principals, and groups can be assigned roles. \nThere are two main levels of admin privileges available on the Databricks platform: \n* [Account admins](https:\/\/docs.databricks.com\/admin\/index.html#account-admins): Manage the Databricks account, including workspace creation, user management, cloud resources, and account usage monitoring. \n* [Workspace admins](https:\/\/docs.databricks.com\/admin\/index.html#workspace-admins): Manage workspace identities, access control, settings, and features for individual workspaces in the account. \nAdditionally, users can be assigned these feature-specific admin roles, which have narrower sets of privileges: \n* [Marketplace admins](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#marketplace-admin): Manage their account\u2019s Databricks Marketplace provider profile, including creating and managing Marketplace listings.\n* [Metastore admins](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#admin-roles): Manage privileges and ownership for all securable objects within a Unity Catalog metastore, such as who can create catalogs or query a table. \nUsers can also be assigned to be workspace users. A workspace user has the ability to log in to a workspace, where they can be granted workspace-level permissions. \nFor more information, see [Setting up single sign-on (SSO)](https:\/\/docs.databricks.com\/admin\/users-groups\/index.html#assigning-admin-roles). \n### Workspace entitlements \nAn entitlement is a property that allows a user, service principal, or group to interact with Databricks in a specified way. Workspace admins assign entitlements to users, service principals, and groups at the workspace-level. For more information, see [Manage entitlements](https:\/\/docs.databricks.com\/security\/auth-authz\/entitlements.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/index.html"} +{"content":"# What is Databricks Marketplace?\n### Access data products in Databricks Marketplace (Unity Catalog-enabled workspaces)\n\nThis article describes how to access data products in Databricks Marketplace if you have a Databricks workspace that is enabled for Unity Catalog. \nNote \nIf you don\u2019t have a Databricks workspace that is enabled for Unity Catalog, you can access shared Marketplace data products using Delta Sharing open sharing connectors. See [Access data products in Databricks Marketplace using external platforms](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer-open.html).\n\n### Access data products in Databricks Marketplace (Unity Catalog-enabled workspaces)\n#### Overview\n\nDatabricks Marketplace gives you, as a data consumer, a secure platform for discovering data products that your organization needs to be successful. Databricks Marketplace uses Delta Sharing to provide security and control over shared data. Consumers can access public data, free sample data, and commercialized data offerings. Consumers who use a Unity Catalog-enabled Databricks workspace are not limited to accessing tabular data, but can also access volumes (non-tabular data), AI models, Databricks notebooks, and Databricks Solution Accelerators. \nWhen you consume Marketplace data products using a Databricks workspace that is enabled for Unity Catalog, you can take advantage of the deep integration between Delta Sharing and Unity Catalog, along with Unity Catalog governance, auditing, and convenient interfaces.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html"} +{"content":"# What is Databricks Marketplace?\n### Access data products in Databricks Marketplace (Unity Catalog-enabled workspaces)\n#### Before you begin\n\nTo browse data product listings on Databricks Marketplace, you can use either of the following: \n* The [Open Marketplace](https:\/\/marketplace.databricks.com).\n* A Databricks workspace. \nTo consume data products using a Databricks workspace that is enabled for Unity Catalog, you must have the following: \n* A Databricks account on the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons).\n* A Databricks workspace that is enabled for Unity Catalog (of course). See [Enable a workspace for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/enable-workspaces.html). \nIf you don\u2019t have one, you can get a free trial. Click **Try for free** on the [Open Marketplace](https:\/\/marketplace.databricks.com) and follow the prompts to start your trial. \nImportant \nYou don\u2019t need to enable all of your workspaces for Unity Catalog. You can create a new one and enable it for Unity Catalog, using that workspace to receive Marketplace data products. If this option is not available, use the Marketplace on external platforms option. See [Access data products in Databricks Marketplace using external platforms](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer-open.html). \nTo learn how to enable a workspace for Unity Catalog, see [Set up and manage Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html).\n* `USE MARKETPLACE ASSETS` privilege on the Unity Catalog metastore attached to the workspace. See [Privilege types that apply only to Delta Sharing or Databricks Marketplace](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html#delta-sharing). This privilege is enabled for all users on all Unity Catalog metastores by default. \nIf your admin has disabled this privilege, you can request that they grant it to you or that they grant you either of these: \n+ `CREATE CATALOG` and `USE PROVIDER` permissions on the Unity Catalog metastore.\n+ [Metastore admin role](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html).If you do not have any of these privileges, you can still view Marketplace listings but cannot access data products using Unity Catalog. \nFor more information, see [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html) and [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html). See also [Disable Marketplace access](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html#revoke-use-assets).\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html"} +{"content":"# What is Databricks Marketplace?\n### Access data products in Databricks Marketplace (Unity Catalog-enabled workspaces)\n#### Browse Databricks Marketplace listings\n\nTo find a data product you want, simply browse or search the data product listings in Databricks Marketplace. \nNote \nAs an alternative to the instructions that follow, you can search for Marketplace listings using the global search bar at the top of your Databricks workspace. See [Search for workspace objects](https:\/\/docs.databricks.com\/search\/index.html). You can also view and request free sample data on the **Add data** page. In the workspace sidebar, click **Data ingestion** and scroll down to **Free sample data from Databricks Marketplace**. \n1. Go to [marketplace.databricks.com](https:\/\/marketplace.databricks.com) or log into your Databricks workspace and click ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace**.\n2. Browse or search for the data product that you want. \nYou can filter listings by product type (dataset, [solution accelerator](https:\/\/www.databricks.com\/solutions\/accelerators), or ML model), provider name, category, cost (free or paid), or keyword search. \nIf you are logged into a Databricks workspace, you can also choose to view only the private listings available to you as part of a private exchange. See [Participate in private exchanges](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html#private-exchange).\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html"} +{"content":"# What is Databricks Marketplace?\n### Access data products in Databricks Marketplace (Unity Catalog-enabled workspaces)\n#### Request access to data products in the Marketplace\n\nTo request access to data products, you must be logged into a Databricks workspace. Some data products are available immediately, and others require provider approval and transaction completion using provider interfaces. \n### Requirements \nSee [Before you begin](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html#before). To access data products in the Marketplace, you must have at least the `USE MARKETPLACE ASSETS` privilege on the Unity Catalog metastore that is attached to the workspace that you are using. \n### Get access to data products that are instantly available \nSome data products are available instantly, requiring only that you request them and agree to terms. These are listed under the **Free and instantly available** heading on the Marketplace landing page, are identified on the listing tile as **Free**, and are identified as **Instantly available** on the listing detail page. \n1. When you\u2019ve found a listing you\u2019re interested in on the Marketplace landing page, click the listing to open the listing detail page.\n2. Click the **Get instant access** button and accept the Databricks terms and conditions. \nAccessing Databricks Solution Accelerators works a little differently. See [Get access to Databricks Solution Accelerators](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html#sa).\n3. (Optional) Under **More options**, modify the suggested catalog name. \nThe catalog name is displayed in Catalog Explorer in your Databricks workspace, and it is used in the three-part namespace (catalog.schema.table|volume|view) in queries. You can change the default name.\n4. Click the **Get instant access** button.\n5. Click the **Open** button to view the data product, which appears as a read-only catalog in Catalog Explorer. \nFor information about managing access to catalogs in Databricks, see [Access the shared data using Unity Catalog](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html#access). \nIf sample notebooks are available, they appear under the **Sample notebook** heading in the listing. \nTo view a notebook, click the **Preview notebook** button. To import a notebook into your Databricks workspace so that you can run it, click **Preview notebook** and then click **Import Notebook**. See [Introduction to Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html). \n### Request data products that require provider approval \nSome data products require provider approval, typically because a commercial transaction is involved, or the provider might prefer to customize data products for you. These listings are identified on the listing detail page as **By request** and include a **Request access** button. \n1. When you\u2019ve found a listing you\u2019re interested in on the Marketplace landing page, click the listing to open the listing detail page.\n2. Click the **Request access** button.\n3. Enter your name, company, and a brief description of your intended use for the data product.\n4. Accept the Databricks terms and conditions and click **Request access**.\n5. You will be notified by email when the provider has completed their review of your request. \nYou can also monitor the progress of your request on the My Requests page in Marketplace. See [Manage shared Databricks Marketplace data products](https:\/\/docs.databricks.com\/marketplace\/manage-requests-consumer.html). However, any transactions that follow will use provider communications and payment platforms. No commercial transactions are handled directly on Databricks Marketplace.\n6. When your transaction is complete, the data provider will make the data product available to you as a read-only catalog in your workspace. \nFor information about managing access to catalogs in Databricks, see [Access the shared data using Unity Catalog](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html#access). \n### Get access to Databricks Solution Accelerators \nUnlike other data assets, Databricks [Solution Accelerators](https:\/\/www.databricks.com\/solutions\/accelerators) are shared by cloning Git repositories and making them available in [Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html). To access a Solution Accelerator: \n1. In Marketplace, click the Solution Accelerator listing to open the listing detail page.\n2. Click the **Get instant access** button.\n3. On the **Add Git folder** dialog, enter a name for the repository. This name will appear in Databricks Git folders UIs.\n4. Accept the Databricks terms and conditions and click **Create Git folder**.\n5. Click the **Open** button and select **Git folder** to view the repo in the workspace file browser.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html"} +{"content":"# What is Databricks Marketplace?\n### Access data products in Databricks Marketplace (Unity Catalog-enabled workspaces)\n#### Access the shared data using Unity Catalog\n\nAccess to the data in your Databricks workspace depends on the data product type: \n* Marketplace datasets and models are shared with you in a read-only catalog in Databricks. Catalogs are the top-level container for data assets that are managed by Unity Catalog. For more information about the data object hierarchy in Unity Catalog, see [The Unity Catalog object model](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#object-model). \nOnce the provider has shared the data with you, you have a few ways of accessing the catalog. The sections that follow describe some of these access options.\n* Notebooks are shared directly in the Marketplace listing, and you can import them from the listing into your workspace.\n* Solution Accelerators are shared as Databricks Git folders. See [Get access to Databricks Solution Accelerators](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html#sa). \n### Access shared datasets in the Marketplace \nTo access shared data from the Marketplace: \n1. In the sidebar, click ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace**.\n2. On the Marketplace landing page, click **My requests** in the upper right-hand corner.\n3. On the **Installed data products** tab, find the data product, click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu (also known as the three-dot menu) at the end of the data product row, and select **View data**. \nYou can also click the data product name to open the data product details page, where you can click the **Open** button to view the data. \nCatalog Explorer opens to the catalog that contains the data set, where you can access the data or manage access for other team members. See [Grant access to other team members](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html#grant-access). \nTo learn more about accessing shared data in Databricks, see [Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html). \n### Access shared datasets in Catalog Explorer \nTo access shared data directly from Catalog Explorer: \n1. In the sidebar, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. If you know the name of the catalog that holds the shared data, search for it and select it. \nIf you don\u2019t know the catalog name but you do know the provider name, you can find the catalog in Catalog Explorer by doing the following: \n1. In the sidebar, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane of Catalog Explorer, click **Delta Sharing** and then **Shared with me**.\n3. On the **Providers** tab, click the provider name.\n4. On the **Shares** tab, find the catalog name and click it to open it. \n### Access shared datasets using the Databricks CLI or SQL statements \nYou can also find and access the catalog that contains the shared data using the Databricks CLI or SQL statements in a Databricks notebook or Databricks SQL editor query. For details, see [Access data in a shared table or volume](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html#access-data). You can skip the sections that describe how to create a catalog, since Databricks Marketplace does this for you.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html"} +{"content":"# What is Databricks Marketplace?\n### Access data products in Databricks Marketplace (Unity Catalog-enabled workspaces)\n#### Grant access to other team members\n\nIf you are the user who requested the shared data, you are the owner of the catalog that contains that data in your workspace. As such, you can grant your team members access to the catalog and refine access at the schema, table, view, row, and column level, just as you do any data in Unity Catalog. See [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html). That said, table and view data under a shared catalog is read-only, which means that you can only grant your team read operations like `DESCRIBE`, `SHOW`, and `SELECT`. \nYou can also transfer ownership of the catalog or the objects inside it.\n\n### Access data products in Databricks Marketplace (Unity Catalog-enabled workspaces)\n#### View sample notebooks\n\nSome listings include sample notebooks on the listing details page. To access these notebooks for instantly available listings that have been shared with you: \n1. In the sidebar, click ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace**.\n2. On the Marketplace landing page, click **My requests** in the upper right-hand corner.\n3. On the **Installed data products** tab, find the data product and click the data product name to open the listing details page. If there is a sample notebook, it appears under the **Sample notebook** heading in the listing.\n4. Click **Preview notebook** to view the notebook, and click **Import notebook** to import it to your Databricks workspace. \nNote \nThe **Sample notebooks** display and preview in the listings UI does not work in Chrome Incognito mode.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html"} +{"content":"# What is Databricks Marketplace?\n### Access data products in Databricks Marketplace (Unity Catalog-enabled workspaces)\n#### Participate in private exchanges\n\nSome data providers might want to share certain data with a limited set of consumers who are invited to be part of a *private exchange*. You can find private exchange listings by selecting the **Private exchange** checkbox on the Marketplace home page. Just as you can with public listings, you can access free listings instantly or request access to those that are marked **By request**. \nTo join a private exchange, a data provider needs to invite your organization. When they do, they will request a shared identifier for your Unity Catalog metastore. To learn how to get your metastore\u2019s sharing identifier, see step 1 in [Get access in the Databricks-to-Databricks model](https:\/\/docs.databricks.com\/data-sharing\/recipient.html#get-access-db-to-db). \nTo learn more about private exchanges, see [Create and manage private exchanges in Databricks Marketplace](https:\/\/docs.databricks.com\/marketplace\/private-exchange.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html"} +{"content":"# What is Databricks Marketplace?\n### Access data products in Databricks Marketplace (Unity Catalog-enabled workspaces)\n#### Disable Marketplace access\n\nBy default, all users in any Unity Catalog-enabled workspace have the ability to request data products in Databricks Marketplace. In other words, the `account users` group has the `USE MARKETPLACE ASSETS` privilege on all Unity Catalog metastores unless a metastore admin revokes that privilege. \nThis privilege does not grant the ability to participate in financial transactions with data providers. All financial transactions take place outside Databricks. This privilege does grant the ability to access data products that are labelled **Free and instantly available** and data products that have already been purchased. Accessing such data products creates new read-only catalogs in Databricks that are owned by the requestor, who can grant read-only access to other users. \nA metastore admin can disable all users\u2019 ability to request data products in Databricks Marketplace by revoking the `USE MARKETPLACE ASSETS` privilege from the `account users` group on the Unity Catalog metastore. If you do revoke this privilege, users can continue to browse the Databricks Marketplace in their workspace but cannot request data products. \n**Permission required**: Metastore admin \nNote \nIf your workspace was enabled for Unity Catalog automatically, you might not have a metastore admin. For more information, see [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement). \nTo revoke the `USE MARKETPLACE ASSETS` privilege for the `account users` group: \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Click the link icon next to the metastore name at the top left.\n3. Find the row that grants `account users` the `USE MARKETPLACE ASSETS` privilege.\n4. Click the checkbox next to the row and click the **Revoke** button.\n5. Confirm the revoke action.\n6. Grant the privilege to any specific users and groups you like by clicking the **Grant button**. \nTo revoke the `USE MARKETPLACE ASSETS` privilege for the `account users` group, run the following command in a notebook or the Databricks SQL query editor. \n```\nREVOKE USE MARKETPLACE ASSETS ON METASTORE FROM `account users`;\n\n``` \nTo grant the `USE MARKETPLACE ASSETS` privilege to a specific user or group, run the following command in a notebook or the Databricks SQL query editor. \n```\nGRANT USE MARKETPLACE ASSETS ON METASTORE TO `<user-or-group>`;\n\n``` \nIf you don\u2019t want your users to be able to view the Marketplace home page at all, contact your Databricks account team.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html"} +{"content":"# What is Databricks Marketplace?\n### Access data products in Databricks Marketplace (Unity Catalog-enabled workspaces)\n#### Known issues\n\nIf a request for access is rejected by the data provider, you cannot request the same data product again. If you encounter this issue, contact your provider or Databricks account team.\n\n### Access data products in Databricks Marketplace (Unity Catalog-enabled workspaces)\n#### Share your feedback\n\nWe\u2019d love to hear what you think about Databricks Marketplace. \n* Have feedback? Go to **Learn > Provide feedback** on the Marketplace home page.\n* Want to see additional datasets in the marketplace? Click **Suggest a product** on the Marketplace home page banner.\n\n### Access data products in Databricks Marketplace (Unity Catalog-enabled workspaces)\n#### Next steps\n\n* [Manage Marketplace requests and data that has been shared with you](https:\/\/docs.databricks.com\/marketplace\/manage-requests-consumer.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html"} +{"content":"# Data governance with Unity Catalog\n## What is Catalog Explorer?\n#### Explore models\n\nIn Catalog Explorer you can view model schema details, preview sample data, model type, model location, model history, frequent queries and users, and other details. \nFor information about using Catalog Explorer to set model ownership and permissions, see [Manage Unity Catalog object ownership](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/ownership.html) and [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html).\n\n#### Explore models\n##### View model information\n\n1. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog** in the sidebar.\n2. Select a compute resource from the drop-down list at the top right.\n3. In the Catalog Explorer tree at the left, open a catalog and select a schema.\n4. In the right side of the screen, click the **Models** tab.\nYou can filter models by typing text in the **Filter models** field. \n![Filter models](https:\/\/docs.databricks.com\/_images\/filter-models.png)\n5. Click a model to see more information. The model details page shows a list of model versions with additional information. From this page you can [set model aliases](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#uc-model-aliases). \n![model details page](https:\/\/docs.databricks.com\/_images\/registered-model.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/catalog-explorer\/explore-models.html"} +{"content":"# Data governance with Unity Catalog\n## What is Catalog Explorer?\n#### Explore models\n##### View model version information and model lineage\n\nTo view more information about a model version, click its name in the list of models. Details of the model version appear, including a link to the MLflow run that created the model. \n![model version page](https:\/\/docs.databricks.com\/_images\/model-version.png) \nFrom this page, you can view the lineage of the model as follows: \n1. Select the **Lineage** tab. The left sidebar shows components that were logged with the model. \n![Lineage tab on model page in Catalog Explorer](https:\/\/docs.databricks.com\/_images\/model-page-lineage-tab.png)\n2. Click **See lineage graph**. The lineage graph appears. For details about exploring the lineage graph, see [Capture and explore lineage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html#capture-and-explore-lineage). \n![lineage screen](https:\/\/docs.databricks.com\/_images\/lineage-graph.png)\n3. To close the lineage graph, click ![close button for lineage graph](https:\/\/docs.databricks.com\/_images\/close-lineage-graph.png) in the upper-right corner.\n\n#### Explore models\n##### Serve a model behind an endpoint\n\nFrom the model details page, click ![Catalog Explorer serve model button](https:\/\/docs.databricks.com\/_images\/serve-model-button.png) to serve the model behind a [model serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/catalog-explorer\/explore-models.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Use query-based parameters\n\nThe article guides you through the steps to create an interactive dashboard that uses query-based parameters. It assumes a basic familiarity with building dashboards on Databricks. See [Get started](https:\/\/docs.databricks.com\/dashboards\/tutorials\/index.html#get-started) for foundational instruction on creating dashboards.\n\n##### Use query-based parameters\n###### Requirements\n\n* You are logged into a Databricks workspace.\n* You have the SQL entitlement in that workspace.\n* You have at least CAN USE access to one or more SQL warehouses.\n\n##### Use query-based parameters\n###### Create a dashboard dataset\n\nThis tutorial uses generated data from the **samples** catalog on Databricks. \n1. Click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar and select **Dashboard** from the menu.\n2. Click the **Data** tab.\n3. Click **Create from SQL** and paste the following query into the editor. Then click **Run** to return the results. \n```\nSELECT\n*\nFROM\nsamples.tpch.customer\n\n```\n4. Your newly defined dataset is autosaved with the name **Untitled Dataset**. Double-click the title then rename it **Marketing segment**.\n\n##### Use query-based parameters\n###### Add a parameter\n\nYou can add a parameter to this dataset to filter the returned values. The parameter in this example is `:segment`. See [What are dashboard parameters?](https:\/\/docs.databricks.com\/dashboards\/parameters.html) to learn more about parameter syntax. \n1. Paste the following `WHERE` clause at the bottom of your query. A text field with the parameter name `segment` appears below your query. \n```\nWHERE\nc_mktsegment = :segment\n\n```\n2. Type `BUILDING` into the text field below your query to set the default value for the parameter.\n3. Rerun the query to inspect the results.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/query-based-params.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Use query-based parameters\n###### Configure a visualization widget\n\nAdd a visualization for your dataset on the canvas by completing the following steps: \n1. Click the **Canvas** tab.\n2. Click ![Create Icon](https:\/\/docs.databricks.com\/_images\/lakeview-create.png) **Add a visualization** to add a visualization widget and use your mouse to place it in the canvas. \n### Setup the X-axis \n1. If necessary, select **Bar** from the **Visualization** dropdown menu.\n2. Click ![add field icon](https:\/\/docs.databricks.com\/_images\/lakeview-add-vis-field.png) to choose the data presented along the **X-axis**. You can use the search bar to search for a field by name. Select **c\\_nationkey**.\n3. Click the field name you selected to view additional configuration options. \n* As the **Scale Type**, select **Categorical**.\n* For the **Transform** selection, choose **None**. \n### Setup the Y-axis \n1. Click ![add field icon](https:\/\/docs.databricks.com\/_images\/lakeview-add-vis-field.png) next to the **Y-axis**, then select **c\\_acctbal**.\n2. Click the field name you selected to view additional configuration options. \n* As the **Scale Type**, select **Quantitative**.\n* For the **Transform** selection, choose **SUM**. \nThe visualization is automatically updated as you configure it. The data shown includes only records where the `segment` is `BUILDING`. \n![Visualization widget configured as described in previous steps.](https:\/\/docs.databricks.com\/_images\/segment-vis.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/query-based-params.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Use query-based parameters\n###### Add a filter\n\nSet up a filter so that dashboard viewers can control which marketing segment to focus on. \n1. Click ![Filter Icon](https:\/\/docs.databricks.com\/_images\/lakeview-filter.png) **Add a filter (field\/parameter)** to add a filter widget. Place it on the canvas.\n2. From the **Filter** drop-down menu in the configuration panel, select **Single value**.\n3. Select the **Title** checkbox to show a title field on your filter widget.\n4. Click the placeholder title on the widget and type **Segment** to retitle your filter.\n5. Click ![add field icon](https:\/\/docs.databricks.com\/_images\/lakeview-add-vis-field.png) next to **Parameters** in the configuration panel.\n6. Choose **segment** from the **Marketing segment** dataset. \nYour configured filter widget shows the default parameter value for the dataset. \n![Filter widget configured with a parameter, as described.](https:\/\/docs.databricks.com\/_images\/query-based-param.png)\n\n##### Use query-based parameters\n###### Define a selection of values\n\nThe filter you created is functional, but it requires the viewer to know the available range of choices before they can type a selection. It also requires that users match the case and spelling when entering the desired parameter value. \nTo create a drop-down list so that the viewer can select a parameter from a list of available options, create a new dataset to define the list of possible values. \n1. Click the **Data** tab.\n2. Click **Create from SQL** to create a new dataset.\n3. Copy and paste the following into the editor: \n```\nSELECT\nDISTINCT c_mktsegment\nFROM\nsamples.tpch.customer\n\n```\n4. Run your query and inspect the results. The five marketing segments from the table appear in the results.\n5. Double-click the automatically generated title, then rename this dataset **Segment choice**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/query-based-params.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Use query-based parameters\n###### Update the filter\n\nUpdate your existing filter to use the dataset you just created to populate a drop-down list of values users can select from. \n1. Click **Canvas**. Then, click the filter widget you created in a previous step.\n2. Click ![add field icon](https:\/\/docs.databricks.com\/_images\/lakeview-add-vis-field.png) next to **Fields**.\n3. Click **Segment choice**, then click the field name `c_mktsegment`. \nYour filter widget updates as you change the configuration. Click the field in the filter widget to see the available choices in the drop-down menu. \nNote \nThis tutorial contains a simplified use case meant to demonstrate how to use query-based parameters. An alternate approach to creating this dashboard is to apply a filter to the `c_mktsegment` field. \n![Filter widget configured with a field, as described.](https:\/\/docs.databricks.com\/_images\/query-based-param-field.png)\n\n##### Use query-based parameters\n###### Next steps\n\nKeep learning about how to work with dashboards with the following articles: \n* Learn more about applying filters. See [Filters](https:\/\/docs.databricks.com\/dashboards\/index.html#filters).\n* Learn more about dashboard parameters. See [What are dashboard parameters?](https:\/\/docs.databricks.com\/dashboards\/parameters.html).\n* Publish and share your dashboard. See [Publish a dashboard](https:\/\/docs.databricks.com\/dashboards\/index.html#publish-a-dashboard).\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/query-based-params.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Deploy Python code with Model Serving\n\nThis article describes how to deploy Python code with [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html). \nMLflow\u2019s Python function, `pyfunc`, provides flexibility to deploy any piece of Python code or any Python model. The following are example scenarios where you might want to use the guide. \n* Your model requires preprocessing before inputs can be passed to the model\u2019s predict function.\n* Your model framework is not natively supported by MLflow.\n* Your application requires the model\u2019s raw outputs to be post-processed for consumption.\n* The model itself has per-request branching logic.\n* You are looking to deploy fully custom code as a model.\n\n#### Deploy Python code with Model Serving\n##### Construct a custom MLflow Python function model\n\nMLflow offers the ability to log Python code with the [custom Python models format](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.pyfunc.html#creating-custom-pyfunc-models). \nThere are two required functions when packaging arbitrary python code with MLflow: \n* `load_context` - anything that needs to be loaded just one time for the model to operate should be defined in this function. This is critical so that the system minimize the number of artifacts loaded during the `predict` function, which speeds up inference.\n* `predict` - this function houses all the logic that is run every time an input request is made.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/deploy-custom-models.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Deploy Python code with Model Serving\n##### Log your Python function model\n\nEven though you are writing your model with custom code, it is possible to use shared modules of code from your organization. With the `code_path` parameter, authors of models can log full code references that load into the path and are usable from other custom `pyfunc` models. \nFor example, if a model is logged with: \n```\nmlflow.pyfunc.log_model(CustomModel(), \"model\", code_path = [\"preprocessing_utils\/\"])\n\n``` \nCode from the `preprocessing_utils` is available in the loaded context of the model. The following is an example model that uses this code. \n```\nclass CustomModel(mlflow.pyfunc.PythonModel):\ndef load_context(self, context):\nself.model = torch.load(context.artifacts[\"model-weights\"])\nfrom preprocessing_utils.my_custom_tokenizer import CustomTokenizer\nself.tokenizer = CustomTokenizer(context.artifacts[\"tokenizer_cache\"])\n\ndef format_inputs(self, model_input):\n# insert some code that formats your inputs\npass\n\ndef format_outputs(self, outputs):\npredictions = (torch.sigmoid(outputs)).data.numpy()\nreturn predictions\n\ndef predict(self, context, model_input):\nmodel_input = self.format_inputs(model_input)\noutputs = self.model.predict(model_input)\nreturn self.format_outputs(outputs)\n\n```\n\n#### Deploy Python code with Model Serving\n##### Serve your model\n\nAfter you log your custom `pyfunc` model, you can register it to the Unity Catalog or Workspace Registry and serve your model to a [Model Serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/deploy-custom-models.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Deploy Python code with Model Serving\n##### Notebook example\n\nThe following notebook example demonstrates how to customize model output when the raw output of the queried model needs to be post-processed for consumption. \n### Customize model serving output with MLflow PyFunc notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/customize-model-serving-output.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/deploy-custom-models.html"} +{"content":"# Data governance with Unity Catalog\n## Hive metastore table access control (legacy)\n#### Enable Hive metastore table access control on a cluster (legacy)\n\nThis article describes how to enable table access control for the built-in Hive metastore on a cluster. \nFor information about how to set privileges on Hive metastore securable objects once table access control has been enabled on a cluster, see [Hive metastore privileges and securable objects (legacy)](https:\/\/docs.databricks.com\/data-governance\/table-acls\/object-privileges.html). \nNote \nHive metastore table access control is a legacy data governance model. Databricks recommends that you use Unity Catalog instead for its simplicity and account-centered governance model. You can [upgrade the tables managed by the Hive metastore to the Unity Catalog metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/table-acls\/table-acl.html"} +{"content":"# Data governance with Unity Catalog\n## Hive metastore table access control (legacy)\n#### Enable Hive metastore table access control on a cluster (legacy)\n##### Enable table access control for a cluster\n\nTable access control is available in two versions: \n* [SQL-only table access control](https:\/\/docs.databricks.com\/data-governance\/table-acls\/table-acl.html#sql-only-table-access-control), which restricts users to SQL commands.\n* [Python and SQL table access control](https:\/\/docs.databricks.com\/data-governance\/table-acls\/table-acl.html#python-and-sql-table-access-control), which allows users to run SQL, Python, and PySpark commands. \nTable access control is not supported with [Machine Learning Runtime](https:\/\/docs.databricks.com\/machine-learning\/index.html). \nImportant \nEven if table access control is enabled for a cluster, Databricks workspace administrators have access to file-level data. \n### SQL-only table access control \nThis version of table access control restricts users to SQL commands only. \nTo enable SQL-only table access control on a cluster and restrict that cluster to use only SQL commands, set\nthe following flag in the cluster\u2019s [Spark conf](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration): \n```\nspark.databricks.acl.sqlOnly true\n\n``` \nNote \nAccess to SQL-only table access control is not affected by the [Enable Table Access Control](https:\/\/docs.databricks.com\/data-governance\/table-acls\/table-acl.html#enable-table-acl-workspace) setting in the admin settings page. That setting controls only the workspace-wide enablement of Python and SQL table access control. \n### Python and SQL table access control \nThis version of table access control lets users run Python commands that use the DataFrame API as well as SQL. When\nit is enabled on a cluster, users on that cluster: \n* Can access Spark only using the Spark SQL API or DataFrame API. In both cases, access to tables and views is restricted by administrators according to the Databricks [Privileges you can grant on Hive metastore objects](https:\/\/docs.databricks.com\/data-governance\/table-acls\/object-privileges.html#privilege-types).\n* Must run their commands on cluster nodes as a low-privilege user forbidden from accessing sensitive parts of the filesystem or creating network connections to ports other than 80 and 443. \n+ Only built-in Spark functions can create network connections on ports other than 80 and 443.\n+ Only workspace admin users or users with [ANY FILE](https:\/\/docs.databricks.com\/data-governance\/table-acls\/object-privileges.html#privilege-types) privilege can read data from external databases through the [PySpark JDBC connector](https:\/\/docs.databricks.com\/connect\/external-systems\/jdbc.html).\n+ If you want Python processes to be able to access additional outbound ports, you can set the [Spark config](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration) `spark.databricks.pyspark.iptable.outbound.whitelisted.ports` to the ports you want to allow access. The supported format of the configuration value is `[port[:port][,port[:port]]...]`, for example: `21,22,9000:9999`. The port must be within the valid range, that is, `0-65535`. \nAttempts to get around these restrictions will fail with an exception. These restrictions are in place so that users can never access unprivileged data through the cluster.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/table-acls\/table-acl.html"} +{"content":"# Data governance with Unity Catalog\n## Hive metastore table access control (legacy)\n#### Enable Hive metastore table access control on a cluster (legacy)\n##### Enable table access control for your workspace\n\nBefore users can configure Python and SQL table access control, a Databricks workspace must enable table access control for the Databricks workspace and deny users access to clusters that are not enabled for table access control. \n1. Go to the [settings page](https:\/\/docs.databricks.com\/admin\/index.html#admin-settings).\n2. Click the **Security** tab.\n3. Turn on the **Table Access Control** option. \n### Enforce table access control \nTo ensure that your users access only the data that you want them to, you must restrict your users to clusters with table access control enabled. In particular, you should ensure that: \n* Users do not have permission to create clusters. If they create a cluster without table access control, they can access any data from that cluster.\n* Users do not have CAN ATTACH TO permission for any cluster that is not enabled for table access control. \nSee [Compute permissions](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions) for more information.\n\n#### Enable Hive metastore table access control on a cluster (legacy)\n##### Create a cluster enabled for table access control\n\nTable access control is enabled by default in clusters with [Shared access mode](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode). \nTo create the cluster using the REST API, see [Create new cluster](https:\/\/docs.databricks.com\/api\/workspace\/clusters\/create).\n\n#### Enable Hive metastore table access control on a cluster (legacy)\n##### Set privileges on a data object\n\nSee [Hive metastore privileges and securable objects (legacy)](https:\/\/docs.databricks.com\/data-governance\/table-acls\/object-privileges.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/table-acls\/table-acl.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n### Pandas API on Spark\n##### Convert between PySpark and pandas DataFrames\n\nLearn how to convert Apache Spark DataFrames to and from pandas DataFrames using Apache Arrow in Databricks.\n\n##### Convert between PySpark and pandas DataFrames\n###### Apache Arrow and PyArrow\n\n[Apache Arrow](https:\/\/arrow.apache.org\/) is an in-memory columnar data format used in Apache Spark to efficiently transfer data between JVM and Python processes. This is beneficial to Python developers who work with pandas and NumPy data. However, its usage requires some minor configuration or code changes to ensure compatibility and gain the most benefit. \nPyArrow is a Python binding for Apache Arrow and is installed in Databricks Runtime. For information on the version of PyArrow available in each Databricks Runtime version, see the [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html).\n\n##### Convert between PySpark and pandas DataFrames\n###### Supported SQL types\n\nAll Spark SQL data types are supported by Arrow-based conversion except `ArrayType` of `TimestampType`. `MapType` and `ArrayType` of nested `StructType` are only supported when using PyArrow 2.0.0 and above. `StructType` is represented as a `pandas.DataFrame` instead of `pandas.Series`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/pandas\/pyspark-pandas-conversion.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n### Pandas API on Spark\n##### Convert between PySpark and pandas DataFrames\n###### Convert PySpark DataFrames to and from pandas DataFrames\n\nArrow is available as an optimization when converting a PySpark DataFrame to a pandas DataFrame with `toPandas()` and when creating a PySpark DataFrame from a pandas DataFrame with `createDataFrame(pandas_df)`. \nTo use Arrow for these methods, set the [Spark configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration) `spark.sql.execution.arrow.pyspark.enabled` to `true`. This configuration is enabled by default except for High Concurrency clusters as well as user isolation clusters in workspaces that are Unity Catalog enabled. \nIn addition, optimizations enabled by `spark.sql.execution.arrow.pyspark.enabled` could fall back to a non-Arrow implementation if an error occurs before the computation within Spark. You can control this behavior using the Spark configuration `spark.sql.execution.arrow.pyspark.fallback.enabled`. \n### Example \n```\nimport numpy as np\nimport pandas as pd\n\n# Enable Arrow-based columnar data transfers\nspark.conf.set(\"spark.sql.execution.arrow.pyspark.enabled\", \"true\")\n\n# Generate a pandas DataFrame\npdf = pd.DataFrame(np.random.rand(100, 3))\n\n# Create a Spark DataFrame from a pandas DataFrame using Arrow\ndf = spark.createDataFrame(pdf)\n\n# Convert the Spark DataFrame back to a pandas DataFrame using Arrow\nresult_pdf = df.select(\"*\").toPandas()\n\n``` \nUsing the Arrow optimizations produces the same results as when Arrow is not enabled. Even with Arrow, `toPandas()` results in the collection of all records in the DataFrame to the driver program and should be done on a small subset of the data. \nIn addition, not all Spark data types are supported and an error can be raised if a column has an unsupported type. If an error occurs during `createDataFrame()`, Spark creates the DataFrame without Arrow.\n\n","doc_uri":"https:\/\/docs.databricks.com\/pandas\/pyspark-pandas-conversion.html"} +{"content":"# \n### Directory structure\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \nThis document explains the directory structure of the RAG Application. This directory is a unified code base that works in both development and production environments \u2013 just like you would expect from a typical full-stack software application\u2019s code base. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html) \n```\nrag-app\/\n\u2502\n\u251c\u2500\u2500 config \/\n\u2502 \u2502\n\u2502 \u2514\u2500\u2500 rag_studio.yml <- Main configuration file for RAG Studio.\n|\n\u251c\u2500\u2500 resources\/ <- Internal resources used by RAG Studio - do not modify.\n\u2502\n\u2514\u2500\u2500 src\/ <- Store all Source code directory for the application.\n\u2502\n\u251c\u2500\u2500 my_rag_builder\/ <- Store all application code here.\n\u2502 \u2514\u2500\u2500 chain.py <- Code for the chain\n\u2502 \u2514\u2500\u2500 document_processor.py <- Code for the data processor\n\u2502\n|\u2500\u2500 notebooks\/ <- Internal RAG Studio notebooks for running workflows - do not modify.\n\u2502 \u2514\u2500\u2500 ingest_data.py <- Code for the data ingestor - this is the only file in this folder you can modify.\n|\n\u2514\u2500\u2500 review\/ <- Configuration for the Review UI\n\u2502 \u2514\u2500\u2500 instructions.md <- Instructions shown to the end user in the Review UI\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/directory-structure.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Load and process data incrementally with Delta Live Tables flows\n\nThis article explains what flows are and how you can use flows in Delta Live Tables pipelines to incrementally process data from a source to a target streaming table. In Delta Live Tables, flows are defined in two ways: \n1. A flow is defined automatically when you create a query that updates a streaming table.\n2. Delta Live Tables also provides functionality to explicitly define flows for more complex processing such as appending to a streaming table from multiple streaming sources. \nThis article discusses the implicit flows that are created when you define a query to update a streaming table, and then provides details on the syntax to define more complex flows.\n\n##### Load and process data incrementally with Delta Live Tables flows\n###### What is a flow?\n\nIn Delta Live Tables, a *flow* is a streaming query that processes source data incrementally to update a target streaming table. Most Delta Live Tables datasets you create in a pipeline define the flow as part of the query and do not require explicitly defining the flow. For example, you create a streaming table in Delta Live Tables in a single DDL command instead of using separate table and flow statements to create the streaming table: \nNote \nThis `CREATE FLOW` example is provided for illustrative purposes only and includes keywords that are not valid Delta Live Tables syntax. \n```\nCREATE STREAMING TABLE raw_data\nAS SELECT * FROM source_data(\"\/path\/to\/source\/data\")\n\n-- The above query is equivalent to the following statements:\nCREATE STREAMING TABLE raw_data;\n\nCREATE FLOW raw_data\nAS INSERT INTO raw_data BY NAME\nSELECT * FROM source_data(\"\/path\/to\/source\/data\");\n\n``` \nIn addition to the default flow defined by a query, the Delta Live Tables Python and SQL interfaces provide *append flow* functionality. Append flow supports processing that requires reading data from multiple streaming sources to update a single streaming table. For example, you can use append flow functionality when you have an existing streaming table and flow and want to add a new streaming source that writes to this existing streaming table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/flows.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Load and process data incrementally with Delta Live Tables flows\n###### Use append flow to write to a streaming table from multiple source streams\n\nNote \nTo use append flow processing, your pipeline must be configured to use the [preview channel](https:\/\/docs.databricks.com\/delta-live-tables\/properties.html#config-settings). \nUse the `@append_flow` decorator in the Python interface or the `CREATE FLOW` clause in the SQL interface to write to a streaming table from multiple streaming sources. Use append flow for processing tasks such as the following: \n* Add streaming sources that append data to an existing streaming table without requiring a full refresh. For example, you might have a table combining regional data from every region you operate in. As new regions are rolled out, you can add the new region data to the table without performing a full refresh. See [Example: Write to a streaming table from multiple Kafka topics](https:\/\/docs.databricks.com\/delta-live-tables\/flows.html#multiple-sources).\n* Update a streaming table by appending missing historical data (backfilling). For example, you have an existing streaming table that is written to by an Apache Kafka topic. You also have historical data stored in a table that you need inserted exactly once into the streaming table, and you cannot stream the data because your processing includes performing a complex aggregation before inserting the data. See [Example: Run a one-time data backfill](https:\/\/docs.databricks.com\/delta-live-tables\/flows.html#backfill).\n* Combine data from multiple sources and write to a single streaming table instead of using the `UNION` clause in a query. Using append flow processing instead of `UNION` allows you to update the target table incrementally without running a [full refresh update](https:\/\/docs.databricks.com\/delta-live-tables\/updates.html#how-dlt-updates). See [Example: Use append flow processing instead of UNION](https:\/\/docs.databricks.com\/delta-live-tables\/flows.html#replace-union). \nThe target for the records output by the append flow processing can be an existing table or a new table. For Python queries, use the [create\\_streaming\\_table()](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html#create-target-fn) function to create a target table. \nImportant \n* If you need to define data quality constraints with [expectations](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html), define the expectations on the target table as part of the `create_streaming_table()` function or on an existing table definition. You cannot define expectations in the `@append_flow` definition.\n* Flows are identified by a *flow name*, and this name is used to identify streaming checkpoints. The use of the flow name to identify the checkpoint means the following: \n+ If an existing flow in a pipeline is renamed, the checkpoint does not carry over, and the renamed flow is effectively an entirely new flow.\n+ You cannot reuse a flow name in a pipeline, because the existing checkpoint won\u2019t match the new flow definition. \nThe following is the syntax for `@append_flow`: \n```\nimport dlt\n\ndlt.create_streaming_table(\"<target-table-name>\") # Required only if the target table doesn't exist.\n\n@dlt.append_flow(\ntarget = \"<target-table-name>\",\nname = \"<flow-name>\", # optional, defaults to function name\nspark_conf = {\"<key>\" : \"<value\", \"<key\" : \"<value>\"}, # optional\ncomment = \"<comment>\") # optional\ndef <function-name>():\nreturn (<streaming query>)\n\n``` \n```\nCREATE OR REFRESH STREAMING TABLE append_target; -- Required only if the target table doesn't exist.\n\nCREATE FLOW\nflow_name\nAS INSERT INTO\ntarget_table BY NAME\nSELECT * FROM\nsource;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/flows.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Load and process data incrementally with Delta Live Tables flows\n###### Example: Write to a streaming table from multiple Kafka topics\n\nThe following examples creates a streaming table named `kafka_target` and writes to that streaming table from two Kafka topics: \n```\nimport dlt\n\ndlt.create_streaming_table(\"kafka_target\")\n\n# Kafka stream from multiple topics\n@dlt.append_flow(target = \"kafka_target\")\ndef topic1():\nreturn (\nspark.readStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", \"host1:port1,...\")\n.option(\"subscribe\", \"topic1\")\n.load()\n)\n\n@dlt.append_flow(target = \"kafka_target\")\ndef topic2():\nreturn (\nspark.readStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", \"host1:port1,...\")\n.option(\"subscribe\", \"topic2\")\n.load()\n)\n\n``` \n```\nCREATE OR REFRESH STREAMING TABLE kafka_target;\n\nCREATE FLOW\ntopic1\nAS INSERT INTO\nkafka_target BY NAME\nSELECT * FROM\nread_kafka(bootstrapServers => 'host1:port1,...', subscribe => 'topic1');\n\nCREATE FLOW\ntopic2\nAS INSERT INTO\nkafka_target BY NAME\nSELECT * FROM\nread_kafka(bootstrapServers => 'host1:port1,...', subscribe => 'topic2');\n\n``` \nTo learn more about the `read_kafka()` table-valued function used in the SQL queries, see [read\\_kafka](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/read_kafka.html) in the SQL language reference.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/flows.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Load and process data incrementally with Delta Live Tables flows\n###### Example: Run a one-time data backfill\n\nThe following examples run a query to append historical data to a streaming table: \nNote \nTo ensure a true one-time backfill when the backfill query is part of a pipeline that runs on a scheduled basis or continuously, remove the query after running the pipeline once. To append new data if it arrives in the backfill directory, leave the query in place. \n```\nimport dlt\n\n@dlt.table()\ndef csv_target():\nreturn spark.readStream\n.format(\"cloudFiles\")\n.option(\"cloudFiles.format\",\"csv\")\n.load(\"path\/to\/sourceDir\")\n\n@dlt.append_flow(target = \"csv_target\")\ndef backfill():\nreturn spark.readStream\n.format(\"cloudFiles\")\n.option(\"cloudFiles.format\",\"csv\")\n.load(\"path\/to\/backfill\/data\/dir\")\n\n``` \n```\nCREATE OR REFRESH STREAMING TABLE csv_target\nAS SELECT * FROM\ncloud_files(\n\"path\/to\/sourceDir\",\n\"csv\"\n);\n\nCREATE FLOW\nbackfill\nAS INSERT INTO\ncsv_target BY NAME\nSELECT * FROM\ncloud_files(\n\"path\/to\/backfill\/data\/dir\",\n\"csv\"\n);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/flows.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Load and process data incrementally with Delta Live Tables flows\n###### Example: Use append flow processing instead of `UNION`\n\nInstead of using a query with a `UNION` clause, you can use append flow queries to combine multiple sources and write to a single streaming table. Using append flow queries instead of `UNION` allows you to append to a streaming table from multiple sources without running a [full refresh](https:\/\/docs.databricks.com\/delta-live-tables\/updates.html#how-dlt-updates). \nThe following Python example includes a query that combines multiple data sources with a `UNION` clause: \n```\n@dlt.create_table(name=\"raw_orders\")\ndef unioned_raw_orders():\nraw_orders_us =\nspark.readStream\n.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"csv\")\n.load(\"\/path\/to\/orders\/us\")\n\nraw_orders_eu =\nspark.readStream\n.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"csv\")\n.load(\"\/path\/to\/orders\/eu\")\n\nreturn raw_orders_us.union(raw_orders_eu)\n\n``` \nThe following examples replace the `UNION` query with append flow queries: \n```\ndlt.create_streaming_table(\"raw_orders\")\n\n@dlt.append_flow(target=\"raw_orders\")\ndef raw_oders_us():\nreturn spark.readStream\n.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"csv\")\n.load(\"\/path\/to\/orders\/us\")\n\n@dlt.append_flow(target=\"raw_orders\")\ndef raw_orders_eu():\nreturn spark.readStream\n.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"csv\")\n.load(\"\/path\/to\/orders\/eu\")\n\n# Additional flows can be added without the full refresh that a UNION query would require:\n@dlt.append_flow(target=\"raw_orders\")\ndef raw_orders_apac():\nreturn spark.readStream\n.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"csv\")\n.load(\"\/path\/to\/orders\/apac\")\n\n``` \n```\nCREATE OR REFRESH STREAMING TABLE raw_orders;\n\nCREATE FLOW\nraw_orders_us\nAS INSERT INTO\nraw_orders BY NAME\nSELECT * FROM\ncloud_files(\n\"\/path\/to\/orders\/us\",\n\"csv\"\n);\n\nCREATE FLOW\nraw_orders_eu\nAS INSERT INTO\nraw_orders BY NAME\nSELECT * FROM\ncloud_files(\n\"\/path\/to\/orders\/eu\",\n\"csv\"\n);\n\n-- Additional flows can be added without the full refresh that a UNION query would require:\nCREATE FLOW\nraw_orders_apac\nAS INSERT INTO\nraw_orders BY NAME\nSELECT * FROM\ncloud_files(\n\"\/path\/to\/orders\/apac\",\n\"csv\"\n);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/flows.html"} +{"content":"# Compute\n## What is a SQL warehouse?\n#### SQL warehouse sizing, scaling, and queuing behavior\n\nThis article explains the cluster sizing, queuing, and autoscaling behavior of SQL warehouses.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/sql-warehouse\/warehouse-behavior.html"} +{"content":"# Compute\n## What is a SQL warehouse?\n#### SQL warehouse sizing, scaling, and queuing behavior\n##### Sizing a serverless SQL warehouse\n\nAlways start with a larger t-shirt size for your serverless SQL warehouse than you think you will need and size down as you test. Don\u2019t start with a small t-shirt size for your serverless SQL warehouse and go up. In general, start with a single serverless SQL warehouse and rely on Databricks to right-size with serverless clusters, prioritizing workloads, and fast data reads. See [Serverless autoscaling and query queuing](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/warehouse-behavior.html#serverless-autoscaling). \n* To decrease query latency for a given serverless SQL warehouse: \n+ If queries are spilling to disk, increase the t-shirt size.\n+ If the queries are highly parallelizable, increase the t-shirt size.\n+ If you are running multiple queries at a time, add more clusters for autoscaling.\n* To reduce costs, try to step down in t-shirt size without spilling to disk or significantly increasing latency.\n* To help right-size your serverless SQL warehouse, use the following tools: \n+ Monitoring page: look at the peak query count. If the peak queued is commonly above one, add clusters. The maximum number of queries in a queue for all SQL warehouse types is 1000. See [Monitor a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/monitor.html).\n+ Query history. See [Query history](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html).\n+ Query profiles (look for **Bytes spilled to disk** above 1). See [Query profile](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-profile.html). \nNote \nFor serverless SQL warehouses, the cluster sizes may in some cases use different instance types than the ones listed in the documentation for pro and classic SQL warehouses for an equivalent cluster size. In general, the price\/performance ratio of the cluster sizes for serverless SQL warehouses is similar to those for pro and classic SQL warehouses.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/sql-warehouse\/warehouse-behavior.html"} +{"content":"# Compute\n## What is a SQL warehouse?\n#### SQL warehouse sizing, scaling, and queuing behavior\n##### Serverless autoscaling and query queuing\n\nIntelligent Workload Management (IWM) is a set of features that enhances the ability of serverless SQL warehouses to process large numbers of queries quickly and cost-effectively. Using AI-powered prediction capabilities to analyze incoming queries and determine the fastest and more efficient (Predictive IO), IWM works to ensure that workloads have the right amount of resources quickly. The key difference lies in the AI capabilities in Databricks SQL to respond dynamically to workload demands rather than using static thresholds. \nThis responsiveness ensures: \n* Rapid upscaling to acquire more compute when needed for maintaining low latency.\n* Query admittance closer to the hardware\u2019s limitation.\n* Quick downscaling to minimize costs when demand is low, providing consistent performance with optimized costs and resources. \nWhen a query arrives to the warehouse, IWM predicts the cost of the query. At the same time, IWM is real-time monitoring the available compute capacity of the warehouse. Next, using machine learning models, IWM predicts if the incoming query has the necessary compute available on the existing compute. If it doesn\u2019t have the compute needed, then the query is added to the queue. If it does have the compute needed, the query begins executing immediately. \nIWM monitors the queue is monitored approximately every 10 seconds. If the queue is not decreasing quickly enough, autoscaling kicks in to rapidly procure more compute. Once new capacity is added, queued queries are admitted to the new clusters. With serverless SQL warehouses, new clusters can be added rapidly, and more than one cluster at a time can be created. The maximum number of queries in a queue for all SQL warehouse types is 1000.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/sql-warehouse\/warehouse-behavior.html"} +{"content":"# Compute\n## What is a SQL warehouse?\n#### SQL warehouse sizing, scaling, and queuing behavior\n##### Cluster sizes for pro and classic SQL warehouses\n\nThe table in this section maps SQL warehouse cluster sizes to Databricks cluster driver size and worker counts. The driver size only applies to pro and classic SQL warehouses. \n| Cluster size | Instance type for driver (applies only to pro and classic SQL warehouses) | Worker count |\n| --- | --- | --- |\n| 2X-Small | i3.2xlarge | 1 x i3.2xlarge |\n| X-Small | i3.2xlarge | 2 x i3.2xlarge |\n| Small | i3.4xlarge | 4 x i3.2xlarge |\n| Medium | i3.8xlarge | 8 x i3.2xlarge |\n| Large | i3.8xlarge | 16 x i3.2xlarge |\n| X-Large | i3.16xlarge | 32 x i3.2xlarge |\n| 2X-Large | i3.16xlarge | 64 x i3.2xlarge |\n| 3X-Large | i3.16xlarge | 128 x i3.2xlarge |\n| 4X-Large | i3.16xlarge | 256 x i3.2xlarge | \nThe instance size of all workers is i3.2xlarge. If your workspace has the [compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html) enabled, warehouses will use `i3en` instance types instead of `i3`. \nNote \nThe information in this table can vary based on product or region availability and workspace type.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/sql-warehouse\/warehouse-behavior.html"} +{"content":"# Compute\n## What is a SQL warehouse?\n#### SQL warehouse sizing, scaling, and queuing behavior\n##### Availability zones (AZ) for pro and classic SQL warehouses\n\nFor SQL warehouses, AWS availability zones are set to **auto** (Auto-AZ), where the AZ is automatically selected based on available IPs in the workspace subnets. Auto-AZ retries in other availability zones if AWS returns insufficient capacity errors. For more about availability zones, see the [AWS documentation](https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/using-regions-availability-zones.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/sql-warehouse\/warehouse-behavior.html"} +{"content":"# Compute\n## What is a SQL warehouse?\n#### SQL warehouse sizing, scaling, and queuing behavior\n##### Queueing and autoscaling for pro and classic SQL warehouses\n\nDatabricks limits the number of queries on a cluster assigned to a SQL warehouse based on the cost to compute their results. Upscaling of clusters per warehouse is based on query throughput, the rate of incoming queries, and the queue size. Databricks recommends a cluster for every 10 concurrent queries. The maximum number of queries in a queue for all SQL warehouse types is 1000. \nDatabricks adds clusters based on the time it would take to process all currently running queries, all queued queries, and the incoming queries expected in the next two minutes. \n* If less than 2 minutes, don\u2019t upscale.\n* If 2 to 6 minutes, add 1 cluster.\n* If 6 to 12 minutes, add 2 clusters.\n* If 12 to 22 minutes, add 3 clusters. \nOtherwise, Databricks adds 3 clusters plus 1 cluster for every additional 15 minutes of expected query load. \nIn addition, a warehouse is always upscaled if a query waits for 5 minutes in the queue. \nIf the load is low for 15 minutes, Databricks downscales the SQL warehouse. It keeps enough clusters to handle the peak load over the last 15 minutes. For example, if the peak load was 25 concurrent queries, Databricks keeps 3 clusters. \n### Query queuing for pro and classic SQL warehouses \nDatabricks queues queries when all clusters assigned to the warehouse are executing queries at full capacity or when the warehouse is in the `STARTING` state. The maximum number of queries in a queue for all SQL warehouse types is 1000. \nMetadata queries (for example, `DESCRIBE <table>`) and state modifying queries (for example `SET`) are never queued, unless the warehouse is in the `STARTING` state.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/sql-warehouse\/warehouse-behavior.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n###### Compliance security profile\n\nThis article describes the compliance security profile and its compliance controls.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n###### Compliance security profile\n####### Compliance security profile overview\n\nThe compliance security profile enables additional monitoring, enforced instance types for inter-node encryption, a hardened compute image, and other features and controls on Databricks workspaces. Enabling the compliance security profile is required to use Databricks to process data that is regulated under the following compliance standards: \n* [HIPAA](https:\/\/docs.databricks.com\/security\/privacy\/hipaa.html)\n* [Infosec Registered Assessors Program (IRAP)](https:\/\/docs.databricks.com\/security\/privacy\/irap.html)\n* [PCI-DSS](https:\/\/docs.databricks.com\/security\/privacy\/pci.html)\n* [FedRAMP High](https:\/\/docs.databricks.com\/security\/privacy\/gov-cloud.html)\n* [FedRAMP Moderate](https:\/\/docs.databricks.com\/security\/privacy\/fedramp.html) \nYou can also choose to enable the compliance security profile for its enhanced security features without the need to conform to a compliance standard. \nImportant \n* You are solely responsible for ensuring your own compliance with all applicable laws and regulations.\n* You are solely responsible for ensuring that the compliance security profile and the appropriate compliance standards are configured before processing regulated data.\n* If you add [HIPAA](https:\/\/docs.databricks.com\/security\/privacy\/hipaa.html), it is your responsibility before you process PHI data to have a BAA agreement with Databricks. \n### Which compute resources get enhanced security \nThe compliance security profile enhancements apply to compute resources in the [classic compute plane](https:\/\/docs.databricks.com\/getting-started\/overview.html) in all regions. \nServerless SQL warehouse support for the compliance security profile varies by region. See [Serverless SQL warehouses support the compliance security profile in some regions](https:\/\/docs.databricks.com\/admin\/sql\/serverless.html#security-profile). \n### Compliance security profile features and technical controls \nSecurity enhancements include: \n* An enhanced hardened operating system image based on [Ubuntu Advantage](https:\/\/ubuntu.com\/advantage). \nUbuntu Advantage is a package of enterprise security and support for open source infrastructure and applications that includes the following: \n+ A [CIS Level 1](https:\/\/www.cisecurity.org\/cis-hardened-images) hardened image.\n+ [FIPS 140-2 Level 1](https:\/\/csrc.nist.gov\/publications\/detail\/fips\/140\/2\/final) validated encryption modules.\n* Automatic cluster update is automatically enabled. \nClusters are restarted to get the latest updates periodically during a maintenance window that you can configure. See [Automatic cluster update](https:\/\/docs.databricks.com\/admin\/clusters\/automatic-cluster-update.html).\n* Enhanced securing monitoring is automatically enabled. \nSecurity monitoring agents generate logs that you can review. For more information on the monitoring agents, see [Monitoring agents in Databricks compute plane images](https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-monitoring.html#monitors).\n* Enforced use of [AWS Nitro](https:\/\/aws.amazon.com\/ec2\/nitro\/) instance types in cluster and Databricks SQL SQL warehouses.\n* Communications for egress use TLS 1.2 or higher, including connecting to the metastore.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n###### Compliance security profile\n####### Requirements\n\n* Your Databricks account must include the Enhanced Security and Compliance add-on. For details, see the [pricing page](https:\/\/databricks.com\/product\/aws-pricing).\n* Your Databricks workspace is on the [Enterprise pricing tier](https:\/\/databricks.com\/product\/aws-pricing).\n* [Single sign-on (SSO)](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/single-sign-on\/index.html) authentication is configured for the workspace.\n* Your Databricks workspace\u2019s root S3 bucket cannot have a period character (`.`) in its name, such as `my-bucket-1.0`. If an existing workspace\u2019s root S3 bucket has a period character in the name, contact your Databricks account team before enabling the compliance security profile.\n* Instance types are limited to those that provide both hardware-implemented network encryption between cluster nodes and encryption at rest for local disks. The supported instance types are: \n+ **General purpose:** `M-fleet`, `Md-fleet`, `M5dn`, `M5n`, `M5zn`, `M7g`, `M7gd`, `M6i`, `M7i`, `M6id`, `M6in`, `M6idn`, `M6a`, `M7a`\n+ **Compute optimized:** `C5a`, `C5ad`, `C5n`, `C6gn`, `C7g`, `C7gd`, `C7gn`, `C6i`, `C6id`, `C7i`, `C6in`, `C6a`, `C7a`\n+ **Memory optimized:** `R-fleet`, `Rd-fleet`, `R7g`, `R7gd`, `R6i`, `R7i`, `R7iz`, `R6id`, `R6in`, `R6idn`, `R6a`, `R7a`\n+ **Storage optimized:** `D3`, `D3en`, `P3dn`, `R5dn`, `R5n`, `I4i`, `I4g`, `I3en`, `Im4gn`, `Is4gen`\n+ **Accelerated computing:** `G4dn`, `G5`, `P4d`, `P4de`, `P5` \nNote \nFleet instances are not available in AWS Gov Cloud.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n###### Compliance security profile\n####### Step 1: Prepare a workspace for the compliance security profile\n\nFollow these steps when you create a new workspaces with the security profile enabled or enable it on an existing workspace. \n1. Check your workspace for long-running clusters before you enable the compliance security profile. When you enable the compliance security profile, long-running clusters are automatically restarted during the configured frequency and window of automatic cluster update. See [Automatic cluster update](https:\/\/docs.databricks.com\/admin\/clusters\/automatic-cluster-update.html).\n2. Ensure that single sign-on (SSO) authentication is configured. See [SSO in your Databricks account console](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/single-sign-on\/index.html).\n3. Add required network ports. The required network ports depend on if [PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html) back-end connection for private connectivity for the classic compute plane is enabled or note. \n* **PrivateLink back-end connectivity enabled**: You must update your network security group to allow bidrectional access to port 2443 for FIPS encryption connections. For more information, see [Step 1: Configure AWS network objects](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#create-vpc).\n* **No PrivateLink back-end connectivity**: You must update your network security group to allow outbound access to port 2443 to support FIPS encryption endpoints. See [Security groups](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#security-groups).\n4. If your workspace is in the US East, the US West, or the Canada (Central) region, and is configured to restrict outbound network access, you must allow traffic to additional endpoints to support FIPS endpoints for the S3 service. This applies to the S3 service but not to STS and Kinesis endpoints. AWS does not yet provide FIPS endpoints for STS and Kinesis. \n* For S3, allow outgoing traffic to the endpoint `s3.<region>.amazonaws.com` and `s3-fips.<region>.amazonaws.com`. For example `s3.us-east-1.amazonaws.com` and `s3-fips.us-east-1.amazonaws.com`.\n5. Run the following tests to verify that the changes were correctly applied: \n1. Launch a Databricks cluster with 1 driver and 1 worker, any DBR version, and any instance type.\n2. Create a notebook attached to the cluster. Use this cluster for the following tests.\n3. In the notebook, validate DBFS connectivity by running: \n```\n%fs ls \/\n%sh ls \/dbfs\n\n``` \nConfirm that a file listing appears without errors.\n4. Confirm access to the control plane instance for your region. Get the address from the table [IP addresses and domains](https:\/\/docs.databricks.com\/resources\/supported-regions.html#ip-domain-aws) and look for the Webapp endpoint for your VPC region. \n```\n%sh nc -zv <webapp-domain-name> 443\n\n``` \nFor example, for VPC region `us-west-2`: \n```\n%sh nc -zv oregon.cloud.databricks.com 443\n\n``` \nConfirm the result says it succeeded.\n5. Confirm access to the SCC relay for your region. Get the address from the table [IP addresses and domains](https:\/\/docs.databricks.com\/resources\/supported-regions.html#ip-domain-aws) and look for the SCC relay endpoint for your VPC region. \n```\n%sh nc -zv <scc-relay-domain-name> 2443\n\n``` \nFor example, for VPC region `us-west-1`: \n```\n%sh nc -zv tunnel.cloud.databricks.com 2443\n\n``` \nConfirm that the results says it succeeded.\n6. If your workspace in the US East region, the US West region, or Canada (Central) region, confirm access to the S3 endpoints for your region. \n```\n%sh nc -zv <bucket-name>.s3-fips.<region>.amazonaws.com 443\n\n``` \nFor example, for VPC region `us-west-1`: \n```\n%sh nc -zv acme-company-bucket.s3-fips.us-west-1.amazonaws.com 443\n\n``` \nConfirm the results for all three commands indicate success.\n7. In the same notebook, validate that the cluster Spark config points to the desired endpoints. For example: \n```\n>>> spark.conf.get(\"fs.s3a.stsAssumeRole.stsEndpoint\")\n\"sts.us-west-1.amazonaws.com\"\n\n>>> spark.conf.get(\"fs.s3a.endpoint\")\n\"s3-fips.us-west-2.amazonaws.com\"\n\n```\n6. Confirm that all existing compute in all affected workspaces use only the instance types that are supported by the compliance security profile, listed in [Requirements](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html#requirements) above. \nAny workload with an instance type outside of the list above would result in compute failing to startup with an `invalid_parameter_exception`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n###### Compliance security profile\n####### Step 2: Enable the compliance security profile on a workspace\n\nNote \n[Databricks Assistant](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html) is disabled by default on workspaces that have enabled the compliance security profile. Workspace admins can enable it by following the instructions [Enable or disable Databricks Assistant](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html#enable-or-disable). \n1. Enable the compliance security profile. \nTo directly enable the compliance security profile on a workspace and optionally add compliance standards, see [Enable enhanced security and compliance features on a workspace](https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-compliance.html#aws-workspace-config). \nYou can also set an account-level default for new workspaces to enable the security profile and optionally choose to add compliance standards on new workspaces. See [Set account-level defaults for new workspaces](https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-compliance.html#aws-account-level-defaults). \nUpdates might take up to six hours to propagate to all environments. Workloads that are actively running continue with the settings that were active at the time of starting the compute resource, and new settings apply the next time these workloads are started.\n2. Restart all running compute.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n###### Compliance security profile\n####### Step 3: Confirm that the compliance security profile is enabled for a workspace\n\nTo confirm that a workspace is using the compliance security profile, check that it has the **yellow shield logo** displayed in the user interface. \n* A shield logo appears in the top-right of the page, to the left of the workspace name: \n![Shield logo small.](https:\/\/docs.databricks.com\/_images\/shield-profile-logo-small.png)\n* Click the workspace name to see a list of the workspaces that you have access to. The workspaces that enable the compliance security profile have a shield icon followed by the text \u201cCompliance security profile\u201d. \n![Shield logo large.](https:\/\/docs.databricks.com\/_images\/shield-profile-logo-large.png) \nYou can also confirm a workspace is using the compliance security profile from the **Security and compliance** tab on the workspace page in the account console. \n![Shield account.](https:\/\/docs.databricks.com\/_images\/shield-account.png) \nIf the shield icons are missing for a workspace with the compliance security profile enabled, contact your Databricks account team.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/encrypt-otw.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n#### Encrypt traffic between cluster worker nodes\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nImportant \nThe[example init script](https:\/\/docs.databricks.com\/security\/keys\/encrypt-otw.html#example-init-script) that is referenced in this article derives its shared encryption secret from the hash of the keystore stored in DBFS. If you rotate the secret by updating the keystore file in DBFS, all running clusters must be restarted. Otherwise, Spark workers may to fail to authenticate with the Spark driver due to inconsistent shared secret, causing jobs to slow down. Furthermore, since the shared secret is stored in DBFS, any user with DBFS access can retrieve the secret using a notebook. \nAs an alternative, you can use one of the following AWS instance types, which automatically encrypt data between worker nodes with no extra configuration required: \n* **General purpose:** `M-fleet`, `Md-fleet`, `M5dn`, `M5n`, `M5zn`, `M7g`, `M7gd`, `M6i`, `M7i`, `M6id`, `M6in`, `M6idn`, `M6a`, `M7a`\n* **Compute optimized:** `C5a`, `C5ad`, `C5n`, `C6gn`, `C7g`, `C7gd`, `C7gn`, `C6i`, `C6id`, `C7i`, `C6in`, `C6a`, `C7a`\n* **Memory optimized:** `R-fleet`, `Rd-fleet`, `R7g`, `R7gd`, `R6i`, `R7i`, `R7iz`, `R6id`, `R6in`, `R6idn`, `R6a`, `R7a`\n* **Storage optimized:** `D3`, `D3en`, `P3dn`, `R5dn`, `R5n`, `I4i`, `I4g`, `I3en`, `Im4gn`, `Is4gen`\n* **Accelerated computing:** `G4dn`, `G5`, `P4d`, `P4de`, `P5`\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/encrypt-otw.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n#### Encrypt traffic between cluster worker nodes\n##### Requirements\n\nThis feature requires the [Enterprise plan](https:\/\/databricks.com\/product\/pricing\/platform-addons). Contact your Databricks account team for more information.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/encrypt-otw.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n#### Encrypt traffic between cluster worker nodes\n##### How the init script works\n\nImportant \nThe[example init script](https:\/\/docs.databricks.com\/security\/keys\/encrypt-otw.html#example-init-script) that is referenced in this article derives its shared encryption secret from the hash of the keystore stored in DBFS. If you rotate the secret by updating the keystore file in DBFS, all running clusters must be restarted. Otherwise, Spark workers may to fail to authenticate with the Spark driver due to inconsistent shared secret, causing jobs to slow down. Furthermore, since the shared secret is stored in DBFS, any user with DBFS access can retrieve the secret using a notebook. \nUser queries and transformations are typically sent to your clusters over an encrypted channel. By default, however, the data exchanged between worker nodes in a cluster is not encrypted. If your environment requires that data be encrypted at all times, whether at rest or in transit, you can create an init script that configures your clusters to encrypt traffic between worker nodes, using AES 256-bit encryption over a TLS 1.3 connection. \nNote \nAlthough AES enables cryptographic routines to take advantage of hardware acceleration, there\u2019s a performance penalty compared to unencrypted traffic. This penalty can result in queries taking longer on an encrypted cluster, depending on the amount of data shuffled between nodes. \nEnabling encryption of traffic between worker nodes requires setting Spark configuration parameters through an init script. You can use a [cluster-scoped init script](https:\/\/docs.databricks.com\/init-scripts\/cluster-scoped.html) for a single cluster or add a cluster-scoped init script to your cluster policies if you want all clusters in your workspace to use worker-to-worker encryption. \nOne time, copy the keystore file to a directory in DBFS. Then create the init script that applies the encryption settings. \nThe init script must perform the following tasks: \n1. Get the JKS keystore file and password.\n2. Set the Spark executor configuration.\n3. Set the Spark driver configuration. \nNote \nThe JKS keystore file used for enabling SSL\/HTTPS is dynamically generated for each workspace. The JKS keystore file\u2019s password is hardcoded and not intended to protect the confidentiality of the keystore. \nThe following is an example init script that implements these three tasks to generate the cluster encryption configuration.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/encrypt-otw.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n#### Encrypt traffic between cluster worker nodes\n##### Example init script\n\n```\n#!\/bin\/bash\n\nset -euo pipefail\n\nkeystore_dbfs_file=\"\/dbfs\/<keystore-directory>\/jetty_ssl_driver_keystore.jks\"\n\n## Wait till keystore file is available via Fuse\n\nmax_attempts=30\nwhile [ ! -f ${keystore_dbfs_file} ];\ndo\nif [ \"$max_attempts\" == 0 ]; then\necho \"ERROR: Unable to find the file : $keystore_dbfs_file .Failing the script.\"\nexit 1\nfi\nsleep 2s\n((max_attempts--))\ndone\n## Derive shared internode encryption secret from the hash of the keystore file\nsasl_secret=$(sha256sum $keystore_dbfs_file | cut -d' ' -f1)\n\nif [ -z \"${sasl_secret}\" ]; then\necho \"ERROR: Unable to derive the secret.Failing the script.\"\nexit 1\nfi\n\n# The JKS keystore file used for enabling SSL\/HTTPS\nlocal_keystore_file=\"$DB_HOME\/keys\/jetty_ssl_driver_keystore.jks\"\n# Password of the JKS keystore file. This jks password is hardcoded and is not intended to protect the confidentiality\n# of the keystore. Do not assume the keystore file itself is protected.\nlocal_keystore_password=\"gb1gQqZ9ZIHS\"\n\n## Updating spark-branch.conf is only needed for driver\n\nif [[ $DB_IS_DRIVER = \"TRUE\" ]]; then\ndriver_conf=${DB_HOME}\/driver\/conf\/spark-branch.conf\necho \"Configuring driver conf at $driver_conf\"\n\nif [ ! -e $driver_conf ] ; then\ntouch $driver_conf\nfi\n\ncat << EOF >> $driver_conf\n[driver] {\n\/\/ Configure inter-node authentication\n\"spark.authenticate\" = true\n\"spark.authenticate.secret\" = \"$sasl_secret\"\n\/\/ Configure AES encryption\n\"spark.network.crypto.enabled\" = true\n\"spark.network.crypto.saslFallback\" = false\n\/\/ Configure SSL\n\"spark.ssl.enabled\" = true\n\"spark.ssl.keyPassword\" = \"$local_keystore_password\"\n\"spark.ssl.keyStore\" = \"$local_keystore_file\"\n\"spark.ssl.keyStorePassword\" = \"$local_keystore_password\"\n\"spark.ssl.protocol\" =\"TLSv1.3\"\n\"spark.ssl.standalone.enabled\" = true\n\"spark.ssl.ui.enabled\" = true\n}\nEOF\necho \"Successfully configured driver conf at $driver_conf\"\nfi\n\n# Setting configs in spark-defaults.conf for the spark master and worker\n\nspark_defaults_conf=\"$DB_HOME\/spark\/conf\/spark-defaults.conf\"\necho \"Configuring spark defaults conf at $spark_defaults_conf\"\nif [ ! -e $spark_defaults_conf ] ; then\ntouch $spark_defaults_conf\nfi\n\ncat << EOF >> $spark_defaults_conf\nspark.authenticate true\nspark.authenticate.secret $sasl_secret\nspark.network.crypto.enabled true\nspark.network.crypto.saslFallback false\n\nspark.ssl.enabled true\nspark.ssl.keyPassword $local_keystore_password\nspark.ssl.keyStore $local_keystore_file\nspark.ssl.keyStorePassword $local_keystore_password\nspark.ssl.protocol TLSv1.3\nspark.ssl.standalone.enabled true\nspark.ssl.ui.enabled true\nEOF\n\necho \"Successfully configured spark defaults conf at $spark_defaults_conf\"\n\n``` \nOnce the initialization of the driver and worker nodes is complete, all traffic between these nodes is encrypted using the keystore file.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/encrypt-otw.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n#### Encrypt traffic between cluster worker nodes\n##### Notebook example: Install an encryption init script\n\nThis following notebook copies the keystore file and generates the init script in DBFS. You can use the init script to create new clusters with encryption enabled. \n### Install an encryption init script notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/cluster-encryption-init-script.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### Encrypt traffic between cluster worker nodes\n##### Disable encryption between worker nodes\n\nTo disable encryption between worker nodes, remove the init script from the cluster configuration, then restart the cluster.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/encrypt-otw.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Track model development using MLflow\n\nThis article contains examples of tracking model development in Databricks. Log and track ML and deep learning models automatically with MLflow or manually with the MLflow API.\n\n#### Track model development using MLflow\n##### Model tracking & MLflow\n\nThe model development process is iterative, and it can be challenging to keep track of your work as you develop and optimize a model. In Databricks, you can use [MLflow tracking](https:\/\/mlflow.org\/docs\/latest\/tracking.html) to help you keep track of the model development process, including parameter settings or combinations you have tried and how they affected the model\u2019s performance. \nMLflow tracking uses *experiments* and *runs* to log and track your ML and deep learning model development. A run is a single execution of model code. During an MLflow run, you can log model parameters and results. An experiment is a collection of related runs. In an experiment, you can compare and filter runs to understand how your model performs and how its performance depends on the parameter settings, input data, and so on. \nThe notebooks in this article provide simple examples that can help you quickly get started using MLflow to track your model development. For more details on using MLflow tracking in Databricks, see [Track ML and deep learning training runs](https:\/\/docs.databricks.com\/mlflow\/tracking.html). \nNote \nMLflow tracking does not support jobs submitted with [spark\\_submit\\_task](https:\/\/docs.databricks.com\/api\/workspace\/jobs) in the Jobs API. Instead, you can use [MLflow Projects](https:\/\/docs.databricks.com\/mlflow\/projects.html) to run Spark code.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/track-model-development\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Track model development using MLflow\n##### Use autologging to track model development\n\nMLflow can automatically log training code written in many ML and deep learning frameworks. This is the easiest way to get started using MLflow tracking. \nThis example notebook shows how to use autologging with [scikit-learn](https:\/\/scikit-learn.org\/stable\/index.html). For information about autologging with other Python libraries, see [Automatically log training runs to MLflow](https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html#automatically-log-training-runs-to-mlflow). \n### MLflow autologging Python notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-quick-start-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### Track model development using MLflow\n##### Use the logging API to track model development\n\nThis notebook illustrates how to use the MLflow logging API. Using the logging API gives you more control over the metrics logged and lets you log additional artifacts such as tables or plots. \nThis example notebook shows how to use the [Python logging API](https:\/\/mlflow.org\/docs\/latest\/python_api\/index.html). MLflow also has [REST, R, and Java APIs](https:\/\/mlflow.org\/docs\/latest\/tracking.html). \n### MLflow logging API Python notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-logging-api-quick-start-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/track-model-development\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Track model development using MLflow\n##### End-to-end example\n\nThis tutorial notebook presents an end-to-end example of training a model in Databricks, including loading data, visualizing the data, setting up a parallel hyperparameter optimization, and using MLflow to review the results, register the model, and perform inference on new data using the registered model in a Spark UDF. \n### Requirements \nDatabricks Runtime ML \n### Example notebook \nIf your workspace is enabled for Unity Catalog, use this version of the notebook: \n#### Use scikit-learn with MLflow integration on Databricks (Unity Catalog) \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-end-to-end-example-uc.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nIf your workspace is not enabled for Unity Catalog, use this version of the notebook: \n#### Use scikit-learn with MLflow integration on Databricks \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-end-to-end-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/track-model-development\/index.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### User-defined scalar functions - Scala\n\nThis article contains Scala user-defined function (UDF) examples. It shows how to register UDFs, how to invoke UDFs, and caveats regarding evaluation order of subexpressions in Spark SQL. See [External user-defined scalar functions (UDFs)](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-functions-udf-scalar.html) for more details. \nPreview \nSupport for Scala UDFs on Unity Catalog-enabled clusters with shared access mode is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html) and requires Databricks Runtime 14.2 and above. \nNote \nGraviton instances do not support Scala UDFs on Unity Catalog-enabled clusters.\n\n#### User-defined scalar functions - Scala\n##### Register a function as a UDF\n\n```\nval squared = (s: Long) => {\ns * s\n}\nspark.udf.register(\"square\", squared)\n\n```\n\n#### User-defined scalar functions - Scala\n##### Call the UDF in Spark SQL\n\n```\nspark.range(1, 20).createOrReplaceTempView(\"test\")\n\n``` \n```\n%sql select id, square(id) as id_squared from test\n\n```\n\n#### User-defined scalar functions - Scala\n##### Use UDF with DataFrames\n\n```\nimport org.apache.spark.sql.functions.{col, udf}\nval squared = udf((s: Long) => s * s)\ndisplay(spark.range(1, 20).select(squared(col(\"id\")) as \"id_squared\"))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/scala.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### User-defined scalar functions - Scala\n##### Evaluation order and null checking\n\nSpark SQL (including SQL and the DataFrame and Dataset APIs) does not guarantee the order of evaluation of subexpressions. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order. For example, logical `AND` and `OR` expressions do not have left-to-right \u201cshort-circuiting\u201d semantics. \nTherefore, it is dangerous to rely on the side effects or order of evaluation of Boolean expressions, and the order of `WHERE` and `HAVING` clauses, since such expressions and clauses can be reordered during query optimization and planning. Specifically, if a UDF relies on short-circuiting semantics in SQL for null checking, there\u2019s no guarantee that the null check will happen before invoking the UDF. For example, \n```\nspark.udf.register(\"strlen\", (s: String) => s.length)\nspark.sql(\"select s from test1 where s is not null and strlen(s) > 1\") \/\/ no guarantee\n\n``` \nThis `WHERE` clause does not guarantee the `strlen` UDF to be invoked after filtering out nulls. \nTo perform proper null checking, we recommend that you do either of the following: \n* Make the UDF itself null-aware and do null checking inside the UDF itself\n* Use `IF` or `CASE WHEN` expressions to do the null check and invoke the UDF in a conditional branch \n```\nspark.udf.register(\"strlen_nullsafe\", (s: String) => if (s != null) s.length else -1)\nspark.sql(\"select s from test1 where s is not null and strlen_nullsafe(s) > 1\") \/\/ ok\nspark.sql(\"select s from test1 where if(s is not null, strlen(s), null) > 1\") \/\/ ok\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/scala.html"} +{"content":"# What is Delta Lake?\n### Rename and drop columns with Delta Lake column mapping\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nDatabricks supports column mapping for Delta Lake tables, which enables metadata-only changes to mark columns as deleted or renamed without rewriting data files. It also allows users to name Delta table columns using characters that are not allowed by Parquet, such as spaces, so that users can directly ingest CSV or JSON data into Delta without the need to rename columns due to previous character constraints. \nImportant \nEnabling column mapping also enables random file prefixes, which removes the ability to explore data using Hive-style partitioning. See [Do Delta Lake and Parquet share partitioning strategies?](https:\/\/docs.databricks.com\/tables\/partitions.html#delta-hive-partitions). \nEnabling column mapping on tables might break downstream operations that rely on Delta change data feed. See [Change data feed limitations for tables with column mapping enabled](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html#column-mapping-limitations). \nEnabling column mapping on tables might break streaming read from the Delta table as a source, including in Delta Live Tables. See [Streaming with column mapping and schema changes](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html#schema-tracking).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html"} +{"content":"# What is Delta Lake?\n### Rename and drop columns with Delta Lake column mapping\n#### How to enable Delta Lake column mapping\n\nImportant \nEnabling column mapping for a table upgrades the Delta [table version](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html). This protocol upgrade is irreversible. Tables with column mapping enabled can only be read in Databricks Runtime 10.4 LTS and above. \nColumn mapping requires the following Delta protocols: \n* Reader version 2 or above.\n* Writer version 5 or above. \nFor a Delta table with the required protocol versions, you can enable column mapping by setting `delta.columnMapping.mode` to `name`. \nYou can use the following command to upgrade the table version and enable column mapping: \n```\nALTER TABLE <table-name> SET TBLPROPERTIES (\n'delta.minReaderVersion' = '2',\n'delta.minWriterVersion' = '5',\n'delta.columnMapping.mode' = 'name'\n)\n\n``` \nNote \nYou cannot turn off column mapping after you enable it. If you try to set `'delta.columnMapping.mode' = 'none'`, you\u2019ll get an error.\n\n### Rename and drop columns with Delta Lake column mapping\n#### Rename a column\n\nNote \nAvailable in Databricks Runtime 10.4 LTS and above. \nWhen column mapping is enabled for a Delta table, you can rename a column: \n```\nALTER TABLE <table-name> RENAME COLUMN old_col_name TO new_col_name\n\n``` \nFor more examples, see [Update Delta Lake table schema](https:\/\/docs.databricks.com\/delta\/update-schema.html).\n\n### Rename and drop columns with Delta Lake column mapping\n#### Drop columns\n\nNote \nAvailable in Databricks Runtime 11.3 LTS and above. \nWhen column mapping is enabled for a Delta table, you can drop one or more columns: \n```\nALTER TABLE table_name DROP COLUMN col_name\nALTER TABLE table_name DROP COLUMNS (col_name_1, col_name_2, ...)\n\n``` \nFor more details, see [Update Delta Lake table schema](https:\/\/docs.databricks.com\/delta\/update-schema.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html"} +{"content":"# What is Delta Lake?\n### Rename and drop columns with Delta Lake column mapping\n#### Supported characters in column names\n\nWhen column mapping is enabled for a Delta table, you can include spaces and any of these characters in the table\u2019s column names: `,;{}()\\n\\t=`.\n\n### Rename and drop columns with Delta Lake column mapping\n#### Streaming with column mapping and schema changes\n\nImportant \nThis feature is in Public Preview in Databricks Runtime 13.3 LTS and above. \nYou can provide a schema tracking location to enable streaming from Delta tables with column mapping enabled. This overcomes an issue in which non-additive schema changes could result in broken streams. \nEach streaming read against a data source must have its own `schemaTrackingLocation` specified. The specified `schemaTrackingLocation` must be contained within the directory specified for the `checkpointLocation` of the target table for streaming write. \nNote \nFor streaming workloads that combine data from multiple source Delta tables, you need to specify unique directories within the `checkpointLocation` for each source table. \nThe option `schemaTrackingLocation` is used to specify the path for schema tracking, as shown in the following code example: \n```\ncheckpoint_path = \"\/path\/to\/checkpointLocation\"\n\n(spark.readStream\n.option(\"schemaTrackingLocation\", checkpoint_path)\n.table(\"delta_source_table\")\n.writeStream\n.option(\"checkpointLocation\", checkpoint_path)\n.toTable(\"output_table\")\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html"} +{"content":"# DatabricksIQ-powered features\n### What is Databricks Assistant?\n\nPreview \nThis feature is currently in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nDatabricks Assistant works as an AI-based companion pair-programmer to make you more efficient as you create notebooks, queries, and files. It can help you rapidly answer questions by generating, optimizing, completing, explaining, and fixing code and queries. \nThis page provides general information about the Assistant in the form of frequently asked questions. For questions about privacy and security, see [Privacy and security](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html#privacy-security).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html"} +{"content":"# DatabricksIQ-powered features\n### What is Databricks Assistant?\n#### Enable or disable Databricks Assistant\n\nDatabricks Assistant is enabled by default. You can manage enablement for all workspaces in an account or individual workspaces. \nEnablement of the Databricks Assistant for your account is captured as an account event in your audit logs, see [Account events](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html#accounts). \nNote \nAn improved Databricks Assistant experience that tracks query threads and history throughout editor contexts is also available on an opt-in basis. \nTo enable this experience, change the **New Assistant** toggle to **On** when you open the Assistant and reload the page. \n![Enable improved DB Assistant experience that tracks query threads and history.](https:\/\/docs.databricks.com\/_images\/enable-single-assistant-preview.png) \n### Manage the account setting \nTo enable or disable all workspaces in an account for Databricks Assistant, follow these instructions: \n1. As an account admin, log in to the [account console](https:\/\/accounts.cloud.databricks.com).\n2. Click **Settings**.\n3. Click the **Advanced** tab.\n4. From the **Other** > **Partner-powered AI assistive features** section, select **Enabled** or **Disabled**, and then click **Save**. \n### Manage the workspace setting \nIf the account setting permits workspace setting overrides, workspace admins can enable or disable specific workspaces. To do this, use a Workspace Setting to override the default setting in the Account Console as follows: \n1. Go to the workspace [admin settings page](https:\/\/docs.databricks.com\/admin\/index.html#admin-settings).\n2. Click the **Advanced** tab.\n3. Use the **Partner-powered AI assistive features** drop-down menu to make your selection.\n4. Click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html"} +{"content":"# DatabricksIQ-powered features\n### What is Databricks Assistant?\n#### Get coding help from Databricks Assistant\n\nTo access Databricks Assistant, click the Assistant icon ![Databricks assistant icon](https:\/\/docs.databricks.com\/_images\/assistant-icon.png) in the left sidebar of the notebook, the file editor, the SQL Editor, or the dashboard **Data** tab. \n![Databricks assistant icon location](https:\/\/docs.databricks.com\/_images\/assistant-icon-in-sidebar.png) \nThe Assistant pane can open on the left or right side of the screen. \n![Databricks assistant pane](https:\/\/docs.databricks.com\/_images\/assistant-panel.png) \nSome capabilities of Databricks Assistant are the following: \n* Generate: Use natural language to generate a SQL query.\n* Explain: Highlight a query or a block of code and have Databricks Assistant walk through the logic in clear, concise English.\n* Fix: Explain and fix syntax and runtime errors with a single click.\n* Transform and optimize: Convert Pandas code to PySpark for faster execution. \nAny code generated by the Databricks Assistant is intended to run in a Databricks compute environment. It is optimized to create code in Databricks-supported programming languages, frameworks, and dialects. It is not intended to be a general-purpose programming assistant. The Assistant often uses information from Databricks resources, such as the Databricks Documentation website or Knowledge Base, to better answer user queries. It performs best when the user question is related to questions that can be answered with knowledge from Databricks documentation, Unity Catalog, and user code in the Workspace. \nUsers should always review any code generated by the Assistant before running it because it can sometimes make mistakes.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html"} +{"content":"# DatabricksIQ-powered features\n### What is Databricks Assistant?\n#### Create data visualizations using the Databricks Assistant\n\nYou can use the Databricks Assistant when drafting dashboards. As you create visualizations on an existing dashboard dataset, prompt the Assistant with questions to receive responses in the form of generated charts. To use the Assistant in a dashboard, first create one or more datasets, then add a visualization widget to the Canvas. The visualization widget includes a prompt to describe your new chart. Type a description of the chart you want to see, and the assistant will generate it. You can approve or reject the chart, or modify the description to generate something new. \nFor details and examples of using the Assistant with dashboards, see [Create visualizations with Databricks Assistant](https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-w-db-assistant.html).\n\n### What is Databricks Assistant?\n#### Services used by Databricks Assistant\n\nDatabricks Assistant might use third-party services to provide responses, including [Azure OpenAI](https:\/\/azure.microsoft.com\/products\/cognitive-services\/openai-service\/) operated by Microsoft. \nThese services are subject to their respective data management policies. Data sent to these services is not used for any model training. For details, see [Azure data management policy](https:\/\/learn.microsoft.com\/legal\/cognitive-services\/openai\/data-privacy). \nFor Azure OpenAI, Databricks has opted out of [Abuse Monitoring](https:\/\/learn.microsoft.com\/legal\/cognitive-services\/openai\/data-privacy?context=%2Fazure%2Fai-services%2Fopenai%2Fcontext%2Fcontext#preventing-abuse-and-harmful-content-generation) so no prompts or responses are stored with Azure OpenAI.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html"} +{"content":"# DatabricksIQ-powered features\n### What is Databricks Assistant?\n#### Tips for improving the accuracy of results\n\n* **Use the prompt \u201cFind Tables\u201d for better responses.** Before you ask questions about data in a table, ask the Assistant to find related tables by subject matter or other characteristics. Example: `Find tables related to NFL games`.\n* **Specify the structure of the response you want.** The structure and detail that Databricks Assistant provides varies, even for the same prompt. Databricks Assistant knows about your table and column schema and metadata, so you can use natural language to ask your question. Example: `List active and retired NFL quarterbacks' passing completion rate, for those who had over 500 attempts in a season.` Assistant answers using data from columns such as `s.player_id` and `s.attempts`.\n* **Provide examples of your row-level data values.** Databricks Assistant doesn\u2019t have access to row-level data, thus for more accurate answers provide examples of the data. Example: `List the average height for each position in inches`. This returns an error because the data set shows height in feet and inches, as in `6-2`.\n* **Test code snippets by running them in the Assistant pane.** Use the Assistant pane as a scratchpad that saves iterations of your queries and assistant answers. You can run code and edit it in the pane until you are ready to add it to a notebook. \n![Testing code snippets by running them in the Assistant pane.](https:\/\/docs.databricks.com\/_images\/run-code-in-assistant.gif)\n* **Use cell actions in a notebook.** Cell actions include shortcuts to common tasks, such as documenting (commenting), fixing, and explaining code. \n![`\/doc` cell action prompts Assistant to comment the code.](https:\/\/docs.databricks.com\/_images\/cell-action-doc.gif) \nFor fully illustrated examples, see [5 tips for Databricks Assistant](https:\/\/www.databricks.com\/blog\/5-tips-get-most-out-your-databricks-assistant). \nDatabricks Assistant considers the history of the conversation so you can refine your questions as you go.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html"} +{"content":"# DatabricksIQ-powered features\n### What is Databricks Assistant?\n#### Give feedback\n\nThe best way to send feedback is to use the **Provide Feedback** links in the notebook and SQL editor. You can also send an email to [assistant-feedback@databricks.com](mailto:assistant-feedback%40databricks.com) or to your account team. \nShare product improvement suggestions and user experience issues rather than feedback about prompt accuracy. If you receive an unhelpful suggestion from the Assistant, click the \u201cNot useful\u201d ![Thumb down icon](https:\/\/docs.databricks.com\/_images\/assistant-thumb-down.png) button.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html"} +{"content":"# DatabricksIQ-powered features\n### What is Databricks Assistant?\n#### Privacy and security\n\n### Q: What data is being sent to the models? \nDatabricks Assistant sends code and metadata to the models on each API request. This helps return more relevant results for your data. Examples include: \n* Code\/queries in the current notebook cell or SQL Editor tab\n* Table and Column names and descriptions\n* Previous questions\n* Favorite tables \n### Q: Does the metadata sent to the models respect the user\u2019s Unity Catalog permissions? \nYes, all of the data sent to the model respects the user\u2019s Unity Catalog permissions, so it does not send metadata relating to tables that the user does not have permission to see. \n### Q: If I execute a query with results, and then ask a question, do the results of my query get sent to the model? \nNo, only the code contents in cells, metadata about tables, and the user-entered text is shared with the model. For the \u201cfix error\u201d feature, Databricks also shares the stack trace from the error output. \n### Q: Will Databricks Assistant execute dangerous code? \nNo. Databricks Assistant does not automatically run code on your behalf. AI models can make mistakes, misunderstand intent, and hallucinate or give incorrect answers. Review and test AI- generated code before you run it. \n### Q: Has Databricks done any assessment to evaluate the accuracy and appropriateness of the Assistant responses? \nYes. Databricks has mitigations to prevent the Assistant from generating harmful responses such as hate speech, insecure code, prompt jailbreaks, and third-party copyright content. Databricks has done extensive testing of all our AI assistive features with thousands of simulated user inputs to assess the robustness of mitigations. These assessments focused on the expected use cases for the Assistant such as code generation in the Python, Databricks SQL, R, and Scala languages. \n### Q: Can I use Databricks Assistant with tables that process regulated data (PHI, PCI, IRAP, FedRAMP)? \nYes. To do so, you must comply with requirements, such as enabling the [compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html), and add the relevant compliance standard as part of the compliance security profile configuration.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html"} +{"content":"# \n### RAG Studio\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \nImportant \nTo navigate RAG Studio documentation, return to this index page. Private preview documentation doesn\u2019t include navigation links in the side panel.\n\n### RAG Studio\n#### Overview\n\nRAG Studio provides tools and an opinionated workflow for developing, evaluating, and iterating on Retrieval-Augmented Generation (RAG) applications in order to build apps that deliver consistent, accurate answers. RAG Studio is built on top of MLflow and is tightly integrated with Databricks tools and infrastructure. \nRead more about RAG Studio\u2019s [product philosophy](https:\/\/docs.databricks.com\/rag-studio\/approach-overview.html) about developing RAG Applications.\n\n### RAG Studio\n#### Development workflow\n\nThe RAG Studio approach to improving quality is to make it easy for developers to quickly: \n1. Adjust various knobs throughout the RAG application\u2019s `\ud83d\udce5 Data Ingestor`, `\ud83d\uddc3\ufe0f Data Processor`, `\ud83d\udd0d Retriever`, and `\ud83d\udd17 Chain` to create a new `Version`\n2. Test the `Version` offline with a `\ud83d\udcd6 Evaluation Set` and `\ud83e\udd16 LLM Judge`s\n3. Deploy the `Version` in the `\ud83d\udcac Review UI` to collect feedback from `\ud83e\udde0 Expert Users`\n4. Review `\ud83d\udcc8 Evaluation Results` to determine if the changes had a positive impact of quality, cost, and\/or latency\n5. Investigate the details in the `\ud83d\uddc2\ufe0f Request Log` and `\ud83d\udc4d Assessment & Evaluation Results Log` to identify hypotheses for how to improve quality, cost, and\/or latency\n6. If needed, collect additional feedback on specific `\ud83d\uddc2\ufe0f Request Log`s from `\ud83e\udde0 Expert Users` using the `\ud83d\udcac Review UI`\n7. *Repeat until you reach your quality\/cost\/latency targets!*\n8. Deploy the application to production \nNote \nImportantly, the same development workflow above applies to production traffic! The RAG Studio data model for logs, assessments, and metrics is fully unified between development and production.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/index.html"} +{"content":"# \n### RAG Studio\n#### Tutorials\n\nTutorials demonstrate how to do the key developer workflows mentioned above, based on the fully featured sample RAG Application included with RAG Studio - a Documentation Q&A bot on the Databricks documentation. \nImportant \nDatabricks suggests getting started by going through these tutorials. Following tutorials [#1](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1-create-sample-app.html) and [#2](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1c-deploy-version.html) will deploy a fully functioning chat UI for the sample application. While you can do these tutorials in any order, they are designed to be done sequentially. \n1. [Initialize a RAG Application](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1-create-sample-app.html)\n2. [Ingest or connect raw data](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1b-ingest-data.html)\n3. [Deploy a version of a RAG Application](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1c-deploy-version.html)\n4. [View logs & assessments](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/2-view-logs.html)\n5. [Run offline evaluation with a \ud83d\udcd6 Evaluation Set](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/3-run-offline-eval.html)\n6. [Collect feedback from \ud83e\udde0 Expert Users](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/4-collect-feedback.html)\n7. [Create an \ud83d\udcd6 Evaluation Set](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/5-create-eval-set.html)\n8. [Create versions of your RAG application to iterate on the app\u2019s quality](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-app-versions.html) \n* Create a [version](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-data-processor.html) of the `\ud83d\uddc3\ufe0f Data Processor`\n* Create a [version](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-retriever.html) of the `\ud83d\udd0d Retriever`\n* Create a [version](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-chain.html) of the `\ud83d\udd17 Chain`\n9. [Collect feedback on \ud83d\uddc2\ufe0f Request Logs from expert users](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/8-eval-review.html)\n10. [Deploy a RAG application to production](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/6-deploy-rag-app-to-production.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/index.html"} +{"content":"# \n### RAG Studio\n#### Concept guides\n\nFor a deep dive of RAG Studio concepts and architecture, review these guides. \n* [Rag Studio user personas](https:\/\/docs.databricks.com\/rag-studio\/concepts\/user-personas.html)\n* [Key concepts & terminology](https:\/\/docs.databricks.com\/rag-studio\/concepts\/terminology-key-concepts.html)\n* [Supported metrics](https:\/\/docs.databricks.com\/rag-studio\/details\/metrics.html)\n* [Development & production environments](https:\/\/docs.databricks.com\/rag-studio\/details\/environments.html)\n\n### RAG Studio\n#### Additional reference\n\nThese documents provide additional reference material that is linked from the above guides. \n* Initial setup \n+ [Set up your development environment](https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-dev.html)\n+ [Provision required infrastructure](https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-infra.html)\n* [Infrastructure and Unity Catalog assets created by RAG Studio](https:\/\/docs.databricks.com\/rag-studio\/details\/created-infra.html)\n* [Supported regions](https:\/\/docs.databricks.com\/rag-studio\/regions.html)\n* [Directory structure](https:\/\/docs.databricks.com\/rag-studio\/details\/directory-structure.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/index.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query Amazon Redshift using Databricks\n\nYou can read and write tables from Amazon Redshift with Databricks. \nNote \nYou may prefer Lakehouse Federation for managing queries to Redshift. See [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). \nThe Databricks Redshift data source uses Amazon S3 to efficiently transfer data in and out of Redshift and uses JDBC to automatically trigger the appropriate `COPY` and `UNLOAD` commands on Redshift. \nNote \nIn Databricks Runtime 11.3 LTS and above, Databricks Runtime includes the Redshift JDBC driver, accessible using the `redshift` keyword for the format option. See [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html) for driver versions included in each Databricks Runtime. User-provided drivers are still supported and take precedence over the bundled JDBC driver. \nIn Databricks Runtime 10.4 LTS and below, manual installation of the Redshift JDBC driver is required, and queries should use the driver (`com.databricks.spark.redshift`) for the format. See [Redshift driver installation](https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html#installation).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query Amazon Redshift using Databricks\n##### Usage\n\nThe following examples demonstrate connecting with the Redshift driver. Replace the `url` parameter values if you\u2019re using the PostgreSQL JDBC driver. \nOnce you have [configured your AWS credentials](https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html#redshift-aws-credentials), you can use the data source with the Spark data source API in Python, SQL, R, or Scala. \nImportant \n[External locations defined in Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html) are not supported as `tempdir` locations. \n```\n# Read data from a table using Databricks Runtime 10.4 LTS and below\ndf = (spark.read\n.format(\"redshift\")\n.option(\"dbtable\", table_name)\n.option(\"tempdir\", \"s3a:\/\/<bucket>\/<directory-path>\")\n.option(\"url\", \"jdbc:redshift:\/\/<database-host-url>\")\n.option(\"user\", username)\n.option(\"password\", password)\n.option(\"forward_spark_s3_credentials\", True)\n.load()\n)\n\n# Read data from a table using Databricks Runtime 11.3 LTS and above\ndf = (spark.read\n.format(\"redshift\")\n.option(\"host\", \"hostname\")\n.option(\"port\", \"port\") # Optional - will use default port 5439 if not specified.\n.option(\"user\", \"username\")\n.option(\"password\", \"password\")\n.option(\"database\", \"database-name\")\n.option(\"dbtable\", \"schema-name.table-name\") # if schema-name is not specified, default to \"public\".\n.option(\"tempdir\", \"s3a:\/\/<bucket>\/<directory-path>\")\n.option(\"forward_spark_s3_credentials\", True)\n.load()\n)\n\n# Read data from a query\ndf = (spark.read\n.format(\"redshift\")\n.option(\"query\", \"select x, count(*) <your-table-name> group by x\")\n.option(\"tempdir\", \"s3a:\/\/<bucket>\/<directory-path>\")\n.option(\"url\", \"jdbc:redshift:\/\/<database-host-url>\")\n.option(\"user\", username)\n.option(\"password\", password)\n.option(\"forward_spark_s3_credentials\", True)\n.load()\n)\n\n# After you have applied transformations to the data, you can use\n# the data source API to write the data back to another table\n\n# Write back to a table\n(df.write\n.format(\"redshift\")\n.option(\"dbtable\", table_name)\n.option(\"tempdir\", \"s3a:\/\/<bucket>\/<directory-path>\")\n.option(\"url\", \"jdbc:redshift:\/\/<database-host-url>\")\n.option(\"user\", username)\n.option(\"password\", password)\n.mode(\"error\")\n.save()\n)\n\n# Write back to a table using IAM Role based authentication\n(df.write\n.format(\"redshift\")\n.option(\"dbtable\", table_name)\n.option(\"tempdir\", \"s3a:\/\/<bucket>\/<directory-path>\")\n.option(\"url\", \"jdbc:redshift:\/\/<database-host-url>\")\n.option(\"user\", username)\n.option(\"password\", password)\n.option(\"aws_iam_role\", \"arn:aws:iam::123456789000:role\/redshift_iam_role\")\n.mode(\"error\")\n.save()\n)\n\n``` \nRead data using SQL on Databricks Runtime 10.4 LTS and below: \n```\nDROP TABLE IF EXISTS redshift_table;\nCREATE TABLE redshift_table\nUSING redshift\nOPTIONS (\ndbtable '<table-name>',\ntempdir 's3a:\/\/<bucket>\/<directory-path>',\nurl 'jdbc:redshift:\/\/<database-host-url>',\nuser '<username>',\npassword '<password>',\nforward_spark_s3_credentials 'true'\n);\nSELECT * FROM redshift_table;\n\n``` \nRead data using SQL on Databricks Runtime 11.3 LTS and above: \n```\nDROP TABLE IF EXISTS redshift_table;\nCREATE TABLE redshift_table\nUSING redshift\nOPTIONS (\nhost '<hostname>',\nport '<port>', \/* Optional - will use default port 5439 if not specified. *.\/\nuser '<username>',\npassword '<password>',\ndatabase '<database-name>'\ndbtable '<schema-name>.<table-name>', \/* if schema-name not provided, default to \"public\". *\/\ntempdir 's3a:\/\/<bucket>\/<directory-path>',\nforward_spark_s3_credentials 'true'\n);\nSELECT * FROM redshift_table;\n\n``` \nWrite data using SQL: \n```\nDROP TABLE IF EXISTS redshift_table;\nCREATE TABLE redshift_table_new\nUSING redshift\nOPTIONS (\ndbtable '<new-table-name>',\ntempdir 's3a:\/\/<bucket>\/<directory-path>',\nurl 'jdbc:redshift:\/\/<database-host-url>',\nuser '<username>',\npassword '<password>',\nforward_spark_s3_credentials 'true'\n) AS\nSELECT * FROM table_name;\n\n``` \nThe SQL API supports only the creation of new tables and not overwriting or appending. \nRead data using R on Databricks Runtime 10.4 LTS and below: \n```\ndf <- read.df(\nNULL,\n\"com.databricks.spark.redshift\",\ntempdir = \"s3a:\/\/<your-bucket>\/<your-directory-path>\",\ndbtable = \"<your-table-name>\",\nurl = \"jdbc:redshift:\/\/<the-rest-of-the-connection-string>\")\n\n``` \nRead data using R on Databricks Runtime 11.3 LTS and above: \n```\ndf <- read.df(\nNULL,\n\"redshift\",\nhost = \"hostname\",\nport = \"port\",\nuser = \"username\",\npassword = \"password\",\ndatabase = \"database-name\",\ndbtable = \"schema-name.table-name\",\ntempdir = \"s3a:\/\/<your-bucket>\/<your-directory-path>\",\nforward_spark_s3_credentials = \"true\",\ndbtable = \"<your-table-name>\")\n\n``` \n```\n\/\/ Read data from a table using Databricks Runtime 10.4 LTS and below\nval df = spark.read\n.format(\"redshift\")\n.option(\"dbtable\", table_name)\n.option(\"tempdir\", \"s3a:\/\/<bucket>\/<directory-path>\")\n.option(\"url\", \"jdbc:redshift:\/\/<database-host-url>\")\n.option(\"user\", username)\n.option(\"password\", password)\n.option(\"forward_spark_s3_credentials\", True)\n.load()\n\n\/\/ Read data from a table using Databricks Runtime 11.3 LTS and above\nval df = spark.read\n.format(\"redshift\")\n.option(\"host\", \"hostname\")\n.option(\"port\", \"port\") \/* Optional - will use default port 5439 if not specified. *\/\n.option(\"user\", \"username\")\n.option(\"password\", \"password\")\n.option(\"database\", \"database-name\")\n.option(\"dbtable\", \"schema-name.table-name\") \/* if schema-name is not specified, default to \"public\". *\/\n.option(\"tempdir\", \"s3a:\/\/<bucket>\/<directory-path>\")\n.option(\"forward_spark_s3_credentials\", true)\n.load()\n\n\/\/ Read data from a query\nval df = spark.read\n.format(\"redshift\")\n.option(\"query\", \"select x, count(*) <your-table-name> group by x\")\n.option(\"tempdir\", \"s3a:\/\/<bucket>\/<directory-path>\")\n.option(\"url\", \"jdbc:redshift:\/\/<database-host-url>\")\n.option(\"user\", username)\n.option(\"password\", password)\n.option(\"forward_spark_s3_credentials\", True)\n.load()\n\n\/\/ After you have applied transformations to the data, you can use\n\/\/ the data source API to write the data back to another table\n\n\/\/ Write back to a table\ndf.write\n.format(\"redshift\")\n.option(\"dbtable\", table_name)\n.option(\"tempdir\", \"s3a:\/\/<bucket>\/<directory-path>\")\n.option(\"url\", \"jdbc:redshift:\/\/<database-host-url>\")\n.option(\"user\", username)\n.option(\"password\", password)\n.mode(\"error\")\n.save()\n\n\/\/ Write back to a table using IAM Role based authentication\ndf.write\n.format(\"redshift\")\n.option(\"dbtable\", table_name)\n.option(\"tempdir\", \"s3a:\/\/<bucket>\/<directory-path>\")\n.option(\"url\", \"jdbc:redshift:\/\/<database-host-url>\")\n.option(\"user\", username)\n.option(\"password\", password)\n.option(\"aws_iam_role\", \"arn:aws:iam::123456789000:role\/redshift_iam_role\")\n.mode(\"error\")\n.save()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query Amazon Redshift using Databricks\n##### Recommendations for working with Redshift\n\nQuery execution may extract large amounts of data to S3. If you plan to perform several queries against the same data in Redshift, Databricks recommends saving the extracted data using [Delta Lake](https:\/\/docs.databricks.com\/delta\/index.html). \nNote \nYou should not create a Redshift cluster inside the Databricks managed VPC as it can lead to permissions issues due to the security model in the Databricks VPC. You should create your own VPC and then perform [VPC peering](https:\/\/docs.databricks.com\/security\/network\/classic\/vpc-peering.html) to connect Databricks to your Redshift instance.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query Amazon Redshift using Databricks\n##### Configuration\n\n### Authenticating to S3 and Redshift \nThe data source involves several network connections, illustrated in the following diagram: \n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500>\u2502 S3 \u2502<\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 IAM or keys \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 IAM or keys \u2502\n\u2502 ^ \u2502\n\u2502 \u2502 IAM or keys \u2502\nv v \u250c\u2500\u2500\u2500\u2500\u2500\u2500v\u2500\u2500\u2500\u2500\u2510\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2510\n\u2502 Redshift \u2502 \u2502 Spark \u2502 \u2502\u2502 Spark \u2502\n\u2502 \u2502<\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500>\u2502 Driver \u2502<\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500>| Executors \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\nJDBC with Configured\nusername \/ in\npassword Spark\n(SSL enabled by default)\n\n``` \nThe data source reads and writes data to S3 when transferring data to\/from Redshift. As a result, it requires AWS credentials with read and write access to an S3 bucket (specified using the `tempdir` configuration parameter). \nNote \nThe data source does not clean up the temporary files that it creates in S3. As a result, we recommend that you use a dedicated temporary S3 bucket with an [object lifecycle configuration](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/dev\/object-lifecycle-mgmt.html) to ensure that temporary files are automatically deleted after a specified expiration period. See the [Encryption](https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html#redshift-encryption) section of this document for a discussion of how to encrypt these files. You cannot use an [External location defined in Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html) as a `tempdir` location. \nThe following sections describe each connection\u2019s authentication configuration options: \n* [Spark driver to Redshift](https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html#spark-driver-to-redshift)\n* [Spark to S3](https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html#spark-to-s3)\n* [Redshift to S3](https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html#redshift-to-s3) \n#### [Spark driver to Redshift](https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html#id1) \nThe Spark driver connects to Redshift via JDBC using a username and password. Redshift does not support the use of IAM roles to authenticate this connection. By default, this connection uses SSL encryption; for more details, see [Encryption](https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html#redshift-encryption). \n#### [Spark to S3](https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html#id2) \nS3 acts as an intermediary to store bulk data when reading from or writing to Redshift. Spark connects to S3 using both the Hadoop FileSystem interfaces and directly using the Amazon Java SDK\u2019s S3 client. \nNote \nYou cannot use DBFS mounts to configure access to S3 for Redshift. \n* **Default Credential Provider Chain (best option for most users):** AWS credentials are automatically retrieved through the [DefaultAWSCredentialsProviderChain](https:\/\/docs.aws.amazon.com\/sdk-for-java\/v1\/developer-guide\/credentials.html). If you use [instance profiles](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html) to authenticate to S3 then you should probably use this method. \nThe following methods of providing credentials take precedence over this default.\n* **By assuming an IAM role**: You can use an IAM role that the instance profile can assume. To specify the role ARN, you must [attach an instance profile to the cluster](https:\/\/docs.databricks.com\/compute\/configure.html#instance-profiles), and provide the following configuration keys: \n```\nsc.hadoopConfiguration.set(\"fs.s3a.credentialsType\", \"AssumeRole\")\nsc.hadoopConfiguration.set(\"fs.s3a.stsAssumeRole.arn\", <iam-role-arn-to-be-assumed>)\n\/\/ An optional duration, expressed as a quantity and a unit of\n\/\/ time, such as \"15m\" or \"1h\"\nsc.hadoopConfiguration.set(\"fs.s3a.assumed.role.session.duration\", <duration>)\n\n``` \n```\nsc._jsc.hadoopConfiguration().set(\"fs.s3a.credentialsType\", \"AssumeRole\")\nsc._jsc.hadoopConfiguration().set(\"fs.s3a.stsAssumeRole.arn\", <iam-role-arn-to-be-assumed>)\n# An optional duration, expressed as a quantity and a unit of\n# time, such as \"15m\" or \"1h\"\nsc._jsc.hadoopConfiguration().set(\"fs.s3a.assumed.role.session.duration\", <duration>)\n\n``` \n* **Set keys in Hadoop conf:** You can specify AWS keys using [Hadoop configuration properties](https:\/\/github.com\/apache\/hadoop\/blob\/trunk\/hadoop-tools\/hadoop-aws\/src\/site\/markdown\/tools\/hadoop-aws\/index.md). If your `tempdir` configuration points to an `s3a:\/\/` filesystem, you can set the `fs.s3a.access.key` and `fs.s3a.secret.key` properties in a Hadoop XML configuration file or call `sc.hadoopConfiguration.set()` to configure Spark\u2019s global Hadoop configuration. If you use an `s3n:\/\/` filesystem, you can provide the legacy configuration keys as shown in the following example. \nFor example, if you are using the `s3a` filesystem, add: \n```\nsc.hadoopConfiguration.set(\"fs.s3a.access.key\", \"<your-access-key-id>\")\nsc.hadoopConfiguration.set(\"fs.s3a.secret.key\", \"<your-secret-key>\")\n\n``` \nFor the legacy `s3n` filesystem, add: \n```\nsc.hadoopConfiguration.set(\"fs.s3n.awsAccessKeyId\", \"<your-access-key-id>\")\nsc.hadoopConfiguration.set(\"fs.s3n.awsSecretAccessKey\", \"<your-secret-key>\")\n\n``` \nThe following command relies on some Spark internals, but should work with all PySpark versions and is unlikely to change in the future: \n```\nsc._jsc.hadoopConfiguration().set(\"fs.s3a.access.key\", \"<your-access-key-id>\")\nsc._jsc.hadoopConfiguration().set(\"fs.s3a.secret.key\", \"<your-secret-key>\")\n\n``` \n#### [Redshift to S3](https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html#id3) \nRedshift also connects to S3 during `COPY` and `UNLOAD` queries. There are three methods of authenticating this connection: \n* **Have Redshift assume an IAM role (most secure)**: You can grant Redshift permission to assume an IAM role during `COPY` or `UNLOAD` operations and then configure the data source to instruct Redshift to use that role: \n1. Create an IAM role granting appropriate S3 permissions to your bucket.\n2. Follow the guide [Authorizing Amazon Redshift to Access Other AWS Services On Your Behalf](https:\/\/docs.aws.amazon.com\/redshift\/latest\/mgmt\/authorizing-redshift-service.html) to configure this role\u2019s trust policy in order to allow Redshift to assume this role.\n3. Follow the steps in the [Authorizing COPY and UNLOAD Operations Using IAM Roles](https:\/\/docs.aws.amazon.com\/redshift\/latest\/mgmt\/copy-unload-iam-role.html) guide to associate that IAM role with your Redshift cluster.\n4. Set the data source\u2019s `aws_iam_role` option to the role\u2019s ARN.\n* **Forward Spark\u2019s S3 credentials to Redshift**: if the `forward_spark_s3_credentials` option is set to `true` then the data source automatically discovers the credentials that Spark is using to connect to S3 and forwards those credentials to Redshift over JDBC. If Spark is authenticating to S3 using an instance profile then a set of temporary STS credentials is forwarded to Redshift; otherwise, AWS keys are forwarded. The JDBC query embeds these credentials so therefore Databricks strongly recommends that you enable SSL encryption of the JDBC connection when using this authentication method.\n* **Use Security Token Service (STS) credentials**: You may configure the `temporary_aws_access_key_id`, `temporary_aws_secret_access_key`, and `temporary_aws_session_token` configuration properties to point to temporary keys created via the AWS [Security Token Service](https:\/\/docs.aws.amazon.com\/IAM\/latest\/UserGuide\/id_credentials_temp.html). The JDBC query embeds these credentials so therefore it is **strongly recommended** to enable SSL encryption of the JDBC connection when using this authentication method. If you choose this option then be aware of the risk that the credentials expire before the read \/ write operation succeeds. \nThese three options are mutually exclusive and you must explicitly choose which one to use. \n### Encryption \n* **Securing JDBC**: Unless any SSL-related settings are present in the JDBC URL, the data source by default enables SSL encryption and also verifies that the Redshift server is trustworthy (that is, `sslmode=verify-full`). For that, a server certificate is automatically downloaded from the Amazon servers the first time it is needed. In case that fails, a pre-bundled certificate file is used as a fallback. This holds for both the Redshift and the PostgreSQL JDBC drivers. \nIn case there are any issues with this feature, or you simply want to disable SSL, you can call `.option(\"autoenablessl\", \"false\")` on your `DataFrameReader` or `DataFrameWriter`. \nIf you want to specify custom SSL-related settings, you can follow the instructions in the Redshift documentation: [Using SSL and Server Certificates in Java](https:\/\/docs.aws.amazon.com\/redshift\/latest\/mgmt\/connecting-ssl-support.html#connecting-ssl-support-java)\nand [JDBC Driver Configuration Options](https:\/\/docs.aws.amazon.com\/redshift\/latest\/mgmt\/configure-jdbc-options.html) Any SSL-related options present in the JDBC `url` used with the data source take precedence (that is, the auto-configuration will not trigger).\n* **Encrypting UNLOAD data stored in S3 (data stored when reading from Redshift)**: According to the Redshift documentation on [Unloading Data to S3](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/t_Unloading_tables.html), \u201cUNLOAD automatically encrypts data files using Amazon S3 server-side encryption (SSE-S3).\u201d \nRedshift also supports client-side encryption with a custom key (see: [Unloading Encrypted Data Files](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/t_unloading_encrypted_files.html)) but the data source lacks the capability to specify the required symmetric key.\n* **Encrypting COPY data stored in S3 (data stored when writing to Redshift)**: According to the Redshift documentation on [Loading Encrypted Data Files from Amazon S3](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/c_loading-encrypted-files.html): \nYou can use the `COPY` command to load data files that were uploaded to Amazon S3 using server-side encryption with AWS-managed encryption keys (SSE-S3 or SSE-KMS), client-side encryption, or both. COPY does not support Amazon S3 server-side encryption with a customer-supplied key (SSE-C). \nTo use this capability, configure your Hadoop S3 filesystem to use [Amazon S3 encryption](https:\/\/docs.databricks.com\/dbfs\/mounts.html#s3-encryption). This will not encrypt the `MANIFEST` file that contains a list of all files written. \n### Parameters \nThe parameter map or OPTIONS provided in Spark SQL support the following settings: \n| Parameter | Required | Default | Description |\n| --- | --- | --- | --- |\n| dbtable | Yes, unless query is specified. | None | The table to create or read from in Redshift. This parameter is required when saving data back to Redshift. |\n| query | Yes, unless dbtable is specified. | None | The query to read from in Redshift. |\n| user | No | None | The Redshift username. Must be used in tandem with password option. Can be used only if the user and password are not passed in the URL, passing both will result in an error. Use this parameter when the username contains special characters that need to be escaped. |\n| password | No | None | The Redshift password. Must be used in tandem with `user` option. Can be used only if the user and password are not passed in the URL; passing both will result in an error. Use this parameter when the password contains special characters that need to be escaped. |\n| url | Yes | None | A JDBC URL, of the format ``` jdbc:subprotocol:\/\/<host>:<port>\/database?user=<username>&password=<password> ``` `subprotocol` can be `postgresql` or `redshift`, depending on which JDBC driver you have loaded. One Redshift-compatible driver must be on the classpath and match this URL. `host` and `port` should point to the Redshift master node, so security groups and\/or VPC must be configured to allow access from your driver application. `database` identifies a Redshift database name `user` and `password` are credentials to access the database, which must be embedded in this URL for JDBC, and your user account should have necessary privileges for the table being referenced. |\n| search\\_path | No | None | Set schema search path in Redshift. Will be set using the `SET search_path to` command. Should be a comma separated list of schema names to search for tables in. See [Redshift documentation of search\\_path](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/r_search_path.html). |\n| aws\\_iam\\_role | Only if using IAM roles to authorize. | None | Fully specified ARN of the [IAM Redshift COPY\/UNLOAD operations Role](https:\/\/docs.aws.amazon.com\/redshift\/latest\/mgmt\/copy-unload-iam-role.html) attached to the Redshift cluster, For example, `arn:aws:iam::123456789000:role\/<redshift-iam-role>`. |\n| forward\\_spark\\_s3\\_credentials | No | `false` | If `true`, the data source automatically discovers the credentials that Spark is using to connect to S3 and forwards those credentials to Redshift over JDBC. These credentials are sent as part of the JDBC query, so therefore it is strongly recommended to enable SSL encryption of the JDBC connection when using this option. |\n| temporary\\_aws\\_access\\_key\\_id | No | None | AWS access key, must have write permissions to the S3 bucket. |\n| temporary\\_aws\\_secret\\_access\\_key | No | None | AWS secret access key corresponding to provided access key. |\n| temporary\\_aws\\_session\\_token | No | None | AWS session token corresponding to provided access key. |\n| tempdir | Yes | None | A writable location in Amazon S3, to be used for unloaded data when reading and Avro data to be loaded into Redshift when writing. If you\u2019re using Redshift data source for Spark as part of a regular ETL pipeline, it can be useful to set a [Lifecycle Policy](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/dev\/object-lifecycle-mgmt.html) on a bucket and use that as a temp location for this data. You cannot use [External locations defined in Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html) as `tempdir` locations. |\n| jdbcdriver | No | Determined by the JDBC URL\u2019s subprotocol. | The class name of the JDBC driver to use. This class must be on the classpath. In most cases, it should not be necessary to specify this option, as the appropriate driver class name should automatically be determined by the JDBC URL\u2019s subprotocol. |\n| diststyle | No | `EVEN` | The Redshift [Distribution Style](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/c_choosing_dist_sort.html) to be used when creating a table. Can be one of `EVEN`, `KEY` or `ALL` (see Redshift docs). When using `KEY`, you must also set a distribution key with the distkey option. |\n| distkey | No, unless using `DISTSTYLE KEY` | None | The name of a column in the table to use as the distribution key when creating a table. |\n| sortkeyspec | No | None | A full Redshift [Sort Key](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/t_Sorting_data.html) definition. Examples include:* `SORTKEY(my_sort_column)` * `COMPOUND SORTKEY(sort_col_1, sort_col_2)` * `INTERLEAVED SORTKEY(sort_col_1, sort_col_2)` |\n| usestagingtable (Deprecated) | No | `true` | Setting this deprecated option to `false` causes an overwrite operation\u2019s destination table to be dropped immediately at the beginning of the write, making the overwrite operation non-atomic and reducing the availability of the destination table. This may reduce the temporary disk space requirements for overwrites. Since setting `usestagingtable=false` operation risks data loss or unavailability, it is deprecated in favor of requiring you to manually drop the destination table. |\n| description | No | None | A description for the table. Will be set using the SQL COMMENT command, and should show up in most query tools. See also the `description` metadata to set descriptions on individual columns. |\n| preactions | No | None | A `;` separated list of SQL commands to be executed before loading `COPY` command. It may be useful to have some `DELETE` commands or similar run here before loading new data. If the command contains `%s`, the table name is formatted in before execution (in case you\u2019re using a staging table). Be warned that if these commands fail, it is treated as an error and an exception is thrown. If using a staging table, the changes are reverted and the backup table restored if pre actions fail. |\n| postactions | No | None | A `;` separated list of SQL commands to be executed after a successful `COPY` when loading data. It may be useful to have some `GRANT` commands or similar run here when loading new data. If the command contains `%s`, the table name is formatted in before execution (in case you\u2019re using a staging table). Be warned that if these commands fail, it is treated as an error and an exception is thrown. If using a staging table, the changes are reverted and the backup table restored if post actions fail. |\n| extracopyoptions | No | None | A list of extra options to append to the Redshift `COPY` command when loading data, for example, `TRUNCATECOLUMNS` or `MAXERROR n` (see the [Redshift docs](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/r_COPY.html#r_COPY-syntax-overview-optional-parameters) for other options). Since these options are appended to the end of the `COPY` command, only options that make sense at the end of the command can be used, but that should cover most possible use cases. |\n| tempformat | No | `AVRO` | The format in which to save temporary files in S3 when writing to Redshift. Defaults to `AVRO`; the other allowed values are `CSV` and `CSV GZIP` for CSV and gzipped CSV, respectively. Redshift is significantly faster when loading CSV than when loading Avro files, so using that tempformat may provide a large performance boost when writing to Redshift. |\n| csvnullstring | No | `@NULL@` | The String value to write for nulls when using the CSV tempformat. This should be a value that does not appear in your actual data. |\n| csvseparator | No | `,` | Separator to use when writing temporary files with tempformat set to `CSV` or `CSV GZIP`. This must be a valid ASCII character, for example, \u201c`,`\u201d or \u201c`|`\u201d. |\n| csvignoreleadingwhitespace | No | `true` | When set to true, removes leading whitespace from values during writes when `tempformat` is set to `CSV` or `CSV GZIP`. Otherwise, whitespace is retained. |\n| csvignoretrailingwhitespace | No | `true` | When set to true, removes trailing whitespace from values during writes when `tempformat` is set to `CSV` or `CSV GZIP`. Otherwise, the whitespace is retained. |\n| infer\\_timestamp\\_ntz\\_type | No | `false` | If `true`, values of type Redshift `TIMESTAMP` are interpreted as `TimestampNTZType` (timestamp without time zone) during reads. Otherwise, all timestamps are interpreted as `TimestampType` regardless of the type in the underlying Redshift table. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query Amazon Redshift using Databricks\n##### Additional configuration options\n\n### Configuring the maximum size of string columns \nWhen creating Redshift tables, the default behavior is to create `TEXT` columns for string columns. Redshift stores `TEXT` columns as `VARCHAR(256)`, so these columns have a maximum size of 256 characters ([source](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/r_Character_types.html)). \nTo support larger columns, you can use the `maxlength` column metadata field to specify the maximum length of individual string columns. This is also useful for implementing space-saving performance optimizations by declaring columns with a smaller maximum length than the default. \nNote \nDue to limitations in Spark, the SQL and R language APIs do not support column metadata modification. \n```\ndf = ... # the dataframe you'll want to write to Redshift\n\n# Specify the custom width of each column\ncolumnLengthMap = {\n\"language_code\": 2,\n\"country_code\": 2,\n\"url\": 2083,\n}\n\n# Apply each column metadata customization\nfor (colName, length) in columnLengthMap.iteritems():\nmetadata = {'maxlength': length}\ndf = df.withColumn(colName, df[colName].alias(colName, metadata=metadata))\n\ndf.write \\\n.format(\"com.databricks.spark.redshift\") \\\n.option(\"url\", jdbcURL) \\\n.option(\"tempdir\", s3TempDirectory) \\\n.option(\"dbtable\", sessionTable) \\\n.save()\n\n``` \nHere is an example of updating multiple columns\u2019 metadata fields using Spark\u2019s Scala API: \n```\nimport org.apache.spark.sql.types.MetadataBuilder\n\n\/\/ Specify the custom width of each column\nval columnLengthMap = Map(\n\"language_code\" -> 2,\n\"country_code\" -> 2,\n\"url\" -> 2083\n)\n\nvar df = ... \/\/ the dataframe you'll want to write to Redshift\n\n\/\/ Apply each column metadata customization\ncolumnLengthMap.foreach { case (colName, length) =>\nval metadata = new MetadataBuilder().putLong(\"maxlength\", length).build()\ndf = df.withColumn(colName, df(colName).as(colName, metadata))\n}\n\ndf.write\n.format(\"com.databricks.spark.redshift\")\n.option(\"url\", jdbcURL)\n.option(\"tempdir\", s3TempDirectory)\n.option(\"dbtable\", sessionTable)\n.save()\n\n``` \n### Set a custom column type \nIf you need to manually set a column type, you can use the `redshift_type` column metadata. For example, if you desire to override the `Spark SQL Schema -> Redshift SQL` type matcher to assign a user-defined column type, you can do the following: \n```\n# Specify the custom type of each column\ncolumnTypeMap = {\n\"language_code\": \"CHAR(2)\",\n\"country_code\": \"CHAR(2)\",\n\"url\": \"BPCHAR(111)\",\n}\n\ndf = ... # the dataframe you'll want to write to Redshift\n\n# Apply each column metadata customization\nfor colName, colType in columnTypeMap.items():\nmetadata = {'redshift_type': colType}\ndf = df.withColumn(colName, df[colName].alias(colName, metadata=metadata))\n\n``` \n```\nimport org.apache.spark.sql.types.MetadataBuilder\n\n\/\/ Specify the custom type of each column\nval columnTypeMap = Map(\n\"language_code\" -> \"CHAR(2)\",\n\"country_code\" -> \"CHAR(2)\",\n\"url\" -> \"BPCHAR(111)\"\n)\n\nvar df = ... \/\/ the dataframe you'll want to write to Redshift\n\n\/\/ Apply each column metadata customization\ncolumnTypeMap.foreach { case (colName, colType) =>\nval metadata = new MetadataBuilder().putString(\"redshift_type\", colType).build()\ndf = df.withColumn(colName, df(colName).as(colName, metadata))\n}\n\n``` \n### Configure column encoding \nWhen creating a table, use the `encoding` column metadata field to specify a compression encoding for each column (see [Amazon docs](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/c_Compression_encodings.html) for available encodings). \n### Setting descriptions on columns \nRedshift allows columns to have descriptions attached that should show up in most query tools (using the `COMMENT` command). You can set the `description` column metadata field to specify a description for\nindividual columns. \n### Query pushdown into Redshift \nThe Spark optimizer pushes the following operators down into Redshift: \n* `Filter`\n* `Project`\n* `Sort`\n* `Limit`\n* `Aggregation`\n* `Join` \nWithin `Project` and `Filter`, it supports the following expressions: \n* Most Boolean logic operators\n* Comparisons\n* Basic arithmetic operations\n* Numeric and string casts\n* Most string functions\n* Scalar subqueries, if they can be pushed down entirely into Redshift. \nNote \nThis pushdown does not support expressions operating on dates and timestamps. \nWithin `Aggregation`, it supports the following aggregation functions: \n* `AVG`\n* `COUNT`\n* `MAX`\n* `MIN`\n* `SUM`\n* `STDDEV_SAMP`\n* `STDDEV_POP`\n* `VAR_SAMP`\n* `VAR_POP` \ncombined with the `DISTINCT` clause, where applicable. \nWithin `Join`, it supports the following types of joins: \n* `INNER JOIN`\n* `LEFT OUTER JOIN`\n* `RIGHT OUTER JOIN`\n* `LEFT SEMI JOIN`\n* `LEFT ANTI JOIN`\n* Subqueries that are rewritten into `Join` by the optimizer e.g. `WHERE EXISTS`, `WHERE NOT EXISTS` \nNote \nJoin pushdown does not support `FULL OUTER JOIN`. \nThe pushdown might be most beneficial in queries with `LIMIT`. A query such as `SELECT * FROM large_redshift_table LIMIT 10` could take very long, as the whole table would first be UNLOADed to S3 as an intermediate result. With pushdown, the `LIMIT` is executed in Redshift. In queries with aggregations, pushing the aggregation down into Redshift also helps to reduce the amount of data that needs to be transferred. \nQuery pushdown into Redshift is enabled by default. It can be disabled by setting `spark.databricks.redshift.pushdown` to `false`. Even when disabled, Spark still pushes down filters and performs column elimination into Redshift.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query Amazon Redshift using Databricks\n##### Redshift driver installation\n\nThe Redshift data source also requires a Redshift-compatible JDBC driver. Because Redshift is based on the PostgreSQL database system, you can use the PostgreSQL JDBC driver included with Databricks Runtime or the Amazon recommended Redshift JDBC driver. No installation is required to use the PostgreSQL JDBC driver. The version of the PostgreSQL JDBC driver included in each Databricks Runtime release is listed in the Databricks Runtime [release notes](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). \nTo manually install the Redshift JDBC driver: \n1. [Download](https:\/\/docs.aws.amazon.com\/redshift\/latest\/mgmt\/configure-jdbc-connection.html) the driver from Amazon.\n2. Upload the driver to your Databricks workspace. See [Libraries](https:\/\/docs.databricks.com\/libraries\/index.html).\n3. [Install](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html) the library on your cluster. \nNote \nDatabricks recommends using the latest version of the Redshift JDBC driver. Versions of the Redshift JDBC driver below 1.2.41 have the following limitations: \n* Version 1.2.16 of the driver returns empty data when using a `where` clause in an SQL query.\n* Versions of the driver below 1.2.41 may return invalid results because a column\u2019s nullability is incorrectly reported as \u201cNot Nullable\u201d instead of \u201cUnknown\u201d.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query Amazon Redshift using Databricks\n##### Transactional guarantees\n\nThis section describes the transactional guarantees of the Redshift data source for Spark. \n### General background on Redshift and S3 properties \nFor general information on Redshift transactional guarantees, see the [Managing Concurrent Write Operations](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/c_Concurrent_writes.html)\nchapter in the Redshift documentation. In a nutshell, Redshift provides [serializable isolation](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/c_serial_isolation.html) according to the documentation for the Redshift [BEGIN](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/r_BEGIN.html) command: \n> [although] you can use any of the four transaction isolation levels, Amazon Redshift processes all isolation levels as serializable. \nAccording to the [Redshift documentation](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/c_serial_isolation.html): \n> Amazon Redshift supports a default *automatic commit* behavior in which each separately-executed SQL command commits individually. \nThus, individual commands like `COPY` and `UNLOAD` are atomic and transactional, while explicit `BEGIN` and `END` should only be necessary to enforce the atomicity of multiple commands or queries. \nWhen reading from and writing to Redshift, the data source reads and writes data in S3. Both Spark and Redshift produce partitioned output and store it in multiple files in S3. According to the [Amazon S3 Data Consistency Model](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/dev\/Introduction.html#ConsistencyModel) documentation, S3 bucket listing operations are eventually-consistent, so the files must to go to special lengths to avoid missing or incomplete data due to this source of eventual-consistency. \n### Guarantees of the Redshift data source for Spark \n#### Append to an existing table \nWhen inserting rows into Redshift, the data source uses the [COPY](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/r_COPY.html)\ncommand and specifies [manifests](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/loading-data-files-using-manifest.html) to guard against certain eventually-consistent S3 operations. As a result, `spark-redshift` appends to existing tables have the same atomic and transactional properties as regular Redshift `COPY` commands. \n#### Create a new table (`SaveMode.CreateIfNotExists`) \nCreating a new table is a two-step process, consisting of a `CREATE TABLE` command followed by a [COPY](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/r_COPY.html) command to append the initial set of rows. Both operations are performed in the same transaction. \n#### Overwrite an existing table \nBy default, the data source uses transactions to perform overwrites, which are implemented by deleting the destination table, creating a new empty table, and appending rows to it. \nIf the deprecated `usestagingtable` setting is set to `false`, the data source commits the `DELETE TABLE` command before appending rows to the new table, sacrificing the atomicity of the overwrite operation but reducing the amount of staging space that Redshift needs during the overwrite. \n#### Query Redshift table \nQueries use the Redshift [UNLOAD](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/r_UNLOAD.html) command to execute a query and save its results to S3 and use [manifests](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/loading-data-files-using-manifest.html) to guard against certain eventually-consistent S3 operations. As a result, queries from Redshift data source for Spark should have the same consistency properties as regular Redshift queries.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query Amazon Redshift using Databricks\n##### Common problems and solutions\n\n### S3 bucket and Redshift cluster are in different AWS regions \nBy default, S3 <-> Redshift copies do not work if the S3 bucket and Redshift cluster are in different AWS regions. \nIf you attempt to read a Redshift table when the S3 bucket is in a different region, you may see an error such as: \n```\nERROR: S3ServiceException:The S3 bucket addressed by the query is in a different region from this cluster.,Status 301,Error PermanentRedirect.\n\n``` \nSimilarly, attempting to write to Redshift using a S3 bucket in a different region may cause the following error: \n```\nerror: Problem reading manifest file - S3ServiceException:The S3 bucket addressed by the query is in a different region from this cluster.,Status 301,Error PermanentRedirect\n\n``` \n* **Writes:** The Redshift [COPY](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/r_COPY.html) command supports explicit specification of the S3 bucket region, so you can make writes to Redshift work properly in these cases by adding `region 'the-region-name'` to the `extracopyoptions` setting. For example, with a bucket in the US East (Virginia) region and the Scala API, use: \n```\n.option(\"extracopyoptions\", \"region 'us-east-1'\")\n\n``` \nYou can alternatively use the `awsregion` setting: \n```\n.option(\"awsregion\", \"us-east-1\")\n\n```\n* **Reads:** The Redshift [UNLOAD](https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/r_UNLOAD.html) command also supports explicit specification of the S3 bucket region. You can make reads work properly by adding the region to the `awsregion` setting: \n```\n.option(\"awsregion\", \"us-east-1\")\n\n``` \n### Unexpected S3ServiceException credentials error when you use instance profiles to authenticate to S3 \nIf you are using [instance profiles](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html) to authenticate to S3 and receive an unexpected `S3ServiceException` error, check whether AWS access keys are specified in the `tempdir` S3 URI, in Hadoop configurations, or in any of the sources checked by the [DefaultAWSCredentialsProviderChain](https:\/\/docs.aws.amazon.com\/sdk-for-java\/v1\/developer-guide\/credentials.html): those sources take precedence over instance profile credentials. \nHere is a sample error message that can be a symptom of keys accidentally taking precedence over instance profiles: \n```\ncom.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId;\n\n``` \n### Authentication error when using a password with special characters in the JDBC url \nIf you are providing the username and password as part of the JDBC url and the password contains special characters such as `;`, `?`, or `&`, you might see the following exception: \n```\njava.sql.SQLException: [Amazon](500310) Invalid operation: password authentication failed for user 'xyz'\n\n``` \nThis is caused by special characters in the username or password not being escaped correctly by the JDBC driver. Make sure to specify the username and password using the corresponding DataFrame options `user` and `password`. For more information, see [Parameters](https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html#parameters). \n### Long-running Spark query hangs indefinitely even though the corresponding Redshift operation is done \nIf you are reading or writing large amounts of data from and to Redshift, your Spark query may hang indefinitely, even though the AWS Redshift Monitoring page shows that the corresponding `LOAD` or `UNLOAD` operation has completed and that the cluster is idle. This is caused by the connection between Redshift and Spark timing out. To avoid this, make sure the `tcpKeepAlive` JDBC flag is enabled and `TCPKeepAliveMinutes` is set to a low value (for example, 1). \nFor additional information, see [Amazon Redshift JDBC Driver Configuration](https:\/\/docs.aws.amazon.com\/redshift\/latest\/mgmt\/configure-jdbc-options.html). \n### Timestamp with timezone semantics \nWhen reading data, both Redshift `TIMESTAMP` and `TIMESTAMPTZ` data types are mapped to Spark `TimestampType`, and a value is converted to Coordinated Universal Time (UTC) and is stored as the UTC timestamp. For a Redshift `TIMESTAMP`, the local timezone is assumed as the value does not have any timezone information. When writing data to a Redshift table, a Spark `TimestampType` is mapped to the Redshift `TIMESTAMP` data type.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query Amazon Redshift using Databricks\n##### Migration guide\n\nThe data source now requires you to explicitly set `forward_spark_s3_credentials` before Spark S3 credentials are forwarded to Redshift. This change has no impact if you use the `aws_iam_role` or `temporary_aws_*` authentication mechanisms. However, if you relied on the old default behavior you must now explicitly set `forward_spark_s3_credentials` to `true` to continue using your previous Redshift to S3 authentication mechanism. For a discussion of the three authentication mechanisms and their security trade-offs, see the [Authenticating to S3 and Redshift](https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html#redshift-aws-credentials) section of this document.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html"} +{"content":"# Develop on Databricks\n### What are user-defined functions (UDFs)?\n\nA user-defined function (UDF) is a function defined by a user, allowing custom logic to be reused in the user environment. Databricks has support for many different types of UDFs to allow for distributing extensible logic. This article introduces some of the general strengths and limitations of UDFs. \nNote \nNot all forms of UDFs are available in all execution environments on Databricks. If you are working with Unity Catalog, see [User-defined functions (UDFs) in Unity Catalog](https:\/\/docs.databricks.com\/udf\/unity-catalog.html). \nSee the following articles for more information on UDFs: \n* [User-defined functions (UDFs) in Unity Catalog](https:\/\/docs.databricks.com\/udf\/unity-catalog.html)\n* [pandas user-defined functions](https:\/\/docs.databricks.com\/udf\/pandas.html)\n* [User-defined scalar functions - Python](https:\/\/docs.databricks.com\/udf\/python.html)\n* [What are Python user-defined table functions?](https:\/\/docs.databricks.com\/udf\/python-udtf.html)\n* [User-defined scalar functions - Scala](https:\/\/docs.databricks.com\/udf\/scala.html)\n* [User-defined aggregate functions - Scala](https:\/\/docs.databricks.com\/udf\/aggregate-scala.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/index.html"} +{"content":"# Develop on Databricks\n### What are user-defined functions (UDFs)?\n#### Defining custom logic without serialization penalties\n\nDatabricks inherits much of its UDF behaviors from Apache Spark, including the efficiency limitations around many types of UDFs. See [Which UDFs are most efficient?](https:\/\/docs.databricks.com\/udf\/index.html#udf-efficiency). \nYou can safely modularize your code without worrying about potential efficiency tradeoffs associated with UDFs. To do so, you must define your logic as a series of Spark built-in methods using SQL or Spark DataFrames. For example, the following SQL and Python functions combine Spark built-in methods to define a unit conversion as a reusable function: \n```\nCREATE FUNCTION convert_f_to_c(unit STRING, temp DOUBLE)\nRETURNS DOUBLE\nRETURN CASE\nWHEN unit = \"F\" THEN (temp - 32) * (5\/9)\nELSE temp\nEND;\n\nSELECT convert_f_to_c(unit, temp) AS c_temp\nFROM tv_temp;\n\n``` \n```\ndef convertFtoC(unitCol, tempCol):\nfrom pyspark.sql.functions import when\nreturn when(unitCol == \"F\", (tempCol - 32) * (5\/9)).otherwise(tempCol)\n\nfrom pyspark.sql.functions import col\n\ndf_query = df.select(convertFtoC(col(\"unit\"), col(\"temp\"))).toDF(\"c_temp\")\ndisplay(df_query)\n\n``` \nTo run the above UDFs, you can create [example data](https:\/\/docs.databricks.com\/udf\/index.html#example-data).\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/index.html"} +{"content":"# Develop on Databricks\n### What are user-defined functions (UDFs)?\n#### Which UDFs are most efficient?\n\nUDFs might introduce significant processing bottlenecks into code execution. Databricks uses a number of different optimizers automatically for code written with included Apache Spark, SQL, and Delta Lake syntax. When custom logic is introduced by UDFs, these optimizers do not have the ability to efficiently plan tasks around this custom logic. In addition, logic that executes outside the JVM has additional costs around data serialization. \nNote \nDatabricks optimizes many functions using Photon if you use Photon-enabled compute. Only functions that chain together Spark SQL of DataFrame commands can be optimized by Photon. \nSome UDFs are more efficient than others. In terms of performance: \n* Built in functions will be fastest because of Databricks optimizers.\n* Code that executes in the JVM (Scala, Java, Hive UDFs) will be faster than Python UDFs.\n* Pandas UDFs use Arrow to reduce serialization costs associated with Python UDFs.\n* Python UDFs work well for procedural logic, but should be avoided for production ETL workloads on large datasets. \nNote \nIn Databricks Runtime 12.2 LTS and below, Python scalar UDFs and Pandas UDFs are not supported in Unity Catalog on clusters that use shared access mode. These UDFs are supported in Databricks Runtime 13.3 LTS and above for all access modes. \nIn Databricks Runtime 14.1 and below, Scala scalar UDFs are not supported in Unity Catalog on clusters that use shared access mode. These UDFs are supported for all access modes in Databricks Runtime 14.2 and above. \nIn Databricks Runtime 13.3 LTS and above, you can register scalar Python UDFs to Unity Catalog using SQL syntax. See [User-defined functions (UDFs) in Unity Catalog](https:\/\/docs.databricks.com\/udf\/unity-catalog.html). \n| Type | Optimized | Execution environment |\n| --- | --- | --- |\n| Hive UDF | No | JVM |\n| Python UDF | No | Python |\n| Pandas UDF | No | Python (Arrow) |\n| Scala UDF | No | JVM |\n| Spark SQL | Yes | JVM (Photon) |\n| Spark DataFrame | Yes | JVM (Photon) |\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/index.html"} +{"content":"# Develop on Databricks\n### What are user-defined functions (UDFs)?\n#### When should you use a UDF?\n\nA major benefit of UDFs is that they allow users to express logic in familiar languages, reducing the human cost associated with refactoring code. For ad hoc queries, manual data cleansing, exploratory data analysis, and most operations on small or medium-sized datasets, latency overhead costs associated with UDFs are unlikely to outweigh costs associated with refactoring code. \nFor ETL jobs, streaming operations, operations on very large datasets, or other workloads that are executed regularly or continuously, refactoring logic to use native Apache Spark methods quickly pays dividends.\n\n### What are user-defined functions (UDFs)?\n#### Example data for example UDFs\n\nThe code examples in this article use UDFs to convert temperatures between Celcius and Farenheit. If you wish to execute these functions, you can create a sample dataset with the following Python code: \n```\nimport numpy as np\nimport pandas as pd\n\nFdf = pd.DataFrame(np.random.normal(55, 25, 10000000), columns=[\"temp\"])\nFdf[\"unit\"] = \"F\"\n\nCdf = pd.DataFrame(np.random.normal(10, 10, 10000000), columns=[\"temp\"])\nCdf[\"unit\"] = \"C\"\n\ndf = spark.createDataFrame(pd.concat([Fdf, Cdf]).sample(frac=1))\n\ndf.cache().count()\ndf.createOrReplaceTempView(\"tv_temp\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use features to train models\n\nThis article describes how you can train models using Feature Engineering in Unity Catalog or the local Workspace Feature Store. You must first create a training dataset, which defines the features to use and how to join them. Then, when you train a model, the model retains references to the features. \nWhen you train a model using Feature Engineering in Unity Catalog, you can view the model\u2019s lineage in Catalog Explorer. Tables and functions that were used to create the model are automatically tracked and displayed. See [View feature store lineage](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/lineage.html). \nWhen you use the model for inference, you can choose to have it retrieve feature values from the feature store. You can also serve the model with [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) and it will automatically lookup features [published to online stores](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-feature-stores.html). Feature store models are also compatible with the [MLflow pyfunc interface](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.pyfunc.html), so you can use MLflow to perform batch inference with feature tables. \nIf your model uses environment variables, learn more about how to use them when serving the model online at [Configure access to resources from model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/store-env-variable-model-serving.html). \nA model can use at most 50 tables and 100 functions for training.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/train-models-with-feature-store.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use features to train models\n##### Create a training dataset\n\nTo select specific features from a feature table for model training, you create a training dataset using the `FeatureEngineeringClient.create_training_set` (for Feature Engineering in Unity Catalog) or `FeatureStoreClient.create_training_set` (for Workspace Feature Store) API and an object called a `FeatureLookup`. A `FeatureLookup` specifies each feature to use in the training set, including the name of the feature table, the name(s) of the features, and the key(s) to use when joining the feature table with the DataFrame passed to `create_training_set`. See [Feature Lookup](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/concepts.html#feature-lookup) for more information. \nUse the `feature_names` parameter when you create a `FeatureLookup`.\n`feature_names` takes a single feature name, a list of feature names, or None to look up all features (excluding primary keys) in the feature table at the time that the training set is created. \nNote \nThe type and order of `lookup_key` columns in the DataFrame must match the type and order of the primary keys (excluding timestamp keys) of the reference feature table. \nThis article includes code examples for both versions of the syntax. \nIn this example, the DataFrame returned by `trainingSet.load_df` contains a column for each feature in `feature_lookups`. It preserves all columns of the DataFrame provided to `create_training_set` except those excluded using `exclude_columns`. \n```\nfrom databricks.feature_engineering import FeatureEngineeringClient, FeatureLookup\n\n# The model training uses two features from the 'customer_features' feature table and\n# a single feature from 'product_features'\nfeature_lookups = [\nFeatureLookup(\ntable_name='ml.recommender_system.customer_features',\nfeature_names=['total_purchases_30d', 'total_purchases_7d'],\nlookup_key='customer_id'\n),\nFeatureLookup(\ntable_name='ml.recommender_system.product_features',\nfeature_names=['category'],\nlookup_key='product_id'\n)\n]\n\nfe = FeatureEngineeringClient()\n\n# Create a training set using training DataFrame and features from Feature Store\n# The training DataFrame must contain all lookup keys from the set of feature lookups,\n# in this case 'customer_id' and 'product_id'. It must also contain all labels used\n# for training, in this case 'rating'.\ntraining_set = fe.create_training_set(\ndf=training_df,\nfeature_lookups=feature_lookups,\nlabel='rating',\nexclude_columns=['customer_id', 'product_id']\n)\n\ntraining_df = training_set.load_df()\n\n``` \n```\nfrom databricks.feature_store import FeatureLookup, FeatureStoreClient\n\n# The model training uses two features from the 'customer_features' feature table and\n# a single feature from 'product_features'\nfeature_lookups = [\nFeatureLookup(\ntable_name='recommender_system.customer_features',\nfeature_names=['total_purchases_30d', 'total_purchases_7d'],\nlookup_key='customer_id'\n),\nFeatureLookup(\ntable_name='recommender_system.product_features',\nfeature_names=['category'],\nlookup_key='product_id'\n)\n]\n\nfs = FeatureStoreClient()\n\n# Create a training set using training DataFrame and features from Feature Store\n# The training DataFrame must contain all lookup keys from the set of feature lookups,\n# in this case 'customer_id' and 'product_id'. It must also contain all labels used\n# for training, in this case 'rating'.\ntraining_set = fs.create_training_set(\ndf=training_df,\nfeature_lookups=feature_lookups,\nlabel='rating',\nexclude_columns=['customer_id', 'product_id']\n)\n\ntraining_df = training_set.load_df()\n\n``` \n### Create a TrainingSet when lookup keys do not match the primary keys \nUse the argument `lookup_key` in the `FeatureLookup` for the column name in the training set. `create_training_set` performs an ordered join between the columns from the training set specified in the `lookup_key` argument using the order in which the primary keys were specified when the feature table was created. \nIn this example, `recommender_system.customer_features` has the following primary keys: `customer_id`, `dt`. \nThe `recommender_system.product_features` feature table has primary key `product_id`. \nIf the `training_df` has the following columns: \n* `cid`\n* `transaction_dt`\n* `product_id`\n* `rating` \nthe following code will create the correct feature lookups for the `TrainingSet`: \n```\nfeature_lookups = [\nFeatureLookup(\ntable_name='ml.recommender_system.customer_features',\nfeature_names=['total_purchases_30d', 'total_purchases_7d'],\nlookup_key=['cid', 'transaction_dt']\n),\nFeatureLookup(\ntable_name='ml.recommender_system.product_features',\nfeature_names=['category'],\nlookup_key='product_id'\n)\n]\n\n``` \n```\nfeature_lookups = [\nFeatureLookup(\ntable_name='recommender_system.customer_features',\nfeature_names=['total_purchases_30d', 'total_purchases_7d'],\nlookup_key=['cid', 'transaction_dt']\n),\nFeatureLookup(\ntable_name='recommender_system.product_features',\nfeature_names=['category'],\nlookup_key='product_id'\n)\n]\n\n``` \nWhen `create_training_set` is called, it creates a training dataset by performing a left join, joining the tables `recommender_system.customer_features` and `training_df` using the keys (`customer_id`,`dt`) corresponding to (`cid`,`transaction_dt`), as shown in the following code: \n```\ncustomer_features_df = spark.sql(\"SELECT * FROM ml.recommender_system.customer_features\")\nproduct_features_df = spark.sql(\"SELECT * FROM ml.recommender_system.product_features\")\n\ntraining_df.join(\ncustomer_features_df,\non=[training_df.cid == customer_features_df.customer_id,\ntraining_df.transaction_dt == customer_features_df.dt],\nhow=\"left\"\n).join(\nproduct_features_df,\non=\"product_id\",\nhow=\"left\"\n)\n\n``` \n```\ncustomer_features_df = spark.sql(\"SELECT * FROM recommender_system.customer_features\")\nproduct_features_df = spark.sql(\"SELECT * FROM recommender_system.product_features\")\n\ntraining_df.join(\ncustomer_features_df,\non=[training_df.cid == customer_features_df.customer_id,\ntraining_df.transaction_dt == customer_features_df.dt],\nhow=\"left\"\n).join(\nproduct_features_df,\non=\"product_id\",\nhow=\"left\"\n)\n\n``` \n### Create a TrainingSet containing two features with the same name from different feature tables \nUse the optional argument `output_name` in the `FeatureLookup`. The name provided is used in place of the feature name in the DataFrame returned by `TrainingSet.load_df`. For example, with the following code, the DataFrame returned by `training_set.load_df` includes columns `customer_height` and `product_height`. \n```\nfeature_lookups = [\nFeatureLookup(\ntable_name='ml.recommender_system.customer_features',\nfeature_names=['height'],\nlookup_key='customer_id',\noutput_name='customer_height',\n),\nFeatureLookup(\ntable_name='ml.recommender_system.product_features',\nfeature_names=['height'],\nlookup_key='product_id',\noutput_name='product_height'\n),\n]\n\nfe = FeatureEngineeringClient()\n\nwith mlflow.start_run():\ntraining_set = fe.create_training_set(\ndf=df,\nfeature_lookups=feature_lookups,\nlabel='rating',\nexclude_columns=['customer_id']\n)\ntraining_df = training_set.load_df()\n\n``` \n```\nfeature_lookups = [\nFeatureLookup(\ntable_name='recommender_system.customer_features',\nfeature_names=['height'],\nlookup_key='customer_id',\noutput_name='customer_height',\n),\nFeatureLookup(\ntable_name='recommender_system.product_features',\nfeature_names=['height'],\nlookup_key='product_id',\noutput_name='product_height'\n),\n]\n\nfs = FeatureStoreClient()\n\nwith mlflow.start_run():\ntraining_set = fs.create_training_set(\ndf=df,\nfeature_lookups=feature_lookups,\nlabel='rating',\nexclude_columns=['customer_id']\n)\ntraining_df = training_set.load_df()\n\n``` \n### Create a TrainingSet using the same feature multiple times \nTo create a TrainingSet using the same feature joined by different lookup keys, use multiple FeatureLookups.\nUse a unique `output_name` for each FeatureLookup output. \n```\nfeature_lookups = [\nFeatureLookup(\ntable_name='ml.taxi_data.zip_features',\nfeature_names=['temperature'],\nlookup_key=['pickup_zip'],\noutput_name='pickup_temp'\n),\nFeatureLookup(\ntable_name='ml.taxi_data.zip_features',\nfeature_names=['temperature'],\nlookup_key=['dropoff_zip'],\noutput_name='dropoff_temp'\n)\n]\n\n``` \n```\nfeature_lookups = [\nFeatureLookup(\ntable_name='taxi_data.zip_features',\nfeature_names=['temperature'],\nlookup_key=['pickup_zip'],\noutput_name='pickup_temp'\n),\nFeatureLookup(\ntable_name='taxi_data.zip_features',\nfeature_names=['temperature'],\nlookup_key=['dropoff_zip'],\noutput_name='dropoff_temp'\n)\n]\n\n``` \n### Create a TrainingSet for unsupervised machine learning models \nSet `label=None` when creating a TrainingSet for unsupervised learning models. For example, the following TrainingSet can\nbe used to cluster different customers into groups based on their interests: \n```\nfeature_lookups = [\nFeatureLookup(\ntable_name='ml.recommender_system.customer_features',\nfeature_names=['interests'],\nlookup_key='customer_id',\n),\n]\n\nfe = FeatureEngineeringClient()\nwith mlflow.start_run():\ntraining_set = fe.create_training_set(\ndf=df,\nfeature_lookups=feature_lookups,\nlabel=None,\nexclude_columns=['customer_id']\n)\n\ntraining_df = training_set.load_df()\n\n``` \n```\nfeature_lookups = [\nFeatureLookup(\ntable_name='recommender_system.customer_features',\nfeature_names=['interests'],\nlookup_key='customer_id',\n),\n]\n\nfs = FeatureStoreClient()\nwith mlflow.start_run():\ntraining_set = fs.create_training_set(\ndf=df,\nfeature_lookups=feature_lookups,\nlabel=None,\nexclude_columns=['customer_id']\n)\n\ntraining_df = training_set.load_df()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/train-models-with-feature-store.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use features to train models\n##### Train models and perform batch inference with feature tables\n\nWhen you train a model using features from Feature Store, the model retains references to the features. When you use the model for inference, you can choose to have it retrieve feature values from Feature Store. You must provide the primary key(s) of the features used in the model. The model retrieves the features it requires from Feature Store in your workspace. It then joins the feature values as needed during scoring. \nTo support feature lookup at inference time: \n* You must log the model using the `log_model` method of `FeatureEngineeringClient` (for Feature Engineering in Unity Catalog) or `FeatureStoreClient` (for Workspace Feature Store).\n* You must use the DataFrame returned by `TrainingSet.load_df` to train the model. If you modify this DataFrame in any way before using it to train the model, the modifications are not applied when you use the model for inference. This decreases the performance of the model.\n* The model type must have a corresponding `python_flavor` in MLflow. MLflow supports most Python model training frameworks, including: \n+ scikit-learn\n+ keras\n+ PyTorch\n+ SparkML\n+ LightGBM\n+ XGBoost\n+ TensorFlow Keras (using the `python_flavor` `mlflow.keras`)\n* Custom MLflow pyfunc models \n```\n# Train model\nimport mlflow\nfrom sklearn import linear_model\n\nfeature_lookups = [\nFeatureLookup(\ntable_name='ml.recommender_system.customer_features',\nfeature_names=['total_purchases_30d'],\nlookup_key='customer_id',\n),\nFeatureLookup(\ntable_name='ml.recommender_system.product_features',\nfeature_names=['category'],\nlookup_key='product_id'\n)\n]\n\nfe = FeatureEngineeringClient()\n\nwith mlflow.start_run():\n\n# df has columns ['customer_id', 'product_id', 'rating']\ntraining_set = fe.create_training_set(\ndf=df,\nfeature_lookups=feature_lookups,\nlabel='rating',\nexclude_columns=['customer_id', 'product_id']\n)\n\ntraining_df = training_set.load_df().toPandas()\n\n# \"training_df\" columns ['total_purchases_30d', 'category', 'rating']\nX_train = training_df.drop(['rating'], axis=1)\ny_train = training_df.rating\n\nmodel = linear_model.LinearRegression().fit(X_train, y_train)\n\nfe.log_model(\nmodel=model,\nartifact_path=\"recommendation_model\",\nflavor=mlflow.sklearn,\ntraining_set=training_set,\nregistered_model_name=\"recommendation_model\"\n)\n\n# Batch inference\n\n# If the model at model_uri is packaged with the features, the FeatureStoreClient.score_batch()\n# call automatically retrieves the required features from Feature Store before scoring the model.\n# The DataFrame returned by score_batch() augments batch_df with\n# columns containing the feature values and a column containing model predictions.\n\nfe = FeatureEngineeringClient()\n\n# batch_df has columns \u2018customer_id\u2019 and \u2018product_id\u2019\npredictions = fe.score_batch(\nmodel_uri=model_uri,\ndf=batch_df\n)\n\n# The \u2018predictions\u2019 DataFrame has these columns:\n# \u2018customer_id\u2019, \u2018product_id\u2019, \u2018total_purchases_30d\u2019, \u2018category\u2019, \u2018prediction\u2019\n\n``` \n```\n# Train model\nimport mlflow\nfrom sklearn import linear_model\n\nfeature_lookups = [\nFeatureLookup(\ntable_name='recommender_system.customer_features',\nfeature_names=['total_purchases_30d'],\nlookup_key='customer_id',\n),\nFeatureLookup(\ntable_name='recommender_system.product_features',\nfeature_names=['category'],\nlookup_key='product_id'\n)\n]\n\nfs = FeatureStoreClient()\n\nwith mlflow.start_run():\n\n# df has columns ['customer_id', 'product_id', 'rating']\ntraining_set = fs.create_training_set(\ndf=df,\nfeature_lookups=feature_lookups,\nlabel='rating',\nexclude_columns=['customer_id', 'product_id']\n)\n\ntraining_df = training_set.load_df().toPandas()\n\n# \"training_df\" columns ['total_purchases_30d', 'category', 'rating']\nX_train = training_df.drop(['rating'], axis=1)\ny_train = training_df.rating\n\nmodel = linear_model.LinearRegression().fit(X_train, y_train)\n\nfs.log_model(\nmodel=model,\nartifact_path=\"recommendation_model\",\nflavor=mlflow.sklearn,\ntraining_set=training_set,\nregistered_model_name=\"recommendation_model\"\n)\n\n# Batch inference\n\n# If the model at model_uri is packaged with the features, the FeatureStoreClient.score_batch()\n# call automatically retrieves the required features from Feature Store before scoring the model.\n# The DataFrame returned by score_batch() augments batch_df with\n# columns containing the feature values and a column containing model predictions.\n\nfs = FeatureStoreClient()\n\n# batch_df has columns \u2018customer_id\u2019 and \u2018product_id\u2019\npredictions = fs.score_batch(\nmodel_uri=model_uri,\ndf=batch_df\n)\n\n# The \u2018predictions\u2019 DataFrame has these columns:\n# \u2018customer_id\u2019, \u2018product_id\u2019, \u2018total_purchases_30d\u2019, \u2018category\u2019, \u2018prediction\u2019\n\n``` \n### Use custom feature values when scoring a model packaged with feature metadata \nBy default, a model packaged with feature metadata looks up features from feature tables at inference. To use custom feature values for scoring, include them in the DataFrame passed to `FeatureEngineeringClient.score_batch` (for Feature Engineering in Unity Catalog) or `FeatureStoreClient.score_batch` (for Workspace Feature Store). \nFor example, suppose you package a model with these two features: \n```\nfeature_lookups = [\nFeatureLookup(\ntable_name='ml.recommender_system.customer_features',\nfeature_names=['account_creation_date', 'num_lifetime_purchases'],\nlookup_key='customer_id',\n),\n]\n\n``` \n```\nfeature_lookups = [\nFeatureLookup(\ntable_name='recommender_system.customer_features',\nfeature_names=['account_creation_date', 'num_lifetime_purchases'],\nlookup_key='customer_id',\n),\n]\n\n``` \nAt inference, you can provide custom values for the feature `account_creation_date` by calling `score_batch` on a DataFrame that includes a column named `account_creation_date`. In this case the API looks up only the `num_lifetime_purchases` feature from Feature Store and uses the provided custom `account_creation_date` column values for model scoring. \n```\n# batch_df has columns ['customer_id', 'account_creation_date']\npredictions = fe.score_batch(\nmodel_uri='models:\/ban_prediction_model\/1',\ndf=batch_df\n)\n\n``` \n```\n# batch_df has columns ['customer_id', 'account_creation_date']\npredictions = fs.score_batch(\nmodel_uri='models:\/ban_prediction_model\/1',\ndf=batch_df\n)\n\n``` \n### Train and score a model using a combination of Feature Store features and data residing outside Feature Store \nYou can train a model using a combination of Feature Store features and data from outside Feature Store. When you package the model with feature metadata, the model retrieves feature values from Feature Store for inference. \nTo train a model, include the extra data as columns in the DataFrame passed to `FeatureEngineeringClient.create_training_set` (for Feature Engineering in Unity Catalog) or `FeatureStoreClient.create_training_set` (for Workspace Feature Store). This example uses the feature `total_purchases_30d` from Feature Store and the external column `browser`. \n```\nfeature_lookups = [\nFeatureLookup(\ntable_name='ml.recommender_system.customer_features',\nfeature_names=['total_purchases_30d'],\nlookup_key='customer_id',\n),\n]\n\nfe = FeatureEngineeringClient()\n\n# df has columns ['customer_id', 'browser', 'rating']\ntraining_set = fe.create_training_set(\ndf=df,\nfeature_lookups=feature_lookups,\nlabel='rating',\nexclude_columns=['customer_id'] # 'browser' is not excluded\n)\n\n``` \n```\nfeature_lookups = [\nFeatureLookup(\ntable_name='recommender_system.customer_features',\nfeature_names=['total_purchases_30d'],\nlookup_key='customer_id',\n),\n]\n\nfs = FeatureStoreClient()\n\n# df has columns ['customer_id', 'browser', 'rating']\ntraining_set = fs.create_training_set(\ndf=df,\nfeature_lookups=feature_lookups,\nlabel='rating',\nexclude_columns=['customer_id'] # 'browser' is not excluded\n)\n\n``` \nAt inference, the DataFrame used in `FeatureStoreClient.score_batch` must include the `browser` column. \n```\n# At inference, 'browser' must be provided\n# batch_df has columns ['customer_id', 'browser']\npredictions = fe.score_batch(\nmodel_uri=model_uri,\ndf=batch_df\n)\n\n``` \n```\n# At inference, 'browser' must be provided\n# batch_df has columns ['customer_id', 'browser']\npredictions = fs.score_batch(\nmodel_uri=model_uri,\ndf=batch_df\n)\n\n``` \n### Load models and perform batch inference using MLflow \nAfter a model has been logged using the `log_model` method of `FeatureEngineeringClient` (for Feature Engineering in Unity Catalog) or `FeatureStoreClient` (for Workspace Feature Store), MLflow can be used at inference. `MLflow.pyfunc.predict` retrieves feature values from Feature Store and also joins any values provided at inference time. You must provide the primary key(s) of the features used in the model. \nNote \nBatch inference with MLflow requires MLflow version 2.11 and above. \n```\n# Train model\nimport mlflow\nfrom sklearn import linear_model\n\nfeature_lookups = [\nFeatureLookup(\ntable_name='ml.recommender_system.customer_features',\nfeature_names=['total_purchases_30d'],\nlookup_key='customer_id',\n),\nFeatureLookup(\ntable_name='ml.recommender_system.product_features',\nfeature_names=['category'],\nlookup_key='product_id'\n)\n]\n\nfe = FeatureEngineeringClient()\n\nwith mlflow.start_run():\n\n# df has columns ['customer_id', 'product_id', 'rating']\ntraining_set = fe.create_training_set(\ndf=df,\nfeature_lookups=feature_lookups,\nlabel='rating',\nexclude_columns=['customer_id', 'product_id']\n)\n\ntraining_df = training_set.load_df().toPandas()\n\n# \"training_df\" columns ['total_purchases_30d', 'category', 'rating']\nX_train = training_df.drop(['rating'], axis=1)\ny_train = training_df.rating\n\nmodel = linear_model.LinearRegression().fit(X_train, y_train)\n\nfe.log_model(\nmodel=model,\nartifact_path=\"recommendation_model\",\nflavor=mlflow.sklearn,\ntraining_set=training_set,\nregistered_model_name=\"recommendation_model\",\n#refers to the default value of \"result_type\" if not provided at inference\nparams={\"result_type\":\"double\"},\n)\n\n# Batch inference with MLflow\n\n# NOTE: the result_type parameter can only be used if a default value\n# is provided in log_model. This is automatically done for all models\n# logged using Databricks Runtime for ML 15.0 or above.\n# For earlier Databricks Runtime versions, use set_result as shown below.\n\n# batch_df has columns \u2018customer_id\u2019 and \u2018product_id\u2019\nmodel = mlflow.pyfunc.load_model(model_version_uri)\n\n# If result_type parameter is provided in log_model\npredictions = model.predict(df, {\"result_type\":\"double\"})\n\n# If result_type parameter is NOT provided in log_model\nmodel._model_impl.set_result_type(\"double\")\npredictions = model.predict(df)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/train-models-with-feature-store.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Enhance and cleanse data\n\nThis get started article walks you through using a Databricks notebook to cleanse and enhance the New York State baby name data that was previously loaded into a table in Unity Catalog by using Python, Scala, and R. In this article, you change column names, change capitalization, and spell out the sex of each baby name from the raw data table - and then save the DataFrame into a silver table. Then you filter the data to only include data for 2021, group the data at the state level, and then sort the data by count. Finally, you save this DataFrame into a gold table and visualize the data in a bar chart. For more information on silver and gold tables, see [medallion architecture](https:\/\/docs.databricks.com\/lakehouse\/medallion.html). \nImportant \nThis get started article builds on [Get started: Ingest and insert additional data](https:\/\/docs.databricks.com\/getting-started\/ingest-insert-additional-data.html). You must complete the steps in that article to complete this article. For the complete notebook for that getting started article, see [Ingest additional data notebooks](https:\/\/docs.databricks.com\/getting-started\/ingest-insert-additional-data.html#notebook).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/cleanse-enhance-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Enhance and cleanse data\n#### Requirements\n\nTo complete the tasks in this article, you must meet the following requirements: \n* Your workspace must have [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) enabled. For information on getting started with Unity Catalog, see [Set up and manage Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html).\n* You must have permission to use an existing compute resource or create a new compute resource. See [Get started: Account and workspace setup](https:\/\/docs.databricks.com\/getting-started\/index.html) or see your Databricks administrator. \nTip \nFor a completed notebook for this article, see [Cleanse and enhance data notebooks](https:\/\/docs.databricks.com\/getting-started\/cleanse-enhance-data.html#notebook).\n\n### Get started: Enhance and cleanse data\n#### Step 1: Create a new notebook\n\nTo create a notebook in your workspace: \n1. Click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar, and then click **Notebook**.\n2. On the Create Notebook page: \n* Specify a unique name for your notebook.\n* Set the default language for your notebook and then click **Confirm** if prompted.\n* Click **Connect** and select a compute resource. To create a new compute resource, see [Use compute](https:\/\/docs.databricks.com\/compute\/use-compute.html). \nTo learn more about creating and managing notebooks, see [Manage notebooks](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/cleanse-enhance-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Enhance and cleanse data\n#### Step 2: Define variables\n\nIn this step, you define variables for use in the example notebook you create in this article. \n1. Copy and paste the following code into the new empty notebook cell. Replace `<catalog-name>`, `<schema-name>`, and `<volume-name>` with the catalog, schema, and volume names for a Unity Catalog volume. Replace `<table_name>` with a table name of your choice. You will save the baby name data into this table later in this article.\n2. Press `Shift+Enter` to run the cell and create a new blank cell. \n```\ncatalog = \"<catalog_name>\"\nschema = \"<schema_name>\"\ntable_name = \"baby_names\"\nsilver_table_name = \"baby_names_prepared\"\ngold_table_name = \"top_baby_names_2021\"\npath_table = catalog + \".\" + schema\nprint(path_table) # Show the complete path\n\n``` \n```\nval catalog = \"<catalog_name>\"\nval schema = \"<schema_name>\"\nval tableName = \"baby_names\"\nval silverTableName = \"baby_names_prepared\"\nval goldTableName = \"top_baby_names_2021\"\nval pathTable = s\"${catalog}.${schema}\"\nprint(pathTable) \/\/ Show the complete path\n\n``` \n```\ncatalog <- \"<catalog_name>\"\nschema <- \"<schema_name>\"\nvolume <- \"<volume_name>\"\ntable_name <- \"baby_names\"\nsilver_table_name <- \"baby_names_prepared\"\ngold_table_name <- \"top_baby_names_2021\"\npath_table <- paste(catalog, \".\", schema, sep = \"\")\nprint(path_table) # Show the complete path\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/cleanse-enhance-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Enhance and cleanse data\n#### Step 3: Load the raw data into a new DataFrame\n\nThis step loads the raw data previously saved into a Delta table into a new DataFrame in preparation for cleansing and enhancing this data for further analysis. \n1. Copy and paste the following code into the new empty notebook cell. \n```\ndf_raw = spark.read.table(f\"{path_table}.{table_name}\")\ndisplay(df_raw)\n\n``` \n```\nval dfRaw = spark.read.table(s\"${pathTable}.${tableName}\")\ndisplay(dfRaw)\n\n``` \n```\n# Load the SparkR package that is already preinstalled on the cluster.\nlibrary(SparkR)\ndf_raw = sql(paste0(\"SELECT * FROM \", path_table, \".\", table_name))\ndisplay(df_raw)\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/cleanse-enhance-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Enhance and cleanse data\n#### Step 4: Cleanse and enhance raw data and save\n\nIn this step, you change the name of the `Year` column, change the data in the `First_Name` column to initial capitals, and update the values for the sex to spell of the sex, and then save the DataFrame to a new table. \n1. Copy and paste the following code into an empty notebook cell. \n```\nfrom pyspark.sql.functions import col, initcap, when\n\n# Rename \"Year\" column to \"Year_Of_Birth\"\ndf_rename_year = df_raw.withColumnRenamed(\"Year\", \"Year_Of_Birth\")\n\n# Change the case of \"First_Name\" column to initcap\ndf_init_caps = df_rename_year.withColumn(\"First_Name\", initcap(col(\"First_Name\").cast(\"string\")))\n\n# Update column values from \"M\" to \"male\" and \"F\" to \"female\"\ndf_baby_names_sex = df_init_caps.withColumn(\n\"Sex\",\nwhen(col(\"Sex\") == \"M\", \"Male\")\n.when(col(\"Sex\") == \"F\", \"Female\")\n)\n\n# display\ndisplay(df_baby_names_sex)\n\n# Save DataFrame to table\ndf_baby_names_sex.write.mode(\"overwrite\").saveAsTable(f\"{path_table}.{silver_table_name}\")\n\n``` \n```\nimport org.apache.spark.sql.functions.{col, initcap, when}\n\n\/\/ Rename \"Year\" column to \"Year_Of_Birth\"\nval dfRenameYear = dfRaw.withColumnRenamed(\"Year\", \"Year_Of_Birth\")\n\n\/\/ Change the case of \"First_Name\" data to initial caps\nval dfNameInitCaps = dfRenameYear.withColumn(\"First_Name\", initcap(col(\"First_Name\").cast(\"string\")))\n\n\/\/ Update column values from \"M\" to \"Male\" and \"F\" to \"Female\"\nval dfBabyNamesSex = dfNameInitCaps.withColumn(\"Sex\",\nwhen(col(\"Sex\") equalTo \"M\", \"Male\")\n.when(col(\"Sex\") equalTo \"F\", \"Female\"))\n\n\/\/ Display the data\ndisplay(dfBabyNamesSex)\n\n\/\/ Save DataFrame to a table\ndfBabyNamesSex.write.mode(\"overwrite\").saveAsTable(s\"${pathTable}.${silverTableName}\")\n\n``` \n```\n# Rename \"Year\" column to \"Year_Of_Birth\"\ndf_rename_year <- withColumnRenamed(df_raw, \"Year\", \"Year_Of_Birth\")\n\n# Change the case of \"First_Name\" data to initial caps\ndf_init_caps <- withColumn(df_rename_year, \"First_Name\", initcap(df_rename_year$First_Name))\n\n# Update column values from \"M\" to \"Male\" and \"F\" to \"Female\"\ndf_baby_names_sex <- withColumn(df_init_caps, \"Sex\",\nifelse(df_init_caps$Sex == \"M\", \"Male\",\nifelse(df_init_caps$Sex == \"F\", \"Female\", df_init_caps$Sex)))\n# Display the data\ndisplay(df_baby_names_sex)\n\n# Save DataFrame to a table\nsaveAsTable(df_baby_names_sex, paste(path_table, \".\", silver_table_name), mode = \"overwrite\")\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/cleanse-enhance-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Enhance and cleanse data\n#### Step 5: Group and visualize data\n\nIn this step, you filter the data to only the year 2021, group the data by sex and name, aggregate by count, and order by count. You then save the DataFrame to a table and then visualize the data in a bar chart. \n1. Copy and paste the following code into an empty notebook cell. \n```\nfrom pyspark.sql.functions import expr, sum, desc\nfrom pyspark.sql import Window\n\n# Count of names for entire state of New York by sex\ndf_baby_names_2021_grouped=(df_baby_names_sex\n.filter(expr(\"Year_Of_Birth == 2021\"))\n.groupBy(\"Sex\", \"First_Name\")\n.agg(sum(\"Count\").alias(\"Total_Count\"))\n.sort(desc(\"Total_Count\")))\n\n# Display data\ndisplay(df_baby_names_2021_grouped)\n\n# Save DataFrame to a table\ndf_baby_names_2021_grouped.write.mode(\"overwrite\").saveAsTable(f\"{path_table}.{gold_table_name}\")\n\n``` \n```\nimport org.apache.spark.sql.functions.{expr, sum, desc}\nimport org.apache.spark.sql.expressions.Window\n\n\/\/ Count of male and female names for entire state of New York by sex\nval dfBabyNames2021Grouped = dfBabyNamesSex\n.filter(expr(\"Year_Of_Birth == 2021\"))\n.groupBy(\"Sex\", \"First_Name\")\n.agg(sum(\"Count\").alias(\"Total_Count\"))\n.sort(desc(\"Total_Count\"))\n\n\/\/ Display data\ndisplay(dfBabyNames2021Grouped)\n\n\/\/ Save DataFrame to a table\ndfBabyNames2021Grouped.write.mode(\"overwrite\").saveAsTable(s\"${pathTable}.${goldTableName}\")\n\n``` \n```\n# Filter to only 2021 data\ndf_baby_names_2021 <- filter(df_baby_names_sex, df_baby_names_sex$Year_Of_Birth == 2021)\n\n# Count of names for entire state of New York by sex\ndf_baby_names_grouped <- agg(\ngroupBy(df_baby_names_2021, df_baby_names_2021$Sex, df_baby_names_2021$First_Name),\nTotal_Count = sum(df_baby_names_2021$Count)\n)\n# Display data\ndisplay(arrange(select(df_baby_names_grouped, df_baby_names_grouped$Sex, df_baby_names_grouped$First_Name, df_baby_names_grouped$Total_Count), desc(df_baby_names_grouped$Total_Count)))\n\n# Save DataFrame to a table\nsaveAsTable(df_baby_names_2021_grouped, paste(path_table, \".\", gold_table_name), mode = \"overwrite\")\n\n```\n2. Press `Ctrl+Enter` to run the cell.\n3. 1. Next to the **Table** tab, click **+** and then click **Visualization**.\n4. In the visualization editor, click **Visualization Type**, and verify that **Bar** is selected.\n5. In the **X column**, select`First\\_Name`.\n6. Click **Add column** under **Y columns** and then select **Total\\_Count**.\n7. In **Group by**, select **Sex**. \n![gold table](https:\/\/docs.databricks.com\/_images\/gold_table.png)\n8. Click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/cleanse-enhance-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Enhance and cleanse data\n#### Cleanse and enhance data notebooks\n\nUse one of the following notebooks to perform the steps in this article. \n### Cleanse and enhance data using Python \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/getting-started\/cleanse-enhance-data-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Cleanse and enhance data using Scala \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/getting-started\/cleanse-enhance-data-scala.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Cleanse and enhance data using R \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/getting-started\/cleanse-enhance-data-sparkr.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n### Get started: Enhance and cleanse data\n#### Additional resources\n\n* [Get started: Query and visualize data from a notebook](https:\/\/docs.databricks.com\/getting-started\/quick-start.html)\n* [Get started: Import and visualize CSV data from a notebook](https:\/\/docs.databricks.com\/getting-started\/import-visualize-data.html)\n* [Get started: Ingest and insert additional data](https:\/\/docs.databricks.com\/getting-started\/ingest-insert-additional-data.html)\n* [Tutorial: Load and transform data using Apache Spark DataFrames](https:\/\/docs.databricks.com\/getting-started\/dataframes.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/cleanse-enhance-data.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Configure Auto Loader for production workloads\n\nDatabricks recommends that you follow [the streaming best practices](https:\/\/docs.databricks.com\/structured-streaming\/production.html) for running Auto Loader in production. \nDatabricks recommends using Auto Loader in Delta Live Tables for incremental data ingestion. Delta Live Tables extends functionality in Apache Spark Structured Streaming and allows you to write just a few lines of declarative Python or SQL to deploy a production-quality data pipeline with: \n* Autoscaling compute infrastructure for cost savings\n* Data quality checks with [expectations](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html)\n* Automatic [schema evolution](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html) handling\n* Monitoring via metrics in the [event log](https:\/\/docs.databricks.com\/delta-live-tables\/observability.html#event-log)\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/production.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Configure Auto Loader for production workloads\n##### Monitoring Auto Loader\n\n### Querying files discovered by Auto Loader \nNote \nThe `cloud_files_state` function is available in Databricks Runtime 11.3 LTS and above. \nAuto Loader provides a SQL API for inspecting the state of a stream. Using the `cloud_files_state` function, you can find metadata about files that have been discovered by an Auto Loader stream. Simply query from `cloud_files_state`, providing the checkpoint location associated with an Auto Loader stream. \n```\nSELECT * FROM cloud_files_state('path\/to\/checkpoint');\n\n``` \n### Listen to stream updates \nTo further monitor Auto Loader streams, Databricks recommends using Apache Spark\u2019s [Streaming Query Listener interface](https:\/\/docs.databricks.com\/structured-streaming\/stream-monitoring.html). \nAuto Loader reports metrics to the Streaming Query Listener at every batch. You can view how many files exist in the backlog and how large the backlog is in the `numFilesOutstanding` and `numBytesOutstanding` metrics under the **Raw Data** tab in the streaming query progress dashboard: \n```\n{\n\"sources\" : [\n{\n\"description\" : \"CloudFilesSource[\/path\/to\/source]\",\n\"metrics\" : {\n\"numFilesOutstanding\" : \"238\",\n\"numBytesOutstanding\" : \"163939124006\"\n}\n}\n]\n}\n\n``` \nIn Databricks Runtime 10.4 LTS and above, when using file notification mode, the metrics will also include the approximate number of file events that are in the cloud queue as `approximateQueueSize` for AWS and Azure.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/production.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Configure Auto Loader for production workloads\n##### Cost considerations\n\nWhen running Auto Loader, your main source of costs would be the cost of compute resources and file discovery. \nTo reduce compute costs, Databricks recommends using Databricks Jobs to schedule Auto Loader as batch jobs using `Trigger.AvailableNow` instead of running it continuously as long as you don\u2019t have low latency requirements. See [Configure Structured Streaming trigger intervals](https:\/\/docs.databricks.com\/structured-streaming\/triggers.html). \nFile discovery costs can come in the form of LIST operations on your storage accounts in directory listing mode and API requests on the subscription service, and queue service in file notification mode. To reduce file discovery costs, Databricks recommends: \n* Providing a `ProcessingTime` trigger when running Auto Loader continuously in directory listing mode\n* Architecting file uploads to your storage account in lexical ordering to leverage [Incremental Listing (deprecated)](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/directory-listing-mode.html#incremental-listing) when possible\n* [Leveraging file notifications](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html) when incremental listing is not possible\n* Using [resource tags](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#file-notification-options) to tag resources created by Auto Loader to track your costs\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/production.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Configure Auto Loader for production workloads\n##### Using Trigger.AvailableNow and rate limiting\n\nNote \nAvailable in Databricks Runtime 10.4 LTS and above. \nAuto Loader can be scheduled to run in Databricks Jobs as a batch job by using `Trigger.AvailableNow`. The `AvailableNow` trigger will instruct Auto Loader to process all files that arrived **before** the query start time. New files that are uploaded after the stream has started are ignored until the next trigger. \nWith `Trigger.AvailableNow`, file discovery happens asynchronously with data processing and data can be processed across multiple micro-batches with rate limiting. Auto Loader by default processes a maximum of 1000 files every micro-batch. You can configure `cloudFiles.maxFilesPerTrigger` and `cloudFiles.maxBytesPerTrigger` to configure how many files or how many bytes should be processed in a micro-batch. The file limit is a hard limit but the byte limit is a soft limit, meaning that more bytes can be processed than the provided `maxBytesPerTrigger`. When the options are both provided together, Auto Loader processes as many files that are needed to hit one of the limits.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/production.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Configure Auto Loader for production workloads\n##### Event retention\n\nAuto Loader keeps track of discovered files in the checkpoint location using RocksDB to provide exactly-once ingestion guarantees. Databricks strongly recommends using the `cloudFiles.maxFileAge` option for all high-volume or long-lived ingestion streams. This option expires events from the checkpoint location, which accelerates Auto Loader startup time. Startup time can grow into the minutes per Auto Loader run, which adds unnecessary cost when you have an upper bound on the maximal age of files that will be stored in the source directory. The minimum value that you can set for `cloudFiles.maxFileAge` is `\"14 days\"`. Deletes in RocksDB appear as tombstone entries, therefore you should expect the storage usage to increase temporarily as events expire before it starts to level off. \nWarning \n`cloudFiles.maxFileAge` is provided as a cost control mechanism for high volume datasets. Tuning `cloudFiles.maxFileAge` too aggressively can cause data quality issues such as duplicate ingestion or missing files. Therefore, Databricks recommends a conservative setting for `cloudFiles.maxFileAge`, such as 90 days, which is similar to what comparable data ingestion solutions recommend. \nTrying to tune the `cloudFiles.maxFileAge` option can lead to unprocessed files being ignored by Auto Loader or already processed files expiring and then being re-processed causing duplicate data. Here are some things to consider when choosing a `cloudFiles.maxFileAge`: \n* If your stream restarts after a long time, file notification events that are pulled from the queue that are older than `cloudFiles.maxFileAge` are ignored. Similarly, if you use directory listing, files that might have appeared during the down time that are older than `cloudFiles.maxFileAge` are ignored.\n* If you use directory listing mode and use `cloudFiles.maxFileAge`, for example set to `\"1 month\"`, you stop your stream and restart the stream with `cloudFiles.maxFileAge` set to `\"2 months\"`, files that are older than 1 month, but more recent than 2 months are reprocessed. \nIf you set this option the first time you start the stream, you will not ingest data older than `cloudFiles.maxFileAge`, therefore, if you want to ingest old data you should not set this option as you start your stream for the first time. However, you should set this option on subsequent runs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/production.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Configure Auto Loader for production workloads\n##### Trigger regular backfills using cloudFiles.backfillInterval\n\nAuto Loader can trigger asynchronous backfills at a given interval, for example one day to backfill once a day, or one week to backfill once a week. File event notification systems do not guarantee 100% delivery of all files that have been uploaded and do not provide strict SLAs on the latency of the file events. Databricks recommends that you trigger regular backfills with Auto Loader by using the `cloudFiles.backfillInterval` option to guarantee that all files are discovered within a given SLA if data completeness is a requirement. Triggering regular backfills does not cause duplicates.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/production.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n### What are workspace files?\n##### Programmatically interact with workspace files\n\nYou can interact with workspace files stored in Databricks programmatically. This enables tasks such as: \n* Storing small data files alongside notebooks and code.\n* Writing log files to directories synced with Git.\n* Importing modules using relative paths.\n* Creating or modifying an environment specification file.\n* Writing output from notebooks.\n* Writing output from execution of libraries such as Tensorboard. \nYou can programmatically create, edit, and delete workspace files in Databricks Runtime 11.3 LTS and above. \nNote \nTo disable writing to workspace files, set the cluster environment variable `WSFS_ENABLE_WRITE_SUPPORT=false`. For more information, see [Environment variables](https:\/\/docs.databricks.com\/compute\/configure.html#env-var). \nNote \nIn Databricks Runtime 14.0 and above, the the default current working directory (CWD) for code executed locally is the directory containing the notebook or script being run. This is a change in behavior from Databricks Runtime 13.3 LTS and below. See [What is the default current working directory?](https:\/\/docs.databricks.com\/files\/cwd-dbr-14.html).\n\n##### Programmatically interact with workspace files\n###### Read the locations of files\n\nUse shell commands to read the locations of files, for example, in a repo or in the local filesystem. \nTo determine the location of files, enter the following: \n```\n%sh ls\n\n``` \n* **Files aren\u2019t in a repo:** The command returns the filesystem `\/databricks\/driver`.\n* **Files are in a repo:** The command returns a virtualized repo such as `\/Workspace\/Repos\/name@domain.com\/public_repo_2\/repos_file_system`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/workspace-interact.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n### What are workspace files?\n##### Programmatically interact with workspace files\n###### Read data workspace files\n\nYou can programmatically read small data files such as `.csv` or `.json` files from code in your notebooks. The following example uses Pandas to query files stored in a `\/data` directory relative to the root of the project repo: \n```\nimport pandas as pd\ndf = pd.read_csv(\".\/data\/winequality-red.csv\")\ndf\n\n``` \nYou can use Spark to read data files. You must provide Spark with the fully qualified path. \n* Workspace files in Git folders use the path `file:\/Workspace\/Repos\/<user-folder>\/<repo-name>\/path\/to\/file`.\n* Workspace files in your personal directory use the path: `file:\/Workspace\/Users\/<user-folder>\/path\/to\/file`. \nYou can copy the absolute or relative path to a file from the dropdown menu next to the file: \n![file drop down menu](https:\/\/docs.databricks.com\/_images\/file-drop-down.png) \nThe example below shows the use of `{os.getcwd()}` to get the full path. \n```\nimport os\nspark.read.format(\"csv\").load(f\"file:{os.getcwd()}\/my_data.csv\")\n\n``` \nTo learn more about files on Databricks, see [Work with files on Databricks](https:\/\/docs.databricks.com\/files\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/workspace-interact.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n### What are workspace files?\n##### Programmatically interact with workspace files\n###### Programmatically create, update, and delete files and directories\n\nIn Databricks Runtime 11.3 LTS and above, you can directly manipulate workspace files in Databricks. The following examples use standard Python packages and functionality to create and manipulate files and directories. \n```\n# Create a new directory\n\nos.mkdir('dir1')\n\n# Create a new file and write to it\n\nwith open('dir1\/new_file.txt', \"w\") as f:\nf.write(\"new content\")\n\n# Append to a file\n\nwith open('dir1\/new_file.txt', \"a\") as f:\nf.write(\" continued\")\n\n# Delete a file\n\nos.remove('dir1\/new_file.txt')\n\n# Delete a directory\n\nos.rmdir('dir1')\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/workspace-interact.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Limits & FAQ for Git integration with Databricks Git folders\n\nDatabricks Git folders and Git integration have limits specified in the following sections. For general information, see [Databricks limits](https:\/\/docs.databricks.com\/resources\/limits.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/limits.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Limits & FAQ for Git integration with Databricks Git folders\n##### File and repo size limits\n\nDatabricks doesn\u2019t enforce a limit on the size of a repo. However: \n* Working branches are limited to 200 MB.\n* Individual workspace files are subject to a separate size limit. For more details, read [Limitations](https:\/\/docs.databricks.com\/files\/workspace.html#limits).\n* Files larger than 10 MB can\u2019t be viewed in the Databricks UI. \nDatabricks recommends that in a repo: \n* The total number of all files not exceed 10,000.\n* The total number of notebooks not exceed 5,000. \nFor any Git operation, memory usage is limited to 2 GB, and disk writes are limited to 4 GB. Since the limit is per-operation, you get a failure if you attempt to clone a Git repo that is 5 GB in current size. However, if you clone a Git repo that is 3 GB in size in one operation and then add 2 GB to it later, the next pull operation will succeed. \nYou might receive an error message if your repo exceeds these limits. You might also receive a timeout error when you clone the repo, but the operation might complete in the background. \nTo work with repo larger than the size limits, try [sparse checkout](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html#sparse). \nIf you must write temporary files that you do not want to keep after the cluster is shut down, writing the temporary files to `$TEMPDIR` avoids exceeding branch size limits and yields better performance than writing to the current working directory (CWD) if the CWD is in the workspace filesystem. For more information, see [Where should I write temporary files on Databricks?](https:\/\/docs.databricks.com\/files\/write-data.html#write-temp-files).\n\n#### Limits & FAQ for Git integration with Databricks Git folders\n##### Maximum number of Git folders per workspace\n\nYou can have a maximum of 10,000 Git folders per workspace. If you require more, contact Databricks support.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/limits.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Limits & FAQ for Git integration with Databricks Git folders\n##### Monorepo support\n\nDatabricks recommends that you do not create Git folders backed by monorepos, where a *monorepo* is a large, single-organization Git repository with many thousands of files across many projects.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/limits.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Limits & FAQ for Git integration with Databricks Git folders\n##### Git folder configuration\n\n### Where is Databricks repo content stored? \nThe contents of a repo are temporarily cloned onto disk in the control plane. Databricks notebook files are stored in the control plane database just like notebooks in the main workspace. Non-notebook files are stored on disk for up to 30 days. \n### Do Git folders support on-premises or self-hosted Git servers? \nDatabricks Git folders supports GitHub Enterprise, Bitbucket Server, Azure DevOps Server, and GitLab Self-managed integration, if the server is internet accessible. For details on integrating Git folders with an on-prem Git server, read [Git Proxy Server for Git folders](https:\/\/docs.databricks.com\/repos\/git-proxy.html). \nTo integrate with a Bitbucket Server, GitHub Enterprise Server, or a GitLab self-managed subscription instance that is not internet-accessible, get in touch with your Databricks account team. \n### What Databricks asset types are supported by Git folders? \nFor details on supported asset types, read [Manage file assets in Databricks Git folders](https:\/\/docs.databricks.com\/repos\/manage-assets.html). \n### Do Git folders support `.gitignore` files? \nYes. If you add a file to your repo and do not want it to be tracked by Git, create a `.gitignore` file or use one cloned from your remote repository and add the filename, including the extension. \n`.gitignore` works only for files that are not already tracked by Git. If you add a file that is already tracked by Git to a `.gitignore` file, the file is still tracked by Git. \n### Can I create top-level folders that are not user folders? \nYes, admins can create top-level folders to a single depth. Git folders do not support additional folder levels. \n### Do Git folders support Git submodules? \nNo. You can clone a repo that contains Git submodules, but the submodule is not cloned.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/limits.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Limits & FAQ for Git integration with Databricks Git folders\n##### Source management\n\n### Why do notebook dashboards disappear when I pull or checkout a different branch? \nThis is currently a limitation because Databricks notebook source files don\u2019t store notebook dashboard information. \nIf you want to preserve dashboards in the Git repository, change the notebook format to `.ipynb` (the Jupyter notebook format). By default, `.ipynb` supports dashboard and visualization definitions. If you want to preserve graph data (data points), you must commit the notebook with outputs. \nTo learn about committing `.ipynb` notebook outputs, see [Allow committing `.ipynb` notebook output](https:\/\/docs.databricks.com\/repos\/manage-assets.html#ipynb-repos). \n### Do Git folders support branch merging? \nYes. You can also create a pull request and merge through your Git provider. \n### Can I delete a branch from a Databricks repo? \nNo. To delete a branch, you must work in your Git provider. \n### If a library is installed on a cluster, and a library with the same name is included in a folder within a repo, which library is imported? \nThe library in the repo is imported. For more information about library precedence in Python, see [Python library precedence](https:\/\/docs.databricks.com\/libraries\/index.html#precedence). \n### Can I pull the latest version of a repository from Git before running a job without relying on an external orchestration tool? \nNo. Typically you can integrate this as a pre-commit on the Git server so that every push to a branch (main\/prod) updates the Production repo. \n### Can I export a repo? \nYou can export notebooks, folders, or an entire repo. You cannot export non-notebook files. If you export an entire repo, non-notebook files are not included. To export, use the `workspace export` command in the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) or use the [Workspace API](https:\/\/docs.databricks.com\/api\/workspace\/workspace).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/limits.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Limits & FAQ for Git integration with Databricks Git folders\n##### Security, authentication, and tokens\n\n### Issue with a conditional access policy (CAP) for Microsoft Entra ID (formerly Azure Active Directory) \nWhen you try to clone a repo, you might get a \u201cdenied access\u201d error message when: \n* Databricks is configured to use Azure DevOps with Microsoft Entra ID authentication.\n* You have enabled a conditional access policy in Azure DevOps and an Microsoft Entra ID conditional access policy. \nTo resolve this, add an exclusion to the conditional access policy (CAP) for the IP address or users of Databricks. \nFor more information, see [Conditional access policies](https:\/\/learn.microsoft.com\/azure\/active-directory\/conditional-access\/concept-conditional-access-policies). \n### Are the contents of Databricks Git folders encrypted? \nThe contents of Databricks Git folders are encrypted by Databricks using a default key. Encryption using customer-managed keys is not supported except when encrypting your Git credentials. \n### How and where are the GitHub tokens stored in Databricks? Who would have access from Databricks? \n* The authentication tokens are stored in the Databricks control plane, and a Databricks employee can only gain access through a temporary credential that is audited.\n* Databricks logs the creation and deletion of these tokens, but not their usage. Databricks has logging that tracks Git operations that can be used to audit the usage of the tokens by the Databricks application.\n* GitHub enterprise audits token usage. Other Git services might also have Git server auditing. \n### Do Git folders support GPG signing of commits? \nNo. \n### Do Git folders support SSH? \nNo, only `HTTPS`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/limits.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Limits & FAQ for Git integration with Databricks Git folders\n##### CI\/CD and MLOps\n\n### Incoming changes clear the notebook state \nGit operations that alter the notebook source code result in the loss of the notebook state, including cell outputs, comments, version history, and widgets. For example, `git pull` can change the source code of a notebook. In this case, Databricks Git folders must overwrite the existing notebook to import the changes. `git commit` and `push` or creating a new branch do not affect the notebook source code, so the notebook state is preserved in these operations. \nImportant \nMLflow experiments don\u2019t work in Git folders with DBR 14.x or lower versions. \n### Can I create an MLflow experiment in a repo? \nThere are two types of MLflow experiments: **workspace** and **notebook**. For details on the two types of MLflow experiments, see [Organize training runs with MLflow experiments](https:\/\/docs.databricks.com\/mlflow\/experiments.html). \nIn Git folders, you can call `mlflow.set_experiment(\"\/path\/to\/experiment\")` for an MLflow experiment of either type and log runs to it, but that experiment and the associated runs will **not** be checked into source control. \n#### Workspace MLflow experiments \nYou cannot [create workspace MLflow experiments](https:\/\/docs.databricks.com\/mlflow\/experiments.html#create-workspace-experiment) in a Databricks Git folder (Git folder). If multiple users use separate Git folders to collaborate on the same ML code, log MLflow runs to an MLflow experiment created in a regular workspace folder. \n#### Notebook MLflow experiments \nYou can create notebook experiments in a Databricks Git folder. If you check your notebook into source control as an `.ipynb` file, you can log MLflow runs to an automatically created and associated MLflow experiment. For more details, read about [creating notebook experiments](https:\/\/docs.databricks.com\/mlflow\/experiments.html#create-notebook-experiment). \n### Prevent data loss in MLflow experiments \nWarning \nAny time you switch to a branch that does not contain the notebook, you risk losing the associated MLFlow experiment data. This loss becomes permnanent if the prior branch is not accessed within 30 days. \nTo recover missing experiment data before the 30 day expiry, **rename the notebook back to the original name**, open the notebook, click the \u201cexperiment\u201d icon on the right side pane (this also effectively calls the `mlflow.get_experiment_by_name()` API), and you will be able to see the recovered experiment and runs. After 30 days, any orphaned MLflow experiments will be purged to meet GDPR compliance policies. \nTo prevent this situation, Databricks recommends you either avoid renaming notebooks in repos altogether, or if you do rename a notebook, click the \u201cexperiment\u201d icon on the right side pane immediately after renaming a notebook. \n### What happens if a notebook job is running in a workspace while a Git operation is in progress? \nAt any point while a Git operation is in progress, some notebooks in the repo might have been updated while others have not. This can cause unpredictable behavior. \nFor example, suppose `notebook A` calls `notebook Z` using a `%run` command. If a job running\nduring a Git operation starts the most recent version of `notebook A`, but `notebook Z` has not\nyet been updated, the `%run` command in notebook A might start the older version of `notebook Z`.\nDuring the Git operation, the notebook states are not predictable and the job might fail or run\n`notebook A` and `notebook Z` from different commits. \nTo avoid this situation, use Git-based jobs (where the source is a Git provider and not a workspace path) instead. For more details, read [Use version-controlled source code in a Databricks job](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-repos.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/limits.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Limits & FAQ for Git integration with Databricks Git folders\n##### Resources\n\nFor details on Databricks workspace files, see [What are workspace files?](https:\/\/docs.databricks.com\/files\/workspace.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/limits.html"} +{"content":"# Query data\n## Data format options\n#### JSON file\n\nYou can read JSON files in [single-line](https:\/\/docs.databricks.com\/query\/formats\/json.html#single-line-mode) or [multi-line](https:\/\/docs.databricks.com\/query\/formats\/json.html#multi-line-mode) mode. In single-line mode, a file can be split into many parts and read in parallel. In multi-line mode, a file is loaded as a whole entity and *cannot* be split. \nFor further information, see [JSON Files](https:\/\/spark.apache.org\/docs\/latest\/sql-data-sources-json.html).\n\n#### JSON file\n##### Options\n\nSee the following Apache Spark reference articles for supported read and write options. \n* Read \n+ [Python](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrameReader.json.html?highlight=json#pyspark.sql.DataFrameReader.json)\n+ [Scala](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/sql\/DataFrameReader.html#json(paths:String*):org.apache.spark.sql.DataFrame)\n* Write \n+ [Python](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrameWriter.json.html?highlight=json#pyspark.sql.DataFrameWriter.json)\n+ [Scala](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/sql\/DataFrameWriter.html#json(path:String):Unit)\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/json.html"} +{"content":"# Query data\n## Data format options\n#### JSON file\n##### Rescued data column\n\nNote \nThis feature is supported in [Databricks Runtime 8.2 (unsupported)](https:\/\/docs.databricks.com\/archive\/runtime-release-notes\/8.2.html) and above. \nThe rescued data column ensures that you never lose or miss out on data during ETL. The rescued data column contains any data that wasn\u2019t parsed, either because it was missing from the given schema, or because there was a type mismatch, or because the casing of the column in the record or file didn\u2019t match with that in the schema. The rescued data column is returned as a JSON blob containing the columns that were rescued, and the source file path of the record. To remove the source file path from the rescued data column, you can set the SQL configuration `spark.conf.set(\"spark.databricks.sql.rescuedDataColumn.filePath.enabled\", \"false\")`. You can enable the rescued data column by setting the option `rescuedDataColumn` to a column name, such as `_rescued_data` with `spark.read.option(\"rescuedDataColumn\", \"_rescued_data\").format(\"json\").load(<path>)`. \nThe JSON parser supports three modes when parsing records: `PERMISSIVE`, `DROPMALFORMED`, and `FAILFAST`. When used together with `rescuedDataColumn`, data type mismatches do not cause records to be dropped in `DROPMALFORMED` mode or throw an error in `FAILFAST` mode. Only corrupt records\u2014that is, incomplete or malformed JSON\u2014are dropped or throw errors. If you use the option `badRecordsPath` when parsing JSON, data type mismatches are not considered as bad records when using the `rescuedDataColumn`. Only incomplete and malformed JSON records are stored in `badRecordsPath`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/json.html"} +{"content":"# Query data\n## Data format options\n#### JSON file\n##### Examples\n\n### Single-line mode \nIn this example, there is one JSON object per line: \n```\n{\"string\":\"string1\",\"int\":1,\"array\":[1,2,3],\"dict\": {\"key\": \"value1\"}}\n{\"string\":\"string2\",\"int\":2,\"array\":[2,4,6],\"dict\": {\"key\": \"value2\"}}\n{\"string\":\"string3\",\"int\":3,\"array\":[3,6,9],\"dict\": {\"key\": \"value3\", \"extra_key\": \"extra_value3\"}}\n\n``` \nTo read the JSON data, use: \n```\nval df = spark.read.format(\"json\").load(\"example.json\")\n\n``` \nSpark infers the schema automatically. \n```\ndf.printSchema\n\n``` \n```\nroot\n|-- array: array (nullable = true)\n| |-- element: long (containsNull = true)\n|-- dict: struct (nullable = true)\n| |-- extra_key: string (nullable = true)\n| |-- key: string (nullable = true)\n|-- int: long (nullable = true)\n|-- string: string (nullable = true)\n\n``` \n### Multi-line mode \nThis JSON object occupies multiple lines: \n```\n[\n{\"string\":\"string1\",\"int\":1,\"array\":[1,2,3],\"dict\": {\"key\": \"value1\"}},\n{\"string\":\"string2\",\"int\":2,\"array\":[2,4,6],\"dict\": {\"key\": \"value2\"}},\n{\n\"string\": \"string3\",\n\"int\": 3,\n\"array\": [\n3,\n6,\n9\n],\n\"dict\": {\n\"key\": \"value3\",\n\"extra_key\": \"extra_value3\"\n}\n}\n]\n\n``` \nTo read this object, enable multi-line mode: \n```\nCREATE TEMPORARY VIEW multiLineJsonTable\nUSING json\nOPTIONS (path=\"\/tmp\/multi-line.json\",multiline=true)\n\n``` \n```\nval mdf = spark.read.option(\"multiline\", \"true\").format(\"json\").load(\"\/tmp\/multi-line.json\")\nmdf.show(false)\n\n``` \n### Charset auto-detection \nBy default, the charset of input files is detected automatically. You can specify the charset explicitly using the `charset` option: \n```\nspark.read.option(\"charset\", \"UTF-16BE\").format(\"json\").load(\"fileInUTF16.json\")\n\n``` \nSome supported charsets include: `UTF-8`, `UTF-16BE`, `UTF-16LE`, `UTF-16`, `UTF-32BE`, `UTF-32LE`, `UTF-32`. For the full list of charsets supported by Oracle Java SE, see [Supported Encodings](https:\/\/docs.oracle.com\/javase\/8\/docs\/technotes\/guides\/intl\/encoding.doc.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/json.html"} +{"content":"# Query data\n## Data format options\n#### JSON file\n##### Notebook example: Read JSON files\n\nThe following notebook demonstrates single-line mode and multi-line mode. \n### Read JSON files notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/read-json-files.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/json.html"} +{"content":"# Technology partners\n## Connect to data governance partners using Partner Connect\n#### Connect to Anomalo\n\nAnomalo is a data quality validation platform that ensures accurate, complete, and consistent data that is in line with your expectations. By connecting to Databricks, Anomalo brings a unifying layer that ensures you can trust the quality of your data before it is consumed by various business intelligence and analytics tools or modeling and machine learning frameworks. \nYou can integrate your Databricks clusters and Databricks SQL warehouses (formerly Databricks SQL endpoints) with Anomalo.\n\n#### Connect to Anomalo\n##### Connect to Anomalo using Partner Connect\n\nTo connect your Databricks workspace to Anomalo using Partner Connect, see [Connect to data governance partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/data-governance.html). \nNote \nPartner Connect only supports Databricks SQL warehouses for Anomalo. To connect a cluster in your Databricks workspace to Anomalo, connect to Anomalo manually.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/data-governance\/anomalo.html"} +{"content":"# Technology partners\n## Connect to data governance partners using Partner Connect\n#### Connect to Anomalo\n##### Connect to Anomalo manually\n\nThis section describes how to connect an existing SQL warehouse or cluster to Anomalo manually. \n### Requirements \nBefore you connect to Anomalo manually, you must have the following: \n* A cluster or SQL warehouse in your Databricks workspace. \n+ [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n+ [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* The connection details for your cluster or SQL warehouse, specifically the **Server Hostname**, **Port**, and **HTTP Path** values. \n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). To create a personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**.\nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n+ [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n+ [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \n### Steps to connect \nTo connect to Anomalo manually, do the following: \n1. [Sign up](https:\/\/app.anomalo.com\/signup?next=\/dashboard\/home) for a new Anomalo account, or [sign in](https:\/\/app.anomalo.com\/login?next=\/dashboard\/home) to your existing Anomalo account.\n2. In the sidebar of your Anomalo home page, click the **Support** icon, then click **Anomalo Documentation**.\n3. Follow the steps in the article titled *Connecting your data*.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/data-governance\/anomalo.html"} +{"content":"# Technology partners\n## Connect to data governance partners using Partner Connect\n#### Connect to Anomalo\n##### Next steps\n\n* Visit the [Anomalo website](https:\/\/www.anomalo.com\/).\n* Read [What is Anomalo](https:\/\/docs.anomalo.com\/introduction\/readme).\n* To learn more about how to use Anomalo, click the **Support** icon in the sidebar of your Anomalo home page, then click **Anomalo Documentation**.\n* For additional help, [email Anomalo support](mailto:support%40anomalo.com).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/data-governance\/anomalo.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Copy MLflow objects between workspaces\n\nTo import or export MLflow objects to or from your Databricks workspace, you can use the community-driven open source project [MLflow Export-Import](https:\/\/github.com\/mlflow\/mlflow-export-import#why-use-mlflow-export-import) to migrate MLflow experiments, models, and runs between workspaces. \nWith these tools, you can: \n* Share and collaborate with other data scientists in the same or another tracking server. For example, you can clone an experiment from another user into your workspace.\n* Copy a model from one workspace to another, such as from a development to a production workspace.\n* Copy MLflow experiments and runs from your local tracking server to your Databricks workspace.\n* Back up mission critical experiments and models to another Databricks workspace.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/migrate-mlflow-objects.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### User-defined scalar functions - Python\n\nThis article contains Python user-defined function (UDF) examples. It shows how to register UDFs, how to invoke UDFs, and provides caveats about evaluation order of subexpressions in Spark SQL. \nIn Databricks Runtime 14.0 and above, you can use Python user-defined table functions (UDTFs) to register functions that return entire relations instead of scalar values. See [What are Python user-defined table functions?](https:\/\/docs.databricks.com\/udf\/python-udtf.html). \nNote \nIn Databricks Runtime 12.2 LTS and below, Python UDFs and Pandas UDFs are not supported in Unity Catalog on compute that uses shared access mode. Scalar Python UDFs and scalar Pandas UDFs are supported in Databricks Runtime 13.3 LTS and above for all access modes. \nIn Databricks Runtime 13.3 LTS and above, you can register scalar Python UDFs to Unity Catalog using SQL syntax. See [User-defined functions (UDFs) in Unity Catalog](https:\/\/docs.databricks.com\/udf\/unity-catalog.html). \nGraviton instances do not support Python UDFs on Unity Catalog-enabled clusters. \nImportant \nUDFs and UDAFs are not supported on Graviton clusters configured with shared access mode and Unity Catalog.\n\n#### User-defined scalar functions - Python\n##### Register a function as a UDF\n\n```\ndef squared(s):\nreturn s * s\nspark.udf.register(\"squaredWithPython\", squared)\n\n``` \nYou can optionally set the return type of your UDF. The default return type is `StringType`. \n```\nfrom pyspark.sql.types import LongType\ndef squared_typed(s):\nreturn s * s\nspark.udf.register(\"squaredWithPython\", squared_typed, LongType())\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/python.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### User-defined scalar functions - Python\n##### Call the UDF in Spark SQL\n\n```\nspark.range(1, 20).createOrReplaceTempView(\"test\")\n\n``` \n```\n%sql select id, squaredWithPython(id) as id_squared from test\n\n```\n\n#### User-defined scalar functions - Python\n##### Use UDF with DataFrames\n\n```\nfrom pyspark.sql.functions import udf\nfrom pyspark.sql.types import LongType\nsquared_udf = udf(squared, LongType())\ndf = spark.table(\"test\")\ndisplay(df.select(\"id\", squared_udf(\"id\").alias(\"id_squared\")))\n\n``` \nAlternatively, you can declare the same UDF using annotation syntax: \n```\nfrom pyspark.sql.functions import udf\n@udf(\"long\")\ndef squared_udf(s):\nreturn s * s\ndf = spark.table(\"test\")\ndisplay(df.select(\"id\", squared_udf(\"id\").alias(\"id_squared\")))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/python.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### User-defined scalar functions - Python\n##### Evaluation order and null checking\n\nSpark SQL (including SQL and the DataFrame and Dataset API) does not guarantee the order of\nevaluation of subexpressions. In particular, the inputs of an operator or function are not\nnecessarily evaluated left-to-right or in any other fixed order. For example, logical `AND`\nand `OR` expressions do not have left-to-right \u201cshort-circuiting\u201d semantics. \nTherefore, it is dangerous to rely on the side effects or order of evaluation of Boolean\nexpressions, and the order of `WHERE` and `HAVING` clauses, since such expressions and clauses can be\nreordered during query optimization and planning. Specifically, if a UDF relies on short-circuiting semantics in SQL for null checking, there\u2019s no\nguarantee that the null check will happen before invoking the UDF. For example, \n```\nspark.udf.register(\"strlen\", lambda s: len(s), \"int\")\nspark.sql(\"select s from test1 where s is not null and strlen(s) > 1\") # no guarantee\n\n``` \nThis `WHERE` clause does not guarantee the `strlen` UDF to be invoked after filtering out nulls. \nTo perform proper null checking, we recommend that you do either of the following: \n* Make the UDF itself null-aware and do null checking inside the UDF itself\n* Use `IF` or `CASE WHEN` expressions to do the null check and invoke the UDF in a conditional branch \n```\nspark.udf.register(\"strlen_nullsafe\", lambda s: len(s) if not s is None else -1, \"int\")\nspark.sql(\"select s from test1 where s is not null and strlen_nullsafe(s) > 1\") \/\/ ok\nspark.sql(\"select s from test1 where if(s is not null, strlen(s), null) > 1\") \/\/ ok\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/python.html"} +{"content":"# \n### Connect to data sources\n\nThis article provides opinionated recommendations for how administrators and other power users can configure connections between Databricks and data sources. If you are trying to determine whether you have access to read data from an external system, start by reviewing the data that you have access to in your workspace. See [Discover data](https:\/\/docs.databricks.com\/discover\/index.html). \nYou can connect your Databricks account to data sources such as cloud object storage, relational database management systems, streaming data services, and enterprise platforms such as CRMs. The specific privileges required to configure connections depends on the data source, how permissions in your Databricks workspace are configured, the required permissions for interacting with data in the source, your data governance model, and your preferred method for connecting. \nMost methods require elevated privileges on both the data source and the Databricks workspace to configure the necessary permissions to integrate systems. Users without these permissions should request help. See [Request access to data sources](https:\/\/docs.databricks.com\/connect\/index.html#request-access).\n\n### Connect to data sources\n#### Configure object storage connections\n\nCloud object storage provides the basis for storing most data on Databricks. To learn more about cloud object storage and where Databricks stores data, see [Where does Databricks write data?](https:\/\/docs.databricks.com\/files\/write-data.html). \nDatabricks recommends using Unity Catalog to configure access to cloud object storage. Unity Catalog provides data governance for both structured and unstructured data in cloud object storage. See [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html). \nCustomers who don\u2019t use Unity Catalog must configure connections using legacy methods. See [Configure access to cloud object storage for Databricks](https:\/\/docs.databricks.com\/connect\/storage\/index.html). \nTo configure networking to cloud object storage, see [Networking](https:\/\/docs.databricks.com\/security\/network\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/index.html"} +{"content":"# \n### Connect to data sources\n#### Configure connections to external data systems\n\nDatabricks recommends several options for configuring connections to external data systems depending on your needs. The following table provides a high-level overview of these options: \n| Option | Description |\n| --- | --- |\n| Lakehouse Federation | Provides read-only access to data in enterprise data systems. Connections are configured through Unity Catalog at the catalog or schema level, syncing multiple tables with a single configuration. See [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). |\n| Partner Connect | Leverages technology partner solutions to connect to external data sources and automate ingesting data to the lakehouse. Some solutions also include reverse ETL and direct access to lakehouse data from external systems. See [What is Databricks Partner Connect?](https:\/\/docs.databricks.com\/partner-connect\/index.html) |\n| Drivers | Databricks includes drivers for external data systems in each Databricks Runtime. You can optionally install third-party drivers to access data in other systems. You must configure connections for each table. Some drivers include write access. See [Connect to external systems](https:\/\/docs.databricks.com\/connect\/external-systems\/index.html). |\n| JDBC | Several included drivers for external systems build upon native JDBC support, and the JDBC option provides extensible options for configuring connections to other systems. You must configure connections for each table. See [Query databases using JDBC](https:\/\/docs.databricks.com\/connect\/external-systems\/jdbc.html). |\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/index.html"} +{"content":"# \n### Connect to data sources\n#### Connect to streaming data sources\n\nDatabricks provides optimized connectors for many streaming data systems. \nFor all streaming data sources, you must generate credentials that provide access and load these credentials into Databricks. Databricks recommends storing credentials using secrets, because you can use secrets for all configuration options and in all access modes. \nAll data connectors for streaming sources support passing credentials using options when you define streaming queries. See [Configure streaming data sources](https:\/\/docs.databricks.com\/connect\/streaming\/index.html).\n\n### Connect to data sources\n#### Request access to data sources\n\nIn many organizations, most users do not have sufficient privileges on either Databricks or external data sources to configure data connections. \nYour organization might have already configured access to a data source using one of the patterns described in the articles linked from this page. If your organization has a well-defined process for requesting access to data, Databricks recommends following that process. \nIf you\u2019re uncertain how to gain access to a data source, this procedure might help you: \n1. Use Catalog Explorer to view the tables and volumes that you can access. See [What is Catalog Explorer?](https:\/\/docs.databricks.com\/catalog-explorer\/index.html).\n2. Ask your teammates or managers about the data sources that they can access. \n* Most organizations use groups synced from their identity provider (for example: Okta or Microsoft Entra ID (formerly Azure Active Directory)) to manage permissions for workspace users. If other members of your team can access data sources that you need access to, have a workspace admin add you to the correct group to grant you access.\n* If a particular table, volume, or data source was configured by a co-worker, that individual should have permissions to grant you access to the data.\n3. Some organizations configure data access permissions through settings on compute clusters and SQL warehouses. \n* Access to data sources can vary by compute.\n* You can view the compute creator on the **Compute** tab. Reach out to the creator to ask about data sources that should be accessible.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/index.html"} +{"content":"# Get started: Account and workspace setup\n### Best practice articles\n\nThis article provides a reference of best practice articles you can use to optimize your Databricks activity. \nThe Databricks documentation includes a number of best practices articles to help you get the best performance at the lowest cost when using and administering Databricks.\n\n### Best practice articles\n#### Cheat sheets\n\nCheat sheets provide you with a high-level view of practices you should be implementing in your Databricks account and workflows. Each cheat sheet includes a table of best practices, their impact, and helpful resources. Available cheat sheets include the following: \n* [Platform administration cheat sheet](https:\/\/docs.databricks.com\/cheat-sheet\/administration.html)\n* [Compute creation cheat sheet](https:\/\/docs.databricks.com\/cheat-sheet\/compute.html)\n* [Production job scheduling cheat sheet](https:\/\/docs.databricks.com\/cheat-sheet\/jobs.html)\n\n### Best practice articles\n#### Best practice articles\n\nThe following articles provide you with best practice guidance for various Databricks features. \n* [Delta Lake best practices](https:\/\/docs.databricks.com\/delta\/best-practices.html)\n* [Hyperparameter tuning with Hyperopt](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-best-practices.html)\n* [Deep learning in Databricks](https:\/\/docs.databricks.com\/machine-learning\/train-model\/dl-best-practices.html)\n* [Recommendations for MLOps](https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html)\n* [Unity Catalog best practices](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html)\n* [Cluster configuration best practices](https:\/\/docs.databricks.com\/compute\/cluster-config-best-practices.html)\n* [Instance pool configuration best practices](https:\/\/docs.databricks.com\/compute\/pool-best-practices.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/best-practices.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Read data shared using Delta Sharing open sharing (for recipients)\n\nThis article describes how to read data that has been shared with you using the Delta Sharing *open sharing* protocol. It includes instructions for reading shared data using Databricks, Apache Spark, pandas, PowerBI, and Tableau. \nIn open sharing, you use a credential file that was shared with a member of your team by the data provider to gain secure read access to shared data. Access persists as long as the credential is valid and the provider continues to share the data. Providers manage credential expiration and rotation. Updates to the data are available to you in near real time. You can read and make copies of the shared data, but you can\u2019t modify the source data. \nNote \nIf data has been shared with you using Databricks-to-Databricks Delta Sharing, you don\u2019t need a credential file to access data, and this article doesn\u2019t apply to you. For instructions, see [Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html). \nThe sections that follow describe how to use Databricks, Apache Spark, pandas, and Power BI to access and read shared data using the credential file. For a full list of Delta Sharing connectors and information about how to use them, see the [Delta Sharing open source documentation](https:\/\/delta.io\/sharing). If you run into trouble accessing the shared data, contact the data provider. \nNote \nPartner integrations are, unless otherwise noted, provided by the third parties and you must have an account with the appropriate provider for the use of their products and services. While Databricks does its best to keep this content up to date, we make no representation regarding the integrations or the accuracy of the content on the partner integration pages. Reach out to the appropriate providers regarding the integrations.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Read data shared using Delta Sharing open sharing (for recipients)\n#### Before you begin\n\nA member of your team must download the credential file shared by the data provider. See [Get access in the open sharing model](https:\/\/docs.databricks.com\/data-sharing\/recipient.html#get-access-open). \nThey should use a secure channel to share that file or file location with you.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Read data shared using Delta Sharing open sharing (for recipients)\n#### Databricks: Read shared data using open sharing connectors\n\nThis section describes how to use an open sharing connector to access shared data using a notebook in your Databricks workspace. You or another member of your team store the credential file in DBFS, then you use it to authenticate to the data provider\u2019s Databricks account and read the data that the data provider shared with you. \nNote \nIf the data provider is using Databricks-to-Databricks sharing and did not share a credential file with you, you must access the data using [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). For instructions, see [Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html). \nIn this example, you create a notebook with multiple cells that you can run independently. You could instead add the notebook commands to the same cell and run them in sequence. \n### Step 1: Store the credential file in DBFS (Python instructions) \nIn this step, you use a Python notebook in Databricks to store the credential file so that users on your team can access shared data. \nSkip to the next step if you or someone on your team has already stored the credential file in DBFS. \n1. In a text editor, open the credential file.\n2. In your Databricks workspace, click **New > Notebook**. \n* Enter a name.\n* Set the default language for the notebook to Python.\n* Select a cluster to attach to the notebook.\n* Click **Create**.The notebook opens in the notebook editor.\n3. To use Python or pandas to access the shared data, install the [delta-sharing Python connector](https:\/\/delta.io\/connectors\/). In the notebook editor, paste the following command: \n```\n%sh pip install delta-sharing\n\n```\n4. Run the cell. \nThe `delta-sharing` Python library gets installed in the cluster if it isn\u2019t already installed.\n5. In a new cell, paste the following command, which uploads the contents of the credential file to a folder in DBFS. Replace the variables as follows: \n* `<dbfs-path>`: the path to the folder where you want to save the credential file\n* `<credential-file-contents>`: the contents of the credential file. This is not a path to the file, but the copied contents of the file. \nThe credential file contains JSON that defines three fields: `shareCredentialsVersion`, `endpoint`, and `bearerToken`. \n```\n%scala\ndbutils.fs.put(\"<dbfs-path>\/config.share\",\"\"\"\n<credential-file-contents>\n\"\"\")\n\n```\n6. Run the cell. \nAfter the credential file is uploaded, you can delete this cell. All workspace users can read the credential file from DBFS, and the credential file is available in DBFS on all clusters and SQL warehouses in your workspace. To delete the cell, click **x** in the cell actions menu ![Cell actions](https:\/\/docs.databricks.com\/_images\/cell-actions.png) at the far right. \n### Step 2: Use a notebook to list and read shared tables \nIn this step, you list the tables in the *share*, or set of shared tables and partitions, and you query a table. \n1. Using Python, list the tables in the share. \nIn a new cell, paste the following command. Replace `<dbfs-path>` with the path that was created in [Step 1: Store the credential file in DBFS (Python instructions)](https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html#store-creds). \nWhen the code runs, Python reads the credential file from DBFS on the cluster. Access data stored in DBFS at the path `\/dbfs\/`. \n```\nimport delta_sharing\n\nclient = delta_sharing.SharingClient(f\"\/dbfs\/<dbfs-path>\/config.share\")\n\nclient.list_all_tables()\n\n```\n2. Run the cell. \nThe result is an array of tables, along with metadata for each table. The following output shows two tables: \n```\nOut[10]: [Table(name='example_table', share='example_share_0', schema='default'), Table(name='other_example_table', share='example_share_0', schema='default')]\n\n``` \nIf the output is empty or doesn\u2019t contain the tables you expect, contact the data provider.\n3. Query a shared table. \n* **Using Scala**: \nIn a new cell, paste the following command. When the code runs, the credential file is read from DBFS through the JVM. \nReplace the variables as follows: \n+ `<profile-path>`: the DBFS path of the credential file. For example, `\/<dbfs-path>\/config.share`.\n+ `<share-name>`: the value of `share=` for the table.\n+ `<schema-name>`: the value of `schema=` for the table.\n+ `<table-name>`: the value of `name=` for the table.\n```\n%scala\nspark.read.format(\"deltaSharing\")\n.load(\"<profile-path>#<share-name>.<schema-name>.<table-name>\").limit(10);\n\n``` \nRun the cell. Each time you load the shared table, you see fresh data from the source.\n* **Using SQL**: \nTo query the data using SQL, you create a local table in the workspace from the shared table, then query the local table. The shared data is not stored or cached in the local table. Each time you query the local table, you see the current state of the shared data. \nIn a new cell, paste the following command. \nReplace the variables as follows: \n+ `<local-table-name>`: the name of the local table.\n+ `<profile-path>`: the location of the credential file.\n+ `<share-name>`: the value of `share=` for the table.\n+ `<schema-name>`: the value of `schema=` for the table.\n+ `<table-name>`: the value of `name=` for the table.\n```\n%sql\nDROP TABLE IF EXISTS table_name;\n\nCREATE TABLE <local-table-name> USING deltaSharing LOCATION \"<profile-path>#<share-name>.<schema-name>.<table-name>\";\n\nSELECT * FROM <local-table-name> LIMIT 10;\n\n``` \nWhen you run the command, the shared data is queried directly. As a test, the table is queried and the first 10 results are returned.If the output is empty or doesn\u2019t contain the data you expect, contact the data provider.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Read data shared using Delta Sharing open sharing (for recipients)\n#### Apache Spark: Read shared data\n\nFollow these steps to access shared data using Spark 3.x or above. \nThese instructions assume that you have access to the credential file that was shared by the data provider. See [Get access in the open sharing model](https:\/\/docs.databricks.com\/data-sharing\/recipient.html#get-access-open). \n### Install the Delta Sharing Python and Spark connectors \nTo access metadata related to the shared data, such as the list of tables shared with you, do the following. This example uses Python. \n1. Install the [delta-sharing Python connector](https:\/\/delta.io\/connectors\/): \n```\npip install delta-sharing\n\n```\n2. Install the [Apache Spark connector](https:\/\/github.com\/delta-io\/delta-sharing#apache-spark-connector). \n### List shared tables using Spark \nList the tables in the share. In the following example, replace `<profile-path>` with the location of the credential file. \n```\nimport delta_sharing\n\nclient = delta_sharing.SharingClient(f\"<profile-path>\/config.share\")\n\nclient.list_all_tables()\n\n``` \nThe result is an array of tables, along with metadata for each table. The following output shows two tables: \n```\nOut[10]: [Table(name='example_table', share='example_share_0', schema='default'), Table(name='other_example_table', share='example_share_0', schema='default')]\n\n``` \nIf the output is empty or doesn\u2019t contain the tables you expect, contact the data provider. \n### Access shared data using Spark \nRun the following, replacing these variables: \n* `<profile-path>`: the location of the credential file.\n* `<share-name>`: the value of `share=` for the table.\n* `<schema-name>`: the value of `schema=` for the table.\n* `<table-name>`: the value of `name=` for the table.\n* `<version-as-of>`: optional. The version of the table to load the data. Only works if the data provider shares the history of the table. Requires `delta-sharing-spark` 0.5.0 or above.\n* `<timestamp-as-of>`: optional. Load the data at the version before or at the given timestamp. Only works if the data provider shares the history of the table. Requires `delta-sharing-spark` 0.6.0 or above. \n```\ndelta_sharing.load_as_spark(f\"<profile-path>#<share-name>.<schema-name>.<table-name>\", version=<version-as-of>)\n\nspark.read.format(\"deltaSharing\")\\\n.option(\"versionAsOf\", <version-as-of>)\\\n.load(\"<profile-path>#<share-name>.<schema-name>.<table-name>\")\\\n.limit(10))\n\ndelta_sharing.load_as_spark(f\"<profile-path>#<share-name>.<schema-name>.<table-name>\", timestamp=<timestamp-as-of>)\n\nspark.read.format(\"deltaSharing\")\\\n.option(\"timestampAsOf\", <timestamp-as-of>)\\\n.load(\"<profile-path>#<share-name>.<schema-name>.<table-name>\")\\\n.limit(10))\n\n``` \nRun the following, replacing these variables: \n* `<profile-path>`: the location of the credential file.\n* `<share-name>`: the value of `share=` for the table.\n* `<schema-name>`: the value of `schema=` for the table.\n* `<table-name>`: the value of `name=` for the table.\n* `<version-as-of>`: optional. The version of the table to load the data. Only works if the data provider shares the history of the table. Requires `delta-sharing-spark` 0.5.0 or above.\n* `<timestamp-as-of>`: optional. Load the data at the version before or at the given timestamp. Only works if the data provider shares the history of the table. Requires `delta-sharing-spark` 0.6.0 or above. \n```\nspark.read.format(\"deltaSharing\")\n.option(\"versionAsOf\", <version-as-of>)\n.load(\"<profile-path>#<share-name>.<schema-name>.<table-name>\")\n.limit(10)\n\nspark.read.format(\"deltaSharing\")\n.option(\"timestampAsOf\", <version-as-of>)\n.load(\"<profile-path>#<share-name>.<schema-name>.<table-name>\")\n.limit(10)\n\n``` \n### Access shared change data feed using Spark \nIf the table history has been shared with you and change data feed (CDF) is enabled on the source table, you can access the change data feed by running the following, replacing these variables. Requires `delta-sharing-spark` 0.5.0 or above. \nOne and only one start parameter must be provided. \n* `<profile-path>`: the location of the credential file.\n* `<share-name>`: the value of `share=` for the table.\n* `<schema-name>`: the value of `schema=` for the table.\n* `<table-name>`: the value of `name=` for the table.\n* `<starting-version>`: optional. The starting version of the query, inclusive. Specify as a Long.\n* `<ending-version>`: optional. The ending version of the query, inclusive. If the ending version is not provided, the API uses the latest table version.\n* `<starting-timestamp>`: optional. The starting timestamp of the query, this is converted to a version created greater or equal to this timestamp. Specify as a string in the format `yyyy-mm-dd hh:mm:ss[.fffffffff]`.\n* `<ending-timestamp>`: optional. The ending timestamp of the query, this is converted to a version created earlier or equal to this timestamp. Specify as a string in the format `yyyy-mm-dd hh:mm:ss[.fffffffff]` \n```\ndelta_sharing.load_table_changes_as_spark(f\"<profile-path>#<share-name>.<schema-name>.<table-name>\",\nstarting_version=<starting-version>,\nending_version=<ending-version>)\n\ndelta_sharing.load_table_changes_as_spark(f\"<profile-path>#<share-name>.<schema-name>.<table-name>\",\nstarting_timestamp=<starting-timestamp>,\nending_timestamp=<ending-timestamp>)\n\nspark.read.format(\"deltaSharing\").option(\"readChangeFeed\", \"true\")\\\n.option(\"statingVersion\", <starting-version>)\\\n.option(\"endingVersion\", <ending-version>)\\\n.load(\"<profile-path>#<share-name>.<schema-name>.<table-name>\")\n\nspark.read.format(\"deltaSharing\").option(\"readChangeFeed\", \"true\")\\\n.option(\"startingTimestamp\", <starting-timestamp>)\\\n.option(\"endingTimestamp\", <ending-timestamp>)\\\n.load(\"<profile-path>#<share-name>.<schema-name>.<table-name>\")\n\n``` \n```\nspark.read.format(\"deltaSharing\").option(\"readChangeFeed\", \"true\")\n.option(\"statingVersion\", <starting-version>)\n.option(\"endingVersion\", <ending-version>)\n.load(\"<profile-path>#<share-name>.<schema-name>.<table-name>\")\n\nspark.read.format(\"deltaSharing\").option(\"readChangeFeed\", \"true\")\n.option(\"startingTimestamp\", <starting-timestamp>)\n.option(\"endingTimestamp\", <ending-timestamp>)\n.load(\"<profile-path>#<share-name>.<schema-name>.<table-name>\")\n\n``` \nIf the output is empty or doesn\u2019t contain the data you expect, contact the data provider. \n### Access a shared table using Spark Structured Streaming \nIf the table history is shared with you, you can stream read the shared data. Requires `delta-sharing-spark` 0.6.0 or above. \nSupported options: \n* `ignoreDeletes`: Ignore transactions that delete data.\n* `ignoreChanges`: Re-process updates if files were rewritten in the source table due to a data changing operation such as `UPDATE`, `MERGE INTO`, `DELETE` (within partitions), or `OVERWRITE`. Unchanged rows can still be emitted. Therefore your downstream consumers should be able to handle duplicates. Deletes are not propagated downstream. `ignoreChanges` subsumes `ignoreDeletes`. Therefore if you use `ignoreChanges`, your stream will not be disrupted by either deletions or updates to the source table.\n* `startingVersion`: The shared table version to start from. All table changes starting from this version (inclusive) will be read by the streaming source.\n* `startingTimestamp`: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) will be read by the streaming source. Example: `\"2023-01-01 00:00:00.0\"`.\n* `maxFilesPerTrigger`: The number of new files to be considered in every micro-batch.\n* `maxBytesPerTrigger`: The amount of data that gets processed in each micro-batch. This option sets a \u201csoft max\u201d, meaning that a batch processes approximately this amount of data and may process more than the limit in order to make the streaming query move forward in cases when the smallest input unit is larger than this limit.\n* `readChangeFeed`: Stream read the change data feed of the shared table. \nUnsupported options: \n* `Trigger.availableNow` \n#### Sample Structured Streaming queries \n```\nspark.readStream.format(\"deltaSharing\")\n.option(\"startingVersion\", 0)\n.option(\"ignoreChanges\", true)\n.option(\"maxFilesPerTrigger\", 10)\n.load(\"<profile-path>#<share-name>.<schema-name>.<table-name>\")\n\n``` \n```\nspark.readStream.format(\"deltaSharing\")\\\n.option(\"startingVersion\", 0)\\\n.option(\"ignoreDeletes\", true)\\\n.option(\"maxBytesPerTrigger\", 10000)\\\n.load(\"<profile-path>#<share-name>.<schema-name>.<table-name>\")\n\n``` \nSee also [Streaming on Databricks](https:\/\/docs.databricks.com\/structured-streaming\/index.html). \n### Read tables with deletion vectors or column mapping enabled \nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nDeletion vectors are a storage optimization feature that your provider can enable on shared Delta tables. See [What are deletion vectors?](https:\/\/docs.databricks.com\/delta\/deletion-vectors.html). \nDatabricks also supports column mapping for Delta tables. See [Rename and drop columns with Delta Lake column mapping](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html). \nIf your provider shared a table with deletion vectors or column mapping enabled, you can read the table using compute that is running `delta-sharing-spark` 3.1 or above. If you are using Databricks clusters, you can perform batch reads using a cluster running Databricks Runtime 14.1 or above. CDF and streaming queries require Databricks Runtime 14.2 or above. \nYou can perform batch queries as-is, because they can automatically resolve `responseFormat` based on the table features of the shared table. \nTo read a change data feed (CDF) or to perform streaming queries on shared tables with deletion vectors or column mapping enabled, you must set the additional option `responseFormat=delta`. \nThe following examples show batch, CDF, and streaming queries: \n```\nimport org.apache.spark.sql.SparkSession\n\nval spark = SparkSession\n.builder()\n.appName(\"...\")\n.master(\"...\")\n.config(\"spark.sql.extensions\", \"io.delta.sql.DeltaSparkSessionExtension\")\n.config(\"spark.sql.catalog.spark_catalog\", \"org.apache.spark.sql.delta.catalog.DeltaCatalog\")\n.getOrCreate()\n\nval tablePath = \"<profile-file-path>#<share-name>.<schema-name>.<table-name>\"\n\n\/\/ Batch query\nspark.read.format(\"deltaSharing\").load(tablePath)\n\n\/\/ CDF query\nspark.read.format(\"deltaSharing\")\n.option(\"readChangeFeed\", \"true\")\n.option(\"responseFormat\", \"delta\")\n.option(\"startingVersion\", 1)\n.load(tablePath)\n\n\/\/ Streaming query\nspark.readStream.format(\"deltaSharing\").option(\"responseFormat\", \"delta\").load(tablePath)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Read data shared using Delta Sharing open sharing (for recipients)\n#### Pandas: Read shared data\n\nFollow these steps to access shared data in pandas 0.25.3 or above. \nThese instructions assume that you have access to the credential file that was shared by the data provider. See [Get access in the open sharing model](https:\/\/docs.databricks.com\/data-sharing\/recipient.html#get-access-open). \n### Install the Delta Sharing Python connector \nTo access metadata related to the shared data, such as the list of tables shared with you, you must install the [delta-sharing Python connector](https:\/\/delta.io\/connectors\/). \n```\npip install delta-sharing\n\n``` \n### List shared tables using pandas \nTo list the tables in the share, run the following, replacing `<profile-path>\/config.share` with the location of the credential file. \n```\nimport delta_sharing\n\nclient = delta_sharing.SharingClient(f\"<profile-path>\/config.share\")\n\nclient.list_all_tables()\n\n``` \nIf the output is empty or doesn\u2019t contain the tables you expect, contact the data provider. \n### Access shared data using pandas \nTo access shared data in pandas using Python, run the following, replacing the variables as follows: \n* `<profile-path>`: the location of the credential file.\n* `<share-name>`: the value of `share=` for the table.\n* `<schema-name>`: the value of `schema=` for the table.\n* `<table-name>`: the value of `name=` for the table. \n```\nimport delta_sharing\ndelta_sharing.load_as_pandas(f\"<profile-path>#<share-name>.<schema-name>.<table-name>\")\n\n``` \n### Access a shared change data feed using pandas \nTo access the change data feed for a shared table in pandas using Python run the following, replacing the variables as follows. A change data feed may not be available, depending on whether or not the data provider shared the change data feed for the table. \n* `<starting-version>`: optional. The starting version of the query, inclusive.\n* `<ending-version>`: optional. The ending version of the query, inclusive.\n* `<starting-timestamp>`: optional. The starting timestamp of the query. This is converted to a version created greater or equal to this timestamp.\n* `<ending-timestamp>`: optional. The ending timestamp of the query. This is converted to a version created earlier or equal to this timestamp. \n```\nimport delta_sharing\ndelta_sharing.load_table_changes_as_pandas(\nf\"<profile-path>#<share-name>.<schema-name>.<table-name>\",\nstarting_version=<starting-version>,\nending_version=<starting-version>)\n\ndelta_sharing.load_table_changes_as_pandas(\nf\"<profile-path>#<share-name>.<schema-name>.<table-name>\",\nstarting_timestamp=<starting-timestamp>,\nending_timestamp=<ending-timestamp>)\n\n``` \nIf the output is empty or doesn\u2019t contain the data you expect, contact the data provider.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Read data shared using Delta Sharing open sharing (for recipients)\n#### Power BI: Read shared data\n\nThe Power BI Delta Sharing connector allows you to discover, analyze, and visualize datasets shared with you through the Delta Sharing open protocol. \n### Requirements \n* Power BI Desktop 2.99.621.0 or above.\n* Access to the credential file that was shared by the data provider. See [Get access in the open sharing model](https:\/\/docs.databricks.com\/data-sharing\/recipient.html#get-access-open). \n### Connect to Databricks \nTo connect to Databricks using the Delta Sharing connector, do the following: \n1. Open the shared credential file with a text editor to retrieve the endpoint URL and the token.\n2. Open Power BI Desktop.\n3. On the **Get Data** menu, search for **Delta Sharing**.\n4. Select the connector and click **Connect**.\n5. Enter the endpoint URL that you copied from the credentials file into the **Delta Sharing Server URL** field.\n6. Optionally, in the **Advanced Options** tab, set a **Row Limit** for the maximum number of rows that you can download. This is set to 1 million rows by default.\n7. Click **OK**.\n8. For **Authentication**, copy the token that you retrieved from the credentials file into **Bearer Token**.\n9. Click **Connect**. \n### Limitations of the Power BI Delta Sharing connector \nThe Power BI Delta Sharing Connector has the following limitations: \n* The data that the connector loads must fit into the memory of your machine. To manage this requirement, the connector limits the number of imported rows to the **Row Limit** that you set under the Advanced Options tab in Power BI Desktop.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Read data shared using Delta Sharing open sharing (for recipients)\n#### Tableau: Read shared data\n\nThe Tableau Delta Sharing connector allows you to discover, analyze, and visualize datasets that are shared with you through the Delta Sharing open protocol. \n### Requirements \n* Tableau Desktop and Tableau Server 2024.1 or above\n* Access to the credential file that was shared by the data provider. See [Get access in the open sharing model](https:\/\/docs.databricks.com\/data-sharing\/recipient.html#get-access-open). \n### Connect to Databricks \nTo connect to Databricks using the Delta Sharing connector, do the following: \n1. Go to [Tableau Exchange](https:\/\/exchange.tableau.com\/products\/1019), follow the instructions to download the Delta Sharing Connector, and put it in an appropriate desktop folder.\n2. Open Tableau Desktop.\n3. On the **Connectors** page, search for \u201cDelta Sharing by Databricks\u201d.\n4. Select **Upload Share file**, and choose the credential file that was shared by the provider.\n5. Click **Get Data**.\n6. In the Data Explorer, select the table.\n7. Optionally add SQL filters or row limits.\n8. Click **Get Table Data**. \n### Limitations of the Tableau Delta Sharing connector \nThe Tableau Delta Sharing Connector has the following limitations: \n* The data that the connector loads must fit into the memory of your machine. To manage this requirement, the connector limits the number of imported rows to the row limit that you set in Tableau.\n* All columns are returned as type `String`.\n* SQL Filter only works if your Delta Sharing server supports [predicateHint](https:\/\/github.com\/delta-io\/delta-sharing\/blob\/main\/PROTOCOL.md#request-body).\n\n### Read data shared using Delta Sharing open sharing (for recipients)\n#### Request a new credential\n\nIf your credential activation URL or downloaded credential is lost, corrupted, or compromised, or your credential expires without your provider sending you a new one, contact your provider to request a new credential.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n### Common data loading patterns using COPY INTO\n##### Tutorial: COPY INTO with Spark SQL\n\nDatabricks recommends that you use the [COPY INTO](https:\/\/docs.databricks.com\/ingestion\/copy-into\/index.html) command for incremental and bulk data loading for data sources that contain thousands of files. Databricks recommends that you use [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html) for advanced use cases. \nIn this tutorial, you use the `COPY INTO` command to load data from cloud object storage into a table in your Databricks workspace.\n\n##### Tutorial: COPY INTO with Spark SQL\n###### Requirements\n\n1. A Databricks account, and a Databricks workspace in your account. To create these, see [Get started: Account and workspace setup](https:\/\/docs.databricks.com\/getting-started\/index.html).\n2. An all-purpose [cluster](https:\/\/docs.databricks.com\/compute\/index.html) in your workspace running Databricks Runtime 11.3 LTS or above. To create an all-purpose cluster, see [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n3. Familiarity with the Databricks workspace user interface. See [Navigate the workspace](https:\/\/docs.databricks.com\/workspace\/index.html).\n4. Familiarity working with [Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html).\n5. A location you can write data to; this demo uses the DBFS root as an example, but Databricks recommends an external storage location configured with Unity Catalog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/tutorial-notebook.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n### Common data loading patterns using COPY INTO\n##### Tutorial: COPY INTO with Spark SQL\n###### Step 1. Configure your environment and create a data generator\n\nThis tutorial assumes basic familiarity with Databricks and a default workspace configuration. If you are unable to run the code provided, contact your workspace administrator to make sure you have access to compute resources and a location to which you can write data. \nNote that the provided code uses a `source` parameter to specify the location you\u2019ll configure as your `COPY INTO` data source. As written, this code points to a location on DBFS root. If you have write permissions on an external object storage location, replace the `dbfs:\/` portion of the source string with the path to your object storage. Because this code block also does a recursive delete to reset this demo, make sure that you don\u2019t point this at production data and that you keep the `\/user\/{username}\/copy-into-demo` nested directory to avoid overwriting or deleting existing data. \n1. [Create a new SQL notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook) and [attach it to a cluster](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) running Databricks Runtime 11.3 LTS or above.\n2. Copy and run the following code to reset the storage location and database used in this tutorial: \n```\n%python\n# Set parameters for isolation in workspace and reset demo\n\nusername = spark.sql(\"SELECT regexp_replace(current_user(), '[^a-zA-Z0-9]', '_')\").first()[0]\ndatabase = f\"copyinto_{username}_db\"\nsource = f\"dbfs:\/user\/{username}\/copy-into-demo\"\n\nspark.sql(f\"SET c.username='{username}'\")\nspark.sql(f\"SET c.database={database}\")\nspark.sql(f\"SET c.source='{source}'\")\n\nspark.sql(\"DROP DATABASE IF EXISTS ${c.database} CASCADE\")\nspark.sql(\"CREATE DATABASE ${c.database}\")\nspark.sql(\"USE ${c.database}\")\n\ndbutils.fs.rm(source, True)\n\n```\n3. Copy and run the following code to configure some tables and functions that will be used to randomly generate data: \n```\n-- Configure random data generator\n\nCREATE TABLE user_ping_raw\n(user_id STRING, ping INTEGER, time TIMESTAMP)\nUSING json\nLOCATION ${c.source};\n\nCREATE TABLE user_ids (user_id STRING);\n\nINSERT INTO user_ids VALUES\n(\"potato_luver\"),\n(\"beanbag_lyfe\"),\n(\"default_username\"),\n(\"the_king\"),\n(\"n00b\"),\n(\"frodo\"),\n(\"data_the_kid\"),\n(\"el_matador\"),\n(\"the_wiz\");\n\nCREATE FUNCTION get_ping()\nRETURNS INT\nRETURN int(rand() * 250);\n\nCREATE FUNCTION is_active()\nRETURNS BOOLEAN\nRETURN CASE\nWHEN rand() > .25 THEN true\nELSE false\nEND;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/tutorial-notebook.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n### Common data loading patterns using COPY INTO\n##### Tutorial: COPY INTO with Spark SQL\n###### Step 2: Write the sample data to cloud storage\n\nWriting to data formats other than Delta Lake is rare on Databricks. The code provided here writes to JSON, simulating an external system that might dump results from another system into object storage. \n1. Copy and run the following code to write a batch of raw JSON data: \n```\n-- Write a new batch of data to the data source\n\nINSERT INTO user_ping_raw\nSELECT *,\nget_ping() ping,\ncurrent_timestamp() time\nFROM user_ids\nWHERE is_active()=true;\n\n```\n\n##### Tutorial: COPY INTO with Spark SQL\n###### Step 3: Use COPY INTO to load JSON data idempotently\n\nYou must create a target Delta Lake table before you can use `COPY INTO`. In Databricks Runtime 11.3 LTS and above, you do not need to provide anything other than a table name in your `CREATE TABLE` statement. For previous versions of Databricks Runtime, you must provide a schema when creating an empty table. \n1. Copy and run the following code to create your target Delta table and load data from your source: \n```\n-- Create target table and load data\n\nCREATE TABLE IF NOT EXISTS user_ping_target;\n\nCOPY INTO user_ping_target\nFROM ${c.source}\nFILEFORMAT = JSON\nFORMAT_OPTIONS (\"mergeSchema\" = \"true\")\nCOPY_OPTIONS (\"mergeSchema\" = \"true\")\n\n``` \nBecause this action is idempotent, you can run it multiple times but data will only be loaded once.\n\n##### Tutorial: COPY INTO with Spark SQL\n###### Step 4: Preview the contents of your table\n\nYou can run a simple SQL query to manually review the contents of this table. \n1. Copy and execute the following code to preview your table: \n```\n-- Review updated table\n\nSELECT * FROM user_ping_target\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/tutorial-notebook.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n### Common data loading patterns using COPY INTO\n##### Tutorial: COPY INTO with Spark SQL\n###### Step 5: Load more data and preview results\n\nYou can re-run steps 2-4 many times to land new batches of random raw JSON data in your source, idempotently load them to Delta Lake with `COPY INTO`, and preview the results. Try running these steps out of order or multiple times to simulate multiple batches of raw data being written or executing `COPY INTO` multiple times without new data having arrived.\n\n##### Tutorial: COPY INTO with Spark SQL\n###### Step 6: Clean up tutorial\n\nWhen you are done with this tutorial, you can clean up the associated resources if you no longer want to keep them. \n1. Copy and run the following code to drop the database, tables, and remove all data: \n```\n%python\n# Drop database and tables and remove data\n\nspark.sql(\"DROP DATABASE IF EXISTS ${c.database} CASCADE\")\ndbutils.fs.rm(source, True)\n\n```\n2. To stop your compute resource, go to the **Clusters** tab and **Terminate** your cluster.\n\n##### Tutorial: COPY INTO with Spark SQL\n###### Additional resources\n\n* The [COPY INTO](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-copy-into.html) reference article\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/tutorial-notebook.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Tutorials: Implement ETL workflows with Delta Live Tables\n##### Programmatically create multiple tables\n\nYou can use Python with Delta Live Tables to programmatically create multiple tables to reduce code redundancy. \nYou might have pipelines containing multiple flows or dataset definitions that differ only by a small number of parameters. This redundancy results in pipelines that are error-prone and difficult to maintain. For example, the following diagram shows the graph of a pipeline that uses a fire department dataset to find neighborhoods with the fastest response times for different categories of emergency calls. In this example, the parallel flows differ by only a few parameters. \n![Fire dataset flow diagram](https:\/\/docs.databricks.com\/_images\/fire-dataset-flows.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/create-multiple-tables.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Tutorials: Implement ETL workflows with Delta Live Tables\n##### Programmatically create multiple tables\n###### Delta Live Tables metaprogramming with Python example\n\nNote \nThis example reads sample data included in the [Databricks datasets](https:\/\/docs.databricks.com\/discover\/databricks-datasets.html#dbfs-datasets). Because the Databricks datasets are not supported with a pipeline that publishes to Unity Catalog, this example works only with a pipeline configured to publish to the Hive metastore. However, this pattern also works with Unity Catalog enabled pipelines, but you must read data from [external locations](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html). To learn more about using Unity Catalog with Delta Live Tables, see [Use Unity Catalog with your Delta Live Tables pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html). \nYou can use a metaprogramming pattern to reduce the overhead of generating and maintaining redundant flow definitions. Metaprogramming in Delta Live Tables is done using Python inner functions. Because these functions are lazily evaluated, you can use them to create flows that are identical except for input parameters. Each invocation can include a different set of parameters that controls how each table should be generated, as shown in the following example. \nImportant \nBecause Python functions with Delta Live Tables decorators are invoked lazily, when creating datasets in a loop you must call a separate function to create the datasets to ensure correct parameter values are used. Failing to create datasets in a separate function results in multiple tables that use the parameters from the final execution of the loop. \nThe following example calls the `create_table()` function inside a loop to create tables `t1` and `t2`: \n```\ndef create_table(name):\n@dlt.table(name=name)\ndef t():\nreturn spark.read.table(name)\n\ntables = [\"t1\", \"t2\"]\nfor t in tables:\ncreate_table(t)\n\n``` \n```\nimport dlt\nfrom pyspark.sql.functions import *\n\n@dlt.table(\nname=\"raw_fire_department\",\ncomment=\"raw table for fire department response\"\n)\n@dlt.expect_or_drop(\"valid_received\", \"received IS NOT NULL\")\n@dlt.expect_or_drop(\"valid_response\", \"responded IS NOT NULL\")\n@dlt.expect_or_drop(\"valid_neighborhood\", \"neighborhood != 'None'\")\ndef get_raw_fire_department():\nreturn (\nspark.read.format('csv')\n.option('header', 'true')\n.option('multiline', 'true')\n.load('\/databricks-datasets\/timeseries\/Fires\/Fire_Department_Calls_for_Service.csv')\n.withColumnRenamed('Call Type', 'call_type')\n.withColumnRenamed('Received DtTm', 'received')\n.withColumnRenamed('Response DtTm', 'responded')\n.withColumnRenamed('Neighborhooods - Analysis Boundaries', 'neighborhood')\n.select('call_type', 'received', 'responded', 'neighborhood')\n)\n\nall_tables = []\n\ndef generate_tables(call_table, response_table, filter):\n@dlt.table(\nname=call_table,\ncomment=\"top level tables by call type\"\n)\ndef create_call_table():\nreturn (\nspark.sql(\"\"\"\nSELECT\nunix_timestamp(received,'M\/d\/yyyy h:m:s a') as ts_received,\nunix_timestamp(responded,'M\/d\/yyyy h:m:s a') as ts_responded,\nneighborhood\nFROM LIVE.raw_fire_department\nWHERE call_type = '{filter}'\n\"\"\".format(filter=filter))\n)\n\n@dlt.table(\nname=response_table,\ncomment=\"top 10 neighborhoods with fastest response time \"\n)\ndef create_response_table():\nreturn (\nspark.sql(\"\"\"\nSELECT\nneighborhood,\nAVG((ts_received - ts_responded)) as response_time\nFROM LIVE.{call_table}\nGROUP BY 1\nORDER BY response_time\nLIMIT 10\n\"\"\".format(call_table=call_table))\n)\n\nall_tables.append(response_table)\n\ngenerate_tables(\"alarms_table\", \"alarms_response\", \"Alarms\")\ngenerate_tables(\"fire_table\", \"fire_response\", \"Structure Fire\")\ngenerate_tables(\"medical_table\", \"medical_response\", \"Medical Incident\")\n\n@dlt.table(\nname=\"best_neighborhoods\",\ncomment=\"which neighbor appears in the best response time list the most\"\n)\ndef summary():\ntarget_tables = [dlt.read(t) for t in all_tables]\nunioned = functools.reduce(lambda x,y: x.union(y), target_tables)\nreturn (\nunioned.groupBy(col(\"neighborhood\"))\n.agg(count(\"*\").alias(\"score\"))\n.orderBy(desc(\"score\"))\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/create-multiple-tables.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### Prepare data for Foundation Model Training\n\nImportant \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). Reach out to your Databricks account team to enroll in the Public Preview. \nThis section describes the accepted training and evaluation data file formats for supported tasks: **supervised fine-tuning**, **chat completion**, and **continued pre-training**. \nThe following notebook shows how to validate your data. It is designed to be run independently before you begin training. The purpose of this notebook is to validate that your data is in the correct format for Foundation Model Training. It also includes code to tokenize your raw dataset to help you estimate costs during your training run.\n\n#### Prepare data for Foundation Model Training\n##### Validate data for training runs notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/large-language-models\/validate-data-estimate-tokens.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/data-preparation.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### Prepare data for Foundation Model Training\n##### Prepare data for supervised training\n\nFor [supervised training](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html#tasks) tasks, the training data can be in one of the following schemas: \n* Prompt and response pairs. \n```\n{\"prompt\": \"your-custom-prompt\", \"response\": \"your-custom-response\"}\n\n```\n* Prompt and completion pairs. \n```\n{\"prompt\": \"your-custom-prompt\", \"completion\": \"your-custom-response\"}\n\n``` \nNote \nPrompt-response and prompt-completion are **not** [templated](https:\/\/huggingface.co\/docs\/transformers\/main\/en\/chat_templating), so any model-specific templating, such as Mistral\u2019s [instruct formatting](https:\/\/huggingface.co\/mistralai\/Mistral-7B-Instruct-v0.2#instruction-format) must be performed as a preprocessing step. \nAccepted data formats are: \n* A Unity Catalog Volume with a `.jsonl` file. The training data must be in JSONL format, where each line is a valid JSON object.\n* A Delta table that adheres to one of the accepted schemas mentioned above. For Delta tables, you must provide a `data_prep_cluster_id` parameter for data processing. See [Configure a training run](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/create-fine-tune-run.html#configure).\n* A public Hugging Face dataset. \nIf you use a public Hugging Face dataset as your training data, specify the full path with the split, for example, `mosaicml\/instruct-v3\/train and mosaicml\/instruct-v3\/test`. This accounts for datasets that have different split schemas. Nested datasets from Hugging Face are not supported. \nFor a more extensive example, see the `mosaicml\/dolly_hhrlhf` dataset on Hugging Face. \nThe following example rows of data are from the `mosaicml\/dolly_hhrlhf` dataset. \n```\n{\"prompt\": \"Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is Kubernetes? ### Response: \",\"response\": \"Kubernetes is an open source container orchestration system for automating software deployment, scaling, and management. Originally designed by Google, the project is now maintained by the Cloud Native Computing Foundation.\"}\n{\"prompt\": \"Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Van Halen famously banned what color M&Ms in their rider? ### Response: \",\"response\": \"Brown.\"}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/data-preparation.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### Prepare data for Foundation Model Training\n##### Prepare data for chat completion\n\nFor chat completion tasks, chat-formatted data must be in a file `.jsonl` format, where each line is a separate JSON object representing a single chat session. Each chat session is represented as a JSON object with a single key, `\"messages\"`, that maps to an array of message objects. To train on chat data, simply provide the `task_type = 'CHAT_COMPLETION'`. \nMessages in chat format are automatically formatted according to the model\u2019s [chat template](https:\/\/huggingface.co\/docs\/transformers\/main\/en\/chat_templating), so there is no need to add special chat tokens to signal the beginning or end of a chat turn manually. An example of a model that uses a custom chat template is [Mistral-instruct](https:\/\/huggingface.co\/mistralai\/Mistral-7B-Instruct-v0.1#instruction-format). \nEach message object in the array represents a single message in the conversation and has the following structure: \n* `role`: A string indicating the author of the message. Possible values are `\"system\"`, `\"user\"`, and `\"assistant\"`. If the role is `system`, it must be the first chat in the messages list. There must be at least one message with the role `\"assistant\"`, and any messages after the (optional) system prompt must alternate roles between user\/assistant. There must not be two adjacent messages with the same role. The last message in the `\"messages\"` array must have the role `\"assistant\".`\n* `content`: A string containing the text of the message. \nThe following is a chat-formatted data example: \n```\n{\"messages\": [\n{\"role\": \"system\", \"content\": \"A conversation between a user and a helpful assistant.\"},\n{\"role\": \"user\", \"content\": \"Hi there. What's the capital of the moon?\"},\n{\"role\": \"assistant\", \"content\": \"This question doesn't make sense as nobody currently lives on the moon, meaning it would have no government or political institutions. Furthermore, international treaties prohibit any nation from asserting sovereignty over the moon and other celestial bodies.\"},\n]\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/data-preparation.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### Prepare data for Foundation Model Training\n##### Prepare data for continued pre-training\n\nFor [continued pre-training](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html#tasks) tasks, the training data is your unstructured text data. The training data must be in a Unity Catalog Volume containing `.txt` files. Each `.txt` file is treated as a single sample. If your `.txt` files are in a Unity Catalog Volume folder, those files are also obtained for your training data. Any non-`txt` files in the Volume are ignored.\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/data-preparation.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Cluster libraries\n\nCluster libraries can be used by all notebooks and jobs running on a cluster. This article details using the **Install library** UI in the Databricks workspace. \nNote \nIf you create compute using a policy that enforces library installations, you can\u2019t install or uninstall libraries on your compute. Workspace admins control all library management at the policy level. \nYou can install libraries to a cluster using the following approaches: \n* Install a library for use with a specific cluster only.\n* Install a library with the REST API. See the [Libraries API](https:\/\/docs.databricks.com\/api\/workspace\/libraries\/install).\n* Install a library with Databricks CLI. See [What is the Databricks CLI?](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html).\n* Install a library using Terraform. See [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_library](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/library). \n* Install a library by creating a cluster with a policy that defines library installations. See [Add libraries to a policy](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html#libraries).\n* (Not recommended) Install a library using an init script that runs at cluster creation time. See [Install a library with an init script (legacy)](https:\/\/docs.databricks.com\/archive\/compute\/libraries-init-scripts.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Cluster libraries\n##### Install a library on a cluster\n\nTo install a library on a cluster: \n1. Click ![compute icon](https:\/\/docs.databricks.com\/_images\/clusters-icon.png) **Compute** in the sidebar.\n2. Click a cluster name.\n3. Click the **Libraries** tab.\n4. Click **Install New**.\n5. The **Install library** dialog displays.\n6. Select one of the **Library Source** options, complete the instructions that appear, and then click **Install**. \nImportant \nLibraries can be installed from DBFS when using Databricks Runtime 14.3 LTS and below. However, any workspace user can modify library files stored in DBFS. To improve the security of libraries in a Databricks workspace, storing library files in the DBFS root is deprecated and disabled by default in Databricks Runtime 15.0 and above. See [Storing libraries in DBFS root is deprecated and disabled by default](https:\/\/docs.databricks.com\/release-notes\/runtime\/15.0.html#libraries-dbfs-deprecation). \nInstead, Databricks [recommends](https:\/\/docs.databricks.com\/libraries\/index.html#recommendations) uploading all libraries, including Python libraries, JAR files, and Spark connectors, to workspace files or Unity Catalog volumes, or using library package repositories. If your workload does not support these patterns, you can also use libraries stored in cloud object storage. \nNot all cluster access modes support all library configurations. See [Cluster-scoped libraries](https:\/\/docs.databricks.com\/libraries\/index.html#compatibility). \n| Library source | Instructions |\n| --- | --- |\n| **Workspace** | Select a workspace file or upload a Whl, zipped wheelhouse, JAR, ZIP, tar, or requirements.txt file. See [Install libraries from workspace files](https:\/\/docs.databricks.com\/libraries\/workspace-files-libraries.html) |\n| **Volumes** | Select a Whl, JAR, or requirements.txt file from a volume. See [Install libraries from a volume](https:\/\/docs.databricks.com\/libraries\/volume-libraries.html). |\n| **File Path\/S3** | Select the library type and provide the full URI to the library object (for example: `\/Workspace\/path\/to\/library.whl`, `\/Volumes\/path\/to\/library.whl`, or `s3:\/\/bucket-name\/path\/to\/library.whl`). See [Install libraries from object storage](https:\/\/docs.databricks.com\/libraries\/object-storage-libraries.html). |\n| **PyPI** | Enter a PyPI package name. See [PyPI package](https:\/\/docs.databricks.com\/libraries\/package-repositories.html#pypi-libraries). |\n| **Maven** | Specify a Maven coordinate. See [Maven or Spark package](https:\/\/docs.databricks.com\/libraries\/package-repositories.html#maven-libraries). |\n| **CRAN** | Enter the name of a package. See [CRAN package](https:\/\/docs.databricks.com\/libraries\/package-repositories.html#cran-libraries). |\n| **DBFS** (Not recommended) | Load a JAR or Whl file to the DBFS root. This is not recommended, as files stored in DBFS can be modified by any workspace user. | \nWhen you install a library on a cluster, a notebook already attached to that cluster will not immediately see the new library. You must first [detach](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#detach) and then [reattach](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) the notebook to the cluster.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Cluster libraries\n##### Install a library using a policy\n\nIf you create a cluster using a policy that enforces library installation, specified libraries automatically install on your cluster. You cannot install additional libraries or uninstall any libraries. \nWorkspace admins can add libraries to policies, allowing them to manage and enforce library installations on all compute that uses the policy. For admin instructions, see [Add libraries to a policy](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html#libraries).\n\n#### Cluster libraries\n##### Uninstall a library from a cluster\n\nNote \nWhen you uninstall a library from a cluster, the library is removed only when you restart the cluster. Until you restart the cluster, the status of the uninstalled library appears as **Uninstall pending restart**. \nTo uninstall a library you can use the cluster UI: \n1. Click ![compute icon](https:\/\/docs.databricks.com\/_images\/clusters-icon.png) **Compute** in the sidebar.\n2. Click a cluster name.\n3. Click the **Libraries** tab.\n4. Select the checkbox next to the cluster you want to uninstall the library from, click **Uninstall**, then **Confirm**. The Status changes to **Uninstall pending restart**. \nClick **Restart** and **Confirm** to uninstall the library. The library is removed from the cluster\u2019s Libraries tab.\n\n#### Cluster libraries\n##### View the libraries installed on a cluster\n\n1. Click ![compute icon](https:\/\/docs.databricks.com\/_images\/clusters-icon.png) **Compute** in the sidebar.\n2. Click the cluster name.\n3. Click the **Libraries** tab. For each library, the tab displays the name and version, type, [install status](https:\/\/docs.databricks.com\/api\/workspace\/libraries), and, if uploaded, the source file.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Cluster libraries\n##### Update a cluster-installed library\n\nTo update a cluster-installed library, uninstall the old version of the library and install a new version. \nNote \nRequirements.txt files do not require uninstalling and restarting. If you have modified the contents of a requirements.txt file, you can simply reinstall it to update the contents of the installed file.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n### PySpark on Databricks\n##### What is a PySpark DataSource?\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html) in Databricks Runtime 15.2 and above. \nA PySpark DataSource is created by the [Python (PySpark) DataSource API](https:\/\/github.com\/apache\/spark\/blob\/c88fabfee41df1ca4729058450ec6f798641c936\/python\/docs\/source\/user_guide\/sql\/python_data_source.rst), which enables reading from custom data sources and writing to custom data sinks in Apache Spark using Python. You can use PySpark DataSources to define custom connections to data systems and implement additional functionality, to build out reusable data sources.\n\n##### What is a PySpark DataSource?\n###### DataSource class\n\nThe PySpark [DataSource](https:\/\/github.com\/apache\/spark\/blob\/0d7c07047a628bd42eb53eb49935f5e3f81ea1a1\/python\/pyspark\/sql\/datasource.py) is a base class that provides methods to create data readers and writers. In addition to defining `name` and `schema`, either `DataSource.reader` or `DataSource.writer` must be implemented by any subclass to make the data source either readable or writable, or both. After implementing this interface, register it, then load or save your data source using the following syntax: \n```\n# Register the data source\nspark.dataSource.register(<DataSourceClass>)\n\n# Read from a custom data source\nspark.read.format(<datasource-name>).load()\n\n# Write to a custom data source\ndf.write.format(<datasource-name>).save()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/pyspark\/datasources.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n### PySpark on Databricks\n##### What is a PySpark DataSource?\n###### Create a PySpark DataSource for batch query\n\nTo demonstrate PySpark DataSource reader capabilities, create a data source that generates example data, generated using the `faker` Python package. For more information about `faker`, see the [Faker documentation](https:\/\/faker.readthedocs.io\/en\/master\/). \n### Step 1: Install dependencies \nDepending on your particular custom data source scenario, you may need to install one or more dependencies. In this example, install the `faker` package using the following command: \n```\n%pip install faker\n\n``` \n### Step 2: Define the DataSource \nNext, define your new PySpark DataSource as a subclass of `DataSource`, with a name, schema, and reader. The `reader()` method must be defined to read from a data source in a batch query. \n```\n\nfrom pyspark.sql.datasource import DataSource, DataSourceReader\nfrom pyspark.sql.types import StructType\n\nclass FakeDataSource(DataSource):\n\"\"\"\nAn example data source for batch query using the `faker` library.\n\"\"\"\n\n@classmethod\ndef name(cls):\nreturn \"fake\"\n\ndef schema(self):\nreturn \"name string, date string, zipcode string, state string\"\n\ndef reader(self, schema: StructType):\nreturn FakeDataSourceReader(schema, self.options)\n\n``` \n### Step 3: Implement the reader for a batch query \nNext, implement the reader logic to generate example data. Use the installed `faker` library to populate each field in the schema. \n```\nclass FakeDataSourceReader(DataSourceReader):\n\ndef __init__(self, schema, options):\nself.schema: StructType = schema\nself.options = options\n\ndef read(self, partition):\n# Library imports must be within the read method\nfrom faker import Faker\nfake = Faker()\n\n# Every value in this `self.options` dictionary is a string.\nnum_rows = int(self.options.get(\"numRows\", 3))\nfor _ in range(num_rows):\nrow = []\nfor field in self.schema.fields:\nvalue = getattr(fake, field.name)()\nrow.append(value)\nyield tuple(row)\n\n``` \n### Step 4: Register and use the example data source \nTo use the data source, register it. By default the `FakeDataSource` has three rows, and the default schema includes these `string` fields: `name`, `date`, `zipcode`, `state`. The following example registers, loads, and outputs an example data source with the defaults: \n```\nspark.dataSource.register(FakeDataSource)\nspark.read.format(\"fake\").load().show()\n\n``` \n```\n+-----------------+----------+-------+----------+\n| name| date|zipcode| state|\n+-----------------+----------+-------+----------+\n|Christine Sampson|1979-04-24| 79766| Colorado|\n| Shelby Cox|2011-08-05| 24596| Florida|\n| Amanda Robinson|2019-01-06| 57395|Washington|\n+-----------------+----------+-------+----------+\n\n``` \nOnly `string` fields are supported, but you can specify a schema with any fields that correspond to `faker` package providers\u2019 fields to generate random data for testing and development. The following example loads a data source with `name` and `company` fields: \n```\nspark.read.format(\"fake\").schema(\"name string, company string\").load().show()\n\n``` \n```\n+---------------------+--------------+\n|name |company |\n+---------------------+--------------+\n|Tanner Brennan |Adams Group |\n|Leslie Maxwell |Santiago Group|\n|Mrs. Jacqueline Brown|Maynard Inc |\n+---------------------+--------------+\n\n``` \nTo load a data source with a custom number of rows, specify the `numRows` option. The following example specifies 5 rows: \n```\nspark.read.format(\"fake\").option(\"numRows\", 5).load().show()\n\n``` \n```\n+--------------+----------+-------+------------+\n| name| date|zipcode| state|\n+--------------+----------+-------+------------+\n| Pam Mitchell|1988-10-20| 23788| Tennessee|\n|Melissa Turner|1996-06-14| 30851| Nevada|\n| Brian Ramsey|2021-08-21| 55277| Washington|\n| Caitlin Reed|1983-06-22| 89813|Pennsylvania|\n| Douglas James|2007-01-18| 46226| Alabama|\n+--------------+----------+-------+------------+\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/pyspark\/datasources.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n### PySpark on Databricks\n##### What is a PySpark DataSource?\n###### Troubleshooting\n\nIf the output is the following error, your compute does not support PySpark DataSources. You must use Databricks Runtime 15.2 or above. \n`Error: [UNSUPPORTED_FEATURE.PYTHON_DATA_SOURCE] The feature is not supported: Python data sources. SQLSTATE: 0A000`\n\n","doc_uri":"https:\/\/docs.databricks.com\/pyspark\/datasources.html"} +{"content":"# What is data warehousing on Databricks?\n### Write queries and explore data in the SQL Editor\n\nThe Databricks UI includes a SQL editor that you can use to author queries, browse available data, and create visualizations. You can also share your saved queries with other team members in the workspace. \n![SQL editor UI](https:\/\/docs.databricks.com\/_images\/query-editor.png) \nAfter opening the editor, you can author a SQL query or browse the available data. The text editor supports autocomplete, autoformatting, and various other keyboard shortcuts. \nYou can open multiple queries using the query tabs at the top of the text editor. Each query tab has controls for running the query, marking the query as a favorite, and connecting to a SQL warehouse. You can also **Save**, **Schedule**, or **Share** queries.\n\n### Write queries and explore data in the SQL Editor\n#### Connect to compute\n\nYou must have at least CAN USE permissions on a running SQL Warehouse to run queries. You can use the drop-down near the top of the editor to see available options. To filter the list, enter text in the text box. \n![SQL Warehouse selector](https:\/\/docs.databricks.com\/_images\/sql-warehouse-selector.png) \nThe first time you create a query, the list of available SQL warehouses appears alphabetically. The last used SQL warehouse is selected the next time you create a query. \nThe icon next to the SQL warehouse indicates the status: \n* Running ![Running](https:\/\/docs.databricks.com\/_images\/endpoint-running.png)\n* Starting ![Starting](https:\/\/docs.databricks.com\/_images\/endpoint-starting.png)\n* Stopped ![Stopped](https:\/\/docs.databricks.com\/_images\/endpoint-stopped.png) \nNote \nIf there are no SQL warehouses in the list, contact your workspace administrator. \nThe selected SQL Warehouse will restart automatically when you run your query. See [Start a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html#start) to learn other ways to start a SQL warehouse.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Write queries and explore data in the SQL Editor\n#### Browse data objects in SQL editor\n\nIf you have metadata read permission, the schema browser in the SQL editor shows the available databases and tables. You can also browse data objects from [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html). \n![The schema browser showing the samples catalog, nyctaxi database, trips table, and the columns in that table.](https:\/\/docs.databricks.com\/_images\/schema-browser.png) \nYou can navigate Unity Catalog-governed database objects in Catalog Explorer without active compute. To explore data in the `hive_metastore` and other catalogs not governed by Unity Catalog, you must attach to compute with appropriate privileges. See [Data governance with Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/index.html). \nNote \nIf no data objects exist in the schema browser or Catalog Explorer, contact your workspace administrator. \nClick ![Refresh Schema Icon](https:\/\/docs.databricks.com\/_images\/refresh-schema-icon.png) near the top of the schema browser to refresh the schema. You can filter the schema by typing filter strings in the search box. \nClick a table name to show the columns for that table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Write queries and explore data in the SQL Editor\n#### Create a query\n\nYou can enter text to create a query in the SQL editor. You can insert elements from the schema browser to reference catalogs and tables. \n* Type your query in the SQL editor. \nThe SQL editor supports autocomplete. As you type, autocomplete suggests completions. For example, if a valid completion at the cursor location is a column, autocomplete suggests a column name. If you type `select * from table_name as t where t.`, autocomplete recognizes that `t` is an alias for `table_name` and suggests the columns inside `table_name`. \n![Autocomplete alias](https:\/\/docs.databricks.com\/_images\/autocomplete-alias.png)\n* (Optional) When you are done editing, click **Save**. \n### Turn autocomplete on and off \nLive autocomplete can complete schema tokens, query syntax identifiers (like `SELECT` and `JOIN`), and the titles of [query snippets](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-snippets.html). It\u2019s enabled by default unless your database schema exceeds five thousand tokens (tables or columns). \nUse the toggle beneath the SQL Editor to turn live autocomplete off or on. \n* To turn off live autocomplete, press **Ctrl + Space** or click the ![Auto Complete Enabled](https:\/\/docs.databricks.com\/_images\/auto-complete-enabled.png) button beneath the SQL editor.\n\n### Write queries and explore data in the SQL Editor\n#### Save queries\n\nThe **Save** button near the top-right of the SQL editor saves your query. \nImportant \nWhen you modify a query but don\u2019t explicitly click **Save**, that state is retained as a query draft. Query drafts are retained for 30 days. After 30 days, query drafts are automatically deleted. To retain your changes, you must explicitly save them.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Write queries and explore data in the SQL Editor\n#### Edit multiple queries\n\nBy default, the SQL editor uses tabs so you can edit multiple queries simultaneously. To open a new tab, click **+**, then select **Create new query** or **Open existing query**. Click **Open existing query** to see your list of saved queries. click **My Queries** or **Favorites** to filter the list of queries. In the row containing the query you want to view, click **Open**. \n![Queries Dialog](https:\/\/docs.databricks.com\/_images\/queries-dialog.png)\n\n### Write queries and explore data in the SQL Editor\n#### Run a single query or multiple queries\n\nTo run a query or all queries: \n1. Select a SQL warehouse.\n2. Highlight a query in the SQL editor (if multiple queries are in the query pane).\n3. Press **Ctrl\/Cmd + Enter** or click **Run (1000)** to display the results as a table in the results pane. \n![Query result](https:\/\/docs.databricks.com\/_images\/query-result.png) \nNote \n**Limit 1000** is selected by default for all queries to limit the query return to 1000 rows. If a query is saved with the **Limit 1000** setting, this setting applies to all query runs (including in dashboards). To return all rows for this query, you can unselect **LIMIT 1000** by clicking the **Run (1000)** drop-down. If you want to specify a different limit on the number of rows, you can add a `LIMIT` clause in your query with a value of your choice.\n\n### Write queries and explore data in the SQL Editor\n#### Terminate a query\n\nTo terminate a query while it is executing, click **Cancel**. An administrator can stop an executing query that another user started by viewing the [Terminate an executing query](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html#terminate-an-executing-query).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Write queries and explore data in the SQL Editor\n#### Query options\n\nYou can use the ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) kebab context menu near the top of the query editor to access menu options to clone, revert, format, and edit query information. \n### Revert to saved query \nWhen you edit a query, a **Revert changes** option appears in the context menu for the query. You can click **Revert** to go back to your saved version. \n### Discarding and restoring queries \nTo move a query to trash: \n* Click the kebab context menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) next to the query in the SQL editor and select **Move to Trash**.\n* Click **Move to trash** to confirm. \nTo restore a query from trash: \n1. In the All Queries list, click ![Trash](https:\/\/docs.databricks.com\/_images\/trash-icon1.png).\n2. Click a query.\n3. Click the kebab context menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) at the top-right of the SQL editor and click **Restore**. \n### Set query description and view query info \nTo set a query description: \n1. Click the ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) kebab context menu next to the query and click **Edit query info**. \n![Context menu](https:\/\/docs.databricks.com\/_images\/query-context-menu.png)\n2. In the **Description** text box, enter your description. Then, click **Save**.\nYou can also view the history of the query, including when it was created and updated, in this dialog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Write queries and explore data in the SQL Editor\n#### Favorite and tag queries\n\nYou can use favorites and tags to filter the lists of queries and dashboards displayed on your workspace landing page, and on each of the listing pages for dashboards and queries. \n**Favorites**: To favorite a query, click the star to the left of its title in the Queries list. The star will turn yellow. \n**Tags**: You can tag queries and dashboards with any meaningful string to your organization. \n### Add a tag \nAdd tags in the query editor. \n1. Click the ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) kebab context menu next to the query and click **Edit query info**. A **Query info** dialog appears.\n2. If the query has no tags applied,**Add some tags** shows in the text box where tags will appear. To create a new tag, type it into the box. To enter multiple tags, press tab between entries. \n![Add tags](https:\/\/docs.databricks.com\/_images\/add-tag.png)\n3. Click **Save** to apply the tags and close the dialog. \n### Remove tags \n1. Click the ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) kebab context menu next to the query and click **Edit query info**.\n2. Click **X** on any tag you want to remove.\n3. Click **Save** to close the dialog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Write queries and explore data in the SQL Editor\n#### View query results\n\nAfter a query runs, the results appear in the pane below it. The **New result table** is **ON** for new queries. If necessary, click the drop-down to turn it off. The images in this section use the new result table. \nYou can interact with and explore your query results using the result pane. The result pane includes the following features for exploring results: \n### Visualizations, filters, and parameters \nClick the ![Plus Sign Icon](https:\/\/docs.databricks.com\/_images\/plus-sign-icon.png) to add a visualization, filter, or parameter. The following options appear: \n![Available options are shown.](https:\/\/docs.databricks.com\/_images\/explore-results-options.png) \n**Visualization**: Visualizations can help explore the result set. See [Visualization types](https:\/\/docs.databricks.com\/visualizations\/visualization-types.html) for a complete list of available visualization types. \n**Filter**: Filters allow you to limit the result set after a query has run. You can apply filters to selectively show different subsets of the data. See [Query filters](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-filters.html) to learn how to use filters. \n**Parameter**: Parameters allow you to limit the result set by substituting values into a query at runtime. See [Query parameters](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html) to learn how to apply parameters. \n### Edit, download, or add to a dashboard \nClick the ![Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret.png) in a results tab to view more options. \n![Options to customize, download results, and add to dashboards.](https:\/\/docs.databricks.com\/_images\/sqle-vis-options.png) \n1. Click **Edit** to customize the results shown in the visualization.\n2. Click **Delete** to delete the results tab.\n3. Click **Duplicate** to clone the results tab.\n4. Click **Add to dashboard** to copy the query and visualization to a new dashboard. \n* This action creates a new dashboard that includes all the visualizations associated with the query. See [Dashboards](https:\/\/docs.databricks.com\/dashboards\/index.html) to learn how to edit your dashboard.\n* You are prompted to choose a name for the new dashboard. The new dashboard is saved to your home folder.\n* You cannot add results to an existing dashboard.\n5. Click **Add to legacy dashboard** to add the results tab to a new or existing legacy dashboard.\n6. Click any of the download options to download the results. See the following description for details and limits. \n**Download results**: You can download results as a CSV, TSV, or Excel file. \nYou can download up to approximately 1GB of results data from Databricks SQL in CSV and TSV format and up to 100,000 rows to an Excel file. \nThe final file download size might be slightly more or less than 1GB, as the 1GB limit is applied to an earlier step than the final file download. \nNote \nIf you cannot download a query, your workspace administrator has disabled download for your workspace.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Write queries and explore data in the SQL Editor\n#### Past executions\n\nYou can view previous runs for the query, including the complete query syntax. Past executions open in read-only mode and include buttons to **Clone to new query** or **Resume editing**. This tab does not show [scheduled runs](https:\/\/docs.databricks.com\/sql\/user\/queries\/schedule-query.html). \n![The record shows each time the query has run, including the specific query syntax.](https:\/\/docs.databricks.com\/_images\/past-runs.png)\n\n### Write queries and explore data in the SQL Editor\n#### Explore results\n\nReturned query results appear below the query. The **Raw results** tab populates with the returned rows. You can use built-in filters to reorder the results by ascending or descending values. You can also use the filter to search for result rows that include a specific value. \n![Filter results with search](https:\/\/docs.databricks.com\/_images\/query-result-filter.gif) \nYou can use tabs in the result pane to add visualizations, filters, and parameters. \n![Scatter plot visualization of data with options to create a new visualization, filter, or parameters.](https:\/\/docs.databricks.com\/_images\/query-visualization.png)\n\n### Write queries and explore data in the SQL Editor\n#### Filter the list of saved queries in the queries window\n\nIn the queries window, you can filter the list of all queries by the list of queries you have created (**My Queries**), by favorites, and by tags.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Write queries and explore data in the SQL Editor\n#### Automate updates\n\nYou can use the **Schedule** button to set an automatic cadence for query runs. Automatic updates can help keep your dashboards and reports up-to-date with the most current data. Schedueled queries can also enable Databricks SQL alerts, a special type of scheduled task that sends notifications when a value reaches a specified threshold. \nSee [Schedule a query](https:\/\/docs.databricks.com\/sql\/user\/queries\/schedule-query.html). \nSee [What are Databricks SQL alerts?](https:\/\/docs.databricks.com\/sql\/user\/alerts\/index.html).\n\n### Write queries and explore data in the SQL Editor\n#### Share queries\n\nThe **Share** button lets you share your query with other users in your workspace. When sharing, choose the between the following options: \n* **Run as owner (owner\u2019s credentials)**: This setting means that viewers are able to see the same query results as the query owner. This applies to scheduled or manual query runs.\n* **Run as view (viewers credentials)**: This setting limits results to the viewer\u2019s assigned permissions. \nSee [Configure query permissions](https:\/\/docs.databricks.com\/sql\/user\/queries\/index.html#share).\n\n### Write queries and explore data in the SQL Editor\n#### Next step\n\nSee [Access and manage saved queries](https:\/\/docs.databricks.com\/sql\/user\/queries\/index.html) to learn how to work with queries with the Databricks UI.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use time series feature tables with point-in-time support\n\nThe data used to train a model often has time dependencies built into it. For example, if you are training a model to predict which machines on a factory floor need maintenance, you might have historical datasets that contain sensor measurements and usage data for many machines, along with target labels that indicate if the machine needed service or not. The dataset might contain data for machines both before and after a maintenance service was performed. \nWhen you build the model, you must consider only feature values up until the time of the observed target value (needs service or does not need service). If you do not explicitly take into account the timestamp of each observation, you might inadvertently use feature values measured after the timestamp of the target value for training. This is called \u201cdata leakage\u201d and can negatively affect the model\u2019s performance. \nTime series feature tables include a timestamp key column that ensures that each row in the training dataset represents the latest known feature values as of the row\u2019s timestamp. You should use time series feature tables whenever feature values change over time, for example with time series data, event-based data, or time-aggregated data. \nNote \n* With Databricks Runtime 13.3 LTS and above, any Delta table in Unity Catalog with primary keys and timestamp keys can be used as a time series feature table. We recommend applying [Z-Ordering](https:\/\/docs.databricks.com\/delta\/data-skipping.html) on time series tables for better performance in point-in-time lookups.\n* Point-in-time lookup functionality is sometimes referred to as \u201ctime travel\u201d. The point-in-time functionality in Databricks Feature Store is not related to [Delta Lake time travel](https:\/\/docs.databricks.com\/delta\/history.html).\n* To use point-in-time functionality, you must specify time-related keys using the `timeseries_columns` argument (for Feature Engineering in Unity Catalog) or the `timestamp_keys` argument (for Workspace Feature Store). This indicates that feature table rows should be joined by matching the most recent value for a particular primary key that is not later than the value of the `timestamps_keys` column, instead of joining based on an exact time match. If you only designate a timeseries column as a primary key column, feature store does not apply point-in-time logic to the timeseries column during joins. Instead, it matches only rows with an exact time match instead of matching all rows prior to the timestamp.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use time series feature tables with point-in-time support\n##### How time series feature tables work\n\nSuppose you have the following feature tables. This data is taken from the [example notebook](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html#example). \nThe tables contain sensor data measuring the temperature, relative humidity, ambient light, and carbon dioxide in a room. The ground truth table indicates if a person was present in the room. Each of the tables has a primary key (\u2018room\u2019) and a timestamp key (\u2018ts\u2019). For simplicity, only data for a single value of the primary key (\u20180\u2019) is shown. \n![example feature table data](https:\/\/docs.databricks.com\/_images\/feature-tables.png) \nThe following figure illustrates how the timestamp key is used to ensure point-in-time correctness in a training dataset. Feature values are matched based on the primary key (not shown in the diagram) and the timestamp key, using an AS OF join. The AS OF join ensures that the most recent value of the feature at the time of the timestamp is used in the training set. \n![how point in time works](https:\/\/docs.databricks.com\/_images\/point-in-time-diagram.png) \nAs shown in the figure, the training dataset includes the latest feature values for each sensor prior to the timestamp on the observed ground truth. \nIf you created a training dataset without taking into account the timestamp key, you might have a row with these feature values and observed ground truth: \n| temp | rh | light | co2 | ground truth |\n| --- | --- | --- | --- | --- |\n| 15.8 | 32 | 212 | 630 | 0 | \nHowever, this is not a valid observation for training, because the co2 reading of 630 was taken at 8:52, after the observation of the ground truth at 8:50. The future data is \u201cleaking\u201d into the training set, which will impair the model\u2019s performance.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use time series feature tables with point-in-time support\n##### Requirements\n\n* For Feature Engineering in Unity Catalog: Feature Engineering in Unity Catalog client (any version)\n* For Workspace Feature Store: Feature Store client v0.3.7 and above\n\n#### Use time series feature tables with point-in-time support\n##### Create a time series feature table in Unity Catalog\n\nIn Unity Catalog, any table with a [TIMESERIES](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-table-constraint.html) primary key is a time series feature table. See [Create a feature table in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html#create-feature-table) for how to create one.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use time series feature tables with point-in-time support\n##### Create a time series feature table in local workspace\n\nTo create a time series feature table in the local Workspace Feature Store, the DataFrame or schema must contain a column that you designate as the timestamp key. \nStarting with Feature Store client v0.13.4, timestamp key columns must be specified in the `primary_keys` argument. Timestamp keys are part of the \u201cprimary keys\u201d that uniquely identify each row in the feature table. Like other primary key columns, timestamp key columns cannot contain `NULL` values. \n```\nfe = FeatureEngineeringClient()\n# user_features_df DataFrame contains the following columns:\n# - user_id\n# - ts\n# - purchases_30d\n# - is_free_trial_active\nfe.create_table(\nname=\"ml.ads_team.user_features\",\nprimary_keys=[\"user_id\", \"ts\"],\ntimeseries_columns=\"ts\",\nfeatures_df=user_features_df,\n)\n\n``` \n```\nfs = FeatureStoreClient()\n# user_features_df DataFrame contains the following columns:\n# - user_id\n# - ts\n# - purchases_30d\n# - is_free_trial_active\nfs.create_table(\nname=\"ads_team.user_features\",\nprimary_keys=[\"user_id\", \"ts\"],\ntimestamp_keys=\"ts\",\nfeatures_df=user_features_df,\n)\n\n``` \n```\nfs = FeatureStoreClient()\n# user_features_df DataFrame contains the following columns:\n# - user_id\n# - ts\n# - purchases_30d\n# - is_free_trial_active\nfs.create_table(\nname=\"ads_team.user_features\",\nprimary_keys=\"user_id\",\ntimestamp_keys=\"ts\",\nfeatures_df=user_features_df,\n)\n\n``` \nA time series feature table must have one timestamp key and cannot have any partition columns. The timestamp key column must be of `TimestampType` or `DateType`. \nDatabricks recommends that time series feature tables have no more than two primary key columns to ensure performant writes and lookups.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use time series feature tables with point-in-time support\n##### Update a time series feature table\n\nWhen writing features to the time series feature tables, your DataFrame must supply values for all features of the feature table, unlike regular feature tables. This constraint reduces the sparsity of feature values across timestamps in the time series feature table. \n```\nfe = FeatureEngineeringClient()\n# daily_users_batch_df DataFrame contains the following columns:\n# - user_id\n# - ts\n# - purchases_30d\n# - is_free_trial_active\nfe.write_table(\n\"ml.ads_team.user_features\",\ndaily_users_batch_df,\nmode=\"merge\"\n)\n\n``` \n```\nfs = FeatureStoreClient()\n# daily_users_batch_df DataFrame contains the following columns:\n# - user_id\n# - ts\n# - purchases_30d\n# - is_free_trial_active\nfs.write_table(\n\"ads_team.user_features\",\ndaily_users_batch_df,\nmode=\"merge\"\n)\n\n``` \nStreaming writes to time series feature tables is supported.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use time series feature tables with point-in-time support\n##### Create a training set with a time series feature table\n\nTo perform a point-in-time lookup for feature values from a time series feature table, you must specify a `timestamp_lookup_key` in the feature\u2019s `FeatureLookup`, which indicates the name of the DataFrame column that contains timestamps against which to lookup time series features. Databricks Feature Store retrieves the latest feature values prior to the timestamps specified in the DataFrame\u2019s `timestamp_lookup_key` column and whose primary keys (excluding timestamp keys) match the values in the DataFrame\u2019s `lookup_key` columns, or `null` if no such feature value exists. \n```\nfeature_lookups = [\nFeatureLookup(\ntable_name=\"ml.ads_team.user_features\",\nfeature_names=[\"purchases_30d\", \"is_free_trial_active\"],\nlookup_key=\"u_id\",\ntimestamp_lookup_key=\"ad_impression_ts\"\n),\nFeatureLookup(\ntable_name=\"ml.ads_team.ad_features\",\nfeature_names=[\"sports_relevance\", \"food_relevance\"],\nlookup_key=\"ad_id\",\n)\n]\n\n# raw_clickstream DataFrame contains the following columns:\n# - u_id\n# - ad_id\n# - ad_impression_ts\ntraining_set = fe.create_training_set(\ndf=raw_clickstream,\nfeature_lookups=feature_lookups,\nexclude_columns=[\"u_id\", \"ad_id\", \"ad_impression_ts\"],\nlabel=\"did_click\",\n)\ntraining_df = training_set.load_df()\n\n``` \n```\nfeature_lookups = [\nFeatureLookup(\ntable_name=\"ads_team.user_features\",\nfeature_names=[\"purchases_30d\", \"is_free_trial_active\"],\nlookup_key=\"u_id\",\ntimestamp_lookup_key=\"ad_impression_ts\"\n),\nFeatureLookup(\ntable_name=\"ads_team.ad_features\",\nfeature_names=[\"sports_relevance\", \"food_relevance\"],\nlookup_key=\"ad_id\",\n)\n]\n\n# raw_clickstream DataFrame contains the following columns:\n# - u_id\n# - ad_id\n# - ad_impression_ts\ntraining_set = fs.create_training_set(\ndf=raw_clickstream,\nfeature_lookups=feature_lookups,\nexclude_columns=[\"u_id\", \"ad_id\", \"ad_impression_ts\"],\nlabel=\"did_click\",\n)\ntraining_df = training_set.load_df()\n\n``` \nAny `FeatureLookup` on a time series feature table must be a point-in-time lookup, so it must specify a `timestamp_lookup_key` column to use in your DataFrame. Point-in-time lookup does not skip rows with `null` feature values stored in the time series feature table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use time series feature tables with point-in-time support\n##### Set a time limit for historical feature values\n\nWith Feature Store client v0.13.0 or above, or any version of Feature Engineering in Unity Catalog client, you can exclude feature values with older timestamps from the training set. To do so, use the parameter `lookback_window` in the `FeatureLookup`. \nThe data type of `lookback_window` must be `datetime.timedelta`, and the default value is `None` (all feature values are used, regardless of age). \nFor example, the following code excludes any feature values that are more than 7 days old: \n```\nfrom datetime import timedelta\n\nfeature_lookups = [\nFeatureLookup(\ntable_name=\"ml.ads_team.user_features\",\nfeature_names=[\"purchases_30d\", \"is_free_trial_active\"],\nlookup_key=\"u_id\",\ntimestamp_lookup_key=\"ad_impression_ts\",\nlookback_window=timedelta(days=7)\n)\n]\n\n``` \n```\nfrom datetime import timedelta\n\nfeature_lookups = [\nFeatureLookup(\ntable_name=\"ads_team.user_features\",\nfeature_names=[\"purchases_30d\", \"is_free_trial_active\"],\nlookup_key=\"u_id\",\ntimestamp_lookup_key=\"ad_impression_ts\",\nlookback_window=timedelta(days=7)\n)\n]\n\n``` \nWhen you call `create_training_set` with the above `FeatureLookup`, it automatically performs the point-in-time join and excludes feature values older than 7 days. \nThe lookback window is applied during training and batch inference. During online inference, the latest feature value is always used, regardless of the lookback window.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use time series feature tables with point-in-time support\n##### Score models with time series feature tables\n\nWhen you score a model trained with features from time series feature tables, Databricks Feature Store retrieves the appropriate features using point-in-time lookups with metadata packaged with the model during training. The DataFrame you provide to `FeatureEngineeringClient.score_batch` (for Feature Engineering in Unity Catalog) or `FeatureStoreClient.score_batch` (for Workspace Feature Store) must contain a timestamp column with the same name and `DataType` as the `timestamp_lookup_key` of the `FeatureLookup` provided to `FeatureEngineeringClient.create_training_set` or `FeatureStoreClient.create_training_set`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use time series feature tables with point-in-time support\n##### Publish time series features to an online store\n\nYou can use `FeatureEngineeringClient.publish_table` (for Feature Engineering in Unity Catalog) or `FeatureStoreClient.publish_table` (for Workspace Feature Store) to publish time series feature tables to online stores. Databricks Feature Store provides the functionality to publish either a snapshot or a window of time series data to the online store, depending on the `OnlineStoreSpec` that created the online store. The table shows details for each publish mode. \n| Online store provider | Snapshot publish mode | Window publish mode |\n| --- | --- | --- |\n| Amazon DynamoDB (v0.3.8 and above) | X | X |\n| Amazon Aurora (MySQL-compatible) | X | |\n| Amazon RDS MySQL | X | | \n### Publish a time series snapshot \nThis publishes the latest feature values for each primary key in the feature table. The online store supports primary key lookup but does not support point-in-time lookup. \nFor online stores that do not support time to live, Databricks Feature Store supports only snapshot publish mode. For online stores that do support time to live, the default publish mode is snapshot unless time to live (`ttl`) is specified in the `OnlineStoreSpec` at the time of creation. \n### Publish a time series window \nThis publishes all feature values for each primary key in the feature table to the online store and automatically removes expired records. A record is considered expired if the record\u2019s timestamp (in UTC) is more than the specified time to live duration in the past. Refer to cloud-specific documentation for details on time-to-live. \nThe online store supports primary key lookup and automatically retrieves the feature value with the latest timestamp. \nTo use this publish mode, you must provide a value for time to live (`ttl`) in the `OnlineStoreSpec` when you create the online store. The `ttl` cannot be changed once set. All subsequent publish calls inherit the `ttl` and are not required to explicitly define it in the `OnlineStoreSpec`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use time series feature tables with point-in-time support\n##### Notebook example: Time series feature table\n\nThe following notebook illustrates point-in-time lookups on time series feature tables in the Workspace Feature Store. \n### Time series feature table example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/feature-store-time-series-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Batch inference using Foundation Model APIs\n\nThis article provides example notebooks that perform batch inference on a provisioned throughput endpoint using [Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html). You need both notebooks to accomplish batch inference using Foundation Model APIs. \nThe examples demonstrate batch inference using the [DBRX Instruct](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/supported-models.html#dbrx) model for chat tasks.\n\n##### Batch inference using Foundation Model APIs\n###### Requirements\n\n* A workspace in a [Foundation Model APIs supported region](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html#regions)\n* Databricks Runtime 14.0 ML or above\n* The `provisioned-throughput-batch-inference` notebook and `chat-batch-inference-api` notebook must exist in the same directory in the workspace\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/fmapi-batch-inference.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Batch inference using Foundation Model APIs\n###### Set up input table, batch inference\n\nThe following notebook does the following tasks, using Python: \n* Reads data from the input table and input column\n* Constructs the requests and sends them to a Foundation Model APIs endpoint\n* Persists input rows together with the response data to the output table \n### Chat model batch inference tasks using Python notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/large-language-models\/chat-batch-inference-api.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nThe following notebook does the same tasks as the above notebook, but using Spark: \n* Reads data from the input table and input column\n* Constructs the requests and sends them to a Foundation Model APIs endpoint\n* Persists input row together with the response data to the output table \n### Chat model batch inference tasks using PySpark Pandas UDF notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/large-language-models\/chat-batch-inference-udf.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/fmapi-batch-inference.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Batch inference using Foundation Model APIs\n###### Create provisioned throughput endpoint\n\nIf you want to use the spark notebook instead of the python notebook, be sure to update the command that calls the Python notebook. \n* Creates a provisioned throughput serving endpoint\n* Monitor the endpoint until it achieves a ready state\n* Calls the `chat-batch-inference-api` notebook to run batch inference tasks concurrently against the prepared endpoint. If you prefer to use Spark, change this reference to call the `chat-batch-inference-udf` notebook.\n* Deletes the provisioned throughput serving endpoint after batch inference completes \n### Perform batch inference on a provisioned throughput endpoint notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/large-language-models\/provisioned-throughput-batch-inference.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n##### Batch inference using Foundation Model APIs\n###### Additional resources\n\n* [Get started querying LLMs on Databricks](https:\/\/docs.databricks.com\/large-language-models\/llm-serving-intro.html)\n* [Try out the DBRX Instruct model in the AI Playground](https:\/\/docs.databricks.com\/large-language-models\/ai-playground.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/fmapi-batch-inference.html"} +{"content":"# \n### What is data warehousing on Databricks?\n\nData warehousing refers to collecting and storing data from multiple sources so it can be quickly accessed for business insights and reporting. This article contains key concepts for building a data warehouse in your data lakehouse.\n\n### What is data warehousing on Databricks?\n#### Data warehousing in your lakehouse\n\nThe lakehouse architecture and Databricks SQL bring cloud data warehousing capabilities to your data lakes. Using familiar data structures, relations, and management tools, you can model a highly-performant, cost-effective data warehouse that runs directly on your data lake. For more information, see [What is a data lakehouse?](https:\/\/docs.databricks.com\/lakehouse\/index.html) \n![Lakehouse architecture with a top layer that includes data warehousing, data engineering, data streaming, and data science and ML](https:\/\/docs.databricks.com\/_images\/lakehouse-dw-highlight.png) \nAs with a traditional data warehouse, you model data according to business requirements and then serve it to your end users for analytics and reports. Unlike a traditional data warehouse, you can avoid siloing your business analytics data or creating redundant copies that quickly become stale. \nBuilding a data warehouse inside your lakehouse lets you bring all your data into a single system and lets you take advantage of features such as Unity Catalog and Delta Lake. \n[Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) adds a unified governance model so that you can secure and audit data access and provide lineage information on downstream tables. [Delta Lake](https:\/\/docs.databricks.com\/delta\/index.html) adds ACID transactions and schema evolution, among other powerful tools for keeping your data reliable, scalable, and high-quality.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/index.html"} +{"content":"# \n### What is data warehousing on Databricks?\n#### What is Databricks SQL?\n\nDatabricks SQL is the collection of services that bring data warehousing capabilities and performance to your existing data lakes. Databricks SQL supports open formats and standard ANSI SQL. An in-platform SQL editor and dashboarding tools allow team members to collaborate with other Databricks users directly in the workspace. Databricks SQL also integrates with a variety of tools so that analysts can author queries and dashboards in their favorite environments without adjusting to a new platform. \nDatabricks SQL provides general compute resources that are executed against the tables in the lakehouse. Databricks SQL is powered by [SQL warehouses](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html), offering scalable SQL compute resources decoupled from storage. \nSee [What is a SQL warehouse?](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html) for more information on SQL Warehouse defaults and options. \nDatabricks SQL integrates with Unity Catalog so that you can discover, audit, and govern data assets from one place. To learn more, see [What is Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/index.html"} +{"content":"# \n### What is data warehousing on Databricks?\n#### Data modeling on Databricks\n\nA lakehouse supports a variety of modeling styles. The following image shows how data is curated and modeled as it moves through different layers of a lakehouse. \n![A diagram showing various data models at each level of the medallion lakehouse archtecture.](https:\/\/docs.databricks.com\/_images\/DW-lakehouse-layers.png) \n### Medallion architecture \nThe medallion architecture is a data design pattern that describes a series of incrementally refined data layers that provide a basic structure in the lakehouse. The bronze, silver, and gold layers signify increasing data quality at each level, with gold representing the highest quality. For more information, see [What is the medallion lakehouse architecture?](https:\/\/docs.databricks.com\/lakehouse\/medallion.html). \nInside a lakehouse, each layer can contain one or more tables. The data warehouse is modeled at the silver layer and feeds specialized data marts in the gold layer. \n### Bronze layer \nData can enter your lakehouse in any format and through any combination of batch or steaming transactions. The bronze layer provides the landing space for all of your raw data in its original format. That data is converted to Delta tables. \n### Silver layer \nThe silver layer brings the data from different sources together. For the part of the business that focuses on data science and machine learning applications, this is where you start to curate meaningful data assets. This process is often marked by a focus on speed and agility. \nThe silver layer is also where you can carefully integrate data from disparate sources to build a data warehouse in alignment with your existing business processes. Often, this data follows a Third Normal Form (3NF) or Data Vault model. Specifying primary and foreign key constraints allows end users to understand table relationships when using Unity Catalog. Your data warehouse should serve as the single source of truth for your data marts. \nThe data warehouse itself is schema-on-write and atomic. It is optimized for change, so you can quickly modify the data warehouse to match your current needs when your business processes change or evolve. \n### Gold layer \nThe gold layer is the presentation layer, which can contain one or more data marts. Frequently, data marts are dimensional models in the form of a set of related tables that capture a specific business perspective. \nThe gold layer also houses departmental and data science sandboxes to enable self-service analytics and data science across the enterprise. Providing these sandboxes and their own separate compute clusters prevents the Business teams from creating copies of data outside the lakehouse.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/index.html"} +{"content":"# \n### What is data warehousing on Databricks?\n#### Next step\n\nTo learn more about the principles and best practices for implementing and operating a lakehouse using Databricks, see [Introduction to the well-architected data lakehouse](https:\/\/docs.databricks.com\/lakehouse-architecture\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## Prepare data and environment for ML and DL\n### Load data for machine learning and deep learning\n##### Prepare data for distributed training\n\nThis article describes two methods for preparing data for distributed training: Petastorm and TFRecords.\n\n##### Prepare data for distributed training\n###### Petastorm (Recommended)\n\n[Petastorm](https:\/\/github.com\/uber\/petastorm) is an open source data access library that enables directly loading data stored in Apache Parquet format. This is convenient for Databricks and Apache Spark users because Parquet is the recommended data format. The following article illustrates this use case: \n* [Load data using Petastorm](https:\/\/docs.databricks.com\/machine-learning\/load-data\/petastorm.html)\n\n##### Prepare data for distributed training\n###### TFRecord\n\nYou can also use TFRecord format as the data source for distributed deep learning.\nTFRecord format is a simple record-oriented binary format that many TensorFlow applications use for\ntraining data. \n[tf.data.TFRecordDataset](https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/data\/TFRecordDataset) is\nthe TensorFlow dataset, which is comprised of records from TFRecords files.\nFor more details about how to consume TFRecord data, see the TensorFlow guide\n[Consuming TFRecord data](https:\/\/www.tensorflow.org\/guide\/data#consuming_tfrecord_data). \nThe following articles describe and illustrate the recommended ways to save your data to TFRecord files and load TFRecord files: \n* [Save Apache Spark DataFrames as TFRecord files](https:\/\/docs.databricks.com\/machine-learning\/load-data\/tfrecords-save-load.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/load-data\/ddl-data.html"} +{"content":"# Databricks data engineering\n## What is DBFS?\n#### Mounting cloud object storage on Databricks\n\nImportant \nMounts are a legacy access pattern. Databricks recommends using Unity Catalog for managing all data access. See [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html). \nDatabricks enables users to mount cloud object storage to the Databricks File System (DBFS) to simplify data access patterns for users that are unfamiliar with cloud concepts. Mounted data does not work with Unity Catalog, and Databricks recommends migrating away from using mounts and instead managing data governance with Unity Catalog.\n\n#### Mounting cloud object storage on Databricks\n##### How does Databricks mount cloud object storage?\n\nDatabricks mounts create a link between a workspace and cloud object storage, which enables you to interact with cloud object storage using familiar file paths relative to the Databricks file system. Mounts work by creating a local alias under the `\/mnt` directory that stores the following information: \n* Location of the cloud object storage.\n* Driver specifications to connect to the storage account or container.\n* Security credentials required to access the data.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dbfs\/mounts.html"} +{"content":"# Databricks data engineering\n## What is DBFS?\n#### Mounting cloud object storage on Databricks\n##### What is the syntax for mounting storage?\n\nThe `source` specifies the URI of the object storage (and can optionally encode security credentials). The `mount_point` specifies the local path in the `\/mnt` directory. Some object storage sources support an optional `encryption_type` argument. For some access patterns you can pass additional configuration specifications as a dictionary to `extra_configs`. \nNote \nDatabricks recommends setting mount-specific Spark and Hadoop configuration as options using `extra_configs`. This ensures that configurations are tied to the mount rather than the cluster or session. \n```\ndbutils.fs.mount(\nsource: str,\nmount_point: str,\nencryption_type: Optional[str] = \"\",\nextra_configs: Optional[dict[str:str]] = None\n)\n\n``` \nCheck with your workspace and cloud administrators before configuring or altering data mounts, as improper configuration can provide unsecured access to all users in your workspace. \nNote \nIn addition to the approaches described in this article, you can automate mounting a bucket with the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_mount](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/mount).\n\n","doc_uri":"https:\/\/docs.databricks.com\/dbfs\/mounts.html"} +{"content":"# Databricks data engineering\n## What is DBFS?\n#### Mounting cloud object storage on Databricks\n##### Unmount a mount point\n\nTo unmount a mount point, use the following command: \n```\ndbutils.fs.unmount(\"\/mnt\/<mount-name>\")\n\n``` \nWarning \nTo avoid errors, never modify a mount point while other jobs are reading or writing to it. After modifying a mount, always run `dbutils.fs.refreshMounts()` on all other running clusters to propagate any mount updates. See [refreshMounts command (dbutils.fs.refreshMounts)](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#dbutils-fs-refreshmounts).\n\n","doc_uri":"https:\/\/docs.databricks.com\/dbfs\/mounts.html"} +{"content":"# Databricks data engineering\n## What is DBFS?\n#### Mounting cloud object storage on Databricks\n##### Mount an S3 bucket\n\nYou can mount an S3 bucket through [What is DBFS?](https:\/\/docs.databricks.com\/dbfs\/index.html). The mount is a pointer to an S3 location, so the data is never synced locally. \nAfter a mount point is created through a cluster, users of that cluster can immediately access the mount point. To use the mount point in another running cluster, you must run `dbutils.fs.refreshMounts()` on that running cluster to make the newly created mount point available. \nYou can use the following methods to mount an S3 bucket: \n* [Mount a bucket using an AWS instance profile](https:\/\/docs.databricks.com\/dbfs\/mounts.html#mount-a-bucket-using-an-aws-instance-profile)\n* [Mount a bucket using AWS keys](https:\/\/docs.databricks.com\/dbfs\/mounts.html#mount-a-bucket-using-aws-keys)\n* [Mount a bucket using instance profiles with the `AssumeRole` policy](https:\/\/docs.databricks.com\/dbfs\/mounts.html#mount-a-bucket-using-instance-profiles-with-the-assumerole-policy) \n### [Mount a bucket using an AWS instance profile](https:\/\/docs.databricks.com\/dbfs\/mounts.html#id1) \nYou can manage authentication and authorization for an S3 bucket using an AWS [instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html). Access to the objects in the bucket is determined by the permissions granted to the instance profile. If the role has write access, users of the mount point can write objects in the bucket. If the role has read access, users of the mount point will be able to read objects in the bucket. \n1. Configure your cluster with an instance profile.\n2. Mount the bucket. \n```\naws_bucket_name = \"<aws-bucket-name>\"\nmount_name = \"<mount-name>\"\ndbutils.fs.mount(f\"s3a:\/\/{aws_bucket_name}\", f\"\/mnt\/{mount_name}\")\ndisplay(dbutils.fs.ls(f\"\/mnt\/{mount_name}\"))\n\n``` \n```\nval AwsBucketName = \"<aws-bucket-name>\"\nval MountName = \"<mount-name>\"\n\ndbutils.fs.mount(s\"s3a:\/\/$AwsBucketName\", s\"\/mnt\/$MountName\")\ndisplay(dbutils.fs.ls(s\"\/mnt\/$MountName\"))\n\n``` \n### [Mount a bucket using AWS keys](https:\/\/docs.databricks.com\/dbfs\/mounts.html#id2) \nYou can mount a bucket using AWS keys. \nImportant \nWhen you mount an S3 bucket using keys, *all users* have read and write access to *all the objects* in the S3 bucket. \nThe following examples use Databricks [secrets](https:\/\/docs.databricks.com\/security\/secrets\/index.html) to store the keys. You *must URL escape* the secret key. \n```\naccess_key = dbutils.secrets.get(scope = \"aws\", key = \"aws-access-key\")\nsecret_key = dbutils.secrets.get(scope = \"aws\", key = \"aws-secret-key\")\nencoded_secret_key = secret_key.replace(\"\/\", \"%2F\")\naws_bucket_name = \"<aws-bucket-name>\"\nmount_name = \"<mount-name>\"\n\ndbutils.fs.mount(f\"s3a:\/\/{access_key}:{encoded_secret_key}@{aws_bucket_name}\", f\"\/mnt\/{mount_name}\")\ndisplay(dbutils.fs.ls(f\"\/mnt\/{mount_name}\"))\n\n``` \n```\nval AccessKey = dbutils.secrets.get(scope = \"aws\", key = \"aws-access-key\")\n\/\/ Encode the Secret Key as that can contain \"\/\"\nval SecretKey = dbutils.secrets.get(scope = \"aws\", key = \"aws-secret-key\")\nval EncodedSecretKey = SecretKey.replace(\"\/\", \"%2F\")\nval AwsBucketName = \"<aws-bucket-name>\"\nval MountName = \"<mount-name>\"\n\ndbutils.fs.mount(s\"s3a:\/\/$AccessKey:$EncodedSecretKey@$AwsBucketName\", s\"\/mnt\/$MountName\")\ndisplay(dbutils.fs.ls(s\"\/mnt\/$MountName\"))\n\n``` \n### [Mount a bucket using instance profiles with the `AssumeRole` policy](https:\/\/docs.databricks.com\/dbfs\/mounts.html#id3) \nYou must first configure [Access cross-account S3 buckets with an AssumeRole policy](https:\/\/docs.databricks.com\/archive\/admin-guide\/assume-role.html). \nMount buckets while setting S3 options in the `extraConfigs`: \n```\ndbutils.fs.mount(\"s3a:\/\/<s3-bucket-name>\", \"\/mnt\/<s3-bucket-name>\",\nextra_configs = {\n\"fs.s3a.credentialsType\": \"AssumeRole\",\n\"fs.s3a.stsAssumeRole.arn\": \"arn:aws:iam::<bucket-owner-acct-id>:role\/MyRoleB\",\n\"fs.s3a.canned.acl\": \"BucketOwnerFullControl\",\n\"fs.s3a.acl.default\": \"BucketOwnerFullControl\"\n}\n)\n\n``` \n```\ndbutils.fs.mount(\"s3a:\/\/<s3-bucket-name>\", \"\/mnt\/<s3-bucket-name>\",\nextraConfigs = Map(\n\"fs.s3a.credentialsType\" -> \"AssumeRole\",\n\"fs.s3a.stsAssumeRole.arn\" -> \"arn:aws:iam::<bucket-owner-acct-id>:role\/MyRoleB\",\n\"fs.s3a.canned.acl\" -> \"BucketOwnerFullControl\",\n\"fs.s3a.acl.default\" -> \"BucketOwnerFullControl\"\n)\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dbfs\/mounts.html"} +{"content":"# Databricks data engineering\n## What is DBFS?\n#### Mounting cloud object storage on Databricks\n##### Encrypt data in S3 buckets\n\nDatabricks supports encrypting data using server-side encryption. This section covers how to use server-side encryption when writing files in S3 through DBFS. Databricks supports [Amazon S3-managed encryption keys (SSE-S3)](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/dev\/UsingServerSideEncryption.html) and [AWS KMS\u2013managed encryption keys (SSE-KMS)](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/dev\/UsingKMSEncryption.html). \n### Write files using SSE-S3 \n1. To mount your S3 bucket with SSE-S3, run the following command: \n```\ndbutils.fs.mount(s\"s3a:\/\/$AccessKey:$SecretKey@$AwsBucketName\", s\"\/mnt\/$MountName\", \"sse-s3\")\n\n```\n2. To write files to the corresponding S3 bucket with SSE-S3, run: \n```\ndbutils.fs.put(s\"\/mnt\/$MountName\", \"<file content>\")\n\n``` \n### Write files using SSE-KMS \n1. Mount a source directory passing in `sse-kms` or `sse-kms:$KmsKey` as the encryption type. \n* To mount your S3 bucket with SSE-KMS using the default KMS master key, run: \n```\ndbutils.fs.mount(s\"s3a:\/\/$AccessKey:$SecretKey@$AwsBucketName\", s\"\/mnt\/$MountName\", \"sse-kms\")\n\n```\n* To mount your S3 bucket with SSE-KMS using a specific KMS key, run: \n```\ndbutils.fs.mount(s\"s3a:\/\/$AccessKey:$SecretKey@$AwsBucketName\", s\"\/mnt\/$MountName\", \"sse-kms:$KmsKey\")\n\n```\n2. To write files to the S3 bucket with SSE-KMS, run: \n```\ndbutils.fs.put(s\"\/mnt\/$MountName\", \"<file content>\")\n\n``` \n### Mounting S3 buckets with the Databricks commit service \nIf you plan to write to a given table stored in S3 from multiple clusters or workloads simultaneously, Databricks recommends that you [Configure Databricks S3 commit services](https:\/\/docs.databricks.com\/security\/network\/classic\/s3-commit-service.html). Your notebook code must mount the bucket and add the `AssumeRole` configuration. This step is necessary only for DBFS mounts, not for accessing [DBFS root](https:\/\/docs.databricks.com\/dbfs\/index.html) storage in your workspace\u2019s root S3 bucket. The following example uses Python: \n```\n\n# If other code has already mounted the bucket without using the new role, unmount it first\ndbutils.fs.unmount(\"\/mnt\/<mount-name>\")\n\n# mount the bucket and assume the new role\ndbutils.fs.mount(\"s3a:\/\/<bucket-name>\/\", \"\/mnt\/<mount-name>\", extra_configs = {\n\"fs.s3a.credentialsType\": \"AssumeRole\",\n\"fs.s3a.stsAssumeRole.arn\": \"<role-arn>\"\n})\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dbfs\/mounts.html"} +{"content":"# Databricks data engineering\n## What is DBFS?\n#### Mounting cloud object storage on Databricks\n##### Mount ADLS Gen2 or Blob Storage with ABFS\n\nYou can mount data in an Azure storage account using a Microsoft Entra ID (formerly Azure Active Directory) application service principal for authentication. For more information, see [Access storage using a service principal & Microsoft Entra ID(Azure Active Directory)](https:\/\/docs.databricks.com\/connect\/storage\/aad-storage-service-principal.html). \nImportant \n* All users in the Databricks workspace have access to the mounted ADLS Gen2 account. The service principal you use to access the ADLS Gen2 account should be granted access only to that ADLS Gen2 account; it should not be granted access to other Azure resources.\n* When you create a mount point through a cluster, cluster users can immediately access the mount point. To use the mount point in another running cluster, you must run `dbutils.fs.refreshMounts()` on that running cluster to make the newly created mount point available for use.\n* Unmounting a mount point while jobs are running can lead to errors. Ensure that production jobs do not unmount storage as part of processing.\n* Mount points that use secrets are not automatically refreshed. If mounted storage relies on a secret that is rotated, expires, or is deleted, errors can occur, such as `401 Unauthorized`. To resolve such an error, you must unmount and remount the storage.\n* Hierarchical namespace (HNS) must be enabled to successfully mount an Azure Data Lake Storage Gen2 storage account using the ABFS endpoint. \nRun the following in your notebook to authenticate and create a mount point. \n```\nconfigs = {\"fs.azure.account.auth.type\": \"OAuth\",\n\"fs.azure.account.oauth.provider.type\": \"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\",\n\"fs.azure.account.oauth2.client.id\": \"<application-id>\",\n\"fs.azure.account.oauth2.client.secret\": dbutils.secrets.get(scope=\"<scope-name>\",key=\"<service-credential-key-name>\"),\n\"fs.azure.account.oauth2.client.endpoint\": \"https:\/\/login.microsoftonline.com\/<directory-id>\/oauth2\/token\"}\n\n# Optionally, you can add <directory-name> to the source URI of your mount point.\ndbutils.fs.mount(\nsource = \"abfss:\/\/<container-name>@<storage-account-name>.dfs.core.windows.net\/\",\nmount_point = \"\/mnt\/<mount-name>\",\nextra_configs = configs)\n\n``` \n```\nval configs = Map(\n\"fs.azure.account.auth.type\" -> \"OAuth\",\n\"fs.azure.account.oauth.provider.type\" -> \"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\",\n\"fs.azure.account.oauth2.client.id\" -> \"<application-id>\",\n\"fs.azure.account.oauth2.client.secret\" -> dbutils.secrets.get(scope=\"<scope-name>\",key=\"<service-credential-key-name>\"),\n\"fs.azure.account.oauth2.client.endpoint\" -> \"https:\/\/login.microsoftonline.com\/<directory-id>\/oauth2\/token\")\n\/\/ Optionally, you can add <directory-name> to the source URI of your mount point.\ndbutils.fs.mount(\nsource = \"abfss:\/\/<container-name>@<storage-account-name>.dfs.core.windows.net\/\",\nmountPoint = \"\/mnt\/<mount-name>\",\nextraConfigs = configs)\n\n``` \nReplace \n* `<application-id>` with the **Application (client) ID** for the Azure Active Directory application.\n* `<scope-name>` with the Databricks secret scope name.\n* `<service-credential-key-name>` with the name of the key containing the client secret.\n* `<directory-id>` with the **Directory (tenant) ID** for the Azure Active Directory application.\n* `<container-name>` with the name of a container in the ADLS Gen2 storage account.\n* `<storage-account-name>` with the ADLS Gen2 storage account name.\n* `<mount-name>` with the name of the intended mount point in DBFS.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dbfs\/mounts.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n###### Databricks on AWS GovCloud\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes the Databricks on AWS GovCloud offering and its compliance controls.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/gov-cloud.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n###### Databricks on AWS GovCloud\n####### AWS GovCloud overview\n\nAWS GovCloud gives United States government customers and their partners the flexibility to architect secure cloud solutions that comply with the FedRAMP High baseline and other compliance regimes, including United States International Traffic in Arms Regulations (ITAR) and Export Administration Regulations (EAR). For details, see [AWS GovCloud](https:\/\/aws.amazon.com\/govcloud-us). \nDatabricks on AWS GovCloud provides the Databricks platform deployed in AWS GovCloud with compliance and security controls. Databricks on AWS GovCloud is operated exclusively by US citizens on US soil. \nNote \nThe Databricks GovCloud Help Center is where you submit and manage support cases. Go to <https:\/\/help.databricks.us\/s\/>. Do not share any export-controlled data regarding support cases through channels other than the Databricks GovCloud Help Center. For more information on support, see [Support](https:\/\/docs.databricks.com\/resources\/support.html). \nWhen a Databricks on AWS GovCloud account is provisioned, the account owner receives an email with a short-lived login URL. You can request a new URL by resetting your account password. \n### Compliance security profile \nThe compliance security profile is enabled on all Databricks on AWS GovCloud workspaces by default. The compliance security profile has additional monitoring, enforced instance types for inter-node encryption, a hardened compute image, and other features that help meet the requirements of FedRAMP High compliance. Automatic cluster update and enhanced securing monitoring are also automatically enabled. \nThe compliance security profile enforces the use of [AWS Nitro](https:\/\/aws.amazon.com\/ec2\/nitro\/) instance types that provide both hardware-implemented network encryption between cluster nodes and encryption at rest for local disks in cluster and Databricks SQL SQL warehouses. Fleet instances are not available in AWS Gov Cloud. The supported instance types are: \n* **General purpose:** `M5dn`, `M5n`, `M5zn`, `M6i`, `M7i`, `M6id`, `M6in`, `M6idn`, `M6a`, `M7a`\n* **Compute optimized:** `C5a`, `C5ad`, `C5n`, `C6i`, `C6id`, `C7i`, `C6in`, `C6a`, `C7a`\n* **Memory optimized:** `R6i`, `R7i`, `R7iz`, `R6id`, `R6in`, `R6idn`, `R6a`, `R7a`\n* **Storage optimized:** `D3`, `D3en`, `P3dn`, `R5dn`, `R5n`, `I4i`, `I3en`\n* **Accelerated computing:** `G4dn`, `G5`, `P4d`, `P4de`, `P5` \nFor more information on the compliance security profile, see [Compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/gov-cloud.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n###### Databricks on AWS GovCloud\n####### FedRAMP High compliance\n\nThe FedRAMP High authorization status of Databricks on AWS GovCloud is currently **In Process**. \nCustomers are responsible for implementing and operating applicable FedRAMP HIGH compliance controls as documented in the **Control Implementation Summary \/ Customer Responsibility Matrix** in SSP Appendix J of the Databricks FedRAMP authorization documentation package. US Government agencies can obtain access to the Databricks FedRAMP High authorization documentation through the FedRAMP package access request form. Follow the instructions on the [Databricks FedRAMP Marketplace listing](https:\/\/marketplace.fedramp.gov\/products\/FR2324740262\/) (package ID: FR2324740262). \nYou must configure the following on Databricks on AWS GovCloud workspaces: \n* Single sign-on authentication, see [Set up SSO for your workspace](https:\/\/docs.databricks.com\/admin\/users-groups\/single-sign-on\/index.html)\n* PrivateLink for both back-end and front-end connections, see [Enable AWS PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/gov-cloud.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n###### Databricks on AWS GovCloud\n####### Databricks for AWS GovCloud region and URLs\n\nThe Databricks AWS account ID for Databricks on AWS GovCloud is `044793339203`. This account ID is required to create and configure a cross-account IAM role for Databricks workspace deployment. See [Create an IAM role for workspace deployment](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/credentials.html). \nDatabricks on AWS GovCloud workspaces are in the `us-gov-west-1` region. For region information, see [Databricks clouds and regions](https:\/\/docs.databricks.com\/resources\/supported-regions.html). \nDatabricks on AWS GovCloud URLs differ from Databricks URLs on the commercial offering. Use the following URLs for Databricks on AWS GovCloud: \n* Databricks account console URL: `https:\/\/accounts.cloud.databricks.us`\n* Base URL for account-level REST APIs: `https:\/\/accounts.cloud.databricks.us\/`\n* Databricks workspace URL: `https:\/\/<deployment-name>.cloud.databricks.us` \nFor example, if the deployment name you specified during workspace creation is `ABCSales`, your workspace URL is `https:\/\/abcsales.cloud.databricks.com.us`.\n* Base URL for workspace-level REST APIs: `https:\/\/<deployment-name>.cloud.databricks.us\/`\n\n###### Databricks on AWS GovCloud\n####### Public Preview feature availability\n\nNotable features that are supported: \n* Unity Catalog\n* Databricks Runtime latest versions and LTS versions\n* Databricks SQL\n* Dashboards\n* MLflow experiments \nFeatures that are unavailable during preview: \n* Serverless compute\n* Model serving\n* In-product messaging\n* Databricks Marketplace\n* Partner Connect\n* System tables\n* Cluster metrics\n* Legacy dashboards\n* In-product support ticket submission\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/gov-cloud.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Use XGBoost on Databricks\n##### Distributed training of XGBoost models using Scala\n\nThe example notebooks show how you can use XGBoost with MLlib: \n* The first example shows how to embed an XGBoost model into an MLlib ML pipeline.\n* The second example shows how to use MLlib cross validation to tune an XGBoost model.\n\n##### Distributed training of XGBoost models using Scala\n###### XGBoost classification with ML pipeline notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/xgboost-classification.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n##### Distributed training of XGBoost models using Scala\n###### XGBoost regression with cross-validation notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/xgboost-regression.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost-scala.html"} +{"content":"# \n### Databricks notebook execution contexts\n\nWhen you attach a notebook to a cluster, Databricks creates an execution context. An *execution context* contains the state for a [REPL](https:\/\/en.wikipedia.org\/wiki\/Read%E2%80%93eval%E2%80%93print_loop) environment for each supported programming language: Python, R, Scala, and SQL. When you run a cell in a notebook, the command is dispatched to the appropriate language REPL environment and run. \nYou can also use the [command execution API](https:\/\/docs.databricks.com\/api\/workspace\/commandexecution) to create an execution context and send a command to run in the execution context. Similarly, the command is dispatched to the language REPL environment and run. \nA cluster has a maximum number of execution contexts (145). Once the number of execution contexts has reached this threshold, you cannot attach a notebook to the cluster or create a new execution context.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/execution-context.html"} +{"content":"# \n### Databricks notebook execution contexts\n#### Idle execution contexts\n\nAn execution context is considered *idle* when the last completed execution occurred past a set idle threshold. Last completed execution is the last time the notebook completed execution of commands. The idle threshold is the amount of time that must pass between the last completed execution and any attempt to automatically detach the notebook. \nWhen a cluster has reached the maximum context limit, Databricks removes (evicts) idle execution contexts (starting with the least recently used) as needed. Even when a context is removed, the notebook using the context is *still attached to the cluster and appears in the cluster\u2019s notebook list*. Streaming notebooks are considered actively running, and their context is never evicted until their execution has been stopped. If an idle context is evicted, the UI displays a message indicating that the notebook using the context was detached due to being idle. \n![Notebook context evicted](https:\/\/docs.databricks.com\/_images\/notebook-context-evicted.png) \nIf you attempt to attach a notebook to cluster that has maximum number of execution contexts and there are no idle contexts (or if auto-eviction is disabled), the UI displays a message saying that the current maximum execution contexts threshold has been reached and the notebook will remain in the detached state. \n![Notebook detached](https:\/\/docs.databricks.com\/_images\/notebook-detached.png) \nIf you fork a process, an idle execution context is still considered idle once execution of the request that forked the process returns. Forking separate processes is *not recommended* with Spark.\n\n### Databricks notebook execution contexts\n#### Configure context auto-eviction\n\nAuto-eviction is enabled by default. To disable auto-eviction for a cluster, set the [Spark property](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration) `spark.databricks.chauffeur.enableIdleContextTracking false`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/execution-context.html"} +{"content":"# \n### Databricks notebook execution contexts\n#### Determine Spark and Databricks Runtime version\n\nTo determine the Spark version of the cluster your notebook is attached to, run: \n```\nspark.version\n\n``` \nTo determine the Databricks Runtime version of the cluster your notebook is attached to, run: \n```\nspark.conf.get(\"spark.databricks.clusterUsageTags.sparkVersion\")\n\n``` \nNote \nBoth this `sparkVersion` tag and the `spark_version` property required by the endpoints in the [Clusters API](https:\/\/docs.databricks.com\/api\/workspace\/clusters) and [Jobs API](https:\/\/docs.databricks.com\/api\/workspace\/jobs) refer to the [Databricks Runtime version](https:\/\/docs.databricks.com\/api\/workspace\/introduction), not the Spark version.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/execution-context.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Share information between tasks in a Databricks job\n\nYou can use *task values* to pass arbitrary parameters between tasks in a Databricks job. You pass task values using the [taskValues subutility](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#dbutils-jobs-taskvalues) in Databricks Utilities. The taskValues subutility provides a simple API that allows tasks to output values that can be referenced in subsequent tasks, making it easier to create more expressive workflows. For example, you can communicate identifiers or metrics, such as information about the evaluation of a machine learning model, between different tasks within a job run. Each task can set and get multiple task values. Task values can be set and retrieved in Python notebooks. \nNote \nYou can now use [dynamic value references](https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html) in your notebooks to reference task values set in upstream tasks. For example, to reference the value with the key `name` set by the task `Get_user_data`, use `{{tasks.Get_user_data.values.name}}`. Because they can be used with multiple task types, Databricks recommends using dynamic value references instead of `dbutils.jobs.taskValues.get` to retrieve the task value programmatically.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/share-task-context.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Share information between tasks in a Databricks job\n##### Using task values\n\nThe taskValues subutility provides two commands: `dbutils.jobs.taskValues.set()` to set a variable and `dbutils.jobs.taskValues.get()` to retrieve a value. Suppose you have two notebook tasks: `Get_user_data` and `Analyze_user_data` and want to pass a user\u2019s name and age from the `Get_user_data` task to the `Analyze_user_data` task. The following example sets the user\u2019s name and age in the `Get_user_data` task: \n```\ndbutils.jobs.taskValues.set(key = 'name', value = 'Some User')\ndbutils.jobs.taskValues.set(key = \"age\", value = 30)\n\n``` \n* `key` is the name of the task value key. This name must be unique to the task.\n* `value` is the value for this task value\u2019s key. This command must be able to represent the value internally in JSON format. The size of the JSON representation of the value cannot exceed 48 KiB. \nThe following example then gets the values in the `Analyze_user_data` task: \n```\ndbutils.jobs.taskValues.get(taskKey = \"Get_user_data\", key = \"age\", default = 42, debugValue = 0)\ndbutils.jobs.taskValues.get(taskKey = \"Get_user_data\", key = \"name\", default = \"Jane Doe\")\n\n``` \n* `taskKey` is the name of the job task setting the value. If the command cannot find this task, a `ValueError` is raised.\n* `key` is the name of the task value\u2019s key. If the command cannot find this task value\u2019s key, a `ValueError` is raised (unless `default` is specified).\n* `default` is an optional value that is returned if `key` cannot be found. `default` cannot be `None`.\n* `debugValue` is an optional value that is returned if you try to get the task value from within a notebook that is running outside of a job. This can be useful during debugging when you want to run your notebook manually and return some value instead of raising a `TypeError` by default. `debugValue` cannot be `None`. \nAs a more complex example of sharing context between tasks, suppose that you have an application that includes several machine learning models to predict an individual\u2019s income given various personal attributes, and a task that determines the best model to use based on output from the previous three tasks. The models are run by three tasks named `Logistic_Regression`, `Decision_Tree`, and `Random_Forest`, and the `Best_Model` task determines the best model to use based on output from the previous three tasks. \n![Graph of example classification application](https:\/\/docs.databricks.com\/_images\/classifier-graph.png) \nThe accuracy for each model (how well the classifier predicts income) is passed in a task value to determine the best performing algorithm. For example, the logistic regression notebook associated with the `Logistic_Regression` task includes the following command: \n```\ndbutils.jobs.taskValues.set(key = \"model_performance\", value = result)\n\n``` \nEach model task sets a value for the `model_performance` key. The `Best_Model` task reads the value for each task, and uses that value to determine the optimal model. The following example reads the value set by the `Logistic_Regression` task: \n```\nlogistic_regression = dbutils.jobs.taskValues.get(taskKey = \"Logistic_Regression\", key = \"model_performance\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/share-task-context.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Share information between tasks in a Databricks job\n##### View task values\n\nTo view the value of a task value after a task runs, go to the [task run history](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#task-history) for the task. The task value results are displayed in the **Output** panel.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/share-task-context.html"} +{"content":"# What is Delta Lake?\n### Convert to Delta Lake\n\nThe `CONVERT TO DELTA` SQL command performs a one-time conversion for Parquet and Iceberg tables to Delta Lake tables. For incremental conversion of Parquet or Iceberg tables to Delta Lake, see [Incrementally clone Parquet and Iceberg tables to Delta Lake](https:\/\/docs.databricks.com\/delta\/clone-parquet.html). \nUnity Catalog supports the `CONVERT TO DELTA` SQL command for Parquet and Iceberg tables stored in external locations managed by Unity Catalog. \nYou can configure existing Parquet data files as external tables in Unity Catalog and then convert them to Delta Lake to unlock all features of the Databricks lakehouse. \nFor the technical documentation, see [CONVERT TO DELTA](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-convert-to-delta.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/convert-to-delta.html"} +{"content":"# What is Delta Lake?\n### Convert to Delta Lake\n#### Converting a directory of Parquet or Iceberg files in an external location to Delta Lake\n\nNote \n* Converting Iceberg tables is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html).\n* Converting Iceberg tables is supported in Databricks Runtime 10.4 and above.\n* Converting Iceberg metastore tables is not supported.\n* Converting Iceberg tables that have experienced [partition evolution](https:\/\/iceberg.apache.org\/docs\/latest\/evolution\/#partition-evolution) is not supported.\n* Converting Iceberg merge-on-read tables that have experienced updates, deletions, or merges is not supported.\n* The following are limitations for converting Iceberg tables with partitions defined on truncated columns: \n+ In Databricks Runtime 12.2 LTS and below, the only truncated column type supported is `string`.\n+ In Databricks Runtime 13.3 LTS and above, you can work with truncated columns of types `string`, `long`, or `int`.\n+ Databricks does not support working with truncated columns of type `decimal`. \nYou can convert a directory of Parquet data files to a Delta Lake table as long as you have write access on the storage location; for information on configuring access with Unity Catalog, see [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html). \n```\nCONVERT TO DELTA parquet.`s3:\/\/my-bucket\/parquet-data`;\n\nCONVERT TO DELTA iceberg.`s3:\/\/my-bucket\/iceberg-data`;\n\n``` \nTo load converted tables as external tables to Unity Catalog, you need `CREATE TABLES` permissions on the external location. \nNote \nFor Databricks Runtime 11.3 LTS and above, `CONVERT TO DELTA` automatically infers partitioning information for tables registered to the metastore, eliminating the requirement to manually specify partitions.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/convert-to-delta.html"} +{"content":"# What is Delta Lake?\n### Convert to Delta Lake\n#### Converting managed and external tables to Delta Lake on Unity Catalog\n\nUnity Catalog supports many formats for external tables, but only supports Delta Lake for managed tables. To convert a managed Parquet table directly to a managed Unity Catalog Delta Lake table, see [Upgrade a Hive managed table to a Unity Catalog managed table using CLONE](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#clone). \nTo upgrade an external Parquet table to a Unity Catalog external table, see [Upgrade a single Hive table to a Unity Catalog external table using the upgrade wizard](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#migrate-external). \nOnce you\u2019ve registered an external Parquet table to Unity Catalog, you can convert it to an external Delta Lake table. Note that you must provide partitioning information if the parquet table is partitioned. \n```\nCONVERT TO DELTA catalog_name.database_name.table_name;\n\nCONVERT TO DELTA catalog_name.database_name.table_name PARTITIONED BY (date_updated DATE);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/convert-to-delta.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Hyperparameter tuning\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/mllib-mlflow-integration.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Hyperparameter tuning\n##### Apache Spark MLlib and automated MLflow tracking\n\nNote \nMLlib automated MLflow tracking is deprecated on clusters that run Databricks Runtime 10.1 ML and above, and it is disabled by default on clusters running Databricks Runtime 10.2 ML and above. Instead, use [MLflow PySpark ML autologging](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.pyspark.ml.html#mlflow.pyspark.ml.autolog) by calling `mlflow.pyspark.ml.autolog()`, which is enabled by default with [Databricks Autologging](https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html). \nTo use the old MLlib automated MLflow tracking in Databricks Runtime 10.2 ML or above, enable it by setting the [Spark configurations](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration) `spark.databricks.mlflow.trackMLlib.enabled true` and `spark.databricks.mlflow.autologging.enabled false`. \n[MLflow](https:\/\/docs.databricks.com\/mlflow\/index.html) is an open source platform for managing the end-to-end machine learning lifecycle. MLflow supports tracking for machine learning model tuning in Python, R, and Scala. For Python notebooks only, [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html) and [Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html) support *automated* [MLflow Tracking](https:\/\/docs.databricks.com\/mlflow\/tracking.html) for Apache Spark MLlib model tuning. \nWith MLlib automated MLflow tracking, when you run tuning code that uses `CrossValidator` or `TrainValidationSplit`, hyperparameters and evaluation metrics are automatically logged in MLflow. Without automated MLflow tracking, you must make explicit API calls to log to MLflow.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/mllib-mlflow-integration.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Hyperparameter tuning\n##### Apache Spark MLlib and automated MLflow tracking\n###### Manage MLflow runs\n\n`CrossValidator` or `TrainValidationSplit` log tuning results as nested MLflow runs: \n* Main or parent run: The information for `CrossValidator` or `TrainValidationSplit` is logged to the main run. If there is an active run already, information is logged to this active run and the active run is not stopped. If there is no active run, MLflow creates a new run, logs to it, and ends the run before returning.\n* Child runs: Each hyperparameter setting tested and the corresponding evaluation metric are logged to a child run under the main run. \nWhen calling `fit()`, Databricks recommends active MLflow run management; that is, wrap the call to `fit()` inside a \u201c`with mlflow.start_run():`\u201d statement.\nThis ensures that the information is logged under its own MLflow main run, and makes it easier to log additional tags, parameters, or metrics to that run. \nNote \nWhen `fit()` is called multiple times within the same active MLflow run, it logs those multiple runs to the same main run. To resolve name conflicts for MLflow parameters and tags, MLflow appends a UUID to names with conflicts. \nThe following Python notebook demonstrates automated MLflow tracking. \n### Automated MLflow tracking notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mllib-mlflow-integration.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nAfter you perform the actions in the last cell in the notebook, your MLflow UI should display: \n![MLlib-MLflow demo](https:\/\/docs.databricks.com\/_images\/mllib-mlflow-demo.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/mllib-mlflow-integration.html"} +{"content":"# Develop on Databricks\n### Databricks for R developers\n\nThis section provides a guide to developing notebooks and jobs in Databricks using the R language. \nA basic workflow for getting started is: \n1. [Import code](https:\/\/docs.databricks.com\/sparkr\/index.html#manage-code-with-notebooks-and-databricks-git-folders): Either import your own code from files or Git repos, or try a tutorial listed below. Databricks recommends learning to use interactive Databricks notebooks.\n2. [Run your code on a cluster](https:\/\/docs.databricks.com\/sparkr\/index.html#clusters): Either create a cluster of your own, or ensure you have permissions to use a shared cluster. Attach your notebook to the cluster, and run the notebook. \nBeyond this, you can branch out into more specific topics: \n* [Work with larger data sets](https:\/\/docs.databricks.com\/sparkr\/index.html#reference) using Apache Spark\n* [Add visualizations](https:\/\/docs.databricks.com\/sparkr\/index.html#visualizations)\n* [Automate your workload](https:\/\/docs.databricks.com\/sparkr\/index.html#jobs) as a job\n* [Use machine learning](https:\/\/docs.databricks.com\/sparkr\/index.html#machine-learning) to analyze your data\n* [Use R developer tools](https:\/\/docs.databricks.com\/sparkr\/index.html#r-developer-tools)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/index.html"} +{"content":"# Develop on Databricks\n### Databricks for R developers\n#### Tutorials\n\nThe following tutorials provide example code and notebooks to learn about common workflows. See [Import a notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html#import-notebook) for instructions on importing notebook examples into your workspace. \n* [Tutorial: Analyze data with glm](https:\/\/docs.databricks.com\/sparkr\/glm-tutorial.html)\n* [MLflow quickstart R notebook](https:\/\/docs.databricks.com\/mlflow\/quick-start-r.html)\n* [Tutorial: Delta Lake](https:\/\/docs.databricks.com\/delta\/tutorial.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/index.html"} +{"content":"# Develop on Databricks\n### Databricks for R developers\n#### Reference\n\nThe following subsections list key features and tips to help you begin developing in Databricks with R. \nDatabricks supports two APIs that provide an R interface to Apache Spark: [SparkR](https:\/\/spark.apache.org\/docs\/latest\/sparkr.html) and [sparklyr](https:\/\/spark.rstudio.com\/). \n### SparkR \nThese articles provide an introduction and reference for [SparkR](https:\/\/spark.apache.org\/docs\/latest\/sparkr.html). SparkR is an R interface to Apache Spark that provides a distributed data frame implementation. SparkR supports operations like selection, filtering, and aggregation (similar to R data frames) but on large datasets. \n* [SparkR overview](https:\/\/docs.databricks.com\/sparkr\/overview.html)\n* [Tutorial: Analyze data with glm](https:\/\/docs.databricks.com\/sparkr\/glm-tutorial.html) \n### sparklyr \nThis article provides an introduction to [sparklyr](https:\/\/rdocumentation.org\/packages\/sparklyr). sparklyr is an R interface to Apache Spark that provides functionality similar to [dplyr](https:\/\/spark.rstudio.com\/guides\/dplyr.html), `broom`, and [DBI](https:\/\/spark.rstudio.com\/get-started\/prepare-data.html#using-sql). \n* [sparklyr](https:\/\/docs.databricks.com\/sparkr\/sparklyr.html) \n### Comparing SparkR and sparklyr \nThis article explains key similarities and differences between SparkR and sparklyr. \n* [Comparing SparkR and sparklyr](https:\/\/docs.databricks.com\/sparkr\/sparkr-vs-sparklyr.html) \n### Work with DataFrames and tables with SparkR and sparklyr \nThis article describes how to use R, SparkR, sparklyr, and dplyr to work with R data.frames, Spark DataFrames, and Spark tables in Databricks. \n* [Work with DataFrames and tables in R](https:\/\/docs.databricks.com\/sparkr\/dataframes-tables.html) \n### Manage code with notebooks and Databricks Git folders \nDatabricks [notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html) support R. These notebooks provide functionality similar to that of Jupyter, but with additions such as built-in visualizations using big data, Apache Spark integrations for debugging and performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by [importing a notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html#import-notebook). Once you have access to a cluster, you can [attach a notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) to the cluster and [run the notebook](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html). \nDatabricks [Git folders](https:\/\/docs.databricks.com\/repos\/index.html) allows users to synchronize notebooks and other files with Git repositories. Databricks Git folders help with code versioning and collaboration, and it can simplify importing a full repository of code into Databricks, viewing past notebook versions, and integrating with IDE development. Get started by [cloning a remote Git repository](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html). You can then open or create notebooks with the repository clone, [attach the notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) to a cluster, and [run the notebook](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html). \n### Clusters \nDatabricks [Compute](https:\/\/docs.databricks.com\/compute\/index.html) provide compute management for both single nodes and large clusters. You can customize cluster hardware and libraries according to your needs. Data scientists will generally begin work either by [creating a cluster](https:\/\/docs.databricks.com\/compute\/configure.html) or using an existing [shared cluster](https:\/\/docs.databricks.com\/compute\/use-compute.html). Once you have access to a cluster, you can [attach a notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) to the cluster or [run a job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#create-a-job) on the cluster. \n* For small workloads which only require single nodes, data scientists can use [single node compute](https:\/\/docs.databricks.com\/compute\/configure.html#single-node) for cost savings.\n* For detailed tips, see [Compute configuration best practices](https:\/\/docs.databricks.com\/compute\/cluster-config-best-practices.html).\n* Administrators can set up [cluster policies](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html) to simplify and guide cluster creation. \n#### Single node R and distributed R \nDatabricks clusters consist of an Apache Spark *driver* node and zero or more Spark *worker* (also known as *executor*) nodes. The driver node maintains attached notebook state, maintains the `SparkContext`, interprets notebook and library commands, and runs the Spark master that coordinates with Spark executors. Worker nodes run the Spark executors, one Spark executor per worker node. \nA *single node* cluster has one driver node and no worker nodes, with Spark running in local mode to support access to tables managed by Databricks. Single node clusters support RStudio, notebooks, and libraries, and are useful for R projects that don\u2019t depend on Spark for big data or parallel processing. See [Single-node or multi-node compute](https:\/\/docs.databricks.com\/compute\/configure.html#single-node). \nFor data sizes that R struggles to process (many gigabytes or petabytes), you should use multiple-node or *distributed* clusters instead. Distributed clusters have one driver node and one or more worker nodes. Distributed clusters support not only RStudio, notebooks, and libraries, but R packages such as SparkR and sparkly, which are uniquely designed to use distributed clusters through the `SparkContext`. These packages provide familiar SQL and DataFrame APIs, which enable assigning and running various Spark tasks and commands in parallel across worker nodes. To learn more about sparklyr and SparkR, see [Comparing SparkR and sparklyr](https:\/\/docs.databricks.com\/sparkr\/sparkr-vs-sparklyr.html). \nSome SparkR and sparklyr functions that take particular advantage of distributing related work across worker nodes include the following: \n* [sparklyr::spark\\_apply](https:\/\/spark.rstudio.com\/guides\/distributed-r.html): Runs arbitrary R code at scale within a cluster. This is especially useful for using functionality that is available only in R, or R packages that are not available in Apache Spark nor other Spark packages.\n* [SparkR::dapply](https:\/\/spark.apache.org\/docs\/latest\/api\/R\/reference\/dapply.html): Applies the specified function to each partition of a `SparkDataFrame`.\n* [SparkR::dapplyCollect](https:\/\/spark.apache.org\/docs\/latest\/api\/R\/reference\/dapplyCollect.html): Applies the specified function to each partition of a `SparkDataFrame` and collects the results back to R as a `data.frame`.\n* [SparkR::gapply](https:\/\/spark.apache.org\/docs\/latest\/api\/R\/reference\/gapply.html): Groups a `SparkDataFrame` by using the specified columns and applies the specified R function to each group.\n* [SparkR::gapplyCollect](https:\/\/spark.apache.org\/docs\/latest\/api\/R\/reference\/gapplyCollect.html): Groups a `SparkDataFrame` by using the specified columns, applies the specified R function to each group, and collects the result back to R as a `data.frame`.\n* [SparkR::spark.lapply](https:\/\/spark.apache.org\/docs\/latest\/api\/R\/reference\/spark.lapply.html): Runs the specified function over a list of elements, distributing the computations with Spark. \nFor examples, see the notebook [Distributed R: User Defined Functions in Spark](https:\/\/www.databricks.com\/notebooks\/gallery\/DistributedRUserDefinedFunctions.html). \n#### Databricks Container Services \n[Databricks Container Services](https:\/\/docs.databricks.com\/compute\/custom-containers.html) lets you specify a Docker image when you create a cluster. Databricks provides the [databricksruntime\/rbase](https:\/\/hub.docker.com\/r\/databricksruntime\/rbase) base image on Docker Hub as an example to launch a Databricks Container Services cluster with R support. See also the [Dockerfile](https:\/\/github.com\/databricks\/containers\/blob\/master\/ubuntu\/R\/Dockerfile) that is used to generate this base image. \n### Libraries \nDatabricks clusters use the Databricks Runtime, which provides many popular libraries out-of-the-box, including Apache Spark, Delta Lake, and more. You can also install additional third-party or custom R packages into libraries to use with notebooks and jobs. \nStart with the default libraries in the [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). Use [Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html) for machine learning workloads. For full lists of pre-installed libraries, see the \u201cInstalled R libraries\u201d section for the target Databricks Runtime in [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). \nYou can customize your environment by using [Notebook-scoped R libraries](https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html), which allow you to modify your notebook or job environment with libraries from CRAN or other repositories. To do this, you can use the familiar [install.packages](https:\/\/www.rdocumentation.org\/packages\/utils\/versions\/3.6.2\/topics\/install.packages) function from `utils`. The following example installs the [Arrow R package](https:\/\/arrow.apache.org\/docs\/r\/) from the default CRAN repository: \n```\ninstall.packages(\"arrow\")\n\n``` \nIf you need an older version than what is included in the Databricks Runtime, you can use a notebook to run [install\\_version](https:\/\/www.rdocumentation.org\/packages\/devtools\/versions\/1.13.6\/topics\/install_version) function from `devtools`. The following example installs [dplyr](https:\/\/dplyr.tidyverse.org\/) version 0.7.4 from CRAN: \n```\nrequire(devtools)\n\ninstall_version(\npackage = \"dplyr\",\nversion = \"0.7.4\",\nrepos = \"http:\/\/cran.r-project.org\"\n)\n\n``` \nPackages installed this way are available across a cluster. They are scoped to the user who installs them. This enables you to install multiple versions of the same package on the same compute without creating package conflicts. \nYou can install other libraries as [Cluster libraries](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html) as needed, for example from CRAN. To do this, in the cluster user interface, click **Libraries > Install new > CRAN** and specify the library\u2019s name. This approach is especially important for when you want to call user-defined functions with SparkR or sparklyr. \nFor more details, see [Libraries](https:\/\/docs.databricks.com\/libraries\/index.html). \nTo install a [custom package](https:\/\/r-pkgs.org\/) into a library: \n1. Build your custom package from the command line or by using [RStudio](https:\/\/support.rstudio.com\/hc\/en-us\/articles\/200486488-Developing-Packages-with-the-RStudio-IDE).\n2. Copy the custom package file from your development machine over to your Databricks workspace. For options, see [Libraries](https:\/\/docs.databricks.com\/libraries\/index.html).\n3. Install the custom package into a library by running `install.packages`. \nFor example, from a notebook in your workspace: \n```\ninstall.packages(\npkgs = \"\/path\/to\/tar\/file\/<custom-package>.tar.gz\",\ntype = \"source\",\nrepos = NULL\n)\n\n``` \nOr: \n```\n%sh\nR CMD INSTALL \/path\/to\/tar\/file\/<custom-package>.tar.gz\n\n``` \nAfter you install a custom package into a library, add the library to the search path and then load the library with a single command. \nFor example: \n```\n# Add the library to the search path one time.\n.libPaths(c(\"\/path\/to\/tar\/file\/\", .libPaths()))\n\n# Load the library. You do not need to add the library to the search path again.\nlibrary(<custom-package>)\n\n``` \nTo install a custom package as a library on *each* node in a cluster, you must use [What are init scripts?](https:\/\/docs.databricks.com\/init-scripts\/index.html). \n### Visualizations \nDatabricks R notebooks support various types of visualizations using the `display` function. \n* [Visualizations in R](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#visualizations-in-r) \n### Jobs \nYou can automate R workloads as scheduled or triggered notebook [Create and run Databricks Jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) in Databricks. \n* For details on creating a job via the UI, see [Create a job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#create-a-job).\n* The [Jobs API](https:\/\/docs.databricks.com\/api\/workspace\/jobs) allows you to create, edit, and delete jobs.\n* The [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) provides a convenient command line interface for calling the Jobs API. \n### Machine learning \nDatabricks supports a wide variety of machine learning (ML) workloads, including traditional ML on tabular data, deep learning for computer vision and natural language processing, recommendation systems, graph analytics, and more. For general information about machine learning on Databricks, see [Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html). \nFor ML algorithms, you can use pre-installed libraries in [Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html). You can also [install custom libraries](https:\/\/docs.databricks.com\/libraries\/index.html). \nFor machine learning operations (MLOps), Databricks provides a managed service for the open source library MLflow. With [MLflow Tracking](https:\/\/docs.databricks.com\/mlflow\/tracking.html) you can record model development and save models in reusable formats. You can use the [MLflow Model Registry](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html) to manage and automate the promotion of models towards production. [Jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) and [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) allow hosting models as batch and streaming jobs as REST endpoints. For more information and examples, see the [ML lifecycle management using MLflow](https:\/\/docs.databricks.com\/mlflow\/index.html) or the [MLflow R API docs](https:\/\/mlflow.org\/docs\/latest\/R-api.html). \n### R developer tools \nIn addition to Databricks notebooks, you can also use the following R developer tools: \n* [RStudio on Databricks](https:\/\/docs.databricks.com\/sparkr\/rstudio.html)\n* [Shiny on Databricks](https:\/\/docs.databricks.com\/sparkr\/shiny.html)\n* [`renv` on Databricks](https:\/\/docs.databricks.com\/sparkr\/renv.html) \n* Use SparkR and RStudio Desktop with [Databricks Connect](https:\/\/docs.databricks.com\/dev-tools\/databricks-connect\/index.html).\n* Use sparklyr and RStudio Desktop with [Databricks Connect](https:\/\/docs.databricks.com\/dev-tools\/databricks-connect\/index.html). \n### R session customization \nIn Databricks Runtime 12.2 LTS and above, R sessions can be customized by using site-wide profile (`.Rprofile`) files. R notebooks will source the file as R code during startup. To modify the file, find the value of `R_HOME` and modify `$R_HOME\/etc\/Rprofile.site`. Note that Databricks has added configuration in the file to ensure proper functionality for hosted [RStudio on Databricks](https:\/\/docs.databricks.com\/sparkr\/rstudio.html). Removing any of it may cause RStudio to not work as expected. \nIn Databricks Runtime 11.3 LTS and below, this behavior can be enabled by setting the environment variable `DATABRICKS_ENABLE_RPROFILE=true`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/index.html"} +{"content":"# Develop on Databricks\n### Databricks for R developers\n#### Additional resources\n\n* [Knowledge Base](https:\/\/kb.databricks.com\/r-aws.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/index.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Use the Databricks notebook and file editor\n\nThis page describes some of the functions available with the Databricks notebook and file editor, including code suggestions and autocomplete, variable inspection, code folding, and side-by-side diffs. When you use the notebook or the file editor, Databricks Assistant is available to help you generate, explain, and debug code. See [Use Databricks Assistant](https:\/\/docs.databricks.com\/notebooks\/use-databricks-assistant.html) for details. \nYou can choose from a selection of editor themes. Select **View > Editor theme** and make a selection from the menu.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-editor.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Use the Databricks notebook and file editor\n##### Autocomplete\n\nAutocomplete automatically completes code segments as you type them. Completable objects include types, classes, and objects, as well as SQL database and table names. \n* For Python cells, the notebook must be [attached to a cluster](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) for autocomplete to work, and you must [run all cells](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html) that define completable objects.\n* For SQL cells, autocomplete suggests keywords and basic syntax even if the notebook is not attached to any compute resource. \n+ If the workspace is enabled for Unity Catalog, autocomplete also suggests catalog, schema, table, and column names for tables in Unity Catalog.\n+ If the workspace is not enabled for Unity Catalog, the notebook must be attached to a cluster or a [SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html) to suggest table or column names. \nAutocomplete suggestions automatically appear you type in a cell. Use the up and down arrow keys or your mouse to select a suggestion, and press **Tab** or **Enter** to insert the selection into the cell. \nNote \nServer autocomplete in R notebooks is blocked during command execution. \nThere are two [user settings](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#configure-notebook-settings) to be aware of: \n* To turn off autocomplete suggestions, toggle **Autocomplete as you type**. When autocomplete is off, you can display autocomplete suggestions by pressing **Ctrl + Space**.\n* To prevent **Enter** from inserting autocomplete suggestions, toggle **Enter key accepts autocomplete suggestions**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-editor.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Use the Databricks notebook and file editor\n##### Variable inspection\n\nTo display information about a variable defined in a SQL or Python notebook, hover your cursor over the variable name. Python variable inspection requires Databricks Runtime 12.2 LTS or above. \n![how to inspect a variable](https:\/\/docs.databricks.com\/_images\/variable-inspection.png)\n\n#### Use the Databricks notebook and file editor\n##### Go to definition\n\nWhen a Python notebook is attached to a cluster, you can quickly go to the definition of a variable, function, or the code behind a `%run` statement. To do this, right-click the variable or function name, and then click **Go to definition** or **Peek definition**. \nYou can also hold down the **Cmd** key on macOS or **Ctrl** key on Windows and hover over the variable or function name. The name turns into a hyperlink if a definition is found. \n![how to get function definitions](https:\/\/docs.databricks.com\/_images\/go-to-definition.gif) \nThe \u201cgo to definition\u201d feature is available in Databricks Runtime 12.2 LTS and above.\n\n#### Use the Databricks notebook and file editor\n##### Code folding\n\nCode folding lets you temporarily hide sections of code. This can be helpful when working with long code blocks because it lets you focus on specific sections of code you are working on. \nTo hide code, place your cursor at the far left of a cell. Downward-pointing arrows appear at logical points where you can hide a section of code. Click the arrow to hide a code section. Click the arrow again (now pointing to the right) to show the code. \n![how to fold code](https:\/\/docs.databricks.com\/_images\/code-folding.gif) \nFor more details, including keyboard shortcuts, see the [VS Code documentation](https:\/\/code.visualstudio.com\/docs\/editor\/codebasics#_folding).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-editor.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Use the Databricks notebook and file editor\n##### Multicursor support\n\nYou can create multiple cursors to make simultaneous edits easier, as shown in the video: \n![how to use multiple cursors](https:\/\/docs.databricks.com\/_images\/multi-cursor.gif) \nTo create multiple cursors in a cell: \n* On macOS, hold down the `Option` key and click in each location to add a cursor.\n* On Windows, hold down the `Alt` key and click in each location to add a cursor.\n* You also have the option to change the shortcut. See [Change shortcut for multicursor and column selection](https:\/\/docs.databricks.com\/notebooks\/notebook-editor.html#key-modifier-for-editor). \nOn macOS, you can create multiple cursors that are vertically aligned by using the keyboard shortcut `Option`+`Command`+ up or down arrow key.\n\n#### Use the Databricks notebook and file editor\n##### Column (box) selection\n\nTo select multiple items in a column, click at the upper left of the area you want to capture. Then: \n* On macOS, press `Shift` + `Option` and drag to the lower right to capture one or more columns.\n* On Windows, press `Shift` + `Alt` and drag to the lower right to capture one or more columns.\n* You also have the option to change the shortcut. See [Change shortcut for multicursor and column selection](https:\/\/docs.databricks.com\/notebooks\/notebook-editor.html#key-modifier-for-editor). \n![how to select columns](https:\/\/docs.databricks.com\/_images\/select-columns.gif)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-editor.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Use the Databricks notebook and file editor\n##### Change shortcut for multicursor and column selection\n\nAn alternate shortcut is available for multicursor and column (box) selection. With the alternate selection, the shortcuts change as follows: \n* To create multiple cursors in a cell: \n+ On macOS, hold down the `Cmd` key and click in each location to add a cursor.\n+ On Windows, hold down the `Ctrl` key and click in each location to add a cursor.\n* To select multiple items in a column, click at the upper left of the area you want to capture. Then: \n+ On macOS, press `Option` and drag to the lower right to capture one or more columns.\n+ On Windows, press `Alt` and drag to the lower right to capture one or more columns. \nTo enable the alternate shortcuts, do the following: \n1. Click your username at the upper-right of the workspace, then click **Settings** in the dropdown list.\n2. In the **Settings** sidebar, select **Developer**.\n3. In the **Code editor** section, change the **Key modifier for multi-cursor click** setting to **Cmd** for macOS or **Ctrl** for Windows. \nWhen you enable alternate shortcuts, the keyboard shortcut for creating multiple cursors that are vertically aligned does not change.\n\n#### Use the Databricks notebook and file editor\n##### Bracket matching\n\nWhen you click near a parenthesis, square bracket, or curly brace, the editor highlights that character and its matching bracket. \n![show the corresponding bracket](https:\/\/docs.databricks.com\/_images\/bracket-matching.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-editor.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Use the Databricks notebook and file editor\n##### Side-by-side diff in version history\n\nWhen you [display previous notebook versions](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#version-control), the editor displays side-by-side diffs with color highlighting. \n![show the code diffs](https:\/\/docs.databricks.com\/_images\/code-diffs.png)\n\n#### Use the Databricks notebook and file editor\n##### Syntax error highlighting\n\nWhen a notebook is [connected to a cluster](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach), syntax errors are highlighted by a squiggly red line. For Python, the cluster must be running Databricks Runtime 12.2 LTS or above. \n![example of syntax error higlighting](https:\/\/docs.databricks.com\/_images\/syntax-error-example.png) \nTo enable or disable syntax error highlighting, do the following: \n1. Click your username at the upper-right of the workspace, then click **Settings** in the dropdown list.\n2. In the **Settings** sidebar, select **Developer**.\n3. In the **Code editor** section, toggle the setting for **SQL syntax error highlighting** or **Python syntax error highlighting**.\n\n#### Use the Databricks notebook and file editor\n##### Possible actions on syntax errors and warnings\n\nWhen you see a syntax error, you can hover over it and select **Quick Fix** for possible actions. \n![example for code actions on syntax error highlighting](https:\/\/docs.databricks.com\/_images\/syntax-error-quick-fix.png) \nNote \nThis feature uses [Databricks Assistant](https:\/\/docs.databricks.com\/notebooks\/use-databricks-assistant.html). If you don\u2019t see any actions, that means your administrator needs to enable Databricks Assistant first.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-editor.html"} +{"content":"# AI and Machine Learning on Databricks\n## Prepare data and environment for ML and DL\n#### Preprocess data for machine learning and deep learning\n\nYou can use [Databricks Feature Store](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html) to create new features, explore and re-use existing features, select features for training and scoring machine learning models, and publish features to low-latency online stores for real-time inference. \nOn large datasets, you can use Spark SQL and MLlib for feature engineering. Third-party libraries included in Databricks Runtime ML such as scikit-learn also provide useful helper methods. For examples, see the following machine learning notebooks for scikit-learn and MLlib: \n* [Feature engineering with scikit-learn](https:\/\/docs.databricks.com\/machine-learning\/preprocess-data\/scikit-learn.html)\n* [Feature engineering with MLlib](https:\/\/docs.databricks.com\/machine-learning\/preprocess-data\/mllib.html) \nFor more complex deep learning feature processing, this example notebook illustrates how to use transfer learning for featurization: \n* [Featurization for transfer learning](https:\/\/docs.databricks.com\/machine-learning\/preprocess-data\/transfer-learning-tensorflow.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/preprocess-data\/index.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Run Delta Live Tables pipelines\n\nYou run Delta Live Tables pipelines by starting a pipeline *update*. Most commonly, you run full updates to refresh all of the datasets in a pipeline, but Delta Live Tables offers other update options to support different tasks. For example, you can run an update for only selected tables for testing or debugging. \nUpdates can be run manually in the Delta Live Tables UI or a Databricks notebook, or as a scheduled task using the REST API or the Databricks CLI. Updates can also be scheduled or included in a workflow using an orchestration tool such as Databricks Workflows or Apache Airflow. The articles in this section detail these options for running your Delta Live Tables pipelines.\n\n#### Run Delta Live Tables pipelines\n##### Run pipelines manually\n\nTo learn how the datasets defined in a pipeline are processed when an update is run, the different types of updates supported, and recommendations for selecting settings for updates, see [Run an update on a Delta Live Tables pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/updates.html).\n\n#### Run Delta Live Tables pipelines\n##### Run pipelines using orchestration tools\n\nTo learn how to run a pipeline on a scheduled basis or as part of a larger data processing workflow, see [Run a Delta Live Tables pipeline in a workflow](https:\/\/docs.databricks.com\/delta-live-tables\/workflows.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/run-schedule-updates.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Configure pipeline settings for Delta Live Tables\n\nThis article provides details on configuring pipeline settings for Delta Live Tables. Delta Live Tables provides a user interface for configuring and editing pipeline settings. The UI also provides an option to display and edit settings in JSON. \nNote \nYou can configure most settings with either the UI or a JSON specification. Some advanced options are only available using the JSON configuration. \nDatabricks recommends familiarizing yourself with Delta Live Tables settings using the UI. If necessary, you can directly edit the JSON configuration in the workspace. JSON configuration files are also useful when deploying pipelines to new environments or when using the CLI or [REST API](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html). \nFor a full reference to the Delta Live Tables JSON configuration settings, see [Delta Live Tables pipeline configurations](https:\/\/docs.databricks.com\/delta-live-tables\/properties.html#config-settings).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/settings.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Configure pipeline settings for Delta Live Tables\n###### Choose a product edition\n\nSelect the Delta Live Tables product edition with the features best suited for your pipeline requirements. The following product editions are available: \n* `Core` to run streaming ingest workloads. Select the `Core` edition if your pipeline doesn\u2019t require advanced features such as change data capture (CDC) or Delta Live Tables expectations.\n* `Pro` to run streaming ingest and CDC workloads. The `Pro` product edition supports all of the `Core` features, plus support for workloads that require updating tables based on changes in source data.\n* `Advanced` to run streaming ingest workloads, CDC workloads, and workloads that require expectations. The `Advanced` product edition supports the features of the `Core` and `Pro` editions, and also supports enforcement of data quality constraints with Delta Live Tables expectations. \nYou can select the product edition when you create or edit a pipeline. You can select a different edition for each pipeline. See the [Delta Live Tables product page](https:\/\/www.databricks.com\/product\/pricing\/delta-live). \nNote \nIf your pipeline includes features not supported by the selected product edition, for example, expectations, you will receive an error message with the reason for the error. You can then edit the pipeline to select the appropriate edition.\n\n##### Configure pipeline settings for Delta Live Tables\n###### Choose a pipeline mode\n\nYou can update your pipeline continuously or with manual triggers based on the pipeline mode. See [Continuous vs. triggered pipeline execution](https:\/\/docs.databricks.com\/delta-live-tables\/updates.html#continuous-triggered).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/settings.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Configure pipeline settings for Delta Live Tables\n###### Select a cluster policy\n\nUsers must have permissions to deploy compute to configure and update Delta Live Tables pipelines. Workspace admins can configure cluster policies to provide users with access to compute resources for Delta Live Tables. See [Define limits on Delta Live Tables pipeline compute](https:\/\/docs.databricks.com\/admin\/clusters\/policy-definition.html#dlt). \nNote \n* Cluster policies are optional. Check with your workspace administrator if you lack compute privileges required for Delta Live Tables.\n* To ensure that cluster policy default values are correctly applied, set the `apply_policy_default_values` value to `true` in the [cluster configurations](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#cluster-config) in your pipeline configuration: \n```\n{\n\"clusters\": [\n{\n\"label\": \"default\",\n\"policy_id\": \"<policy-id>\",\n\"apply_policy_default_values\": true\n}\n]\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/settings.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Configure pipeline settings for Delta Live Tables\n###### Configure source code libraries\n\nYou can use the file selector in the Delta Live Tables UI to configure the source code defining your pipeline. Pipeline source code is defined in Databricks notebooks or in SQL or Python scripts stored in workspace files. When you create or edit your pipeline, you can add one or more notebooks or workspace files or a combination of notebooks and workspace files. \nBecause Delta Live Tables automatically analyzes dataset dependencies to construct the processing graph for your pipeline, you can add source code libraries in any order. \nYou can also modify the JSON file to include Delta Live Tables source code defined in SQL and Python scripts stored in workspace files. The following example includes notebooks and workspace files: \n```\n{\n\"name\": \"Example pipeline 3\",\n\"storage\": \"dbfs:\/pipeline-examples\/storage-location\/example3\",\n\"libraries\": [\n{ \"notebook\": { \"path\": \"\/example-notebook_1\" } },\n{ \"notebook\": { \"path\": \"\/example-notebook_2\" } },\n{ \"file\": { \"path\": \"\/Workspace\/Users\/<user-name>@databricks.com\/Apply_Changes_Into\/apply_changes_into.sql\" } },\n{ \"file\": { \"path\": \"\/Workspace\/Users\/<user-name>@databricks.com\/Apply_Changes_Into\/apply_changes_into.py\" } }\n]\n}\n\n```\n\n##### Configure pipeline settings for Delta Live Tables\n###### Specify a storage location\n\nYou can specify a storage location for a pipeline that publishes to the Hive metastore. The primary motivation for specifying a location is to control the object storage location for data written by your pipeline. \nBecause all tables, data, checkpoints, and metadata for Delta Live Tables pipelines are fully managed by Delta Live Tables, most interaction with Delta Live Tables datasets happens through tables registered to the Hive metastore or Unity Catalog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/settings.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Configure pipeline settings for Delta Live Tables\n###### Specify a target schema for pipeline output tables\n\nWhile optional, you should specify a target to publish tables created by your pipeline anytime you move beyond development and testing for a new pipeline. Publishing a pipeline to a target makes datasets available for querying elsewhere in your Databricks environment. See [Publish data from Delta Live Tables to the Hive metastore](https:\/\/docs.databricks.com\/delta-live-tables\/publish.html) or [Use Unity Catalog with your Delta Live Tables pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/settings.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Configure pipeline settings for Delta Live Tables\n###### Configure your compute settings\n\nEach Delta Live Tables pipeline has two associated clusters: \n* The `updates` cluster processes pipeline updates.\n* The `maintenance` cluster runs daily maintenance tasks. \nThe configuration used by these clusters is determined by the `clusters` attribute specified in your pipeline settings. \nYou can add compute settings that apply to only a specific type of cluster by using cluster labels. There are three labels you can use when configuring pipeline clusters: \nNote \nThe cluster label setting can be omitted if you are defining only one cluster configuration. The `default` label is applied to cluster configurations if no setting for the label is provided. The cluster label setting is required only if you need to customize settings for different cluster types. \n* The `default` label defines compute settings to apply to both the `updates` and `maintenance` clusters. Applying the same settings to both clusters improves the reliability of maintenance runs by ensuring that required configurations, for example, data access credentials for a storage location, are applied to the maintenance cluster.\n* The `maintenance` label defines compute settings to apply to only the `maintenance` cluster. You can also use the `maintenance` label to override settings configured by the `default` label.\n* The `updates` label defines settings to apply to only the `updates` cluster. Use the `updates` label to configure settings that should not be applied to the `maintenance` cluster. \nSettings defined using the `default` and `updates` labels are merged to create the final configuration for the `updates` cluster. If the same setting is defined using both `default` and `updates` labels, the setting defined with the `updates` label overrides the setting defined with the `default` label. \nThe following example defines a Spark configuration parameter that is added only to the configuration for the `updates` cluster: \n```\n{\n\"clusters\": [\n{\n\"label\": \"default\",\n\"autoscale\": {\n\"min_workers\": 1,\n\"max_workers\": 5,\n\"mode\": \"ENHANCED\"\n}\n},\n{\n\"label\": \"updates\",\n\"spark_conf\": {\n\"key\": \"value\"\n}\n}\n]\n}\n\n``` \nDelta Live Tables provides similar options for cluster settings as other compute on Databricks. Like other pipeline settings, you can modify the JSON configuration for clusters to specify options not present in the UI. See [Compute](https:\/\/docs.databricks.com\/compute\/index.html). \nNote \n* Because the Delta Live Tables runtime manages the lifecycle of pipeline clusters and runs a custom version of Databricks Runtime, you cannot manually set some cluster settings in a pipeline configuration, such as the Spark version or cluster names. See [Cluster attributes that are not user settable](https:\/\/docs.databricks.com\/delta-live-tables\/properties.html#non-settable-attrs).\n* You can configure Delta Live Tables pipelines to leverage Photon. See [What is Photon?](https:\/\/docs.databricks.com\/compute\/photon.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/settings.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Configure pipeline settings for Delta Live Tables\n###### Select instance types to run a pipeline\n\nBy default, Delta Live Tables selects the instance types for the driver and worker nodes that run your pipeline, but you can also manually configure the instance types. For example, you might want to select instance types to improve pipeline performance or address memory issues when running your pipeline. You can configure instance types when you [create](https:\/\/docs.databricks.com\/api\/workspace\/pipelines\/create) or [edit](https:\/\/docs.databricks.com\/api\/workspace\/pipelines\/update) a pipeline with the REST API, or in the Delta Live Tables UI. \nTo configure instance types when you create or edit a pipeline in the Delta Live Tables UI: \n1. Click the **Settings** button.\n2. In the **Advanced** section of the pipeline settings, in the **Worker type** and **Driver type** drop-down menus, select the instance types for the pipeline. \nTo configure instance types in the pipeline\u2019s JSON settings, click the **JSON** button and enter the instance type configurations in the cluster configuration: \nNote \nTo avoid assigning unnecessary resources to the `maintenance` cluster, this example uses the `updates` label to set the instance types for only the `updates` cluster. To assign the instance types to both `updates` and `maintenance` clusters, use the `default` label or omit the setting for the label. The `default` label is applied to pipeline cluster configurations if no setting for the label is provided. See [Configure your compute settings](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#cluster-config). \n```\n{\n\"clusters\": [\n{\n\"label\": \"updates\",\n\"node_type_id\": \"r6i.xlarge\",\n\"driver_node_type_id\": \"i3.large\",\n\"...\" : \"...\"\n}\n]\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/settings.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Configure pipeline settings for Delta Live Tables\n###### Use autoscaling to increase efficiency and reduce resource usage\n\nUse Enhanced Autoscaling to optimize the cluster utilization of your pipelines. Enhanced Autoscaling adds additional resources only if the system determines those resources will increase pipeline processing speed. Resources are freed when no longer needed, and clusters are shut down as soon as all pipeline updates are complete. \nTo learn more about Enhanced Autoscaling, including configuration details, see [Optimize the cluster utilization of Delta Live Tables pipelines with Enhanced Autoscaling](https:\/\/docs.databricks.com\/delta-live-tables\/auto-scaling.html).\n\n##### Configure pipeline settings for Delta Live Tables\n###### Delay compute shutdown\n\nBecause a Delta Live Tables cluster automatically shuts down when not in use, referencing a cluster policy that sets `autotermination_minutes` in your cluster configuration results in an error. To control cluster shutdown behavior, you can use development or production mode or use the `pipelines.clusterShutdown.delay` setting in the pipeline configuration. The following example sets the `pipelines.clusterShutdown.delay` value to 60 seconds: \n```\n{\n\"configuration\": {\n\"pipelines.clusterShutdown.delay\": \"60s\"\n}\n}\n\n``` \nWhen `production` mode is enabled, the default value for `pipelines.clusterShutdown.delay` is `0 seconds`. When `development` mode is enabled, the default value is `2 hours`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/settings.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Configure pipeline settings for Delta Live Tables\n###### Create a single node cluster\n\nIf you set `num_workers` to 0 in cluster settings, the cluster is created as a [Single Node cluster](https:\/\/docs.databricks.com\/compute\/configure.html#single-node). Configuring an autoscaling cluster and setting `min_workers` to 0 and `max_workers` to 0 also creates a Single Node cluster. \nIf you configure an autoscaling cluster and set only `min_workers` to 0, then the cluster is not created as a Single Node cluster. The cluster has at least one active worker at all times until terminated. \nAn example cluster configuration to create a Single Node cluster in Delta Live Tables: \n```\n{\n\"clusters\": [\n{\n\"num_workers\": 0\n}\n]\n}\n\n```\n\n##### Configure pipeline settings for Delta Live Tables\n###### Configure cluster tags\n\nYou can use [cluster tags](https:\/\/docs.databricks.com\/compute\/configure.html#tags) to monitor usage for your pipeline clusters. Add cluster tags in the Delta Live Tables UI when you create or edit a pipeline, or by editing the JSON settings for your pipeline clusters.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/settings.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Configure pipeline settings for Delta Live Tables\n###### Cloud storage configuration\n\nYou use AWS instance profiles to configure access to [S3 storage in AWS](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html). To add an instance profile in the Delta Live Tables UI when you create or edit a pipeline: \n1. On the **Pipeline details** page for your pipeline, click the **Settings** button.\n2. In the **Instance profile** drop-down menu In the **Compute** section of the pipeline settings, select an instance profile. \nTo configure an AWS instance profile by editing the JSON settings for your pipeline clusters, click the **JSON** button and enter the instance profile configuration in the `aws_attributes.instance_profile_arn` field in the cluster configuration: \n```\n{\n\"clusters\": [\n{\n\"aws_attributes\": {\n\"instance_profile_arn\": \"arn:aws:...\"\n}\n}\n]\n}\n\n``` \nYou can also configure instance profiles when you create cluster policies for your Delta Live Tables pipelines. For an example, see the [knowledge base](https:\/\/kb.databricks.com\/clusters\/set-instance-profile-arn-optional-policy).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/settings.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Configure pipeline settings for Delta Live Tables\n###### Parameterize pipelines\n\nThe Python and SQL code that defines your datasets can be parameterized by the pipeline\u2019s settings. Parameterization enables the following use cases: \n* Separating long paths and other variables from your code.\n* Reducing the amount of data processed in development or staging environments to speed up testing.\n* Reusing the same transformation logic to process from multiple data sources. \nThe following example uses the `startDate` configuration value to limit the development pipeline to a subset of the input data: \n```\nCREATE OR REFRESH LIVE TABLE customer_events\nAS SELECT * FROM sourceTable WHERE date > '${mypipeline.startDate}';\n\n``` \n```\n@dlt.table\ndef customer_events():\nstart_date = spark.conf.get(\"mypipeline.startDate\")\nreturn read(\"sourceTable\").where(col(\"date\") > start_date)\n\n``` \n```\n{\n\"name\": \"Data Ingest - DEV\",\n\"configuration\": {\n\"mypipeline.startDate\": \"2021-01-02\"\n}\n}\n\n``` \n```\n{\n\"name\": \"Data Ingest - PROD\",\n\"configuration\": {\n\"mypipeline.startDate\": \"2010-01-02\"\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/settings.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Configure pipeline settings for Delta Live Tables\n###### Pipelines trigger interval\n\nYou can use `pipelines.trigger.interval` to control the trigger interval for a flow updating a table or an entire pipeline. Because a triggered pipeline processes each table only once, the `pipelines.trigger.interval` is used only with continuous pipelines. \nDatabricks recommends setting `pipelines.trigger.interval` on individual tables because of different defaults for streaming versus batch queries. Set the value on a pipeline only when your processing requires controlling updates for the entire pipeline graph. \nYou set `pipelines.trigger.interval` on a table using `spark_conf` in Python, or `SET` in SQL: \n```\n@dlt.table(\nspark_conf={\"pipelines.trigger.interval\" : \"10 seconds\"}\n)\ndef <function-name>():\nreturn (<query>)\n\n``` \n```\nSET pipelines.trigger.interval=10 seconds;\n\nCREATE OR REFRESH LIVE TABLE TABLE_NAME\nAS SELECT ...\n\n``` \nTo set `pipelines.trigger.interval` on a pipeline, add it to the `configuration` object in the pipeline settings: \n```\n{\n\"configuration\": {\n\"pipelines.trigger.interval\": \"10 seconds\"\n}\n}\n\n```\n\n##### Configure pipeline settings for Delta Live Tables\n###### Allow non-admin users to view the driver logs from a Unity Catalog-enabled pipeline\n\nBy default, only the pipeline owner and workspace admins have permission to view the driver logs from the cluster that runs a Unity Catalog-enabled pipeline. You can enable access to the driver logs for any user with [CAN MANAGE, CAN VIEW, or CAN RUN permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#dlt) by adding the following Spark configuration parameter to the `configuration` object in the pipeline settings: \n```\n{\n\"configuration\": {\n\"spark.databricks.acl.needAdminPermissionToViewLogs\": \"false\"\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/settings.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Configure pipeline settings for Delta Live Tables\n###### Add email notifications for pipeline events\n\nYou can configure one or more email addresses to receive notifications when the following occurs: \n* A pipeline update completes successfully.\n* A pipeline update fails, either with a retryable or a non-retryable error. Select this option to receive a notification for all pipeline failures.\n* A pipeline update fails with a non-retryable (fatal) error. Select this option to receive a notification only when a non-retryable error occurs.\n* A single data flow fails. \nTo configure email notifications when you [create](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html#create-pipeline) or edit a pipeline: \n1. Click **Add notification**.\n2. Enter one or more email addresses to receive notifications.\n3. Click the check box for each notification type to send to the configured email addresses.\n4. Click **Add notification**.\n\n##### Configure pipeline settings for Delta Live Tables\n###### Control tombstone management for SCD type 1 queries\n\nThe following settings can be used to control the behavior of tombstone management for `DELETE` events during SCD type 1 processing: \n* **`pipelines.applyChanges.tombstoneGCThresholdInSeconds`**: Set this value to match the highest expected interval, in seconds, between out-of-order data. The default is 172800 seconds (2 days).\n* **`pipelines.applyChanges.tombstoneGCFrequencyInSeconds`**: This setting controls how frequently, in seconds, tombstones are checked for cleanup. The default is 1800 seconds (30 minutes). \nSee [APPLY CHANGES API: Simplify change data capture in Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/settings.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Configure pipeline settings for Delta Live Tables\n###### Configure pipeline permissions\n\nYou must have the `CAN MANAGE` or `IS OWNER` permission on the pipeline in order to manage permissions on it. \n1. In the sidebar, click **Delta Live Tables**.\n2. Select the name of a pipeline.\n3. Click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png), and select **Permissions**.\n4. In **Permissions Settings**, select the **Select User, Group or Service Principal\u2026** drop-down menu and then select a user, group, or service principal. \n![Permissions Settings dialog](https:\/\/docs.databricks.com\/_images\/select-permission-job.png)\n5. Select a permission from the permission drop-down menu.\n6. Click **Add**.\n7. Click **Save**.\n\n##### Configure pipeline settings for Delta Live Tables\n###### Enable RocksDB state store for Delta Live Tables\n\nYou can enable RocksDB-based state management by setting the following configuration before deploying a pipeline: \n```\n{\n\"configuration\": {\n\"spark.sql.streaming.stateStore.providerClass\": \"com.databricks.sql.streaming.state.RocksDBStateStoreProvider\"\n}\n}\n\n``` \nTo learn more about the RocksDB state store, including configuration recommendations for RocksDB, see [Configure RocksDB state store on Databricks](https:\/\/docs.databricks.com\/structured-streaming\/rocksdb-state-store.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/settings.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Set up Databricks Git folders (Repos)\n\nLearn how to set up Databricks Git folders (formerly Repos) for version control. Once you set up Git folders in your Databricks, you can perform common Git operations such as clone, checkout, commit, push, pull, and branch management on them from the Databricks UI. You can also see diffs for your changes as you develop with notebooks and files in Databricks.\n\n#### Set up Databricks Git folders (Repos)\n##### Configure user settings\n\nDatabricks Git folders uses a personal access token (PAT) or an equivalent credential to authenticate with your Git provider to perform operations such as clone, push, pull etc. To use Git folders, you must first add your Git PAT and Git provider username to Databricks. See [Configure Git credentials & connect a remote repo to Databricks](https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html). \nYou can clone public remote repositories without Git credentials (a personal access token and a username). To modify a public remote repository or to clone or modify a private remote repository, you must have a Git provider username and PAT with **Write** (or greater) permissions for the remote repository. \nGit folders are enabled by default. For more details on enabling or disabling Git folder support, see [Enable or disable the Databricks Git folder feature](https:\/\/docs.databricks.com\/repos\/enable-disable-repos-with-api.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/repos-setup.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Set up Databricks Git folders (Repos)\n##### Add or edit Git credentials in Databricks\n\nImportant \nDatabricks Git folders support just one Git credential per user, per workspace. \n1. Select the down arrow next to the account name at the top right of your screen, and then select **Settings**.\n2. Select the **Linked accounts** tab.\n3. If you\u2019re adding credentials for the first time, follow the on-screen instructions. \nIf you have previously entered credentials, click **Config** > **Edit** and go to the next step.\n4. In the Git provider drop-down, select the provider name.\n5. Enter your Git user name or email.\n6. In the **Token** field, add a personal access token (PAT) or other credentials from your Git provider. For details, see [Configure Git credentials & connect a remote repo to Databricks](https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html) \nImportant \nDatabricks recommends that you set an expiration date for all personal access tokens. \nFor Azure DevOps, Git integration does not support Microsoft Entra ID (formerly Azure Active Directory) tokens. You must use an Azure DevOps personal access token. See [Connect to Azure DevOps project using a DevOps token](https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html#diff-tenancy). \nIf your organization has SAML SSO enabled in GitHub, [authorize your personal access token for SSO](https:\/\/docs.github.com\/en\/github\/authenticating-to-github\/authorizing-a-personal-access-token-for-use-with-saml-single-sign-on).\n7. Enter your username in the **Git provider username** field.\n8. Click **Save**. \nYou can also save a Git PAT token and username to Databricks using the [Databricks Repos API](https:\/\/docs.databricks.com\/api\/workspace\/repos).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/repos-setup.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Set up Databricks Git folders (Repos)\n##### Network connectivity between Databricks Git folders and a Git provider\n\nGit folders need network connectivity to your Git provider to function. Ordinarily, this is over the internet and works out of the box. However, you might have set up additional restrictions on your Git provider for controlling access. For example, you might have a IP allow list in place, or you might host your own on-premises Git server using services like GitHub Enterprise (GHE), Bitbucket Server ( BBS), or Gitlab Self-managed. Depending on your network hosting and configuration, your Git server might not be accessible via the internet. \nNote \n* If your Git server is internet-accessible but has an IP allowlist in place, such as [GitHub allow lists](https:\/\/docs.github.com\/organizations\/keeping-your-organization-secure\/managing-security-settings-for-your-organization\/managing-allowed-ip-addresses-for-your-organization), you must add Databricks control plane NAT IPs to the Git server\u2019s IP allowlist. See [Databricks clouds and regions](https:\/\/docs.databricks.com\/resources\/supported-regions.html) for a list of control plane NAT IP addresses by region. Use the IP for the region that your Databricks workspace is in.\n* If you are privately hosting a Git server, read [Set up private Git connectivity for Databricks Git folders (Repos)](https:\/\/docs.databricks.com\/repos\/git-proxy.html) or contact your Databricks account team for onboarding instructions for access.\n\n#### Set up Databricks Git folders (Repos)\n##### Security features in Git folders\n\nDatabricks Git folders have many security features. The following sections walk you through their setup and use: \n* Use of encrypted Git credentials\n* An allowlist\n* Workspace access control\n* Audit logging\n* Secrets detection\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/repos-setup.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Set up Databricks Git folders (Repos)\n##### Bring your own key: Encrypt Git credentials\n\nYou can use AWS Key Management Service to encrypt a Git personal access token (PAT) or other Git credential. Using a key from an encryption service is referred to as a customer-managed key (CMK) or bring your own key (BYOK). \nFor more information, see [Customer-managed keys for managed services](https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html#managed-services).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/repos-setup.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Set up Databricks Git folders (Repos)\n##### Restrict usage to URLs in an allow list\n\nA workspace admin can limit which remote repositories users can clone from and commit & push to. This helps prevent exfiltration of your code; for example, users cannot push code to an arbitrary repository if you have turned on the allow list restrictions. You can also prevent users from using unlicensed code by restricting clone operation to a list of allowed repositories. \nTo set up an allow list: \n1. Go to the [settings page](https:\/\/docs.databricks.com\/admin\/index.html#admin-settings).\n2. Click the **Workspace admin** tab (it is open by default).\n3. In the **Development** section, choose an option from **Git URL allow list permission**: \n* **Disabled (no restrictions)**: There are no checks against the allow list.\n* **Restrict Clone, Commit & Push to Allowed Git Repositories**: Clone, commit, and push operations are allowed only for repository URLs in the allow list.\n* **Only Restrict Commit & Push to Allowed Git Repositories**: Commit and push operations are allowed only for repository URLs in the allow list. Clone and pull operations are not restricted. \n![The Development pane under Admin Settings, used to set user Git access](https:\/\/docs.databricks.com\/_images\/git-folder-admin-pane.png) \n1. Click the **Edit** button next to **Git URL allow list: Empty list** and enter a comma-separated list of URL prefixes. \n![The Edit allow list button in the Development admin settings](https:\/\/docs.databricks.com\/_images\/git-folder-admin-pane2.png) \n1. Click **Save**. \nNote \n* The list you save overwrites the existing set of saved URL prefixes.\n* It can take up to 15 minutes for the changes to take effect.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/repos-setup.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Set up Databricks Git folders (Repos)\n##### Allow access to all repositories\n\nTo disable an existing allow list and allow access to all repositories: \n1. Go to the [settings page](https:\/\/docs.databricks.com\/admin\/index.html#admin-settings).\n2. Click the **Workspace admin** tab.\n3. In the **Development** section, under **Git URL allow list permission**: select **Disable (no restrictions)**.\n\n#### Set up Databricks Git folders (Repos)\n##### Control access for a repo in your workspace\n\nNote \nAccess control is available only in the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons). \nSet permissions for a repo to control access. Permissions for a repo apply to all content in that repo. You can assign five permission levels to files: NO PERMISSIONS, CAN READ, CAN RUN, CAN EDIT, and CAN MANAGE. \nFor more details on Git folder permissions, see [Git folder ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#git-folders).\n\n#### Set up Databricks Git folders (Repos)\n##### (Optional) Set up a proxy for enterprise Git servers\n\nIf your company uses an on-premises enterprise Git service, such as GitHub Enterprise or Azure DevOps Server, you can use the [Databricks Git Server Proxy](https:\/\/docs.databricks.com\/repos\/git-proxy.html) to connect your Databricks workspaces to the repos it serves.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/repos-setup.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Set up Databricks Git folders (Repos)\n##### Audit logging\n\nWhen [audit logging](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html) is enabled, audit events are logged when you interact with a Git folder. For example, an audit event is logged when you create, update, or delete a Git folder, when you list all Git folders associated with a workspace, and when you sync changes between your Git folder and the remote Git repo.\n\n#### Set up Databricks Git folders (Repos)\n##### Secrets detection\n\nGit folders scan code for access key IDs that begin with the prefix `AKIA` and warns the user before committing. \n### Use a repo config file \nYou can add settings for each notebook to your repo in a `.databricks\/commit_outputs` file that you create manually. \nSpecify the notebook you want to include outputs using patterns similar to [gitignore patterns](https:\/\/git-scm.com\/docs\/gitignore). \n### Patterns for a repo config file \nThe file contains positive and negative file path patterns. File path patterns include notebook file extension such as `.ipynb`. \n* Positive patterns enable outputs inclusion for matching notebooks.\n* Negative patterns disable outputs inclusion for matching notebooks. \nPatterns are evaluated in order for all notebooks. Invalid paths or paths not resolving to `.ipynb` notebooks are ignored. \n**To include outputs from a notebook path** `folder\/innerfolder\/notebook.ipynb`, use following patterns: \n```\n**\/*\nfolder\/**\nfolder\/innerfolder\/note*\n\n``` \n**To exclude outputs for a notebook,** check that none of the positive patterns match or add a negative pattern in a correct spot of the configuration file. Negative (exclude) patterns start with `!`: \n```\n!folder\/innerfolder\/*.ipynb\n!folder\/**\/*.ipynb\n!**\/notebook.ipynb\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/repos-setup.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Set up Databricks Git folders (Repos)\n##### Move Git folder to trash (delete)\n\nTo delete a Git folder from your workspace: \n1. Right-click the Git folder, and then select **Move to trash.**\n2. In the dialog box, type the name of the Git folder you want to delete. Then, click **Confirm & move to trash.** \n![Confirm Move to Trash dialog box.](https:\/\/docs.databricks.com\/_images\/repos-move-to-trash.png)\n\n#### Set up Databricks Git folders (Repos)\n##### Next steps\n\n* [Run Git operations on Databricks Git folders (Repos)](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html)\n* [What are workspace files?](https:\/\/docs.databricks.com\/files\/workspace.html)\n* [CI\/CD techniques with Git and Databricks Git folders (Repos)](https:\/\/docs.databricks.com\/repos\/ci-cd-techniques-with-repos.html)\n* [Set up private Git connectivity for Databricks Git folders (Repos)](https:\/\/docs.databricks.com\/repos\/git-proxy.html)\n* [Run a first dbt job with Git folders](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbt-in-workflows.html#first-dbt-job)\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/repos-setup.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n#### PyTorch\n\n[PyTorch project](https:\/\/github.com\/pytorch) is a Python package that provides GPU accelerated tensor computation and\nhigh level functionalities for building deep learning networks.\nFor licensing details, see the PyTorch [license doc on GitHub](https:\/\/github.com\/pytorch\/pytorch\/blob\/a90c259edad1ea4fa1b8773e3cb37240df680d62\/LICENSE). \nTo monitor and debug your PyTorch models, consider using [TensorBoard](https:\/\/docs.databricks.com\/machine-learning\/train-model\/tensorboard.html). \nPyTorch is included in Databricks Runtime for Machine Learning. If you are using Databricks Runtime, see [Install PyTorch](https:\/\/docs.databricks.com\/machine-learning\/train-model\/pytorch.html#install-pytorch) for instructions on installing PyTorch. \nNote \nThis is not a comprehensive guide to PyTorch. For more information, see the [PyTorch website](https:\/\/pytorch.org\/).\n\n#### PyTorch\n##### Single node and distributed training\n\nTo test and migrate single-machine workflows, use a [Single Node cluster](https:\/\/docs.databricks.com\/compute\/configure.html#single-node). \nFor distributed training options for deep learning, see [Distributed training](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/index.html).\n\n#### PyTorch\n##### Example notebook\n\n### PyTorch notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/pytorch-single-node.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/pytorch.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n#### PyTorch\n##### Install PyTorch\n\n### Databricks Runtime for ML \n[Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html) includes PyTorch so you can create the cluster and start using PyTorch. For the version of PyTorch installed in the Databricks Runtime ML version you are using, see the [release notes](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). \n### Databricks Runtime \nDatabricks recommends that you use the PyTorch included in Databricks Runtime for Machine Learning. However, if you must use [the standard Databricks Runtime](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html), PyTorch can be installed as a [Databricks PyPI library](https:\/\/docs.databricks.com\/libraries\/index.html). The following example shows how to install PyTorch 1.5.0: \n* On GPU clusters, install `pytorch` and `torchvision` by specifying the following: \n+ `torch==1.5.0`\n+ `torchvision==0.6.0`\n* On CPU clusters, install `pytorch` and `torchvision` by using the following Python wheel files: \n```\nhttps:\/\/download.pytorch.org\/whl\/cpu\/torch-1.5.0%2Bcpu-cp37-cp37m-linux_x86_64.whl\n\nhttps:\/\/download.pytorch.org\/whl\/cpu\/torchvision-0.6.0%2Bcpu-cp37-cp37m-linux_x86_64.whl\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/pytorch.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n#### PyTorch\n##### Errors and troubleshooting for distributed PyTorch\n\nThe following sections describe common error messages and troubleshooting guidance for the classes: [PyTorch DataParallel](https:\/\/pytorch.org\/docs\/stable\/generated\/torch.nn.DataParallel.html) or [PyTorch DistributedDataParallel](https:\/\/pytorch.org\/docs\/stable\/generated\/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel). Most of these errors can likely be resolved with [TorchDistributor](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/reference\/api\/pyspark.ml.torch.distributor.TorchDistributor.html), which is available on Databricks Runtime ML 13.0 and above. However, if `TorchDistributor` is not a viable solution, recommended solutions are also provided within each section. \nThe following is an example of how to use TorchDistributor: \n```\n\nfrom pyspark.ml.torch.distributor import TorchDistributor\n\ndef train_fn(learning_rate):\n# ...\n\nnum_processes=2\ndistributor = TorchDistributor(num_processes=num_processes, local_mode=True)\n\ndistributor.run(train_fn, 1e-3)\n\n``` \n### `process 0 terminated with exit code 1` \nThis error occurs when using notebooks, regardless of environment: Databricks, local machine, etc. To avoid this error, use `torch.multiprocessing.start_processes` with `start_method=fork` instead of `torch.multiprocessing.spawn`. \nFor example: \n```\nimport torch\n\ndef train_fn(rank, learning_rate):\n# required setup, e.g. setup(rank)\n# ...\n\nnum_processes = 2\ntorch.multiprocessing.start_processes(train_fn, args=(1e-3,), nprocs=num_processes, start_method=\"fork\")\n\n``` \n### `The server socket has failed to bind to [::]:{PORT NUMBER} (errno: 98 - Address already in use).` \nThis is error appears when you restart the distributed training after interrupting the cell while training is happening. \nTo resolve, restart the cluster. If that does not solve the problem, there may be an error in the training function code. \n### CUDA related errors \nYou can run into additional issues with CUDA since `start_method=\u201dfork\u201d` is [not CUDA-compatible](https:\/\/github.com\/pytorch\/pytorch\/blob\/master\/torch\/multiprocessing\/spawn.py#L173). Using any `.cuda` commands in any cell might lead to failures. To avoid these errors, add the following check before you call `torch.multiprocessing.start_method`: \n```\nif torch.cuda.is_initialized():\nraise Exception(\"CUDA was initialized; distributed training will fail.\") # or something similar\n\n``` \n* [Train a PyTorch model](https:\/\/docs.databricks.com\/mlflow\/tracking-ex-pytorch.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/pytorch.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Get started with MLflow experiments\n##### Quickstart Java and Scala\n###### MLflow quickstart Scala notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-quick-start-scala.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nTo view the experiment, run, and notebook revision used in the quickstart: \n1. Open the experiment `\/Shared\/Quickstart` in the workspace: \n![Quickstart experiment](https:\/\/docs.databricks.com\/_images\/quick-start-exp.png)\n2. Click a date to view a run: \n![View run](https:\/\/docs.databricks.com\/_images\/quick-start-run.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/quick-start-java-scala.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n#### Hyperparameter tuning\n\nDatabricks Runtime for Machine Learning incorporates Hyperopt, an open source tool that automates the process of model selection and hyperparameter tuning.\n\n#### Hyperparameter tuning\n##### Hyperparameter tuning with Ray\n\nDatabricks Runtime ML includes [Ray](https:\/\/docs.ray.io\/en\/latest\/ray-overview\/index.html), an open-source framework that specializes in parallel compute processing for scaling ML workflows and AI applications. See [Use Ray on Databricks](https:\/\/docs.databricks.com\/machine-learning\/ray-integration.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n#### Hyperparameter tuning\n##### Hyperparameter tuning with Hyperopt\n\nDatabricks Runtime ML includes [Hyperopt](https:\/\/github.com\/hyperopt\/hyperopt), a Python library that facilitates distributed hyperparameter tuning and model selection. With Hyperopt, you can scan a set of Python models while varying algorithms and hyperparameters across spaces that you define. Hyperopt works with both distributed ML algorithms such as Apache Spark MLlib and Horovod, as well as with single-machine ML models such as scikit-learn and TensorFlow. \nThe basic steps when using Hyperopt are: \n1. Define an objective function to minimize. Typically this is the training or validation loss.\n2. Define the hyperparameter search space. Hyperopt provides a conditional search space, which lets you compare different ML algorithms in the same run.\n3. Specify the search algorithm. Hyperopt uses stochastic tuning algorithms that perform a more efficient search of hyperparameter space than a deterministic grid search.\n4. Run the Hyperopt function `fmin()`. `fmin()` takes the items you defined in the previous steps and identifies the set of hyperparameters that minimizes the objective function. \nTo get started quickly using Hyperopt with scikit-learn algorithms, see: \n* [Parallelize hyperparameter tuning with scikit-learn and MLflow](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-spark-mlflow-integration.html)\n* [Compare model types with Hyperopt and MLflow](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-model-selection.html) \nFor more details about how Hyperopt works, and for additional examples, see: \n* [Hyperopt concepts](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-concepts.html)\n* [Use distributed training algorithms with Hyperopt](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-distributed-ml.html)\n* [Best practices: Hyperparameter tuning with Hyperopt](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-best-practices.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n#### Hyperparameter tuning\n##### Automated MLflow tracking\n\nNote \nMLlib automated MLflow tracking is deprecated and disabled by default on clusters that run Databricks Runtime 10.4 LTS ML and above. Instead, use [MLflow PySpark ML autologging](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.pyspark.ml.html#mlflow.pyspark.ml.autolog) by calling `mlflow.pyspark.ml.autolog()`, which is enabled by default with [Databricks Autologging](https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html). \nTo use the old MLlib automated MLflow tracking in Databricks Runtime 10.4 LTS ML and above, enable it by setting the [Spark configurations](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration) `spark.databricks.mlflow.trackMLlib.enabled true` and `spark.databricks.mlflow.autologging.enabled false`. \n* [Apache Spark MLlib and automated MLflow tracking](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/mllib-mlflow-integration.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/index.html"} +{"content":"# Model serving with Databricks\n## Monitor model quality and endpoint health\n#### Track and export serving endpoint health metrics to Prometheus and Datadog\n\nThis article provides an overview of serving endpoint health metrics and shows how to use the metrics export API to export endpoint metrics to [Prometheus](https:\/\/prometheus.io\/docs\/introduction\/overview\/) and [Datadog](https:\/\/docs.datadoghq.com\/api\/latest\/). \nEndpoint health metrics measures infrastructure and metrics such as latency, request rate, error rate, CPU usage, memory usage, etc. This tells you how your serving infrastructure is behaving.\n\n#### Track and export serving endpoint health metrics to Prometheus and Datadog\n##### Requirements\n\n* Read access to the desired endpoint and personal access token (PAT) which can be generated in **Settings** in the Databricks Machine Learning UI to access the endpoint.\n* An existing [model serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) endpoint. You can validate this by checking the endpoint health with the following: \n```\ncurl -n -X GET -H \"Authorization: Bearer [PAT]\" https:\/\/[DATABRICKS_HOST]\/api\/2.0\/serving-endpoints\/[ENDPOINT_NAME]\n\n```\n* Validate the export metrics API: \n```\ncurl -n -X GET -H \"Authorization: Bearer [PAT]\" https:\/\/[DATABRICKS_HOST]\/api\/2.0\/serving-endpoints\/[ENDPOINT_NAME]\/metrics\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/metrics-export-serving-endpoint.html"} +{"content":"# Model serving with Databricks\n## Monitor model quality and endpoint health\n#### Track and export serving endpoint health metrics to Prometheus and Datadog\n##### Serving endpoint metrics definitions\n\n| Metric | Description |\n| --- | --- |\n| **Latency (ms)** | Captures the median (P50) and 99th percentile (P99) round-trip latency times within Databricks. This does not include additional Databricks-related latencies like authentication and rate limiting |\n| **Request rate (per second)** | Measures the number of requests processed per second. This rate is calculated by totaling the number of requests within a minute and then dividing by 60 (the number of seconds in a minute). |\n| **Request error rate (per second)** | Tracks the rate of 4xx and 5xx HTTP error responses per second. Similar to the request rate, it\u2019s computed by aggregating the total number of unsuccessful requests within a minute and dividing by 60. |\n| **CPU usage (%)** | Shows the average CPU utilization percentage across all server replicas. In the context of Databricks infrastructure, a replica refers to virtual machine nodes. Depending on your configured concurrency settings, Databricks creates multiple replicas to manage model traffic efficiently. |\n| **Memory usage (%)** | Shows the average memory utilization percentage across all server replicas. |\n| **Provisioned concurrency** | Provisioned concurrency is the maximum number of parallel requests that the system can handle. Provisioned concurrency dynamically adjusts within the minimum and maximum limits of the compute scale-out range, varying in response to incoming traffic. |\n| **GPU usage (%)** | Represents the average GPU utilization, as reported by the [NVIDIA DCGM](https:\/\/developer.nvidia.com\/dcgm) exporter. If the instance type has multiple GPUs, each is tracked separately (such as, `gpu0`, `gpu1`, \u2026, `gpuN`). The utilization is averaged across all server replicas and sampled once a minute. Note: The infrequent sampling means this metric is most accurate under a constant load. |\n| **GPU memory usage (%)** | Indicates the average percentage of utilized frame buffer memory on each GPU based on NVIDIA DCGM exporter data. As with GPU usage, this metric is averaged across replicas and sampled every minute. It\u2019s most reliable under consistent load conditions. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/metrics-export-serving-endpoint.html"} +{"content":"# Model serving with Databricks\n## Monitor model quality and endpoint health\n#### Track and export serving endpoint health metrics to Prometheus and Datadog\n##### Prometheus integration\n\nNote \nRegardless of which type of deployment you have in your production environment, the scraping configuration should be similar. \nThe guidance in this section follows the Prometheus documentation to start a Prometheus service locally using docker. \n1. Write a `yaml` config file and name it `prometheus.yml`. The following is an example: \n```\nglobal:\nscrape_interval: 1m\nscrape_timeout: 10s\nscrape_configs:\n- job_name: \"prometheus\"\nmetrics_path: \"\/api\/2.0\/serving-endpoints\/[ENDPOINT_NAME]\/metrics\"\nscheme: \"https\"\nauthorization:\ntype: \"Bearer\"\ncredentials: \"[PAT_TOKEN]\"\n\nstatic_configs:\n- targets: [\"dbc-741cfa95-12d1.dev.databricks.com\"]\n\n```\n2. Start Prometheus locally with the following command: \n```\ndocker run \\\n-p 9090:9090 \\\n-v \/path\/to\/prometheus.yml:\/etc\/prometheus\/prometheus.yml \\\nprom\/prometheus\n\n```\n3. Navigate to `http:\/\/localhost:9090` to check if your local Prometheus service is up and running.\n4. Check the Prometheus scraper status and debug errors from: `http:\/\/localhost:9090\/targets?search=`\n5. Once the target is fully up and running, you can query the provided metrics, like `cpu_usage_percentage` or `mem_usage_percentage`, in the UI.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/metrics-export-serving-endpoint.html"} +{"content":"# Model serving with Databricks\n## Monitor model quality and endpoint health\n#### Track and export serving endpoint health metrics to Prometheus and Datadog\n##### Datadog integration\n\nNote \nThe preliminary set up for this example is based on the free edition. \n[Datadog](https:\/\/docs.datadoghq.com\/) has a variety of agents that can be deployed in different environments. For demonstration purposes, the following launches a Mac OS agent locally that scrapes the metrics endpoint in your Databricks host. The configuration for using other agents should be in a similar pattern. \n1. Register a datadog account.\n2. Install OpenMetrics integration in your [account dashboard](https:\/\/app.datadoghq.com\/integrations), so Datadog can accept and process OpenMetrics data.\n3. Follow the [Datadog documentation](https:\/\/app.datadoghq.com\/account\/settings#agent\/mac) to get your Datadog agent up and running. For this example, use the DMG package option to have everything installed including `launchctl` and `datadog-agent`.\n4. Locate your OpenMetrics configuration. For this example, the configuration is at `~\/.datadog-agent\/conf.d\/openmetrics.d\/conf.yaml.default`. The following is an example configuration `yaml` file. \n```\ninstances:\n- openmetrics_endpoint: https:\/\/[DATABRICKS_HOST]\/api\/2.0\/serving-endpoints\/[ENDPOINT_NAME]\/metrics\n\nmetrics:\n- cpu_usage_percentage:\nname: cpu_usage_percentage\ntype: gauge\n- mem_usage_percentage:\nname: mem_usage_percentage\ntype: gauge\n- provisioned_concurrent_requests_total:\nname: provisioned_concurrent_requests_total\ntype: gauge\n- request_4xx_count_total:\nname: request_4xx_count_total\ntype: gauge\n- request_5xx_count_total:\nname: request_5xx_count_total\ntype: gauge\n- request_count_total:\nname: request_count_total\ntype: gauge\n- request_latency_ms:\nname: request_latency_ms\ntype: histogram\n\ntag_by_endpoint: false\n\nsend_distribution_buckets: true\n\nheaders:\nAuthorization: Bearer [PAT]\nContent-Type: application\/openmetrics-text\n\n```\n5. Start datadog agent using `launchctl start com.datadoghq.agent`.\n6. Every time you need to make changes to your config, you need to restart the agent to pick up the change. \n```\nlaunchctl stop com.datadoghq.agent\nlaunchctl start com.datadoghq.agent\n\n```\n7. Check the agent health with `datadog-agent health`.\n8. Check agent status with `datadog-agent status`. You should be able to see a response like the following. If not, debug with the error message. Potential issues may be due to an expired PAT token, or an incorrect URL. \n```\nopenmetrics (2.2.2)\n-------------------\nInstance ID: openmetrics: xxxxxxxxxxxxxxxx [OK]\nConfiguration Source: file:\/opt\/datadog-agent\/etc\/conf.d\/openmetrics.d\/conf.yaml.default\nTotal Runs: 1\nMetric Samples: Last Run: 2, Total: 2\nEvents: Last Run: 0, Total: 0\nService Checks: Last Run: 1, Total: 1\nAverage Execution Time : 274ms\nLast Execution Date : 2022-09-21 23:00:41 PDT \/ 2022-09-22 06:00:41 UTC (xxxxxxxx)\nLast Successful Execution Date : 2022-09-21 23:00:41 PDT \/ 2022-09-22 06:00:41 UTC (xxxxxxx)\n\n```\n9. Agent status can also be seen from the UI at:<http:\/\/127.0.0.1:5002\/>. \nIf your agent is fully up and running, you can navigate back to your Datadog dashboard to query the metrics. You can also create a monitor or alert based on the metric data:<https:\/\/app.datadoghq.com\/monitors\/create\/metric>.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/metrics-export-serving-endpoint.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Create a monitor using the Databricks UI\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article demonstrates create a data monitor using the Databricks UI. You can also use [the API](https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-api.html). \nTo access the Databricks UI, do the following: \n1. In the workspace left sidebar, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) to open [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html).\n2. Navigate to the table you want to monitor.\n3. Click the **Quality** tab.\n4. Click the **Get started** button.\n5. In **Create monitor**, choose the options you want to set up the monitor.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-ui.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Create a monitor using the Databricks UI\n#### Profiling\n\nFrom the **Profile type** menu, select the type of monitor you want to create. The profile types are shown in the table. \n| Profile type | Description |\n| --- | --- |\n| Time series profile | A table containing values measured over time. This table includes a timestamp column. |\n| Inference profile | A table containing predicted values output by a machine learning classification or regression model. This table includes a timestamp, a model id, model inputs (features), a column containing model predictions, and optional columns containing unique observation IDs and ground truth labels. It can also contain metadata, such as demographic information, that is not used as input to the model but might be useful for fairness and bias investigations or other monitoring. |\n| Snapshot profile | Any Delta managed table, external table, view, materialized view, or streaming table. | \nIf you select `TimeSeries` or `Inference`, additional parameters are required and are described in the following sections. \nNote \n* When you first create a time series or inference profile, the monitor analyzes only data from the 30 days prior to its creation. After the monitor is created, all new data is processed.\n* Monitors defined on materialized views and streaming tables do not support incremental processing. \n### `TimeSeries` profile \nFor a `TimeSeries` profile, you must make the following selections: \n* Specify the **Metric granularities** that determine how to partition the data in windows across time.\n* Specify the **Timestamp column**, the column in the table that contains the timestamp. The timestamp column data type must be either `TIMESTAMP` or a type that can be converted to timestamps using the `to_timestamp` [PySpark function](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/reference\/pyspark.sql\/api\/pyspark.sql.functions.to_timestamp.html). \n### `Inference` profile \nFor a `Inference` profile, in addition to the granularities and the timestamp, you must make the following selections: \n* Select the **Problem type**, either classification or regression.\n* Specify the **Prediction column**, the column containing the model\u2019s predicted values.\n* Optionally specify the **Label column**, the column containing the ground truth for model predictions.\n* Specify the **Model ID column**, the column containing the id of the model used for prediction.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-ui.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Create a monitor using the Databricks UI\n#### Schedule\n\nTo set up a monitor to run on a scheduled basis, select **Refresh on schedule** and select the frequency and time for the monitor to run. If you do not want the monitor to run automatically, select **Refresh manually**. If you select **Refresh manually**, you can later refresh the metrics from the **Quality** tab.\n\n### Create a monitor using the Databricks UI\n#### Notifications\n\nTo set up email notifications for a monitor, enter the email to be notified and select the notifications to enable. Up to 5 emails are supported per notification event type.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-ui.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Create a monitor using the Databricks UI\n#### General\n\nIn the **General** section, you need to specify one required setting and some additional configuration options: \n* You must specify the Unity Catalog schema where the metric tables created by the monitor are stored. The location must be in the format {catalog}.{schema}. \nYou can also specify the following settings: \n* **Assets directory**. Enter the absolute path to the existing directory to store monitoring assets such as the generated dashboard. By default, assets are stored in the default directory: \u201c\/Users\/{user\\_name}\/databricks\\_lakehouse\\_monitoring\/{table\\_name}\u201d. If you enter a different location in this field, assets are created under \u201c\/{table\\_name}\u201d in the directory you specify. This directory can be anywhere in the workspace. For monitors intended to be shared within an organization, you can use a path in the \u201c\/Shared\/\u201d directory. \nThis field cannot be left blank.\n* **Unity Catalog baseline table name**. Name of a table or view that contains baseline data for comparison. For more information about baseline tables, see [Primary input table and baseline table](https:\/\/docs.databricks.com\/lakehouse-monitoring\/index.html#baseline-table).\n* **Metric slicing expressions**. Slicing expressions let you define subsets of the table to monitor in addition to the table as a whole. To create a slicing expression, click **Add expression** and enter the expression definition. For example the expression `\"col_2 > 10\"` generates two slices: one for `col_2 > 10` and one for `col_2 <= 10`. As another example, the expression `\"col_1\"` will generate one slice for each unique value in `col_1`. The data is grouped by each expression independently, resulting in a separate slice for each predicate and its complements.\n* **Custom metrics**. Custom metrics appear in the metric tables like any built-in metric. For details, see [Use custom metrics with Databricks Lakehouse Monitoring](https:\/\/docs.databricks.com\/lakehouse-monitoring\/custom-metrics.html).\nTo configure a custom metric, click **Add custom metric**.\n- Enter a **Name** for the custom metric.\n- Select the custom metric **Type**, one of `Aggregate`, `Derived`, or `Drift`. For definitions, see [Types of custom metrics](https:\/\/docs.databricks.com\/lakehouse-monitoring\/custom-metrics.html#custom-metric-types).\n- From the drop-down list in **Input columns**, select the columns to apply the metric to.\n- In the **Output type** field, select the Spark data type of the metric.\n- In the **Definition** field, enter SQL code that defines the custom metric.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-ui.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Create a monitor using the Databricks UI\n#### Edit monitor settings in the UI\n\nAfter you have created a monitor, you can make changes to the monitor\u2019s settings by clicking the **Edit monitor configuration** button on the **Quality** tab.\n\n### Create a monitor using the Databricks UI\n#### Refresh and view monitor results in the UI\n\nTo run the monitor manually, click **Refresh metrics**. \nFor information about the statistics that are stored in monitor metric tables, see [Monitor metric tables](https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-output.html). Metric tables are Unity Catalog tables. You can query them in notebooks or in the SQL query explorer, and view them in Catalog Explorer.\n\n### Create a monitor using the Databricks UI\n#### Control access to monitor outputs\n\nThe metric tables and dashboard created by a monitor are owned by the user who created the monitor. You can use Unity Catalog privileges to control access to metric tables. To share dashboards within a workspace, click the **Share** button on the upper-right side of the dashboard.\n\n### Create a monitor using the Databricks UI\n#### Delete a monitor from the UI\n\nTo delete a monitor from the UI, click the kebab menu next to the **Refresh metrics** button and select **Delete monitor**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-ui.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Collaborate using Databricks notebooks\n\nThis page describes how to give coworkers access to a notebook and how you can leave comments in a notebook. \nNote \nAccess control is available only in the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-collaborate.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Collaborate using Databricks notebooks\n##### Share a notebook\n\nTo share a notebook with a coworker, click ![Notebook header share button](https:\/\/docs.databricks.com\/_images\/nb-header-share.png) at the top of the notebook. The Sharing dialog opens, which you can use to select who to share the notebook with and what level of access they have. \nYou can also manage permissions in a fully automated setup using [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_permissions](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/permissions#notebook-usage). \n### Notebook permissions \nYou can assign five permission levels to notebooks: NO PERMISSIONS, CAN READ, CAN RUN, CAN EDIT, and CAN MANAGE. The table lists the abilities for each permission. \n| Ability | NO PERMISSIONS | CAN READ | CAN RUN | CAN EDIT | CAN MANAGE |\n| --- | --- | --- | --- | --- | --- |\n| View cells | | x | x | x | x |\n| Comment | | x | x | x | x |\n| Run via %run or notebook workflows | | x | x | x | x |\n| Attach and detach notebooks | | | x | x | x |\n| Run commands | | | x | x | x |\n| Edit cells | | | | x | x |\n| Modify permissions | | | | | x | \nWorkspace admins have the CAN MANAGE permission on all notebooks in their workspace. Users automatically have the CAN MANAGE permission for notebooks they create. \n### Manage notebook permission with folders \nYou can manage notebook permissions by adding notebook to folders. Notebooks in a folder inherit all permissions settings of that folder. For example, a user that has CAN RUN permission on a folder has CAN RUN permission on the queries in that folder. To learn about configuring permissions on folders, see [Folder ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#folders). \nTo learn more about organizing notebooks into folders, see [Workspace browser](https:\/\/docs.databricks.com\/workspace\/workspace-browser\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-collaborate.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Collaborate using Databricks notebooks\n##### Command comments\n\nYou can have discussions with collaborators using command comments. \nTo toggle the Comments sidebar, click the **Comments** icon ![Toggle notebook comments](https:\/\/docs.databricks.com\/_images\/nb-header-comments.png) in the notebook\u2019s right sidebar. \nTo add a comment to a command: \n1. Highlight the command text and click the comment bubble: \n![Open comments](https:\/\/docs.databricks.com\/_images\/add-comment.png)\n2. Add your comment and click **Comment**. \n![Add comment](https:\/\/docs.databricks.com\/_images\/save-comment.png) \nTo edit, delete, or reply to a comment, click the comment and choose an action. \n![Edit comment](https:\/\/docs.databricks.com\/_images\/edit-comment.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-collaborate.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### IPython kernel\n\nThe [IPython kernel](https:\/\/ipython.readthedocs.io\/en\/stable\/index.html#ipython-documentation) is a Jupyter kernel for Python code execution. Jupyter, and other compatible notebooks, use the IPython kernel for executing Python notebook code. \nIn Databricks Runtime 11.3 LTS and above, Python notebooks use the IPython kernel to execute Python code. \nIn Databricks Runtime 11.3 LTS and above, you can pass input to ipykernel in Python notebooks. This allows you to use interactive tools such as the Python debugger in the notebook. For an example notebook that illustrates how to use the Python debugger, see [Debug in Python notebooks](https:\/\/docs.databricks.com\/languages\/python.html#python-debugger).\n\n#### IPython kernel\n##### Benefits of using the IPython kernel\n\nThe IPython kernel allows Databricks to add better support for open source tools built for Jupyter notebooks. Using the IPython kernel on Databricks adds support for IPython\u2019s display and output tooling. See [IPython.core.display](https:\/\/IPython.readthedocs.io\/en\/stable\/api\/generated\/IPython.display.html) for more information. Also, the IPython kernel captures the stdout and stderr outputs of child processes created by a notebook, allowing that output to be included in the notebook\u2019s command results.\n\n#### IPython kernel\n##### Known issue\n\nThe IPython command `update_display` only updates the outputs of the current cell.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/ipython-kernel.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Optimize stateful Structured Streaming queries\n##### Read Structured Streaming state information\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nIn Databricks Runtime 14.3 LTS and above, you can use DataFrame operations or SQL table-value functions to query Structured Streaming state data and metadata. You can use these functions to observe state information for Structured Streaming stateful queries, which can be useful for monitoring and debugging. \nYou must have read access to the checkpoint path for a streaming query in order to query state data or metadata. The functions described in this article provide read-only access to state data and metadata. You can only use batch read semantics to query state information. \nNote \nYou cannot query state information for Delta Live Tables pipelines, streaming tables, or materialized views.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/read-state.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Optimize stateful Structured Streaming queries\n##### Read Structured Streaming state information\n###### Read Structured Streaming state store\n\nYou can read state store information for Structured Streaming queries executed in any supported Databricks Runtime. Use the following syntax: \n```\ndf = (spark.read\n.format(\"statestore\")\n.load(\"\/checkpoint\/path\"))\n\n``` \n```\nSELECT * FROM read_statestore('\/checkpoint\/path')\n\n``` \nThe following optional configurations are supported: \n| Option | Type | Default value | Description |\n| --- | --- | --- | --- |\n| `batchId` | Long | latest batch ID | Represents the target batch to read from. Specify this option to query state information for an earlier state of the query. The batch must be committed but not yet cleaned up. |\n| `operatorId` | Long | 0 | Represents the target operator to read from. This option is used when the query is using multiple stateful operators. |\n| `storeName` | String | \u201cDEFAULT\u201d | Represents the target state store name to read from. This option is used when the stateful operator uses multiple state store instances. Either `storeName` or `joinSide` must be specified for a stream-steam join, but not both. |\n| `joinSide` | String (\u201cleft\u201d or \u201cright\u201d) | Represents the target side to read from. This option is used when users want to read the state from stream-stream join. | \nThe returned data has the following schema: \n| Column | Type | Description |\n| --- | --- | --- |\n| `key` | Struct (further type derived from the state key) | The key for a stateful operator record in the state checkpoint. |\n| `value` | Struct (further type derived from the state value) | The value for a stateful operator record in the state checkpoint. |\n| `partition_id` | Integer | The partition of the state checkpoint that contains the stateful operator record. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/read-state.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Optimize stateful Structured Streaming queries\n##### Read Structured Streaming state information\n###### Read Structured Streaming state metadata\n\nImportant \nYou must run streaming queries on Databricks Runtime 14.2 or above to record state metadata. State metadata files do not break backward compatibility. If you choose to run a streaming query on Databricks Runtime 14.1 or below, existing state metadata files are ignored and no new state metadata files are written. \nYou can read state metadata information for Structured Streaming queries run on Databricks Runtime 14.2 or above. Use the following syntax: \n```\ndf = (spark.read\n.format(\"state-metadata\")\n.load(\"<checkpointLocation>\"))\n\n``` \n```\nSELECT * FROM read_state_metadata('\/checkpoint\/path')\n\n``` \nThe returned data has the following schema: \n| Column | Type | Description |\n| --- | --- | --- |\n| `operatorId` | Integer | The integer ID of the stateful streaming operator. |\n| `operatorName` | Integer | Name of the stateful streaming operator. |\n| `stateStoreName` | String | Name of the state store of the operator. |\n| `numPartitions` | Integer | Number of partitions of the state store. |\n| `minBatchId` | Long | The minimum batch ID available for querying state. |\n| `maxBatchId` | Long | The maximum batch ID available for querying state. | \nNote \nThe batch ID values provided by `minBatchId` and `maxBatchId` reflect the state at the time the checkpoint was written. Old batches are cleaned up automatically with micro-batch execution, so the value provided here is not guaranteed to still be available.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/read-state.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Monitor metric tables\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis page describes the metric tables created by Databricks Lakehouse Monitoring. For information about the dashboard created by a monitor, see [Use the generated SQL dashboard](https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-dashboard.html). \nWhen a monitor runs on a Databricks table, it creates or updates two metric tables: a profile metrics table and a drift metrics table. \n* The profile metrics table contains summary statistics for each column and for each combination of time window, slice, and grouping columns. For `InferenceLog` analysis, the analysis table also contains model accuracy metrics.\n* The drift metrics table contains statistics that track changes in distribution for a metric. Drift tables can be used to visualize or alert on changes in the data instead of specific values. The following types of drift are computed: \n+ Consecutive drift compares a window to the previous time window. Consecutive drift is only calculated if a consecutive time window exists after aggregation according to the specified granularities.\n+ Baseline drift compares a window to the baseline distribution determined by the baseline table. Baseline drift is only calculated if a baseline table is provided.\n\n### Monitor metric tables\n#### Where metric tables are located\n\nMonitor metric tables are saved to `{output_schema}.{table_name}_profile_metrics` and `{output_schema}.{table_name}_drift_metrics`, where: \n* `{output_schema}` is the catalog and schema specified by `output_schema_name`.\n* `{table_name}` is the name of the table being monitored.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-output.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Monitor metric tables\n#### How monitor statistics are computed\n\nEach statistic and metric in the metric tables is computed for a specified time interval (called a \u201cwindow\u201d). For `Snapshot` analysis, the time window is a single point in time corresponding to the time the metric was refreshed. For `TimeSeries` and `InferenceLog` analysis, the time window is based on the granularities specified in `create_monitor` and the values in the `timestamp_col` specified in the `profile_type` argument. \nMetrics are always computed for the entire table. In addition, if you provide a slicing expression, metrics are computed for each data slice defined by a value of the expression. \nFor example: \n`slicing_exprs=[\"col_1\", \"col_2 > 10\"]` \ngenerates the following slices: one for `col_2 > 10`, one for `col_2 <= 10`, and one for each unique value in `col1`. \nSlices are identified in the metrics tables by the column names `slice_key` and `slice_value`. In this example, one slice key would be \u201ccol\\_2 > 10\u201d and the corresponding values would be \u201ctrue\u201d and \u201cfalse\u201d. The entire table is equivalent to `slice_key` = NULL and `slice_value` = NULL. Slices are defined by a single slice key. \nMetrics are computed for all possible groups defined by the time windows and slice keys and values. In addition, for `InferenceLog` analysis, metrics are computed for each model id. For details, see [Column schemas for generated tables](https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-output.html#output_schema). \n### Additional statistics for model accuracy monitoring (`InferenceLog` analysis only) \nAdditional statistics are calculated for `InferenceLog` analysis. \n* Model quality is calculated if both `label_col` and `prediction_col` are provided.\n* Slices are automatically created based on the distinct values of `model_id_col`.\n* For classification models, [fairness and bias statistics](https:\/\/docs.databricks.com\/lakehouse-monitoring\/fairness-bias.html) are calculated for slices that have a Boolean value.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-output.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Monitor metric tables\n#### Query analysis and drift metrics tables\n\nYou can query the metrics tables directly. The following example is based on `InferenceLog` analysis: \n```\nSELECT\nwindow.start, column_name, count, num_nulls, distinct_count, frequent_items\nFROM census_monitor_db.adult_census_profile_metrics\nWHERE model_id = 1 \u2014 Constrain to version 1\nAND slice_key IS NULL \u2014 look at aggregate metrics over the whole data\nAND column_name = \"income_predicted\"\nORDER BY window.start\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-output.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Monitor metric tables\n#### Column schemas for generated tables\n\nFor each column in the primary table, the metrics tables contain one row for each combination of grouping columns. The column associated with each row is shown in the column `column_name`. \nFor metrics based on more than one column such as model accuracy metrics, `column_name` is set to `:table`. \nFor profile metrics, the following grouping columns are used: \n* time window\n* granularity (`TimeSeries` and `InferenceLog` analysis only)\n* log type - input table or baseline table\n* slice key and value\n* model id (`InferenceLog` analysis only) \nFor drift metrics, the following additional grouping columns are used: \n* comparison time window\n* drift type (comparison to previous window or comparison to baseline table) \nThe schemas of the metric tables are shown below, and are also shown in the [Databricks Lakehouse Monitoring API reference documentation](https:\/\/api-docs.databricks.com\/python\/lakehouse-monitoring\/latest\/index.html). \n### Profile metrics table schema \nThe following table shows the schema of the profile metrics table. Where a metric is not applicable to a row, the corresponding cell is null. \n| Column name | Type | Description |\n| --- | --- | --- |\n| **Grouping columns** | | |\n| window | Struct. See [1] below. | Time window. |\n| granularity | string | Window duration, set by `granularities` parameter. [2] |\n| model\\_id\\_col | string | Optional. Only used for `InferenceLog` analysis type. |\n| log\\_type | string | Table used to calculate metrics. BASELINE or INPUT. |\n| slice\\_key | string | Slice expression. NULL for default, which is all data. |\n| slice\\_value | string | Value of the slicing expression. |\n| column\\_name | string | Name of column in primary table. `:table` is a special name for metrics that apply to the whole table, such as model accuracy. |\n| data\\_type | string | Spark data type of `column_name`. |\n| logging\\_table\\_commit\\_version | int | Ignore. |\n| monitor\\_version | bigint | Version of the monitor configuration used to calculate the metrics in the row. See [3] below for details. |\n| **Metrics columns - summary statistics** | | |\n| count | bigint | Number of non-null values. |\n| num\\_nulls | bigint | Number of null values in `column_name`. |\n| avg | double | Arithmetic mean of the column, ingoring nulls. |\n| quantiles | `array<double>` | Array of 1000 quantiles. See [4] below. |\n| distinct\\_count | bigint | Number of distinct values in `column_name`. |\n| min | double | Minimum value in `column_name`. |\n| max | double | Maximum value in `column_name`. |\n| stddev | double | Standard deviation of `column_name`. |\n| num\\_zeros | bigint | Number of zeros in `column_name`. |\n| num\\_nan | bigint | Number of NaN values in `column_name`. |\n| min\\_size | double | Minimum size of arrays or structures in `column_name`. |\n| max\\_size | double | Maximum size of arrays or structures in `column_name`. |\n| avg\\_size | double | Average size of arrays or structures in `column_name`. |\n| min\\_len | double | Minimum length of string and binary values in `column_name`. |\n| max\\_len | double | Maximum length of string and binary values in `column_name`. |\n| avg\\_len | double | Average length of string and binary values in `column_name`. |\n| frequent\\_items | Struct. See [1] below. | Top 100 most frequently occurring items. |\n| non\\_null\\_columns | `array<string>` | List of columns with at least one non-null value. |\n| median | double | Median value of `column_name`. |\n| percent\\_null | double | Percent of null values in `column_name`. |\n| percent\\_zeros | double | Percent of values that are zero in `column_name`. |\n| percent\\_distinct | double | Percent of values that are distinct in `column_name`. |\n| **Metrics columns - classification model accuracy** [5] | | |\n| accuracy\\_score | double | Accuracy of model, calculated as (number of correct predictions \/ total number of predictions), ignoring null values. |\n| confusion\\_matrix | Struct. See [1] below. | |\n| precision | Struct. See [1] below. | |\n| recall | Struct. See [1] below. | |\n| f1\\_score | Struct. See [1] below. | |\n| **Metrics columns - regression model accuracy** [5] | | |\n| mean\\_squared\\_error | double | Mean squared error between `prediction_col` and `label_col`. |\n| root\\_mean\\_squared\\_error | double | Root mean squared error between `prediction_col` and `label_col`. |\n| mean\\_average\\_error | double | Mean average error between `prediction_col` and `label_col`. |\n| mean\\_absolute\\_percentage\\_error | double | Mean absolute percentage error between `prediction_col` and `label_col`. |\n| r2\\_score | double | R-squared score between `prediction_col` and `label_col`. |\n| **Metrics columns - fairness and bias** [6] | | |\n| predictive\\_parity | double | Measures whether the two groups have equal precision across all predicted classes. `label_col` is required. |\n| predictive\\_equality | double | Measures whether the two groups have equal false positive rate across all predicted classes. `label_col` is required. |\n| equal\\_opportunity | double | Measures whether the two groups have equal recall across all predicted classes. `label_col` is required. |\n| statistical\\_parity | double | Measures whether the two groups have equal acceptance rate. Acceptance rate here is defined as the empirical probability to be predicted as a certain class, across all predicted classes. | \n[1] Format of struct for `confusion_matrix`, `precision`, `recall`, and `f1_score`: \n| Column name | Type |\n| --- | --- |\n| window | `struct<start: timestamp, end: timestamp>` |\n| frequent\\_items | `array<struct<item: string, count: bigint>>` |\n| confusion\\_matrix | `struct<prediction: string, label: string, count: bigint>` |\n| precision | `struct<one_vs_all: map<string,double>, macro: double, weighted: double>` |\n| recall | `struct<one_vs_all: map<string,double>, macro: double, weighted: double>` |\n| f1\\_score | `struct<one_vs_all: map<string,double>, macro: double, weighted: double>` | \n[2] For time series or inference profiles, the monitor looks back 30 days from the time the monitor is created. Due to this cutoff, the first analysis might include a partial window. For example, the 30 day limit might fall in the middle of a week or month, in which case the full week or month is not included in the calculation. This issue affects only the first window. \n[3] The version shown in this column is the version that was used to calculate the statistics in the row and might not be the current version of the monitor. Each time you refresh the metrics, the monitor attempts to recompute previously calculated metrics using the current monitor configuration. The current monitor version appears in the monitor information returned by the API and Python Client. \n[4] Sample code to retrieve the 50th percentile: `SELECT element_at(quantiles, int((size(quantiles)+1)\/2)) AS p50 ...` or `SELECT quantiles[500] ...` . \n[5] Only shown if the monitor has `InferenceLog` analysis type and both `label_col` and `prediction_col` are provided. \n[6] Only shown if the monitor has `InferenceLog` analysis type and `problem_type` is `classification`. \n### Drift metrics table schema \nThe following table shows the schema of the drift metrics table. The drift table is only generated if a baseline table is provided, or if a consecutive time window exists after aggregation according to the specified granularities. \n| Column name | Type | Description |\n| --- | --- | --- |\n| **Grouping columns** | | |\n| window | `struct<start: timestamp, end: timestamp>` | Time window. |\n| window\\_cmp | `struct<start: timestamp, end: timestamp>` | Comparison window for drift\\_type `CONSECUTIVE`. |\n| drift\\_type | string | BASELINE or CONSECUTIVE. Whether the drift metrics compare to the previous time window or to the baseline table. |\n| granularity | string | Window duration, set by `granularities` parameter. [7] |\n| model\\_id\\_col | string | Optional. Only used for `InferenceLog` analysis type. |\n| slice\\_key | string | Slice expression. NULL for default, which is all data. |\n| slice\\_value | string | Value of the slicing expression. |\n| column\\_name | string | Name of column in primary table. `:table` is a special name for metrics that apply to the whole table, such as model accuracy. |\n| data\\_type | string | Spark data type of `column_name`. |\n| monitor\\_version | bigint | Version of the monitor configuration used to calculate the metrics in the row. See [8] below for details. |\n| **Metrics columns - drift** | | Differences are calculated as current window - comparison window. |\n| count\\_delta | double | Difference in `count`. |\n| avg\\_delta | double | Difference in `avg`. |\n| percent\\_null\\_delta | double | Difference in `percent_null`. |\n| percent\\_zeros\\_delta | double | Difference in `percent_zeros`. |\n| percent\\_distinct\\_delta | double | Difference in `percent_distinct`. |\n| non\\_null\\_columns\\_delta | `struct<added: int, missing: int>` | Number of columns with any increase or decrease in non-null values. |\n| chi\\_squared\\_test | `struct<statistic: double, pvalue: double>` | Chi-square test for drift in distribution. |\n| ks\\_test | `struct<statistic: double, pvalue: double>` | KS test for drift in distribution. Calculated for numeric columns only. |\n| tv\\_distance | double | Total variation distance for drift in distribution. |\n| l\\_infinity\\_distance | double | L-infinity distance for drift in distribution. |\n| js\\_distance | double | Jensen\u2013Shannon distance for drift in distribution. Calculated for categorical columns only. |\n| wasserstein\\_distance | double | Drift between two numeric distributions using the Wasserstein distance metric. |\n| population\\_stability\\_index | double | Metric for comparing the drift between two numeric distributions using the population stability index metric. See [9] below for details. | \n[7] For time series or inference profiles, the monitor looks back 30 days from the time the monitor is created. Due to this cutoff, the first analysis might include a partial window. For example, the 30 day limit might fall in the middle of a week or month, in which case the full week or month is not included in the calculation. This issue affects only the first window. \n[8] The version shown in this column is the version that was used to calculate the statistics in the row and might not be the current version of the monitor. Each time you refresh the metrics, the monitor attempts to recompute previously calculated metrics using the current monitor configuration. The current monitor version appears in the monitor information returned by the API and Python Client. \n[9] The output of the population stability index is a numeric value that represents how different two distributions are. The range is [0, inf). PSI < 0.1 means no significant population change. PSI < 0.2 indicates moderate population change. PSI >= 0.2 indicates significant population change.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-output.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Load data using the add data UI\n#### Create or modify a table using file upload\n\nThe **Create or modify a table using file upload** page allows you to upload CSV, TSV, or JSON, Avro, Parquet, or text files to create or overwrite a managed Delta Lake table. \nYou can create managed Delta tables in Unity Catalog or in the Hive metastore. \nNote \nYou can also load files from cloud storage [using the add data UI](https:\/\/docs.databricks.com\/ingestion\/add-data\/add-data-external-locations.html) or [using COPY INTO](https:\/\/docs.databricks.com\/ingestion\/copy-into\/tutorial-dbsql.html). \nImportant \n* You must have access to a running compute resource and permissions to create tables in a target schema.\n* Workspace admins can [disable the Create or modify a table using file upload page](https:\/\/docs.databricks.com\/admin\/workspace-settings\/disable-upload-data-ui.html). \nYou can use the UI to create a Delta table by importing small CSV, TSV, JSON, Avro, Parquet, or text files from your local machine. \n* The **Create or modify a table using file upload** page supports uploading up to 10 files at a time.\n* The total size of uploaded files must be under 2 gigabytes.\n* The file must be a CSV, TSV, JSON, Avro, Parquet, or text file and have the extension \u201c.csv\u201d, \u201c.tsv\u201d (or \u201c.tab\u201d), \u201c.json\u201d, \u201c.avro\u201d, \u201c.parquet\u201d, or \u201c.txt\u201d.\n* Compressed files such as `zip` and `tar` files are not supported.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-data.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Load data using the add data UI\n#### Create or modify a table using file upload\n##### Upload the file\n\n1. Click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New > Add data**.\n2. Click **Create or modify a table**.\n3. Click the file browser button or drag and drop files directly on the drop zone. \nNote \nImported files are uploaded to a secure internal location within your account which is garbage collected daily.\n\n#### Create or modify a table using file upload\n##### Preview, configure, and create a table\n\nYou can upload data to the staging area without connecting to compute resources, but you must select an active compute resource to preview and configure your table. \nYou can preview 50 rows of your data when you configure the options for the uploaded table. Click the grid or list buttons under the file name to switch the presentation of your data. \nDatabricks stores data files for managed tables in the locations configured for the containing schema. You need proper permissions to create a table in a schema. \nSelect the desired schema in which to create a table by doing the following: \n1. (For Unity Catalog-enabled workspaces only) You can select a catalog or the legacy `hive_metastore`.\n2. Select a schema.\n3. (Optional) Edit the table name. \nNote \nYou can use the dropdown to select **Overwrite existing table** or **Create new table**. Operations that attempt to create new tables with name conflicts display an error message. \nYou can configure [options](https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-data.html#options) or [columns](https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-data.html#columns) before you create the table. \nTo create the table, click **Create** at the bottom of the page.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-data.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Load data using the add data UI\n#### Create or modify a table using file upload\n##### Format options\n\nFormat options depend on the file format you upload. Common format options appear in the header bar, while less commonly used options are available on the **Advanced attributes** dialog. \n* For CSV, the following options are available: \n+ **First row contains the header** (enabled by default): This option specifies whether the CSV\/TSV file contains a header.\n+ **Column delimiter**: The separator character between columns. Only a single character is allowed, and backslash is not supported. This defaults to comma for CSV files.\n+ **Automatically detect column types** (enabled by default): Automatically detect column types from file content. You can edit types in the preview table. If this is set to false, all column types are inferred as `STRING`.\n+ **Rows span multiple lines** (disabled by default): Whether a column\u2019s value can span multiple lines in the file.\n+ **Merge the schema across multiple files**: Whether to infer the schema across multiple files and to merge the schema of each file. If disabled, the schema from one file is used.\n* For JSON, the following options are available: \n+ **Automatically detect column types** (enabled by default): Automatically detect column types from file content. You can edit types in the preview table. If this is set to false, all column types are inferred as `STRING`.\n+ **Rows span multiple lines** (enabled by default): Whether a column\u2019s value can span multiple lines in the file.\n+ **Allow comments** (enabled by default): Whether comments are allowed in the file.\n+ **Allow single quotes** (enabled by default): Whether single quotes are allowed in the file.\n+ **Infer timestamp** (enabled by default): Whether to try to infer timestamp strings as `TimestampType`.\n* For JSON, the following options are available: \n+ **Automatically detect column types** (enabled by default): Automatically detect column types from file content. You can edit types in the preview table. If this is set to false, all column types are inferred as `STRING`.\n+ **Rows span multiple lines** (disabled by default): Whether a column\u2019s value can span multiple lines in the file.\n+ **Allow comments** Whether comments are allowed in the file.\n+ **Allow single quotes**: Whether single quotes are allowed in the file.\n+ **Infer timestamp**: Whether to try to infer timestamp strings as `TimestampType`. \nThe data preview updates automatically when you edit format options. \nNote \nWhen you upload multiple files, the following rules apply: \n* Header settings apply to all files. Make sure headers are consistently absent or present in all uploaded files to avoid data loss.\n* Uploaded files combine by appending all data as rows in the target table. Joining or merging records during file upload is not supported.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-data.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Load data using the add data UI\n#### Create or modify a table using file upload\n##### Column names and types\n\nYou can edit column names and types. \n* To edit types, click the icon with the type. \nNote \nYou can\u2019t edit nested types for `STRUCT` or `ARRAY`.\n* To edit the column name, click the input box at the top of the column. \nColumn names do not support commas, backslashes, or unicode characters (such as emojis). \nColumn data types are inferred by default for CSV and JSON files. You can interpret all columns as `STRING` type by disabling **Advanced attributes** > **Automatically detect column types**. \nNote \n* Schema inference does a best effort detection of column types. Changing column types can lead to some values being cast to `NULL` if the value cannot be cast correctly to the target data type. Casting `BIGINT` to `DATE` or `TIMESTAMP` columns is not supported. Databricks recommends that you create a table first and then transform these columns using SQL functions afterwards.\n* To support table column names with special characters, the **Create or modify a table using file upload** page leverages [Column Mapping](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html).\n* To add comments to columns, create the table and navigate to [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html) where you can add comments. \n### Supported data types \nThe **Create or modify a table using file upload** page supports the following data types. For more information about individual data types see [SQL data types](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-datatypes.html). \n| Data Type | Description |\n| --- | --- |\n| `BIGINT` | 8-byte signed integer numbers. |\n| `BOOLEAN` | Boolean (`true`, `false`) values. |\n| `DATE` | Values comprising values of fields year, month, and day, without a time-zone. |\n| `DOUBLE` | 8-byte double-precision floating point numbers. |\n| `STRING` | Character string values. |\n| `TIMESTAMP` | Values comprising values of fields year, month, day, hour, minute, and second, with the session local timezone. |\n| `STRUCT` | Values with the structure described by a sequence of fields. |\n| `ARRAY` | Values comprising a sequence of elements with the type `elementType`. |\n| `DECIMAL(P,S)` | Numbers with maximum precision `P` and fixed scale `S`. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-data.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Load data using the add data UI\n#### Create or modify a table using file upload\n##### Known issues\n\nCasting `BIGINT` to non-castable types like `DATE`, such as dates in the format of \u2018yyyy\u2019, may trigger errors.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-data.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Optimize stateful Structured Streaming queries\n\nManaging the intermediate state information of stateful Structured Streaming queries can help prevent unexpected latency and production problems. \nDatabricks recommends: \n* Use compute-optimized instances as workers.\n* Set the number of shuffle partitions to 1-2 times number of cores in the cluster.\n* Set the `spark.sql.streaming.noDataMicroBatches.enabled` configuration to `false` in the SparkSession. This prevents the streaming micro-batch engine from processing micro-batches that do not contain data. Note also that setting this configuration to `false` could result in stateful operations that leverage watermarks or processing time timeouts to not get data output until new data arrives instead of immediately. \nDatabricks recommends using RocksDB with changelog checkpointing to manage the state for stateful streams. See [Configure RocksDB state store on Databricks](https:\/\/docs.databricks.com\/structured-streaming\/rocksdb-state-store.html). \nNote \nThe state management scheme cannot be changed between query restarts. That is, if a query has been started with the default management, then it cannot changed without starting the query from scratch with a new checkpoint location.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/stateful-streaming.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Optimize stateful Structured Streaming queries\n##### Work with multiple stateful operators in Structured Streaming\n\nIn Databricks Runtime 13.3 LTS and above, Databricks offers advanced support for stateful operators in Structured Streaming workloads. You can now chain multiple stateful operators together, meaning that you can feed the output of an operation such as a windowed aggregation to another stateful operation such as a join. \nThe following examples demonstrate several patterns you can use. \nImportant \nThe following limitations exist when working with multiple stateful operators: \n* `FlatMapGroupWithState` is not supported.\n* Only the append output mode is supported. \n### Chained time window aggregation \n```\nwords = ... # streaming DataFrame of schema { timestamp: Timestamp, word: String }\n\n# Group the data by window and word and compute the count of each group\nwindowedCounts = words.groupBy(\nwindow(words.timestamp, \"10 minutes\", \"5 minutes\"),\nwords.word\n).count()\n\n# Group the windowed data by another window and word and compute the count of each group\nanotherWindowedCounts = windowedCounts.groupBy(\nwindow(window_time(windowedCounts.window), \"1 hour\"),\nwindowedCounts.word\n).count()\n\n``` \n```\nimport spark.implicits._\n\nval words = ... \/\/ streaming DataFrame of schema { timestamp: Timestamp, word: String }\n\n\/\/ Group the data by window and word and compute the count of each group\nval windowedCounts = words.groupBy(\nwindow($\"timestamp\", \"10 minutes\", \"5 minutes\"),\n$\"word\"\n).count()\n\n\/\/ Group the windowed data by another window and word and compute the count of each group\nval anotherWindowedCounts = windowedCounts.groupBy(\nwindow($\"window\", \"1 hour\"),\n$\"word\"\n).count()\n\n``` \n### Time window aggregation in two different streams followed by stream-stream window join \n```\nclicksWindow = clicksWithWatermark.groupBy(\nclicksWithWatermark.clickAdId,\nwindow(clicksWithWatermark.clickTime, \"1 hour\")\n).count()\n\nimpressionsWindow = impressionsWithWatermark.groupBy(\nimpressionsWithWatermark.impressionAdId,\nwindow(impressionsWithWatermark.impressionTime, \"1 hour\")\n).count()\n\nclicksWindow.join(impressionsWindow, \"window\", \"inner\")\n\n``` \n```\nval clicksWindow = clicksWithWatermark\n.groupBy(window(\"clickTime\", \"1 hour\"))\n.count()\n\nval impressionsWindow = impressionsWithWatermark\n.groupBy(window(\"impressionTime\", \"1 hour\"))\n.count()\n\nclicksWindow.join(impressionsWindow, \"window\", \"inner\")\n\n``` \n### Stream-stream time interval join followed by time window aggregation \n```\njoined = impressionsWithWatermark.join(\nclicksWithWatermark,\nexpr(\"\"\"\nclickAdId = impressionAdId AND\nclickTime >= impressionTime AND\nclickTime <= impressionTime + interval 1 hour\n\"\"\"),\n\"leftOuter\" # can be \"inner\", \"leftOuter\", \"rightOuter\", \"fullOuter\", \"leftSemi\"\n)\n\njoined.groupBy(\njoined.clickAdId,\nwindow(joined.clickTime, \"1 hour\")\n).count()\n\n``` \n```\nval joined = impressionsWithWatermark.join(\nclicksWithWatermark,\nexpr(\"\"\"\nclickAdId = impressionAdId AND\nclickTime >= impressionTime AND\nclickTime <= impressionTime + interval 1 hour\n\"\"\"),\njoinType = \"leftOuter\" \/\/ can be \"inner\", \"leftOuter\", \"rightOuter\", \"fullOuter\", \"leftSemi\"\n)\n\njoined\n.groupBy($\"clickAdId\", window($\"clickTime\", \"1 hour\"))\n.count()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/stateful-streaming.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Optimize stateful Structured Streaming queries\n##### State rebalancing for Structured Streaming\n\nState rebalancing is enabled by default for all streaming workloads in Delta Live Tables. In Databricks Runtime 11.3 LTS and above, you can set the following configuration option in the Spark cluster configuration to enable state rebalancing: \n```\nspark.sql.streaming.statefulOperator.stateRebalancing.enabled true\n\n``` \nState rebalancing benefits stateful Structured Streaming pipelines that undergo cluster resizing events. Stateless streaming operations do not benefit, regardless of changing cluster sizes. \nNote \nCompute auto-scaling has limitations scaling down cluster size for Structured Streaming workloads. Databricks recommends using Delta Live Tables with Enhanced Autoscaling for streaming workloads. See [Optimize the cluster utilization of Delta Live Tables pipelines with Enhanced Autoscaling](https:\/\/docs.databricks.com\/delta-live-tables\/auto-scaling.html). \nCluster resizing events cause state rebalancing to trigger. During rebalancing events, micro-batches might have higher latency as the state loads from cloud storage to the new executors.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/stateful-streaming.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Optimize stateful Structured Streaming queries\n##### Specify initial state for `mapGroupsWithState`\n\nYou can specify a user defined initial state for Structured Streaming stateful processing using `flatMapGroupsWithState`or `mapGroupsWithState`. This allows you to avoid reprocessing data when starting a stateful stream without a valid checkpoint. \n```\ndef mapGroupsWithState[S: Encoder, U: Encoder](\ntimeoutConf: GroupStateTimeout,\ninitialState: KeyValueGroupedDataset[K, S])(\nfunc: (K, Iterator[V], GroupState[S]) => U): Dataset[U]\n\ndef flatMapGroupsWithState[S: Encoder, U: Encoder](\noutputMode: OutputMode,\ntimeoutConf: GroupStateTimeout,\ninitialState: KeyValueGroupedDataset[K, S])(\nfunc: (K, Iterator[V], GroupState[S]) => Iterator[U])\n\n``` \nExample use case that specifies an initial state to the `flatMapGroupsWithState` operator: \n```\nval fruitCountFunc =(key: String, values: Iterator[String], state: GroupState[RunningCount]) => {\nval count = state.getOption.map(_.count).getOrElse(0L) + valList.size\nstate.update(new RunningCount(count))\nIterator((key, count.toString))\n}\n\nval fruitCountInitialDS: Dataset[(String, RunningCount)] = Seq(\n(\"apple\", new RunningCount(1)),\n(\"orange\", new RunningCount(2)),\n(\"mango\", new RunningCount(5)),\n).toDS()\n\nval fruitCountInitial = initialState.groupByKey(x => x._1).mapValues(_._2)\n\nfruitStream\n.groupByKey(x => x)\n.flatMapGroupsWithState(Update, GroupStateTimeout.NoTimeout, fruitCountInitial)(fruitCountFunc)\n\n``` \nExample use case that specifies an initial state to the `mapGroupsWithState` operator: \n```\nval fruitCountFunc =(key: String, values: Iterator[String], state: GroupState[RunningCount]) => {\nval count = state.getOption.map(_.count).getOrElse(0L) + valList.size\nstate.update(new RunningCount(count))\n(key, count.toString)\n}\n\nval fruitCountInitialDS: Dataset[(String, RunningCount)] = Seq(\n(\"apple\", new RunningCount(1)),\n(\"orange\", new RunningCount(2)),\n(\"mango\", new RunningCount(5)),\n).toDS()\n\nval fruitCountInitial = initialState.groupByKey(x => x._1).mapValues(_._2)\n\nfruitStream\n.groupByKey(x => x)\n.mapGroupsWithState(GroupStateTimeout.NoTimeout, fruitCountInitial)(fruitCountFunc)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/stateful-streaming.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Optimize stateful Structured Streaming queries\n##### Test the `mapGroupsWithState` update function\n\nThe `TestGroupState` API enables you to test the state update function used for `Dataset.groupByKey(...).mapGroupsWithState(...)` and `Dataset.groupByKey(...).flatMapGroupsWithState(...)`. \nThe state update function takes the previous state as input using an object of type `GroupState`. See the Apache Spark [GroupState reference documentation](https:\/\/spark.apache.org\/docs\/latest\/api\/java\/org\/apache\/spark\/sql\/streaming\/GroupState.html). For example: \n```\nimport org.apache.spark.sql.streaming._\nimport org.apache.spark.api.java.Optional\n\ntest(\"flatMapGroupsWithState's state update function\") {\nvar prevState = TestGroupState.create[UserStatus](\noptionalState = Optional.empty[UserStatus],\ntimeoutConf = GroupStateTimeout.EventTimeTimeout,\nbatchProcessingTimeMs = 1L,\neventTimeWatermarkMs = Optional.of(1L),\nhasTimedOut = false)\n\nval userId: String = ...\nval actions: Iterator[UserAction] = ...\n\nassert(!prevState.hasUpdated)\n\nupdateState(userId, actions, prevState)\n\nassert(prevState.hasUpdated)\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/stateful-streaming.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query parameters\n\nA query parameter lets you substitute values into a query at runtime. Any string between double curly braces `{{ }}` is treated as a query parameter. A widget appears above the results pane where you set the parameter value. Query parameters are more flexible than query filters and should only be used in cases where query filters are not sufficient.\n\n#### Query parameters\n##### Add a query parameter\n\n1. Type `Cmd + I`. The parameter is inserted at the text caret and the **Add Parameter** dialog appears. \n* **Keyword**: The keyword that represents the parameter in the query.\n* **Title**: The title that appears over the widget. By default, the title is the same as the keyword.\n* **Type**: Supported types are Text, Number, Date, Date and Time, Date and Time (with Seconds), Dropdown List, and Query Based Dropdown List. The default is Text.\n2. Enter the keyword, optionally override the title, and select the parameter type.\n3. Click **Add Parameter**.\n4. In the parameter widget, set the parameter value.\n5. Click **Apply Changes**.\n6. Click **Save**. \nAlternatively, type double curly braces `{{ }}` and click the gear icon near the parameter widget to edit the settings. \nTo re-run the query with a different parameter value, enter the value in the widget and click **Apply Changes**.\n\n#### Query parameters\n##### Edit a query parameter\n\nTo edit a parameter, click the gear icon beside the parameter widget. To prevent users who don\u2019t own the query from changing the parameter, click **Show Results Only**. The **`<Keyword>`** parameter dialog appears.\n\n#### Query parameters\n##### Remove a query parameter\n\nTo remove a query parameter, delete the parameter from your query. The parameter widget disappears, and you can rewrite your query using static values.\n\n#### Query parameters\n##### Change the order of parameters\n\nTo change the order in which parameters are shown, you can click and drag each parameter to the desired position.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query parameters\n##### Query parameter types\n\n* [Text](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html#text)\n* [Number](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html#number)\n* [Dropdown List](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html#dropdown-list)\n* [Query-Based Dropdown List](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html#query-based-dropdown-list)\n* [Date and Time](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html#date-and-time) \n### [Text](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html#id2) \nTakes a string as input. Backslash, single, and double quotation marks are escaped, and Databricks adds quotation marks to this parameter. For example, a string like `mr's Li\"s` is transformed to `'mr\\'s Li\\\"s'` An example of using this could be \n```\nSELECT * FROM users WHERE name={{ text_param }}\n\n``` \n### [Number](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html#id3) \nTakes a number as its input. An example of using this could be \n```\nSELECT * FROM users WHERE age={{ number_param }}\n\n``` \n### [Dropdown List](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html#id4) \nTo restrict the scope of possible parameter values when running a query, use the **Dropdown List** parameter type. An example would be `SELECT * FROM users WHERE name='{{ dropdown_param }}'`. When selected from the parameter settings panel, a text box appears where you enter your allowed values, each value separated by a new line. Dropdown lists are text parameters. To use dates or dates and times in your Dropdown List, enter them in the format your data source requires. The strings are not escaped. You can choose between a single-value or multi-value dropdown. \n* **Single value**: Single quotation marks around the parameter are required.\n* **Multi-value**: Toggle the **Allow multiple values** option. In the **Quotation** drop-down, choose whether to leave the parameters as entered (no quotation marks) or wrap the parameters with single or double quotation marks. You don\u2019t need to add quotation marks around the parameter if you choose quotation marks. \nChange your `WHERE` clause to use the `IN` keyword in your query. \n```\nSELECT ...\nFROM ...\nWHERE field IN ( {{ Multi Select Parameter }} )\n\n``` \nThe parameter multi-selection widget lets you pass multiple values to the database. If you select the **Double Quotation Mark** option for the **Quotation** parameter, your query reflects the following format: `WHERE IN (\"value1\", \"value2\", \"value3\")` \n### [Query-Based Dropdown List](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html#id5) \nTakes the result of a query as its input. It has the same behavior as the **Dropdown List** parameter. You must save the Databricks SQL dropdown list query to use it as an input in another query. \n1. Click **Query Based Dropdown list** under **Type** in the settings panel.\n2. Click the **Query** field and select a query. If your target query returns a large number of records, the performance will degrade. \nIf your target query returns more than one column, Databricks SQL uses the *first* one. If your target query returns `name` and `value` columns, Databricks SQL populates the parameter selection widget with the `name` column but executes the query with the associated `value`. \nFor example, suppose the following query returns the data in the table. \n```\nSELECT user_uuid AS 'value', username AS 'name'\nFROM users\n\n``` \n| value | name |\n| --- | --- |\n| 1001 | John Smith |\n| 1002 | Jane Doe |\n| 1003 | Bobby Tables | \nWhen Databricks runs the query, the value passed to the database would be 1001, 1002, or 1003. \n### [Date and Time](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html#id6) \nDatabricks has several options to parameterize date and timestamp values, including options to simplify the parameterization of time ranges. Select from three options of varying precision: \n| Option | Precision | Type |\n| --- | --- | --- |\n| **Date** | day | `DATE` |\n| **Date and Time** | minute | `TIMESTAMP` |\n| **Date and Time (with seconds)** | second | `TIMESTAMP` | \nWhen choosing a **Range** parameter option, you create two parameters designated by `.start` and `.end` suffixes. All options pass parameters to your query as string literals; Databricks requires that you wrap date and time values in single quotation marks (`'`). For example: \n```\n-- Date parameter\nSELECT *\nFROM usage_logs\nWHERE date = '{{ date_param }}'\n\n-- Date and Time Range parameter\nSELECT *\nFROM usage_logs\nWHERE modified_time > '{{ date_range.start }}' and modified_time < '{{ date_range.end }}'\n\n``` \nDate parameters use a calendar-picking interface and default to the current date and time. \nNote \nThe Date Range parameter only returns correct results for columns of `DATE` type. For `TIMESTAMP` columns, use one of the Date and Time Range options. \n#### Dynamic date and date range values \nWhen you add a date or date range parameter to your query, the selection widget shows a blue lightning bolt icon. Click it to display dynamic values like `today`, `yesterday`, `this week`, `last week`, `last month`, or `last year`. These values update dynamically. \nImportant \nDynamic dates and date ranges aren\u2019t compatible with scheduled queries.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query parameters\n##### Using query parameters in dashboards\n\nOptionally, queries can use parameters or static values. When a visualization based on a parameterized query is added to a dashboard, the visualization can be configured to use either a: \n* Widget parameter \nWidget parameters are specific to a single visualization in a dashboard, appear in the visualization panel, and the parameter values specified apply only to the query underlying the visualization.\n* Dashboard parameter \nDashboard parameters can apply to multiple visualizations. When you add a visualization based on a parameterized query to a dashboard, the parameter will be added as a dashboard parameter by default. Dashboard parameters are configured for one or more visualizations in a dashboard and appear at the top of the dashboard. The parameter values specified for a dashboard parameter apply to visualizations reusing that particular dashboard parameter. A dashboard can have multiple parameters, each of which can apply to some visualizations and not others.\n* Static value \nStatic values are used in place of a parameter that responds to changes. Static values allow you to hard code a value in place of a parameter and will make the parameter \u201cdisappear\u201d from the dashboard or widget where it previously appeared. \nWhen you add a visualization containing a parameterized query, you can choose the the title and the source for the parameter in the visualization query by clicking the appropriate pencil icon. You can also select the keyword and a default value. See [Parameter properties](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html#parameter-properties). \nAfter adding a visualization to a dashboard, access the parameter mapping interface by clicking the vertical ellipsis on the upper-right of a dashboard widget and then clicking **Change widget settings**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query parameters\n##### Parameter properties\n\n* **Title**: The display name that appears beside the value selector on your dashboard. It defaults to the parameter **Keyword**. To edit it, click the pencil icon ![Pencil Icon](https:\/\/docs.databricks.com\/_images\/pencil-icon.png). Titles are not displayed for static dashboard parameters because the value selector is hidden. If you select **Static value** as your **Value Source**, the **Title** field is grayed out.\n* **Keyword**: The string literal for this parameter in the underlying query. This is useful for debugging if your dashboard does not return the expected results.\n* **Default Value**: The value used if no other value is specified. To change this from the query screen, run the query with your desired parameter value and click the **Save** button.\n* **Value Source**: The source of the parameter value. Click the pencil icon ![Pencil Icon](https:\/\/docs.databricks.com\/_images\/pencil-icon.png) to choose a source. \n+ **New dashboard parameter**: Create a new dashboard-level parameter. This lets you set a parameter value in one place on your dashboard and map it to multiple visualizations.\n+ **Existing dashboard parameter**: Map parameter to an existing dashboard parameter. You must specify which pre-existing dashboard parameter.\n+ **Widget parameter**: Displays a value selector inside your dashboard widget. This is useful for one-off parameters that are not shared between widgets.\n+ **Static value**: Choose a static value for the widget, regardless of the values used on other widgets. Statically mapped parameter values do not display a value selector anywhere on the dashboard, which is more compact. This lets you take advantage of the flexibility of query parameters without cluttering the user interface on a dashboard when certain parameters are not expected to change frequently.\n![Change parameter mapping](https:\/\/docs.databricks.com\/_images\/dashboard_parameter_mapping_change.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query parameters\n##### Frequently Asked Questions (FAQ)\n\n* [Can I reuse the same parameter multiple times in a single query?](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html#can-i-reuse-the-same-parameter-multiple-times-in-a-single-query)\n* [Can I use multiple parameters in a single query?](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html#can-i-use-multiple-parameters-in-a-single-query) \n### [Can I reuse the same parameter multiple times in a single query?](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html#id7) \nYes. Use the same identifier in the curly brackets. This example uses the `{{org_id}}` parameter twice. \n```\nSELECT {{org_id}}, count(0)\nFROM queries\nWHERE org_id = {{org_id}}\n\n``` \n### [Can I use multiple parameters in a single query?](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html#id8) \nYes. Use a unique name for each parameter. This example uses two parameters: `{{org_id}}` and `{{start_date}}`. \n```\nSELECT count(0)\nFROM queries\nWHERE org_id = {{org_id}} AND created_at > '{{start_date}}'\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Organize workspace objects into folders\n\nThis article explains how to use folders to organize your workspace objects.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/workspace-objects.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Organize workspace objects into folders\n##### Folders\n\nFolders contain all static assets within a workspace: notebooks, libraries, files (in Databricks Runtime 11.3 LTS and above), experiments, and other folders. Icons indicate the type of the object contained in a folder. Click a folder name to open or close the folder and view its contents. \n![Open folder](https:\/\/docs.databricks.com\/_images\/folder-open.png) \nTo perform an action on a folder, click the ![Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret.png) at the right side of a folder and select a menu item. \n![Folder menu](https:\/\/docs.databricks.com\/_images\/folder-menu.png) \n### Special folders \nA Databricks workspace has three special folders: Workspace, Shared, and Users. You cannot rename or move a special folder. \n#### Workspace root folder \nTo navigate to the Workspace root folder: \n1. In the sidebar, click **Workspace**.\n2. Click the ![Scroll Left Icon](https:\/\/docs.databricks.com\/_images\/scroll-left-icon.png) icon. \nThe Workspace root folder is a container for all of your organization\u2019s Databricks static assets. \n![Workspace Root](https:\/\/docs.databricks.com\/_images\/workspace-folder.png) \nWithin the Workspace root folder: \n* ![Shared Icon](https:\/\/docs.databricks.com\/_images\/shared-icon.png) **Shared** is for sharing objects across your organization. All users have full permissions for all objects in Shared.\n* ![Users Icon](https:\/\/docs.databricks.com\/_images\/users-icon.png) **Users** contains a folder for each user. \nBy default, the Workspace root folder and all of its contained objects are *available to all users*. You can control who can manage and access objects by setting [permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html). \nTo sort all objects alphabetically or by type across all folders, click the ![Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret.png) to the right of the Workspace folder and select **Sort > [Alphabetical | Type]**: \n![Workspace Sort](https:\/\/docs.databricks.com\/_images\/workspace-sort.png) \n#### User home folders \nEach user has a *home folder* for their notebooks and libraries: \n![Workspace label](https:\/\/docs.databricks.com\/_images\/workspace.png) > ![Home Folder](https:\/\/docs.databricks.com\/_images\/home-folder.png) \nBy default objects in this folder are private to that user. \nNote \nWhen you [remove a user](https:\/\/docs.databricks.com\/admin\/users-groups\/users.html#remove-user) from a workspace, the user\u2019s home folder is retained. If you re-add a user to the workspace, their home folder is restored.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/workspace-objects.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Organize workspace objects into folders\n##### Workspace object operations\n\nThe objects stored in the Workspace root folder are [folders](https:\/\/docs.databricks.com\/workspace\/workspace-objects.html#folders), [notebooks](https:\/\/docs.databricks.com\/workspace\/workspace-assets.html#ws-notebooks), [files](https:\/\/docs.databricks.com\/files\/index.html) (in Databricks Runtime 11.3 LTS and above), [libraries](https:\/\/docs.databricks.com\/workspace\/workspace-assets.html#ws-libraries), and [experiments](https:\/\/docs.databricks.com\/mlflow\/experiments.html). To perform an action on a Workspace object, right-click the object or click the ![Menu Dropdown](https:\/\/docs.databricks.com\/_images\/menu-dropdown.png) at the right side of an object. \n![Object menu](https:\/\/docs.databricks.com\/_images\/object-menu.png) \nFrom the drop-down menu you can: \n* If the object is a folder: \n+ Create a [notebook](https:\/\/docs.databricks.com\/notebooks\/index.html), [library](https:\/\/docs.databricks.com\/libraries\/index.html), [file](https:\/\/docs.databricks.com\/files\/index.html) (in Databricks Runtime 11.3 LTS and above), [MLflow experiment](https:\/\/docs.databricks.com\/mlflow\/experiments.html), or folder.\n+ Import a [notebook or Databricks archive](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html).\n* Clone the object. (Files cannot be cloned.)\n* Rename the object.\n* Move the object to another folder.\n* Move the object to Trash. See [Delete an object](https:\/\/docs.databricks.com\/workspace\/workspace-objects.html#delete-object).\n* Export a folder or notebook as a Databricks archive.\n* If the object is a notebook, copy the notebook\u2019s file path.\n* Set permissions on the object. \nIn addition to the procedures listed in this article, you can also do the following: \n* Create a folder with the [databricks workspace mkdirs](https:\/\/docs.databricks.com\/archive\/dev-tools\/cli\/workspace-cli.html#create-a-directory-in-a-workspace) command in the Databricks CLI, the [POST \/api\/2.0\/workspace\/mkdirs](https:\/\/docs.databricks.com\/api\/workspace\/introduction) operation in the Workspace API 2.0, and the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_directory](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/directory).\n* Create a notebook with the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_notebook](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/notebook).\n* Export a folder or notebook with the [databricks workspace export\\_dir](https:\/\/docs.databricks.com\/archive\/dev-tools\/cli\/workspace-cli.html#export-a-directory-from-a-workspace-to-your-local-filesystem) or [databricks workspace export](https:\/\/docs.databricks.com\/archive\/dev-tools\/cli\/workspace-cli.html#export-a-file-from-a-workspace-to-your-local-filesystem) commands in the Databricks CLI, and the [GET \/api\/2.0\/workspace\/export](https:\/\/docs.databricks.com\/api\/workspace\/introduction) operation in the Workspace API 2.0.\n* Set permissions on the following workspace objects: \n+ For notebooks, with the [PUT \/api\/2.0\/preview\/permissions\/notebooks\/{notebook\\_id}](https:\/\/docs.databricks.com\/api\/workspace\/permissions) or [PATCH \/api\/2.0\/preview\/permissions\/notebooks\/{notebook\\_id}](https:\/\/docs.databricks.com\/api\/workspace\/permissions) operations in the Permissions API 2.0.\n+ For folders, with the [PUT \/api\/2.0\/preview\/permissions\/directories\/{directory\\_id}](https:\/\/docs.databricks.com\/api\/workspace\/permissions) or [PATCH \/api\/2.0\/preview\/permissions\/directories\/{directory\\_id}](https:\/\/docs.databricks.com\/api\/workspace\/permissions) operations in the Permissions API 2.0. \n### Access recently used objects \nYou can access recently used objects by clicking ![Recents Icon](https:\/\/docs.databricks.com\/_images\/recents-icon.png) **Recents** in the sidebar or the Recents column on the workspace landing page. \nNote \nThe Recents list is cleared after deleting the browser cache and cookies. \n### Move an object \nTo move an object, you can drag-and-drop the object or click the ![Menu Dropdown](https:\/\/docs.databricks.com\/_images\/menu-dropdown.png) or ![Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret.png) at the right side of the object and select **Move**: \n![Move object](https:\/\/docs.databricks.com\/_images\/move-drop-down.png) \nTo move all the objects *inside a folder* to another folder, select the **Move** action on the source folder and select the **Move all items in `<folder-name>` rather than the folder itself** checkbox. \n![Move folder contents](https:\/\/docs.databricks.com\/_images\/move-contents-check-box.png) \n### Delete an object \nTo delete a folder, notebook, library, repository, or experiment, click the ![Menu Dropdown](https:\/\/docs.databricks.com\/_images\/menu-dropdown.png) or ![Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret.png) at the right side of the object and select **Move to Trash**. Contents of the Trash folder are automatically deleted permanently after **30 days**. \nYou can permanently delete an object in the Trash by selecting the ![Menu Dropdown](https:\/\/docs.databricks.com\/_images\/menu-dropdown.png) to the right of the object and selecting **Delete Immediately**. \n![Delete immediately](https:\/\/docs.databricks.com\/_images\/trash-delete.png) \nYou can permanently delete all objects in the Trash by selecting the ![Menu Dropdown](https:\/\/docs.databricks.com\/_images\/menu-dropdown.png) to the right of the Trash folder and selecting **Empty Trash**. \n![Empty trash](https:\/\/docs.databricks.com\/_images\/empty-trash.png) \nYou can also delete objects with the [databricks workspace delete](https:\/\/docs.databricks.com\/archive\/dev-tools\/cli\/workspace-cli.html#delete-an-object-from-a-workspace) or [databricks workspace rm](https:\/\/docs.databricks.com\/archive\/dev-tools\/cli\/workspace-cli.html#delete-an-object-from-a-workspace) commands in the Databricks CLI, and the [POST \/api\/2.0\/workspace\/delete](https:\/\/docs.databricks.com\/api\/workspace\/introduction) operation in the Workspace API 2.0. \nNote \nIf you delete an object using the Databricks CLI or the Workspace API 2.0, the object doesn\u2019t appear in the Trash folder. \n### Restore an object \nYou restore an object by dragging it from the ![Trash](https:\/\/docs.databricks.com\/_images\/trash-icon1.png) Trash folder to another folder.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/workspace-objects.html"} +{"content":"# What is Databricks?\n### Databricks integrations overview\n\nThe articles listed here provide information about how to connect to the large assortment of data sources, BI tools, and developer tools that you can use with Databricks. Many of these are available through our system of partners and our Partner Connect hub.\n\n### Databricks integrations overview\n#### Partner Connect\n\nPartner Connect is a user interface that allows validated solutions to integrate more quickly and easily with your Databricks clusters and SQL warehouses. \nFor more information, see [What is Databricks Partner Connect?](https:\/\/docs.databricks.com\/partner-connect\/index.html).\n\n### Databricks integrations overview\n#### Data sources\n\nDatabricks can read data from and write data to a variety of data formats such as CSV, [Delta Lake](https:\/\/docs.databricks.com\/delta\/index.html), JSON, Parquet, XML, and other formats, as well as data storage providers such as Amazon S3, Google BigQuery and Cloud Storage, Snowflake, and other providers. \nSee [Data ingestion](https:\/\/docs.databricks.com\/integrations\/index.html#ingest), [Connect to data sources](https:\/\/docs.databricks.com\/connect\/index.html), and [Data format options](https:\/\/docs.databricks.com\/query\/formats\/index.html).\n\n### Databricks integrations overview\n#### BI tools\n\nDatabricks has validated integrations with your favorite BI tools, including Power BI, Tableau, and others, allowing you to work with data through Databricks clusters and SQL warehouses, in many cases with low-code and no-code experiences. \nFor a comprehensive list, with connection instructions, see [BI and visualization](https:\/\/docs.databricks.com\/integrations\/index.html#bi).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/connect\/index.html"} +{"content":"# What is Databricks?\n### Databricks integrations overview\n#### Other ETL tools\n\nIn addition to access to all kinds of [data sources](https:\/\/docs.databricks.com\/getting-started\/connect\/index.html#data-sources), Databricks provides integrations with ETL\/ELT tools like dbt, Prophecy, and Azure Data Factory, as well as data pipeline orchestration tools like Airflow and SQL database tools like DataGrip, DBeaver, and SQL Workbench\/J. \nFor connection instructions, see: \n* **ETL tools**: [Data preparation and transformation](https:\/\/docs.databricks.com\/integrations\/index.html#prep)\n* **Airflow**: [Orchestrate Databricks jobs with Apache Airflow](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-airflow-with-jobs.html)\n* **SQL database tools**: [Use a SQL database tool](https:\/\/docs.databricks.com\/dev-tools\/index-sql.html).\n\n### Databricks integrations overview\n#### IDEs and other developer tools\n\nDatabricks supports developer tools such as DataGrip, IntelliJ, PyCharm, Visual Studio Code, and others, that allow you to programmatically access Databricks [compute](https:\/\/docs.databricks.com\/compute\/index.html), including [SQL warehouses](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html). \nFor a comprehensive list of tools that support developers, see [Develop on Databricks](https:\/\/docs.databricks.com\/languages\/index.html).\n\n### Databricks integrations overview\n#### Git\n\nDatabricks Git folders provide repository-level integration with your favorite Git providers, so you can develop code in a Databricks notebook and sync it with a remote Git repository. See [Git integration with Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/connect\/index.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage data recipients for Delta Sharing\n\nThis article explains how to create and manage recipients for Delta Sharing. \nA recipient is the named object that represents the identity of a user or group of users in the real world who consume shared data. The way you create recipients differs depending on whether or not your recipient has access to a Databricks workspace that is enabled for [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html): \n* For recipients with access to a Databricks workspace that is enabled for Unity Catalog, you can create a recipient object with a secure connection managed entirely by Databricks. This sharing mode is called *Databricks-to-Databricks sharing*.\n* For recipients without access to a Databricks workspace that is enabled for Unity Catalog, you must use *open sharing*, with a secure connection that you manage using token-based authentication. \nFor more information about these two sharing modes and when to choose which, see [Open sharing versus Databricks-to-Databricks sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html#open-vs-d-to-d).\n\n### Create and manage data recipients for Delta Sharing\n#### Requirements\n\nTo create a recipient: \n* You must be a metastore admin or have the `CREATE_RECIPIENT` privilege for the Unity Catalog metastore where the data you want to share is registered.\n* You must create the recipient using a Databricks workspace that has that Unity Catalog metastore attached.\n* If you use a Databricks notebook to create the recipient, your cluster must use Databricks Runtime 11.3 LTS or above and either shared or single-user cluster access mode. \nFor other recipient management operations (such as view, delete, update, and grant recipient access to a share) see the permissions requirements listed in the operation-specific sections of this article.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage data recipients for Delta Sharing\n#### Create a recipient object for users who have access to Databricks (Databricks-to-Databricks sharing)\n\nIf your data recipient has access to a Databricks workspace that has been [enabled for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/enable-workspaces.html), you can create a recipient object with an authentication type of `DATABRICKS`. \nA recipient object with the authentication type of `DATABRICKS` represents a data recipient on a particular Unity Catalog metastore, identified in the recipient object definition by a *sharing identifier* string consisting of the metastore\u2019s cloud, region, and UUID. The data shared with this recipient can be accessed only on that metastore. \n### Step 1: Request the recipient\u2019s sharing identifier \nAsk a recipient user to send you the sharing identifier for the Unity Catalog metastore that is attached to the workspaces where the recipient user or group of users will work with the shared data. \nThe sharing identifier is a string consisting of the metastore\u2019s cloud, region, and UUID (the unique identifier for the metastore), in the format `<cloud>:<region>:<uuid>`. \nFor example, in the following screenshot, the complete sharing identifier string is `aws:us-west-2:19a84bee-54bc-43a2-87de-023d0ec16016`. \n![example of CURRENT_METASTORE](https:\/\/docs.databricks.com\/_images\/share-metastore.png) \nThe recipient can find the identifier using Catalog Explorer, the Databricks Unity Catalog CLI, or the default SQL function `CURRENT_METASTORE` in a Databricks notebook or Databricks SQL query that runs on a Unity-Catalog-capable cluster in the workspace they intend to use. \nTo get the sharing identifier using Catalog Explorer: \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared with me**.\n3. Above the Providers tab, click the **Sharing identifier** copy icon. \nRun the following command in a notebook or the Databricks SQL query editor: \n```\nSELECT CURRENT_METASTORE();\n\n``` \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). The sharing identifier is returned as the `global_metastore_id`. \n```\ndatabricks metastores summary\n\n``` \nYou can help the recipient by sending your contact the information contained in this step, or you can point them to [Get access in the Databricks-to-Databricks model](https:\/\/docs.databricks.com\/data-sharing\/recipient.html#get-access-db-to-db). \n### Step 2: Create the recipient \nTo create a recipient for Databricks-to-Databricks sharing, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `CREATE RECIPIENT` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: Metastore admin or user with the `CREATE_RECIPIENT` privilege for the Unity Catalog metastore where the data you want to share is registered. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. Click **New Recipient**.\n4. Enter the recipient **Name** and **Sharing identifier**. \nUse the entire sharing identifier string in the format `<cloud>:<region>:<uuid>`. For example, `aws:us-west-2:19a84bee-54bc-43a2-87de-023d0ec16016`.\n5. (Optional) Enter a comment.\n6. Click **Create**.\n7. (Optional) Create custom **Recipient properties**. \nClick **Edit properties > +Add property**. Then add a property name (**Key**) and **Value**. For details, see [Manage recipient properties](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#properties). \nRun the following command in a notebook or the Databricks SQL query editor: \n```\nCREATE RECIPIENT [IF NOT EXISTS] <recipient-name>\nUSING ID '<sharing-identifier>'\n[COMMENT \"<comment>\"];\n\n``` \nUse the entire sharing identifier string in the format `<cloud>:<region>:<uuid>`. For example, `aws:eu-west-1:g0c979c8-3e68-4cdf-94af-d05c120ed1ef`. \nYou can also add custom properties for the recipient. For details, see [Manage recipient properties](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#properties). \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). Replace the placeholder values: \n* `<recipient-name>`: The name of the recipient.\n* `<sharing-identifier>`: The entire sharing identifier string in the format `<cloud>:<region>:<uuid>`. For example, `aws:eu-west-1:g0c979c8-3e68-4cdf-94af-d05c120ed1ef`.\n* `<authentication-type>`: Set to `DATABRICKS` when a sharing identifier string in the format `<cloud>:<region>:<uuid>` is provided for `<sharing-identifier>`. \n```\ndatabricks recipients create <recipient-name> <authentication-type> --sharing-code <sharing-identifier>\n\n``` \nYou can also add custom properties for the recipient. For details, see [Manage recipient properties](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#properties). \nThe recipient is created with the `authentication_type` of `DATABRICKS`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage data recipients for Delta Sharing\n#### Create a recipient object for all other users (open sharing)\n\nIf you want to share data with users outside of your Databricks workspace, regardless of whether they use Databricks themselves, you can use open Delta Sharing to share your data securely. Here\u2019s how it works: \n1. As a data provider, you create the recipient object in your Unity Catalog metastore.\n2. When you create the recipient object, Databricks generates a token, a credential file that includes the token, and an activation link for you to share with the recipient. The recipient object has the authentication type of `TOKEN`.\n3. The recipient accesses the activation link, downloads the credential file, and uses the credential file to authenticate and get read access to the tables you include in the shares you give them access to. \n### Step 1: Create the recipient \nTo create a recipient for open sharing, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `CREATE RECIPIENT` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: Metastore admin or user with the `CREATE_RECIPIENT` privilege for the Unity Catalog metastore where the data you want to share is registered. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. Click **New Recipient**.\n4. Enter the recipient **Name**\n5. (Optional) Enter a comment.\n6. Click **Create**. \nYou do not use the sharing identifier for open sharing recipients.\n7. (Optional) Create custom **Recipient properties**. \nClick **Edit properties > +Add property**. Then add a property name (**Key**) and **Value**. For details, see [Manage recipient properties](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#properties). \nRun the following command in a notebook or the Databricks SQL query editor: \n```\nCREATE RECIPIENT [IF NOT EXISTS] <recipient-name>\n[COMMENT \"<comment>\"];\n\n``` \nYou can also add custom properties for the recipient. For details, see [Manage recipient properties](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#properties). \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \n```\ndatabricks recipients create <recipient-name>\n\n``` \nYou can also add custom properties for the recipient. For details, see [Manage recipient properties](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#properties). \nOutput includes the `activation_url` that you share with the recipient. \nThe recipient is created with the `authentication_type` of `TOKEN`. \nNote \nWhen you create the recipient, you have the option to limit recipient access to a restricted set of IP addresses. You can also add an IP access list to an existing recipient. See [Restrict Delta Sharing recipient access using IP access lists (open sharing)](https:\/\/docs.databricks.com\/data-sharing\/access-list.html). \n### Step 2: Get the activation link \nTo get the new recipient\u2019s activation link, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `DESCRIBE RECIPIENT` SQL command in a Databricks notebook or the Databricks SQL query editor. \nIf the recipient has already downloaded the credential file, the activation link is not returned or displayed. \n**Permissions required**: Metastore admin, user with the `USE RECIPIENT` privilege, or the recipient object owner. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Recipients** tab, find and select the recipient.\n4. On the recipient details page, copy the **Activation link**. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nDESCRIBE RECIPIENT <recipient-name>;\n\n``` \nOutput includes the `activation_link`. \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \n```\ndatabricks recipients get <recipient-name>\n\n``` \nOutput includes the `activation_url`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage data recipients for Delta Sharing\n#### Grant the recipient access to a share\n\nOnce you\u2019ve created the recipient and [created shares](https:\/\/docs.databricks.com\/data-sharing\/create-share.html), you can grant the recipient access to those shares. \nTo grant share access to recipients, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `GRANT ON SHARE` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: One of the following: \n* Metastore admin.\n* Delegated permissions or ownership on both the share and the recipient objects ((`USE SHARE` + `SET SHARE PERMISSION`) or share owner) AND (`USE RECIPIENT` or recipient owner). \nFor instructions, see [Manage access to Delta Sharing data shares (for providers)](https:\/\/docs.databricks.com\/data-sharing\/grant-access.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage data recipients for Delta Sharing\n#### Send the recipient their connection information\n\nYou must let the recipient know how to access the data that you are sharing with them. The information that you share with the recipient depends on whether you are using Databricks-to-Databricks sharing or open sharing: \n* **For Databricks-to-Databricks sharing**, you send a link to instructions for accessing the data you are sharing. \nA provider object that lists available shares is automatically created in the recipient\u2019s metastore. Recipients don\u2019t need to do anything but view and select the shares they want to use. See [Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html).\n* **For open sharing**, you use a secure channel to share the [activation link](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#get-activation-link) and a link to [instructions for using it](https:\/\/docs.databricks.com\/data-sharing\/recipient.html#get-access-open). \nYou can download the credential file only once. Recipients should treat the downloaded credential as a secret and must not share it outside of their organization. If you have concerns that a credential may have been handled insecurely, you can [rotate a recipient\u2019s credential](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#rotate-credential) at any time. For more information about managing credentials to ensure secure recipient access, see [Security considerations for tokens](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#security-considerations).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage data recipients for Delta Sharing\n#### View recipients\n\nTo view a list of recipients, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `SHOW RECIPIENTS` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: You must be a metastore admin or have the `USE RECIPIENT` privilege to view all recipients in the metastore. Other users have access only to the recipients that they own. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. Open the **Recipients** tab. \nRun the following command in a notebook or the Databricks SQL query editor. Optionally, replace `<pattern>` with a [`LIKE` predicate](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/like.html). \n```\nSHOW RECIPIENTS [LIKE <pattern>];\n\n``` \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \n```\ndatabricks recipients list\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage data recipients for Delta Sharing\n#### View recipient details\n\nTo view details about a recipient, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `DESCRIBE RECIPIENT` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: Metastore admin, user with the `USE RECIPIENT` privilege, or the recipient object owner. \nDetails include: \n* The recipient\u2019s creator, creation timestamp, comments, and authentication type (`TOKEN` or `DATABRICKS`).\n* If the recipient uses open sharing: the token lifetime, activation link, activation status (whether the credential has been downloaded), and IP access lists, if assigned.\n* If the recipient uses Databricks-to-Databricks sharing: the cloud, region, and metastore ID of the recipient\u2019s Unity Catalog metastore, as well as activation status.\n* Recipient properties, including custom properties. See [Manage recipient properties](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#properties). \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Recipients** tab, find and select the recipient.\n4. View recipient details on the **Details** tab. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nDESCRIBE RECIPIENT <recipient-name>;\n\n``` \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \n```\ndatabricks recipients get <recipient-name>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage data recipients for Delta Sharing\n#### View a recipient\u2019s share permissions\n\nTo view the list of shares that a recipient has been granted access to, you can use Catalog Explorer, the Databricks CLI, or the `SHOW GRANTS TO RECIPIENT` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: Metastore admin, user with the `USE RECIPIENT` privilege, or the recipient object owner. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Recipients** tab, find and select the recipient.\n4. Go to the **Shares** tab to view the list of shares shared with the recipient. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nSHOW GRANTS TO RECIPIENT <recipient-name>;\n\n``` \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \n```\ndatabricks recipients share-permissions <recipient-name>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage data recipients for Delta Sharing\n#### Update a recipient\n\nTo update a recipient, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `ALTER RECIPIENT` SQL command in a Databricks notebook or the Databricks SQL query editor. \nProperties you can update include recipient name, owner, comment, and custom properties. You cannot update the recipient name using Catalog Explorer. \n**Permissions required**: You must be a metastore admin or owner of the recipient object to update the owner. You must be a metastore admin (or user with the `CREATE_RECIPIENT` privilege) *and* the owner to update the name. You must be the owner to update the comment or custom properties. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Recipients** tab, find and select the recipient.\n4. On the details page, you can: \n* Update the owner.\n* Edit or add a comment.\n* Edit, remove, or add custom **Recipient properties**. \nClick **Edit properties**. To add a property, click **+Add property** and enter a property name (**Key**) and **Value**. For details, see [Manage recipient properties](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#properties). \nRun one or more of the following commands in a notebook or the Databricks SQL query editor. \n```\nALTER RECIPIENT <recipient-name> RENAME TO <new-recipient-name>;\n\nALTER RECIPIENT <recipient-name> OWNER TO <new-owner>;\n\nCOMMENT ON RECIPIENT <recipient-name> IS \"<new-comment>\";\n\nALTER RECIPIENT <recipient-name> SET PROPERTIES ( <property-key> = property_value [, ...] )\n\nALTER RECIPIENT <recipient-name> UNSET PROPERTIES ( <property-key> [, ...] )\n\n``` \nFor more information about properties, see [Manage recipient properties](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#properties). \nCreate a JSON file that includes an update to the recipient name, comment, owner, IP access list, or custom properties. \n```\n{\n\"name\": \"new-recipient-name\",\n\"owner\": \"someone-else@example.com\",\n\"comment\": \"something new\",\n\"ip_access_list\": {\n\"allowed_ip_addresses\": [\"8.8.8.8\", \"8.8.8.4\/10\"]\n},\n\"property\": {\n\"country\": \"us\",\n\"id\": \"001\"\n}\n}\n\n``` \nThen run the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). Replace `<recipient-name>` with the current recipient name and replace `update-recipient-settings.json` with the filename of the JSON file. \n```\ndatabricks recipients update --json-file update-recipient-settings.json\n\n``` \nFor more information about properties, see [Manage recipient properties](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#properties).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage data recipients for Delta Sharing\n#### Manage recipient tokens (open sharing)\n\nIf you are sharing data with a recipient using [open sharing](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#open-sharing), you may need to rotate that recipient\u2019s token. \nYou should rotate a recipient\u2019s credential and generate a new activation link in the following circumstances: \n* When the existing recipient token is about to expire.\n* If a recipient loses their activation URL or if it is compromised.\n* If the credential is corrupted, lost, or compromised after it is downloaded by a recipient.\n* When you modify the recipient token lifetime for a metastore. See [Modify the recipient token lifetime](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#modify-recipient-token-lifetime). \n### Security considerations for tokens \nAt any given time, a recipient can have at most two tokens: an active token and a rotated token. Until the rotated token expires, attempting to rotate the token again results in an error. \nWhen you rotate a recipient\u2019s token, you can optionally set `--existing-token-expire-in-seconds` to the number of seconds before the existing recipient token expires. If you set the value to `0`, the existing recipient token expires immediately. \nDatabricks recommends that you set `--existing-token-expire-in-seconds` to a relatively short period that gives the recipient organization time to access the new activation URL while minimizing the amount of time that the recipient has two active tokens. If you suspect that the recipient token is compromised, Databricks recommends that you force the existing recipient token to expire immediately. \nIf a recipient\u2019s existing activation URL has never been accessed and the recipient has not been rotated, rotating the recipient invalidates the existing activation URL and replaces it with a new one. \nIf all recipient tokens have expired, rotating the recipient replaces the existing activation URL with a new one. Databricks recommends that you promptly rotate or drop a recipient whose token has expired. \nIf a recipient activation link is inadvertently sent to the wrong person or is sent over an insecure channel, Databricks recommends that you: \n1. [Revoke the recipient\u2019s access to the share](https:\/\/docs.databricks.com\/data-sharing\/grant-access.html#revoke).\n2. Rotate the recipient and set `--existing-token-expire-in-seconds` to `0`.\n3. Share the new activation link with the intended recipient over a secure channel.\n4. After the activation URL has been accessed, grant the recipient access to the share again. \nIn extreme situations, instead of rotating the recipient\u2019s token, you can [drop](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#delete-recipient) and re-create the recipient. \n### Rotate a recipient\u2019s token \nTo rotate a recipient\u2019s token, you can use Catalog Explorer or the Databricks Unity Catalog CLI. \n**Permissions required**: Recipient object owner. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Recipients** tab, find and select the recipient.\n4. On the **Details** tab, under **Token Expiration**, click **Rotate**.\n5. On the **Rotate token** dialog, set the token to expire either immediately or for a set period of time. For advice about when to expire existing tokens, see [Security considerations for tokens](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#security-considerations).\n6. Click **Rotate**.\n7. On the **Details** tab, copy the new **Activation link** and share it with the recipient over a secure channel. See [Step 2: Get the activation link](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#get-activation-link). \n1. Run the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). Replace the placeholder values: \n* `<recipient-name>`: the name of the recipient.\n* `<expiration-seconds>`: The number of seconds until the existing recipient token should expire. During this period, the existing token will continue to work. A value of `0` means the existing token expires immediately. For advice about when to expire existing tokens, see [Security considerations for tokens](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#security-considerations).\n```\ndatabricks recipients rotate-token \\\n<recipient-name> \\\n<expiration-seconds>\n\n```\n2. Get the recipient\u2019s new activation link and share it with the recipient over a secure channel. See [Step 2: Get the activation link](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#get-activation-link). \n### Modify the recipient token lifetime \nIf you need to modify the default recipient token lifetime for your Unity Catalog metastore, you can use Catalog Explorer or the Databricks Unity Catalog CLI. \nNote \nThe recipient token lifetime for existing recipients is not updated automatically when you change the default recipient token lifetime for a metastore. In order to apply the new token lifetime to a given recipient, you must rotate their token. See [Manage recipient tokens (open sharing)](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#rotate-credential). \n**Permissions required**: Account admin. \n1. Log in to the [account console](https:\/\/accounts.cloud.databricks.com).\n2. In the sidebar, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. Click the metastore name.\n4. Enable **Set expiration**.\n5. Enter a number of seconds, minutes, hours, or days, and select the unit of measure.\n6. Click **Enable**. \nIf you disable **Set expiration**, recipient tokens do not expire. Databricks recommends that you configure tokens to expire. \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). Replace `12a345b6-7890-1cd2-3456-e789f0a12b34` with the metastore UUID, and replace `86400` with number of seconds before the recipient token expires. If you set this value to `0`, recipient tokens do not expire. Databricks recommends that you configure tokens to expire. \n```\ndatabricks metastores update \\\n12a345b6-7890-1cd2-3456-e789f0a12b34 \\\n--delta-sharing-recipient-token-lifetime-in-seconds 86400\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage data recipients for Delta Sharing\n#### (Optional) Restrict recipient access using access lists\n\nYou can limit recipient access to a restricted set of IP addresses when you configure the recipient object. See [Restrict Delta Sharing recipient access using IP access lists (open sharing)](https:\/\/docs.databricks.com\/data-sharing\/access-list.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage data recipients for Delta Sharing\n#### Manage recipient properties\n\nRecipient objects include predefined properties that you can use to refine data sharing access. For example, you can use them to do the following: \n* Share different table partitions with different recipients, enabling you to use the same shares with multiple recipients while maintaining data boundaries between them.\n* Share dynamic views that limit recipient access to table data at the row or column level based on recipient properties. \nYou can also create custom properties. \nThe predefined properties start with `databricks.` and include the following: \n* `databricks.accountId`: The Databricks account that a data recipient belongs to (Databricks-to-Databricks sharing only).\n* `databricks.metastoreId`: The Unity Catalog metastore that a data recipient belongs to (Databricks-to-Databricks sharing only).\n* `databricks.name`: The name of the data recipient. \nCustom properties that might be of value could include, for example, `country`. For example, if you attach the custom property `'country' = 'us'` to a recipient, you can partition table data by country and share only rows that have US data with the recipients that have that property assigned. You can also share a dynamic view that restricts row or column access based on recipient properties. For more detailed examples, see [Use recipient properties to do partition filtering](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#properties) and [Add dynamic views to a share to filter rows and columns](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#dynamic-views). \n### Requirements \nRecipient properties are supported in Databricks Runtime 12.2 and above. \n### Add properties when you create or update a recipient \nYou can add properties when you create a recipient or update them for an existing recipient. You can use Catalog Explorer, the Databricks Unity Catalog CLI, or SQL commands in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: Metastore admin or user with the `CREATE RECIPIENT` privilege for the Unity Catalog metastore. \nWhen you [create](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#create-recipient-db-to-db) or [update](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#update) a recipient using Catalog Explorer, add or update custom properties by doing the following: \n1. Go to the Recipient details page. \nIf you are creating a new recipient, you land on this page after you click **Create**. If you are updating an existing recipient, go to this page by clicking **Delta Sharing > Shared by me > Recipients** and selecting the recipient.\n2. Click **Edit properties > +Add property**.\n3. Enter a property name (**Key**) and **Value**. \nFor example, if you want to filter shared data by country and share only US data with this recipient, you can create a key named \u201ccountry\u201d with a value of \u201cUS.\u201d\n4. Click **Save**. \nTo add a custom property when you create a recipient, run the following command in a notebook or the Databricks SQL query editor: \n```\nCREATE RECIPIENT [IF NOT EXISTS] <recipient-name>\n[USING ID '<sharing-identifier>'] \/* Skip this if you are using open sharing *\/\n[COMMENT \"<comment>\"]\nPROPERTIES ( '<property-key>' = '<property-value>' [, ...] );\n\n``` \n`<property-key>` can be a string literal or identifier. `<property-value>` must be a string literal. \nFor example: \n```\nCREATE RECIPIENT acme PROPERTIES ('country' = 'us', 'partner_id' = '001');\n\n``` \nTo add, edit, or delete custom properties for an existing recipient, run one of the following: \n```\nALTER RECIPIENT <recipient-name> SET PROPERTIES ( '<property-key>' = '<property-value>' [, ...] );\n\nALTER RECIPIENT <recipient-name> UNSET PROPERTIES ( '<property-key>' );\n\n``` \nTo add custom properties when you create a recipient, run the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). Replace the placeholder values: \n* `<recipient-name>`: The name of the recipient.\n* `<property-key>` can be a string literal or identifier.\n* `<property-value>` must be a string literal. \n```\ndatabricks recipients create \\\n--json='{\n\"name\": \"<recipient-name>\",\n\"properties_kvpairs\": {\n\"properties\": {\n\"<property-key>\": \"<property-value>\",\n}\n}\n}'\n\n``` \nFor example: \n```\ndatabricks recipients create \\\n--json='{\n\"name\": \"<recipient-name>\",\n\"properties_kvpairs\": {\n\"properties\": {\n\"country\": \"us\",\n\"partner_id\":\"001\"\n}\n}\n}'\n\n``` \nTo add or edit custom properties for an existing recipient, use `update` instead of `create`: \n```\ndatabricks recipients update \\\n--json='{\n\"name\": \"<recipient-name>\",\n\"properties_kvpairs\": {\n\"properties\": {\n\"country\": \"us\",\n\"partner_id\":\"001\"\n}\n}\n}'\n\n``` \n### View recipient properties \nTo view recipient properties, follow the instructions in [View recipient details](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#recipient-details).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage data recipients for Delta Sharing\n#### Delete a recipient\n\nTo delete a recipient, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `DROP RECIPIENT` SQL command in a Databricks notebook or the Databricks SQL query editor. You must be the recipient object owner to delete the recipient. \nWhen you delete a recipient, the users represented by the recipient can no longer access the shared data. Tokens that recipients use in an open sharing scenario are invalidated. \n**Permissions required**: Recipient object owner. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Recipients** tab, find and select the recipient.\n4. Click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu (also known as the three-dot menu) and select **Delete**.\n5. On the confirmation dialog, click **Delete**. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nDROP RECIPIENT [IF EXISTS] <recipient-name>;\n\n``` \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \n```\ndatabricks recipients delete <recipient-name>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Capture and view data lineage using Unity Catalog\n\nThis article describes how to capture and visualize data lineage using Catalog Explorer, the data lineage system tables, and the REST API. \nYou can use Unity Catalog to capture runtime data lineage across queries run on Databricks. Lineage is supported for all languages and is captured down to the column level. Lineage data includes notebooks, workflows, and dashboards related to the query. Lineage can be visualized in Catalog Explorer in near real time and retrieved programmatically using the lineage system tables and the Databricks REST API. \nLineage is aggregated across all workspaces attached to a Unity Catalog metastore. This means that lineage captured in one workspace is visible in any other workspace sharing that metastore. Users must have the correct permissions to view the lineage data. Lineage data is retained for 1 year. \nFor information about tracking the lineage of a machine learning model, see [Track the data lineage of a model in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#model-lineage).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Capture and view data lineage using Unity Catalog\n##### Requirements\n\nThe following are required to capture data lineage using Unity Catalog: \n* The workspace must have [Unity Catalog enabled](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/enable-workspaces.html).\n* Tables must be registered in a Unity Catalog metastore.\n* Queries must use the Spark DataFrame (for example, Spark SQL functions that return a DataFrame) or Databricks SQL interfaces. For examples of Databricks SQL and PySpark queries, see [Examples](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html#lineage-examples).\n* To view the lineage of a table or view, users must have at least the `BROWSE` privilege on the table\u2019s or view\u2019s parent catalog.\n* To view lineage information for notebooks, workflows, or dashboards, users must have permissions on these objects as defined by the access control settings in the workspace. See [Lineage permissions](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html#permissions).\n* To view lineage for a [Unity Catalog-enabled pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html), you must have `CAN_VIEW` permissions on the pipeline. \n* You might need to update your outbound firewall rules to allow for connectivity to the Amazon Kinesis endpoint in the Databricks control plane. Typically this applies if your Databricks workspace is deployed in your own VPC or you use AWS PrivateLink within your Databricks network environment. To get the Kinesis endpoint for your workspace region, see [Kinesis addresses](https:\/\/docs.databricks.com\/resources\/supported-regions.html#kinesis). See also [Configure a customer-managed VPC](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html) and [Enable AWS PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Capture and view data lineage using Unity Catalog\n##### Limitations\n\n* Streaming between Delta tables is supported only in Databricks Runtime 11.3 LTS or above.\n* Because lineage is computed on a 1-year rolling window, lineage collected more than 1 year ago is not displayed. For example, if a job or query reads data from table A and writes to table B, the link between table A and table B is displayed for only 1 year. \nYou can filter lineage data by time frame. When \u201cAll lineage\u201d is selected, lineage data collected starting in June 2023 is displayed.\n* Workflows that use the Jobs API `runs submit` request are unavailable when viewing lineage. Table and column level lineage is still captured when using the `runs submit` request, but the link to the run is not captured.\n* Unity Catalog captures lineage to the column level as much as possible. However, there are some cases where column-level lineage cannot be captured.\n* Column lineage is only supported when both the source and target are referenced by table name (Example: `select * from <catalog>.<schema>.<table>`). Column lineage cannot be captured if the source or the target are addressed by path (Example: `select * from delta.\"s3:\/\/<bucket>\/<path>\"`).\n* If a table is renamed, lineage is not captured for the renamed table.\n* If you use Spark SQL dataset checkpointing, lineage is not captured. See [pyspark.sql.DataFrame.checkpoint](https:\/\/spark.apache.org\/docs\/3.1.1\/api\/python\/reference\/api\/pyspark.sql.DataFrame.checkpoint.html#pyspark.sql.DataFrame.checkpoint) in the Apache Spark documentation.\n* Unity Catalog captures lineage from Delta Live Tables pipelines in most cases. However, in some instances, complete lineage coverage cannot be guaranteed, such as when pipelines use the [APPLY CHANGES](https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html) API or [TEMPORARY tables](https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html#sql-properties).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Capture and view data lineage using Unity Catalog\n##### Examples\n\nNote \n* The following examples use the catalog name `lineage_data` and the schema name `lineagedemo`. To use a different catalog and schema, change the names used in the examples.\n* To complete this example, you must have `CREATE` and `USE SCHEMA` privileges on a schema. A metastore admin, catalog owner, or schema owner can grant these privileges. For example, to give all users in the group \u2018data\\_engineers\u2019 permission to create tables in the `lineagedemo` schema in the `lineage_data` catalog, a user with one of the above privileges or roles can run the following queries: \n```\nCREATE SCHEMA lineage_data.lineagedemo;\nGRANT USE SCHEMA, CREATE on SCHEMA lineage_data.lineagedemo to `data_engineers`;\n\n``` \n### Capture and explore lineage \nTo capture lineage data, use the following steps: \n1. Go to your Databricks landing page, click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar, and select **Notebook** from the menu.\n2. Enter a name for the notebook and select **SQL** in **Default Language**.\n3. In **Cluster**, select a cluster with access to Unity Catalog.\n4. Click **Create**.\n5. In the first notebook cell, enter the following queries: \n```\nCREATE TABLE IF NOT EXISTS\nlineage_data.lineagedemo.menu (\nrecipe_id INT,\napp string,\nmain string,\ndessert string\n);\n\nINSERT INTO lineage_data.lineagedemo.menu\n(recipe_id, app, main, dessert)\nVALUES\n(1,\"Ceviche\", \"Tacos\", \"Flan\"),\n(2,\"Tomato Soup\", \"Souffle\", \"Creme Brulee\"),\n(3,\"Chips\",\"Grilled Cheese\",\"Cheesecake\");\n\nCREATE TABLE\nlineage_data.lineagedemo.dinner\nAS SELECT\nrecipe_id, concat(app,\" + \", main,\" + \",dessert)\nAS\nfull_menu\nFROM\nlineage_data.lineagedemo.menu\n\n```\n6. To run the queries, click in the cell and press **shift+enter** or click ![Run Menu](https:\/\/docs.databricks.com\/_images\/run-menu.png) and select **Run Cell**. \nTo use Catalog Explorer to view the lineage generated by these queries, use the following steps: \n1. In the **Search** box in the top bar of the Databricks workspace, enter `lineage_data.lineagedemo.dinner` and click **Search lineage\\_data.lineagedemo.dinner in Databricks**.\n2. Under **Tables**, click the `dinner` table.\n3. Select the **Lineage** tab. The lineage panel appears and displays related tables (for this example it\u2019s the `menu` table).\n4. To view an interactive graph of the data lineage, click **See Lineage Graph**. By default, one level is displayed in the graph. You can click on the ![Plus Sign Icon](https:\/\/docs.databricks.com\/_images\/plus-sign-icon.png) icon on a node to reveal more connections if they are available.\n5. Click on an arrow connecting nodes in the lineage graph to open the **Lineage connection** panel. The **Lineage connection** panel shows details about the connection, including source and target tables, notebooks, and workflows. \n![Lineage graph](https:\/\/docs.databricks.com\/_images\/uc-lineage-graph.png)\n6. To show the notebook associated with the `dinner` table, select the notebook in the **Lineage connection** panel or close the lineage graph and click **Notebooks**. To open the notebook in a new tab, click on the notebook name.\n7. To view the column-level lineage, click on a column in the graph to show links to related columns. For example, clicking on the \u2018full\\_menu\u2019 column shows the upstream columns the column was derived from: \n![Full menu column lineage](https:\/\/docs.databricks.com\/_images\/uc-lineage-column-lineage.png) \nTo demonstrate creating and viewing lineage with a different language, for example, Python, use the following steps: \n1. Open the notebook you created previously, create a new cell, and enter the following Python code: \n```\n%python\nfrom pyspark.sql.functions import rand, round\ndf = spark.range(3).withColumn(\"price\", round(10*rand(seed=42),2)).withColumnRenamed(\"id\",\"recipe_id\")\n\ndf.write.mode(\"overwrite\").saveAsTable(\"lineage_data.lineagedemo.price\")\n\ndinner = spark.read.table(\"lineage_data.lineagedemo.dinner\")\nprice = spark.read.table(\"lineage_data.lineagedemo.price\")\n\ndinner_price = dinner.join(price, on=\"recipe_id\")\ndinner_price.write.mode(\"overwrite\").saveAsTable(\"lineage_data.lineagedemo.dinner_price\")\n\n```\n2. Run the cell by clicking in the cell and pressing **shift+enter** or clicking ![Run Menu](https:\/\/docs.databricks.com\/_images\/run-menu.png) and selecting **Run Cell**.\n3. In the **Search** box in the top bar of the Databricks workspace, enter `lineage_data.lineagedemo.price` and click **Search lineage\\_data.lineagedemo.price in Databricks**.\n4. Under **Tables**, click the `price` table.\n5. Select the **Lineage** tab and click **See Lineage Graph**. Click on the ![Plus Sign Icon](https:\/\/docs.databricks.com\/_images\/plus-sign-icon.png) icons to explore the data lineage generated by the SQL and Python queries. \n![Expanded lineage graph](https:\/\/docs.databricks.com\/_images\/uc-expanded-lineage-graph.png)\n6. Click on an arrow connecting nodes in the lineage graph to open the **Lineage connection** panel. The **Lineage connection** panel shows details about the connection, including source and target tables, notebooks, and workflows. \n### Capture and view workflow lineage \nLineage is also captured for any workflow that reads or writes to Unity Catalog. To demonstrate viewing lineage for a Databricks workflow, use the following steps: \n1. Click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar and select **Notebook** from the menu.\n2. Enter a name for the notebook and select **SQL** in **Default Language**.\n3. Click **Create**.\n4. In the first notebook cell, enter the following query: \n```\nSELECT * FROM lineage_data.lineagedemo.menu\n\n```\n5. Click **Schedule** in the top bar. In the schedule dialog, select **Manual**, select a cluster with access to Unity Catalog, and click **Create**.\n6. Click **Run now**.\n7. In the **Search** box in the top bar of the Databricks workspace, enter `lineage_data.lineagedemo.menu` and click **Search lineage\\_data.lineagedemo.menu in Databricks**.\n8. Under **Tables View all tables**, click the `menu` table.\n9. Select the **Lineage** tab, click **Workflows**, and select the **Downstream** tab. The job name appears under **Job Name** as a consumer of the `menu` table. \n### Capture and view dashboard lineage \nTo demonstrate viewing lineage for a SQL dashboard, use the following steps: \n1. Go to your Databricks landing page and open Catalog Explorer by clicking **Catalog** in the sidebar.\n2. Click on the catalog name, click **lineagedemo**, and select the `menu` table. You can also use the **Search tables** text box in the top bar to search for the `menu` table.\n3. Click **Actions > Create a quick dashboard**.\n4. Select columns to add to the dashboard and click **Create**.\n5. In the **Search** box in the top bar of the Databricks workspace, enter `lineage_data.lineagedemo.menu` and click **Search lineage\\_data.lineagedemo.menu in Databricks**.\n6. Under **Tables View all tables**, click the `menu` table.\n7. Select the **Lineage** tab and click **Dashboards**. The dashboard name appears under **Dashboard Name** as a consumer of the menu table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Capture and view data lineage using Unity Catalog\n##### Lineage permissions\n\nLineage graphs share the same [permission model](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html) as Unity Catalog. If a user does not have the `BROWSE` or `SELECT` privilege on a table, they cannot explore the lineage. Additionally, users can only see notebooks, workflows, and dashboards that they have permission to view. For example, if you run the following commands for a non-admin user `userA`: \n```\nGRANT USE SCHEMA on lineage_data.lineagedemo to `userA@company.com`;\nGRANT SELECT on lineage_data.lineagedemo.menu to `userA@company.com`;\n\n``` \nWhen `userA` views the lineage graph for the `lineage_data.lineagedemo.menu` table, they will see the `menu` table. They will not be able to see information about associated tables, such as the downstream `lineage_data.lineagedemo.dinner` table. The `dinner` table is displayed as a `masked` node in the display to `userA`, and `userA` cannot expand the graph to reveal downstream tables from tables they do not have permission to access. \nIf you run the following command to grant the `BROWSE` permission to a non-admin user `userB`: \n```\nGRANT BROWSE on lineage_data to `userA@company.com`;\n\n``` \n`userB` can now view the lineage graph for any table in the `lineage_data` schema. \nFor more information about managing access to securable objects in Unity Catalog, see [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html). For more information about managing access to workspace objects like notebooks, workflows, and dashboards, see [Access control lists](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Capture and view data lineage using Unity Catalog\n##### Delete lineage data\n\nWarning \nThe following instructions delete all objects stored in Unity Catalog. Use these instructions only if necessary. For example, to meet compliance requirements. \nTo delete lineage data, you must delete the metastore managing the Unity Catalog objects. For more information about deleting the metastore, see [Delete a metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html#delete). Data will be deleted within 90 days.\n\n#### Capture and view data lineage using Unity Catalog\n##### Query lineage data using system tables\n\nYou can use the lineage system tables to programmatically query lineage data. For detailed instructions, see [Monitor usage with system tables](https:\/\/docs.databricks.com\/admin\/system-tables\/index.html) and [Lineage system tables reference](https:\/\/docs.databricks.com\/admin\/system-tables\/lineage.html). \nIf your workspace is in a region that doesn\u2019t support lineage system tables, you can alternatively use the Data Lineage REST API to retrieve lineage data programmatically.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Capture and view data lineage using Unity Catalog\n##### Retrieve lineage using the Data Lineage REST API\n\nThe data lineage API allows you to retrieve table and column lineage. However, if your workspace is in a region that supports the lineage system tables, you should use system table queries instead of the REST API. System tables are a better option for programmatic retrieval of lineage data. Most regions support the lineage system tables. \nImportant \nTo access Databricks REST APIs, you must [authenticate](https:\/\/docs.databricks.com\/dev-tools\/auth\/index.html). \n### Retrieve table lineage \nThis example retrieves lineage data for the `dinner` table. \n#### Request \n```\ncurl --netrc -X GET \\\n-H 'Content-Type: application\/json' \\\nhttps:\/\/<workspace-instance>\/api\/2.0\/lineage-tracking\/table-lineage \\\n-d '{\"table_name\": \"lineage_data.lineagedemo.dinner\", \"include_entity_lineage\": true}'\n\n``` \nReplace `<workspace-instance>`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file. \n#### Response \n```\n{\n\"upstreams\": [\n{\n\"tableInfo\": {\n\"name\": \"menu\",\n\"catalog_name\": \"lineage_data\",\n\"schema_name\": \"lineagedemo\",\n\"table_type\": \"TABLE\"\n},\n\"notebookInfos\": [\n{\n\"workspace_id\": 4169371664718798,\n\"notebook_id\": 1111169262439324\n}\n]\n}\n],\n\"downstreams\": [\n{\n\"notebookInfos\": [\n{\n\"workspace_id\": 4169371664718798,\n\"notebook_id\": 1111169262439324\n}\n]\n},\n{\n\"tableInfo\": {\n\"name\": \"dinner_price\",\n\"catalog_name\": \"lineage_data\",\n\"schema_name\": \"lineagedemo\",\n\"table_type\": \"TABLE\"\n},\n\"notebookInfos\": [\n{\n\"workspace_id\": 4169371664718798,\n\"notebook_id\": 1111169262439324\n}\n]\n}\n]\n}\n\n``` \n### Retrieve column lineage \nThis example retrieves column data for the `dinner` table. \n#### Request \n```\ncurl --netrc -X GET \\\n-H 'Content-Type: application\/json' \\\nhttps:\/\/<workspace-instance>\/api\/2.0\/lineage-tracking\/column-lineage \\\n-d '{\"table_name\": \"lineage_data.lineagedemo.dinner\", \"column_name\": \"dessert\"}'\n\n``` \nReplace `<workspace-instance>`. \nThis example uses a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file. \n#### Response \n```\n{\n\"upstream_cols\": [\n{\n\"name\": \"dessert\",\n\"catalog_name\": \"lineage_data\",\n\"schema_name\": \"lineagedemo\",\n\"table_name\": \"menu\",\n\"table_type\": \"TABLE\"\n},\n{\n\"name\": \"main\",\n\"catalog_name\": \"lineage_data\",\n\"schema_name\": \"lineagedemo\",\n\"table_name\": \"menu\",\n\"table_type\": \"TABLE\"\n},\n{\n\"name\": \"app\",\n\"catalog_name\": \"lineage_data\",\n\"schema_name\": \"lineagedemo\",\n\"table_name\": \"menu\",\n\"table_type\": \"TABLE\"\n}\n],\n\"downstream_cols\": [\n{\n\"name\": \"full_menu\",\n\"catalog_name\": \"lineage_data\",\n\"schema_name\": \"lineagedemo\",\n\"table_name\": \"dinner_price\",\n\"table_type\": \"TABLE\"\n}\n]\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n### Compare Auto Loader file detection modes\n##### What is Auto Loader directory listing mode?\n\nAuto Loader uses directory listing mode by default. In directory listing mode, Auto Loader identifies new files by listing the input directory. Directory listing mode allows you to quickly start Auto Loader streams without any permission configurations other than access to your data on cloud storage. \nFor best performance with directory listing mode, use Databricks Runtime 9.1 or above. This article describes the default functionality of directory listing mode as well as optimizations based on [lexical ordering of files](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/directory-listing-mode.html#lexical-examples).\n\n##### What is Auto Loader directory listing mode?\n###### How does directory listing mode work?\n\nDatabricks has optimized directory listing mode for Auto Loader to discover files in cloud storage more efficiently than other Apache Spark options. \nFor example, if you have files being uploaded every 5 minutes as `\/some\/path\/YYYY\/MM\/DD\/HH\/fileName`, to find all the files in these directories, the Apache Spark file source lists all subdirectories in parallel. The following algorithm estimates the total number of API `LIST` directory calls to object storage: \n> 1 (base directory) + 365 (per day) \\* 24 (per hour) = 8761 calls \nBy receiving a flattened response from storage, Auto Loader reduces the number of API calls to the number of files in storage divided by the number of results returned by each API call, greatly reducing your cloud costs. The following table shows the number of files returned by each API call for common object storage: \n| Results returned per call | Object storage |\n| --- | --- |\n| 1000 | S3 |\n| 5000 | ADLS Gen2 |\n| 1024 | GCS |\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/directory-listing-mode.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n### Compare Auto Loader file detection modes\n##### What is Auto Loader directory listing mode?\n###### Incremental Listing (deprecated)\n\nImportant \nThis feature has been deprecated. Databricks recommends using [file notification mode](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html) instead of incremental listing. \nNote \nAvailable in [Databricks Runtime 9.1 LTS](https:\/\/docs.databricks.com\/release-notes\/runtime\/9.1lts.html) and above. \nIncremental listing is available for Azure Data Lake Storage Gen2 (`abfss:\/\/`), S3 (`s3:\/\/`) and GCS (`gs:\/\/`). \nFor lexicographically generated files, Auto Loader leverages the lexical file ordering and optimized listing APIs to improve the efficiency of directory listing by listing from recently ingested files rather than listing the contents of the entire directory. \nBy default, Auto Loader automatically detects whether a given directory is applicable for incremental listing by checking and comparing file paths of previously completed directory listings. To ensure eventual completeness of data in `auto` mode, Auto Loader automatically triggers a full directory list after completing 7 consecutive incremental lists. You can control the frequency of full directory lists by setting `cloudFiles.backfillInterval` to trigger asynchronous backfills at a given interval.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/directory-listing-mode.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n### Compare Auto Loader file detection modes\n##### What is Auto Loader directory listing mode?\n###### Lexical ordering of files\n\nFor files to be lexically ordered, new files that are uploaded need to have a prefix that is lexicographically greater than existing files. Some examples of lexical ordered directories are shown below. \n### Versioned files \nDelta Lake makes commits to table transaction logs in a lexical order. \n```\n<path-to-table>\/_delta_log\/00000000000000000000.json\n<path-to-table>\/_delta_log\/00000000000000000001.json <- guaranteed to be written after version 0\n<path-to-table>\/_delta_log\/00000000000000000002.json <- guaranteed to be written after version 1\n...\n\n``` \n[AWS DMS](https:\/\/docs.aws.amazon.com\/dms\/latest\/userguide\/CHAP_Target.S3.html) uploads CDC files to AWS S3 in a versioned manner. \n```\ndatabase_schema_name\/table_name\/LOAD00000001.csv\ndatabase_schema_name\/table_name\/LOAD00000002.csv\n...\n\n``` \n### Date partitioned files \nFiles can be uploaded in a date partitioned format. Some examples of this are: \n```\n\/\/ <base-path>\/yyyy\/MM\/dd\/HH:mm:ss-randomString\n<base-path>\/2021\/12\/01\/10:11:23-b1662ecd-e05e-4bb7-a125-ad81f6e859b4.json\n<base-path>\/2021\/12\/01\/10:11:23-b9794cf3-3f60-4b8d-ae11-8ea320fad9d1.json\n...\n\n\/\/ <base-path>\/year=yyyy\/month=MM\/day=dd\/hour=HH\/minute=mm\/randomString\n<base-path>\/year=2021\/month=12\/day=04\/hour=08\/minute=22\/442463e5-f6fe-458a-8f69-a06aa970fc69.csv\n<base-path>\/year=2021\/month=12\/day=04\/hour=08\/minute=22\/8f00988b-46be-4112-808d-6a35aead0d44.csv <- this may be uploaded before the file above as long as processing happens less frequently than a minute\n\n``` \nWhen files are uploaded with date partitioning, some things to keep in mind are: \n* Months, days, hours, minutes need to be left padded with zeros to ensure lexical ordering (should be uploaded as `hour=03`, instead of `hour=3` or `2021\/05\/03` instead of `2021\/5\/3`).\n* Files don\u2019t necessarily have to be uploaded in lexical order in the deepest directory as long as processing happens less frequently than the parent directory\u2019s time granularity. \nSome services that can upload files in a date partitioned lexical ordering are: \n* [Azure Data Factory](https:\/\/learn.microsoft.com\/azure\/data-factory\/connector-azure-data-lake-storage?tabs=data-factory#sink-properties) can be configured to upload files in a lexical order. See an example [here](https:\/\/social.technet.microsoft.com\/wiki\/contents\/articles\/53406.azure-data-factory-dynamically-add-timestamp-in-copied-filename.aspx).\n* [Kinesis Firehose](https:\/\/docs.aws.amazon.com\/firehose\/latest\/dev\/basic-deliver.html#s3-object-name)\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/directory-listing-mode.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n### Compare Auto Loader file detection modes\n##### What is Auto Loader directory listing mode?\n###### Change source path for Auto Loader\n\nIn Databricks Runtime 11.3 LTS and above, you can change the directory input path for Auto Loader configured with directory listing mode without having to choose a new checkpoint directory. \nWarning \nThis functionality is not supported for file notification mode. If file notification mode is used and the path is changed, you might fail to ingest files that are already present in the new directory at the time of the directory update. \nFor example, if you wish to run a daily ingestion job that loads all data from a directory structure organized by day, such as `\/YYYYMMDD\/`, you can use the same checkpoint to track ingestion state information across a different source directory each day while maintaining state information for files ingested from all previously used source directories.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/directory-listing-mode.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n#### Track ML and deep learning training runs\n###### Track ML Model training data with Delta Lake\n\nThe following notebook demonstrates how to use MLflow with [Delta Lake](https:\/\/delta.io)\nto track and reproduce the training data used for ML model training, as well as identify\nML models and runs derived from a particular dataset.\n\n###### Track ML Model training data with Delta Lake\n####### MLflow training data in Delta Lake notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-delta-training.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/tracking-ex-delta.html"} +{"content":"# AI and Machine Learning on Databricks\n### What is a feature store?\n\nThis page explains what a feature store is and what benefits it provides, and the specific advantages of Databricks Feature Store. \nA feature store is a centralized repository that enables data scientists to find and share features and also ensures that the same code used to compute the feature values is used for model training and inference. \nMachine learning uses existing data to build a model to predict future outcomes. In almost all cases, the raw data requires preprocessing and transformation before it can be used to build a model. This process is called feature engineering, and the outputs of this process are called features - the building blocks of the model. \nDeveloping features is complex and time-consuming. An additional complication is that for machine learning, feature calculations need to be done for model training, and then again when the model is used to make predictions. These implementations may not be done by the same team or using the same code environment, which can lead to delays and errors. Also, different teams in an organization will often have similar feature needs but may not be aware of work that other teams have done. A feature store is designed to address these problems.\n\n### What is a feature store?\n#### Why use Databricks Feature Store?\n\nDatabricks Feature Store is fully integrated with other components of Databricks. \n* Discoverability. The Feature Store UI, accessible from the Databricks workspace, lets you browse and search for existing features.\n* Lineage. When you create a feature table in Databricks, the data sources used to create the feature table are saved and accessible. For each feature in a feature table, you can also access the models, notebooks, jobs, and endpoints that use the feature.\n* Integration with model scoring and serving. When you use features from Feature Store to train a model, the model is packaged with feature metadata. When you use the model for batch scoring or online inference, it automatically retrieves features from Feature Store. The caller does not need to know about them or include logic to look up or join features to score new data. This makes model deployment and updates much easier.\n* Point-in-time lookups. Feature Store supports time series and event-based use cases that require point-in-time correctness.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n### What is a feature store?\n#### Feature Engineering in Unity Catalog\n\nWith Databricks Runtime 13.3 LTS and above, if your workspace is enabled for Unity Catalog, Unity Catalog becomes your feature store. You can use any Delta table or Delta Live Table in Unity Catalog with a primary key as a feature table for model training or inference. Unity Catalog provides feature discovery, governance, lineage, and cross-workspace access.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n### What is a feature store?\n#### How does Databricks Feature Store work?\n\nThe typical machine learning workflow using Feature Store follows this path: \n1. Write code to convert raw data into features and create a Spark DataFrame containing the desired features.\n2. For workspaces that are enabled for Unity Catalog, [write the DataFrame as a feature table in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html#create-feature-table). If your workspace is not enabled for Unity Catalog, [write the DataFrame as a feature table in the Workspace Feature Store](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html#create-feature-table).\n3. Train a model using features from the feature store. When you do this, the model stores the specifications of features used for training. When the model is used for inference, it automatically joins features from the appropriate feature tables.\n4. Register model in [Model Registry](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html). \nYou can now use the model to make predictions on new data. \nFor batch use cases, the model automatically retrieves the features it needs from Feature Store. \n![Feature Store workflow for batch machine learning use cases.](https:\/\/docs.databricks.com\/_images\/feature-store-flow-gcp.png) \nFor real-time serving use cases, publish the features to an [online table](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html). Third-party online stores are also supported. See [Third-party online stores](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-feature-stores.html). \nAt inference time, the model reads pre-computed features from the online store and joins them with the data provided in the client request to the model serving endpoint. \n![Feature Store flow for machine learning models that are served.](https:\/\/docs.databricks.com\/_images\/feature-store-flow-with-online-store.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n### What is a feature store?\n#### Start using Feature Store\n\nSee the following articles to get started with Feature Store: \n* Try one of the [example notebooks](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/example-notebooks.html) that illustrate feature store capabilities.\n* See the reference material for the [Feature Store Python API](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html).\n* Learn about [training models with Feature Store](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/train-models-with-feature-store.html).\n* Learn about [Feature Engineering in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html).\n* Learn about [the Workspace Feature Store](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html).\n* Use [time series feature tables and point-in-time lookups](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html) to retrieve the latest feature values as of a particular time for training or scoring a model.\n* Learn about [publishing features to online stores](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-feature-stores.html) or [online tables](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html) for real-time serving and automatic feature lookup.\n* Learn about [Feature Serving](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html), which makes features in the Databricks platform available with low latency to models or applications deployed outside of Databricks. \nWhen you use Feature Engineering in Unity Catalog, Unity Catalog takes care of sharing feature tables across workspaces, and you use [Unity Catalog privileges](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html) to control access the feature tables. The following links are for the Workspace Feature Store only: \n* [Share feature tables across workspaces](https:\/\/docs.databricks.com\/archive\/machine-learning\/feature-store\/multiple-workspaces.html).\n* [Control access to feature tables](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/access-control.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n### What is a feature store?\n#### Supported data types\n\nFeature Engineering in Unity Catalog and Workspace Feature Store support the following [PySpark data types](https:\/\/spark.apache.org\/docs\/latest\/sql-ref-datatypes.html): \n* `IntegerType`\n* `FloatType`\n* `BooleanType`\n* `StringType`\n* `DoubleType`\n* `LongType`\n* `TimestampType`\n* `DateType`\n* `ShortType`\n* `ArrayType`\n* `BinaryType` [1]\n* `DecimalType` [1]\n* `MapType` [1] \n[1] `BinaryType`, `DecimalType`, and `MapType` are supported in all versions of Feature Engineering in Unity Catalog and in Workspace Feature Store v0.3.5 or above. \nThe data types listed above support feature types that are common in machine learning applications. For example: \n* You can store dense vectors, tensors, and embeddings as `ArrayType`.\n* You can store sparse vectors, tensors, and embeddings as `MapType`.\n* You can store text as `StringType`. \nWhen published to online stores, `ArrayType` and `MapType` features are stored in JSON format. \nThe Feature Store UI displays metadata on feature data types: \n![Complex data types example](https:\/\/docs.databricks.com\/_images\/complex-data-type-example.png)\n\n### What is a feature store?\n#### More information\n\nFor more information on best practices for using Feature Store, download [The Comprehensive Guide to Feature Stores](https:\/\/www.databricks.com\/p\/ebook\/the-comprehensive-guide-to-feature-stores).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/compute.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Compute settings for the Databricks ODBC Driver\n\nThis article describes how to configure Databricks compute resource settings for the [Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/index.html). \nThe driver requires the following compute resource configuration settings: \n| Setting | Description |\n| --- | --- |\n| `Driver` | The driver\u2019s full installation path. To get this path, see [Download and install the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html). |\n| `Host` | The Databricks compute resource\u2019s **Server Hostname** value. |\n| `Port` | 443 |\n| `HTTPPath` | The Databricks compute resource\u2019s **HTTP Path** value. |\n| `SSL` | 1 |\n| `ThriftTransport` | 2 |\n| `Schema` (optional) | The name of the default schema to use. |\n| `Catalog` (optional) | The name of the default catalog to use. | \nA DSN that uses the preceding settings uses the following format: \n```\n[Databricks]\nDriver=<path-to-driver>\nHost=<server-hostname>\nPort=443\nHTTPPath=<http-path>\nSSL=1\nThriftTransport=2\n<setting1>=<valueN>\n<setting2>=<value2>\n<settingN>=<valueN>\n\n``` \nA DSN-less connection string that uses the preceding settings has the following format. Line breaks have been added for readability. The string must not contain these line breaks: \n```\nDriver=<path-to-driver>;\nHost=<server-hostname>;\nPort=443;\nHTTPPath=<http-path>;\nSSL=1;\nThriftTransport=2;\n<setting1>=<valueN>;\n<setting2>=<value2>;\n<settingN>=<valueN>\n\n``` \n* Replace `<setting>` and `<value>` as needed for each of the target Databricks [authentication settings](https:\/\/docs.databricks.com\/integrations\/odbc\/authentication.html) and any special or advanced [driver capability settings](https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html).\n* To get the values for `<server-hostname>` and `<http-path>`, see the following procedures. \nTo get the connection details for a Databricks [cluster](https:\/\/docs.databricks.com\/compute\/configure.html), do the following: \n1. Log in to your Databricks workspace.\n2. In the sidebar, click **Compute**.\n3. In the list of available clusters, click the target cluster\u2019s name.\n4. On the **Configuration** tab, expand **Advanced options**.\n5. Click the **JDBC\/ODBC** tab.\n6. Copy the connection details that you need, such as **Server Hostname**, **Port**, and **HTTP Path**. \nTo get the connection details for a Databricks SQL [warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html), do the following: \n1. Log in to your Databricks workspace.\n2. In the sidebar, click **SQL > SQL Warehouses**.\n3. In the list of available warehouses, click the target warehouse\u2019s name.\n4. On the **Connection Details** tab, copy the connection details that you need, such as **Server hostname**, **Port**, and **HTTP path**. \nTo use the driver with a Databricks [cluster](https:\/\/docs.databricks.com\/compute\/configure.html), there are two [permissions](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions) that the calling user or service principal needs when connecting to or restarting the cluster: \n* CAN ATTACH TO permission to connect to the running cluster.\n* CAN RESTART permission to automatically trigger the cluster to start if its state is terminated when connecting. \nTo use the driver with a Databricks SQL [warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html), the calling user or service principal needs CAN USE [permission](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#sql-warehouses). The Databricks SQL warehouse automatically starts if it was stopped. \nNote \nDatabricks SQL warehouses are recommended when using Microsoft Power BI in **DirectQuery** mode.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/compute.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/legacy.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Legacy Databricks JDBC Driver\n\nNote \nThe following information applies to legacy Databricks JDBC Driver 2.6.22 and below. \nFor information about Databricks JDBC Driver 2.6.25 and above, see [Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/index.html). \nTo use a JDBC connection URL to authenticate using a Databricks personal access token, set the following properties collection, replacing `<personal-access-token>` with your Databricks personal access token: \n```\njdbc:spark:\/\/<server-hostname>:443;httpPath=<http-path>;transportMode=http;SSL=1;UID=token;PWD=<personal-access-token\n\n``` \nTo use Java code to authenticate using a Databricks personal access token, set the following properties collection, replacing `<personal-access-token>` with your Databricks personal access token: \n```\n\/\/ ...\nString url = \"jdbc:spark:\/\/<server-hostname>:443;httpPath=<http-path>;transportMode=http;SSL=1\";\nProperties p = new java.util.Properties();\np.put(\"UID\", \"token\");\np.put(\"PWD\", \"<personal-access-token>\");\n\/\/ ...\nDriverManager.getConnection(url, p);\n\/\/ ...\n\n``` \n* The legacy Databricks JDBC Driver requires setting the **transportMode** and **SSL** properties. Databricks recommends that you set these values to `http` and `1`, respectively.\n* For a complete Java code example that you can adapt as needed, see the beginning of [Authentication settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html).\n* To get the values for `<server-hostname>` and `<http-path>`, see the following procedures. \nTo get the connection details for a Databricks [cluster](https:\/\/docs.databricks.com\/compute\/configure.html), do the following: \n1. Log in to your Databricks workspace.\n2. In the sidebar, click **Compute**.\n3. In the list of available clusters, click the target cluster\u2019s name.\n4. On the **Configuration** tab, expand **Advanced options**.\n5. Click the **JDBC\/ODBC** tab.\n6. Copy the connection details that you need, such as **Server Hostname**, **Port**, and **HTTP Path**. \nTo get the connection details for a Databricks SQL [warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html), do the following: \n1. Log in to your Databricks workspace.\n2. In the sidebar, click **SQL > SQL Warehouses**.\n3. In the list of available warehouses, click the target warehouse\u2019s name.\n4. On the **Connection Details** tab, copy the connection details that you need, such as **Server hostname**, **Port**, and **HTTP path**. \nTo use the driver with a Databricks [cluster](https:\/\/docs.databricks.com\/compute\/configure.html), there are two [permissions](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions) that the calling user or service principal needs when connecting to or restarting the cluster: \n* CAN ATTACH TO permission to connect to the running cluster.\n* CAN RESTART permission to automatically trigger the cluster to start if its state is terminated when connecting. \nTo use the driver with a Databricks SQL [warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html), the calling user or service principal needs CAN USE [permission](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#sql-warehouses). The Databricks SQL warehouse automatically starts if it was stopped. \nNote \nDatabricks SQL warehouses are recommended when using Microsoft Power BI in **DirectQuery** mode.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/legacy.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Tutorial: Analyze data with glm\n\nLearn how to perform linear and logistic regression using a generalized linear model (GLM) in Databricks. `glm` fits a Generalized Linear Model, similar to R\u2019s `glm()`. \n**Syntax**: `glm(formula, data, family...)` \n**Parameters**: \n* `formula`: Symbolic description of model to be fitted, for eg: `ResponseVariable ~ Predictor1 + Predictor2`. Supported operators: `~`, `+`, `-`, and `.`\n* `data`: Any SparkDataFrame\n* `family`: String, `\"gaussian\"` for linear regression or `\"binomial\"` for logistic regression\n* `lambda`: Numeric, Regularization parameter\n* `alpha`: Numeric, Elastic-net mixing parameter \n**Output**: MLlib PipelineModel \nThis tutorial shows how to perform linear and logistic regression on the diamonds dataset.\n\n#### Tutorial: Analyze data with glm\n##### Load diamonds data and split into training and test sets\n\n```\nrequire(SparkR)\n\n# Read diamonds.csv dataset as SparkDataFrame\ndiamonds <- read.df(\"\/databricks-datasets\/Rdatasets\/data-001\/csv\/ggplot2\/diamonds.csv\",\nsource = \"com.databricks.spark.csv\", header=\"true\", inferSchema = \"true\")\ndiamonds <- withColumnRenamed(diamonds, \"\", \"rowID\")\n\n# Split data into Training set and Test set\ntrainingData <- sample(diamonds, FALSE, 0.7)\ntestData <- except(diamonds, trainingData)\n\n# Exclude rowIDs\ntrainingData <- trainingData[, -1]\ntestData <- testData[, -1]\n\nprint(count(diamonds))\nprint(count(trainingData))\nprint(count(testData))\n\n``` \n```\nhead(trainingData)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/glm-tutorial.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Tutorial: Analyze data with glm\n##### Train a linear regression model using `glm()`\n\nThis section shows how to predict a diamond\u2019s price from its features by training a linear regression model using the training data. \nThere is a mix of categorical features (cut - Ideal, Premium, Very Good\u2026) and continuous features (depth, carat). SparkR automatically encodes these features so you don\u2019t have to encode these features manually. \n```\n# Family = \"gaussian\" to train a linear regression model\nlrModel <- glm(price ~ ., data = trainingData, family = \"gaussian\")\n\n# Print a summary of the trained model\nsummary(lrModel)\n\n``` \nUse `predict()` on the test data to see how well the model works on new data. \n**Syntax:** `predict(model, newData)` \n**Parameters:** \n* `model`: MLlib model\n* `newData`: SparkDataFrame, typically your test set \n**Output:** `SparkDataFrame` \n```\n# Generate predictions using the trained model\npredictions <- predict(lrModel, newData = testData)\n\n# View predictions against mpg column\ndisplay(select(predictions, \"price\", \"prediction\"))\n\n``` \nEvaluate the model. \n```\nerrors <- select(predictions, predictions$price, predictions$prediction, alias(predictions$price - predictions$prediction, \"error\"))\ndisplay(errors)\n\n# Calculate RMSE\nhead(select(errors, alias(sqrt(sum(errors$error^2 , na.rm = TRUE) \/ nrow(errors)), \"RMSE\")))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/glm-tutorial.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Tutorial: Analyze data with glm\n##### Train a logistic regression model using `glm()`\n\nThis section shows how to create a logistic regression on the same dataset to predict a diamond\u2019s cut based on some of its features. \nLogistic regression in MLlib supports binary classification. To test the algorithm in this example, subset the data to work with two labels. \n```\n# Subset data to include rows where diamond cut = \"Premium\" or diamond cut = \"Very Good\"\ntrainingDataSub <- subset(trainingData, trainingData$cut %in% c(\"Premium\", \"Very Good\"))\ntestDataSub <- subset(testData, testData$cut %in% c(\"Premium\", \"Very Good\"))\n\n``` \n```\n# Family = \"binomial\" to train a logistic regression model\nlogrModel <- glm(cut ~ price + color + clarity + depth, data = trainingDataSub, family = \"binomial\")\n\n# Print summary of the trained model\nsummary(logrModel)\n\n``` \n```\n# Generate predictions using the trained model\npredictionsLogR <- predict(logrModel, newData = testDataSub)\n\n# View predictions against label column\ndisplay(select(predictionsLogR, \"label\", \"prediction\"))\n\n``` \nEvaluate the model. \n```\nerrorsLogR <- select(predictionsLogR, predictionsLogR$label, predictionsLogR$prediction, alias(abs(predictionsLogR$label - predictionsLogR$prediction), \"error\"))\ndisplay(errorsLogR)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/glm-tutorial.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Qlik Sense\n\nQlik Sense delivers best-in-class cloud analytics that help people of all skill levels to make data-driven decisions and take action. \nThis article describes how to use Qlik Sense with a Databricks cluster or a Databricks SQL warehouse (formerly Databricks SQL endpoint) to analyze data in Delta Lake. \nNote \nFor information about Qlik Replicate, a solution that helps you pull data from multiple data sources (Oracle, Microsoft SQL Server, SAP, mainframe, and more) into Delta Lake, see [Connect to Qlik Replicate](https:\/\/docs.databricks.com\/partners\/ingestion\/qlik.html).\n\n#### Connect to Qlik Sense\n##### Connect to Qlik Sense using Partner Connect\n\nNote \nPartner Connect only supports SQL warehouses for Qlik Sense. To connect a cluster to Qlik Sense, connect to Qlik Sense manually. \nTo connect to Qlik Sense using Partner Connect, do the following: \n1. [Connect to BI partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/bi.html).\n2. On the **Qlik Sense Databricks Connect** page, click the **Sign up here** link.\n3. Follow the on-screen instructions to create a Qlik account and start your free Qlik Sense trial, then return to the **Qlik Sense Databricks Connect** page.\n4. Enter your Qlik Sense tenant URL.\n5. Click the **Click here** link. \nA new tab opens in your browser that displays the Qlik Sense API Access Key Help page.\n6. Follow the instructions to generate an API key for your Qlik Sense tenant, then return to the **Qlik Sense Databricks Connect**.\n7. Enter your Qlik Sense Tenant API key, then click **Submit**. \nThe Qlik Management Console displays.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/qlik-sense.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Qlik Sense\n##### Connect to Qlik Sense manually\n\nThis section describes how to connect to Qlik Sense manually. \n### Requirements \nBefore you connect to Qlik Sense manually, you must have the following: \n* A cluster or SQL warehouse in your Databricks workspace. \n+ [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n+ [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* The connection details for your cluster or SQL warehouse, specifically the **Server Hostname**, **Port**, and **HTTP Path** values. \n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). To create a personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**.\nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n+ [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n+ [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \n### Steps to connect \nTo connect to Qlik Sense manually, do the following: \n1. Sign in to the Qlik Sense app or website for your organization.\n2. Do one of the following: \n* If you have an existing app that you want to use, click the app\u2019s tile on the home page to open it.\n* If you do not have an existing app, click **Add new > New analytics app**, and follow the on-screen directions to finish creating the app and to open it.\n3. With the app open, click **Prepare > Data manager**.\n4. Click **Add data > Files and other sources**.\n5. For **Connect to a new data source**, click **Databricks**.\n6. In the **Create new connection (Databricks)** dialog, enter the following information: \n1. For **Host name**, enter the **Server Hostname** value.\n2. For **Port**, enter the **Port** value.\n3. For **Database name**, enter the name of the database that you want to use.\n4. For **HTTP Path**, enter the **HTTP Path** value.\n5. In **Credentials**, for **User name**, enter the word `token`.\n6. For **Password**, enter the token.\n7. For **SSL Options**, select the boxes **Enable SSL**, **Allow Self-signed Server Certificate**, **Allow Common Name Host Name Mismatch**, and **Use System Trust Store**.\n8. For **Name**, enter a name for this connection, or leave the default name.\n9. You can leave the rest of the settings in this dialog with their default settings.\n7. Click **Test connection**.\n8. After the connection succeeds, click **Create**.\n9. Follow the on-screen directions to add tables to your connection and to filter the tables\u2019 data.\n10. Click **Next**.\n11. Follow the on-screen directions to analyze your data with [sheets](https:\/\/help.qlik.com\/en-us\/cloud-services\/Subsystems\/Hub\/Content\/Sense_Hub\/Sheets\/create-sheets-for-structure.htm), [visualizations](https:\/\/help.qlik.com\/en-us\/cloud-services\/Subsystems\/Hub\/Content\/Sense_Hub\/Visualizations\/visualizations.htm), and other [data analytics and visualization resources](https:\/\/help.qlik.com\/en-US\/cloud-services\/Subsystems\/Hub\/Content\/Sense_Hub\/Introduction\/creating-analytics-and-visualizing-data.htm).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/qlik-sense.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Qlik Sense\n##### Additional resources\n\nTo continue using Qlik Sense, see the following resources: \n* [Qlik Sense](https:\/\/www.qlik.com\/products\/qlik-sense)\n* [Qlik Sense demos](https:\/\/demos.qlik.com\/qliksense)\n* [Qlik help videos](https:\/\/www.youtube.com\/channel\/UCFxZPr8pHfZS0n3jxx74rpA)\n* [Qlik Sense for developers help](https:\/\/qlik.dev\/)\n* [Qlik support services and resources](https:\/\/www.qlik.com\/services\/support)\n* [Contact Qlik](https:\/\/www.qlik.com\/contact)\n* [Support](https:\/\/support.qlik.com\/)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/qlik-sense.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Optimize performance with caching on Databricks\n\nDatabricks uses disk caching to accelerate data reads by creating copies of remote Parquet data files in nodes\u2019 local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then performed locally, which results in significantly improved reading speed. The cache works for all Parquet data files (including Delta Lake tables). \nNote \nIn SQL warehouses and Databricks Runtime 14.2 and above, the `CACHE SELECT` command is ignored. An enhanced disk caching algorithm is used instead.\n\n#### Optimize performance with caching on Databricks\n##### Delta cache renamed to disk cache\n\nDisk caching on Databricks was formerly referred to as the Delta cache and the DBIO cache. Disk caching behavior is a proprietary Databricks feature. This name change seeks to resolve confusion that it was part of the Delta Lake protocol.\n\n#### Optimize performance with caching on Databricks\n##### Disk cache vs. Spark cache\n\nThe Databricks disk cache differs from Apache Spark caching. Databricks recommends using automatic disk caching. \nThe following table summarizes the key differences between disk and Apache Spark caching so that you can choose the best tool for your workflow: \n| Feature | disk cache | Apache Spark cache |\n| --- | --- | --- |\n| Stored as | Local files on a worker node. | In-memory blocks, but it depends on storage level. |\n| Applied to | Any Parquet table stored on S3, ABFS, and other file systems. | Any DataFrame or RDD. |\n| Triggered | Automatically, on the first read (if cache is enabled). | Manually, requires code changes. |\n| Evaluated | Lazily. | Lazily. |\n| Availability | Can be enabled or disabled with configuration flags, enabled by default on certain node types. | Always available. |\n| Evicted | Automatically in LRU fashion or on any file change, manually when restarting a cluster. | Automatically in LRU fashion, manually with `unpersist`. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/disk-cache.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Optimize performance with caching on Databricks\n##### Disk cache consistency\n\nThe disk cache automatically detects when data files are created, deleted, modified, or overwritten and updates its content accordingly. You can write, modify, and delete table data with no need to explicitly invalidate cached data. Any stale entries are automatically invalidated and evicted from the cache.\n\n#### Optimize performance with caching on Databricks\n##### Selecting instance types to use disk caching\n\nThe recommended (and easiest) way to use disk caching is to choose a worker type with SSD volumes when you configure your cluster. Such workers are enabled and configured for disk caching. \nThe disk cache is configured to use at most half of the space available on the local SSDs provided with the worker nodes. For configuration options, see [Configure the disk cache](https:\/\/docs.databricks.com\/optimizations\/disk-cache.html#configure-cache).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/disk-cache.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Optimize performance with caching on Databricks\n##### Configure the disk cache\n\nDatabricks recommends that you choose cache-accelerated worker instance types for your compute. Such instances are automatically configured optimally for the disk cache. \nNote \nWhen a worker is decommissioned, the Spark cache stored on that worker is lost. So if autoscaling is enabled, there is some instability with the cache. Spark would then need to reread missing partitions from source as needed. \n### Configure disk usage \nTo configure how the disk cache uses the worker nodes\u2019 local storage, specify the following [Spark configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration) settings during cluster creation: \n* `spark.databricks.io.cache.maxDiskUsage`: disk space per node reserved for cached data in bytes\n* `spark.databricks.io.cache.maxMetaDataCache`: disk space per node reserved for cached metadata in bytes\n* `spark.databricks.io.cache.compression.enabled`: should the cached data be stored in compressed format \nExample configuration: \n```\nspark.databricks.io.cache.maxDiskUsage 50g\nspark.databricks.io.cache.maxMetaDataCache 1g\nspark.databricks.io.cache.compression.enabled false\n\n``` \n### Enable or disable the disk cache \nTo enable and disable the disk cache, run: \n```\nspark.conf.set(\"spark.databricks.io.cache.enabled\", \"[true | false]\")\n\n``` \nDisabling the cache does not result in dropping the data that is already in the local storage. Instead, it prevents queries from adding new data to the cache and reading data from the cache.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/disk-cache.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query optimization using primary key constraints\n\nPrimary key constraints, which capture relationships between fields in tables, can help users and tools understand relationships in your data. This article contains examples that show how you can use primary keys with the `RELY` option to optimize some common types of queries.\n\n#### Query optimization using primary key constraints\n##### Add primary key constraints\n\nYou can add a primary key constraint in your table creation statement, as in the following example, or add a constraint to a table using the `ADD CONSTRAINT` clause. \n```\nCREATE TABLE customer (\nc_customer_sk int,\nPRIMARY KEY (c_customer_sk)\n...\n)\n\n``` \nIn this example, `c_customer_sk` is the customer ID key. The primary key constraint specifies that each customer ID value should be unique in the table. Databricks does not enforce key constraints. They can be validated through your existing data pipeline or ETL. See [Manage data quality with Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html) to learn about working expectations on streaming tables and materialized views. See [Constraints on Databricks](https:\/\/docs.databricks.com\/tables\/constraints.html) to learn about working with constraints on Delta tables.\n\n#### Query optimization using primary key constraints\n##### Use `RELY` to enable optimizations\n\nWhen you know that a primary key constraint is valid, you can enable optimizations based on the constraint by specifying it with the `RELY` option. See [ADD CONSTRAINT clause](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table-add-constraint.html) for the complete syntax. \nThe `RELY` option allows Databricks to exploit the constraint to rewrite queries. The following optimizations can only be performed if the `RELY` option is specified in an `ADD CONSTRAINT` clause or `ALTER TABLE` statement. \nUsing `ALTER TABLE`, you can modify a table\u2019s primary key to include the `RELY` option, as shown in the following example. \n```\nALTER TABLE\ncustomer DROP PRIMARY KEY;\nALTER TABLE\ncustomer\nADD\nPRIMARY KEY (c_customer_sk) RELY;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-optimization-constraints.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query optimization using primary key constraints\n##### Optimization examples\n\nThe following examples extend the previous example that creates a `customer` table where `c_customer_sk` is a verified unique identifier named as a `PRIMARY KEY` with the `RELY` option specified. \n### Example 1: Eliminate unnecessary aggregations \nThe following shows a query that applies a `DISTINCT` operation to a primary key. \n```\nSELECT\nDISTINCT c_customer_sk\nFROM\ncustomer;\n\n``` \nBecause the `c_customer_sk` column is a verified `PRIMARY KEY` constraint, all values in the column are unique. With the `RELY` option is specified, Databricks can optimize the query by not performing the `DISTINCT` operation. \n### Example 2: Eliminate unnecessary joins \nThe following example shows a query where Databricks can eliminate an unnecessary join. \nThe query joins a fact table, `store_sales` with a dimension table, `customer`. It performs a left outer join, so the query result includes all records from the `store_sales` table and matched records from the `customer` table. If there is no matching record in the `customer` table, the query result shows a `NULL` value for the `c_customer_sk` column. \n```\nSELECT\nSUM(ss_quantity)\nFROM\nstore_sales ss\nLEFT JOIN customer c ON ss.customer_sk = c.c_customer_sk;\n\n``` \nTo understand why this join is unnecessary, consider the query statement. It requires only the `ss_quantity` column from the `store_sales` table. The `customer` table is joined on its primary key, so each row of `store_sales` matches at most one row in `customer`. Because the operation is an outer join, all records from the `store_sales` table are preserved, so the join does not change any data from that table. The `SUM` aggregation is the same whether or not these tables are joined. \nUsing the primary key constraint with `RELY` gives the query optimizer the information it needs to eliminate the join. The optimized query looks more like this: \n```\nSELECT\nSUM(ss_quantity)\nFROM\nstore_sales ss\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-optimization-constraints.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query optimization using primary key constraints\n##### Next steps\n\nSee [View the Entity Relationship Diagram](https:\/\/docs.databricks.com\/catalog-explorer\/entity-relationship-diagram.html) to learn how to explore primary key and foreign key relationships in the Catalog Explorer UI.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-optimization-constraints.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n#### Security, compliance, and privacy for the data lakehouse\n\nThe architectural principles of the **security, compliance, and privacy** pillar are about protecting a Databricks application, customer workloads, and customer data from threats. As a starting point, the [Databricks Security and Trust Center](https:\/\/www.databricks.com\/trust) provides a good overview of the Databricks approach to security. \n![Security, compliance, and privacy lakehouse architecture diagram for Databricks.](https:\/\/docs.databricks.com\/_images\/security.png)\n\n#### Security, compliance, and privacy for the data lakehouse\n##### Principles of security, compliance, and privacy\n\n1. **Manage identity and access using least privilege** \nThe practice of identity and access management (IAM) helps you ensure that the right people can access the right resources. IAM addresses the following aspects of authentication and authorization: account management including provisioning, identity governance, authentication, access control (authorization), and identity federation.\n2. **Protect data in transit and at rest** \nClassify your data into sensitivity levels and use mechanisms such as encryption, tokenization, and access control where appropriate.\n3. **Secure your network and identify and protect endpoints** \nSecure your network and monitor and protect the network integrity of internal and external endpoints through security appliances or cloud services like firewalls.\n4. **Review the Shared Responsibility Model** \nSecurity and compliance are a shared responsibility between Databricks, the Databricks customer, and the cloud provider. It is important to understand which party is responsible for what part.\n5. **Meet compliance and data privacy requirements** \nYou might have internal (or external) requirements that require you to control the data storage locations and processing. These requirements vary based on systems design objectives, industry regulatory concerns, national law, tax implications, and culture. Be mindful that you might need to obfuscate or redact personally identifiable information (PII) to meet your regulatory requirements. Where possible, automate your compliance efforts.\n6. **Monitor system security** \nUse automated tools to monitor your application and infrastructure. To scan your infrastructure for vulnerabilities and detect security incidents, use automated scanning in your continuous integration and continuous deployment (CI\/CD) pipelines.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/security-compliance-and-privacy\/index.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n#### Security, compliance, and privacy for the data lakehouse\n##### Next: Best practices for security, compliance, and privacy\n\nSee [Best practices for security, compliance & privacy](https:\/\/docs.databricks.com\/lakehouse-architecture\/security-compliance-and-privacy\/best-practices.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/security-compliance-and-privacy\/index.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Monitor fairness and bias for classification models\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nWith Databricks Lakehouse Monitoring, you can monitor the predictions of a classification model to see if the model performs similarly on data associated with different groups. For example, you can investigate whether a loan-default classifier generates the same false-positive rate for applicants from different demographics.\n\n### Monitor fairness and bias for classification models\n#### Work with fairness and bias metrics\n\nTo monitor for fairness and bias, you create a Boolean slice expression. The group defined by the slice expression evaluating to `True` is considered the protected group (that is, the group you are checking for bias against). For example, if you create `slicing_exprs=[\"age < 25\"]`, the slice identified by `slice_key` = \u201cage < 25\u201d and `slice_value` = `True` is considered the protected group, and the slice identified by `slice_key` = \u201cage < 25\u201d and `slice_value` = `False` is considered the unprotected group. \nThe monitor automatically computes metrics that compare the performance of the classification model between groups. The following metrics are reported in the profile metrics table: \n* `predictive_parity`, which compares the model\u2019s precision between groups.\n* `predictive_equality`, which compares false positive rates between groups.\n* `equal_opportunity`, which measures whether a label is predicted equally well for both groups.\n* `statistical_parity`, which measures the difference in predicted outcomes between groups. \nThese metrics are calculated only if the analysis type is `InferenceLog` and `problem_type` is `classification`. \nFor definitions of these metrics, see the following references: \n* Wikipedia article on fairness in machine learning: `https:\/\/en.wikipedia.org\/wiki\/Fairness_(machine_learning)`\n* [Fairness Definitions Explained, Verma and Rubin, 2018](http:\/\/fairware.cs.umass.edu\/papers\/Verma.pdf)\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/fairness-bias.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Monitor fairness and bias for classification models\n#### Fairness and bias metrics outputs\n\nSee the [API reference](https:\/\/api-docs.databricks.com\/python\/lakehouse-monitoring\/latest\/index.html) for details about these metrics and how to view them in the metric tables. All fairness and bias metrics share the same data type as shown below, showing fairness scores computed across all predicted classes in an \u201cone-vs-all\u201d manner as key-value pairs. \nYou can create an alert on these metrics. For instance, the owner of the model can set up an alert when the fairness metric exceeds some threshold and then route that alert to an on-call person or team for investigation.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/fairness-bias.html"} +{"content":"# AI and Machine Learning on Databricks\n## GraphFrames\n#### Graph analysis tutorial with GraphFrames\n\nThis tutorial notebook shows you how to use GraphFrames to perform graph analysis. Databricks recommends using a cluster running [Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html), as it includes an optimized installation of GraphFrames. \nTo run the notebook: \n1. If you are not using a cluster running Databricks Runtime ML, use one of [these methods](https:\/\/docs.databricks.com\/libraries\/index.html) to install the [GraphFrames library](https:\/\/spark-packages.org\/package\/graphframes\/graphframes).\n2. Download the SF Bay Area Bike Share [data](https:\/\/www.kaggle.com\/datasets\/benhamner\/sf-bay-area-bike-share) from Kaggle and unzip it. You must sign into Kaggle using third-party authentication or create and sign into a Kaggle account.\n3. Upload `station.csv` and `trip.csv` using the [add data UI](https:\/\/docs.databricks.com\/ingestion\/add-data\/index.html). \nThe tables are named `station_csv` and `trip_csv`.\n\n#### Graph analysis tutorial with GraphFrames\n##### Graph Analysis with GraphFrames notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/graph-analysis-graphframes.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/graphframes\/graph-analysis-tutorial.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n#### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### MLflow Model Registry Webhooks on Databricks\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nWebhooks enable you to listen for Model Registry events so your integrations can automatically trigger actions. You can use webhooks to automate and integrate your machine learning pipeline with existing CI\/CD tools and workflows. For example, you can trigger CI builds when a new model version is created or notify your team members through Slack each time a model transition to production is requested. \nWebhooks are available through the [Databricks REST API](https:\/\/docs.databricks.com\/api\/workspace\/experiments) or the Python client `databricks-registry-webhooks` on [PyPI](https:\/\/pypi.org\/project\/databricks-registry-webhooks\/). \nNote \nWebhooks are not available when you use [Models in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html). For an alternative, see [Can I use stage transition requests or trigger webhooks on events?](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html#manual-approval). Sending webhooks to private endpoints (endpoints that are not accessible from the public internet) is not supported.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/model-registry-webhooks.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n#### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### MLflow Model Registry Webhooks on Databricks\n####### Webhook events\n\nYou can specify a webhook to trigger upon one or more of these events: \n* **MODEL\\_VERSION\\_CREATED**: A new model version was created for the associated model.\n* **MODEL\\_VERSION\\_TRANSITIONED\\_STAGE**: A model version\u2019s stage was changed.\n* **TRANSITION\\_REQUEST\\_CREATED**: A user requested a model version\u2019s stage be transitioned.\n* **COMMENT\\_CREATED**: A user wrote a comment on a registered model.\n* **REGISTERED\\_MODEL\\_CREATED**: A new registered model was created. This event type can only be specified for a registry-wide webhook, which can be created by not specifying a model name in the create request.\n* **MODEL\\_VERSION\\_TAG\\_SET**: A user set a tag on the model version.\n* **MODEL\\_VERSION\\_TRANSITIONED\\_TO\\_STAGING**: A model version was transitioned to staging.\n* **MODEL\\_VERSION\\_TRANSITIONED\\_TO\\_PRODUCTION**: A model version was transitioned to production.\n* **MODEL\\_VERSION\\_TRANSITIONED\\_TO\\_ARCHIVED**: A model version was archived.\n* **TRANSITION\\_REQUEST\\_TO\\_STAGING\\_CREATED**: A user requested a model version be transitioned to staging.\n* **TRANSITION\\_REQUEST\\_TO\\_PRODUCTION\\_CREATED**: A user requested a model version be transitioned to production.\n* **TRANSITION\\_REQUEST\\_TO\\_ARCHIVED\\_CREATED**: A user requested a model version be archived.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/model-registry-webhooks.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n#### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### MLflow Model Registry Webhooks on Databricks\n####### Types of webhooks\n\nThere are two types of webhooks based on their trigger targets: \n* **Webhooks with HTTP endpoints (HTTP registry webhooks)**: Send triggers to an HTTP endpoint.\n* **Webhooks with job triggers (job registry webhooks)**: Trigger a job in a Databricks workspace. If IP allowlisting is enabled in the job\u2019s workspace, you must allowlist the workspace IPs of the model registry. See [IP allowlisting for job registry webhooks](https:\/\/docs.databricks.com\/mlflow\/model-registry-webhooks.html#ip-allowlisting) for more information. \nThere are also two types of webhooks based on their scope, with different access control requirements: \n* **Model-specific webhooks**: The webhook applies to a specific registered model. You must have CAN MANAGE permissions on the registered model to create, modify, delete, or test model-specific webhooks.\n* **Registry-wide webhooks**: The webhook is triggered by events on any registered model in the workspace, including the creation of a new registered model. To create a registry-wide webhook, omit the `model_name` field on creation. You must have workspace admin permissions to create, modify, delete, or test registry-wide webhooks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/model-registry-webhooks.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n#### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### MLflow Model Registry Webhooks on Databricks\n####### Webhook payload\n\nEach event trigger has minimal fields included in the payload for the outgoing request to the webhook endpoint. \n* Sensitive information like artifact path location is excluded. Users and principals with appropriate ACLs can use client or REST APIs to query the Model Registry for this information.\n* Payloads are not encrypted. See [Security](https:\/\/docs.databricks.com\/mlflow\/model-registry-webhooks.html#security) for information on how to validate that Databricks is the source of the webhook.\n* The `text` field facilitates Slack integration. To send a Slack message, provide a Slack webhook endpoint as the webhook URL. \n### Job registry webhook payload \nThe payload for a job registry webhook depends on the type of job and is sent to the `jobs\/run-now` endpoint in the target workspace. \n#### Single-task jobs \nSingle-task jobs have one of three payloads based on the task type. \n##### Notebook and Python wheel jobs \nNotebook and Python wheel jobs have a JSON payload with a parameter dictionary that contains a field `event_message`. \n```\n{\n\"job_id\": 1234567890,\n\"notebook_params\": {\n\"event_message\": \"<Webhook Payload>\"\n}\n}\n\n``` \n##### Python, JAR, and Spark Submit jobs \nPython, JAR, and Spark submit jobs have a JSON payload with a parameter list. \n```\n{\n\"job_id\": 1234567890,\n\"python_params\": [\"<Webhook Payload>\"]\n}\n\n``` \n##### All other jobs \nAll other types of jobs have a JSON payload with no parameters. \n```\n{\n\"job_id\": 1234567890\n}\n\n``` \n#### Multi-task jobs \nMulti-task jobs have a JSON payload with all parameters populated to account for different task types. \n```\n{\n\"job_id\": 1234567890,\n\"notebook_params\": {\n\"event_message\": \"<Webhook Payload>\"\n},\n\"python_named_params\": {\n\"event_message\": \"<Webhook Payload>\"\n},\n\"jar_params\": [\"<Webhook Payload>\"],\n\"python_params\": [\"<Webhook Payload>\"],\n\"spark_submit_params\": [\"<Webhook Payload>\"]\n}\n\n``` \n### Example payloads \n#### event: `MODEL_VERSION_TRANSITIONED_STAGE` \n**Response** \n```\nPOST\n\/your\/endpoint\/for\/event\/model-versions\/stage-transition\n--data {\n\"event\": \"MODEL_VERSION_TRANSITIONED_STAGE\",\n\"webhook_id\": \"c5596721253c4b429368cf6f4341b88a\",\n\"event_timestamp\": 1589859029343,\n\"model_name\": \"Airline_Delay_SparkML\",\n\"version\": \"8\",\n\"to_stage\": \"Production\",\n\"from_stage\": \"None\",\n\"text\": \"Registered model 'someModel' version 8 transitioned from None to Production.\"\n}\n\n``` \n#### event: `MODEL_VERSION_TAG_SET` \n**Response** \n```\nPOST\n\/your\/endpoint\/for\/event\/model-versions\/tag-set\n--data {\n\"event\": \"MODEL_VERSION_TAG_SET\",\n\"webhook_id\": \"8d7fc634e624474f9bbfde960fdf354c\",\n\"event_timestamp\": 1589859029343,\n\"model_name\": \"Airline_Delay_SparkML\",\n\"version\": \"8\",\n\"tags\": [{\"key\":\"key1\",\"value\":\"value1\"},{\"key\":\"key2\",\"value\":\"value2\"}],\n\"text\": \"example@yourdomain.com set version tag(s) 'key1' => 'value1', 'key2' => 'value2' for registered model 'someModel' version 8.\"\n}\n\n``` \n#### event: `COMMENT_CREATED` \n**Response** \n```\nPOST\n\/your\/endpoint\/for\/event\/comments\/create\n--data {\n\"event\": \"COMMENT_CREATED\",\n\"webhook_id\": \"8d7fc634e624474f9bbfde960fdf354c\",\n\"event_timestamp\": 1589859029343,\n\"model_name\": \"Airline_Delay_SparkML\",\n\"version\": \"8\",\n\"comment\": \"Raw text content of the comment\",\n\"text\": \"A user commented on registered model 'someModel' version 8.\"\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/model-registry-webhooks.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n#### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### MLflow Model Registry Webhooks on Databricks\n####### Security\n\nFor security, Databricks includes the X-Databricks-Signature in the header computed from the payload and the shared secret key associated with the webhook using the [HMAC with SHA-256 algorithm](https:\/\/en.wikipedia.org\/wiki\/HMAC). \nIn addition, you can include a standard Authorization header in the outgoing request by specifying one in the `HttpUrlSpec` of the webhook. \n### Client verification \nIf a shared secret is set, the payload recipient should verify the source of the HTTP request by using the shared secret to HMAC-encode the payload, and then comparing the encoded value with the `X-Databricks-Signature` from the header. This is particularly important if SSL certificate validation is disabled (that is, if the `enable_ssl_verification` field is set to `false`). \nNote \n`enable_ssl_verification` is `true` by default. For self-signed certificates, this field must be `false`, and the destination server must disable certificate validation. \nFor security purposes, Databricks recommends that you perform secret validation with the HMAC-encoded portion of the payload. If you disable host name validation, you increase the risk that a request could be maliciously routed to an unintended host. \n```\nimport hmac\nimport hashlib\nimport json\n\nsecret = shared_secret.encode('utf-8')\nsignature_key = 'X-Databricks-Signature'\n\ndef validate_signature(request):\nif not request.headers.has_key(signature_key):\nraise Exception('No X-Signature. Webhook not be trusted.')\n\nx_sig = request.headers.get(signature_key)\nbody = request.body.encode('utf-8')\nh = hmac.new(secret, body, hashlib.sha256)\ncomputed_sig = h.hexdigest()\n\nif not hmac.compare_digest(computed_sig, x_sig.encode()):\nraise Exception('X-Signature mismatch. Webhook not be trusted.')\n\n``` \n### Authorization header for HTTP registry webhooks \nIf an Authorization header is set, clients should verify the source of the HTTP request by verifying the bearer token or authorization credentials in the Authorization header. \n### IP allowlisting for job registry webhooks \nTo use a webhook that triggers job runs in a different workspace that has IP allowlisting enabled, you must [allowlist](https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list-workspace.html) the region NAT IP where the webhook is located to accept incoming requests. \nIf the webhook and the job are in the same workspace, you do not need to add any IPs to your allowlist. \nContact your accounts team to identify the IPs you need to allowlist.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/model-registry-webhooks.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n#### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### MLflow Model Registry Webhooks on Databricks\n####### Audit logging\n\nIf audit logging is enabled for your workspace, the following events are included in the audit logs: \n* Create webhook\n* Update webhook\n* List webhook\n* Delete webhook\n* Test webhook\n* Webhook trigger \n### Webhook trigger audit logging \nFor webhooks with HTTP endpoints, the HTTP request sent to the URL specified for the webhook along with the URL and `enable_ssl_verification` values are logged. \nFor webhooks with job triggers, the `job_id` and `workspace_url` values are logged.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/model-registry-webhooks.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n#### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### MLflow Model Registry Webhooks on Databricks\n####### Examples\n\nThis section includes: \n* [HTTP registry webhook workflow example](https:\/\/docs.databricks.com\/mlflow\/model-registry-webhooks.html#http-registry-webhook-example-workflow).\n* [job registry webhook workflow example](https:\/\/docs.databricks.com\/mlflow\/model-registry-webhooks.html#job-registry-webhook-example-workflow).\n* [list webhooks example](https:\/\/docs.databricks.com\/mlflow\/model-registry-webhooks.html#list-registry-webhooks-example).\n* two [example notebooks](https:\/\/docs.databricks.com\/mlflow\/model-registry-webhooks.html#notebooks): one illustrating the REST API, and one illustrating the Python client. \n### HTTP registry webhook example workflow \n#### 1. Create a webhook \nWhen an HTTPS endpoint is ready to receive the webhook event request, you can create a webhook using the webhooks Databricks REST API. For example, the webhook\u2019s URL can point to Slack to post messages to a channel. \n```\n$ curl -X POST -H \"Authorization: Bearer <access-token>\" -d \\\n'{\"model_name\": \"<model-name>\",\n\"events\": [\"MODEL_VERSION_CREATED\"],\n\"description\": \"Slack notifications\",\n\"status\": \"TEST_MODE\",\n\"http_url_spec\": {\n\"url\": \"https:\/\/hooks.slack.com\/services\/...\",\n\"secret\": \"anyRandomString\"\n\"authorization\": \"Bearer AbcdEfg1294\"}}' https:\/\/<databricks-instance>\/api\/2.0\/mlflow\/registry-webhooks\/create\n\n``` \n```\nfrom databricks_registry_webhooks import RegistryWebhooksClient, HttpUrlSpec\n\nhttp_url_spec = HttpUrlSpec(\nurl=\"https:\/\/hooks.slack.com\/services\/...\",\nsecret=\"secret_string\",\nauthorization=\"Bearer AbcdEfg1294\"\n)\nhttp_webhook = RegistryWebhooksClient().create_webhook(\nmodel_name=\"<model-name>\",\nevents=[\"MODEL_VERSION_CREATED\"],\nhttp_url_spec=http_url_spec,\ndescription=\"Slack notifications\",\nstatus=\"TEST_MODE\"\n)\n\n``` \n**Response** \n```\n{\"webhook\": {\n\"id\":\"1234567890\",\n\"creation_timestamp\":1571440826026,\n\"last_updated_timestamp\":1582768296651,\n\"status\":\"TEST_MODE\",\n\"events\":[\"MODEL_VERSION_CREATED\"],\n\"http_url_spec\": {\n\"url\": \"https:\/\/hooks.slack.com\/services\/...\",\n\"enable_ssl_verification\": True\n}}}\n\n``` \nYou can also create an HTTP registry webhook with the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_mlflow\\_webhook](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/mlflow_webhook). \n#### 2. Test the webhook \nThe previous webhook was created in `TEST_MODE`, so a mock event can be triggered to send a request to the specified URL. However, the webhook does not trigger on a real event. The test endpoint returns the received status code and body from the specified URL. \n```\n$ curl -X POST -H \"Authorization: Bearer <access-token>\" -d \\\n'{\"id\": \"1234567890\"}' \\\nhttps:\/\/<databricks-instance>\/api\/2.0\/mlflow\/registry-webhooks\/test\n\n``` \n```\nfrom databricks_registry_webhooks import RegistryWebhooksClient\n\nhttp_webhook = RegistryWebhooksClient().test_webhook(\nid=\"1234567890\"\n)\n\n``` \n**Response** \n```\n{\n\"status\":200,\n\"body\":\"OK\"\n}\n\n``` \n#### 3. Update the webhook to active status \nTo enable the webhook for real events, set its status to `ACTIVE` through an update call, which can also be used to change any of its other properties. \n```\n$ curl -X PATCH -H \"Authorization: Bearer <access-token>\" -d \\\n'{\"id\": \"1234567890\", \"status\": \"ACTIVE\"}' \\\nhttps:\/\/<databricks-instance>\/api\/2.0\/mlflow\/registry-webhooks\/update\n\n``` \n```\nfrom databricks_registry_webhooks import RegistryWebhooksClient\n\nhttp_webhook = RegistryWebhooksClient().update_webhook(\nid=\"1234567890\",\nstatus=\"ACTIVE\"\n)\n\n``` \n**Response** \n```\n{\"webhook\": {\n\"id\":\"1234567890\",\n\"creation_timestamp\":1571440826026,\n\"last_updated_timestamp\":1582768296651,\n\"status\": \"ACTIVE\",\n\"events\":[\"MODEL_VERSION_CREATED\"],\n\"http_url_spec\": {\n\"url\": \"https:\/\/hooks.slack.com\/services\/...\",\n\"enable_ssl_verification\": True\n}}}\n\n``` \n#### 4. Delete the webhook \nTo disable the webhook, set its status to `DISABLED` (using a similar update command as above), or delete it. \n```\n$ curl -X DELETE -H \"Authorization: Bearer <access-token>\" -d \\\n'{\"id\": \"1234567890\"}' \\\nhttps:\/\/<databricks-instance>\/api\/2.0\/mlflow\/registry-webhooks\/delete\n\n``` \n```\nfrom databricks_registry_webhooks import RegistryWebhooksClient\n\nhttp_webhook = RegistryWebhooksClient().delete_webhook(\nid=\"1234567890\"\n)\n\n``` \n**Response** \n```\n{}\n\n``` \n### Job registry webhook example workflow \nThe workflow for managing job registry webhooks is similar to HTTP registry webhooks, with the only difference being the `job_spec` field that replaces the `http_url_spec` field. \nWith webhooks, you can trigger jobs in the same workspace or in a different workspace. The workspace is specified using the optional parameter `workspace_url`. If no `workspace_url` is present, the default behavior is to trigger a job in the same workspace as the webhook. \n#### Requirements \n* An existing [job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html).\n* A [personal access token](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement). Note that access tokens are not included in the webhook object returned by the APIs. \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \n#### Create a job registry webhook \n```\n$ curl -X POST -H \"Authorization: Bearer <access-token>\" -d \\ '{\"model_name\": \"<model-name>\",\n\"events\": [\"TRANSITION_REQUEST_CREATED\"],\n\"description\": \"Job webhook trigger\",\n\"status\": \"TEST_MODE\",\n\"job_spec\": {\n\"job_id\": \"1\",\n\"workspace_url\": \"https:\/\/my-databricks-workspace.com\",\n\"access_token\": \"dapi12345...\"}}'\nhttps:\/\/<databricks-instance>\/api\/2.0\/mlflow\/registry-webhooks\/create\n\n``` \n```\nfrom databricks_registry_webhooks import RegistryWebhooksClient, JobSpec\n\njob_spec = JobSpec(\njob_id=\"1\",\nworkspace_url=\"https:\/\/my-databricks-workspace.com\",\naccess_token=\"dapi12345...\"\n)\njob_webhook = RegistryWebhooksClient().create_webhook(\nmodel_name=\"<model-name>\",\nevents=[\"TRANSITION_REQUEST_CREATED\"],\njob_spec=job_spec,\ndescription=\"Job webhook trigger\",\nstatus=\"TEST_MODE\"\n)\n\n``` \n**Response** \n```\n{\"webhook\": {\n\"id\":\"1234567891\",\n\"creation_timestamp\":1591440826026,\n\"last_updated_timestamp\":1591440826026,\n\"status\":\"TEST_MODE\",\n\"events\":[\"TRANSITION_REQUEST_CREATED\"],\n\"job_spec\": {\n\"job_id\": \"1\",\n\"workspace_url\": \"https:\/\/my-databricks-workspace.com\"\n}}}\n\n``` \nYou can also create a job registry webhook with the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_mlflow\\_webhook](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/mlflow_webhook). \n### List registry webhooks example \n```\n$ curl -X GET -H \"Authorization: Bearer <access-token>\" -d \\ '{\"model_name\": \"<model-name>\"}'\nhttps:\/\/<databricks-instance>\/api\/2.0\/mlflow\/registry-webhooks\/list\n\n``` \n```\nfrom databricks_registry_webhooks import RegistryWebhooksClient\n\nwebhooks_list = RegistryWebhooksClient().list_webhooks(model_name=\"<model-name>\")\n\n``` \n**Response** \n```\n{\"webhooks\": [{\n\"id\":\"1234567890\",\n\"creation_timestamp\":1571440826026,\n\"last_updated_timestamp\":1582768296651,\n\"status\": \"ACTIVE\",\n\"events\":[\"MODEL_VERSION_CREATED\"],\n\"http_url_spec\": {\n\"url\": \"https:\/\/hooks.slack.com\/services\/...\",\n\"enable_ssl_verification\": True\n}},\n{\n\"id\":\"1234567891\",\n\"creation_timestamp\":1591440826026,\n\"last_updated_timestamp\":1591440826026,\n\"status\":\"TEST_MODE\",\n\"events\":[\"TRANSITION_REQUEST_CREATED\"],\n\"job_spec\": {\n\"job_id\": \"1\",\n\"workspace_url\": \"https:\/\/my-databricks-workspace.com\"\n}}]}\n\n``` \n### Notebooks \n#### MLflow Model Registry webhooks REST API example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-model-registry-webhooks-rest-api-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n#### MLflow Model Registry webhooks Python client example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-model-registry-webhooks-python-client-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/model-registry-webhooks.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Using Auto Loader with Unity Catalog\n\nAuto Loader can securely ingest data from external locations configured with Unity Catalog. To learn more about securely connecting storage with Unity Catalog, see [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html). Auto Loader relies on Structured Streaming for incremental processing; for recommendations and limitations see [Using Unity Catalog with Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/unity-catalog.html). \nNote \nIn Databricks Runtime 11.3 LTS and above, you can use Auto Loader with either shared or single user access modes. \nDirectory listing mode is supported by default. File notification mode is only supported on single user clusters.\n\n#### Using Auto Loader with Unity Catalog\n##### Ingesting data from external locations managed by Unity Catalog with Auto Loader\n\nYou can use Auto Loader to ingest data from any external location managed by Unity Catalog. You must have `READ FILES` permissions on the external location. \nNote \nUnity Catalog external locations do not support cross-cloud or cross-account configurations for Auto Loader.\n\n#### Using Auto Loader with Unity Catalog\n##### Specifying locations for Auto Loader resources for Unity Catalog\n\nThe Unity Catalog security model assumes that all storage locations referenced in a workload will be managed by Unity Catalog. Databricks recommends always storing checkpoint and schema evolution information in storage locations managed by Unity Catalog. Unity Catalog does not allow you to nest checkpoint or schema inference and evolution files under the table directory.\n\n#### Using Auto Loader with Unity Catalog\n##### Examples\n\nThe follow examples assume the executing user has owner privileges on the target tables and the following configurations and grants: \n| Storage location | Grant |\n| --- | --- |\n| s3:\/\/autoloader-source\/json-data | READ FILES |\n| s3:\/\/dev-bucket | READ FILES, WRITE FILES, CREATE TABLE |\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/unity-catalog.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Using Auto Loader with Unity Catalog\n##### Using Auto Loader to load to a Unity Catalog managed table\n\n```\ncheckpoint_path = \"s3:\/\/dev-bucket\/_checkpoint\/dev_table\"\n\n(spark.readStream\n.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"json\")\n.option(\"cloudFiles.schemaLocation\", checkpoint_path)\n.load(\"s3:\/\/autoloader-source\/json-data\")\n.writeStream\n.option(\"checkpointLocation\", checkpoint_path)\n.trigger(availableNow=True)\n.toTable(\"dev_catalog.dev_database.dev_table\"))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/unity-catalog.html"} +{"content":"# \n### Product philosophy\n\nRAG Studio product philosophy is underpinned by the following principles.\n\n### Product philosophy\n#### Measuring quality\n\n* **Quality Through Metrics**: Objective metrics are the cornerstone of quality assessment. Metrics provide indicators for evaluating the RAG application\u2019s quality and cost\/latency performance and thereby identifying areas for improvement.\n* **Comprehensive \u201calways-on\u201d Logging**: Metrics work best if they can be computed for any invocation of the RAG app. Therefore, every invocation of the app, both in development and production, must be logged. The log must capture all inputs and outputs, as well as the detailed steps that transform inputs into outputs.\n* **Human Feedback as the Benchmark**: Collecting human feedback is costly, but its value as a quality measure is unmatched. RAG Studio is designed to make the collection of human feedback as efficient as possible.\n* **LLM Judges Scale Feedback**: Utilizing RAG LLM judges in tandem with human feedback accelerates the development loop, allowing for quicker development cycles without subsequently scaling the number of human evaluators. However, RAG LLM judges are not a substitute, but rather, an augment to human feedback.\n\n### Product philosophy\n#### Development lifecycle\n\n* **Rapid Iteration**: The cycle of creating and testing new versions of a RAG Application must be quick\n* **Effortless Version Management**: Tracking and management of versions must be seamless, reducing cognitive load and letting developers concentrate on enhancing the application rather than on administrative tasks.\n* **Development and Production Are Unified**: The tools, schemas, and processes used in development should be consistent with those in production environments, ensuring a consistent workflow for quality improvement in development to deployment *with the same code base*.\n\n### Product philosophy\n#### Opportunities for quality\n\nRAG Studio is built upon the belief that quality opportunities exist across the entire RAG Application - the models, data processing pipelines, and chains. RAG Studio recognizes the interconnected nature of these components: while individual components can and should be optimized in isolation, the impact of these changes must be evaluated within the context of the entire RAG Application. \n![RAG application architecture all up](https:\/\/docs.databricks.com\/_images\/rag_quality_opps.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/approach-overview.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n### Query databases using JDBC\n##### Query SQL Server with Databricks\n\nThis article shows how you can connect Databricks to Microsoft SQL server to read and write data. \nNote \nYou may prefer Lakehouse Federation for managing queries on SQL Server data. See [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/sql-server.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n### Query databases using JDBC\n##### Query SQL Server with Databricks\n###### Configure a connection to SQL server\n\nIn Databricks Runtime 11.3 LTS and above, you can use the `sqlserver` keyword to use the included driver for connecting to SQL server. When working with DataFrames, use the following syntax: \n```\nremote_table = (spark.read\n.format(\"sqlserver\")\n.option(\"host\", \"hostName\")\n.option(\"port\", \"port\") # optional, can use default port 1433 if omitted\n.option(\"user\", \"username\")\n.option(\"password\", \"password\")\n.option(\"database\", \"databaseName\")\n.option(\"dbtable\", \"schemaName.tableName\") # (if schemaName not provided, default to \"dbo\")\n.load()\n)\n\n``` \n```\nval remote_table = spark.read\n.format(\"sqlserver\")\n.option(\"host\", \"hostName\")\n.option(\"port\", \"port\") \/\/ optional, can use default port 1433 if omitted\n.option(\"user\", \"username\")\n.option(\"password\", \"password\")\n.option(\"database\", \"databaseName\")\n.option(\"dbtable\", \"schemaName.tableName\") \/\/ (if schemaName not provided, default to \"dbo\")\n.load()\n\n``` \nWhen working with SQL, specify `sqlserver` in the `USING` clause and pass options while creating a table, as shown in the following example: \n```\nDROP TABLE IF EXISTS sqlserver_table;\nCREATE TABLE sqlserver_table\nUSING sqlserver\nOPTIONS (\ndbtable '<schema-name.table-name>',\nhost '<host-name>',\nport '1433',\ndatabase '<database-name>',\nuser '<username>',\npassword '<password>'\n);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/sql-server.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n### Query databases using JDBC\n##### Query SQL Server with Databricks\n###### Use the legacy JDBC driver\n\nIn Databricks Runtime 10.4 LTS and below, you must specify the driver and configurations using the JDBC settings. The following example queries SQL Server using its JDBC driver. For more details on reading, writing, configuring parallelism, and query pushdown, see [Query databases using JDBC](https:\/\/docs.databricks.com\/connect\/external-systems\/jdbc.html). \n```\ndriver = \"com.microsoft.sqlserver.jdbc.SQLServerDriver\"\n\ndatabase_host = \"<database-host-url>\"\ndatabase_port = \"1433\" # update if you use a non-default port\ndatabase_name = \"<database-name>\"\ntable = \"<table-name>\"\nuser = \"<username>\"\npassword = \"<password>\"\n\nurl = f\"jdbc:sqlserver:\/\/{database_host}:{database_port};database={database_name}\"\n\nremote_table = (spark.read\n.format(\"jdbc\")\n.option(\"driver\", driver)\n.option(\"url\", url)\n.option(\"dbtable\", table)\n.option(\"user\", user)\n.option(\"password\", password)\n.load()\n)\n\n``` \n```\nval driver = \"com.microsoft.sqlserver.jdbc.SQLServerDriver\"\n\nval database_host = \"<database-host-url>\"\nval database_port = \"1433\" \/\/ update if you use a non-default port\nval database_name = \"<database-name>\"\nval table = \"<table-name>\"\nval user = \"<username>\"\nval password = \"<password>\"\n\nval url = s\"jdbc:sqlserver:\/\/{database_host}:{database_port};database={database_name}\"\n\nval remote_table = spark.read\n.format(\"jdbc\")\n.option(\"driver\", driver)\n.option(\"url\", url)\n.option(\"dbtable\", table)\n.option(\"user\", user)\n.option(\"password\", password)\n.load()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/sql-server.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### Guiding principles for the lakehouse\n\nGuiding principles are level-zero rules that define and influence your architecture. To build a data lakehouse that helps your business succeed now and in the future, consensus among stakeholders in your organization is critical.\n\n### Guiding principles for the lakehouse\n#### Curate data and offer trusted data-as-products\n\nCurating data is essential to creating a high-value data lake for BI and ML\/AI. Treat data like a product with a clear definition, schema, and lifecycle. Ensure semantic consistency and that the data quality improves from layer to layer so that business users can fully trust the data. \n![Curate data and offer trusted data-as-products](https:\/\/docs.databricks.com\/_images\/gp-data-as-product.png) \nCurating data by establishing a layered (or multi-hop) architecture is a critical best practice for the lakehouse, as it allows data teams to structure the data according to quality levels and define roles and responsibilities per layer. A common layering approach is: \n* **Ingest layer**: Source data gets ingested into the lakehouse into the first layer and should be persisted there. When all downstream data is created from the Ingest Layer, rebuilding the subsequent layers from this layer is possible, if needed.\n* **Curated layer**: The purpose of the second layer is to hold cleansed, refined, filtered and aggregated data. The goal of this layer is to provide a sound, reliable foundation for analyses and reports across all roles and functions.\n* **Final layer**: The third layer is created around business or project needs; it provides a different view as data products to other business units or projects, preparing data around security needs (for example, anonymized data), or optimizing for performance (with pre-aggregated views). The data products in this layer are seen as the truth for the business. \nPipelines across all layers need to ensure that data quality constraints are met, meaning that data is accurate, complete, accessible, and consistent at all times, even during concurrent reads and writes. The validation of new data happens at the time of data entry into the curated layer, and the following ETL steps work to improve the quality of this data. Data quality must improve as data progresses through the layers and, as such, the trust in the data subsequently increases from a business point of view.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/guiding-principles.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### Guiding principles for the lakehouse\n#### Eliminate data silos and minimize data movement\n\nDon\u2019t create copies of a dataset with business processes relying on these different copies. Copies may become data silos that get out of sync, leading to lower quality of your data lake, and finally to outdated or incorrect insights. Also, for sharing data with external partners, use an enterprise sharing mechanism that allows direct access to the data in a secure way. \n![Eliminate data silos and minimize data movement](https:\/\/docs.databricks.com\/_images\/gp-no-data-silos.png) \nTo make the distinction clear between a data copy versus a data silo: A standalone or throwaway copy of data is not harmful on its own. It is sometimes necessary for boosting agility, experimentation, and innovation. However, if these copies become operational with downstream business data products dependent on them, they become data silos. \nTo prevent data silos, data teams usually attempt to build a mechanism or data pipeline to keep all copies in sync with the original. Since this is unlikely to happen consistently, data quality eventually degrades. This also can lead to higher costs and a significant loss of trust by the users. On the other hand, several business use cases require data sharing with partners or suppliers. \nAn important aspect is to securely and reliably share the latest version of the dataset. Copies of the dataset are often not sufficient, because they can get out of sync quickly. Instead, data should be shared via enterprise data-sharing tools.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/guiding-principles.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### Guiding principles for the lakehouse\n#### Democratize value creation through self-service\n\nThe best data lake cannot provide sufficient value, if users cannot access the platform or data for their BI and ML\/AI tasks easily. Lower the barriers to accessing data and platforms for all business units. Consider lean data management processes and provide self-service access for the platform and the underlying data. \n![Democratize value creation through self-service](https:\/\/docs.databricks.com\/_images\/gp-data-self-service.png) \nBusinesses that have successfully moved to a data-driven culture will thrive. This means every business unit derives its decisions from analytical models or from analyzing its own or centrally provided data. For consumers, data has to be easily discoverable and securely accessible. \nA good concept for data producers is \u201cdata as a product\u201d: The data is offered and maintained by one business unit or business partner like a product and consumed by other parties with proper permission control. Instead of relying on a central team and potentially slow request processes, these data products must be created, offered, discovered, and consumed in a self-service experience. \nHowever, it\u2019s not just the data that matters. The democratization of data requires the right tools to enable everyone to produce or consume and understand the data. For this, you need the data lakehouse to be a modern data and AI platform that provides the infrastructure and tooling for building data products without duplicating the effort of setting up another tool stack.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/guiding-principles.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### Guiding principles for the lakehouse\n#### Adopt an organization-wide data governance strategy\n\nData is a critical asset of any organization, but you cannot give everyone access to all data. Data access must be actively managed. Access control, auditing, and lineage-tracking are key for the correct and secure use of data. \n![Adopt an organizationwide data governance strategy](https:\/\/docs.databricks.com\/_images\/gp-organization-data-governance.png) \nData governance is a broad topic. The lakehouse covers the following dimensions: \n* **Data quality** \nThe most important prerequisite for correct and meaningful reports, analysis results, and models is high-quality data. Quality assurance (QA) needs to exist around all pipeline steps. Examples of how to implement this include having data contracts, meeting SLAs, keeping schemas stable, and evolving them in a controlled way.\n* **Data catalog** \nAnother important aspect is data discovery: Users of all business areas, especially in a self-service model, must be able to discover relevant data easily. Therefore, a lakehouse needs a data catalog that covers all business-relevant data. The primary goals of a data catalog are as follows: \n+ Ensure the same business concept is uniformly called and declared across the business. You might think of it as a semantic model in the curated and the final layer.\n+ Track the data lineage precisely so that users can explain how these data arrived at their current shape and form.\n+ Maintain high-quality metadata, which is as important as the data itself for proper use of the data.\n* **Access control** \nAs the value creation from the data in the lakehouse happens across all business areas, the lakehouse must be built with security as a first-class citizen. Companies might have a more open data access policy or strictly follow the principle of least privileges. Independent of that, data access controls must be in place in every layer. It is important to implement fine-grade permission schemes from the very beginning (column- and row-level access control, role-based or attribute-based access control). Companies can start with less strict rules. But as the lakehouse platform grows, all mechanisms and processes for a more sophisticated security regime should already be in place. Additionally, all access to the data in the lakehouse must be governed by audit logs from the get-go.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/guiding-principles.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### Guiding principles for the lakehouse\n#### Encourage open interfaces and open formats\n\nOpen interfaces and data formats are crucial for interoperability between the lakehouse and other tools. It simplifies integration with existing systems and also opens up an ecosystem of partners who have integrated their tools with the platform. \n![Encourage open interfaces and open formats](https:\/\/docs.databricks.com\/_images\/gp-open-interfaces-formats.png) \nOpen interfaces are critical to enabling interoperability and preventing dependency on any single vendor. Traditionally, vendors built proprietary technologies and closed interfaces that limited enterprises in the way they can store, process and share data. \nBuilding upon open interfaces helps you build for the future: \n* It increases the longevity and portability of the data so that you can use it with more applications and for more use cases.\n* It opens an ecosystem of partners who can quickly leverage the open interfaces to integrate their tools into the lakehouse platform. \nFinally, by standardizing on open formats for data, total costs will be significantly lower; one can access the data directly on the cloud storage without the need to pipe it through a proprietary platform that can incur high egress and computation costs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/guiding-principles.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### Guiding principles for the lakehouse\n#### Build to scale and optimize for performance and cost\n\nData inevitably continues to grow and become more complex. To equip your organization for future needs, your lakehouse should be able to scale. For example, you should be able to add new resources easily on demand. Costs should be limited to the actual consumption. \n![Build to scale and optimize for performance and cost](https:\/\/docs.databricks.com\/_images\/gp-scale-optimize-performance-cost.png) \nStandard ETL processes, business reports, and dashboards often have a predictable resource need from a memory and computation perspective. However, new projects, seasonal tasks, or modern approaches like model training (churn, forecast, maintenance) generate peaks of resource need. To enable a business to perform all these workloads, a scalable platform for memory and computation is necessary. New resources must be added easily on demand, and only the actual consumption should generate costs. As soon as the peak is over, resources can be freed up again and costs reduced accordingly. Often, this is referred to as horizontal scaling (fewer or more nodes) and vertical scaling (larger or smaller nodes). \nScaling also enables businesses to improve the performance of queries by selecting nodes with more resources or clusters with more nodes. But instead of permanently providing large machines and clusters they can be provisioned on demand only for the time needed to optimize the overall performance to cost ratio. Another aspect of optimization is storage versus compute resources. Since there is no clear relation between the volume of the data and workloads using this data (for example, only using parts of the data or doing intensive calculations on small data), it is a good practice to settle on an infrastructure platform that decouples storage and compute resources.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/guiding-principles.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Optimize the cluster utilization of Delta Live Tables pipelines with Enhanced Autoscaling\n\nDatabricks Enhanced Autoscaling optimizes cluster utilization by automatically allocating cluster resources based on workload volume, with minimal impact to the data processing latency of your pipelines. \nEnhanced Autoscaling improves on the Databricks [cluster autoscaling functionality](https:\/\/docs.databricks.com\/compute\/configure.html#autoscaling) with the following features: \n* Enhanced Autoscaling implements optimization of streaming workloads, and adds enhancements to improve the performance of batch workloads. Enhanced Autoscaling optimizes costs by adding or removing machines as the workload changes.\n* Enhanced Autoscaling proactively shuts down under-utilized nodes while guaranteeing there are no failed tasks during shutdown. The existing cluster autoscaling feature scales down nodes only if the node is idle. \nEnhanced Autoscaling is the default autoscaling mode when you create a new pipeline in the Delta Live Tables UI. You can enable Enhanced Autoscaling for existing pipelines by editing the pipeline settings in the UI. You can also enable Enhanced Autoscaling when you create or edit pipelines with the Delta Live Tables [API](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/auto-scaling.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Optimize the cluster utilization of Delta Live Tables pipelines with Enhanced Autoscaling\n###### Enable Enhanced Autoscaling\n\nTo use Enhanced Autoscaling, do one of the following: \n* Set **Cluster mode** to **Enhanced autoscaling** when you create a pipeline or edit a pipeline in the Delta Live Tables UI.\n* Add the `autoscale` setting to the pipeline cluster configuration and set the `mode` field to `ENHANCED`. See [Configure your compute settings](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#cluster-config). \nUse the following guidelines when configuring Enhanced Autoscaling for production pipelines: \n* Leave the `Min workers` setting at the default.\n* Set the `Max workers` setting to a value based on budget and pipeline priority. \nThe following example configures an Enhanced Autoscaling cluster with a minimum of 5 workers and a maximum of 10 workers. `max_workers` must be greater than or equal to `min_workers`. \nNote \n* Enhanced Autoscaling is available for `updates` clusters only. The existing autoscaling feature is used for `maintenance` clusters.\n* The `autoscale` configuration has two modes: \n+ `LEGACY`: Use [cluster autoscaling](https:\/\/docs.databricks.com\/compute\/configure.html#autoscaling).\n+ `ENHANCED`: Use Enhanced Autoscaling. \n```\n{\n\"clusters\": [\n{\n\"autoscale\": {\n\"min_workers\": 5,\n\"max_workers\": 10,\n\"mode\": \"ENHANCED\"\n}\n}\n]\n}\n\n``` \nThe pipeline is automatically restarted after the autoscaling configuration changes if the pipeline is configured for continuous execution. After restart, expect a short period of increased latency. Following this brief period of increased latency, the cluster size should be updated based on your `autoscale` configuration, and the pipeline latency returned to its previous latency characteristics.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/auto-scaling.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Optimize the cluster utilization of Delta Live Tables pipelines with Enhanced Autoscaling\n###### Monitoring Enhanced Autoscaling enabled pipelines\n\nYou can use the event log in the Delta Live Tables user interface to monitor Enhanced Autoscaling metrics. Enhanced Autoscaling events have the `autoscale` event type. The following are example events: \n| Event | Message |\n| --- | --- |\n| Cluster resize request started | `Scaling [up or down] to <y> executors from current cluster size of <x>` |\n| Cluster resize request succeeded | `Achieved cluster size <x> for cluster <cluster-id> with status SUCCEEDED` |\n| Cluster resize request partially succeeded | `Achieved cluster size <x> for cluster <cluster-id> with status PARTIALLY_SUCCEEDED` |\n| Cluster resize request failed | `Achieved cluster size <x> for cluster <cluster-id> with status FAILED` | \nYou can also view Enhanced Autoscaling events by directly querying the [event log](https:\/\/docs.databricks.com\/delta-live-tables\/observability.html#event-log): \n* To query the event log for backlog metrics, see [Monitor data backlog by querying the event log](https:\/\/docs.databricks.com\/delta-live-tables\/observability.html#backlog-metrics).\n* To monitor cluster resizing requests and responses during Enhanced Autoscaling operations, see [Monitor Enhanced Autoscaling events from the event log](https:\/\/docs.databricks.com\/delta-live-tables\/observability.html#autoscaling).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/auto-scaling.html"} +{"content":"# AI and Machine Learning on Databricks\n### Deep learning\n\nThis article gives a brief introduction to using PyTorch, Tensorflow, and distributed training for developing and fine-tuning deep learning models on Databricks. It also includes links to pages with example notebooks illustrating how to use those tools. \n* For general guidelines on optimizing deep learning workflows on Databricks, see [Best practices for deep learning on Databricks](https:\/\/docs.databricks.com\/machine-learning\/train-model\/dl-best-practices.html).\n* For information about working with large language models and generative AI on Databricks, see: \n+ [Large language models (LLMs) on Databricks](https:\/\/docs.databricks.com\/large-language-models\/index.html).\n+ [Generative AI and large language models (LLMs) on Databricks](https:\/\/docs.databricks.com\/generative-ai\/generative-ai.html).\n\n### Deep learning\n#### PyTorch\n\nPyTorch is included in Databricks Runtime ML and provides GPU accelerated tensor computation and high-level functionalities for building deep learning networks. You can perform single node training or distributed training with PyTorch on Databricks. See [PyTorch](https:\/\/docs.databricks.com\/machine-learning\/train-model\/pytorch.html).\n\n### Deep learning\n#### TensorFlow\n\nDatabricks Runtime ML includes TensorFlow and TensorBoard, so you can use these libraries without installing any packages. TensorFlow supports deep-learning and general numerical computations on CPUs, GPUs, and clusters of GPUs. TensorBoard provides visualization tools to help you debug and optimize machine learning and deep learning workflows. See [TensorFlow](https:\/\/docs.databricks.com\/machine-learning\/train-model\/tensorflow.html) for single node and distributed training examples.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/deep-learning.html"} +{"content":"# AI and Machine Learning on Databricks\n### Deep learning\n#### Distributed training\n\nBecause deep learning models are data and computation-intensive, distributed training can be important. For examples of distributed deep learning using integrations with Horovod, `spark-tensorflow-distributor`, TorchDistributor, and DeepSpeed see [Distributed training](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/index.html).\n\n### Deep learning\n#### Track deep learning model development\n\nTracking remains a cornerstone of the MLflow ecosystem and is especially vital for the iterative nature of deep learning. Databricks uses MLflow to track deep learning training runs and model development. See [Track model development using MLflow](https:\/\/docs.databricks.com\/machine-learning\/track-model-development\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/deep-learning.html"} +{"content":"# \n### Bokeh\n\n[Bokeh](https:\/\/docs.bokeh.org\/en\/latest\/) is a Python interactive visualization library. \nTo use Bokeh, install the Bokeh PyPI package through the [Libraries](https:\/\/docs.databricks.com\/libraries\/index.html) UI, and attach it to your cluster. \nTo display a Bokeh plot in Databricks: \n1. Generate a plot following the instructions in the [Bokeh documentation](https:\/\/docs.bokeh.org\/en\/latest\/docs\/user_guide.html).\n2. Generate an HTML file containing the data for the plot, for example by using Bokeh\u2019s `file_html()` or `output_file()` functions.\n3. Pass this HTML to the Databricks `displayHTML()` function. \nImportant \nThe maximum size for a notebook cell, both contents and output, is 20MB. Make sure that the size of the HTML you pass to the `displayHTML()` function does not exceed this value.\n\n### Bokeh\n#### Notebook example: Bokeh\n\nThe following notebook shows a Bokeh example. \n### bokeh demo notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/bokeh-demo.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/bokeh.html"} +{"content":"# \n### Generative AI and large language models (LLMs) on Databricks\n\nThis article provides an overview of generative AI on Databricks and includes links to example notebooks and demos.\n\n### Generative AI and large language models (LLMs) on Databricks\n#### What is generative AI?\n\nGenerative AI is a type of artificial intelligence focused on the ability of computers to use models to create content like images, text, code, and synthetic data. \nGenerative AI applications are built on top of large language models (LLMs) and foundation models. \n* **LLMs** are deep learning models that consume and train on massive datasets to excel in language processing tasks. They create new combinations of text that mimic natural language based on its training data.\n* **Foundation models** are large ML models pre-trained with the intention that they are to be fine-tuned for more specific language understanding and generation tasks. These models are utilized to discern patterns within the input data. \nAfter these models have completed their learning processes, together they generate statistically probable outputs when prompted and they can be employed to accomplish various tasks, including: \n* Image generation based on existing ones or utilizing the style of one image to modify or create a new one.\n* Speech tasks such as transcription, translation, question\/answer generation, and interpretation of the intent or meaning of text. \nImportant \nWhile many LLMs or other generative AI models have safeguards, they can still generate harmful or inaccurate information. \nGenerative AI has the following design patterns: \n* Prompt Engineering: Crafting specialized prompts to guide LLM behavior\n* Retrieval Augmented Generation (RAG): Combining an LLM with external knowledge retrieval\n* Fine-tuning: Adapting a pre-trained LLM to specific data sets of domains\n* Pre-training: Training an LLM from scratch\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/generative-ai.html"} +{"content":"# \n### Generative AI and large language models (LLMs) on Databricks\n#### Develop generative AI and LLMs on Databricks\n\nDatabricks unifies the AI lifecycle from data collection and preparation, to model development and LLMOps, to serving and monitoring. The following features are specifically optimized to facilitate the development of generative AI applications: \n* [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) for governance, discovery, versioning, and access control for data, features, models, and functions.\n* [MLflow](https:\/\/docs.databricks.com\/mlflow\/tracking.html) for model development tracking and [LLM evaluation](https:\/\/docs.databricks.com\/mlflow\/llm-evaluate.html).\n* [Feature engineering and serving](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html).\n* [Databricks Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) for deploying LLMs. You can configure a model serving endpoint specifically for accessing foundation models: \n+ State-of-the-art open LLMs using [Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html).\n+ Third-party models hosted outside of Databricks. See [External models in Databricks Model Serving](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html).\n* [Databricks Vector Search](https:\/\/docs.databricks.com\/generative-ai\/vector-search.html) provides a queryable vector database that stores embedding vectors and can be configured to automatically sync to your knowledge base.\n* [Lakehouse Monitoring](https:\/\/docs.databricks.com\/lakehouse-monitoring\/index.html) for data monitoring and tracking model prediction quality and drift using [automatic payload logging with inference tables](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html).\n* [AI Playground](https:\/\/docs.databricks.com\/large-language-models\/ai-playground.html) for testing foundation models from your Databricks workspace. You can prompt, compare and adjust settings such as system prompt and inference parameters.\n* [Foundation Model Training](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html) for customizing a foundation model using your own data to optimize its performance for your specific application.\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/generative-ai.html"} +{"content":"# \n### Generative AI and large language models (LLMs) on Databricks\n#### Additional resources\n\n* See [Retrieval Augmented Generation (RAG) on Databricks](https:\/\/docs.databricks.com\/generative-ai\/retrieval-augmented-generation.html). \n+ See [Build a Q&A chatbot with LLama2 and Databricks](https:\/\/www.databricks.com\/resources\/demos\/tutorials\/data-science-and-ai\/lakehouse-ai-deploy-your-llm-chatbot).\n* For information about using Hugging Face models on Databricks, see [Hugging Face Transformers](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/index.html).\n* The [databricks-ml-examples](https:\/\/github.com\/databricks\/databricks-ml-examples) repo in Github contains example implementations of state-of-the-art (SOTA) LLMs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/generative-ai.html"} +{"content":"# What is Databricks?\n### What is a data lakehouse?\n\nA data lakehouse is a data management system that combines the benefits of data lakes and data warehouses. This article describes the lakehouse architectural pattern and what you can do with it on Databricks. \n![A diagram of the lakehouse architecture using Unity Catalog and delta tables.](https:\/\/docs.databricks.com\/_images\/lakehouse-diagram.png)\n\n### What is a data lakehouse?\n#### What is a data lakehouse used for?\n\nA data lakehouse provides scalable storage and processing capabilities for modern organizations that want to avoid isolated systems for processing different workloads, like machine learning (ML) and business intelligence (BI). A data lakehouse can help establish a single source of truth, eliminate redundant costs, and ensure data freshness. \nData lakehouses often use a data design pattern that incrementally improves, enriches, and refines data as it moves through layers of staging and transformation. Each layer of the lakehouse can include one or more layers. This pattern is frequently referred to as a medallion architecture. For more information, see [What is the medallion lakehouse architecture?](https:\/\/docs.databricks.com\/lakehouse\/medallion.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/index.html"} +{"content":"# What is Databricks?\n### What is a data lakehouse?\n#### How does the Databricks lakehouse work?\n\nDatabricks is built on Apache Spark. Apache Spark enables a massively scalable engine that runs on compute resources decoupled from storage. For more information, see [Apache Spark on Databricks](https:\/\/docs.databricks.com\/spark\/index.html) \nThe Databricks lakehouse uses two additional key technologies: \n* Delta Lake: an optimized storage layer that supports ACID transactions and schema enforcement.\n* Unity Catalog: a unified, fine-grained governance solution for data and AI. \n### Data ingestion \nAt the ingestion layer, batch or streaming data arrives from a variety of sources and in a variety of formats. This first logical layer provides a place for that data to land in its raw format. As you convert those files to Delta tables, you can use the schema enforcement capabilities of Delta Lake to check for missing or unexpected data. You can use Unity Catalog to register tables according to your data governance model and required data isolation boundaries. Unity Catalog allows you to track the lineage of your data as it is transformed and refined, as well as apply a unified governance model to keep sensitive data private and secure. \n### Data processing, curation, and integration \nOnce verified, you can start curating and refining your data. Data scientists and machine learning practitioners frequently work with data at this stage to start combining or creating new features and complete data cleansing. Once your data has been thoroughly cleansed, it can be integrated and reorganized into tables designed to meet your particular business needs. \nA schema-on-write approach, combined with Delta schema evolution capabilities, means that you can make changes to this layer without necessarily having to rewrite the downstream logic that serves data to your end users. \n### Data serving \nThe final layer serves clean, enriched data to end users. The final tables should be designed to serve data for all your use cases. A unified governance model means you can track data lineage back to your single source of truth. Data layouts, optimized for different tasks, allow end users to access data for machine learning applications, data engineering, and business intelligence and reporting. \nTo learn more about Delta Lake, see [What is Delta Lake?](https:\/\/docs.databricks.com\/delta\/index.html)\nTo learn more about Unity Catalog, see [What is Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/index.html"} +{"content":"# What is Databricks?\n### What is a data lakehouse?\n#### Capabilities of a Databricks lakehouse\n\nA lakehouse built on Databricks replaces the current dependency on data lakes and data warehouses for modern data companies. Some key tasks you can perform include: \n* **Real-time data processing:** Process streaming data in real-time for immediate analysis and action.\n* **Data integration:** Unify your data in a single system to enable collaboration and establish a single source of truth for your organization.\n* **Schema evolution:** Modify data schema over time to adapt to changing business needs without disrupting existing data pipelines.\n* **Data transformations:** Using Apache Spark and Delta Lake brings speed, scalability, and reliability to your data.\n* **Data analysis and reporting:** Run complex analytical queries with an engine optimized for data warehousing workloads.\n* **Machine learning and AI:** Apply advanced analytics techniques to all of your data. Use ML to enrich your data and support other workloads.\n* **Data versioning and lineage:** Maintain version history for datasets and track lineage to ensure data provenance and traceability.\n* **Data governance:** Use a single, unified system to control access to your data and perform audits.\n* **Data sharing:** Facilitate collaboration by allowing the sharing of curated data sets, reports, and insights across teams.\n* **Operational analytics:** Monitor data quality metrics, model quality metrics, and drift by applying machine learning to lakehouse monitoring data.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/index.html"} +{"content":"# What is Databricks?\n### What is a data lakehouse?\n#### Lakehouse vs Data Lake vs Data Warehouse\n\nData warehouses have powered business intelligence (BI) decisions for about 30 years, having evolved as a set of design guidelines for systems controlling the flow of data. Enterprise data warehouses optimize queries for BI reports, but can take minutes or even hours to generate results. Designed for data that is unlikely to change with high frequency, data warehouses seek to prevent conflicts between concurrently running queries. Many data warehouses rely on proprietary formats, which often limit support for machine learning. Data warehousing on Databricks leverages the capabilities of a Databricks lakehouse and Databricks SQL. For more information, see [What is data warehousing on Databricks?](https:\/\/docs.databricks.com\/sql\/index.html). \nPowered by technological advances in data storage and driven by exponential increases in the types and volume of data, data lakes have come into widespread use over the last decade. Data lakes store and process data cheaply and efficiently. Data lakes are often defined in opposition to data warehouses: A data warehouse delivers clean, structured data for BI analytics, while a data lake permanently and cheaply stores data of any nature in any format. Many organizations use data lakes for data science and machine learning, but not for BI reporting due to its unvalidated nature. \nThe data lakehouse combines the benefits of data lakes and data warehouses and provides: \n* Open, direct access to data stored in standard data formats.\n* Indexing protocols optimized for machine learning and data science.\n* Low query latency and high reliability for BI and advanced analytics. \nBy combining an optimized metadata layer with validated data stored in standard formats in cloud object storage, the data lakehouse allows data scientists and ML engineers to build models from the same data-driven BI reports.\n\n### What is a data lakehouse?\n#### Next step\n\nTo learn more about the principles and best practices for implementing and operating a lakehouse using Databricks, see [Introduction to the well-architected data lakehouse](https:\/\/docs.databricks.com\/lakehouse-architecture\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/index.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Fivetran\n\nFivetran automated data integration adapts as schemas and APIs change, ensuring reliable data access and simplified analysis with ready-to-query schemas. \nYou can integrate your Databricks SQL warehouses (formerly Databricks SQL endpoints) and Databricks clusters with Fivetran. The Fivetran integration with Databricks helps you centralize data from disparate data sources into Delta Lake.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/fivetran.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Fivetran\n##### Connect to Fivetran using Partner Connect\n\nThis section describes how to connect to Fivetran using Partner Connect. Each user creates their own connection. \nNote \nThe per-user connection experience is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). You can disable the ability to create per-user connections by contacting your Databricks account team. \n### Before you connect using Partner Connect \nBefore you connect to Fivetran using Partner Connect, make sure you have the following: \n* The workspace admin user role or the following permissions: \n+ The [CAN USE](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#sql-warehouses) permission for a SQL warehouse\n+ The [CAN USE](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#permissions) permission for token usage\n* For data managed by Unity Catalog, the following metastore object privileges for the catalog you want Fivetran to write to: \n-`USE CATALOG` and`CREATE SCHEMA` on the catalog. \n+ (Optional) To specify a destination location, **CREATE EXTERNAL TABLE** on the external location and access to data in cloud object storage.\n* For data managed by the legacy Hive metastore, the following metastore object privileges for the catalog you want Fivetran to write to: \n+ `USAGE` and `CREATE` on the catalog\n+ (Optional) To specify a destination location, access to data in cloud object storage. \nPrivileges for Unity Catalog metastore objects can be granted by a metastore admin, the owner of the object, or the owner of the catalog or schema that contains the object. For more information, see [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html). \nPrivileges for legacy Hive metastore objects can be granted by a workspace admin or the owner of the object. For more information, see [Hive metastore privileges and securable objects (legacy)](https:\/\/docs.databricks.com\/data-governance\/table-acls\/object-privileges.html). \nAccess to data in cloud object storage must be configured by a workspace admin with sufficient permissions in the cloud object storage account. For more information, see [Enable data access configuration](https:\/\/docs.databricks.com\/admin\/sql\/data-access-configuration.html). \n### Partner Connect steps \nTo connect your Databricks workspace to Fivetran using Partner Connect, do the following: \n1. In the sidebar, click **Partner Connect**.\n2. Click the **Fivetran** tile. \nThe steps in this section create a user-level Fivetran trial account. To sign in to an existing workspace-level Fivetran trial account, click **Use existing connection**, complete the on-screen instructions to sign in to Fivetran, and skip the rest of the steps in this article.\n3. Select a SQL warehouse. If the SQL warehouse is stopped, click **Start**.\n4. If your workspace is enabled for Unity Catalog, select a catalog for Fivetran to write to, then click **Next**. \nPartner Connect generates a Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html) that is associated with your user.\n5. Click **Connect to Fivetran**. \nA new tab opens in your web browser that displays the Fivetran website.\n6. Complete the on-screen instructions on the Fivetran website to create your trial partner account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/fivetran.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Fivetran\n##### Connect to Fivetran manually\n\nFor an overview of the manual connection procedure, watch this YouTube video (2 minutes). \nNote \nTo connect a SQL warehouse with Fivetran faster, use Partner Connect. \n### Before you connect manually \nBefore you connect to Fivetran manually, you must have the following: \n* A cluster or SQL warehouse in your Databricks workspace. \n+ [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n+ [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* The connection details for your cluster or SQL warehouse, specifically the **Server Hostname**, **Port**, and **HTTP Path** values. \n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). To create a personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**.\nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n+ [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n+ [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \nTip \nIf the **Fivetran** tile in Partner Connect in your workspace has a check mark icon inside of it, you can get the connection details for the connected SQL warehouse by clicking the tile and then expanding **Connection details**. The **Personal access token** is hidden; you must [create a replacement personal access token](https:\/\/docs.databricks.com\/partner-connect\/index.html#how-to-create-token) and enter that new token instead when Fivetran asks you for it. \n### Manual steps \nTo connect to Fivetran manually, do the following: \n1. Sign in to your Fivetran account, or create a new Fivetran account, at <https:\/\/fivetran.com\/login>. \nImportant \nIf you sign in to your organization\u2019s Fivetran account, a **Choose Destination** page may display, listing one or more existing destination entries with the Databricks logo. *These entries might contain connection details for compute resources in workspaces that are separate from yours.* If you still want to reuse one of these connections, and you trust the compute resource and have access to it, choose that destination and then skip ahead to next steps. Otherwise, choose any available destination to get past this page.\n2. In your **Dashboard** page in Fivetran, click the **Destinations** tab.\n3. Click **Add Destination**.\n4. Enter a **Destination name** and click **Add**.\n5. On the **Fivetran is modern ELT** page, click **Set up a connector**.\n6. Click a data source, and then click **Next**.\n7. Follow the on-screen instructions in the **Setup Guide** in Fivetran to finish setting up the connector.\n8. Click **Save & Test**.\n9. After the test succeeds, click **Continue**.\n10. On the **Select your data\u2019s destination** page, click **Databricks on AWS**.\n11. Click **Continue Setup**.\n12. Complete the on-screen instructions in Fivetran to enter the connection details for your existing Databricks compute resource, specifically the **Server Hostname** and **HTTP Path** field values, and the token that you generated earlier.\n13. Click **Save & Test**.\n14. After the test succeeeds, click **Continue**.\n15. Continue to next steps.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/fivetran.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Fivetran\n##### Additional resources\n\nExplore one or more of the following resources on the Fivetran website: \n* [Getting Started](https:\/\/fivetran.com\/docs\/getting-started)\n* [Architecture](https:\/\/fivetran.com\/docs\/getting-started\/architecture)\n* [Connectors](https:\/\/fivetran.com\/docs\/getting-started\/fivetran-dashboard\/connectors)\n* [Destinations](https:\/\/fivetran.com\/docs\/getting-started\/fivetran-dashboard\/destination)\n* [Sync Overview](https:\/\/fivetran.com\/docs\/getting-started\/syncoverview)\n* [Transformations](https:\/\/fivetran.com\/docs\/transformations)\n* [Fivetran Documentation](https:\/\/fivetran.com\/docs)\n* [Fivetran Support](https:\/\/support.fivetran.com\/hc)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/fivetran.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to RudderStack\n\nRudderStack is a lakehouse-native customer data platform that helps data teams collect, unify, and activate first-party data securely. \nYou can integrate your Databricks SQL warehouses and Databricks clusters with RudderStack.\n\n#### Connect to RudderStack\n##### Connect to RudderStack using Partner Connect\n\nTo connect to RudderStack using Partner Connect, see [Connect to ingestion partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/ingestion.html). \nNote \nPartner Connect only supports SQL warehouses for RudderStack. To connect using a cluster, do so manually.\n\n#### Connect to RudderStack\n##### Connect to RudderStack manually\n\nThis section describes how to connect to RudderStack manually. \nNote \nYou can use Partner Connect to simplify the connection experience with a SQL warehouse. \nTo connect to RudderStack manually, see the [Databricks Delta Lake](https:\/\/www.rudderstack.com\/docs\/destinations\/warehouse-destinations\/delta-lake\/) article in the RudderStack documentation.\n\n#### Connect to RudderStack\n##### Additional resources\n\nExplore the following RudderStack resources: \n* [Website](https:\/\/www.rudderstack.com\/)\n* [Documentation](https:\/\/www.rudderstack.com\/docs\/)\n* [Resource center](https:\/\/www.rudderstack.com\/resource-center\/)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/rudderstack.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Third-party online stores\n##### Automatic feature lookup with MLflow models on Databricks\n\n[Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) can automatically look up feature values from published online stores or from online tables. This article describes how to work with online stores. For information about working with online tables, see [Use online tables for real-time feature serving](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html).\n\n##### Automatic feature lookup with MLflow models on Databricks\n###### Requirements\n\n* The model must have been logged with `FeatureEngineeringClient.log_model` (for Feature Engineering in Unity Catalog) or `FeatureStoreClient.log_model` (for Workspace Feature Store, requires v0.3.5 and above).\n* The online store must be [published with read-only credentials](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/fs-authentication.html#provide-online-store-credentials-using-databricks-secrets). \nNote \nYou can publish the feature table at any time prior to model deployment, including after model training.\n\n##### Automatic feature lookup with MLflow models on Databricks\n###### Automatic feature lookup\n\nDatabricks [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) supports automatic feature lookup from these online stores: \n* Amazon DynamoDB (v0.3.8 and above) \nAutomatic feature lookup is supported for the following data types: \n* `IntegerType`\n* `FloatType`\n* `BooleanType`\n* `StringType`\n* `DoubleType`\n* `LongType`\n* `TimestampType`\n* `DateType`\n* `ShortType`\n* `DecimalType`\n* `ArrayType`\n* `MapType`\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/automatic-feature-lookup.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Third-party online stores\n##### Automatic feature lookup with MLflow models on Databricks\n###### Override feature values in online model scoring\n\nAll features required by the model (logged with `FeatureEngineeringClient.log_model` or `FeatureStoreClient.log_model`) are automatically looked up from online stores for model scoring. To override feature values when scoring a model using a REST API with [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html#score) include the feature values as a part of the API payload. \nNote \nThe new feature values must conform to the feature\u2019s data type as expected by the underlying model.\n\n##### Automatic feature lookup with MLflow models on Databricks\n###### Notebook examples: Unity Catalog\n\nWith Databricks Runtime 13.3 LTS and above, any Delta table in Unity Catalog with a primary key can be used as a feature table. When you use a table registered in Unity Catalog as a feature table, all Unity Catalog capabilities are automatically available to the feature table. \nThis example notebook illustrates how to publish features to an online store and then serve a trained model that automatically looks up features from the online store. \n### Online Store with Unity Catalog example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/feature-store-with-uc-online-example-dynamodb.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/automatic-feature-lookup.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Third-party online stores\n##### Automatic feature lookup with MLflow models on Databricks\n###### Notebook examples: Workspace Feature Store\n\nThis example notebook illustrates how to publish features to an online store and then serve a trained model that automatically looks up features from the online store. \n### Online Store example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/feature-store-online-example-dynamodb.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/automatic-feature-lookup.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n##### Train Spark ML models on Databricks Connect with `pyspark.ml.connect`\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article provides an example that demonstrates how to use the `pyspark.ml.connect` module to perform distributed training to train Spark ML models and run model inference on Databricks Connect.\n\n##### Train Spark ML models on Databricks Connect with `pyspark.ml.connect`\n###### What is `pyspark.ml.connect`?\n\nSpark 3.5 introduces `pyspark.ml.connect` which is designed for supporting Spark connect mode and Databricks Connect. Learn more about [Databricks Connect](https:\/\/docs.databricks.com\/en\/dev-tools\/databricks-connect.html). \nThe `pyspark.ml.connect` module consists of common learning algorithms and utilities, including classification, feature transformers, ML pipelines, and cross validation. This module provides similar interfaces to the legacy [`pyspark.ml` module](https:\/\/spark.apache.org\/docs\/latest\/ml-guide.html), but the `pyspark.ml.connect` module currently only contains a subset of the algorithms in `pyspark.ml`. The supported algorithms are listed below: \n* Classification algorithm: `pyspark.ml.connect.classification.LogisticRegression`\n* Feature transformers: `pyspark.ml.connect.feature.MaxAbsScaler` and `pyspark.ml.connect.feature.StandardScaler`\n* Evaluator: `pyspark.ml.connect.RegressionEvaluator`, `pyspark.ml.connect.BinaryClassificationEvaluator` and `MulticlassClassificationEvaluator`\n* Pipeline: `pyspark.ml.connect.pipeline.Pipeline`\n* Model tuning: `pyspark.ml.connect.tuning.CrossValidator`\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/distributed-ml-for-spark-connect.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n##### Train Spark ML models on Databricks Connect with `pyspark.ml.connect`\n###### Requirements\n\n* Set up Databricks Connect on your clusters. See [Cluster configuration for Databricks Connect](https:\/\/docs.databricks.com\/dev-tools\/databricks-connect\/cluster-config.html).\n* Databricks Runtime 14.0 ML or higher installed.\n* Cluster access mode of `Assigned`.\n\n##### Train Spark ML models on Databricks Connect with `pyspark.ml.connect`\n###### Example notebook\n\nThe following notebook demonstrates how to use Distributed ML on Databricks Connect: \n### Distributed ML on Databricks Connect \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/distributed-ml-spark-connect.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nFor reference information about APIs in `pyspark.ml.connect`, Databricks recommends the [Apache Spark API reference](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/distributed-ml-for-spark-connect.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks widgets\n\nInput widgets allow you to add parameters to your notebooks and dashboards. You can add a widget from the Databricks UI or using the widget API. To add or edit a widget, you must have CAN EDIT [permissions on the notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-collaborate.html#notebook-permissions). \nIf you are running Databricks Runtime 11.3 LTS or above, you can also use [ipywidgets in Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/ipywidgets.html). \nDatabricks widgets are best for: \n* Building a notebook or dashboard that is re-executed with different parameters.\n* Quickly exploring results of a single query with different parameters. \nTo view the documentation for the widget API in Scala, Python, or R, use the following command: `dbutils.widgets.help()`\n\n#### Databricks widgets\n##### Databricks widget types\n\nThere are 4 types of widgets: \n* `text`: Input a value in a text box.\n* `dropdown`: Select a value from a list of provided values.\n* `combobox`: Combination of text and dropdown. Select a value from a provided list or input one in the text box.\n* `multiselect`: Select one or more values from a list of provided values. \nWidget dropdowns and text boxes appear immediately following the notebook toolbar. Widgets only accept string values. \n![Widget in header](https:\/\/docs.databricks.com\/_images\/widget-dropdown.png)\n\n#### Databricks widgets\n##### Create widgets\n\nThis section shows you how to create widgets using the UI or programatically using either SQL magics or the widget API for Python, Scala, and R.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/widgets.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks widgets\n##### Create widgets using the UI\n\nCreate a widget using the notebook UI. If you are connected to a SQL warehouse, this is the only way you can create widgets. \nSelect **Edit > Add widget**. In the **Add widget** dialog, enter the widget name, optional label, type, parameter type, possible values, and optional default value. In the dialog, **Parameter Name** is the name you use to reference the widget in your code. **Widget Label** is an optional name that appears over the widget in the UI. \n![create widget dialog](https:\/\/docs.databricks.com\/_images\/widget-dialog.png) \nAfter you\u2019ve created a widget, you can hover over the widget name to display a tooltip that describes how to reference the widget. \n![widget tooltip](https:\/\/docs.databricks.com\/_images\/widget-tool-tip.png) \nYou can use the kebab menu to edit or remove the widget: \n![widget kebab menu](https:\/\/docs.databricks.com\/_images\/widget-kebab-menu.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/widgets.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks widgets\n##### Create widgets with SQL, Python, R, and Scala\n\nProgrammatically create widgets in a notebook attached to a compute cluster. \nThe widget API is designed to be consistent in Scala, Python, and R. The widget API in SQL is slightly different but equivalent to the other languages. You manage widgets through the [Databricks Utilities (dbutils) reference](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html) interface. \n* The first argument for all widget types is `name`. This is the name you use to access the widget.\n* The second argument is `defaultValue`, the widget\u2019s default setting.\n* The third argument for all widget types (except `text`) is `choices`, a list of values the widget can take on. This argument is not used for `text` type widgets.\n* The last argument is `label`, an optional value for the label shown over the widget text box or dropdown. \n```\ndbutils.widgets.dropdown(\"state\", \"CA\", [\"CA\", \"IL\", \"MI\", \"NY\", \"OR\", \"VA\"])\n\n``` \n```\ndbutils.widgets.dropdown(\"state\", \"CA\", [\"CA\", \"IL\", \"MI\", \"NY\", \"OR\", \"VA\"])\n\n``` \n```\ndbutils.widgets.dropdown(\"state\", \"CA\", [\"CA\", \"IL\", \"MI\", \"NY\", \"OR\", \"VA\"])\n\n``` \n```\nCREATE WIDGET DROPDOWN state DEFAULT \"CA\" CHOICES SELECT * FROM (VALUES (\"CA\"), (\"IL\"), (\"MI\"), (\"NY\"), (\"OR\"), (\"VA\"))\n\n``` \nInteract with the widget from the widget panel. \n![Interact with widget](https:\/\/docs.databricks.com\/_images\/widget-demo.png) \nYou can access the current value of the widget or get a mapping of all widgets: \n```\ndbutils.widgets.get(\"state\")\n\ndbutils.widgets.getAll()\n\n``` \n```\ndbutils.widgets.get(\"state\")\n\ndbutils.widgets.getAll()\n\n``` \n```\ndbutils.widgets.get(\"state\")\n\n``` \n```\nSELECT :state\n\n``` \nFinally, you can remove a widget or all widgets in a notebook: \n```\ndbutils.widgets.remove(\"state\")\n\ndbutils.widgets.removeAll()\n\n``` \n```\ndbutils.widgets.remove(\"state\")\n\ndbutils.widgets.removeAll()\n\n``` \n```\ndbutils.widgets.remove(\"state\")\n\ndbutils.widgets.removeAll()\n\n``` \n```\nREMOVE WIDGET state\n\n``` \nIf you remove a widget, you cannot create one in the same cell. You must create the widget in another cell.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/widgets.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks widgets\n##### Use widget values in Spark SQL and SQL Warehouse\n\nSpark SQL and SQL Warehouse access widget values using [parameter markers](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-parameter-marker.html). Parameter markers protect your code from SQL injection attacks by clearly separating provided values from the SQL statements. \nParameter markers for widgets is available in Databricks Runtime 15.2 and above. Previous versions of Databricks Runtime should use the old [syntax for DBR 15.1 and below](https:\/\/docs.databricks.com\/notebooks\/widgets.html#widgets-pre-dbr15). \nYou can access widgets defined in any language from Spark SQL while executing notebooks interactively. Consider the following workflow: \n1. Create a dropdown widget of all databases in the current catalog: \n```\ndbutils.widgets.dropdown(\"database\", \"default\", [database[0] for database in spark.catalog.listDatabases()])\n\n```\n2. Create a text widget to manually specify a table name: \n```\ndbutils.widgets.text(\"table\", \"\")\n\n```\n3. Run a SQL query to see all tables in a database (selected from the dropdown list): \n```\nSHOW TABLES IN IDENTIFIER(:database)\n\n``` \nNote \nYou must use the SQL `IDENTIFIER()` clause to parse strings as object identifiers such names for databases, tables, views, functions, columns, and fields.\n4. Manually enter a table name into the `table` widget.\n5. Create a text widget to specify a filter value: \n```\ndbutils.widgets.text(\"filter_value\", \"\")\n\n```\n6. Preview the contents of a table without needing to edit the contents of the query: \n```\nSELECT *\nFROM IDENTIFIER(:database || '.' || :table)\nWHERE col == :filter_value\nLIMIT 100\n\n``` \n### Use widget values in Databricks Runtime 15.1 and below \nThis section describes how to pass Databricks widgets values to `%sql` notebook cells in Databricks Runtime 15.1 and below. \n1. Create widgets to specify text values. \n```\ndbutils.widgets.text(\"database\", \"\")\ndbutils.widgets.text(\"table\", \"\")\ndbutils.widgets.text(\"filter_value\", \"100\")\n\n``` \n```\ndbutils.widgets.text(\"database\", \"\")\ndbutils.widgets.text(\"table\", \"\")\ndbutils.widgets.text(\"filter_value\", \"100\")\n\n``` \n```\ndbutils.widgets.text(\"database\", \"\")\ndbutils.widgets.text(\"table\", \"\")\ndbutils.widgets.text(\"filter_value\", \"100\")\n\n``` \n```\nCREATE WIDGET TEXT database DEFAULT \"\"\nCREATE WIDGET TEXT table DEFAULT \"\"\nCREATE WIDGET TEXT filter_value DEFAULT \"100\"\n\n``` \n1. Pass in the widget values using the `${param}` syntax. \n```\nSELECT *\nFROM ${database}.${table}\nWHERE col == ${filter_value}\nLIMIT 100\n\n``` \nNote \nTo escape the `$` character in a [SQL string literal](https:\/\/docs.databricks.com\/sql\/language-manual\/data-types\/string-type.html), use `\\$`. For example, to express the string `$1,000`, use `\"\\$1,000\"`. The `$` character cannot be escaped for [SQL identifiers](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-identifiers.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/widgets.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks widgets\n##### Configure widget settings\n\nYou can configure the behavior of widgets when a new value is selected, whether the widget panel is always pinned to the top of the notebook, and change the layout of widgets in the notebook. \n1. Click the ![gear icon](https:\/\/docs.databricks.com\/_images\/gear.png) icon at the right end of the Widget panel.\n2. In the pop-up Widget Panel Settings dialog box, choose the widget\u2019s execution behavior. \n![Widget settings](https:\/\/docs.databricks.com\/_images\/widget-settings.png) \n* **Run Notebook**: Every time a new value is selected, the entire notebook is rerun.\n* **Run Accessed Commands**: Every time a new value is selected, only cells that retrieve the values for that particular widget are rerun. This is the default setting when you create a widget. SQL cells are not rerun in this configuration.\n* **Do Nothing**: Every time a new value is selected, nothing is rerun.\n3. To pin the widgets to the top of the notebook or to place the widgets above the first cell, click ![pin icon](https:\/\/docs.databricks.com\/_images\/pin.png). The setting is saved on a per-user basis. Click the thumbtack icon again to reset to the default behavior.\n4. If you have CAN MANAGE permission for notebooks, you can configure the widget layout by clicking ![edit icon](https:\/\/docs.databricks.com\/_images\/edit.png). Each widget\u2019s order and size can be customized. To save or dismiss your changes, click ![accept and cancel icons](https:\/\/docs.databricks.com\/_images\/checkbox.png). \nThe widget layout is saved with the notebook. If you change the widget layout from the default configuration, new widgets are not added alphabetically.\n5. To reset the widget layout to a default order and size, click ![gear icon](https:\/\/docs.databricks.com\/_images\/gear.png) to open the **Widget Panel Settings** dialog and then click **Reset Layout**. The `removeAll()` command does not reset the widget layout.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/widgets.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks widgets\n##### Example notebook\n\nThe following notebook shows how the **Run Accessed Commands** setting works. The `year` widget is created with the setting `2014` and is used in DataFrame API and SQL commands. \n![Widgets](https:\/\/docs.databricks.com\/_images\/widget-demo.png) \nWhen you change the setting of the `year` widget to `2007`, the DataFrame command reruns, but the SQL command is not rerun. \nThis notebook illustrates the use of widgets in a notebook attached to a cluster, not a SQL warehouse. \n### Widget demo notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/widget-demo.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### Databricks widgets\n##### Databricks widgets in dashboards\n\nWhen you create a dashboard from a notebook with input widgets, all the widgets display at the top. In presentation mode, every time you update the value of a widget, you can click the **Update** button to re-run the notebook and update your dashboard with new values. \n![Dashboard with widgets](https:\/\/docs.databricks.com\/_images\/widget-dashboard2.png)\n\n#### Databricks widgets\n##### Use Databricks widgets with %run\n\nIf you [run a notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-workflows.html#run) that contains widgets, the specified notebook is run with the widget\u2019s default values. \nIf the notebook is attached to a cluster (not a SQL warehouse), you can also pass values to widgets. For example: \n```\n%run \/path\/to\/notebook $X=\"10\" $Y=\"1\"\n\n``` \nThis example runs the specified notebook and passes `10` into widget X and `1` into widget Y.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/widgets.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Databricks widgets\n##### Limitations\n\n* The following limits apply to widgets: \n+ A maximum of 512 widgets can be created in a notebook.\n+ A widget name is limited to 1024 characters.\n+ A widget label is limited to 2048 characters.\n+ A maximum of 2048 characters can be input to a text widget.\n+ There can be a maximum of 1024 choices for a multi-select, combo box, or dropdown widget.\n* There is a known issue where a widget state may not properly clear after pressing **Run All**, even after clearing or removing the widget in code. If this happens, you will see a discrepancy between the widget\u2019s visual and printed states. Re-running the cells individually may bypass this issue. To avoid this issue entirely, Databricks recommends using [ipywidgets](https:\/\/docs.databricks.com\/notebooks\/ipywidgets.html).\n* You should not access widget state directly in asynchronous contexts like threads, subprocesses, or Structured Streaming ([foreachBatch](https:\/\/docs.databricks.com\/structured-streaming\/foreach.html)), as widget state can change while the asynchronous code is running. If you need to access widget state in an asynchronous context, pass it in as an argument.\nFor example, if you have the following code that uses threads: \n```\nimport threading\n\ndef thread_func():\n# Unsafe access in a thread\nvalue = dbutils.widgets.get('my_widget')\nprint(value)\n\nthread = threading.Thread(target=thread_func)\nthread.start()\nthread.join()\n\n``` \nThen you should write it this way instead: \n```\n# Access widget values outside the asynchronous context and pass them to the function\nvalue = dbutils.widgets.get('my_widget')\n\ndef thread_func(val):\n# Use the passed value safely inside the thread\nprint(val)\n\nthread = threading.Thread(target=thread_func, args=(value,))\nthread.start()\nthread.join()\n\n```\n* In general, widgets can\u2019t pass arguments between different languages within a notebook. You can create a widget `arg1` in a Python cell and use it in a SQL or Scala cell if you run one cell at a time. However, this does not work if you use **Run All** or run the notebook as a job. Some work arounds are: \n+ For notebooks that do not mix languages, you can create a notebook for each language and pass the arguments when you [run the notebook](https:\/\/docs.databricks.com\/notebooks\/widgets.html#widgets-and-percent-run).\n+ You can access the widget using a `spark.sql()` call. For example, in Python: `spark.sql(\"select getArgument('arg1')\").take(1)[0][0]`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/widgets.html"} +{"content":"# What is data warehousing on Databricks?\n### Dashboards\n\nYou can use dashboards to build data visualizations and share informative data insights with your team. The latest version of dashboards features an enhanced visualization library and a streamlined configuration experience so that you can quickly transform data into sharable insights. \nNote \nDashboards (formerly Lakeview dashboards) are now generally available. \n* Original Databricks SQL dashboards are now called **legacy dashboards**. They will continue to be supported and updated with critical bug fixes, but new functionality will be limited. You can continue to use legacy dashboards for both authoring and consumption.\n* Convert legacy dashboards using the migration tool or REST API. See [Clone a legacy dashboard to a Lakeview dashboard](https:\/\/docs.databricks.com\/dashboards\/clone-legacy-to-lakeview.html) for instructions on using the built-in migration tool. See [Dashboard tutorials](https:\/\/docs.databricks.com\/dashboards\/tutorials\/index.html) for tutorials on creating and managing dashboards using the REST API. \nDashboards have the following components: \n* **Data**: The **Data** tab allows users to define datasets for use in the dashboard. Datasets are bundled with dashboards when sharing, importing, or exporting them using the UI or API.\n* **Canvas**: The **Canvas** tab allows users to create visualizations and construct their dashboards. \nNote \nYou can define up to 100 datasets per dashboard. The **Canvas** can contain up to 100 widgets per dashboard.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Dashboards\n#### Define your datasets\n\nUse the **Data** tab to define the underlying datasets for your dashboard. \nYou can define datasets as any of the following: \n* A new query against one or more tables or views.\n* An existing Unity Catalog table or view. \nYou can define datasets on any type of table or view. You can define multiple datasets by writing additional queries or selecting additional tables or views. After defining a dataset, you can use the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu to the right of the dataset name to rename, clone, or delete it. You can also download the dataset as a CSV, TSV, or Excel file. \n![Menu shows the dataset options](https:\/\/docs.databricks.com\/_images\/lakeview-dataset-options.png) \n### Limit data access with SQL \nAll the data in a dashboard dataset can be accessible to dashboard viewers, even if it\u2019s not displayed in a visualization. To prevent sensitive data from being sent to the browser, limit the columns specified in the SQL query that defines the dataset. For example, rather than selecting all columns from a table, choose only the specific columns needed for the visualizations in your SQL statement rather than table configuration.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Dashboards\n#### Add or remove visualizations, text, and filter widgets on the canvas\n\nUse the **Canvas** tab to construct your dashboard. Use the toolbar at the bottom of the canvas to add widgets such as visualizations, text boxes, and filters. \n### Visualizations \nCreate a visualization by adding a visualization widget to the canvas. Supported visualizations include area, bar, combo, counter, heatmap, histogram, line, pie, pivot, scatter, and table chart types. \nNote \nQueries used by visualizations do not always correspond precisely to the dataset. For example, if you apply aggregations to a visualization, the visualization shows the aggregated values. \n* Use the Databricks Assistant: Create visualizations by describing the chart you want to see in natural language and let the assistant generate a chart. After it is created, you can modify the generated chart using the configuration panel. You cannot use Assistant to create table or pivot table chart types.\n* Use the configuration panel: Apply additional aggregations or time bins in the visualization configuration without modifying the dataset directly. You can choose a dataset, x-axis values, y-axis values, and colors in the configuration panel. See [Dashboard visualization types](https:\/\/docs.databricks.com\/dashboards\/visualization-types.html) for configuration details and examples of each supported visualization type. See [Table options](https:\/\/docs.databricks.com\/visualizations\/tables.html) to learn how you can control data presentation in table visualizations. \nNote \nWhen you apply temporal transformations in the visualization configuration, the date shown in the visualization represents the start of that period. \n### Text widgets \nMarkdown is a markup language for formatting text in a plain text editor. You can use markdown in text widgets to format text, insert links, and add images to your dashboard. \n* To add a static image in a text widget, add markdown image syntax with a desired description and URL: `![description](URL)` from a publicly available URL. For example, the following markdown will insert an image of the Databricks logo: `![The Databricks Logo](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/6\/63\/Databricks_Logo.png)`. To resize the image, resize the widget dimensions.\n* To add an image from DBFS, add markdown image syntax with a desired description and FileStore path: `![description](files\/path_to_dbfs_image)`. To resize the image, resize the widget dimensions. For more information on DBFS, see [What is DBFS?](https:\/\/docs.databricks.com\/dbfs\/index.html). \nFor more information on markdown syntax, see [this guide](https:\/\/www.markdownguide.org\/cheat-sheet\/#basic-syntax). \n### Filters \nFilters are widgets that allow dashboard viewers to narrow down results by filtering on specific fields or setting dataset parameters. They function in the same way as slicers in other BI tools, allowing dashboard viewers to manipulate and refine the data presented in visualizations. Each filter widget can be configured to filter on dataset fields or to assign values to predefined parameters in a dataset query. Filters and parameters can be combined in a single widget when using query-based parameters. See [Use query-based parameters](https:\/\/docs.databricks.com\/dashboards\/tutorials\/query-based-params.html) to learn how to apply a query-based parameter. \n#### Filter on fields \nDashboards support the following filter types for filtering fields: \n* Single value\n* Multiple values\n* Date picker\n* Date range picker\n* Text entry\n* Range slider \nFilters can be applied to fields of one or more datasets. To connect a filter to fields from more than one dataset, add multiple **Fields**, up to one per dataset. The filter applies to all visualizations built on the selected datasets. Filter selection cascades across all other filters. \nDashboard filters always apply to the entire dataset. If the dataset is small, the dashboard filter is applied directly in the browser to improve performance. If the dataset is larger, the filter is added to the query that is run in the SQL warehouse. \n#### Filter on parameters \nIf a filter is connected to a parameter, it runs a query against the SQL warehouse, regardless of the dataset size. \nDashboards support the following filter types for setting parameters: \n* String\n* Date\n* Date and Time\n* Decimal\n* Integer \nSee [What are dashboard parameters?](https:\/\/docs.databricks.com\/dashboards\/parameters.html). \nNote \nUsing parameters to specify date ranges is unsupported. To specify a date range, apply filters on the fields that include the start and end dates of the desired range. \n### Copy widgets \nUse keyboard shortcuts to copy a selected widget and paste it back on the canvas. After you create a new widget, you can edit it as you would any other widget. \nTo clone a widget on your draft dashboard canvas, complete the following steps: \n* Right-click on a widget.\n* Click **Clone**. \nA clone of your widget appears below the original. \n### Remove widgets \nDelete widgets by selecting a widget and pressing the delete key on your keyboard. Or, right-click on the widget. Then, click **Delete**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Dashboards\n#### Download results\n\nYou can download datasets as CSV, TSV, or Excel files. From a draft dashboard, access download options from the **Data** tab or right-click on a visualization on the canvas. \nYou can download up to approximately 1GB of results data in CSV and TSV format and up to 100,000 rows to an Excel file. \nThe final file download size might be slightly more or less than 1GB, as the 1GB limit is applied to an earlier step than the final file download. \nFor published dashboards, viewers can download results by right-clicking on a visualization. \nWorkspace admins can adjust their security settings to prevent users from downloading results with the following steps: \n1. Click your username in the top bar of the Databricks workspace and select **Settings**.\n2. Click **Security**.\n3. Turn the **SQL results download** option off.\n\n### Dashboards\n#### Draft and collaborate on a dashboard\n\nNew dashboards begin as a draft. You can share the draft with other users in your workspace to collaborate. All users use their own credentials to interact with the data and visualizations in dashboard drafts. \nFor more on permission levels, see [Dashboard ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#lakeview).\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Dashboards\n#### Publish a dashboard\n\nPublish a dashboard to create a clean copy of the current dashboard you can share with any user in your Databricks workspace. After publishing your dashboard, the published version remains unchanged and accessible for sharing until you publish again. You can continue making modifications and improvements in a draft version without affecting the publicly shared copy. \nPublishing does not automatically share dashboards with users. You can explicitly share dashboards with view permissions to users or groups. \nYou must have at least **Can Edit** permissions to publish a dashboard. \n1. Open a dashboard.\n2. In the **Share** drop-down menu in the upper-right, click **Publish**. The **Publish** dialog appears.\n3. Choose the credentials to use for the published dashboard. You can optionally choose to embed your credentials. \n* **Embed credentials**: All viewers of a published dashboard can run queries using your credentials for data and compute. This allows users to see the dashboard even if they don\u2019t have access to the underlying data or SQL warehouse. This might expose data to users who have not been granted direct access to it. This is the default option.\n* **Don\u2019t embed credentials**: All viewers of the published dashboard run queries using their own data and compute credentials. Viewers need access to the workspace, the attached SQL warehouse, and the associated data to view results in the dashboard. \n1. Click **Publish**. \nYou can share the published dashboard with any user in your Databricks workspace. For more on controlling access to your dashboard, see [Dashboard ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#lakeview). \nTo access the published dashboard, click **Published** in the drop-down menu near the top of the dashboard. \n![Drop-down menu showing available draft and published dashboard versions.](https:\/\/docs.databricks.com\/_images\/draft-published-switcher.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Dashboards\n#### Schedule dashboards for periodic updates\n\nYou can set up scheduled updates to automatically refresh your dashboard and periodically send emails with the latest data to your subscribers. \nUsers with at least **Can Edit** permissions can create a schedule so published dashboards with embedded credentials run periodically. Each dashboard can have up to ten schedules. \nFor each scheduled dashboard update, the following occurs: \n* All SQL logic that defines datasets runs on the designated time interval.\n* Results populate the query result cache and help to improve initial dashboard load time. \nTo create a schedule: \n1. Click **Schedule** in the upper-right corner of the dashboard. The **Add Schedule** dialog appears. \n![Add schedule dialog](https:\/\/docs.databricks.com\/_images\/lakeview-add-schedule.png)\n2. Use the drop-down selectors to specify the frequency and time zone. Optionally, select the **Show cron syntax** checkbox to edit the schedule in [Quartz Cron Syntax](http:\/\/www.quartz-scheduler.org\/documentation\/quartz-2.3.0\/tutorials\/crontrigger.html).\n3. Click **Create**. The **Schedules** dialog appears and shows the schedule you created. If other schedules exist for this dashboard, the dialog also shows those.\n4. Optionally, click **Subscribe** to add yourself as a subscriber and receive an email with a PDF snapshot of the dashboard after a scheduled run completes. \nNote \nIf a schedule has already been created for this dashboard, the button in the upper-right corner says **Subscribe**. You can use the previously described workflow to add a schedule.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Dashboards\n#### Manage subscriptions\n\nSchedule subscribers receive an email with a PDF snapshot of the current dashboard each time the schedule runs. Eligible subscribers include workspace users and email notification destinations. \nWorkspace admins must define email notification destinations before they can be selected as subscribers. See [Manage notification destinations](https:\/\/docs.databricks.com\/admin\/workspace-settings\/notification-destinations.html). Account users, distribution lists, and users outside of the account (like users at partner or client organizations) can all be configured as email notification destinations and subscribed. However, they can\u2019t be subscribed directly. \nImportant \nSubscription lists can contain up to 100 subscribers. An email notification destination counts as one subscriber regardless of the number of emails it sends. \nYou can add and remove other subscribers to receive updates if you have at least **Can Edit** privileges on the dashboard. You can add and remove yourself as a subscriber to an existing schedule if you have at least **Can View** privileges on the dashboard. \n* To subscribe other users: \n1. Click **Subscribe** in the upper-right corner of the dashboard. The **Schedules** dialog appears.\n2. Identify the schedule that you want to add subscribers to. Click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) to the right of that schedule. Then, click **Edit**.\nNote \nYou can also use this context menu to pause or delete a schedule. \nIf you have **Can View** access to a dashboard that has an assigned schedule, you can subscribe yourself to receive updates each time a scheduled run occurs.\n* To subscribe yourself to an existing schedule: \n1. Click the **Subscribe** button near the upper-right corner of the dashboard. The **Schedules** dialog shows all schedules for the dashboard.\n2. Click **Subscribe** to the right of the schedule you choose.If you cannot add yourself as a subscriber, check the following reasons: \n+ A workspace admin has turned off the **Enable dashboard subscriptions** option for the workspace. \nThis setting supersedes all others. If the workspace admin has turned this setting off, dashboard editors can still assign a schedule, but no subscribers can be assigned.\n+ The dashboard is not shared with embedded credentials. \nDashboards shared without embedded credentials cannot be assigned a schedule, so they cannot be assigned subscribers.\n+ You do not have permission to access the workspace. \nAccount users can only be added as subscribers as an email notification destination. There is no **Subscribe** button on the dashboard for account users.\n+ No schedules have been defined. \nFor dashboards without a defined schedule, workspace users with **Can View** or **Can Run** access to a dashboard cannot interact with the **Subscribe** button. \n### Unsubscribe from email updates \nSubscribers can choose to stop receiving emails by unsubscribing from the schedule. \n* To unsubscribe using the dashboard UI: \n1. Click the **Subscribe** button near the upper-right corner of the dashboard. The **Schedules** dialog shows all schedules for the dashboard.\n2. Click **Subscribed** to unsubscribe. The button text changes to **Subscribe**.\n![UI changes from Subscribed to Subscribe.](https:\/\/docs.databricks.com\/_images\/lakeview-sched-unsubscribe.gif)\n* Use the **Unsubscribe** link in the email footer to unsubscribe from scheduled updates. \nNote \nWhen a user who is included in a larger distribution list set up for email notifications chooses to unsubscribe using the link in the email footer, the action unsubscribes the entire distribution list. The group as a whole is removed from the subscription and will not receive future PDF snapshot updates. \n### Workspace admin subscription controls \nWorkspace admins can prevent users from distributing dashboards using subscriptions. \nTo prevent sharing email updates: \n1. Click your username in the top bar of the Databricks workspace and select **Settings**.\n2. Click **Notifications**.\n3. Turn the **Enable dashboard email subscriptions** option off. \nChanging this setting prevents all users from adding email subscribers. Dashboard editors cannot add subscribers, and dashboard viewers do not have the option to subscribe to a dashboard. \nIf this setting is turned off, existing subscriptions are paused, and no one can modify existing subscription lists. If this setting is turned back on, subscriptions resume using the existing list.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Dashboards\n#### Dashboard size limits for subscriptions\n\nDashboard subscription emails include the following base64 encoded files: \n* PDF: A PDF file that includes the full dashboard.\n* DesktopImage: An image file optimized for viewing on desktop computers.\n* MobileImage: An image file optimized for viewing on mobile devices. \nA maximum limit of 9MB is imposed on the combined size of the three files. The following descriptions outline the expected behavior when the combined file size exceeds the limit: \n* **If the PDF file is greater than 9MB:** The subscription email does not include the PDF attachment or any images. It includes a note that says the dashboard has exceeded the size limit and shows the actual file size of the current dashboard.\n* **If the combined size of the PDF and DesktopImage files is greater than 9MB:** Only the PDF is attached to the email. The inline message includes a link to the dashboard but no inline image for mobile or desktop viewing.\n* **If the combined file size of all files is greater than 9MB:** The MobileImage is excluded from the email.\n\n### Dashboards\n#### Transfer ownership of a dashboard\n\nIf you are a workspace admin, you can transfer ownership of a dashboard to a different user. \n1. Go to the list of dashboards. Click a dashboard name to edit.\n2. Click **Share**.\n3. Click the ![Gear icon](https:\/\/docs.databricks.com\/_images\/gear-icon.png) icon at the top-right of the **Sharing** dialog.\n![Share dialog with gear icon](https:\/\/docs.databricks.com\/_images\/lakeview-transfer-owner.png)\n4. Begin typing a username to search for and select the new owner.\n5. Click **Confirm**. \nThe new owner appears in the **Sharing** dialog with CAN MANAGE permissions. To view dashboards listed by owner, go to the list of available dashboards by choosing the ![Dashboards Icon](https:\/\/docs.databricks.com\/_images\/dashboards-icon.png).\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Dashboards\n#### Export, import, or replace a dashboard\n\nYou can export and import dashboards as files to facilitate the sharing of editable dashboards across different workspaces. To transfer a dashboard to a different workspace, export it as a file and then import it into the new workspace. You can also replace dashboard files in place. That means that when you edit a dashboard file directly, you can upload that file to the original workspace and overwrite the existing file while maintaining existing sharing settings. \nThe following steps explain how to export and import dashboards in the UI. You can also use the Databricks API to import and export dashboards programmatically. See [POST \/api\/2.0\/workspace\/import](https:\/\/docs.databricks.com\/api\/workspace\/workspace\/import). \n### Export a dashboard file \n* From a draft dashboard, click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu at the screen\u2019s upper-right corner, then click **Export dashboard**.\n* Confirm or cancel the action using the **Export dashboard** dialog. When the export succeeds, a `.lvdash.json` file is saved to your web browser\u2019s default download directory. \n### Import a dashboard file \n* From the Dashboards listing page, click ![Blue Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret-blue.png)**> Import dashboard from file**.\n* Click **Choose file** to open your local file dialog, then select the `.lvdash.json` file you want to import.\n* Click **Import dashboard** to confirm and create the dashboard. \nThe imported dashboard is saved to your user folder. If an imported dashboard with the same name already exists in that location, the conflict is automatically resolved by appending a number in parentheses to create a unique name. \n### Replace a dashboard from a file \n* From a draft dashboard, click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu in the screen\u2019s upper-right corner, then click **Replace dashboard**.\n* Click **Choose file** to open the file dialog and select the `.lvdash.json` file to import.\n* Click **Overwrite** to overwrite the existing dashboard.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Dashboards\n#### What is share to account?\n\nDashboard share to account allows users to share published dashboards with users and groups outside the workspace where the dashboard was drafted. \nAdding users to the Databricks account is not the same as adding them to a workspace. When users are added to an account, their credentials do not grant them automatic access to a workspace, data, or compute resources. Instead, the registration establishes their identity within the system, which Databricks will later use to verify that shared dashboards are available to only their intended recipients. \nThe following image shows an example of how published dashboards can be shared across multiple workspaces and at the account level. \n![Example of dashboard sharing as explained in the following list.](https:\/\/docs.databricks.com\/_images\/lakeview-acct-sharing.png) \nPublished dashboards can be shared with the following: \n* One or more specific users assigned to the originating workspace.\n* Workspace groups (including all workspace users).\n* One or more specific users in the Databricks account.\n* Databricks account groups (including all account users). \nFor more information on users and group relationships in your Databricks account, see [How do admins assign users to workspaces?](https:\/\/docs.databricks.com\/admin\/users-groups\/index.html#how-do-admins-assign-users-to-workspaces) \nImportant \nDashboard account sharing requires that unified login with single sign-on (SSO) is enabled. For more information, see [Unified login](https:\/\/docs.databricks.com\/admin\/users-groups\/single-sign-on\/index.html#unified-login).\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Dashboards\n#### How to share a dashboard with other Databricks account users\n\nDraft dashboards cannot be shared with users outside of the workspace. Published dashboards can be shared with Databricks account users and groups. Adding users and groups to a Databricks account does not automatically assign any workspace, data, or compute permissions. See [Manage users, service principals, and groups](https:\/\/docs.databricks.com\/admin\/users-groups\/index.html) for details on identity management with Databricks. \nUse the following steps to publish and share your dashboard with account users. \n* Navigate to the draft dashboard.\n* Publish the dashboard with the **Embed credentials (default)** setting. \nEmbedding your credentials means the SQL warehouse and queries use the publisher\u2019s data and warehouse permissions to update the published dashboard. This is necessary if you want to share with users outside of the originating workspace, as they do not have their own credentials.\n* Click the **Share** button, and use the **Sharing** dialog to set permissions for users and groups in your Databricks account. \n+ At the top of the **Sharing dialog** enter workspace users, workspace groups, specific account users, or account groups.For users in your workspace, you can assign **Can Manage**, **Can Edit**, **Can Run**, or **Can View** permission. Account users are limited to **Can View** access even if assigned a higher permission in the Share modal. \nTo quickly assign view access for all account users, use the **Sharing settings** option at the bottom of the **Sharing** dialog. \n![Sharing dialog showing settings for organization-wide sharing](https:\/\/docs.databricks.com\/_images\/lakeview-sharing-dialog.png)\n* Share the link with users. \nClick **Copy link** near the bottom the **Sharing** dialog to copy a shareable URL for the published dashboard. \nNote \nUsers who do not have access to the workspace are limited to **Can View** permissions. If you grant elevated permissions such as **Can Edit** to a user who does not have access to the workspace, the permissions appear in the UI but are not actually applied until the user is added to the workspace. \nFor more information on dashboard permission levels, see [Dashboard ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#lakeview). \n### Network considerations \nIf IP access lists are configured, a dashboard published to the account is only accessible to account users if they access it from within the approved IP range, such as when using a VPN. For more information on configuring access, see [Manage IP access lists](https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Dashboards\n#### Monitor Lakeview activity\n\nAdmins can monitor the activity on dashboards using audit logs. See [Dashboards events](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html#dashboards).\n\n### Dashboards\n#### Managing dashboards with the REST API\n\nSee [Use Databricks APIs to manage dashboards](https:\/\/docs.databricks.com\/dashboards\/tutorials\/index.html#apis) for tutorials that demonstrate how to use Databricks REST APIs to manage dashboards. The included tutorials explain how to convert legacy dashboards into Lakeview dashboards, as well as how to create, manage, and share them.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/index.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### Create a training run using the Foundation Model Training API\n\nImportant \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). Reach out to your Databricks account team to enroll in the Public Preview. \nThis page describes how to create and configure a training run using the Foundation Model Training API and describes all of the parameters used in API call. You can also create a run using the UI. For instructions, see [Create a training run using the Foundation Model Training UI](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/ui.html).\n\n#### Create a training run using the Foundation Model Training API\n##### Requirements\n\nSee [Requirements](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html#required).\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/create-fine-tune-run.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### Create a training run using the Foundation Model Training API\n##### Create a training run\n\nTo create training runs programmatically, use the `create()` function. This function trains a model on the provided dataset and converts the final [Composer](https:\/\/github.com\/mosaicml\/composer\/) checkpoint to a Hugging Face formatted checkpoint for inference. \nThe required inputs are the model you want to train, the location of your training dataset, and where to register your model. There are also optional fields that allow you to perform evaluation and change the hyperparameters of your run. After you create a run, the checkpoints are saved to the MLflow run, and the final checkpoint is registered to Unity Catalog for easy deployment. \nSee [Configure a training run](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/create-fine-tune-run.html#configure) for details about arguments for the `create()` function. \n```\nfrom databricks.model_training import foundation_model as fm\n\nrun = fm.create(\nmodel='meta-llama\/Llama-2-7b-chat-hf',\ntrain_data_path='dbfs:\/Volumes\/main\/mydirectory\/ift\/train.jsonl', # UC Volume with JSONL formatted data\n# Public HF dataset is also supported\n# train_data_path='mosaicml\/dolly_hhrlhf\/train'\nregister_to='main.mydirectory', # UC catalog and schema to register the model to\n)\n\n``` \nAfter the run completes, the completed run and final checkpoints are saved, and the model is registered to Unity Catalog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/create-fine-tune-run.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### Create a training run using the Foundation Model Training API\n##### Configure a training run\n\nThe following table summarizes the fields for the `create()` function. \n| Field | Required | Type | Description |\n| --- | --- | --- | --- |\n| `model` | x | str | The name of the model to use. See [Supported models](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html#supported-model). |\n| `train_data_path` | x | str | The location of your training data. This can be a location in Unity Catalog (`<catalog>.<schema>.<table>` or `dbfs:\/Volumes\/<catalog>\/<schema>\/<volume>\/<dataset>.jsonl`), or a HuggingFace dataset. For `INSTRUCTION_FINETUNE`, the data should be formatted with each row containing a `prompt` and `response` field. For `CONTINUED_PRETRAIN`, this is a folder of `.txt` files. See [Prepare data for Foundation Model Training](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/data-preparation.html) for accepted data formats and [Recommended data size for model training](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html#data-size) for data size recommendations. |\n| `register_to` | x | str | The Unity Catalog catalog and schema (`<catalog>.<schema>` or `<catalog>.<schema>.<custom-name>`) where the model is registered after training for easy deployment. If `custom-name` is not provided, this defaults to the run name. |\n| `data_prep_cluster_id` | | str | The cluster ID of the cluster to use for Spark data processing. This is required for supervised training tasks where the training data is in a Delta table. For information on how to find the cluster ID, see [Get cluster id](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/create-fine-tune-run.html#cluster-id). |\n| `experiment_path` | | str | The path to the MLflow experiment where the training run output (metrics and checkpoints) is saved. Defaults to the run name within the user\u2019s personal workspace (i.e. `\/Users\/<username>\/<run_name>`). |\n| `task_type` | | str | The type of task to run. Can be `INSTRUCTION_FINETUNE` (default), `CHAT_COMPLETION`, or `CONTINUED_PRETRAIN`. |\n| `eval_data_path` | | str | The remote location of your evaluation data (if any). Must follow the same format as `train_data_path`. |\n| `eval_prompts` | | str | A list of prompt strings to generate responses during evaluation. Default is `None` (do not generate prompts). Results are logged to the experiment every time the model is checkpointed. Generations occur at every model checkpoint with the following generation parameters: `max_new_tokens: 100`, `temperature: 1`, `top_k: 50`, `top_p: 0.95`, `do_sample: true`. |\n| `custom_weights_path` | | str | The remote location of a custom model checkpoint for training. Default is `None`, meaning the run starts from the original pretrained weights of the chosen model. If custom weights are provided, these weights are used instead of the original pretrained weights of the model. These weights must be a Composer checkpoint and must match the architecture of the `model` specified |\n| `training_duration` | | str | The total duration of your run. Default is one epoch or `1ep`. Can be specified in epochs (`10ep`) or tokens (`1000000tok`). |\n| `learning_rate` | | str | The learning rate for model training. Default is `5e-7`. The optimizer is DecoupledLionW with betas of 0.99 and 0.95 and no weight decay. The learning rate scheduler is LinearWithWarmupSchedule with a warmup of 2% of the total training duration and a final learning rate multiplier of 0. |\n| `context_length` | | str | The maximum sequence length of a data sample. This is used to truncate any data that is too long and to package shorter sequences together for efficiency. The default is the default for the provided model. Increasing the context length beyond each model\u2019s default is not supported. See [Supported models](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html#supported-model) for the context length of each model. |\n| `validate_inputs` | | Boolean | Whether to validate the access to input paths before submitting the training job. Default is `True`. | \n### Build on custom model weights \nFoundation Model Training supports training any of the supported models starting from custom weights using the optional parameter `custom_weights_path.` \nFor example, you can create a domain-specific model with your custom data and then pass the desired checkpoint as an input for further training. \nYou can provide the remote location to the [Composer](https:\/\/github.com\/mosaicml\/composer\/) checkpoint from your previous run for training. Checkpoint paths can be found in the **Artifacts** tab of a previous MLflow run and are of the form: `dbfs:\/databricks\/mlflow-tracking\/<experiment_id>\/<run_id>\/artifacts\/<run_name>\/checkpoints\/<checkpoint_folder>[.symlink]`, where the symlink extension is optional. This checkpoint folder name corresponds to the batch and epoch of a particular snapshot, such as `ep29-ba30\/`. The final snapshot is accessible with the symlink `latest-sharded-rank0.symlink`. \n![Artifacts tab for a previous MLflow run](https:\/\/docs.databricks.com\/_images\/checkpoint-path.png) \nThe path can then be passed to the `custom_weights_path` parameter in your configuration. \n```\nmodel = 'meta-llama\/Llama-2-7b-chat-hf'\ncustom_weights_path = 'your\/checkpoint\/path'\n\n``` \n### Get cluster id \nTo retrieve the cluster id: \n1. In the left nav bar of the Databricks workspace, click **Compute**.\n2. In the table, click the name of your cluster.\n3. Click ![More button](https:\/\/docs.databricks.com\/_images\/more-button.png) in the upper-right corner and select **View JSON** from the drop-down menu.\n4. The Cluster JSON file appears. Copy the cluster id, which is the first line in the file. \n![cluster id](https:\/\/docs.databricks.com\/_images\/cluster-id.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/create-fine-tune-run.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### Create a training run using the Foundation Model Training API\n##### Get status of a run\n\nYou can track the progress of a run using the Experiment page in the Databricks UI or using the API command `get_events()`. For details, see [View, manage, and analyze Foundation Model Training runs](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/view-manage-runs.html). \nExample output from `get_events()`: \n![Use API to get run status](https:\/\/docs.databricks.com\/_images\/get-events-output.png) \nSample run details on the Experiment page: \n![Get run status from the experiments UI](https:\/\/docs.databricks.com\/_images\/run-details.png)\n\n#### Create a training run using the Foundation Model Training API\n##### Next steps\n\nAfter your training run is complete, you can review metrics in MLflow and deploy your model for inference. See steps 5 through 7 of [Tutorial: Create and deploy a training run using Foundation Model Training](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/fine-tune-run-tutorial.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/create-fine-tune-run.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### Create a training run using the Foundation Model Training API\n##### Additional resources\n\n* [Foundation Model Training](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html)\n* [Tutorial: Create and deploy a training run using Foundation Model Training](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/fine-tune-run-tutorial.html)\n* [Create a training run using the Foundation Model Training UI](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/ui.html)\n* [View, manage, and analyze Foundation Model Training runs](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/view-manage-runs.html)\n* [Prepare data for Foundation Model Training](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/data-preparation.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/create-fine-tune-run.html"} +{"content":"# What is Delta Lake?\n### Incrementally clone Parquet and Iceberg tables to Delta Lake\n\nYou can use Databricks clone functionality to incrementally convert data from Parquet or Iceberg data sources to managed or external Delta tables. \nDatabricks clone for Parquet and Iceberg combines functionality used to [clone Delta tables](https:\/\/docs.databricks.com\/delta\/clone.html) and [convert tables to Delta Lake](https:\/\/docs.databricks.com\/delta\/convert-to-delta.html). This article describes use cases and limitations for this feature and provides examples. \nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nNote \nThis feature requires Databricks Runtime 11.3 or above.\n\n### Incrementally clone Parquet and Iceberg tables to Delta Lake\n#### When to use clone for incremental ingestion of Parquet or Iceberg data\n\nDatabricks provides a number of options for [ingesting data into the lakehouse](https:\/\/docs.databricks.com\/ingestion\/index.html). Databricks recommends using clone to ingest Parquet or Iceberg data in the following situations: \nNote \nThe term *source table* refers to the table and data files to be cloned, while the *target table* refers to the Delta table created or updated by the operation. \n* You are performing a migration from Parquet or Iceberg to Delta Lake, but need to continue using source tables.\n* You need to maintain an ingest-only sync between a target table and production source table that receives appends, updates, and deletes.\n* You want to create an [ACID-compliant](https:\/\/docs.databricks.com\/lakehouse\/acid.html) snapshot of source data for reporting, machine learning, or batch ETL.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clone-parquet.html"} +{"content":"# What is Delta Lake?\n### Incrementally clone Parquet and Iceberg tables to Delta Lake\n#### What is the syntax for clone?\n\nClone for Parquet and Iceberg uses the same basic syntax used to clone Delta tables, with support for shallow and deep clones. For more information, see [Clone types](https:\/\/docs.databricks.com\/delta\/clone.html#clone-types). \nDatabricks recommends using clone incrementally for most workloads. Clone support for Parquet and Iceberg uses SQL syntax. \nNote \nClone for Parquet and Iceberg has different requirements and guarantees than either clone or convert to Delta. See [Requirements and limitations for cloning Parquet and Iceberg tables](https:\/\/docs.databricks.com\/delta\/clone-parquet.html#limitations). \nTo deep clone a Parquet or Iceberg table using a file path, use the following syntax: \n```\nCREATE OR REPLACE TABLE <target-table-name> CLONE parquet.`\/path\/to\/data`;\n\nCREATE OR REPLACE TABLE <target-table-name> CLONE iceberg.`\/path\/to\/data`;\n\n``` \nTo shallow clone a Parquet or Iceberg table using a file path, use the following syntax: \n```\nCREATE OR REPLACE TABLE <target-table-name> SHALLOW CLONE parquet.`\/path\/to\/data`;\n\nCREATE OR REPLACE TABLE <target-table-name> SHALLOW CLONE iceberg.`\/path\/to\/data`;\n\n``` \nYou can also create deep or shallow clones for Parquet tables registered to the metastore, as shown in the following examples: \n```\nCREATE OR REPLACE TABLE <target-table-name> CLONE <source-table-name>;\n\nCREATE OR REPLACE TABLE <target-table-name> SHALLOW CLONE <source-table-name>;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clone-parquet.html"} +{"content":"# What is Delta Lake?\n### Incrementally clone Parquet and Iceberg tables to Delta Lake\n#### Requirements and limitations for cloning Parquet and Iceberg tables\n\nWhether using deep or shallow clones, changes applied to the target table after the clone occurs cannot be synced back to the source table. Incremental syncing with clone is unidirectional, allowing changes to source tables to be automatically applied to target Delta tables. \nThe following additional limitations apply when using clone with Parquet and Iceberg tables: \n* You must register Parquet tables with partitions to a catalog such as the Hive metastore before cloning and using the table name to idenfity the source table. You cannot use path-based clone syntax for Parquet tables with partitions.\n* You cannot clone Iceberg tables that have experienced partition evolution.\n* You cannot clone Iceberg merge-on-read tables that have experienced updates, deletions, or merges.\n* The following are limitations for cloning Iceberg tables with partitions defined on truncated columns: \n+ In Databricks Runtime 12.2 LTS and below, the only truncated column type supported is `string`.\n+ In Databricks Runtime 13.3 LTS and above, you can work with truncated columns of types `string`, `long`, or `int`.\n+ Databricks does not support working with truncated columns of type `decimal`.\n* Incremental clone syncs the schema changes and properties from the source table, any schema changes and data files written local to the cloned table are overridden.\n* Unity Catalog does not support shallow clones.\n* You cannot use glob patterns when defining a path. \nNote \nIn Databricks Runtime 11.3, this operation does not collect file-level statistics. As such, target tables do not benefit from Delta Lake data skipping. File-level statistics are collected in Databricks Runtime 12.2 LTS and above.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clone-parquet.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### What is Databricks Feature Serving?\n\nDatabricks Feature Serving makes data in the Databricks platform available to models or applications deployed outside of Databricks. Feature Serving endpoints automatically scale to adjust to real-time traffic and provide a high-availability, low-latency service for serving features. This page describes how to set up and use Feature Serving. For a step-by-step tutorial, see [Tutorial: Deploy and query a feature serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-serving-tutorial.html). \nWhen you use Databricks Model Serving to serve a model that was built using features from Databricks, the model automatically looks up and transforms features for inference requests. With Databricks Feature Serving, you can serve structured data for retrieval augmented generation (RAG) applications, as well as features that are required for other applications, such as models served outside of Databricks or any other application that requires features based on data in Unity Catalog. \n![when to use feature serving](https:\/\/docs.databricks.com\/_images\/when-to-use-feature-serving.png)\n\n#### What is Databricks Feature Serving?\n##### Why use Feature Serving?\n\nDatabricks Feature Serving provides a single interface that serves pre-materialized and on-demand features. It also includes the following benefits: \n* Simplicity. Databricks handles the infrastructure. With a single API call, Databricks creates a production-ready serving environment.\n* High availability and scalability. Feature Serving endpoints automatically scale up and down to adjust to the volume of serving requests.\n* Security. Endpoints are deployed in a secure network boundary and use dedicated compute that terminates when the endpoint is deleted or scaled to zero.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### What is Databricks Feature Serving?\n##### Requirements\n\n* Databricks Runtime 14.2 ML or above.\n* To use the Python API, Feature Serving requires `databricks-feature-engineering` version 0.1.2 or above, which is built into Databricks Runtime 14.2 ML. For earlier Databricks Runtime ML versions, manually install the required version using `%pip install databricks-feature-engineering>=0.1.2`. If you are using a Databricks notebook, you must then restart the Python kernel by running this command in a new cell: `dbutils.library.restartPython()`.\n* To use the Databricks SDK, Feature Serving requires `databricks-sdk` version 0.18.0 or above. To manually install the required version, use `%pip install databricks-sdk>=0.18.0`. If you are using a Databricks notebook, you must then restart the Python kernel by running this command in a new cell: `dbutils.library.restartPython()`. \nDatabricks Feature Serving provides a UI and several programmatic options for creating, updating, querying, and deleting endpoints. This article includes instructions for each of the following options: \n* Databricks UI\n* REST API\n* Python API\n* Databricks SDK \nTo use the REST API or MLflow Deployments SDK, you must have a Databricks API token. \nImportant \nAs a security best practice for production scenarios, Databricks recommends that you use [machine-to-machine OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html) for authentication during production. \nFor testing and development, Databricks recommends using a personal access token belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### What is Databricks Feature Serving?\n##### Authentication for Feature Serving\n\nFor information about authentication, see [Authentication for Databricks automation - overview](https:\/\/docs.databricks.com\/dev-tools\/auth\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### What is Databricks Feature Serving?\n##### Create a `FeatureSpec`\n\nA `FeatureSpec` is a user-defined set of features and functions. You can combine features and functions in a `FeatureSpec`. `FeatureSpecs` are stored in and managed by Unity Catalog and appear in Catalog Explorer. \nThe tables specified in a `FeatureSpec` must be published to an online table or a third-party online store. See [Use online tables for real-time feature serving](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html) or [Third-party online stores](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-feature-stores.html). \nYou must use the `databricks-feature-engineering` package to create a `FeatureSpec`. \n```\nfrom databricks.feature_engineering import (\nFeatureFunction,\nFeatureLookup,\nFeatureEngineeringClient,\n)\n\nfe = FeatureEngineeringClient()\n\nfeatures = [\n# Lookup column `average_yearly_spend` and `country` from a table in UC by the input `user_id`.\nFeatureLookup(\ntable_name=\"main.default.customer_profile\",\nlookup_key=\"user_id\",\nfeatures=[\"average_yearly_spend\", \"country\"]\n),\n# Calculate a new feature called `spending_gap` - the difference between `ytd_spend` and `average_yearly_spend`.\nFeatureFunction(\nudf_name=\"main.default.difference\",\noutput_name=\"spending_gap\",\n# Bind the function parameter with input from other features or from request.\n# The function calculates a - b.\ninput_bindings={\"a\": \"ytd_spend\", \"b\": \"average_yearly_spend\"},\n),\n]\n\n# Create a `FeatureSpec` with the features defined above.\n# The `FeatureSpec` can be accessed in Unity Catalog as a function.\nfe.create_feature_spec(\nname=\"main.default.customer_features\",\nfeatures=features,\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### What is Databricks Feature Serving?\n##### Create an endpoint\n\nThe `FeatureSpec` defines the endpoint. For more information, see [Create custom model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html), [the Python API documentation](https:\/\/api-docs.databricks.com\/python\/feature-engineering\/latest\/ml_features.endpoint_core_config.html), or the [Databricks SDK documentation](https:\/\/github.com\/databricks\/databricks-sdk-py\/blob\/e290be90ac0bcf8d01207f8186bc757890c75a09\/databricks\/sdk\/service\/serving.py#L159) for details. \nNote \nFor workloads that are latency sensitive or require high queries per second, Model Serving offers route optimization on custom model serving endpoints, see [Configure route optimization on serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/route-optimization.html). \n```\ncurl -X POST -u token:$DATABRICKS_API_TOKEN ${WORKSPACE_URL}\/api\/2.0\/serving-endpoints \\\n-H 'Content-Type: application\/json' \\\n-d '\"name\": \"customer-features\",\n\"config\": {\n\"served_entities\": [\n{\n\"entity_name\": \"main.default.customer_features\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": true\n}\n]\n}'\n\n``` \n```\nfrom databricks.sdk import WorkspaceClient\nfrom databricks.sdk.service.serving import EndpointCoreConfigInput, ServedEntityInput\n\nworkspace = WorkspaceClient()\n\n# Create endpoint\nworkspace.serving_endpoints.create(\nname=\"my-serving-endpoint\",\nconfig = EndpointCoreConfigInput(\nserved_entities=[\nServedEntityInput(\nentity_name=\"main.default.customer_features\",\nscale_to_zero_enabled=True,\nworkload_size=\"Small\"\n)\n]\n)\n)\n\n``` \n```\nfrom databricks.feature_engineering.entities.feature_serving_endpoint import (\nServedEntity,\nEndpointCoreConfig,\n)\n\nfe.create_feature_serving_endpoint(\nname=\"customer-features\",\nconfig=EndpointCoreConfig(\nserved_entities=ServedEntity(\nfeature_spec_name=\"main.default.customer_features\",\nworkload_size=\"Small\",\nscale_to_zero_enabled=True,\ninstance_profile_arn=None,\n)\n)\n)\n\n``` \nTo see the endpoint, click **Serving** in the left sidebar of the Databricks UI. When the state is **Ready**, the endpoint is ready to respond to queries. To learn more about Databricks Model Serving, see [Databricks Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### What is Databricks Feature Serving?\n##### Get an endpoint\n\nYou can use the Databricks SDK or the Python API to get the metadata and status of an endpoint. \n```\nendpoint = workspace.serving_endpoints.get(name=\"customer-features\")\n# print(endpoint)\n\n``` \n```\nendpoint = fe.get_feature_serving_endpoint(name=\"customer-features\")\n# print(endpoint)\n\n```\n\n#### What is Databricks Feature Serving?\n##### Get the schema of an endpoint\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nYou can use the REST API to get the schema of an endpoint. For more information about the endpoint schema, see [Get a model serving endpoint schema](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/manage-serving-endpoints.html#get-schema). \n```\nACCESS_TOKEN=<token>\nENDPOINT_NAME=<endpoint name>\n\ncurl \"https:\/\/example.databricks.com\/api\/2.0\/serving-endpoints\/$ENDPOINT_NAME\/openapi\" -H \"Authorization: Bearer $ACCESS_TOKEN\" -H \"Content-Type: application\/json\"\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### What is Databricks Feature Serving?\n##### Query an endpoint\n\nYou can use the REST API, the MLflow Deployments SDK, or the Serving UI to query an endpoint. \nThe following code shows how to set up credentials and create the client when using the MLflow Deployments SDK. \n```\n# Set up credentials\nexport DATABRICKS_HOST=...\nexport DATABRICKS_TOKEN=...\n\n``` \n```\n# Set up the client\nimport mlflow.deployments\n\nclient = mlflow.deployments.get_deploy_client(\"databricks\")\n\n``` \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \n### Query an endpoint using APIs \nThis section includes examples of querying an endpoint using the REST API or the MLflow Deployments SDK. \n```\ncurl -X POST -u token:$DATABRICKS_API_TOKEN $ENDPOINT_INVOCATION_URL \\\n-H 'Content-Type: application\/json' \\\n-d '{\"dataframe_records\": [\n{\"user_id\": 1, \"ytd_spend\": 598},\n{\"user_id\": 2, \"ytd_spend\": 280}\n]}'\n\n``` \nImportant \nThe following example uses the `predict()` API from the [MLflow Deployments SDK](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.deployments.html#mlflow.deployments.DatabricksDeploymentClient.predict). This API is [Experimental](https:\/\/docs.databricks.com\/release-notes\/release-types.html) and the API definition might change. \n```\nimport mlflow.deployments\n\nclient = mlflow.deployments.get_deploy_client(\"databricks\")\nresponse = client.predict(\nendpoint=\"test-feature-endpoint\",\ninputs={\n\"dataframe_records\": [\n{\"user_id\": 1, \"ytd_spend\": 598},\n{\"user_id\": 2, \"ytd_spend\": 280},\n]\n},\n)\n\n``` \n### Query an endpoint using the UI \nYou can query a serving endpoint directly from the Serving UI. The UI includes generated code examples that you can use to query the endpoint. \n1. In the left sidebar of the Databricks workspace, click **Serving**.\n2. Click the endpoint you want to query.\n3. In the upper-right of the screen, click **Query endpoint**. \n![query endpoint button](https:\/\/docs.databricks.com\/_images\/query-endpoint-button.png)\n4. In the **Request** box, type the request body in JSON format.\n5. Click **Send request**. \n```\n\/\/ Example of a request body.\n{\n\"dataframe_records\": [\n{\"user_id\": 1, \"ytd_spend\": 598},\n{\"user_id\": 2, \"ytd_spend\": 280}\n]\n}\n\n``` \nThe **Query endpoint** dialog includes generated example code in curl, Python, and SQL. Click the tabs to view and copy the example code. \n![query endpoint dialog](https:\/\/docs.databricks.com\/_images\/query-endpoint-dialog.png) \nTo copy the code, click the copy icon in the upper-right of the text box. \n![copy button in query endpoint dialog](https:\/\/docs.databricks.com\/_images\/query-endpoint-dialog-with-code.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### What is Databricks Feature Serving?\n##### Update an endpoint\n\nYou can update an endpoint using the REST API, the Databricks SDK, or the Serving UI. \n### Update an endpoint using APIs \n```\ncurl -X PUT -u token:$DATABRICKS_API_TOKEN ${WORKSPACE_URL}\/api\/2.0\/serving-endpoints\/<endpoint_name>\/config \\\n-H 'Content-Type: application\/json' \\\n-d '\"served_entities\": [\n{\n\"name\": \"customer-features\",\n\"entity_name\": \"main.default.customer_features_new\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": True\n}\n]'\n\n``` \n```\nworkspace.serving_endpoints.update_config(\nname=\"my-serving-endpoint\",\nserved_entities=[\nServedEntityInput(\nentity_name=\"main.default.customer_features\",\nscale_to_zero_enabled=True,\nworkload_size=\"Small\"\n)\n]\n)\n\n``` \n### Update an endpoint using the UI \nFollow these steps to use the Serving UI: \n1. In the left sidebar of the Databricks workspace, click **Serving**.\n2. In the table, click the name of the endpoint you want to update. The endpoint screen appears.\n3. In the upper-right of the screen, click **Edit endpoint**.\n4. In the **Edit serving endpoint** dialog, edit the endpoint settings as needed.\n5. Click **Update** to save your changes. \n![update an endpoint](https:\/\/docs.databricks.com\/_images\/update-endpoint.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### What is Databricks Feature Serving?\n##### Delete an endpoint\n\nWarning \nThis action is irreversible. \nYou can delete an endpoint using the REST API, the Databricks SDK, the Python API, or the Serving UI. \n### Delete an endpoint using APIs \n```\ncurl -X DELETE -u token:$DATABRICKS_API_TOKEN ${WORKSPACE_URL}\/api\/2.0\/serving-endpoints\/<endpoint_name>\n\n``` \n```\nworkspace.serving_endpoints.delete(name=\"customer-features\")\n\n``` \n```\nfe.delete_feature_serving_endpoint(name=\"customer-features\")\n\n``` \n### Delete an endpoint using the UI \nFollow these steps to delete an endpoint using the Serving UI: \n1. In the left sidebar of the Databricks workspace, click **Serving**.\n2. In the table, click the name of the endpoint you want to delete. The endpoint screen appears.\n3. In the upper-right of the screen, click the kebab menu ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) and select **Delete**. \n![delete an endpoint](https:\/\/docs.databricks.com\/_images\/delete-endpoint.png)\n\n#### What is Databricks Feature Serving?\n##### Monitor the health of an endpoint\n\nFor information about the logs and metrics available for Feature Serving endpoints, see [Monitor model quality and endpoint health](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/monitor-diagnose-endpoints.html).\n\n#### What is Databricks Feature Serving?\n##### Access control\n\nFor information about permissions on Feature Serving endpoints, see [Manage permissions on your model serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/manage-serving-endpoints.html#permissions).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### What is Databricks Feature Serving?\n##### Example notebook\n\nThis notebook illustrates how to use the Databricks SDK to create a Feature Serving endpoint using Databricks Online Tables. \n### Feature Serving example notebook with online tables \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/feature-function-serving-online-tables-dbsdk.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Foundation model REST API reference\n\nThis article provides general API information for Databricks Foundation Model APIs and the models they support. The Foundation Model APIs are designed to be similar to OpenAI\u2019s REST API to make migrating existing projects easier. Both the pay-per-token and provisioned throughput endpoints accept the same REST API request format.\n\n#### Foundation model REST API reference\n##### Endpoints\n\nEach pay-per-token model has a single endpoint, and users can interact with these endpoints using HTTP POST requests. Provisioned throughput endpoints can be [created using the API or the Serving UI](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html). These endpoints also support multiple models per endpoint for A\/B testing, as long as both served models expose the same API format. For example, both models are chat models. \nRequests and responses use JSON, the exact JSON structure depends on an endpoint\u2019s task type. Chat and completion endpoints support streaming responses. \nPay-per-token workloads support certain models, see [Supported models for pay-per-token](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/supported-models.html) for those models and accepted API formats.\n\n#### Foundation model REST API reference\n##### Usage\n\nResponses include a `usage` sub-message which reports the number of tokens in the request and response. The format of this sub-message is the same across all task types. \n| Field | Type | Description |\n| --- | --- | --- |\n| `completion_tokens` | Integer | Number of generated tokens. Not included in embedding responses. |\n| `prompt_tokens` | Integer | Number of tokens from the input prompt(s). |\n| `total_tokens` | Integer | Number of total tokens. | \nFor models like `llama-2-70b-chat` a user prompt is transformed using a prompt template before being passed into the model. For pay-per-token endpoints, a system prompt might also be added. `prompt_tokens` includes all text added by our server.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Foundation model REST API reference\n##### Chat task\n\nChat tasks are optimized for multi-turn conversations with a model. Each request describes the conversation so far, where the `messages` field must alternate between `user` and `assistant` roles, ending with a `user` message. The model response provides the next `assistant` message in the conversation. \n### Chat request \n| Field | Default | Type | Description |\n| --- | --- | --- | --- |\n| `messages` | | [ChatMessage](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html#chat-message) list | A list of messages representing the current conversation. **(Required)** |\n| `max_tokens` | `nil` | Integer greater than zero or `nil`, which represents infinity | The maximum number of tokens to generate. |\n| `stream` | `true` | Boolean | Stream responses back to a client in order to allow partial results for requests. If this parameter is included in the request, responses are sent using the [Server-sent events](https:\/\/html.spec.whatwg.org\/multipage\/server-sent-events.html#server-sent-events) standard. |\n| `temperature` | `1.0` | Float in [0,2] | The sampling temperature. 0 is deterministic and higher values introduce more randomness. |\n| `top_p` | `1.0` | Float in (0,1] | The probability threshold used for nucleus sampling. |\n| `top_k` | `nil` | Integer greater than zero or `nil`, which represents infinity | Defines the number of k most likely tokens to use for top-k-filtering. Set this value to 1 to make outputs deterministic. |\n| `stop` | [] | String or List[String] | Model stops generating further tokens when any one of the sequences in `stop` is encountered. |\n| `n` | 1 | Integer greater than zero | The API returns `n` independent chat completions when `n` is specified. Recommended for workloads that generate multiple completions on the same input for additional inference efficiency and cost savings. Only available for provisioned throughput endpoints. | \n#### `ChatMessage` \n| Field | Type | Description |\n| --- | --- | --- |\n| `role` | String | **Required.** The role of the author of the message. Can be `\"system\"`, `\"user\"`, or `\"assistant\"`. |\n| `content` | String | **Required**. The content of the message. **(Required)** | \nThe `system` role can only be used once, as the first message in a conversation. It overrides the model\u2019s default system prompt. \n### Chat response \nFor non-streaming requests, the response is a single chat completion object. For streaming requests, the response is a `text\/event-stream` where each event is a completion chunk object. The top-level structure of completion and chunk objects is almost identical: only `choices` has a different type. \n| Field | Type | Description |\n| --- | --- | --- |\n| `id` | String | Unique identifier for the chat completion. |\n| `choices` | List[[ChatCompletionChoice](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html#chat-choice)] or List[[ChatCompletionChunk](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html#chat-chunk)] (streaming) | List of chat completion texts. `n` choices are returned if the `n` parameter is specified. |\n| `object` | String | The object type. Equal to either `\"chat.completions\"` for non-streaming or `\"chat.completion.chunk\"` for streaming. |\n| `created` | Integer | The time the chat completion was generated in seconds. |\n| `model` | String | The model version used to generate the response. |\n| `usage` | [Usage](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html#usage) | Token usage metadata. May not be present on streaming responses. | \n#### `ChatCompletionChoice` \n| Field | Type | Description |\n| --- | --- | --- |\n| `index` | Integer | The index of the choice in the list of generated choices. |\n| `message` | [ChatMessage](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html#chat-message) | A chat completion message returned by the model. The role will be `assistant`. |\n| `finish_reason` | String | The reason the model stopped generating tokens. | \n#### `ChatCompletionChunk` \n| Field | Type | Description |\n| --- | --- | --- |\n| `index` | Integer | The index of the choice in the list of generated choices. |\n| `delta` | [ChatMessage](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html#chat-message) | A chat completion message part of generated streamed responses from the model. Only the first chunk is guaranteed to have `role` populated. |\n| `finish_reason` | String | The reason the model stopped generating tokens. Only the last chunk will have this populated. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Foundation model REST API reference\n##### Completion task\n\nText completion tasks are for generating responses to a single prompt. Unlike Chat, this task supports batched inputs: multiple independent prompts can be sent in one request. \n### Completion request \n| Field | Default | Type | Description |\n| --- | --- | --- | --- |\n| `prompt` | | String or List[String] | The prompt(s) for the model. **(Required)** |\n| `max_tokens` | `nil` | Integer greater than zero or `nil`, which represents infinity | The maximum number of tokens to generate. |\n| `stream` | `true` | Boolean | Stream responses back to a client in order to allow partial results for requests. If this parameter is included in the request, responses are sent using the [Server-sent events](https:\/\/html.spec.whatwg.org\/multipage\/server-sent-events.html#server-sent-events) standard. |\n| `temperature` | `1.0` | Float in [0,2] | The sampling temperature. 0 is deterministic and higher values introduce more randomness. |\n| `top_p` | `1.0` | Float in (0,1] | The probability threshold used for nucleus sampling. |\n| `top_k` | `nil` | Integer greater than zero or `nil`, which represents infinity | Defines the number of k most likely tokens to use for top-k-filtering. Set this value to 1 to make outputs deterministic. |\n| `error_behavior` | `\"error\"` | `\"truncate\"` or `\"error\"` | For timeouts and context-length-exceeded errors. One of: `\"truncate\"` (return as many tokens as possible) and `\"error\"` (return an error). This parameter is only accepted by pay per token endpoints. |\n| `n` | 1 | Integer greater than zero | The API returns `n` independent chat completions when `n` is specified. Recommended for workloads that generate multiple completions on the same input for additional inference efficiency and cost savings. Only available for provisioned throughput endpoints. |\n| `stop` | [] | String or List[String] | Model stops generating further tokens when any one of the sequences in `stop` is encountered. |\n| `suffix` | `\"\"` | String | A string that is appended to the end of every completion. |\n| `echo` | `false` | Boolean | Returns the prompt along with the completion. |\n| `use_raw_prompt` | `false` | Boolean | If `true`, pass the `prompt` directly into the model without any transformation. | \n### Completion response \n| Field | Type | Description |\n| --- | --- | --- |\n| `id` | String | Unique identifier for the text completion. |\n| `choices` | [CompletionChoice](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html#chat-chunk) | A list of text completions. For every prompt passed in, `n` choices are generated if `n` is specified. Default `n` is 1. |\n| `object` | String | The object type. Equal to `\"text_completion\"` |\n| `created` | Integer | The time the completion was generated in seconds. |\n| `usage` | [Usage](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html#usage) | Token usage metadata. | \n#### `CompletionChoice` \n| Field | Type | Description |\n| --- | --- | --- |\n| `index` | Integer | The index of the prompt in request. |\n| `text` | String | The generated completion. |\n| `finish_reason` | String | The reason the model stopped generating tokens. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Foundation model REST API reference\n##### Embedding task\n\nEmbedding tasks map input strings into embedding vectors. Many inputs can be batched together in each request. \n### Embedding request \n| Field | Type | Description |\n| --- | --- | --- |\n| `input` | String or List[String] | The input text to embed. Can be a string or a list of strings. **(Required)** |\n| `instruction` | String | An optional instruction to pass to the embedding model. | \nInstructions are optional and highly model specific. For instance the The BGE authors recommend no instruction when indexing chunks and recommend using the instruction `\"Represent this sentence for searching relevant passages:\"` for retrieval queries. Other models like Instructor-XL support a wide range of instruction strings. \n### Embeddings response \n| Field | Type | Description |\n| --- | --- | --- |\n| `id` | String | Unique identifier for the embedding. |\n| `object` | String | The object type. Equal to `\"list\"`. |\n| `model` | String | The name of the embedding model used to create the embedding. |\n| `data` | [EmbeddingObject](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html#embedding-object) | The embedding object. |\n| `usage` | [Usage](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html#usage) | Token usage metadata. | \n#### `EmbeddingObject` \n| Field | Type | Description |\n| --- | --- | --- |\n| `object` | String | The object type. Equal to `\"embedding\"`. |\n| `index` | Integer | The index of the embedding in the list of embeddings generated by the model. |\n| `embedding` | List[Float] | The embedding vector. Each model will return a fixed size vector (1024 for BGE-Large) |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Foundation model REST API reference\n##### Additional resources\n\n* [Databricks Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html)\n* [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html)\n* [Supported models for pay-per-token](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/supported-models.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html"} +{"content":"# Discover data\n### Explore storage and find data files\n\nThis article focuses on discovering and exploring directories and data files managed with Unity Catalog volumes, including UI-based instructions for exploring volumes with Catalog Explorer. This article also provides examples for programmatic exploration of data in cloud object storage using volume paths and cloud URIs. \nDatabricks recommends using volumes to manage access to data in cloud object storage. For more information on connecting to data in cloud object storage, see [Connect to data sources](https:\/\/docs.databricks.com\/connect\/index.html). \nFor a full walkthrough of how to interact with files in all locations, see [Work with files on Databricks](https:\/\/docs.databricks.com\/files\/index.html). \nImportant \nWhen searching for **Files** in the workspace UI, you might discover data files stored as workspace files. Databricks recommends using workspace files primarily for code (such as scripts and libraries), init scripts, or configuration files. You should ideally limit data stored as workspace files to small datasets that might be used for tasks such as testing during development and QA. See [What are workspace files?](https:\/\/docs.databricks.com\/files\/workspace.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/files.html"} +{"content":"# Discover data\n### Explore storage and find data files\n#### Volumes vs. legacy cloud object configurations\n\nWhen you use volumes to manage access to data in cloud object storage, you can only use the volumes path to access data, and these paths are available with all Unity Catalog-enabled compute. You cannot register data files backing Unity Catalog tables using volumes. Databricks recommends using table names instead of file paths to interact with structured data registered as Unity Catalog tables. See [How do paths work for data managed by Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/paths.html). \nIf you use a legacy method for configuring access to data in cloud object storage, Databricks reverts to legacy table ACLs permissions. Users wishing to access data using cloud URIs from SQL warehouses or compute configured with shared access mode require the `ANY FILE` permission. See [Hive metastore table access control (legacy)](https:\/\/docs.databricks.com\/data-governance\/table-acls\/index.html). \nDatabricks provides several APIs for listing files in cloud object storage. Most examples in this article focus on using volumes. For examples on interacting with data on object storage configured without volumes, see [List files with URIs](https:\/\/docs.databricks.com\/discover\/files.html#uri).\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/files.html"} +{"content":"# Discover data\n### Explore storage and find data files\n#### Explore volumes\n\nYou can use Catalog Explorer to explore data in volumes and review the details of a volume. You are only able to see volumes that you have permissions to read, so you can query all data discovered this way. \nYou can use SQL to explore volumes and their metadata. To list files in volumes, you can use SQL, the `%fs` magic command, or Databricks utilities. When interacting with data in volumes, you use the path provided by Unity Catalog, which always has the following format: \n```\n\/Volumes\/catalog_name\/schema_name\/volume_name\/path\/to\/data\n\n``` \n### Display volumes \nRun the following command to see a list of volumes in a given schema. \n```\nSHOW VOLUMES IN catalog_name.schema_name;\n\n``` \nSee [SHOW VOLUMES](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-aux-show-volumes.html). \nTo display volumes in a given schema with Catalog Explorer, do the following: \n1. Select the ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog** icon.\n2. Select a catalog.\n3. Select a schema.\n4. Click **Volumes** to expand all volumes in the schema. \nNote \nIf no volumes are registered to a schema, the **Volumes** option is not displayed. Instead, you see a list of available tables. \n### See volume details \nRun the following command to describe a volume. \n```\nDESCRIBE VOLUME volume_name\n\n``` \nSee [DESCRIBE VOLUME](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-aux-describe-volume.html). \nClick the volume name and select the **Details** tab to review volume details. \n### See files in volumes \nRun the following command to list the files in a volume. \n```\nLIST '\/Volumes\/catalog_name\/schema_name\/volume_name\/'\n\n``` \nClick the volume name and select the **Details** tab to review volume details. \nRun the following command to list the files in a volume. \n```\n%fs ls \/Volumes\/catalog_name\/schema_name\/volume_name\/\n\n``` \nRun the following command to list the files in a volume. \n```\ndbutils.fs.ls(\"\/Volumes\/catalog_name\/schema_name\/volume_name\/\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/files.html"} +{"content":"# Discover data\n### Explore storage and find data files\n#### List files with URIs\n\nYou can query cloud object storage configured with methods other than volumes using URIs. You must be connected to compute with privileges to access the cloud location. The `ANY FILE` permission is required on SQL warehouses and compute configured with shared access mode. \nNote \nURI access to object storage configured with volumes is not supported. You cannot use Catalog Explorer to review contents of object storage not configured with volumes. \nThe following examples include example URIs for data stored with Azure Data Lake Storage Gen2, S3, and GCS. \nRun the following command to list files in cloud object storage. \n```\n-- ADLS 2\nLIST 'abfss:\/\/container-name@storage-account-name.dfs.core.windows.net\/path\/to\/data'\n\n-- S3\nLIST 's3:\/\/bucket-name\/path\/to\/data'\n\n-- GCS\nLIST 'gs:\/\/bucket-name\/path\/to\/data'\n\n``` \nRun the following command to list files in cloud object storage. \n```\n# ADLS 2\n%fs ls abfss:\/\/container-name@storage-account-name.dfs.core.windows.net\/path\/to\/data\n\n# S3\n%fs ls s3:\/\/bucket-name\/path\/to\/data\n\n# GCS\n%fs ls gs:\/\/bucket-name\/path\/to\/data\n\n``` \nRun the following command to list files in cloud object storage. \n```\n\n# ADLS 2\ndbutils.fs.ls(\"abfss:\/\/container-name@storage-account-name.dfs.core.windows.net\/path\/to\/data\")\n\n# S3\ndbutils.fs.ls(\"s3:\/\/bucket-name\/path\/to\/data\")\n\n# GCS\ndbutils.fs.ls(\"bucket-name\/path\/to\/data\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/files.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Models in Unity Catalog example\n\nThis example illustrates how to use Models in Unity Catalog to build a machine learning application that forecasts the daily power output of a wind farm. The example shows how to: \n* Track and log models with MLflow\n* Register models to Unity Catalog\n* Describe models and deploy them for inference using aliases\n* Integrate registered models with production applications\n* Search and discover models in Unity Catalog\n* Archive and delete models \nThe article describes how to perform these steps using the MLflow Tracking and Models in Unity Catalog UIs and APIs.\n\n##### Models in Unity Catalog example\n###### Requirements\n\nMake sure you meet all the requirements in [Requirements](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#requirements). In addition, the code examples in this article assume that you have the following privileges: \n* `USE CATALOG` privilege on the `main` catalog.\n* `CREATE MODEL` and `USE SCHEMA` privileges on the `main.default` schema.\n\n##### Models in Unity Catalog example\n###### Notebook\n\nAll of the code in this article is provided in the following notebook. \n### Models in Unity Catalog example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/models-in-uc-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n##### Models in Unity Catalog example\n###### Install MLflow Python client\n\nThis example requires the MLflow Python client version 2.5.0 or above and TensorFlow. Add the following commands at the top of your notebook to install these dependencies. \n```\n%pip install --upgrade \"mlflow-skinny[databricks]>=2.5.0\" tensorflow\ndbutils.library.restartPython()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/models-in-uc-example.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Models in Unity Catalog example\n###### Load dataset, train model, and register to Unity Catalog\n\nThis section shows how to load the wind farm dataset, train a model, and register the model to Unity Catalog. The model training run and metrics are tracked in an [experiment run](https:\/\/docs.databricks.com\/mlflow\/experiments.html). \n### Load dataset \nThe following code loads a dataset containing weather data and power output information for a wind farm in the United States. The dataset contains `wind direction`, `wind speed`, and `air temperature` features sampled every six hours (once at `00:00`, once at `08:00`, and once at `16:00`), as well as daily aggregate power output (`power`), over several years. \n```\nimport pandas as pd\nwind_farm_data = pd.read_csv(\"https:\/\/github.com\/dbczumar\/model-registry-demo-notebook\/raw\/master\/dataset\/windfarm_data.csv\", index_col=0)\n\ndef get_training_data():\ntraining_data = pd.DataFrame(wind_farm_data[\"2014-01-01\":\"2018-01-01\"])\nX = training_data.drop(columns=\"power\")\ny = training_data[\"power\"]\nreturn X, y\n\ndef get_validation_data():\nvalidation_data = pd.DataFrame(wind_farm_data[\"2018-01-01\":\"2019-01-01\"])\nX = validation_data.drop(columns=\"power\")\ny = validation_data[\"power\"]\nreturn X, y\n\ndef get_weather_and_forecast():\nformat_date = lambda pd_date : pd_date.date().strftime(\"%Y-%m-%d\")\ntoday = pd.Timestamp('today').normalize()\nweek_ago = today - pd.Timedelta(days=5)\nweek_later = today + pd.Timedelta(days=5)\n\npast_power_output = pd.DataFrame(wind_farm_data)[format_date(week_ago):format_date(today)]\nweather_and_forecast = pd.DataFrame(wind_farm_data)[format_date(week_ago):format_date(week_later)]\nif len(weather_and_forecast) < 10:\npast_power_output = pd.DataFrame(wind_farm_data).iloc[-10:-5]\nweather_and_forecast = pd.DataFrame(wind_farm_data).iloc[-10:]\n\nreturn weather_and_forecast.drop(columns=\"power\"), past_power_output[\"power\"]\n\n``` \n### Configure MLflow client to access models in Unity Catalog \nBy default, the MLflow Python client creates models in the workspace model registry on Databricks. To upgrade to models in Unity Catalog, configure the client to access models in Unity Catalog: \n```\nimport mlflow\nmlflow.set_registry_uri(\"databricks-uc\")\n\n``` \n### Train and register model \nThe following code trains a neural network using TensorFlow Keras to predict power output based on the weather features in the dataset and uses MLflow APIs to register the fitted model to Unity Catalog. \n```\nfrom tensorflow.keras.models import Sequential\nfrom tensorflow.keras.layers import Dense\n\nMODEL_NAME = \"main.default.wind_forecasting\"\n\ndef train_and_register_keras_model(X, y):\nwith mlflow.start_run():\nmodel = Sequential()\nmodel.add(Dense(100, input_shape=(X.shape[-1],), activation=\"relu\", name=\"hidden_layer\"))\nmodel.add(Dense(1))\nmodel.compile(loss=\"mse\", optimizer=\"adam\")\n\nmodel.fit(X, y, epochs=100, batch_size=64, validation_split=.2)\nexample_input = X[:10].to_numpy()\nmlflow.tensorflow.log_model(\nmodel,\nartifact_path=\"model\",\ninput_example=example_input,\nregistered_model_name=MODEL_NAME\n)\nreturn model\n\nX_train, y_train = get_training_data()\nmodel = train_and_register_keras_model(X_train, y_train)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/models-in-uc-example.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Models in Unity Catalog example\n###### View the model in the UI\n\nYou can view and manage registered models and model versions in Unity Catalog using the [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html). Look for the model you just created under the `main` catalog and `default` schema. \n![Registered model page](https:\/\/docs.databricks.com\/_images\/uc_registered_model.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/models-in-uc-example.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Models in Unity Catalog example\n###### Deploy a model version for inference\n\nModels in Unity Catalog support [aliases](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#uc-model-aliases) for model deployment. Aliases provide mutable, named references (for example, \u201cChampion\u201d or \u201cChallenger\u201d) to a particular version of a registered model. You can reference and target model versions using these aliases in downstream inference workflows. \nOnce you\u2019ve navigated to the registered model in Catalog Explorer, click under the **Aliases** column to assign the \u201cChampion\u201d alias to the latest model version, and press \u201cContinue\u201d to save changes. \n![Set registered model alias](https:\/\/docs.databricks.com\/_images\/uc_set_registered_model_alias.png) \n### Load model versions using the API \nThe MLflow Models component defines functions for loading models from several machine learning frameworks. For example, `mlflow.tensorflow.load_model()` is used to load TensorFlow models that were saved in MLflow format, and `mlflow.sklearn.load_model()` is used to load scikit-learn models that were saved in MLflow format. \nThese functions can load models from Models in Unity Catalog. \n```\nimport mlflow.pyfunc\n\nmodel_version_uri = \"models:\/{model_name}\/1\".format(model_name=MODEL_NAME)\n\nprint(\"Loading registered model version from URI: '{model_uri}'\".format(model_uri=model_version_uri))\nmodel_version_1 = mlflow.pyfunc.load_model(model_version_uri)\n\nmodel_champion_uri = \"models:\/{model_name}@Champion\".format(model_name=MODEL_NAME)\n\nprint(\"Loading registered model version from URI: '{model_uri}'\".format(model_uri=model_champion_uri))\nchampion_model = mlflow.pyfunc.load_model(model_champion_uri)\n\n``` \n### Forecast power output with the champion model \nIn this section, the champion model is used to evaluate weather forecast data for the wind farm. The `forecast_power()` application loads the latest version of the forecasting model from the specified stage and uses it to forecast power production over the next five days. \n```\nfrom mlflow.tracking import MlflowClient\n\ndef plot(model_name, model_alias, model_version, power_predictions, past_power_output):\nimport matplotlib.dates as mdates\nfrom matplotlib import pyplot as plt\nindex = power_predictions.index\nfig = plt.figure(figsize=(11, 7))\nax = fig.add_subplot(111)\nax.set_xlabel(\"Date\", size=20, labelpad=20)\nax.set_ylabel(\"Power\\noutput\\n(MW)\", size=20, labelpad=60, rotation=0)\nax.tick_params(axis='both', which='major', labelsize=17)\nax.xaxis.set_major_formatter(mdates.DateFormatter('%m\/%d'))\nax.plot(index[:len(past_power_output)], past_power_output, label=\"True\", color=\"red\", alpha=0.5, linewidth=4)\nax.plot(index, power_predictions.squeeze(), \"--\", label=\"Predicted by '%s'\\nwith alias '%s' (Version %d)\" % (model_name, model_alias, model_version), color=\"blue\", linewidth=3)\nax.set_ylim(ymin=0, ymax=max(3500, int(max(power_predictions.values) * 1.3)))\nax.legend(fontsize=14)\nplt.title(\"Wind farm power output and projections\", size=24, pad=20)\nplt.tight_layout()\ndisplay(plt.show())\n\ndef forecast_power(model_name, model_alias):\nimport pandas as pd\nclient = MlflowClient()\nmodel_version = client.get_model_version_by_alias(model_name, model_alias).version\nmodel_uri = \"models:\/{model_name}@{model_alias}\".format(model_name=MODEL_NAME, model_alias=model_alias)\nmodel = mlflow.pyfunc.load_model(model_uri)\nweather_data, past_power_output = get_weather_and_forecast()\npower_predictions = pd.DataFrame(model.predict(weather_data))\npower_predictions.index = pd.to_datetime(weather_data.index)\nprint(power_predictions)\nplot(model_name, model_alias, int(model_version), power_predictions, past_power_output)\n\nforecast_power(MODEL_NAME, \"Champion\")\n\n``` \n### Add model and model version descriptions using the API \nThe code in this section shows how you can add model and model version descriptions using the MLflow API. \n```\nclient = MlflowClient()\nclient.update_registered_model(\nname=MODEL_NAME,\ndescription=\"This model forecasts the power output of a wind farm based on weather data. The weather data consists of three features: wind speed, wind direction, and air temperature.\"\n)\n\nclient.update_model_version(\nname=MODEL_NAME,\nversion=1,\ndescription=\"This model version was built using TensorFlow Keras. It is a feed-forward neural network with one hidden layer.\"\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/models-in-uc-example.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Models in Unity Catalog example\n###### Create a new model version\n\nClassical machine learning techniques are also effective for power forecasting. The following code trains a random forest model using scikit-learn and registers it to Unity Catalog using the `mlflow.sklearn.log_model()` function. \n```\nimport mlflow.sklearn\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.metrics import mean_squared_error\n\nwith mlflow.start_run():\nn_estimators = 300\nmlflow.log_param(\"n_estimators\", n_estimators)\n\nrand_forest = RandomForestRegressor(n_estimators=n_estimators)\nrand_forest.fit(X_train, y_train)\n\nval_x, val_y = get_validation_data()\nmse = mean_squared_error(rand_forest.predict(val_x), val_y)\nprint(\"Validation MSE: %d\" % mse)\nmlflow.log_metric(\"mse\", mse)\n\nexample_input = val_x.iloc[[0]]\n\n# Specify the `registered_model_name` parameter of the `mlflow.sklearn.log_model()`\n# function to register the model to <UC>. This automatically\n# creates a new model version\nmlflow.sklearn.log_model(\nsk_model=rand_forest,\nartifact_path=\"sklearn-model\",\ninput_example=example_input,\nregistered_model_name=MODEL_NAME\n)\n\n``` \n### Fetch the new model version number \nThe following code shows how to retrieve the latest model version number for a model name. \n```\nclient = MlflowClient()\nmodel_version_infos = client.search_model_versions(\"name = '%s'\" % MODEL_NAME)\nnew_model_version = max([model_version_info.version for model_version_info in model_version_infos])\n\n``` \n### Add a description to the new model version \n```\nclient.update_model_version(\nname=MODEL_NAME,\nversion=new_model_version,\ndescription=\"This model version is a random forest containing 100 decision trees that was trained in scikit-learn.\"\n)\n\n``` \n### Mark new model version as Challenger and test the model \nBefore deploying a model to serve production traffic, it is a best practice to test it in on a sample of production data. Previously, you used the \u201cChampion\u201d alias to denote the model version serving the majority of production workloads. The following code assigns the \u201cChallenger\u201d alias to the new model version, and evaluates its performance. \n```\nclient.set_registered_model_alias(\nname=MODEL_NAME,\nalias=\"Challenger\",\nversion=new_model_version\n)\n\nforecast_power(MODEL_NAME, \"Challenger\")\n\n``` \n### Deploy the new model version as the Champion model version \nAfter verifying that the new model version performs well in tests, the following code assigns the \u201cChampion\u201d alias to the new model version and uses the exact same application code from the [Forecast power output with the champion model](https:\/\/docs.databricks.com\/mlflow\/models-in-uc-example.html#forecast-power-output-with-the-champion-model) section to produce a power forecast. \n```\nclient.set_registered_model_alias(\nname=MODEL_NAME,\nalias=\"Champion\",\nversion=new_model_version\n)\n\nforecast_power(MODEL_NAME, \"Champion\")\n\n``` \nThere are now two model versions of the forecasting model: the model version trained in Keras model and the version trained in scikit-learn. Note that the \u201cChallenger\u201d alias remains assigned to the new scikit-learn model version, so any downstream workloads that target the \u201cChallenger\u201d model version continue to run successfully: \n![Product model versions](https:\/\/docs.databricks.com\/_images\/uc_model_versions.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/models-in-uc-example.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Models in Unity Catalog example\n###### Archive and delete models\n\nWhen a model version is no longer being used, you can delete it. You can also delete an entire registered model; this removes all associated model versions. Note that deleting a model version clears any aliases assigned to the model version. \n### Delete `Version 1` using the MLflow API \n```\nclient.delete_model_version(\nname=MODEL_NAME,\nversion=1,\n)\n\n``` \n### Delete the model using the MLflow API \n```\nclient = MlflowClient()\nclient.delete_registered_model(name=MODEL_NAME)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/models-in-uc-example.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use Databricks SQL in a Databricks job\n\nYou can use the SQL task type in a Databricks job, allowing you to create, schedule, operate, and monitor workflows that include Databricks SQL objects such as queries, legacy dashboards, and alerts. For example, your workflow can ingest data, prepare the data, perform analysis using Databricks SQL queries, and then display the results in a legacy dashboard. \nThis article provides an example workflow that creates a legacy dashboard displaying metrics for GitHub contributions. In this example, you will: \n* Ingest GitHub data using a Python script and the [GitHub REST API](https:\/\/docs.github.com\/rest).\n* Transform the GitHub data using a Delta Live Tables pipeline.\n* Trigger Databricks SQL queries performing analysis on the prepared data.\n* Display the analysis in a legacy dashboard. \n![GitHub analysis dashboard](https:\/\/docs.databricks.com\/_images\/github-dashboard.png)\n\n##### Use Databricks SQL in a Databricks job\n###### Before you begin\n\nYou need the following to complete this walkthrough: \n* A [GitHub personal access token](https:\/\/docs.github.com\/authentication\/keeping-your-account-and-data-secure\/creating-a-personal-access-token). This token must have the **repo** permission.\n* A serverless SQL warehouse or a pro SQL warehouse. See [SQL warehouse types](https:\/\/docs.databricks.com\/admin\/sql\/warehouse-types.html).\n* A Databricks [secret scope](https:\/\/docs.databricks.com\/security\/secrets\/secret-scopes.html). The secret scope is used to securely store the GitHub token. See [Step 1: Store the GitHub token in a secret](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html#add-token-to-secret).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use Databricks SQL in a Databricks job\n###### Step 1: Store the GitHub token in a secret\n\nInstead of hardcoding credentials such as the GitHub personal access token in a job, Databricks recommends using a secret scope to store and manage secrets securely. The following Databricks CLI commands are an example of creating a secret scope and storing the GitHub token in a secret in that scope: \n```\ndatabricks secrets create-scope <scope-name>\ndatabricks secrets put-secret <scope-name> <token-key> --string-value <token>\n\n``` \n* Replace `<scope-name` with the name of a Databricks secret scope to store the token.\n* Replace `<token-key>` with the name of a key to assign to the token.\n* Replace `<token>` with the value of the GitHub personal access token.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use Databricks SQL in a Databricks job\n###### Step 2: Create a script to fetch GitHub data\n\nThe following Python script uses the GitHub REST API to fetch data on commits and contributions from a GitHub repo. Input arguments specify the GitHub repo. The records are saved to a location in DBFS specified by another input argument. \nThis example uses DBFS to store the Python script, but you can also use [Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html) or [workspace files](https:\/\/docs.databricks.com\/files\/workspace.html) to store and manage the script. \n* Save this script to a location on your local disk: \n```\nimport json\nimport requests\nimport sys\n\napi_url = \"https:\/\/api.github.com\"\n\ndef get_commits(owner, repo, token, path):\npage = 1\nrequest_url = f\"{api_url}\/repos\/{owner}\/{repo}\/commits\"\nmore = True\n\nget_response(request_url, f\"{path}\/commits\", token)\n\ndef get_contributors(owner, repo, token, path):\npage = 1\nrequest_url = f\"{api_url}\/repos\/{owner}\/{repo}\/contributors\"\nmore = True\n\nget_response(request_url, f\"{path}\/contributors\", token)\n\ndef get_response(request_url, path, token):\npage = 1\nmore = True\n\nwhile more:\nresponse = requests.get(request_url, params={'page': page}, headers={'Authorization': \"token \" + token})\nif response.text != \"[]\":\nwrite(path + \"\/records-\" + str(page) + \".json\", response.text)\npage += 1\nelse:\nmore = False\n\ndef write(filename, contents):\ndbutils.fs.put(filename, contents)\n\ndef main():\nargs = sys.argv[1:]\nif len(args) < 6:\nprint(\"Usage: github-api.py owner repo request output-dir secret-scope secret-key\")\nsys.exit(1)\n\nowner = sys.argv[1]\nrepo = sys.argv[2]\nrequest = sys.argv[3]\noutput_path = sys.argv[4]\nsecret_scope = sys.argv[5]\nsecret_key = sys.argv[6]\n\ntoken = dbutils.secrets.get(scope=secret_scope, key=secret_key)\n\nif (request == \"commits\"):\nget_commits(owner, repo, token, output_path)\nelif (request == \"contributors\"):\nget_contributors(owner, repo, token, output_path)\n\nif __name__ == \"__main__\":\nmain()\n\n```\n* Upload the script to DBFS: \n1. Go to your Databricks landing page and click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog** in the sidebar.\n2. Click **Browse DBFS**.\n3. In the DBFS file browser, click **Upload**. The **Upload Data to DBFS** dialog appears.\n4. Enter a path in DBFS to store the script, click **Drop files to upload, or click to browse**, and select the Python script.\n5. Click **Done**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use Databricks SQL in a Databricks job\n###### Step 3: Create a Delta Live Tables pipeline to process the GitHub data\n\nIn this section, you create a Delta Live Tables pipeline to convert the raw GitHub data into tables that can be analyzed by Databricks SQL queries. To create the pipeline, perform the following steps: \n1. In the sidebar, click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** and select **Notebook** from the menu. The **Create Notebook** dialog appears.\n2. In **Default Language**, enter a name and select **Python**. You can leave **Cluster** set to the default value. The Delta Live Tables runtime creates a cluster before it runs your pipeline.\n3. Click **Create**.\n4. Copy the Python code example and paste it into your new notebook. You can add the example code to a single cell of the notebook or multiple cells. \n```\nimport dlt\nfrom pyspark.sql.functions import *\n\ndef parse(df):\nreturn (df\n.withColumn(\"author_date\", to_timestamp(col(\"commit.author.date\")))\n.withColumn(\"author_email\", col(\"commit.author.email\"))\n.withColumn(\"author_name\", col(\"commit.author.name\"))\n.withColumn(\"comment_count\", col(\"commit.comment_count\"))\n.withColumn(\"committer_date\", to_timestamp(col(\"commit.committer.date\")))\n.withColumn(\"committer_email\", col(\"commit.committer.email\"))\n.withColumn(\"committer_name\", col(\"commit.committer.name\"))\n.withColumn(\"message\", col(\"commit.message\"))\n.withColumn(\"sha\", col(\"commit.tree.sha\"))\n.withColumn(\"tree_url\", col(\"commit.tree.url\"))\n.withColumn(\"url\", col(\"commit.url\"))\n.withColumn(\"verification_payload\", col(\"commit.verification.payload\"))\n.withColumn(\"verification_reason\", col(\"commit.verification.reason\"))\n.withColumn(\"verification_signature\", col(\"commit.verification.signature\"))\n.withColumn(\"verification_verified\", col(\"commit.verification.signature\").cast(\"string\"))\n.drop(\"commit\")\n)\n\n@dlt.table(\ncomment=\"Raw GitHub commits\"\n)\ndef github_commits_raw():\ndf = spark.read.json(spark.conf.get(\"commits-path\"))\nreturn parse(df.select(\"commit\"))\n\n@dlt.table(\ncomment=\"Info on the author of a commit\"\n)\ndef commits_by_author():\nreturn (\ndlt.read(\"github_commits_raw\")\n.withColumnRenamed(\"author_date\", \"date\")\n.withColumnRenamed(\"author_email\", \"email\")\n.withColumnRenamed(\"author_name\", \"name\")\n.select(\"sha\", \"date\", \"email\", \"name\")\n)\n\n@dlt.table(\ncomment=\"GitHub repository contributors\"\n)\ndef github_contributors_raw():\nreturn(\nspark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"json\")\n.load(spark.conf.get(\"contribs-path\"))\n)\n\n```\n5. In the sidebar, click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows**, click the **Delta Live Tables** tab, and click **Create Pipeline**.\n6. Give the pipeline a name, for example, `Transform GitHub data`.\n7. In the **Notebook libraries** field, enter the path to your notebook or click ![File Picker Icon](https:\/\/docs.databricks.com\/_images\/file-picker.png) to select the notebook.\n8. Click **Add configuration**. In the `Key` text box, enter `commits-path`. In the `Value` text box, enter the DBFS path where the GitHub records will be written. This can be any path you choose and is the same path you\u2019ll use when configuring the first Python task when you [create the workflow](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html#create-workflow).\n9. Click **Add configuration** again. In the `Key` text box, enter `contribs-path`. In the `Value` text box, enter the DBFS path where the GitHub records will be written. This can be any path you choose and is the same path you\u2019ll use when configuring the second Python task when you [create the workflow](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html#create-workflow).\n10. In the **Target** field, enter a target database, for example, `github_tables`. Setting a target database publishes the output data to the metastore and is required for the downstream queries analyzing the data produced by the pipeline.\n11. Click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use Databricks SQL in a Databricks job\n###### Step 4: Create a workflow to ingest and transform GitHub data\n\nBefore analyzing and visualizing the GitHub data with Databricks SQL, you need to ingest and prepare the data. To create a workflow to complete these tasks, perform the following steps: \n### Create a Databricks job and add the first task \n1. Go to your Databricks landing page and do one of the following: \n* In the sidebar, click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** and click ![Create Job Button](https:\/\/docs.databricks.com\/_images\/create-job.png).\n* In the sidebar, click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** and select **Job** from the menu.\n2. In the task dialog box that appears on the **Tasks** tab, replace **Add a name for your job\u2026** with your job name, for example, `GitHub analysis workflow`.\n3. In **Task name**, enter a name for the task, for example, `get_commits`.\n4. In **Type**, select **Python script**.\n5. In **Source**, select **DBFS \/ S3**.\n6. In **Path**, enter the path to the script in DBFS.\n7. In **Parameters**, enter the following arguments for the Python script: \n`[\"<owner>\",\"<repo>\",\"commits\",\"<DBFS-output-dir>\",\"<scope-name>\",\"<github-token-key>\"]` \n* Replace `<owner>` with the name of the repository owner. For example, to fetch records from the `github.com\/databrickslabs\/overwatch` repository, enter `databrickslabs`.\n* Replace `<repo>` with the repository name, for example, `overwatch`.\n* Replace `<DBFS-output-dir>` with a path in DBFS to store the records fetched from GitHub.\n* Replace `<scope-name>` with the name of the secret scope you created to store the GitHub token.\n* Replace `<github-token-key>` with the name of the key you assigned to the GitHub token.\n8. Click **Save task**. \n### Add another task \n1. Click ![Add Task Button](https:\/\/docs.databricks.com\/_images\/add-task.png) below the task you just created.\n2. In **Task name**, enter a name for the task, for example, `get_contributors`.\n3. In **Type**, select the **Python script** task type.\n4. In **Source**, select **DBFS \/ S3**.\n5. In **Path**, enter the path to the script in DBFS.\n6. In **Parameters**, enter the following arguments for the Python script: \n`[\"<owner>\",\"<repo>\",\"contributors\",\"<DBFS-output-dir>\",\"<scope-name>\",\"<github-token-key>\"]` \n* Replace `<owner>` with the name of the repository owner. For example, to fetch records from the `github.com\/databrickslabs\/overwatch` repository, enter `databrickslabs`.\n* Replace `<repo>` with the repository name, for example, `overwatch`.\n* Replace `<DBFS-output-dir>` with a path in DBFS to store the records fetched from GitHub.\n* Replace `<scope-name>` with the name of the secret scope you created to store the GitHub token.\n* Replace `<github-token-key>` with the name of the key you assigned to the GitHub token.\n7. Click **Save task**. \n### Add a task to transform the data \n1. Click ![Add Task Button](https:\/\/docs.databricks.com\/_images\/add-task.png) below the task you just created.\n2. In **Task name**, enter a name for the task, for example, `transform_github_data`.\n3. In **Type**, select **Delta Live Tables pipeline** and enter a name for the task.\n4. In **Pipeline**, select the pipeline created in [Step 3: Create a Delta Live Tables pipeline to process the GitHub data](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html#create-pipeline).\n5. Click **Create**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use Databricks SQL in a Databricks job\n###### Step 5: Run the data transformation workflow\n\nClick ![Run Now Button](https:\/\/docs.databricks.com\/_images\/run-now-button.png) to run the workflow. To view [details for the run](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#job-run-details), click the link in the **Start time** column for the run in the [job runs](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#view-job-run-list) view. Click on each task to view details for the task run.\n\n##### Use Databricks SQL in a Databricks job\n###### Step 6: (Optional) To view the output data after the workflow run completes, perform the following steps:\n\n1. In the run details view, click on the Delta Live Tables task.\n2. In the **Task run details** panel, click on the pipeline name under **Pipeline**. The **Pipeline details** page displays.\n3. Select the `commits_by_author` table in the pipeline DAG.\n4. Click the table name next to **Metastore** in the **commits\\_by\\_author** panel. The Catalog Explorer page opens. \nIn Catalog Explorer, you can view the table schema, sample data, and other details for the data. Follow the same steps to view data for the `github_contributors_raw` table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use Databricks SQL in a Databricks job\n###### Step 7: Remove the GitHub data\n\nIn a real-world application, you might be continuously ingesting and processing data. Because this example downloads and processes the entire data set, you must remove the already downloaded GitHub data to prevent an error when re-running the workflow. To remove the downloaded data, perform the following steps: \n1. Create a new notebook and enter the following commands in the first cell: \n```\ndbutils.fs.rm(\"<commits-path\", True)\ndbutils.fs.rm(\"<contributors-path\", True)\n\n``` \nReplace `<commits-path>` and `<contributors-path>` with the DBFS paths you configured when creating the Python tasks.\n2. Click ![Run Menu](https:\/\/docs.databricks.com\/_images\/run-menu.png) and select **Run Cell**. \nYou can also add this notebook as a task in the workflow.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use Databricks SQL in a Databricks job\n###### Step 8: Create the Databricks SQL queries\n\nAfter running the workflow and creating the required tables, create queries to analyze the prepared data. To create the example queries and visualizations, perform the following steps: \n### Display the top 10 contributors by month \n1. Click the icon below the Databricks logo ![Databricks logo](https:\/\/docs.databricks.com\/_images\/databricks-logo.png) in the sidebar and select **SQL**.\n2. Click **Create a query** to open the Databricks SQL query editor.\n3. Make sure the catalog is set to **hive\\_metastore**. Click **default** next to **hive\\_metastore** and set the database to the **Target** value you set in the Delta Live Tables pipeline.\n4. In the **New Query** tab, enter the following query: \n```\nSELECT\ndate_part('YEAR', date) AS year,\ndate_part('MONTH', date) AS month,\nname,\ncount(1)\nFROM\ncommits_by_author\nWHERE\nname IN (\nSELECT\nname\nFROM\ncommits_by_author\nGROUP BY\nname\nORDER BY\ncount(name) DESC\nLIMIT 10\n)\nAND\ndate_part('YEAR', date) >= 2022\nGROUP BY\nname, year, month\nORDER BY\nyear, month, name\n\n```\n5. Click the **New query** tab and rename the query, for example, `Commits by month top 10 contributors`.\n6. By default, the results are displayed as a table. To change how the data is visualized, for example, using a bar chart, in the **Results** panel click ![Jobs Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/jobs-vertical-ellipsis.png) and click **Edit**.\n7. In **Visualization type**, select **Bar**.\n8. In **X column**, select **month**.\n9. In **Y columns**, select **count(1)**.\n10. In **Group by**, select **name**.\n11. Click **Save**. \n### Display the top 20 contributors \n1. Click **+ > Create new query** and make sure the catalog is set to **hive\\_metastore**. Click **default** next to **hive\\_metastore** and set the database to the **Target** value you set in the Delta Live Tables pipeline.\n2. Enter the following query: \n```\nSELECT\nlogin,\ncast(contributions AS INTEGER)\nFROM\ngithub_contributors_raw\nORDER BY\ncontributions DESC\nLIMIT 20\n\n```\n3. Click the **New query** tab and rename the query, for example, `Top 20 contributors`.\n4. To change the visualization from the default table, in the **Results** panel, click ![Jobs Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/jobs-vertical-ellipsis.png) and click **Edit**.\n5. In **Visualization type**, select **Bar**.\n6. In **X column**, select **login**.\n7. In **Y columns**, select **contributions**.\n8. Click **Save**. \n### Display the total commits by author \n1. Click **+ > Create new query** and make sure the catalog is set to **hive\\_metastore**. Click **default** next to **hive\\_metastore** and set the database to the **Target** value you set in the Delta Live Tables pipeline.\n2. Enter the following query: \n```\nSELECT\nname,\ncount(1) commits\nFROM\ncommits_by_author\nGROUP BY\nname\nORDER BY\ncommits DESC\nLIMIT 10\n\n```\n3. Click the **New query** tab and rename the query, for example, `Total commits by author`.\n4. To change the visualization from the default table, in the **Results** panel, click ![Jobs Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/jobs-vertical-ellipsis.png) and click **Edit**.\n5. In **Visualization type**, select **Bar**.\n6. In **X column**, select **name**.\n7. In **Y columns**, select **commits**.\n8. Click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use Databricks SQL in a Databricks job\n###### Step 9: Create a dashboard\n\n1. In the sidebar, click ![Dashboards Icon](https:\/\/docs.databricks.com\/_images\/dashboards-icon.png) **Dashboards**\n2. Click **Create dashboard**.\n3. Enter a name for the dashboard, for example, `GitHub analysis`.\n4. For each query and visualization created in [Step 8: Create the Databricks SQL queries](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html#create-queries), click **Add > Visualization** and select each visualization.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use Databricks SQL in a Databricks job\n###### Step 10: Add the SQL tasks to the workflow\n\nTo add the new query tasks to the workflow you created in [Create a Databricks job and add the first task](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html#create-job), for each query you created in [Step 8: Create the Databricks SQL queries](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html#create-queries): \n1. Click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar.\n2. In the **Name** column, click the job name.\n3. Click the **Tasks** tab.\n4. Click ![Add Task Button](https:\/\/docs.databricks.com\/_images\/add-task.png) below the last task.\n5. Enter a name for the task, in **Type** select **SQL**, and in **SQL task** select **Query**.\n6. Select the query in **SQL query**.\n7. In **SQL warehouse**, select a serverless SQL warehouse or a pro SQL warehouse to run the task.\n8. Click **Create**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use Databricks SQL in a Databricks job\n###### Step 11: Add a dashboard task\n\n1. Click ![Add Task Button](https:\/\/docs.databricks.com\/_images\/add-task.png) below the last task.\n2. Enter a name for the task, in **Type**, select **SQL**, and in **SQL task** select **Legacy dashboard**.\n3. Select the dashboard created in [Step 9: Create a dashboard](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html#create-dashboard).\n4. In **SQL warehouse**, select a serverless SQL warehouse or a pro SQL warehouse to run the task.\n5. Click **Create**.\n\n##### Use Databricks SQL in a Databricks job\n###### Step 12: Run the complete workflow\n\nTo run the workflow, click ![Run Now Button](https:\/\/docs.databricks.com\/_images\/run-now-button.png). To view [details for the run](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#job-run-details), click the link in the **Start time** column for the run in the [job runs](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#view-job-run-list) view.\n\n##### Use Databricks SQL in a Databricks job\n###### Step 13: View the results\n\nTo view the results when the run completes, click the final dashboard task and click the dashboard name under **SQL dashboard** in the right panel.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Map options\n\nThe map visualizations display results on a geographic map. The query result set must include the appropriate geographic data: \n* **Choropleth**: Geographic localities, such as countries or states, are colored according to the aggregate values of each key column. The query must return geographic locations by name. For an example, see [choropleth example](https:\/\/docs.databricks.com\/visualizations\/visualization-types.html#choropleth).\n* **Marker**: A marker is placed at a set of coordinates on the map. The query result must return latitude and longitude pairs. For an example, see [marker example](https:\/\/docs.databricks.com\/visualizations\/visualization-types.html#marker).\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/maps.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Map options\n###### Choropleth options\n\nThis section covers the options for choropleth visualizations. \n### General \nTo configure general options, click **General** and configure each of the following required settings: \n* **Map**: Which geographic map to use. One of **Countries**, **USA**, or **Japan\/Prefectures**. The results in the key column must appear in the map you choose.\n* **Geographic column**: The column containing the locations for the map.\n* **Geographic type**: The format of values in the key column. One of **Abbreviated name**, **Short name**, **Full name**, **ISO code (2 letters)**, **ISO code (3 letters)**, **ISO code (3 digits)**: \n| Format | Value |\n| --- | --- |\n| Abbreviated name | U.S.A, Brazil |\n| Short name | United States, Brazil |\n| Full name | United States, Brazil |\n| ISO code (2 letters) | US, BR |\n| ISO code (3 letters) | USA, BRA |\n| ISO code (numeric) | 840, 076 | \nIf a geographic column value doesn\u2019t match one of the target field formats, no data is shown for that locality. \nFor more information on supported map values and their corresponding abbreviated values, see [Supported map values](https:\/\/docs.databricks.com\/visualizations\/maps.html#supported-map-values).\n* **Value Column**: Aggregates the result per key column. \n### Colors \n* **Clustering mode**: How data is segmented into steps along the color gradient. \n+ **Equidistant** (default): Results are segmented into the number of segments you specify in the **Steps** field, and each segment\u2019s range is the same size.\n+ **Quantile**: Results are segmented into a number of segments less than or equal to the value you specify in the **Steps** field, where each segment contains the same number of results, regardless of the segment\u2019s numeric range.\n+ **K-mean**: Results are segmented into the number of segments you specify in the **Steps** field, and the results in each segment are clustered near a common mean.\n* **Steps**: The number of colors to use inclusive of **Min Color** and **Max Color** in the map and the legend. If **Clustering Mode** is **Quantile**, there may be fewer steps than you specify here.\n* **Min Color**: The color for the lowest value range.\n* **Max Color**: The color for the highest value range.\n* **No Value Color**: The color for locations with no result in the query. Often, this color is outside of the gradient range between **Min Color** and **Max Color**.\n* **Background Color**: The color for the map background.\n* **Border Color**: The color for borders between locations. \n### Format \n* **Value frormat**: Optionally override the automatically detected format for the value column.\n* **Value placeholder**: An optional value to display for locations with no results. Defaults to **N\/A**.\n* **Show legend**: Uncheck to disable the legend.\n* **Legend position**: Where in the visualization to display the legend.\n* **Legend text alignment**: How to align text in the legend.\n* **Show tooltip**: Whether to display a tooltip when you hover your cursor over a location.\n* **Tooltip template**: An optional value to override the template used for tooltips.\n* **Show popup**: Whether to display a popup when you click a location.\n* **Popup template**: An optional value to override the template used for popups. \n### Bounds \nTo show only a portion of the map in the visualization, you can configure coordinates for the four corners of the area to show. You can set the coordinates manually, zoom in or out, or use the rectangular selector.\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/maps.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Map options\n###### Marker options\n\nThis section covers the options for marker visualizations. \n### General \n* **Latitude column**: Latitude portion of the coordinate pair.\n* **Longitude column**: Longitude portion of the coordinate pair.\n* **Group by** The column to use for grouping, if different from the original query. \n### Format \n* **Show tooltip**: If selected, a tooltip displays when you hover the mouse over a marker in the visualization.\n* **Tooltip template**: Override the default tooltip template using a format you specify.\n* **Show popup**: If selected, a popup displays when you click a marker in the visualization.\n* **Popup template**: Override the default popup template using a format you specify. \n### Style \n* **Map tiles**: Override the default map tiles with one of the included sets.\n* **Cluster markers**: If selected, markers that are near to each other are clustered into a single marker that indicates the number of markers in the cluster.\n* **Override default style**: If selected, you can override the marker shape, icon, color, background color, and border color. \nTo hide a series in a visualization, click the series in the legend. To show the series again, click it again in the legend. \nTo show only a single series, double-click the series in the legend. To show other series, click each one.\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/maps.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Map options\n###### Supported map values\n\nThe following table lists supported map values and their corresponding abbreviated values. \n| Abbreviated name | Short name | Full name | ISO code (2 letters) | ISO code (3 letters) | ISO code (numeric) |\n| --- | --- | --- | --- | --- | --- |\n| Belize | Belize | Belize | BZ | BLZ | 84 |\n| Can. | Canada | Canada | CA | CAN | 124 |\n| Bhs. | Bahamas | Bahamas | BS | BHS | 44 |\n| C.R. | Costa Rica | Costa Rica | CR | CRI | 188 |\n| Grlnd. | Greenland | Greenland | GL | GRL | 304 |\n| Dom. Rep. | Dominican Rep. | Dominican Republic | DO | DOM | 214 |\n| Cuba | Cuba | Cuba | CU | CUB | 192 |\n| Hond. | Honduras | Honduras | HN | HND | 340 |\n| Guat. | Guatemala | Guatemala | GT | GTM | 320 |\n| Haiti | Haiti | Haiti | HT | HTI | 332 |\n| Mex. | Mexico | Mexico | MX | MEX | 484 |\n| Jam. | Jamaica | Jamaica | JM | JAM | 388 |\n| P.R. | Puerto Rico | Puerto Rico | PR | PRI | 630 |\n| Pan. | Panama | Panama | PA | PAN | 591 |\n| Nic. | Nicaragua | Nicaragua | NI | NIC | 558 |\n| El. S. | El Salvador | El Salvador | SV | SLV | 222 |\n| Tr.T. | Trinidad and Tobago | Trinidad and Tobago | TT | TTO | 780 |\n| U.S.A. | United States | United States | US | USA | 840 |\n| Arg. | Argentina | Argentina | AR | ARG | 32 |\n| Bolivia | Bolivia | Bolivia | BO | BOL | 68 |\n| Brazil | Brazil | Brazil | BR | BRA | 76 |\n| Chile | Chile | Chile | CL | CHL | 152 |\n| Col. | Colombia | Colombia | CO | COL | 170 |\n| Ecu. | Ecuador | Ecuador | EC | ECU | 218 |\n| Flk. Is. | Falkland Is. | Falkland Islands | FK | FLK | 238 |\n| Guy. | Guyana | Guyana | GY | GUY | 328 |\n| Para. | Paraguay | Paraguay | PY | PRY | 600 |\n| Sur. | Suriname | Suriname | SR | SUR | 740 |\n| Ury. | Uruguay | Uruguay | UY | URY | 858 |\n| Peru | Peru | Peru | PE | PER | 604 |\n| Ven. | Venezuela | Venezuela | VE | VEN | 862 |\n| Afg. | Afghanistan | Afghanistan | AF | AFG | 4 |\n| U.A.E. | United Arab Emirates | United Arab Emirates | AE | ARE | 784 |\n| Arm. | Armenia | Armenia | AM | ARM | 51 |\n| Aze. | Azerbaijan | Azerbaijan | AZ | AZE | 31 |\n| Brunei | Brunei | Brunei Darussalam | BN | BRN | 96 |\n| Bang. | Bangladesh | Bangladesh | BD | BGD | 50 |\n| China | China | China | CN | CHN | 156 |\n| Bhutan | Bhutan | Bhutan | BT | BTN | 64 |\n| 14. Cy. | 14. Cyprus | Northern Cyprus | | | |\n| Cyp. | Cyprus | Cyprus | CY | CYP | 196 |\n| Geo. | Georgia | Georgia | GE | GEO | 268 |\n| Indo. | Indonesia | Indonesia | ID | IDN | 360 |\n| India | India | India | IN | IND | 356 |\n| Iran | Iran | Iran | IR | IRN | 364 |\n| Jord. | Jordan | Jordan | JO | JOR | 400 |\n| Isr. | Israel | Israel | IL | ISR | 376 |\n| Japan | Japan | Japan | JP | JPN | 392 |\n| Iraq | Iraq | Iraq | IQ | IRQ | 368 |\n| Kaz. | Kazakhstan | Kazakhstan | KZ | KAZ | 398 |\n| Kgz. | Kyrgyzstan | Kyrgyzstan | KG | KGZ | 417 |\n| S.K. | Korea | Republic of Korea | KR | KOR | 410 |\n| Camb. | Cambodia | Cambodia | KH | KHM | 116 |\n| Kwt. | Kuwait | Kuwait | KW | KWT | 414 |\n| Laos | Lao PDR | Lao PDR | LA | LAO | 418 |\n| Leb. | Lebanon | Lebanon | LB | LBN | 422 |\n| Sri L. | Sri Lanka | Sri Lanka | LK | LKA | 144 |\n| Myan. | Myanmar | Myanmar | MM | MMR | 104 |\n| Mong. | Mongolia | Mongolia | MN | MNG | 496 |\n| Malay. | Malaysia | Malaysia | MY | MYS | 458 |\n| Nepal | Nepal | Nepal | NP | NPL | 524 |\n| Oman | Oman | Oman | OM | OMN | 512 |\n| Pak. | Pakistan | Pakistan | PK | PAK | 586 |\n| Phil. | Philippines | Philippines | PH | PHL | 608 |\n| N.K. | Dem. Rep. Korea | Dem. Rep. Korea | KP | PRK | 408 |\n| Pal. | Palestine | Palestine | PS | PSE | 275 |\n| Qatar | Qatar | Qatar | QA | QAT | 634 |\n| Syria | Syria | Syria | SY | SYR | 760 |\n| Saud. | Saudi Arabia | Saudi Arabia | SA | SAU | 682 |\n| Thai. | Thailand | Thailand | TH | THA | 764 |\n| Tjk. | Tajikistan | Tajikistan | TJ | TJK | 762 |\n| Turkm. | Turkmenistan | Turkmenistan | TM | TKM | 795 |\n| T.L. | Timor-Leste | Timor-Leste | TL | TLS | 626 |\n| Tur. | Turkey | Turkey | TR | TUR | 792 |\n| Taiwan | Taiwan | Taiwan | TW | TWN | 158 |\n| Viet. | Vietnam | Vietnam | VN | VNM | 704 |\n| Uzb. | Uzbekistan | Uzbekistan | UZ | UZB | 860 |\n| Yem. | Yemen | Yemen | YE | YEM | 887 |\n| Ang. | Angola | Angola | AO | AGO | 24 |\n| Bur. | Burundi | Burundi | BI | BDI | 108 |\n| B.F. | Burkina Faso | Burkina Faso | BF | BFA | 854 |\n| Bwa. | Botswana | Botswana | BW | BWA | 72 |\n| Benin | Benin | Benin | BJ | BEN | 204 |\n| C.A.R. | Central African Rep. | Central African Republic | CF | CAF | 140 |\n| I.C. | C\u00f4te d\u2019Ivoire | C\u00f4te d\u2019Ivoire | CI | CIV | 384 |\n| Cam. | Cameroon | Cameroon | CM | CMR | 120 |\n| D.R.C. | Dem. Rep. Congo | Democratic Republic of the Congo | CD | COD | 180 |\n| Rep. Congo | Congo | Republic of Congo | CG | COG | 178 |\n| Dji. | Djibouti | Djibouti | DJ | DJI | 262 |\n| Alg. | Algeria | Algeria | DZ | DZA | 12 |\n| Egypt | Egypt | Egypt | EG | EGY | 818 |\n| Gabon | Gabon | Gabon | GA | GAB | 266 |\n| Erit. | Eritrea | Eritrea | ER | ERI | 232 |\n| Eth. | Ethiopia | Ethiopia | ET | ETH | 231 |\n| Ghana | Ghana | Ghana | GH | GHA | 288 |\n| Gin. | Guinea | Guinea | GN | GIN | 324 |\n| Gambia | Gambia | The Gambia | GM | GMB | 270 |\n| GnB. | Guinea-Bissau | Guinea-Bissau | GW | GNB | 624 |\n| Ken. | Kenya | Kenya | KE | KEN | 404 |\n| Eq. G. | Eq. Guinea | Equatorial Guinea | GQ | GNQ | 226 |\n| Liberia | Liberia | Liberia | LR | LBR | 430 |\n| Libya | Libya | Libya | LY | LBY | 434 |\n| Mor. | Morocco | Morocco | MA | MAR | 504 |\n| Les. | Lesotho | Lesotho | LS | LSO | 426 |\n| Mad. | Madagascar | Madagascar | MG | MDG | 450 |\n| Mali | Mali | Mali | ML | MLI | 466 |\n| Moz. | Mozambique | Mozambique | MZ | MOZ | 508 |\n| Nam. | Namibia | Namibia | NA | NAM | 516 |\n| Niger | Niger | Niger | NE | NER | 562 |\n| Mal. | Malawi | Malawi | MW | MWI | 454 |\n| Mrt. | Mauritania | Mauritania | MR | MRT | 478 |\n| Nigeria | Nigeria | Nigeria | NG | NGA | 566 |\n| Rwa. | Rwanda | Rwanda | RW | RWA | 646 |\n| 23. Sah. | 23. Sahara | Western Sahara | EH | ESH | 732 |\n| 19. Sud. | 19. Sudan | South Sudan | SS | SSD | 728 |\n| S.L. | Sierra Leone | Sierra Leone | SL | SLE | 694 |\n| Sen. | Senegal | Senegal | SN | SEN | 686 |\n| Solnd. | Somaliland | Somaliland | | | |\n| Sudan | Sudan | Sudan | SD | SDN | 729 |\n| Som. | Somalia | Somalia | SO | SOM | 706 |\n| Swz. | Swaziland | Swaziland | SZ | SWZ | 748 |\n| Chad | Chad | Chad | TD | TCD | 148 |\n| Togo | Togo | Togo | TG | TGO | 768 |\n| Tun. | Tunisia | Tunisia | TN | TUN | 788 |\n| Tanz. | Tanzania | Tanzania | TZ | TZA | 834 |\n| Uga. | Uganda | Uganda | UG | UGA | 800 |\n| S.Af. | South Africa | South Africa | ZA | ZAF | 710 |\n| Zambia | Zambia | Zambia | ZM | ZMB | 894 |\n| Zimb. | Zimbabwe | Zimbabwe | ZW | ZWE | 716 |\n| Alb. | Albania | Albania | AL | ALB | 8 |\n| Aust. | Austria | Austria | AT | AUT | 40 |\n| Belg. | Belgium | Belgium | BE | BEL | 56 |\n| Bulg. | Bulgaria | Bulgaria | BG | BGR | 100 |\n| Bela. | Belarus | Belarus | BY | BLR | 112 |\n| Switz. | Switzerland | Switzerland | CH | CHE | 756 |\n| B.H. | Bosnia and Herz. | Bosnia and Herzegovina | BA | BIH | 70 |\n| Cz. Rep. | Czech Rep. | Czech Republic | CZ | CZE | 203 |\n| Ger. | Germany | Germany | DE | DEU | 276 |\n| Den. | Denmark | Denmark | DK | DNK | 208 |\n| Sp. | Spain | Spain | ES | ESP | 724 |\n| Est. | Estonia | Estonia | EE | EST | 233 |\n| Fin. | Finland | Finland | FI | FIN | 246 |\n| Fr. | France | France | FR | FRA | 250 |\n| U.K. | United Kingdom | United Kingdom | GB | GBR | 826 |\n| Cro. | Croatia | Croatia | HR | HRV | 191 |\n| Ire. | Ireland | Ireland | IE | IRL | 372 |\n| Greece | Greece | Greece | GR | GRC | 300 |\n| Iceland | Iceland | Iceland | IS | ISL | 352 |\n| Hun. | Hungary | Hungary | HU | HUN | 348 |\n| Italy | Italy | Italy | IT | ITA | 380 |\n| Kos. | Kosovo | Kosovo | | | |\n| Lith. | Lithuania | Lithuania | LT | LTU | 440 |\n| Lat. | Latvia | Latvia | LV | LVA | 428 |\n| Lux. | Luxembourg | Luxembourg | LU | LUX | 442 |\n| Mkd. | Macedonia | Macedonia | MK | MKD | 807 |\n| Mda. | Moldova | Moldova | MD | MDA | 498 |\n| Mont. | Montenegro | Montenegro | ME | MNE | 499 |\n| Nor. | Norway | Norway | NO | NOR | 578 |\n| Neth. | Netherlands | Netherlands | NL | NLD | 528 |\n| Pol. | Poland | Poland | PL | POL | 616 |\n| Port. | Portugal | Portugal | PT | PRT | 620 |\n| Rom. | Romania | Romania | RO | ROU | 642 |\n| Rus. | Russia | Russian Federation | RU | RUS | 643 |\n| Serb. | Serbia | Serbia | RS | SRB | 688 |\n| Slo. | Slovenia | Slovenia | SI | SVN | 705 |\n| Svk. | Slovakia | Slovakia | SK | SVK | 703 |\n| Ukr. | Ukraine | Ukraine | UA | UKR | 804 |\n| Swe. | Sweden | Sweden | SE | SWE | 752 |\n| Auz. | Australia | Australia | AU | AUS | 36 |\n| Fiji | Fiji | Fiji | FJ | FJI | 242 |\n| New C. | New Caledonia | New Caledonia | NC | NCL | 540 |\n| N.Z. | New Zealand | New Zealand | NZ | NZL | 554 |\n| P.N.G. | Papua New Guinea | Papua New Guinea | PG | PNG | 598 |\n| 19. Is. | Solomon Is. | Solomon Islands | SB | SLB | 90 |\n| Van. | Vanuatu | Vanuatu | VU | VUT | 548 |\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/maps.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Feature Engineering in Unity Catalog\n\nThis page describes how to create and work with feature tables in Unity Catalog. \nThis page applies only to workspaces that are enabled for Unity Catalog. If your workspace is not enabled for Unity Catalog, see [Work with features in Workspace Feature Store](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html).\n\n#### Feature Engineering in Unity Catalog\n##### Requirements\n\nFeature Engineering in Unity Catalog requires Databricks Runtime 13.2 or above. In addition, the Unity Catalog metastore must have [Privilege Model Version 1.0](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/upgrade-privilege-model.html).\n\n#### Feature Engineering in Unity Catalog\n##### Install Feature Engineering in Unity Catalog Python client\n\nFeature Engineering in Unity Catalog has a Python client `FeatureEngineeringClient`. The class is available on PyPI with the `databricks-feature-engineering` package and is pre-installed in Databricks Runtime 13.3 LTS ML and above. If you use a non-ML Databricks Runtime, you must install the client manually. Use the [compatibility matrix](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html#feature-engineering-compatibility-matrix) to find the correct version for your Databricks Runtime version. \n```\n%pip install databricks-feature-engineering\n\ndbutils.library.restartPython()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Feature Engineering in Unity Catalog\n##### Create a catalog and a schema for feature tables in Unity Catalog\n\nYou must create a new [catalog](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#catalog) or use an existing catalog for feature tables. \nTo create a new catalog, you must have the `CREATE CATALOG` privilege on the [metastore](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#metastore). \n```\nCREATE CATALOG IF NOT EXISTS <catalog-name>\n\n``` \nTo use an existing catalog, you must have the `USE CATALOG` privilege on the catalog. \n```\nUSE CATALOG <catalog-name>\n\n``` \nFeature tables in Unity Catalog must be stored in a [schema](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#database). To create a new schema in the catalog, you must have the `CREATE SCHEMA` privilege on the catalog. \n```\nCREATE SCHEMA IF NOT EXISTS <schema-name>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Feature Engineering in Unity Catalog\n##### Create a feature table in Unity Catalog\n\nNote \nYou can use an existing Delta table in Unity Catalog that includes a primary key constraint as a feature table. If the table does not have a primary key defined, you must update the table using `ALTER TABLE` DDL statements to add the constraint. See [Use an existing Delta table in Unity Catalog as a feature table](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html#use-existing-uc-table). \nHowever, adding a primary key to a streaming table or materialized view that was published to Unity Catalog by a Delta Live Tables pipeline requires modifying the schema of the streaming table or materialized view definition to include the primary key and then refreshing the streaming table or materialized view. See [Use a streaming table or materialized view created by a Delta Live Tables pipeline as a feature table](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html#use-existing-delta-live-table). \nFeature tables in Unity Catalog are [Delta tables](https:\/\/docs.databricks.com\/delta\/index.html). Feature tables must have a primary key. Feature tables, like other data assets in Unity Catalog, are accessed using a three-level namespace: `<catalog-name>.<schema-name>.<table-name>`. \nYou can use Databricks SQL, the Python `FeatureEngineeringClient`, or a Delta Live Tables pipeline to create feature tables in Unity Catalog. \nYou can use any Delta table with a [primary key constraint](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table-add-constraint.html) as a feature table. The following code shows how to create a table with a primary key: \n```\nCREATE TABLE ml.recommender_system.customer_features (\ncustomer_id int NOT NULL,\nfeat1 long,\nfeat2 varchar(100),\nCONSTRAINT customer_features_pk PRIMARY KEY (customer_id)\n);\n\n``` \nTo create a [time series feature table](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html), add a time column as a primary key column and specify the [TIMESERIES](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-table-constraint.html) keyword. The TIMESERIES keyword requires Databricks Runtime 13.3 LTS or above. \n```\nCREATE TABLE ml.recommender_system.customer_features (\ncustomer_id int NOT NULL,\nts timestamp NOT NULL,\nfeat1 long,\nfeat2 varchar(100),\nCONSTRAINT customer_features_pk PRIMARY KEY (customer_id, ts TIMESERIES)\n);\n\n``` \nAfter the table is created, you can write data to it like other Delta tables, and it can be used as a feature table. \nFor details about the commands and parameters used in the following examples, see the [Feature Engineering Python API reference](https:\/\/api-docs.databricks.com\/python\/feature-engineering\/latest\/index.html). \n1. Write the Python functions to compute the features. The output of each function should be an Apache Spark DataFrame with a unique primary key. The primary key can consist of one or more columns.\n2. Create a feature table by instantiating a `FeatureEngineeringClient` and using `create_table`.\n3. Populate the feature table using `write_table`. \n```\nfrom databricks.feature_engineering import FeatureEngineeringClient\n\nfe = FeatureEngineeringClient()\n\n# Prepare feature DataFrame\ndef compute_customer_features(data):\n''' Feature computation code returns a DataFrame with 'customer_id' as primary key'''\npass\n\ncustomer_features_df = compute_customer_features(df)\n\n# Create feature table with `customer_id` as the primary key.\n# Take schema from DataFrame output by compute_customer_features\ncustomer_feature_table = fe.create_table(\nname='ml.recommender_system.customer_features',\nprimary_keys='customer_id',\nschema=customer_features_df.schema,\ndescription='Customer features'\n)\n\n# An alternative is to use `create_table` and specify the `df` argument.\n# This code automatically saves the features to the underlying Delta table.\n\n# customer_feature_table = fe.create_table(\n# ...\n# df=customer_features_df,\n# ...\n# )\n\n# To use a composite primary key, pass all primary key columns in the create_table call\n\n# customer_feature_table = fe.create_table(\n# ...\n# primary_keys=['customer_id', 'date'],\n# ...\n# )\n\n# To create a time series table, set the timeseries_columns argument\n\n# customer_feature_table = fe.create_table(\n# ...\n# primary_keys=['customer_id', 'date'],\n# timeseries_columns='date',\n# ...\n# )\n\n``` \nNote \nDelta Live Tables support for table constraints is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). The following code examples must be run using the Delta Live Tables [preview channel](https:\/\/docs.databricks.com\/release-notes\/delta-live-tables\/index.html#runtime-channels). \nAny table published from a Delta Live Tables pipeline that includes a [primary key constraint](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table-add-constraint.html) can be used as a feature table. Use the following syntax to create a table in a Delta Live Tables pipeline with a primary key: \n```\nCREATE LIVE TABLE customer_features (\ncustomer_id int NOT NULL,\nfeat1 long,\nfeat2 varchar(100),\nCONSTRAINT customer_features_pk PRIMARY KEY (customer_id)\n) AS SELECT * FROM ...;\n\n``` \nTo create a [time series feature table](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html), add a time column as a primary key column and specify the [TIMESERIES](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-table-constraint.html) keyword. \n```\nCREATE LIVE TABLE customer_features (\ncustomer_id int NOT NULL,\nts timestamp NOT NULL,\nfeat1 long,\nfeat2 varchar(100),\nCONSTRAINT customer_features_pk PRIMARY KEY (customer_id, ts TIMESERIES)\n) AS SELECT * FROM ...;\n\n``` \nAfter the table is created, you can write data to it like other Delta Live Tables datasets, and it can be used as a feature table. \nDefining table constraints is supported only by the Delta Live Tables SQL interface. To set primary keys for streaming tables or materialized views that were declared in Python, see [Use a streaming table or materialized view created by a Delta Live Tables pipeline as a feature table](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html#use-existing-delta-live-table).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Feature Engineering in Unity Catalog\n##### Use an existing Delta table in Unity Catalog as a feature table\n\nAny Delta table in Unity Catalog with a primary key can be a feature table in Unity Catalog, and you can use the Features UI and API with the table. \nNote \n* Only the table owner can declare primary key constraints. The owner\u2019s name is displayed on the table detail page of Catalog Explorer.\n* Verify the data type in the Delta table is supported by Feature Engineering in Unity Catalog. See [Supported data types](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html#supported-data-types).\n* The [TIMESERIES](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-table-constraint.html) keyword requires Databricks Runtime 13.3 LTS or above. \nIf an existing Delta table does not have a [primary key constraint](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table-add-constraint.html), you can create one as follows: \n1. Set primary key columns to `NOT NULL`. For each primary key column, run: \n```\nALTER TABLE <full_table_name> ALTER COLUMN <pk_col_name> SET NOT NULL\n\n```\n2. Alter the table to add the primary key constraint: \n```\nALTER TABLE <full_table_name> ADD CONSTRAINT <pk_name> PRIMARY KEY(pk_col1, pk_col2, ...)\n\n``` \n`pk_name` is the name of the primary key constraint. By convention, you can use the table name (without schema and catalog) with a `_pk` suffix. For example, a table with the name `\"ml.recommender_system.customer_features\"` would have `customer_features_pk` as the name of its primary key constraint. \nTo make the table a [time series feature table](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html), specify the [TIMESERIES](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-table-constraint.html) keyword on one of the primary key columns, as follows: \n```\nALTER TABLE <full_table_name> ADD CONSTRAINT <pk_name> PRIMARY KEY(pk_col1 TIMESERIES, pk_col2, ...)\n\n``` \nAfter you add the primary key constraint on the table, the table appears in the Features UI and you can use it as a feature table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Feature Engineering in Unity Catalog\n##### Use a streaming table or materialized view created by a Delta Live Tables pipeline as a feature table\n\nAny streaming table or materialized view in Unity Catalog with a primary key can be a feature table in Unity Catalog, and you can use the Features UI and API with the table. \nNote \n* Delta Live Tables support for table constraints is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). The following code examples must be run using the Delta Live Tables [preview channel](https:\/\/docs.databricks.com\/release-notes\/delta-live-tables\/index.html#runtime-channels).\n* Only the table owner can declare primary key constraints. The owner\u2019s name is displayed on the table detail page of Catalog Explorer.\n* Verify that Feature Engineering in Unity Catalog supports the data type in the Delta table. See [Supported data types](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html#supported-data-types). \n### Add a primary key to a streaming table or materialized view that was created using SQL \nTo set primary keys for an existing streaming table or materialized view that was created using the Delta Live Tables SQL interface, update the schema of the streaming table or materialized view in the notebook that manages the object. Then, [refresh the table](https:\/\/docs.databricks.com\/delta-live-tables\/updates.html#refresh-selection) to update the Unity Catalog object. \nThe following is the syntax to add a primary key to a materialized view: \n```\nCREATE OR REFRESH MATERIALIZED VIEW existing_live_table(\nid int NOT NULL PRIMARY KEY,\n...\n) AS SELECT ...\n\n``` \n### Add primary key to a streaming table or materialized view that was created using Python \nTo create primary keys for an existing streaming table or materialized view that was created by a Delta Live Tables pipeline, you must use the Delta Live Tables SQL interface, even if the streaming table or materialized view was created using the Delta Live Tables Python interface. To add a primary key to a streaming table or materialized view created in Python, create a new SQL notebook to define a new streaming table or materialized view that reads from the existing streaming table or materialized view. Then, run the notebook as a step of the existing Delta Live Tables pipeline or in a new pipeline. \nThe following is an example of the syntax to use in the new SQL notebook to add a primary key to a materialized view: \n```\nCREATE OR REFRESH MATERIALIZED VIEW new_live_table_with_constraint(\nid int NOT NULL PRIMARY KEY,\n...\n) AS SELECT * FROM existing_live_table\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Feature Engineering in Unity Catalog\n##### Control access to feature tables in Unity Catalog\n\nAccess control for feature tables in Unity Catalog is managed by Unity Catalog. See [Unity Catalog privileges](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Feature Engineering in Unity Catalog\n##### Update a feature table in Unity Catalog\n\nYou can update a feature table in Unity Catalog by adding new features or by modifying specific rows based on the primary key. \nThe following feature table metadata should not be updated: \n* Primary key.\n* Partition key.\n* Name or data type of an existing feature. \nAltering them will cause downstream pipelines that use features for training and serving models to break. \n### Add new features to an existing feature table in Unity Catalog \nYou can add new features to an existing feature table in one of two ways: \n* Update the existing feature computation function and run `write_table` with the returned DataFrame. This updates the feature table schema and merges new feature values based on the primary key.\n* Create a new feature computation function to calculate the new feature values. The DataFrame returned by this new computation function must contain the feature tables\u2019 primary and partition keys (if defined). Run `write_table` with the DataFrame to write the new features to the existing feature table using the same primary key. \n### Update only specific rows in a feature table \nUse `mode = \"merge\"` in `write_table`. Rows whose primary key does not exist in the DataFrame sent in the `write_table` call remain unchanged. \n```\nfrom databricks.feature_engineering import FeatureEngineeringClient\nfe = FeatureEngineeringClient()\nfe.write_table(\nname='ml.recommender_system.customer_features',\ndf = customer_features_df,\nmode = 'merge'\n)\n\n``` \n### Schedule a job to update a feature table \nTo ensure that features in feature tables always have the most recent values, Databricks recommends that you [create a job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) that runs a notebook to update your feature table on a regular basis, such as every day. If you already have a non-scheduled job created, you can convert it to a [scheduled job](https:\/\/docs.databricks.com\/workflows\/jobs\/schedule-jobs.html#job-schedule) to ensure the feature values are always up-to-date. \nCode to update a feature table uses `mode='merge'`, as shown in the following example. \n```\nfrom databricks.feature_engineering import FeatureEngineeringClient\nfe = FeatureEngineeringClient()\n\ncustomer_features_df = compute_customer_features(data)\n\nfe.write_table(\ndf=customer_features_df,\nname='ml.recommender_system.customer_features',\nmode='merge'\n)\n\n``` \n### Store past values of daily features \nDefine a feature table with a composite primary key. Include the date in the primary key. For example, for a feature table `customer_features`, you might use a composite primary key (`date`, `customer_id`) and partition key `date` for efficient reads. \n```\nCREATE TABLE ml.recommender_system.customer_features (\ncustomer_id int NOT NULL,\n`date` date NOT NULL,\nfeat1 long,\nfeat2 varchar(100),\nCONSTRAINT customer_features_pk PRIMARY KEY (`date`, customer_id)\n)\nPARTITIONED BY (`date`)\nCOMMENT \"Customer features\";\n\n``` \n```\nfrom databricks.feature_engineering import FeatureEngineeringClient\nfe = FeatureEngineeringClient()\nfe.create_table(\nname='ml.recommender_system.customer_features',\nprimary_keys=['date', 'customer_id'],\npartition_columns=['date'],\nschema=customer_features_df.schema,\ndescription='Customer features'\n)\n\n``` \nYou can then create code to read from the feature table filtering `date` to the time period of interest. \nYou can also create a [time series feature table](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html) which enables point-in-time lookups when you use `create_training_set` or `score_batch`. See [Create a feature table in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html#create-feature-table). \nTo keep the feature table up to date, set up a regularly scheduled job to write features or stream new feature values into the feature table. \n### Create a streaming feature computation pipeline to update features \nTo create a streaming feature computation pipeline, pass a streaming `DataFrame` as an argument to `write_table`. This method returns a `StreamingQuery` object. \n```\ndef compute_additional_customer_features(data):\n''' Returns Streaming DataFrame\n'''\npass\n\nfrom databricks.feature_engineering import FeatureEngineeringClient\nfe = FeatureEngineeringClient()\n\ncustomer_transactions = spark.readStream.load(\"dbfs:\/events\/customer_transactions\")\nstream_df = compute_additional_customer_features(customer_transactions)\n\nfe.write_table(\ndf=stream_df,\nname='ml.recommender_system.customer_features',\nmode='merge'\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Feature Engineering in Unity Catalog\n##### Read from a feature table in Unity Catalog\n\nUse `read_table` to read feature values. \n```\nfrom databricks.feature_engineering import FeatureEngineeringClient\nfe = FeatureEngineeringClient()\ncustomer_features_df = fe.read_table(\nname='ml.recommender_system.customer_features',\n)\n\n```\n\n#### Feature Engineering in Unity Catalog\n##### Search and browse feature tables in Unity Catalog\n\nUse the Features UI to search for or browse feature tables in Unity Catalog. \n1. Click ![Feature Store Icon](https:\/\/docs.databricks.com\/_images\/feature-store-icon.png) **Features** in the sidebar to display the Features UI.\n2. Select catalog with the catalog selector to view all of the available feature tables in that catalog. In the search box, enter all or part of the name of a feature table, a feature, or a comment. You can also enter all or part of the [key or value of a tag](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html#feature-table-tags). Search text is case-insensitive. \n![Feature search example](https:\/\/docs.databricks.com\/_images\/feature-search-example-uc.png)\n\n#### Feature Engineering in Unity Catalog\n##### Get metadata of feature tables in Unity Catalog\n\nUse `get_table` to get feature table metadata. \n```\nfrom databricks.feature_engineering import FeatureEngineeringClient\nfe = FeatureEngineeringClient()\nft = fe.get_table(name=\"ml.recommender_system.user_feature_table\")\nprint(ft.features)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Feature Engineering in Unity Catalog\n##### Use tags with feature tables and features in Unity Catalog\n\nYou can use tags, which are simple key-value pairs, to categorize and manage your feature tables and features. \nFor feature tables, you can create, edit, and delete tags using Catalog Explorer, SQL statements in a notebook or SQL query editor, or the Feature Engineering Python API. \nFor features, you can create, edit, and delete tags using Catalog Explorer or SQL statements in a notebook or SQL query editor. \nSee [Apply tags to Unity Catalog securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/tags.html) and [Python API](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html). \nThe following example shows how to use the Feature Engineering Python API to create, update, and delete feature table tags. \n```\nfrom databricks.feature_engineering import FeatureEngineeringClient\nfe = FeatureEngineeringClient()\n\n# Create feature table with tags\ncustomer_feature_table = fe.create_table(\n# ...\ntags={\"tag_key_1\": \"tag_value_1\", \"tag_key_2\": \"tag_value_2\", ...},\n# ...\n)\n\n# Upsert a tag\nfe.set_feature_table_tag(name=\"customer_feature_table\", key=\"tag_key_1\", value=\"new_key_value\")\n\n# Delete a tag\nfe.delete_feature_table_tag(name=\"customer_feature_table\", key=\"tag_key_2\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Feature Engineering in Unity Catalog\n##### Delete a feature table in Unity Catalog\n\nYou can delete a feature table in Unity Catalog by directly deleting the Delta table in Unity Catalog using Catalog Explorer or using the [Feature Engineering Python API](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html). \nNote \n* Deleting a feature table can lead to unexpected failures in upstream producers and downstream consumers (models, endpoints, and scheduled jobs). You must delete published online stores with your cloud provider.\n* When you delete a feature table in Unity Catalog, the underlying Delta table is also dropped.\n* `drop_table` is not supported in Databricks Runtime 13.1 ML or below. Use SQL command to delete the table. \nYou can use Databricks SQL or `FeatureEngineeringClient.drop_table` to delete a feature table in Unity Catalog: \n```\nDROP TABLE ml.recommender_system.customer_features;\n\n``` \n```\nfrom databricks.feature_engineering import FeatureEngineeringClient\nfe = FeatureEngineeringClient()\nfe.drop_table(\nname='ml.recommender_system.customer_features'\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Provisioned throughput Foundation Model APIs\n\nThis article demonstrates how to deploy models using [Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html) with provisioned throughput. Databricks recommends provisioned throughput for production workloads, and it provides optimized inference for foundation models with performance guarantees. \nSee [Provisioned throughput Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html#throughput) for a list of supported model architectures.\n\n##### Provisioned throughput Foundation Model APIs\n###### Requirements\n\nSee [requirements](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html). \nFor deploying fine-tuned foundation models, \n* Your model must be logged using MLflow 2.11 or above, OR Databricks Runtime 15.0 ML or above.\n* Databricks recommends using models in Unity Catalog for faster upload and download of large models.\n\n##### Provisioned throughput Foundation Model APIs\n###### [Recommended] Deploy foundation models from Unity Catalog\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nDatabricks recommends using the foundation models that are pre-installed in Unity Catalog. You can find these models under the catalog `system` in the schema `ai` (`system.ai`). \nTo deploy a foundation model: \n1. Navigate to `system.ai` in Catalog Explorer.\n2. Click on the name of the model to deploy.\n3. On the model page, click the **Serve this model** button.\n4. The **Create serving endpoint** page appears. See [Create your provisioned throughput endpoint using the UI](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html#provisioned-throughput-endpoint-ui).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Provisioned throughput Foundation Model APIs\n###### Deploy foundation models from Databricks Marketplace\n\nAlternatively, you can install foundation models to Unity Catalog from [Databricks Marketplace](https:\/\/marketplace.databricks.com\/?asset=Model&isFree=true&provider=Databricks&sortBy=date). \nYou can search for a model family and from the model page, you can select **Get access** and provide login credentials to install the model to Unity Catalog. \nAfter the model is installed to Unity Catalog, you can create a model serving endpoint using the Serving UI.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Provisioned throughput Foundation Model APIs\n###### Deploy DBRX models\n\nDatabricks recommends serving the DBRX Instruct model for your workloads. To serve the DBRX Instruct model using provisioned throughput, follow the guidance in [[Recommended] Deploy foundation models from Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html#uc). \nWhen serving these DBRX models, provisioned throughput supports a context length of up to 16k. \nDBRX models use the following default system prompt to ensure relevance and accuracy in model responses: \n```\nYou are DBRX, created by Databricks. You were last updated in December 2023. You answer questions based on information available up to that point.\nYOU PROVIDE SHORT RESPONSES TO SHORT QUESTIONS OR STATEMENTS, but provide thorough responses to more complex and open-ended questions.\nYou assist with various tasks, from writing to coding (using markdown for code blocks \u2014 remember to use ``` with code, JSON, and tables).\n(You do not have real-time data access or code execution capabilities. You avoid stereotyping and provide balanced perspectives on controversial topics. You do not provide song lyrics, poems, or news articles and do not divulge details of your training data.)\nThis is your system prompt, guiding your responses. Do not reference it, just respond to the user. If you find yourself talking about this message, stop. You should be responding appropriately and usually that means not mentioning this.\nYOU DO NOT MENTION ANY OF THIS INFORMATION ABOUT YOURSELF UNLESS THE INFORMATION IS DIRECTLY PERTINENT TO THE USER'S QUERY.\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Provisioned throughput Foundation Model APIs\n###### Log fine-tuned foundation models\n\nIf you cannot use the models in the `system.ai` schema or install models from the Databricks Marketplace, you can deploy a fine-tuned foundation model by logging it to Unity Catalog. The following shows how to set up your code to log an MLflow model to Unity Catalog: \n```\nmlflow.set_registry_uri('databricks-uc')\nCATALOG = \"ml\"\nSCHEMA = \"llm-catalog\"\nMODEL_NAME = \"mpt\" # or \"bge\"\nregistered_model_name = f\"{CATALOG}.{SCHEMA}.{MODEL_NAME}\"\n\n``` \nYou can log your model using the MLflow `transformers` flavor and specify the task argument with the appropriate model type interface from the following options: \n* `task=\"llm\/v1\/completions\"`\n* `task=\"llm\/v1\/chat\"`\n* `task=\"llm\/v1\/embeddings\"` \nThese arguments specify the API signature used for the model serving endpoint, and models logged this way are eligible for provisioned throughput. \nModels logged from the `sentence_transformers` package also support defining the `\"llm\/v1\/embeddings\"` endpoint type. \nFor models logged using MLflow 2.12 or above, the `log_model` argument `task` sets the `metadata` `task` key\u2019s value automatically. If the `task` argument and the `metadata` `task` argument are set to different values, an `Exception` is raised. \nThe following is an example of how to log a text-completion language model logged using MLflow 2.12 or above: \n```\nmodel = AutoModelForCausalLM.from_pretrained(\"mosaicml\/mpt-7b-instruct\",torch_dtype=torch.bfloat16)\ntokenizer = AutoTokenizer.from_pretrained(\"mosaicml\/mpt-7b-instruct\")\nwith mlflow.start_run():\ncomponents = {\n\"model\": model,\n\"tokenizer\": tokenizer,\n}\nmlflow.transformers.log_model(\ntransformers_model=components,\nartifact_path=\"model\",\ninput_example={\"prompt\": np.array([\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\\n\\n### Instruction:\\nWhat is Apache Spark?\\n\\n### Response:\\n\"])},\ntask=\"llm\/v1\/completions\",\nregistered_model_name=registered_model_name\n)\n\n``` \nFor models logged using MLflow 2.11 or above, you can specify the interface for the endpoint using the following metadata values: \n* `metadata = {\"task\": \"llm\/v1\/completions\"}`\n* `metadata = {\"task\": \"llm\/v1\/chat\"}`\n* `metadata = {\"task\": \"llm\/v1\/embeddings\"}` \nThe following is an example of how to log a text-completion language model logged using MLflow 2.11 or above: \n```\nmodel = AutoModelForCausalLM.from_pretrained(\"mosaicml\/mpt-7b-instruct\",torch_dtype=torch.bfloat16)\ntokenizer = AutoTokenizer.from_pretrained(\"mosaicml\/mpt-7b-instruct\")\nwith mlflow.start_run():\ncomponents = {\n\"model\": model,\n\"tokenizer\": tokenizer,\n}\nmlflow.transformers.log_model(\ntransformers_model=components,\nartifact_path=\"model\",\ninput_example={\"prompt\": np.array([\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\\n\\n### Instruction:\\nWhat is Apache Spark?\\n\\n### Response:\\n\"])},\ntask=\"llm\/v1\/completions\",\nmetadata={\"task\": \"llm\/v1\/completions\"},\nregistered_model_name=registered_model_name\n)\n\n``` \nProvisioned throughput also supports both the small and large BGE embedding model. The following is an example of how to log the model, `BAAI\/bge-small-en-v1.5` so it can be served with provisioned throughput using MLflow 2.11 or above: \n```\nmodel = AutoModel.from_pretrained(\"BAAI\/bge-small-en-v1.5\")\ntokenizer = AutoTokenizer.from_pretrained(\"BAAI\/bge-small-en-v1.5\")\nwith mlflow.start_run():\ncomponents = {\n\"model\": model,\n\"tokenizer\": tokenizer,\n}\nmlflow.transformers.log_model(\ntransformers_model=components,\nartifact_path=\"bge-small-transformers\",\ntask=\"llm\/v1\/embeddings\",\nmetadata={\"task\": \"llm\/v1\/embeddings\"}, # not needed for MLflow >=2.12.1\nregistered_model_name=registered_model_name\n)\n\n``` \nWhen logging a fine-tuned BGE model, you must also specify `model_type` metadata key: \n```\nmetadata={\n\"task\": \"llm\/v1\/embeddings\",\n\"model_type\": \"bge-large\" # Or \"bge-small\"\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Provisioned throughput Foundation Model APIs\n###### Create your provisioned throughput endpoint using the UI\n\nAfter the logged model is in Unity Catalog, create a provisioned throughput serving endpoint with the following steps: \n1. Navigate to the **Serving UI** in your workspace.\n2. Select **Create serving endpoint**.\n3. In the **Entity** field, select your model from Unity Catalog. For eligible models, the UI for the Served Entity shows the Provisioned Throughput screen.\n4. In the **Up to** dropdown you can configure the maximum tokens per second throughput for your endpoint. \n1. Provisioned throughput endpoints automatically scale, so you can select **Modify** to view the minimum tokens per second your endpoint can scale down to. \n![Provisioned Throughput](https:\/\/docs.databricks.com\/_images\/create-provisioned-throughput-ui.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Provisioned throughput Foundation Model APIs\n###### Create your provisioned throughput endpoint using the REST API\n\nTo deploy your model in provisioned throughput mode using the REST API, you must specify `min_provisioned_throughput` and `max_provisioned_throughput` fields in your request. \nTo identify the suitable range of provisioned throughput for your model, see [Get provisioned throughput in increments](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html#get-increments). \n```\nimport requests\nimport json\n\n# Set the name of the MLflow endpoint\nendpoint_name = \"llama2-13b-chat\"\n\n# Name of the registered MLflow model\nmodel_name = \"ml.llm-catalog.llama-13b\"\n\n# Get the latest version of the MLflow model\nmodel_version = 3\n\n# Get the API endpoint and token for the current notebook context\nAPI_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()\nAPI_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()\n\nheaders = {\"Context-Type\": \"text\/json\", \"Authorization\": f\"Bearer {API_TOKEN}\"}\n\noptimizable_info = requests.get(\nurl=f\"{API_ROOT}\/api\/2.0\/serving-endpoints\/get-model-optimization-info\/{model_name}\/{model_version}\",\nheaders=headers)\n.json()\n\nif 'optimizable' not in optimizable_info or not optimizable_info['optimizable']:\nraise ValueError(\"Model is not eligible for provisioned throughput\")\n\nchunk_size = optimizable_info['throughput_chunk_size']\n\n# Minimum desired provisioned throughput\nmin_provisioned_throughput = 2 * chunk_size\n\n# Maximum desired provisioned throughput\nmax_provisioned_throughput = 3 * chunk_size\n\n# Send the POST request to create the serving endpoint\ndata = {\n\"name\": endpoint_name,\n\"config\": {\n\"served_entities\": [\n{\n\"entity_name\": model_name,\n\"entity_version\": model_version,\n\"min_provisioned_throughput\": min_provisioned_throughput,\n\"max_provisioned_throughput\": max_provisioned_throughput,\n}\n]\n},\n}\n\nresponse = requests.post(\nurl=f\"{API_ROOT}\/api\/2.0\/serving-endpoints\", json=data, headers=headers\n)\n\nprint(json.dumps(response.json(), indent=4))\n\n``` \n### Get provisioned throughput in increments \nProvisioned throughput is available in increments of tokens per second with specific increments varying by model. To identify the suitable range for your needs, Databricks recommends using the model optimization information API within the platform. \n```\nGET api\/2.0\/serving-endpoints\/get-model-optimization-info\/{registered_model_name}\/{version}\n\n``` \nThe following is an example response from the API: \n```\n{\n\"optimizable\": true,\n\"model_type\": \"llama\",\n\"throughput_chunk_size\": 980\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Provisioned throughput Foundation Model APIs\n###### Notebook examples\n\nThe following notebooks show examples of how to create a provisioned throughput Foundation Model API: \n### Provisioned throughput serving for Llama2 model notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/large-language-models\/provisioned-throughput-llama-serving.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Provisioned throughput serving for Mistral model notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/large-language-models\/provisioned-throughput-mistral-serving.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Provisioned throughput serving for BGE model notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/large-language-models\/provisioned-throughput-bge-serving.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n##### Provisioned throughput Foundation Model APIs\n###### Limitations\n\n* Model deployment might fail due to GPU capacity issues, which results in a timeout during endpoint creation or update. Reach out to your Databricks account team to help resolve.\n* Auto-scaling for Foundation Models APIs is slower than CPU model serving. Databricks recommends over-provisioning to avoid request timeouts.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Provisioned throughput Foundation Model APIs\n###### Additional resources\n\n* [What do tokens per second ranges in provisioned throughput mean?](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/prov-throughput-tokens.html)\n* [Conduct your own LLM endpoint benchmarking](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/prov-throughput-run-benchmark.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html"} +{"content":"# What is Delta Lake?\n### Delta Lake generated columns\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nDelta Lake supports generated columns which are a special type of column whose values are automatically generated based on a user-specified function over other columns in the Delta table. When you write to a table with generated columns and you do not explicitly provide values for them, Delta Lake automatically computes the values. For example, you can automatically generate a date column (for partitioning the table by date) from the timestamp column; any writes into the table need only specify the data for the timestamp column. However, if you explicitly provide values for them, the values must satisfy the [constraint](https:\/\/docs.databricks.com\/tables\/constraints.html) `(<value> <=> <generation expression>) IS TRUE` or the write will fail with an error. \nImportant \nTables created with generated columns have a higher table writer protocol version than the default. See [How does Databricks manage Delta Lake feature compatibility?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html) to understand table protocol versioning and what it means to have a higher version of a table protocol version.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/generated-columns.html"} +{"content":"# What is Delta Lake?\n### Delta Lake generated columns\n#### Create a table with generated columns\n\nThe following example shows how to create a table with generated columns: \n```\nCREATE TABLE default.people10m (\nid INT,\nfirstName STRING,\nmiddleName STRING,\nlastName STRING,\ngender STRING,\nbirthDate TIMESTAMP,\ndateOfBirth DATE GENERATED ALWAYS AS (CAST(birthDate AS DATE)),\nssn STRING,\nsalary INT\n)\n\n``` \n```\nDeltaTable.create(spark) \\\n.tableName(\"default.people10m\") \\\n.addColumn(\"id\", \"INT\") \\\n.addColumn(\"firstName\", \"STRING\") \\\n.addColumn(\"middleName\", \"STRING\") \\\n.addColumn(\"lastName\", \"STRING\", comment = \"surname\") \\\n.addColumn(\"gender\", \"STRING\") \\\n.addColumn(\"birthDate\", \"TIMESTAMP\") \\\n.addColumn(\"dateOfBirth\", DateType(), generatedAlwaysAs=\"CAST(birthDate AS DATE)\") \\\n.addColumn(\"ssn\", \"STRING\") \\\n.addColumn(\"salary\", \"INT\") \\\n.execute()\n\n``` \n```\nDeltaTable.create(spark)\n.tableName(\"default.people10m\")\n.addColumn(\"id\", \"INT\")\n.addColumn(\"firstName\", \"STRING\")\n.addColumn(\"middleName\", \"STRING\")\n.addColumn(\nDeltaTable.columnBuilder(\"lastName\")\n.dataType(\"STRING\")\n.comment(\"surname\")\n.build())\n.addColumn(\"lastName\", \"STRING\", comment = \"surname\")\n.addColumn(\"gender\", \"STRING\")\n.addColumn(\"birthDate\", \"TIMESTAMP\")\n.addColumn(\nDeltaTable.columnBuilder(\"dateOfBirth\")\n.dataType(DateType)\n.generatedAlwaysAs(\"CAST(dateOfBirth AS DATE)\")\n.build())\n.addColumn(\"ssn\", \"STRING\")\n.addColumn(\"salary\", \"INT\")\n.execute()\n\n``` \nGenerated columns are stored as if they were normal columns. That is, they occupy storage. \nThe following restrictions apply to generated columns: \n* A generation expression can use any SQL functions in Spark that always return the same result when given the same argument values, except the following types of functions: \n+ User-defined functions.\n+ Aggregate functions.\n+ Window functions.\n+ Functions returning multiple rows. \nDelta Lake can generate partition filters for a query whenever a partition column is defined by one of the following expressions: \nNote \nPhoton is required in Databricks Runtime 10.4 LTS and below. Photon is not required in Databricks Runtime 11.3 LTS and above. \n* `CAST(col AS DATE)` and the type of `col` is `TIMESTAMP`.\n* `YEAR(col)` and the type of `col` is `TIMESTAMP`.\n* Two partition columns defined by `YEAR(col), MONTH(col)` and the type of `col` is `TIMESTAMP`.\n* Three partition columns defined by `YEAR(col), MONTH(col), DAY(col)` and the type of `col` is `TIMESTAMP`.\n* Four partition columns defined by `YEAR(col), MONTH(col), DAY(col), HOUR(col)` and the type of `col` is `TIMESTAMP`.\n* `SUBSTRING(col, pos, len)` and the type of `col` is `STRING`\n* `DATE_FORMAT(col, format)` and the type of `col` is `TIMESTAMP`. \n+ You can only use date formats with the following patterns: `yyyy-MM` and `yyyy-MM-dd-HH`.\n+ In Databricks Runtime 10.4 LTS and above, you can also use the following pattern: `yyyy-MM-dd`. \nIf a partition column is defined by one of the preceding expressions, and a query filters data using the underlying base column of a generation expression, Delta Lake looks at the relationship between the base column and the generated column, and populates partition filters based on the generated partition column if possible. For example, given the following table: \n```\nCREATE TABLE events(\neventId BIGINT,\ndata STRING,\neventType STRING,\neventTime TIMESTAMP,\neventDate date GENERATED ALWAYS AS (CAST(eventTime AS DATE))\n)\nPARTITIONED BY (eventType, eventDate)\n\n``` \nIf you then run the following query: \n```\nSELECT * FROM events\nWHERE eventTime >= \"2020-10-01 00:00:00\" <= \"2020-10-01 12:00:00\"\n\n``` \nDelta Lake automatically generates a partition filter so that the preceding query only reads the data in partition `date=2020-10-01` even if a partition filter is not specified. \nAs another example, given the following table: \n```\nCREATE TABLE events(\neventId BIGINT,\ndata STRING,\neventType STRING,\neventTime TIMESTAMP,\nyear INT GENERATED ALWAYS AS (YEAR(eventTime)),\nmonth INT GENERATED ALWAYS AS (MONTH(eventTime)),\nday INT GENERATED ALWAYS AS (DAY(eventTime))\n)\nPARTITIONED BY (eventType, year, month, day)\n\n``` \nIf you then run the following query: \n```\nSELECT * FROM events\nWHERE eventTime >= \"2020-10-01 00:00:00\" <= \"2020-10-01 12:00:00\"\n\n``` \nDelta Lake automatically generates a partition filter so that the preceding query only reads the data in partition `year=2020\/month=10\/day=01` even if a partition filter is not specified. \nYou can use an [EXPLAIN](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-qry-explain.html) clause and check the provided plan to see whether Delta Lake automatically generates any partition filters.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/generated-columns.html"} +{"content":"# What is Delta Lake?\n### Delta Lake generated columns\n#### Use identity columns in Delta Lake\n\nImportant \nDeclaring an identity column on a Delta table disables concurrent transactions. Only use identity columns in use cases where concurrent writes to the target table are not required. \nDelta Lake identity columns are a type of generated column that assigns unique values for each record inserted into a table. The following example shows the basic syntax for declaring an identity column during a create table statement: \n```\nCREATE TABLE table_name (\nidentity_col BIGINT GENERATED BY DEFAULT AS IDENTITY,\nother_column ...)\n\n``` \nTo see all syntax options for creating tables with identity columns, see [CREATE TABLE [USING]](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-table-using.html). \nYou can optionally specify the following: \n* A starting value.\n* A step size, which can be positive or negative. \nValues assigned by identity columns are unique and increment in the direction of the specified step, and in multiples of the specified step size, but are not guaranteed to be contiguous. For example, with a starting value of `0` and a step size of `2`, all values are positive even numbers but some even numbers might be skipped. \nWhen using the clause `GENERATED BY DEFAULT AS IDENTITY`, insert operations can specify values for the identity column. Modify the clause to be `GENERATED ALWAYS AS IDENTITY` to override the ability to manually set values. \nIdentity columns only support the `BIGINT` type, and operations fail if the assigned value exceeds the range supported by `BIGINT`. \nTo learn about syncing identity column values with data, see [ALTER TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table.html). \n### Identity column limitations \nThe following limitations exist when working with identity columns: \n* Concurrent transactions are not supported on tables with identity columns enabled.\n* You cannot partition a table by an identity column.\n* You cannot use `ALTER TABLE` to `ADD`, `REPLACE`, or `CHANGE` an identity column.\n* You cannot update the value of an identity column for an existing record. \nNote \nTo change the `IDENTITY` value for an existing record, you must delete the record and `INSERT` it as a new record.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/generated-columns.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Feature Engineering and Workspace Feature Store notebooks\n\nThis page includes example notebooks that illustrate the feature engineering workflow in Databricks for both Feature Engineering in Unity Catalog and Workspace Feature Store scenarios. \nNote \nWith Databricks Runtime 13.3 LTS and above, any Delta table in Unity Catalog that has a primary key is automatically a feature table that you can use for model training and inference. When you use a table registered in Unity Catalog as a feature table, all Unity Catalog capabilities are automatically available to the feature table.\n\n#### Feature Engineering and Workspace Feature Store notebooks\n##### Basic feature engineering example\n\nThe basic feature engineering example notebook steps you through how to create a feature table, use it to train a model, and then perform batch scoring using automatic feature lookup. It also introduces you to the Feature Engineering UI and shows how you can use it to search for features and understand how features are created and used. \n### Basic Feature Engineering in Unity Catalog example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/feature-store-with-uc-basic-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nUse the following notebook if your workspace is not enabled for Unity Catalog. \n### Basic Workspace Feature Store example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/feature-store-basic-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/example-notebooks.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Feature Engineering and Workspace Feature Store notebooks\n##### Taxi example with point-in-time lookup\n\nThe taxi example notebook illustrates the process of creating features, updating them, and using them for model training and batch inference. \n### Feature Engineering in Unity Catalog taxi example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/feature-store-with-uc-taxi-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nUse the following notebook if your workspace is not enabled for Unity Catalog. \n### Workspace Feature Store taxi example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/feature-store-taxi-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### Feature Engineering and Workspace Feature Store notebooks\n##### Create Feature Serving endpoints\n\nThis notebook illustrates how to use the Databricks SDK to create a Feature Serving endpoint using Databricks Online Tables. \n### Feature Serving example notebook with online tables \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/feature-function-serving-online-tables-dbsdk.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/example-notebooks.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Failing jobs or executors removed\n\nSo you\u2019re seeing failed jobs or removed executors: \n![Failing Jobs](https:\/\/docs.databricks.com\/_images\/failing-jobs.png) \nThe most common reasons for executors being removed are: \n* **Autoscaling**: In this case it\u2019s expected and not an error. See [Enable autoscaling](https:\/\/docs.databricks.com\/compute\/configure.html#autoscaling).\n* **Spot instance losses**: The cloud provider is reclaiming your VMs. You can learn more about Spot instances [here](https:\/\/www.databricks.com\/blog\/2016\/10\/25\/running-apache-spark-clusters-with-spot-instances-in-databricks.html).\n* **Executors running out of memory**\n\n##### Failing jobs or executors removed\n###### Failing jobs\n\nIf you see any failing jobs click on them to get to their pages. Then scroll down to see the failed stage and a failure reason: \n![Failure Reason](https:\/\/docs.databricks.com\/_images\/failed-stage-reason.png) \nYou may get a generic error. Click on the link in the description to see if you can get more info: \n![Failure Description](https:\/\/docs.databricks.com\/_images\/failed-stage-description.png) \nIf you scroll down in this page, you will be able to see why each task failed. In this case it\u2019s becoming clear there\u2019s a memory issue: \n![Failed Tasks](https:\/\/docs.databricks.com\/_images\/failed-tasks.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/failing-spark-jobs.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Failing jobs or executors removed\n###### Failing executors\n\nTo find out why your executors are failing, you\u2019ll first want to check the compute\u2019s **Event log** to see if there\u2019s any explanation for why the executors failed. For example, it\u2019s possible you\u2019re using spot instances and the cloud provider is taking them back. \n![Event Log](https:\/\/docs.databricks.com\/_images\/event-log.png) \nSee if there are any events explaining the loss of executors. For example you may see messages indicating that the cluster is resizing or spot instances are being lost. \n* If you are using spot instances, see [Losing spot instances](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/losing-spot-instances.html).\n* If your compute was resized with autoscaling, it\u2019s expected and not an error. See [Learn more about cluster resizing](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#autoscaling). \nIf you don\u2019t see any information in the event log, navigate back to the **Spark UI** then click the **Executors** tab: \n![Executors](https:\/\/docs.databricks.com\/_images\/executors.png) \nHere you can get the logs from the failed executors: \n![Failed Executors](https:\/\/docs.databricks.com\/_images\/failed-executors.png)\n\n##### Failing jobs or executors removed\n###### Next step\n\nIf you\u2019ve gotten this far, the likeliest explanation is a memory issue. The next step is to dig into memory issues. See [Spark memory issues](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/spark-memory-issues.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/failing-spark-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Package cells\n\nTo use custom Scala classes and objects defined within notebooks reliably in Spark and across notebook sessions, you should define classes in package cells. A *package cell* is a cell that is compiled when it is run. A package cell has no visibility with respect to the rest of the notebook. You can think of it as a separate Scala file. Only `class` and `object` definitions can go in a package cell. You cannot have any values, variables, or function definitions. \nThe following notebook shows what can happen if you do not use package cells and provides some examples, caveats, and best practices.\n\n#### Package cells\n##### Notebook example: package cells\n\nThe following notebook shows an example of how to package cells. \n### Package Cells notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/package-cells.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/package-cells.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Configure Unity Catalog storage account for CORS\n\nYou must configure cross-origin resource sharing (CORS) for Databricks to upload files efficiently to managed volumes defined in Unity Catalog. \nYou can configure CORS settings during initial deployment of your Unity Catalog metastore storage or change these settings later. Only sufficiently privileged cloud administrators can apply these changes. The instructions that follow assume that you have proper credentials and are logged into the cloud console for the account containing your storage account.\n\n#### Configure Unity Catalog storage account for CORS\n##### Configure CORS settings for S3\n\nThe instructions that follow show how to use the AWS console to update S3 bucket permissions with the required CORS configuration. \nNote \nIf instead you want to use an AWS CloudFormation template, be aware that CloudFormation uses some property names that differ from those listed in these instructions. Use the [CORS configuration instructions](https:\/\/docs.aws.amazon.com\/AWSCloudFormation\/latest\/UserGuide\/aws-properties-s3-bucket-cors.html) in the AWS CloudFormation reference to get the correct property names. \n1. Use the AWS console to select your bucket from the buckets list.\n2. Select **Permissions**.\n3. Select **Edit** under **Cross-origin resource sharing (CORS)**.\n4. Copy the following JSON configuration into the text box: \n```\n[\n{\n\"AllowedHeaders\": [],\n\"AllowedMethods\": [\n\"PUT\"\n],\n\"AllowedOrigins\": [\n\"https:\/\/*.databricks.com\"\n],\n\"ExposeHeaders\": [],\n\"MaxAgeSeconds\": 1800\n}\n]\n\n```\n5. Select **Save changes**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/storage-cors.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Third-party online stores\n##### Publish features to an online store\n\nThis article describes how to publish features to an online store for real-time serving. \nDatabricks Feature Store supports these online stores: \n| Online store provider | Publish with Feature Engineering in Unity Catalog | Publish with Workspace Feature Store | Feature lookup in Legacy MLflow Model Serving | Feature lookup in Model Serving |\n| --- | --- | --- | --- | --- |\n| Amazon DynamoDB | X | X (Feature Store client v0.3.8 and above) | X | X |\n| Amazon Aurora (MySQL-compatible) | | X | X | |\n| Amazon RDS MySQL | | X | X | | \nNote \nThe DynamoDB online store uses a different schema than the offline store. Specifically, in the online store, primary keys are stored as a combined key in the column `_feature_store_internal__primary_keys`. \nTo ensure that Feature Store can access the DynamoDB online store, you must create the table in the online store by using `publish_table()`. Do not manually create a table inside DynamoDB. `publish_table()` does that for you automatically.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/publish-features.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Third-party online stores\n##### Publish features to an online store\n###### Publish batch-computed features to an online store\n\nYou can create and schedule a Databricks job to regularly publish updated features. This job can also include the code to calculate the updated features, or you can create and run separate jobs to calculate and publish feature updates. \nFor SQL stores, the following code assumes that an online database named \u201crecommender\\_system\u201d already exists in the online store and matches the name of the offline store. If there is no table named \u201ccustomer\\_features\u201d in the database, this code creates one. It also assumes that features are computed each day and stored as a partitioned column `_dt`. \nThe following code assumes that you have [created secrets](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/fs-authentication.html#provide-online-store-credentials-using-databricks-secrets) to access this online store. \nIf you are using DynamoDB, Databricks recommends that you provide write authentication through [an instance profile attached to a Databricks cluster](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/fs-authentication.html#auth-instance-profile). The instance profile can only be used for publishing features; to look up feature values, you must use Databricks secrets. \nDynamoDB support is available in all versions of Feature Engineering in Unity Catalog client, and Feature Store client v0.3.8 and above. \n```\nimport datetime\nfrom databricks.feature_engineering.online_store_spec import AmazonDynamoDBSpec\n# or databricks.feature_store.online_store_spec for Workspace Feature Store\n\n# do not pass `write_secret_prefix` if you intend to use the instance profile attached to the cluster.\nonline_store = AmazonDynamoDBSpec(\nregion='<region>',\nread_secret_prefix='<read-scope>\/<prefix>',\nwrite_secret_prefix='<write-scope>\/<prefix>'\n)\n\nfe.publish_table( # or fs.publish_table for Workspace Feature Store\nname='ml.recommender_system.customer_features',\nonline_store=online_store,\nfilter_condition=f\"_dt = '{str(datetime.date.today())}'\",\nmode='merge'\n)\n\n``` \n```\nimport datetime\nfrom databricks.feature_store.online_store_spec import AmazonRdsMySqlSpec\n\nonline_store = AmazonRdsMySqlSpec(\nhostname='<hostname>',\nport='<port>',\nread_secret_prefix='<read-scope>\/<prefix>',\nwrite_secret_prefix='<write-scope>\/<prefix>'\n)\n\nfs.publish_table(\nname='recommender_system.customer_features',\nonline_store=online_store,\nfilter_condition=f\"_dt = '{str(datetime.date.today())}'\",\nmode='merge'\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/publish-features.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Third-party online stores\n##### Publish features to an online store\n###### Publish streaming features to an online store\n\nTo continuously stream features to the online store, set `streaming=True`. \n```\nfe.publish_table( # or fs.publish_table for Workspace Feature Store\nname='ml.recommender_system.customer_features',\nonline_store=online_store,\nstreaming=True\n)\n\n```\n\n##### Publish features to an online store\n###### Publish selected features to an online store\n\nTo publish only selected features to the online store, use the `features` argument to specify the feature name(s) to publish. Primary keys and timestamp keys are always published. If you do not specify the `features` argument or if the value is None, all features from the offline feature table are published. \n```\nfe.publish_table( # or fs.publish_table for Workspace Feature Store\nname='ml.recommender_system.customer_features',\nonline_store=online_store,\nfeatures=[\"total_purchases_30d\"]\n)\n\n```\n\n##### Publish features to an online store\n###### Publish a feature table to a specific database\n\nIn the [online store spec](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html), specify the database name (`database_name`) and the table name (`table_name`). If you do not specify these parameters, the offline database name and feature table name are used. `database_name` must already exist in the online store. \n```\nonline_store = AmazonRdsMySqlSpec(\nhostname='<hostname>',\nport='<port>',\ndatabase_name='<database-name>',\ntable_name='<table-name>',\nread_secret_prefix='<read-scope>\/<prefix>',\nwrite_secret_prefix='<write-scope>\/<prefix>'\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/publish-features.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Third-party online stores\n##### Publish features to an online store\n###### Overwrite an existing online feature table or specific rows\n\nUse `mode='overwrite'` in the `publish_table` call. The online table is completely overwritten by the data in the offline table. \nNote \nAmazon DynamoDB does not support overwrite mode. \n```\nfs.publish_table(\nname='recommender_system.customer_features',\nonline_store=online_store,\nmode='overwrite'\n)\n\n``` \nTo overwrite only certain rows, use the `filter_condition` argument: \n```\nfs.publish_table(\nname='recommender_system.customer_features',\nonline_store=online_store,\nfilter_condition=f\"_dt = '{str(datetime.date.today())}'\",\nmode='merge'\n)\n\n```\n\n##### Publish features to an online store\n###### Delete a published table from an online store\n\nWith Feature Store client v0.12.0 and above, you can use `drop_online_table` to delete a published table from an online store. When you delete a published table with `drop_online_table`, the table is deleted from your online store provider and the online store metadata is removed from Databricks. \n```\nfe.drop_online_table( # or fs.drop_online_table for Workspace Feature Store\nname='recommender_system.customer_features',\nonline_store = online_store\n)\n\n``` \nNote \n* `drop_online_table` deletes the published table from the online store. It does not delete the feature table in Databricks.\n* Before you delete a published table, you should ensure that the table is not used for Model Serving feature lookup and has no other downstream dependencies. The delete is irreversible and might cause dependencies to fail.\n* To check for any dependencies, consider rotating the keys for the published table you plan to delete for a day before you execute `drop_online_table`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/publish-features.html"} +{"content":"# \n### Technology partners\n\nDatabricks has validated integrations with various third-party solutions that allow you to work with data through Databricks clusters and SQL warehouses, in many cases with low-code and no-code experiences. These solutions enable common scenarios such as data ingestion, data preparation and transformation, business intelligence (BI), and machine learning. \nDatabricks also includes [Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/index.html), a user interface that allows some of these validated solutions to integrate more quickly and easily with your Databricks clusters and SQL warehouses.\n\n### Technology partners\n#### Build an integration with Databricks\n\nThis section provides instructions and best practices for technology partners to build and maintain their integrations with Databricks. \n* [Best practices for ingestion partners using Unity Catalog volumes as staging locations for data](https:\/\/docs.databricks.com\/_extras\/documents\/best-practices-ingestion-partner-volumes.pdf)\n\n### Technology partners\n#### All Databricks technology partners\n\nFor a list of all Databricks partner solutions, see [Databricks Technology Partners](https:\/\/www.databricks.com\/company\/partners\/technology). Some of these partner solutions are featured in Databricks Partner Connect.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/index.html"} +{"content":"# \n### Technology partners\n#### Databricks Partner Connect partners\n\nThis section lists the partner solutions that are featured in Partner Connect. \n### Data ingestion \n| Partner | Unity Catalog support | Steps to connect |\n| --- | --- | --- |\n| [Fivetran logo](https:\/\/fivetran.com\/) | Yes | [Connect to Fivetran using Partner Connect](https:\/\/docs.databricks.com\/partners\/ingestion\/fivetran.html#partner-connect) |\n| [Hevo Data logo](https:\/\/hevodata.com\/) | No | [Connect to Hevo Data using Partner Connect](https:\/\/docs.databricks.com\/partners\/ingestion\/hevo.html#partner-connect) |\n| [Rivery logo](https:\/\/rivery.io\/) | No | [Connect to Rivery using Partner Connect](https:\/\/docs.databricks.com\/partners\/ingestion\/rivery.html#partner-connect) |\n| [Rudderstack logo](https:\/\/www.rudderstack.com\/) | Yes | [Connect to RudderStack using Partner Connect](https:\/\/docs.databricks.com\/partners\/ingestion\/rudderstack.html#partner-connect) |\n| [Snowplow logo](https:\/\/snowplow.io\/) | Yes | [Connect to Snowplow using Partner Connect](https:\/\/docs.databricks.com\/partners\/ingestion\/snowplow.html#partner-connect) | \n### Data preparation and transformation \n| Partner | Unity Catalog support | Steps to connect |\n| --- | --- | --- |\n| [dbt Labs logo](https:\/\/www.getdbt.com\/) | Yes | [Connect to dbt Cloud using Partner Connect](https:\/\/docs.databricks.com\/partners\/prep\/dbt-cloud.html#partner-connect) |\n| [Matillion logo](https:\/\/www.matillion.com\/) | Yes | [Connect to Matillion using Partner Connect](https:\/\/docs.databricks.com\/partners\/prep\/matillion.html#partner-connect) |\n| [Prophecy logo](https:\/\/www.prophecy.io\/) | Yes | [Connect to Prophecy using Partner Connect](https:\/\/docs.databricks.com\/partners\/prep\/prophecy.html#partner-connect) | \n### Machine learning \n| Partner | Unity Catalog support | Steps to connect |\n| --- | --- | --- |\n| [Dataiku logo](https:\/\/dataiku.com\/) | Yes | [Connect to Dataiku using Partner Connect](https:\/\/docs.databricks.com\/partners\/ml\/dataiku.html#partner-connect) |\n| [John Snow Labs logo](https:\/\/www.johnsnowlabs.com) | N\/A | [Connect to John Snow Labs using Partner Connect](https:\/\/docs.databricks.com\/partners\/ml\/john-snow-labs.html#partner-connect) |\n| [Labelbox logo](https:\/\/labelbox.com\/) | N\/A | [Connect to Labelbox using Partner Connect](https:\/\/docs.databricks.com\/partners\/ml\/labelbox.html#connect-with-partner-connect) | \n### BI and visualization \n| Partner | Unity Catalog support | Steps to connect |\n| --- | --- | --- |\n| [Hex logo](https:\/\/hex.tech\/) | Yes | [Connect to Hex using Partner Connect](https:\/\/docs.databricks.com\/partners\/bi\/hex.html#partner-connect) |\n| [Power BI logo](https:\/\/powerbi.microsoft.com\/) | Yes | [Connect Power BI Desktop to Databricks using Partner Connect](https:\/\/docs.databricks.com\/partners\/bi\/power-bi.html#connect-with-partner-connect) |\n| [Preset logo](https:\/\/preset.io\/) | No | [Connect to Preset using Partner Connect](https:\/\/docs.databricks.com\/partners\/bi\/preset.html#partner-connect) |\n| [Qlik logo](https:\/\/www.qlik.com\/) | * Partner Connect: No * Manual connection: Yes | [Connect to Qlik Sense using Partner Connect](https:\/\/docs.databricks.com\/partners\/bi\/qlik-sense.html#partner-connect) |\n| [Sigma logo](https:\/\/www.sigmacomputing.com\/) | Yes | [Connect to Sigma using Partner Connect](https:\/\/docs.databricks.com\/partners\/bi\/sigma.html#partner-connect) |\n| [Tableau logo](https:\/\/www.tableau.com\/) | Yes | [Connect to Tableau Desktop using Partner Connect](https:\/\/docs.databricks.com\/partners\/bi\/tableau.html#connect-with-partner-connect) |\n| [ThoughtSpot logo](https:\/\/www.thoughtspot.com\/) | Yes | [Connect to ThoughtSpot using Partner Connect](https:\/\/docs.databricks.com\/partners\/bi\/thoughtspot.html#partner-connect) | \n### Reverse ETL \n| Partner | Unity Catalog support | Steps to connect |\n| --- | --- | --- |\n| [Census logo](https:\/\/getcensus.com\/) | Yes | [Connect to Census using Partner Connect](https:\/\/docs.databricks.com\/partners\/reverse-etl\/census.html#partner-connect) |\n| [Hightouch logo](https:\/\/hightouch.com\/) | * Partner Connect: No * Manual connection: Yes | [Connect to Hightouch using Partner Connect](https:\/\/docs.databricks.com\/partners\/reverse-etl\/hightouch.html#partner-connect) | \n#### Security \n| Partner | Unity Catalog | Steps to connect |\n| --- | --- | --- |\n| [Hunters logo](https:\/\/www.hunters.security\/) | Yes | [Connect to Hunters using Partner Connect](https:\/\/docs.databricks.com\/partners\/data-security\/hunters.html#partner-connect) |\n| [Privacera logo](https:\/\/www.privacera.com) | Yes | [Connect to Privacera using Partner Connect](https:\/\/docs.databricks.com\/partners\/data-security\/privacera.html#partner-connect) | \n### Data governance \n| Partner | Unity Catalog support | Steps to connect |\n| --- | --- | --- |\n| [Alation logo](https:\/\/www.alation.com\/) | Yes | [Connect to Alation using Partner Connect](https:\/\/docs.databricks.com\/partners\/data-governance\/alation.html#partner-connect) |\n| [Anomalo logo](https:\/\/www.anomalo.com) | Yes | [Connect to Anomalo using Partner Connect](https:\/\/docs.databricks.com\/partners\/data-governance\/anomalo.html#partner-connect) |\n| [erwin logo](https:\/\/www.erwin.com) | No | [Connect to erwin Data Modeler using Partner Connect](https:\/\/docs.databricks.com\/partners\/data-governance\/erwin.html#partner-connect) |\n| [Lightup logo](https:\/\/www.lightup.ai) | Yes | [Connect to Lightup using Partner Connect](https:\/\/docs.databricks.com\/partners\/data-governance\/lightup.html#partner-connect) |\n| [Monte Carlo logo](https:\/\/www.montecarlodata.com) | Yes | [Connect to Monte Carlo using Partner Connect](https:\/\/docs.databricks.com\/partners\/data-governance\/monte-carlo.html#partner-connect) | \n### Semantic layer \n| Partner | Unity Catalog support | Steps to connect |\n| --- | --- | --- |\n| [AtScale logo](https:\/\/www.atscale.com\/) | Yes | [Connect to AtScale using Partner Connect](https:\/\/docs.databricks.com\/partners\/semantic-layer\/atscale.html#partner-connect) |\n| [Stardog logo](https:\/\/www.stardog.com\/) | Yes | [Connect to Stardog using Partner Connect](https:\/\/docs.databricks.com\/partners\/semantic-layer\/stardog.html#partner-connect) |\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/index.html"} +{"content":"# Security and compliance guide\n## Secret management\n#### Secrets\n\nA secret is a key-value pair that stores secret material, with a key name unique within a [secret scope](https:\/\/docs.databricks.com\/security\/secrets\/secret-scopes.html). Each scope is limited to 1000 secrets. The maximum allowed secret value size is 128 KB. \nSee also the [Secrets API](https:\/\/docs.databricks.com\/api\/workspace\/secrets).\n\n#### Secrets\n##### Create a secret\n\nSecret names are case insensitive. \n### Create a secret in a Databricks-backed scope \nTo create a secret in a Databricks-backed scope using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) (version 0.205 and above): \n```\ndatabricks secrets put-secret --json '{\n\"scope\": \"<scope-name>\",\n\"key\": \"<key-name>\",\n\"string_value\": \"<secret>\"\n}'\n\n``` \nIf you are creating a multi-line secret, you can pass the secret using standard input. For example: \n```\n(cat << EOF\nthis\nis\na\nmulti\nline\nsecret\nEOF\n) | databricks secrets put-secret <secret_scope> <secret_key>\n\n``` \nYou can also provide a secret from a file. For more information about writing secrets, see [What is the Databricks CLI?](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html).\n\n#### Secrets\n##### List secrets\n\nTo list secrets in a given scope: \n```\ndatabricks secrets list-secrets <scope-name>\n\n``` \nThe response displays metadata information about the secrets, such as the secrets\u2019 key names. You use the [Secrets utility (dbutils.secrets)](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#dbutils-secrets) in a notebook or job to list this metadata. For example: \n```\ndbutils.secrets.list('my-scope')\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/secrets\/secrets.html"} +{"content":"# Security and compliance guide\n## Secret management\n#### Secrets\n##### Read a secret\n\nYou create secrets using the REST API or CLI, but you must use the [Secrets utility (dbutils.secrets)](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#dbutils-secrets) in a notebook or job to read a secret.\n\n#### Secrets\n##### Delete a secret\n\nTo delete a secret from a scope with the Databricks CLI: \n```\ndatabricks secrets delete-secret <scope-name> <key-name>\n\n``` \nYou can also use the [Secrets API](https:\/\/docs.databricks.com\/api\/workspace\/secrets).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/secrets\/secrets.html"} +{"content":"# Security and compliance guide\n## Secret management\n#### Secrets\n##### Use a secret in a Spark configuration property or environment variable\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nNote \nAvailable in Databricks Runtime 6.4 Extended Support and above. \nYou can reference a secret in a Spark configuration property or environment variable. Retrieved secrets are redacted from notebook output and Spark driver and executor logs. \nImportant \nKeep the following security implications in mind when referencing secrets in a Spark configuration property or environment variable: \n* If table access control is not enabled on a cluster, any user with Can Attach To permissions on a cluster or Run permissions on a notebook can read Spark configuration properties from within the notebook. This includes users who do not have direct permission to read a secret. Databricks recommends enabling [table access control](https:\/\/docs.databricks.com\/data-governance\/table-acls\/table-acl.html) on all clusters or managing access to secrets using [secret scopes](https:\/\/docs.databricks.com\/security\/secrets\/secret-scopes.html#create-a-databricks-backed-secret-scope).\n* Even when table access control is enabled, users with Can Attach To permissions on a cluster or Run permissions on a notebook can read cluster environment variables from within the notebook. Databricks does not recommend storing secrets in cluster environment variables if they must not be available to all users on the cluster.\n* Secrets *are not* redacted from the Spark driver log `stdout` and `stderr` streams. To protect sensitive data, by default, Spark driver logs are viewable only by users with CAN MANAGE permission on job, single user access mode, and shared access mode clusters. To allow users with CAN ATTACH TO or CAN RESTART permission to view the logs on these clusters, set the following Spark configuration property in the cluster configuration: `spark.databricks.acl.needAdminPermissionToViewLogs false`. \nOn No Isolation Shared access mode clusters, the Spark driver logs can be viewed by users with CAN ATTACH TO or CAN MANAGE permission. To limit who can read the logs to only users with the CAN MANAGE permission, set `spark.databricks.acl.needAdminPermissionToViewLogs` to `true`. \n### Requirements and limitations \nThe following requirements and limitations apply to referencing secrets in Spark configuration properties and environment variables: \n* Cluster owners must have CAN READ permission on the secret scope.\n* Only cluster owners can add a reference to a secret in a Spark configuration property or environment variable and edit the existing scope and name. Owners change a secret using the [Secrets API](https:\/\/docs.databricks.com\/api\/workspace\/secrets). You must restart your cluster to fetch the secret again.\n* Users with the CAN MANAGE permission on the cluster can delete a secret Spark configuration property or environment variable. \n### Syntax for referencing secrets in a Spark configuration property or environment variable \nYou can refer to a secret using any valid variable name or Spark configuration property. Databricks enables special behavior for variables referencing secrets based on the syntax of the value being set, not the variable name. \nThe syntax of the Spark configuration property or environment variable value must be `{{secrets\/<scope-name>\/<secret-name>}}`. The value must start with `{{secrets\/` and end with `}}`. \nThe variable portions of the Spark configuration property or environment variable are: \n* `<scope-name>`: The name of the scope in which the secret is associated.\n* `<secret-name>`: The unique name of the secret in the scope. \nFor example, `{{secrets\/scope1\/key1}}`. \nNote \n* There should be no spaces between the curly brackets. If there are spaces, they are treated as part of the scope or secret name. \n### Reference a secret with a Spark configuration property \nYou specify a reference to a secret in a [Spark configuration property](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration) in the following format: \n```\nspark.<property-name> {{secrets\/<scope-name>\/<secret-name>}}\n\n``` \nAny Spark configuration `<property-name>` can reference a secret. Each Spark configuration property can only reference one secret, but you can configure multiple Spark properties to reference secrets. \nFor example: \nYou set a Spark configuration to reference a secret: \n```\nspark.password {{secrets\/scope1\/key1}}\n\n``` \nTo fetch the secret in the notebook and use it: \n```\nspark.conf.get(\"spark.password\")\n\n``` \n```\nSELECT ${spark.password};\n\n``` \n### Reference a secret in an environment variable \nYou specify a secret path in an [environment variable](https:\/\/docs.databricks.com\/compute\/configure.html#environment-variables) in the following format: \n```\n<variable-name>={{secrets\/<scope-name>\/<secret-name>}}\n\n``` \nYou can use any valid variable name when you reference a secret. Access to secrets referenced in environment variables is determined by the permissions of the user who configured the cluster. Secrets stored in environmental variables are accessible by all users of the cluster, but are redacted from plaintext display like secrets referenced elsewhere. \nEnvironment variables that reference secrets are accessible from a cluster-scoped init script. See [Set and use environment variables with init scripts](https:\/\/docs.databricks.com\/init-scripts\/environment-variables.html). \nFor example: \nYou set an environment variable to reference a secret: \n```\nSPARKPASSWORD={{secrets\/scope1\/key1}}\n\n``` \nTo fetch the secret in an init script, access `$SPARKPASSWORD` using the following pattern: \n```\nif [ -n \"$SPARKPASSWORD\" ]; then\n# code to use ${SPARKPASSWORD}\nfi\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/secrets\/secrets.html"} +{"content":"# Security and compliance guide\n## Secret management\n#### Secrets\n##### Manage secrets permissions\n\nThis section describes how to manage secret access control using the [What is the Databricks CLI?](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) (version 0.205 and above). You can also use the [Secrets API](https:\/\/docs.databricks.com\/api\/workspace\/secrets) or [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html). For secret permission levels, see [Secret ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#secrets) \n### Create a secret ACL \nTo create a secret ACL for a given secret scope using the [Databricks CLI (legacy)](https:\/\/docs.databricks.com\/archive\/dev-tools\/cli\/index.html) \n```\ndatabricks secrets put-acl <scope-name> <principal> <permission>\n\n``` \nMaking a put request for a principal that already has an applied permission overwrites the existing permission level. \nThe `principal` field specifies an existing Databricks principal. A user is specified using their email address, a service principal using its `applicationId` value, and a group using its group name. \n### View secret ACLs \nTo view all secret ACLs for a given secret scope: \n```\ndatabricks secrets list-acls <scope-name>\n\n``` \nTo get the secret ACL applied to a principal for a given secret scope: \n```\ndatabricks secrets get-acl <scope-name> <principal>\n\n``` \nIf no ACL exists for the given principal and scope, this request will fail. \n### Delete a secret ACL \nTo delete a secret ACL applied to a principal for a given secret scope: \n```\ndatabricks secrets delete-acl <scope-name> <principal>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/secrets\/secrets.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Microsoft SQL Server\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes how to set up Lakehouse Federation to run federated queries on SQL Server data that is not managed by Databricks. To learn more about Lakehouse Federation, see [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). \nTo connect to your SQL Server database using Lakehouse Federation, you must create the following in your Databricks Unity Catalog metastore: \n* A *connection* to your SQL Server database.\n* A *foreign catalog* that mirrors your SQL Server database in Unity Catalog so that you can use Unity Catalog query syntax and data governance tools to manage Databricks user access to the database.\n\n#### Run federated queries on Microsoft SQL Server\n##### Before you begin\n\nWorkspace requirements: \n* Workspace enabled for Unity Catalog. \nCompute requirements: \n* Network connectivity from your Databricks Runtime cluster or SQL warehouse to the target database systems. See [Networking recommendations for Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/networking.html).\n* Databricks clusters must use Databricks Runtime 13.3 LTS or above and shared or single-user access mode.\n* SQL warehouses must be Pro or Serverless. \nPermissions required: \n* To create a connection, you must be a metastore admin or a user with the `CREATE CONNECTION` privilege on the Unity Catalog metastore attached to the workspace.\n* To create a foreign catalog, you must have the `CREATE CATALOG` permission on the metastore and be either the owner of the connection or have the `CREATE FOREIGN CATALOG` privilege on the connection. \nAdditional permission requirements are specified in each task-based section that follows.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/sql-server.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Microsoft SQL Server\n##### Create a connection\n\nA connection specifies a path and credentials for accessing an external database system. To create a connection, you can use Catalog Explorer or the `CREATE CONNECTION` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** Metastore admin or user with the `CREATE CONNECTION` privilege. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **External Data** menu and select **Connections**.\n3. Click **Create connection**.\n4. Enter a user-friendly **Connection name**.\n5. Select a **Connection type** of **SQL Server**.\n6. Enter the following connection properties for your SQL Server instance. \n* **Host**\n* **Port**\n* **trustServerCertificate**: Defaults to `false`. When set to `true`, the transport layer uses SSL to encrypt the channel and bypasses the certificate chain to validate trust. Leave this set to the default unless you have a specific need to bypass trust validation.\n* **User**\n* **Password**\n7. (Optional) Click **Test connection** to confirm that it works.\n8. (Optional) Add a comment.\n9. Click **Create**. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nCREATE CONNECTION <connection-name> TYPE sqlserver\nOPTIONS (\nhost '<hostname>',\nport '<port>',\nuser '<user>',\npassword '<password>'\n);\n\n``` \nWe recommend that you use Databricks [secrets](https:\/\/docs.databricks.com\/security\/secrets\/index.html) instead of plaintext strings for sensitive values like credentials. For example: \n```\nCREATE CONNECTION <connection-name> TYPE sqlserver\nOPTIONS (\nhost '<hostname>',\nport '<port>',\nuser secret ('<secret-scope>','<secret-key-user>'),\npassword secret ('<secret-scope>','<secret-key-password>')\n)\n\n``` \nFor information about setting up secrets, see [Secret management](https:\/\/docs.databricks.com\/security\/secrets\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/sql-server.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Microsoft SQL Server\n##### Create a foreign catalog\n\nA foreign catalog mirrors a database in an external data system so that you can query and manage access to data in that database using Databricks and Unity Catalog. To create a foreign catalog, you use a connection to the data source that has already been defined. \nTo create a foreign catalog, you can use Catalog Explorer or the `CREATE FOREIGN CATALOG` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** `CREATE CATALOG` permission on the metastore and either ownership of the connection or the `CREATE FOREIGN CATALOG` privilege on the connection. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Click the **Create Catalog** button.\n3. On the **Create a new catalog** dialog, enter a name for the catalog and select a **Type** of **Foreign**.\n4. Select the **Connection** that provides access to the database that you want to mirror as a Unity Catalog catalog.\n5. Enter the name of the **Database** that you want to mirror as a catalog.\n6. Click **Create.** \nRun the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional. Replace the placeholder values: \n* `<catalog-name>`: Name for the catalog in Databricks.\n* `<connection-name>`: The [connection object](https:\/\/docs.databricks.com\/query-federation\/index.html#connection) that specifies the data source, path, and access credentials.\n* `<database-name>`: Name of the database you want to mirror as a catalog in Databricks. \n```\nCREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>\nOPTIONS (database '<database-name>');\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/sql-server.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Microsoft SQL Server\n##### Supported pushdowns\n\nThe following pushdowns are supported on all compute: \n* Filters\n* Projections\n* Limit\n* Functions: partial, only for filter expressions. (String functions, Mathematical functions, Data, Time and Timestamp functions, and other miscellaneous functions, such as Alias, Cast, SortOrder) \nThe following pushdowns are supported on Databricks Runtime 13.3 LTS and above, and on SQL warehouse compute: \n* Aggregates\n* The following Boolean operators: =, <, <=, >, >=, <=>\n* The following mathematical functions (not supported if ANSI is disabled): +, -, \\*, %, \/\n* The following miscellaneous operators: ^, |, ~\n* Sorting, when used with limit \nThe following pushdowns are not supported: \n* Joins\n* Windows functions\n\n#### Run federated queries on Microsoft SQL Server\n##### Data type mappings\n\nWhen you read from SQL Server to Spark, data types map as follows: \n| SQL Server type | Spark type |\n| --- | --- |\n| bigint (unsigned), decimal, money, numeric, smallmoney | DecimalType |\n| smallint | ShortType |\n| int, tinyint | IntegerType |\n| bigint (if signed) | LongType |\n| real | FloatType |\n| float | DoubleType |\n| char, nchar, uniqueidentifier | CharType |\n| nvarchar, varchar | VarcharType |\n| text, xml | StringType |\n| binary, geography, geometry, image, timestamp, udt, varbinary | BinaryType |\n| bit | BooleanType |\n| date | DateType |\n| datetime, datetime, smalldatetime, time | TimestampType\/TimestampNTZType | \n\\*When you read from SQL Server, SQL Server `datetimes` are mapped to Spark `TimestampType` if `preferTimestampNTZ = false` (default). SQL Server `datetimes` are mapped to `TimestampNTZType` if `preferTimestampNTZ = true`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/sql-server.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Export and import Databricks notebooks\n\nThis page describes how to import and export notebooks in Databricks and the notebook formats that Databricks supports.\n\n#### Export and import Databricks notebooks\n##### Supported notebook formats\n\nDatabricks can import and export notebooks in the following formats: \n* Source file: A file containing only source code statements with the extension `.scala`, `.py`, `.sql`, or `.r`.\n* HTML: A Databricks notebook with the extension `.html`.\n* Databricks `.dbc` archive.\n* IPython notebook: A [Jupyter notebook](https:\/\/jupyter-notebook.readthedocs.io\/en\/stable\/) with the extension `.ipynb`.\n* RMarkdown: An [R Markdown document](https:\/\/rmarkdown.rstudio.com\/) with the extension `.Rmd`.\n\n#### Export and import Databricks notebooks\n##### Import a notebook\n\nYou can import an external notebook from a URL or a file. You can also import a ZIP archive of notebooks [exported in bulk](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html#export-all-notebooks-in-a-folder) from a Databricks workspace. \n1. Click ![Workspace Icon](https:\/\/docs.databricks.com\/_images\/workspace-icon.png) **Workspace** in the sidebar. Do one of the following: \n* Right-click on a folder and select **Import**.\n* To import a notebook at the top level of the current workspace folder, click the kebab menu at the upper right and select **Import**.\n2. Specify the URL or browse to a file containing a supported external format or a ZIP archive of notebooks exported from a Databricks workspace.\n3. Click **Import**. \n* If you choose a single notebook, it is exported in the current folder.\n* If you choose a DBC or ZIP archive, its folder structure is recreated in the current folder and each notebook is imported.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Export and import Databricks notebooks\n##### Import a file and convert it to a notebook\n\nYou can convert Python, SQL, Scala, and R scripts to single-cell notebooks by adding a comment to the first cell of the file: \n```\n# Databricks notebook source\n\n``` \n```\n-- Databricks notebook source\n\n``` \n```\n\/\/ Databricks notebook source\n\n``` \n```\n# Databricks notebook source\n\n``` \nTo define cells in a script, use the special comment shown below. When you import the script to Databricks, cells are created as marked by the `COMMAND` lines shown. \n```\n# COMMAND ----------\n\n``` \n```\n-- COMMAND ----------\n\n``` \n```\n\/\/ COMMAND ----------\n\n``` \n```\n# COMMAND ----------\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Export and import Databricks notebooks\n##### Export notebooks\n\nNote \nWhen you export a notebook as HTML, IPython notebook (.ipynb), or archive (DBC), and you have not [cleared the command outputs](https:\/\/docs.databricks.com\/notebooks\/notebook-outputs.html#clear), the outputs are included in the export. \nTo export a notebook, select **File > Export** in the notebook toolbar and select the export format. \nTo export all folders in a workspace folder as a ZIP archive: \n1. Click ![Workspace Icon](https:\/\/docs.databricks.com\/_images\/workspace-icon.png) **Workspace** in the sidebar.\n2. Right-click the folder and select **Export**.\n3. Select the export format: \n* **DBC Archive**: Export a Databricks archive, a binary format that includes metadata and notebook command outputs.\n* **Source File**: Export a ZIP archive of notebook source files, which can be imported into a Databricks workspace, used in a CI\/CD pipeline, or viewed as source files in each notebook\u2019s default language. Notebook command outputs are not included.\n* **HTML Archive**: Export a ZIP archive of HTML files. Each notebook\u2019s HTML file can be imported into a Databricks workspace or viewed as HTML. Notebook command outputs are included.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html"} +{"content":"# Query data\n## Data format options\n#### Binary file\n\nDatabricks Runtime supports the [binary file](https:\/\/spark.apache.org\/docs\/latest\/sql-data-sources-binaryFile.html) data source, which reads binary files and converts each file into a single\nrecord that contains the raw content and metadata of the file. The binary file data source produces a DataFrame with the following columns and possibly partition columns: \n* `path (StringType)`: The path of the file.\n* `modificationTime (TimestampType)`: The modification time of the file.\nIn some Hadoop FileSystem implementations, this parameter might be unavailable and the value would be set to a default value.\n* `length (LongType)`: The length of the file in bytes.\n* `content (BinaryType)`: The contents of the file. \nTo read binary files, specify the data source `format` as `binaryFile`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/binary.html"} +{"content":"# Query data\n## Data format options\n#### Binary file\n##### Images\n\nDatabricks recommends that you use the binary file data source to load image data. \nThe Databricks `display` function supports displaying image data loaded using the binary data source. \nIf all the loaded files have a file name with an [image extension](https:\/\/github.com\/AdoptOpenJDK\/openjdk-jdk11\/blob\/19fb8f93c59dfd791f62d41f332db9e306bc1422\/src\/java.base\/windows\/classes\/sun\/net\/www\/content-types.properties#L182-L251), image preview is automatically enabled: \n```\ndf = spark.read.format(\"binaryFile\").load(\"<path-to-image-dir>\")\ndisplay(df) # image thumbnails are rendered in the \"content\" column\n\n``` \n![image preview](https:\/\/docs.databricks.com\/_images\/binary-file-image-preview1.png) \nAlternatively, you can force the image preview functionality by using the `mimeType` option with a string value `\"image\/*\"` to annotate the binary column. Images are decoded based on their format information in the binary content. Supported image types are `bmp`, `gif`, `jpeg`, and `png`. Unsupported files appear as a broken image icon. \n```\ndf = spark.read.format(\"binaryFile\").option(\"mimeType\", \"image\/*\").load(\"<path-to-dir>\")\ndisplay(df) # unsupported files are displayed as a broken image icon\n\n``` \n![image preview with unsupported file type](https:\/\/docs.databricks.com\/_images\/binary-file-image-preview2.png) \nSee [Reference solution for image applications](https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/images-etl-inference.html) for the recommended workflow to handle image data.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/binary.html"} +{"content":"# Query data\n## Data format options\n#### Binary file\n##### Options\n\nTo load files with paths matching a given glob pattern while keeping the behavior of partition discovery,\nyou can use the `pathGlobFilter` option. The following code reads all JPG files from the\ninput directory with partition discovery: \n```\ndf = spark.read.format(\"binaryFile\").option(\"pathGlobFilter\", \"*.jpg\").load(\"<path-to-dir>\")\n\n``` \nIf you want to ignore partition discovery and recursively search files under the input directory,\nuse the `recursiveFileLookup` option. This option searches through nested directories\neven if their names *do not* follow a partition naming scheme like `date=2019-07-01`.\nThe following code reads all JPG files recursively from the input directory and ignores partition discovery: \n```\ndf = spark.read.format(\"binaryFile\") \\\n.option(\"pathGlobFilter\", \"*.jpg\") \\\n.option(\"recursiveFileLookup\", \"true\") \\\n.load(\"<path-to-dir>\")\n\n``` \nSimilar APIs exist for Scala, Java, and R. \nNote \nTo improve read performance when you load data back,\nDatabricks recommends turning off compression when you save data loaded from binary files: \n```\nspark.conf.set(\"spark.sql.parquet.compression.codec\", \"uncompressed\")\ndf.write.format(\"delta\").save(\"<path-to-table>\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/binary.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deploy models for batch inference and prediction\n#### Model inference example\n##### Model inference\n\nThis notebook uses an ElasticNet model trained on the diabetes dataset described in [Track scikit-learn model training with MLflow](https:\/\/docs.databricks.com\/mlflow\/tracking-ex-scikit.html). This notebook shows how to: \n* Select a model to deploy using the MLflow experiment UI\n* Load the trained model as a scikit-learn model\n* Create a PySpark UDF from the model\n* Apply the UDF to add a prediction column to a DataFrame \n### MLflow inference notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-quick-start-inference.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/model-example.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Connect to Amazon S3\n\nNote \nThis article describes legacy patterns for configuring access to S3. Databricks recommends using Unity Catalog to configure access to S3 and volumes for direct interaction with files. See [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html). \nThis article explains how to connect to AWS S3 from Databricks.\n\n#### Connect to Amazon S3\n##### Access S3 buckets using instance profiles\n\nYou can load IAM roles as instance profiles in Databricks and attach instance profiles to clusters to control data access to S3. Databricks recommends using instance profiles when Unity Catalog is unavailable for your environment or workload. For a tutorial on using instance profiles with Databricks, see [Tutorial: Configure S3 access with an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html). \nThe AWS user who creates the IAM role must: \n* Be an AWS account user with permission to create or update IAM roles, IAM policies, S3 buckets, and cross-account trust relationships. \nThe Databricks user who adds the IAM role as an instance profile in Databricks must: \n* Be a workspace admin \nOnce you add the instance profile to your workspace, you can grant users, groups, or service principals have permissions to launch clusters with the instance profile. See [Manage instance profiles in Databricks](https:\/\/docs.databricks.com\/admin\/workspace-settings\/manage-instance-profiles.html). \nUse both cluster access control and notebook access control together to protect access to the instance profile. See [Compute permissions](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions) and [Collaborate using Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/notebooks-collaborate.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/amazon-s3.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Connect to Amazon S3\n##### Access S3 buckets with URIs and AWS keys\n\nYou can set Spark properties to configure a AWS keys to access S3. \nDatabricks recommends using secret scopes for storing all credentials. You can grant users, service principals, and groups in your workspace access to read the secret scope. This protects the AWS key while allowing users to access S3. To create a secret scope, see [Secret scopes](https:\/\/docs.databricks.com\/security\/secrets\/secret-scopes.html). \nThe credentials can be scoped to either a cluster or a notebook. Use both cluster access control and notebook access control together to protect access to S3. See [Compute permissions](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions) and [Collaborate using Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/notebooks-collaborate.html). \nTo set Spark properties, use the following snippet in a cluster\u2019s Spark configuration to set the AWS keys stored in [secret scopes as environment variables](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html#spark-conf-env-var): \n```\nAWS_SECRET_ACCESS_KEY={{secrets\/scope\/aws_secret_access_key}}\nAWS_ACCESS_KEY_ID={{secrets\/scope\/aws_access_key_id}}\n\n``` \nYou can then read from S3 using the following commands: \n```\naws_bucket_name = \"my-s3-bucket\"\n\ndf = spark.read.load(f\"s3a:\/\/{aws_bucket_name}\/flowers\/delta\/\")\ndisplay(df)\ndbutils.fs.ls(f\"s3a:\/\/{aws_bucket_name}\/\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/amazon-s3.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Connect to Amazon S3\n##### Access S3 with open-source Hadoop options\n\nDatabricks Runtime supports configuring the S3A filesystem using [open-source Hadoop options](https:\/\/hadoop.apache.org\/docs\/r3.1.3\/hadoop-aws\/tools\/hadoop-aws). You can configure global properties and per-bucket properties. \n### Global configuration \n```\n# Global S3 configuration\nspark.hadoop.fs.s3a.aws.credentials.provider <aws-credentials-provider-class>\nspark.hadoop.fs.s3a.endpoint <aws-endpoint>\nspark.hadoop.fs.s3a.server-side-encryption-algorithm SSE-KMS\n\n``` \n### Per-bucket configuration \nYou configure per-bucket properties using the syntax `spark.hadoop.fs.s3a.bucket.<bucket-name>.<configuration-key>`. This lets you set up buckets with different credentials, endpoints, and so on. \nFor example, in addition to global S3 settings you can configure each bucket individually using the following keys: \n```\n# Set up authentication and endpoint for a specific bucket\nspark.hadoop.fs.s3a.bucket.<bucket-name>.aws.credentials.provider <aws-credentials-provider-class>\nspark.hadoop.fs.s3a.bucket.<bucket-name>.endpoint <aws-endpoint>\n\n# Configure a different KMS encryption key for a specific bucket\nspark.hadoop.fs.s3a.bucket.<bucket-name>.server-side-encryption.key <aws-kms-encryption-key>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/amazon-s3.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Connect to Amazon S3\n##### Access Requester Pays buckets\n\nTo enable access to [Requester Pays](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/dev\/RequesterPaysBuckets.html) buckets, add the following line to your cluster\u2019s [Spark configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration): \n```\nspark.hadoop.fs.s3a.requester-pays.enabled true\n\n``` \nNote \nDatabricks does not support Delta Lake writes to Requester Pays buckets.\n\n#### Connect to Amazon S3\n##### Deprecated patterns for storing and accessing data from Databricks\n\nThe following are deprecated storage patterns: \n* Databricks no longer recommends mounting external data locations to Databricks Filesystem. See [Mounting cloud object storage on Databricks](https:\/\/docs.databricks.com\/dbfs\/mounts.html). \n* Databricks no longer recommends using credential passthrough with S3. See [Credential passthrough (legacy)](https:\/\/docs.databricks.com\/archive\/credential-passthrough\/index.html). \nImportant \n* The S3A filesystem enables caching by default and releases resources on \u2018FileSystem.close()\u2019. To avoid other threads using a reference to the cached file system incorrectly, do not explicitly use the \u2018FileSystem.close().\n* The S3A filesystem does not remove directory markers when closing an output stream. Legacy applications based on Hadoop versions that do not include [HADOOP-13230](https:\/\/issues.apache.org\/jira\/browse\/HADOOP-13230) can misinterpret them as empty directories even if there are files inside.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/amazon-s3.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n#### Provisioned throughput Foundation Model APIs\n###### Conduct your own LLM endpoint benchmarking\n\nThis article provides a Databricks recommended notebook example for benchmarking an LLM endpoint. It also includes a brief introduction to how Databricks performs LLM inference and calculates latency and throughput as endpoint performance metrics. \nLLM inference on Databricks measures tokens per second for provisioned throughput mode for Foundation Model APIs. See [What do tokens per second ranges in provisioned throughput mean?](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/prov-throughput-tokens.html).\n\n###### Conduct your own LLM endpoint benchmarking\n####### Benchmarking example notebook\n\nYou can import the following notebook into your Databricks environment and specify the name of your LLM endpoint to run a load test. \n### Benchmarking an LLM endpoint \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/large-language-models\/llm-benchmarking.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/prov-throughput-run-benchmark.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n#### Provisioned throughput Foundation Model APIs\n###### Conduct your own LLM endpoint benchmarking\n####### LLM inference introduction\n\nLLMs perform inference in a two-step process: \n* **Prefill**, where the tokens in the input prompt are processed in parallel.\n* **Decoding**, where text is generated one token at a time in an auto-regressive manner. Each generated token is appended to the input and fed back into the model to generate the next token. Generation stops when the LLM outputs a special stop token or when a user-defined condition is met. \nMost production applications have a latency budget, and Databricks recommends you maximize throughput given that latency budget. \n* The number of input tokens has a substantial impact on the required memory to process requests.\n* The number of output tokens dominates overall response latency. \nDatabricks divides LLM inference into the following sub-metrics: \n* **Time to first token** (TTFT): This is how quickly users start seeing the model\u2019s output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.\n* **Time per output token** (TPOT): Time to generate an output token for each user that is querying the system. This metric corresponds with how each user perceives the \u201cspeed\u201d of the model. For example, a TPOT of 100 milliseconds per token would be 10 tokens per second, or ~450 words per minute, which is faster than a typical person can read. \nBased on these metrics, total latency and throughput can be defined as follows: \n* **Latency** = TTFT + (TPOT) \\* (the number of tokens to be generated)\n* **Throughput** = number of output tokens per second across all concurrency requests \nOn Databricks, LLM serving endpoints are able to scale to match the load sent by clients with multiple concurrent requests. There is a trade-off between latency and throughput. This is because, on LLM serving endpoints, concurrent requests can be and are processed at the same time. At low concurrent request loads, latency is the lowest possible. However, if you increase the request load, latency might go up, but throughput likely also goes up. This is because two requests worth of tokens per second can be processed in less than double the time. \nTherefore, controlling the number of parallel requests into your system is core to balancing latency with throughput. If you have a low latency use case, you want to send fewer concurrent requests to the endpoint to keep latency low. If you have a high throughput use case, you want to saturate the endpoint with lots of concurrency requests, since higher throughput is worth it even at the expense of latency.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/prov-throughput-run-benchmark.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n#### Provisioned throughput Foundation Model APIs\n###### Conduct your own LLM endpoint benchmarking\n####### Databricks benchmarking harness\n\nThe previously shared [benchmarking example notebook](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/prov-throughput-run-benchmark.html#benchmark) is Databricks\u2019 benchmarking harness. The notebook displays the latency and throughput metrics, and plots the throughput versus latency curve across different numbers of parallel requests. Databricks endpoint autoscaling is based on a \u201cbalanced\u201d strategy between latency and throughput. In the notebook, you observe that as more concurrent users are querying the endpoint at the same time latency goes up as well as throughput. \n![Throughput-Latency Graph](https:\/\/docs.databricks.com\/_images\/llm-throughput-latency.png) \nMore details on the Databricks philosophy about LLM performance benchmarking is described in the [LLM Inference Performance Engineering: Best Practices blog](https:\/\/www.databricks.com\/blog\/llm-inference-performance-engineering-best-practices).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/prov-throughput-run-benchmark.html"} +{"content":"# What is data warehousing on Databricks?\n## Get started with data warehousing using Databricks SQL\n#### Databricks SQL concepts\n\nThis article introduces the set of fundamental concepts you need to understand in order to use Databricks SQL effectively.\n\n#### Databricks SQL concepts\n##### Interface\n\nThis section describes the interfaces that Databricks supports for accessing your Databricks SQL assets: UI and API. \n**UI**: A graphical interface to the workspace browser, dashboards and queries, SQL warehouses, query history, and alerts. \n**[REST API](https:\/\/docs.databricks.com\/api\/workspace)** An interface that allows you to automate tasks on Databricks SQL objects. \nImportant \nYou can also attach a notebook to a SQL warehouse. See [Notebooks and SQL warehouses](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#notebook-sql-warehouse) for more information and limitations.\n\n#### Databricks SQL concepts\n##### Data management\n\n**[Visualization](https:\/\/docs.databricks.com\/sql\/user\/visualizations\/index.html)**: A graphical presentation of the result of running a query. \n**[Dashboard](https:\/\/docs.databricks.com\/dashboards\/index.html)**: A presentation of query visualizations and commentary. \n**[Alert](https:\/\/docs.databricks.com\/sql\/user\/alerts\/index.html)**: A notification that a field returned by a query has reached a threshold.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/get-started\/concepts.html"} +{"content":"# What is data warehousing on Databricks?\n## Get started with data warehousing using Databricks SQL\n#### Databricks SQL concepts\n##### Computation management\n\nThis section describes concepts that you need to know to run SQL queries in Databricks SQL. \n**[Query](https:\/\/docs.databricks.com\/sql\/user\/queries\/index.html)**: A valid SQL statement. \n**[SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html)**: A compute resource on which you execute SQL queries. \n**[Query history](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html)**: A list of executed queries and their performance characteristics.\n\n#### Databricks SQL concepts\n##### Authentication and authorization\n\nThis section describes concepts that you need to know when you manage Databricks SQL users and groups and their access to assets. \n**User and group**: A user is a unique individual who has access to the system. A group is a collection of users. \n**[Personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html)**: An opaque string is used to authenticate to the REST API and by tools in the [Technology partners](https:\/\/docs.databricks.com\/integrations\/index.html) to connect to SQL warehouses. \n**[Access control list](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html)**: A set of permissions attached to a principal that requires access to an object. An ACL entry specifies the object and the actions allowed on the object. Each entry in an ACL specifies a principal, action type, and object. \n**[Unity catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html)**: Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/get-started\/concepts.html"} +{"content":"# AI and Machine Learning on Databricks\n### Use Ray on Databricks\n\nWith Ray 2.3.0 and above, you can create Ray clusters and run Ray applications on Apache Spark clusters with Databricks. For information about getting started with machine learning on Ray, including tutorials and examples, see the [Ray documentation](https:\/\/docs.ray.io\/en\/latest\/). For more information about the Ray and Apache Spark integration, see the [Ray on Spark API documentation](https:\/\/docs.ray.io\/en\/latest\/cluster\/vms\/user-guides\/community\/spark.html#ray-on-spark-apis).\n\n### Use Ray on Databricks\n#### Requirements\n\n* Databricks Runtime 12.2 LTS ML and above.\n* Databricks Runtime cluster access mode must be either \u201cAssigned\u201d mode or \u201cNo isolation shared\u201d mode.\n\n### Use Ray on Databricks\n#### Install Ray\n\nUse the following command to install Ray. The `[default]` extension is required by the Ray dashboard component. \n```\n%pip install ray[default]>=2.3.0\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/ray-integration.html"} +{"content":"# AI and Machine Learning on Databricks\n### Use Ray on Databricks\n#### Create a user-specific Ray cluster in a Databricks cluster\n\nTo create a Ray cluster, use the [ray.util.spark.setup\\_ray\\_cluster](https:\/\/docs.ray.io\/en\/latest\/cluster\/vms\/user-guides\/community\/spark.html#ray.util.spark.setup_ray_cluster) API. \nIn any Databricks notebook that is attached to a Databricks cluster, you can run the following command: \n```\nfrom ray.util.spark import setup_ray_cluster, shutdown_ray_cluster\n\nsetup_ray_cluster(\nnum_worker_nodes=2,\nnum_cpus_worker_node=4,\ncollect_log_to_path=\"\/dbfs\/path\/to\/ray_collected_logs\"\n)\n\n``` \nThe `ray.util.spark.setup_ray_cluster` API creates a Ray cluster on Spark. Internally, it creates a background Spark job. Each Spark task in the job creates a Ray worker node, and the Ray head node is created on the driver. The argument `num_worker_nodes` represents the number of Ray worker nodes to create. To specify the number of CPU or GPU cores assigned to each Ray worker node, set the argument `num_cpus_worker_node` (default value: 1) or `num_gpus_worker_node` (default value: 0). \nAfter a Ray cluster is created, you can run any Ray application code directly in your notebook. Click **Open Ray Cluster Dashboard in a new tab** to view the Ray dashboard for the cluster. \nTip \nIf you\u2019re using a Databricks single user cluster, you can set `num_worker_nodes` to `ray.util.spark.MAX_NUM_WORKER_NODES` to use all available resources for your Ray cluster. \n```\nsetup_ray_cluster(\n# ...\nnum_worker_nodes=ray.util.spark.MAX_NUM_WORKER_NODES,\n)\n\n``` \nSet the argument `collect_log_to_path` to specify the destination path where you want to collect the Ray cluster logs. Log collection runs after the Ray cluster is shut down. Databricks recommends that you set a path starting with `\/dbfs\/` so that the logs are preserved even if you terminate the Spark cluster. Otherwise, your logs are not recoverable since the local storage on the cluster is deleted when the cluster is shut down. \nNote \n\u201cTo have your Ray application automatically use the Ray cluster that was created, call `ray.util.spark.setup_ray_cluster` to set the `RAY_ADDRESS` environment variable to the address of the Ray cluster.\u201d You can specify an alternative cluster address using the `address` argument of the [ray.init](https:\/\/docs.ray.io\/en\/latest\/ray-core\/package-ref.html?highlight=init#ray-init) API.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/ray-integration.html"} +{"content":"# AI and Machine Learning on Databricks\n### Use Ray on Databricks\n#### Run a Ray application\n\nAfter the Ray cluster has been created, you can run any Ray application code in a Databricks notebook. \nImportant \nDatabricks recommends that you install any necessary libraries for your application with `%pip install <your-library-dependency>` to ensure they are available to your Ray cluster and application accordingly. Specifying dependencies in the Ray init function call installs the dependencies in a location inaccessible to the Spark worker nodes, which results in version incompatibilities and import errors. \nFor example, you can run a simple Ray application in a Databricks notebook as follows: \n```\nimport ray\nimport random\nimport time\nfrom fractions import Fraction\n\nray.init()\n\n@ray.remote\ndef pi4_sample(sample_count):\n\"\"\"pi4_sample runs sample_count experiments, and returns the\nfraction of time it was inside the circle.\n\"\"\"\nin_count = 0\nfor i in range(sample_count):\nx = random.random()\ny = random.random()\nif x*x + y*y <= 1:\nin_count += 1\nreturn Fraction(in_count, sample_count)\n\nSAMPLE_COUNT = 1000 * 1000\nstart = time.time()\nfuture = pi4_sample.remote(sample_count=SAMPLE_COUNT)\npi4 = ray.get(future)\nend = time.time()\ndur = end - start\nprint(f'Running {SAMPLE_COUNT} tests took {dur} seconds')\n\npi = pi4 * 4\nprint(float(pi))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/ray-integration.html"} +{"content":"# AI and Machine Learning on Databricks\n### Use Ray on Databricks\n#### Create a Ray cluster in autoscaling mode\n\nIn Ray 2.8.0 and above, Ray clusters started on Databricks support integration with Databricks autoscaling. See [Databricks cluster autoscaling](https:\/\/docs.databricks.com\/compute\/configure.html#autoscaling). \nWith Ray 2.8.0 and above, you can create a Ray cluster on a Databricks cluster that supports scaling up or down according to workloads. This autoscaling integration triggers Databricks cluster autoscaling internally within the Databricks environment. \nTo enable autoscaling, run the following command: \n```\nfrom ray.util.spark import setup_ray_cluster\n\nsetup_ray_cluster(\nnum_worker_nodes=8,\nautoscale=True,\n... # other arguments\n)\n\n``` \nIf autoscaling is enabled, `num_worker_nodes` indicates the maximum number of Ray worker nodes. The default minimum number of Ray worker nodes is 0. This default setting means that when the Ray cluster is idle, it scales down to zero Ray worker nodes. This may not be ideal for fast responsiveness in all scenarios, but when enabled, can greatly reduce costs. \nIn autoscaling mode, `num_worker_nodes` cannot be set to `ray.util.spark.MAX_NUM_WORKER_NODES`. \nThe following arguments configure the upscaling and downscaling speed: \n* `autoscale_upscaling_speed` represents the number of nodes allowed to be pending as a multiple of the current number of nodes. The higher the value, the more aggressive the upscaling. For example, if this is set to 1.0, the cluster can grow in size by at most 100% at any time.\n* `autoscale_idle_timeout_minutes` represents the number of minutes that need to pass before an idle worker node is removed by the autoscaler. The smaller the value, the more aggressive the downscaling. \nWith Ray 2.9.0 and above, you can also set `autoscale_min_worker_nodes` to prevent the Ray cluster from scaling down to zero workers when the Ray cluster is idle.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/ray-integration.html"} +{"content":"# AI and Machine Learning on Databricks\n### Use Ray on Databricks\n#### Connect to remote Ray cluster using Ray client\n\nIn Ray 2.9.3, create a Ray cluster by calling the `setup_ray_cluster` API. In the same notebook, call the `ray.init()` API to connect to this Ray cluster. \nFor a Ray cluster that is not in global mode, get the remote connection string with following code: \nTo get the remote connection string using the following: \n```\nfrom ray.util.spark import setup_ray_cluster\n\n_, remote_conn_str = setup_ray_cluster(num_worker_nodes=2, ...)\n\n``` \nConnect to the remote cluster using this remote connection string: \n```\nimport ray\nray.init(remote_conn_str)\n\n``` \nThe Ray client does not support the Ray dataset API defined in the `ray.data` module. As a workaround, you can wrap your code that calls the Ray dataset API inside a remote Ray task, as shown in the following code: \n```\nimport ray\nimport pandas as pd\nray.init(\"ray:\/\/<ray_head_node_ip>:10001\")\n\n@ray.remote\ndef ray_data_task():\np1 = pd.DataFrame({'a': [3,4] * 10000, 'b': [5,6] * 10000})\nds = ray.data.from_pandas(p1)\nreturn ds.repartition(4).to_pandas()\n\nray.get(ray_data_task.remote())\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/ray-integration.html"} +{"content":"# AI and Machine Learning on Databricks\n### Use Ray on Databricks\n#### Load data from a Spark DataFrame\n\nTo load a Spark DataFrame as a Ray Dataset, first, you must save the Spark DataFrame to UC volumes or Databricks Filesystem (deprecated) as Parquet format. To control Databricks Filesystem access securely, Databricks recommends that you [mount cloud object storage to DBFS](https:\/\/docs.databricks.com\/dbfs\/mounts.html). Then, you can create a `ray.data.Dataset` instance from the saved Spark DataFrame path using the following helper method: \n```\nimport ray\nimport os\nfrom urllib.parse import urlparse\n\ndef create_ray_dataset_from_spark_dataframe(spark_dataframe, dbfs_tmp_path):\nspark_dataframe.write.mode('overwrite').parquet(dbfs_tmp_path)\nfuse_path = \"\/dbfs\" + urlparse(dbfs_tmp_path).path\nreturn ray.data.read_parquet(fuse_path)\n\n# For example, read a Delta Table as a Spark DataFrame\nspark_df = spark.read.table(\"diviner_demo.diviner_pedestrians_data_500\")\n\n# Provide a dbfs location to write the table to\ndata_location_2 = (\n\"dbfs:\/home\/example.user@databricks.com\/data\/ray_test\/test_data_2\"\n)\n\n# Convert the Spark DataFrame to a Ray dataset\nray_dataset = create_ray_dataset_from_spark_dataframe(\nspark_dataframe=spark_df,\ndbfs_tmp_path=data_location_2\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/ray-integration.html"} +{"content":"# AI and Machine Learning on Databricks\n### Use Ray on Databricks\n#### Load data from a Unity Catalog table through Databricks SQL warehouse\n\nFor Ray 2.8.0 and above, you can call the `ray.data.read_databricks_tables` API to load data from a Databricks Unity Catalog table. \nFirst, you need to set the `DATABRICKS_TOKEN` environment variable to your Databricks warehouse access token. If you\u2019re not running your program on Databricks Runtime, set the `DATABRICKS_HOST` environment variable to the Databricks workspace URL, as shown in the following: \n```\nexport DATABRICKS_HOST=adb-<workspace-id>.<random-number>.azuredatabricks.net\n\n``` \nThen, call `ray.data.read_databricks_tables()` to read from the Databricks SQL warehouse. \n```\nimport ray\n\nray_dataset = ray.data.read_databricks_tables(\nwarehouse_id='...', # Databricks SQL warehouse ID\ncatalog='catalog_1', # Unity catalog name\nschema='db_1', # Schema name\nquery=\"SELECT title, score FROM movie WHERE year >= 1980\",\n)\n\n```\n\n### Use Ray on Databricks\n#### Configure resources used by Ray head node\n\nBy default, for the Ray on Spark configuration, Databricks restricts resources allocated to the Ray head node to: \n* 0 CPU cores\n* 0 GPUs\n* 128 MB heap memory\n* 128 MB object store memory \nThis is because the Ray head node is usually used for global coordination, not for executing Ray tasks. The Spark driver node resources are shared with multiple users, so the default setting saves resources on the Spark driver side. \nWith Ray 2.8.0 and above, you can configure resources used by the Ray head node. Use the following arguments in the `setup_ray_cluster` API: \n* `num_cpus_head_node`: setting CPU cores used by Ray head node\n* `num_gpus_head_node`: setting GPU used by Ray head node\n* `object_store_memory_head_node`: setting object store memory size by Ray head node\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/ray-integration.html"} +{"content":"# AI and Machine Learning on Databricks\n### Use Ray on Databricks\n#### Support for heterogeneous clusters\n\nFor more efficient and cost effective training runs, you can create a Ray on Spark cluster and set different configurations between the Ray head node and Ray worker nodes. However, all Ray worker nodes must have the same configuration. Databricks clusters do not fully support heterogeneous clusters, but you can create a Databricks cluster with different driver and worker instance types by setting a cluster policy. \nFor example: \n```\n{\n\"node_type_id\": {\n\"type\": \"fixed\",\n\"value\": \"i3.xlarge\"\n},\n\"driver_node_type_id\": {\n\"type\": \"fixed\",\n\"value\": \"g4dn.xlarge\"\n},\n\"spark_version\": {\n\"type\": \"fixed\",\n\"value\": \"13.x-snapshot-gpu-ml-scala2.12\"\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/ray-integration.html"} +{"content":"# AI and Machine Learning on Databricks\n### Use Ray on Databricks\n#### Tune the Ray cluster configuration\n\nThe recommended configuration for each Ray worker node is: \n* Minimum 4 CPU cores per Ray worker node.\n* Minimum 10GB heap memory for each Ray worker node. \nWhen calling `ray.util.spark.setup_ray_cluster`, Databricks recommends setting `num_cpus_worker_node` to a value >= `4`. \nSee [Memory allocation for Ray worker nodes](https:\/\/docs.databricks.com\/machine-learning\/ray-integration.html#ray-memory) for details about tuning heap memory for each Ray worker node. \n### Memory allocation for Ray worker nodes \nEach Ray worker node uses two types of memory: heap memory and object store memory. The allocated memory size for each type is determined as described below. \nThe total memory allocated to each Ray worker node is: \n```\nRAY_WORKER_NODE_TOTAL_MEMORY = (SPARK_WORKER_NODE_PHYSICAL_MEMORY \/ MAX_NUMBER_OF_LOCAL_RAY_WORKER_NODES * 0.8)\n\n``` \n`MAX_NUMBER_OF_LOCAL_RAY_WORKER_NODES` is the maximum number of Ray worker nodes that can be launched on the Spark worker node. This is determined by the argument `num_cpus_worker_node` or `num_gpus_worker_node`. \nIf you do not set the argument `object_store_memory_per_node`, then the heap memory size and\nobject store memory size allocated to each Ray worker node are: \n```\nRAY_WORKER_NODE_HEAP_MEMORY = RAY_WORKER_NODE_TOTAL_MEMORY * 0.7\nOBJECT_STORE_MEMORY_PER_NODE = RAY_WORKER_NODE_TOTAL_MEMORY * 0.3\n\n``` \nIf you do set the argument `object_store_memory_per_node`: \n```\nRAY_WORKER_NODE_HEAP_MEMORY = RAY_WORKER_NODE_TOTAL_MEMORY - argument_object_store_memory_per_node\n\n``` \nIn addition, the object store memory size per Ray worker node is limited by the shared memory of the operating system. The maximum value is: \n```\nOBJECT_STORE_MEMORY_PER_NODE_CAP = (SPARK_WORKER_NODE_OS_SHARED_MEMORY \/ MAX_NUMBER_OF_LOCAL_RAY_WORKER_NODES * 0.8)\n\n``` \n`SPARK_WORKER_NODE_OS_SHARED_MEMORY` is the `\/dev\/shm` disk size configured for the Spark worker node.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/ray-integration.html"} +{"content":"# AI and Machine Learning on Databricks\n### Use Ray on Databricks\n#### Best practices\n\n### How to set CPU \/ GPU number for each Ray worker node ? \nDatabricks recommends setting `num_cpus_worker_node` to the number of CPU cores per Spark worker node and setting `num_gpus_worker_node` to the number of GPUs per Spark worker node. In this config, each Spark worker node launches one Ray worker node that fully utilizes the resources of the Spark worker node. \n### GPU cluster configuration \nThe Ray cluster runs on top of a Databricks Spark cluster. A common scenario is to use a Spark job and Spark UDF to do simple data preprocessing tasks that do not need GPU resources, and then use Ray to execute complicated machine learning tasks that benefit from GPUs. In this case, Databricks recommends setting the Spark cluster level configuration parameter `spark.task.resource.gpu.amount` to `0`, so that all Spark DataFrame transformations and Spark UDF executions do not use GPU resources. \nThe benefits of this configuration are the following: \n* It increases Spark job parallelism, because the GPU instance type usually has many more CPU cores than GPU devices.\n* If the Spark cluster is shared with multiple users, this configuration prevents Spark jobs from competing for GPU resources with concurrently running Ray workloads. \n### Disable `transformers` trainer mlflow integration if using it in Ray tasks \nThe `transformers` trainer MLflow integration is on by default. If you use Ray train to train it, the Ray task fails because the Databricks MLflow service credential is not configured for Ray tasks. \nTo avoid this issue, set the `DISABLE_MLFLOW_INTEGRATION` environment variable to \u2018TRUE\u2019 in databricks cluster config. For information on logging into MLflow in your Ray trainer tasks, see the section \u201cUsing MLflow in Ray tasks\u201d for details. \n### Address Ray remote function pickling error \nTo execute Ray tasks, Ray uses pickle to serialize the task function. If pickling fails, determine the line(s) in your code where the failure occurs. Often, moving `import` commands into the task function addresses common pickling errors. For example, `datasets.load_dataset` is a widely used function that happens to be patched within Databricks Runtime, potentially rendering an external import unpickle-able. To correct this issue, you can update your code like this: \n```\ndef ray_task_func():\nfrom datasets import load_dataset # import the function inside task function\n...\n\n``` \n### Disable Ray memory monitor if the Ray task is unexpectedly killed with OOM error \nIn Ray 2.9.3, Ray memory monitor has known issues which cause Ray tasks to be erroneously killed. \nTo address the issue, disable the Ray memory monitor by setting the environment variable `RAY_memory_monitor_refresh_ms` to `0` in the Databricks cluster config. \n### Memory resource configuration for Spark and Ray hybrid workloads \nIf you run hybrid Spark and Ray workloads in a Databricks cluster, Databricks recommends that you reduce Spark executor memory to a small value, such as setting `spark.executor.memory 4g` in the Databricks cluster config. This is due to the Spark executor running within a Java process that triggers garbage collection (GC) lazily. The memory pressure for Spark dataset caching is rather high, causing a reduction in the available memory that Ray can use. To avoid potential OOM errors, Databricks recommends you reduce the configured \u2018spark.executor.memory\u2019 value to a smaller value than the default. \n### Computation resource configuration for Spark and Ray hybrid workloads \nIf you run hybrid Spark and Ray workloads in a Databricks cluster, set either the Spark cluster nodes to auto-scalable, the Ray worker nodes to auto-scalable, or both with auto-scaling enabled. \nFor example, if you have a fixed number of worker nodes in a Databricks cluster, consider enabling Ray-on-Spark autoscaling, so that when there is no Ray workload running, the Ray cluster scales down. As a result, the idle cluster resources are released so that Spark job can use them. \nWhen Spark job completes and Ray job starts, it triggers the Ray-on-Spark cluster to scale up to meet the processing demands. \nYou can also make both the Databricks cluster and the Ray-on-spark cluster auto-scalable. Specifically, you can configure the Databricks cluster auto-scalable nodes to a maximum of 10 nodes and the Ray-on-Spark worker nodes to a maximum of 4 nodes (with one Ray worker node per spark worker), leaving Spark free to allocate up to 6 nodes for Spark tasks. This means that Ray workloads can use at most 4 nodes resources at the same time, while the Spark job can allocate at most 6 nodes worth of resources. \n### Applying transformation function to batches of data \nWhen processing data in batches, Databricks recommends you use the Ray Data API with `map_batches` function. This approach can be more efficient and scalable, especially for large datasets or when performing complex computations that benefit from batch processing. Any Spark DataFrame can be converted to Ray data using the `ray.data.from_spark` API, and can be written out to databricks UC table using the API `ray.data.write_databricks_table`. \n### Using MLflow in Ray tasks \nTo use MLflow in Ray tasks, configure the following: \n* Databricks MLflow credentials in Ray tasks\n* MLflow runs on the Spark driver side that pass the generated `run_id` values to Ray tasks. \nThe following code is an example: \n```\nimport mlflow\nimport ray\nfrom mlflow.utils.databricks_utils import get_databricks_env_vars\nmlflow_db_creds = get_databricks_env_vars(\"databricks\")\n\nexperiment_name = \"\/Users\/<your-name>@databricks.com\/mlflow_test\"\nmlflow.set_experiment(experiment_name)\n\n@ray.remote\ndef ray_task(x, run_id):\nimport os\nos.environ.update(mlflow_db_creds)\nmlflow.set_experiment(experiment_name)\n# We need to use the run created in Spark driver side,\n# and set `nested=True` to make it a nested run inside the\n# parent run.\nwith mlflow.start_run(run_id=run_id, nested=True):\nmlflow.log_metric(f\"task_{x}_metric\", x)\nreturn x\n\nwith mlflow.start_run() as run: # create MLflow run in Spark driver side.\nresults = ray.get([ray_task.remote(x, run.info.run_id) for x in range(10)])\n\n``` \n### Using notebook scoped python libraries or cluster python libraries in Ray tasks \nCurrently, Ray has a known issue that Ray tasks can\u2019t use notebook-scoped Python libraries or\ncluster Python libraries. To address this limitation,\nrun the following command in your notebook prior to launching a Ray-on-Spark cluster: \n```\n%pip install ray==<The Ray version you want to use> --force-reinstall\n\n``` \nand then run the following command in your notebook to restart python kernel: \n```\ndbutils.library.restartPython()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/ray-integration.html"} +{"content":"# AI and Machine Learning on Databricks\n### Use Ray on Databricks\n#### Enable stack traces and flame graphs on the Ray Dashboard Actors page\n\nOn the Ray **Dashboard Actors** page, you can view stack traces and flame graphs for active Ray actors. \nTo view this information, install `py-spy` before starting the Ray cluster: \n```\n%pip install py-spy\n\n```\n\n### Use Ray on Databricks\n#### Shut down a Ray cluster\n\nTo shut down a Ray cluster running on Databricks, call the [ray.utils.spark.shutdown\\_ray\\_cluster](https:\/\/docs.ray.io\/en\/master\/cluster\/vms\/user-guides\/community\/spark.html#ray.util.spark.shutdown_ray_cluster) API. \nNote \nRay clusters also shut down when: \n* You detach your interactive notebook from your Databricks cluster.\n* Your Databricks [job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) completes.\n* Your Databricks cluster is restarted or terminated.\n* There\u2019s no activity for the specified idle time.\n\n### Use Ray on Databricks\n#### Example notebook\n\nThe following notebook demonstrates how to create a Ray cluster and run a Ray application on Databricks. \n### Ray on Spark starter notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/ray-starter-notebook.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/ray-integration.html"} +{"content":"# AI and Machine Learning on Databricks\n### Use Ray on Databricks\n#### Limitations\n\n* Multi-user shared Databricks clusters (isolation mode enabled) are not supported.\n* When using %pip to install packages, the Ray cluster will shut down. Make sure to start Ray after you\u2019re done installing all of your libraries with %pip.\n* Using integrations that override the configuration from `ray.util.spark.setup_ray_cluster` can cause the Ray cluster to become unstable and can crash the Ray context. For example, using the `xgboost_ray` package and setting `RayParams` with an actor or `cpus_per_actor` configuration in excess of the Ray cluster configuration can silently crash the Ray cluster.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/ray-integration.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Develop code in Databricks notebooks\n\nThis page describes how to develop code in Databricks notebooks, including autocomplete, automatic formatting for Python and SQL, combining Python and SQL in a notebook, and tracking the notebook version history. \nFor more details about advanced functionality available with the editor, such as autocomplete, variable selection, multi-cursor support, and side-by-side diffs, see [Use the Databricks notebook and file editor](https:\/\/docs.databricks.com\/notebooks\/notebook-editor.html). \nWhen you use the notebook or the file editor, Databricks Assistant is available to help you generate, explain, and debug code. See [Use Databricks Assistant](https:\/\/docs.databricks.com\/notebooks\/use-databricks-assistant.html) for more information. \nDatabricks notebooks also include a built-in interactive debugger for Python notebooks. See [Use the Databricks interactive debugger](https:\/\/docs.databricks.com\/notebooks\/debugger.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Develop code in Databricks notebooks\n##### Get coding help from Databricks Assistant\n\nDatabricks Assistant is a context-aware AI assistant that you can interact with using a conversational interface, making you more productive inside Databricks. You can describe your task in English and let the assistant generate Python code or SQL queries, explain complex code, and automatically fix errors. The assistant uses Unity Catalog metadata to understand your tables, columns, descriptions, and popular data assets across your company to provide personalized responses. \nDatabricks Assistant can help you with the following tasks: \n* Generate code.\n* Debug code, including identifying and suggesting fixes for errors.\n* Transform and optimize code.\n* Explain code.\n* Help you find relevant information in the Databricks documentation. \nFor information about using Databricks Assistant to help you code more efficiently, see [Use Databricks Assistant](https:\/\/docs.databricks.com\/notebooks\/use-databricks-assistant.html). For general information about Databricks Assistant, see [DatabricksIQ-powered features](https:\/\/docs.databricks.com\/databricksiq\/index.html).\n\n#### Develop code in Databricks notebooks\n##### Access notebook for editing\n\nTo open a notebook, use the workspace [Search function](https:\/\/docs.databricks.com\/search\/index.html) or use the workspace browser to [navigate to the notebook](https:\/\/docs.databricks.com\/workspace\/workspace-objects.html) and click on the notebook\u2019s name or icon.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Develop code in Databricks notebooks\n##### Browse data\n\nUse the schema browser to explore tables and volumes available for the notebook. Click ![notebook data icon](https:\/\/docs.databricks.com\/_images\/notebook-data-icon.png) at the left side of the notebook to open the schema browser. \nThe **For you** button displays only those tables that you\u2019ve used in the current session or previously marked as a Favorite. \nAs you type text into the **Filter** box, the display changes to show only those items that contain the text you type. Only items that are currently open or have been opened in the current session appear. The **Filter** box does not do a complete search of the catalogs, schemas, and tables available for the notebook. \nTo open the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu, hover the cursor over the item\u2019s name as shown: \n![kebab menu in schema browser](https:\/\/docs.databricks.com\/_images\/schema-browser-kebab.png) \nIf the item is a table, you can do the following: \n* Automatically create and run a cell to display a preview of the data in the table. Select **Preview in a new cell** from the kebab menu for the table.\n* View a catalog, schema, or table in Catalog Explorer. Select **Open in Catalog Explorer** from the kebab menu. A new tab opens showing the selected item.\n* Get the path to a catalog, schema, or table. Select **Copy \u2026 path** from the kebab menu for the item.\n* Add a table to Favorites. Select **Add table to favorites** from the kebab menu for the table. \nIf the item is a catalog, schema, or volume, you can copy the item\u2019s path or open it in Catalog Explorer. \nTo insert a table or column name directly into a cell: \n1. Click your cursor in the cell at the location you want to enter the name.\n2. Move your cursor over the table name or column name in the schema browser.\n3. Click the double arrow ![double arrow](https:\/\/docs.databricks.com\/_images\/schema-browser-double-arrow.png)that appears at the right of the item\u2019s name.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Develop code in Databricks notebooks\n##### Keyboard shortcuts\n\nTo display keyboard shortcuts, select **Help > Keyboard shortcuts**. The keyboard shortcuts available depend on whether the cursor is in a code cell (edit mode) or not (command mode).\n\n#### Develop code in Databricks notebooks\n##### Find and replace text\n\nTo find and replace text within a notebook, select **Edit > Find and Replace**. The current match is highlighted in orange and all other matches are highlighted in yellow. \n![Matching text](https:\/\/docs.databricks.com\/_images\/find-replace-example.png) \nTo replace the current match, click **Replace**. To replace all matches in the notebook, click **Replace All**. \nTo move between matches, click the **Prev** and **Next** buttons. You can also press\n**shift+enter** and **enter** to go to the previous and next matches, respectively. \nTo close the find and replace tool, click ![Delete Icon](https:\/\/docs.databricks.com\/_images\/delete-icon.png) or press **esc**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Develop code in Databricks notebooks\n##### Variable explorer\n\nYou can directly observe Python, Scala, and R variables in the notebook UI. For Python on Databricks Runtime 12.2 LTS and above, the variables update as a cell runs. For Scala, R, and for Python on Databricks Runtime 11.3 LTS and below, variables update after a cell finishes running. \nTo open the variable explorer, click ![the variable explorer icon](https:\/\/docs.databricks.com\/_images\/variable-explorer-icon.png) in [the right sidebar](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#right-sidebar). The variable explorer opens, showing the value and data type, including shape, for each variable that is currently defined in the notebook. (The shape of a PySpark dataframe is \u2018?\u2019, because calculating the shape can be computationally expensive.) \nTo filter the display, enter text into the search box. The list is automatically filtered as you type. \nVariable values are automatically updated as you run notebook cells. \n![example variable explorer panel](https:\/\/docs.databricks.com\/_images\/variable-explorer-example.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Develop code in Databricks notebooks\n##### Run selected cells\n\nYou can run a single cell or a collection of cells. To select a single cell, click anywhere in the cell. To select multiple cells, hold down the `Command` key on MacOS or the `Ctrl` key on Windows, and click in the cell outside of the text area as shown in the screen shot. \n![how to select multiple cells](https:\/\/docs.databricks.com\/_images\/select-multiple-cells.png) \nTo run the selected cells, select **Run > Run selected cell(s)**. \nThe behavior of this command depends on the cluster that the notebook is attached to. \n* On a cluster running Databricks Runtime 13.3 LTS or below, selected cells are executed individually. If an error occurs in a cell, the execution continues with subsequent cells.\n* On a cluster running Databricks Runtime 14.0 or above, or on a SQL warehouse, selected cells are executed as a batch. Any error halts execution, and you cannot cancel the execution of individual cells. You can use the **Interrupt** button to stop execution of all cells.\n\n#### Develop code in Databricks notebooks\n##### Modularize your code\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nWith Databricks Runtime 11.3 LTS and above, you can create and manage source code files in the Databricks workspace, and then import these files into your notebooks as needed. \nFor more information on working with source code files, see [Share code between Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/share-code.html) and [Work with Python and R modules](https:\/\/docs.databricks.com\/files\/workspace-modules.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Develop code in Databricks notebooks\n##### Run selected text\n\nYou can highlight code or SQL statements in a notebook cell and run only that selection. This is useful when you want to quickly iterate on code and queries. \n1. Highlight the lines you want to run.\n2. Select **Run > Run selected text** or use the keyboard shortcut `Ctrl`+`Shift`+`Enter`. If no text is highlighted, **Run Selected Text** executes the current line. \n![run selected lines](https:\/\/docs.databricks.com\/_images\/run-selected-text.gif) \nIf you are using [mixed languages in a cell](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#language-magic), you must include the `%<language>` line in the selection. \n**Run selected text** also executes collapsed code, if there is any in the highlighted selection. \nSpecial cell commands such as `%run`, `%pip`, and `%sh` are supported. \nYou cannot use **Run selected text** on cells that have multiple output tabs (that is, cells where you have defined a data profile or visualization).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Develop code in Databricks notebooks\n##### Format code cells\n\nDatabricks provides tools that allow you to format Python and SQL code in notebook cells quickly and easily. These tools reduce the effort to keep your code formatted and help to enforce the same coding standards across your notebooks. \n### Format Python cells \nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nDatabricks supports Python code formatting using [Black](https:\/\/black.readthedocs.io\/en\/stable\/) within the notebook. The notebook must be attached to a cluster with `black` and `tokenize-rt` Python packages installed, and the Black formatter executes on the cluster that the notebook is attached to. \nOn Databricks Runtime 11.3 LTS and above, Databricks preinstalls `black` and `tokenize-rt`. You can use the formatter directly without needing to install these libraries. \nOn Databricks Runtime 10.4 LTS and below, you must install `black==22.3.0` and `tokenize-rt==4.2.1` from PyPI on your notebook or cluster to use the Python formatter. You can run the following command in your notebook: \n```\n%pip install black==22.3.0 tokenize-rt==4.2.1\n\n``` \nor [install the library on your cluster](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html). \nFor more details about installing libraries, see [Python environment management](https:\/\/docs.databricks.com\/libraries\/index.html#python-environment-management). \nFor files and notebooks in Databricks Git folders, you can configure the Python formatter based on the `pyproject.toml` file. To use this feature, create a `pyproject.toml` file in the Git folder root directory and configure it according to the [Black configuration format](https:\/\/black.readthedocs.io\/en\/stable\/usage_and_configuration\/the_basics.html#configuration-format). Edit the [tool.black] section in the file. The configuration is applied when you format any file and notebook in that Git folder. \n### How to format Python and SQL cells \nYou must have [CAN EDIT permission](https:\/\/docs.databricks.com\/notebooks\/notebooks-collaborate.html#notebook-permissions) on the notebook to format code. \nYou can trigger the formatter in the following ways: \n* **Format a single cell** \n+ Keyboard shortcut: Press **Cmd+Shift+F**.\n+ Command context menu: \n- Format SQL cell: Select **Format SQL** in the command context dropdown menu of a SQL cell. This menu item is visible only in SQL notebook cells or those with a `%sql` [language magic](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#language-magic).\n- Format Python cell: Select **Format Python** in the command context dropdown menu of a Python cell. This menu item is visible only in Python notebook cells or those with a `%python` [language magic](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#language-magic).\n+ Notebook **Edit** menu: Select a Python or SQL cell, and then select **Edit > Format Cell(s)**.\n* **Format multiple cells** \n[Select multiple cells](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#run-selected-cells) and then select **Edit > Format Cell(s)**. If you select cells of more than one language, only SQL and Python cells are formatted. This includes those that use `%sql` and `%python`.\n* **Format all Python and SQL cells in the notebook** \nSelect **Edit > Format Notebook**. If your notebook contains more than one language, only SQL and Python cells are formatted. This includes those that use `%sql` and `%python`. \n### Limitations of code formatting \n* Black enforces [PEP 8](https:\/\/peps.python.org\/pep-0008\/) standards for 4-space indentation. Indentation is not configurable.\n* Formatting embedded Python strings inside a SQL UDF is not supported. Similarly, formatting SQL strings inside a Python UDF is not supported.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Develop code in Databricks notebooks\n##### Version history\n\nDatabricks notebooks maintain a history of notebook versions, allowing you to view and restore previous snapshots of the notebook. You can perform the following actions on versions: add comments, restore and delete versions, and clear version history. \nYou can also [sync your work in Databricks with a remote Git repository](https:\/\/docs.databricks.com\/repos\/index.html). \nTo access notebook versions, click ![version history icon](https:\/\/docs.databricks.com\/_images\/version-history.png) in the right sidebar. The notebook version history appears. You can also select **File > Version history**. \n### Add a comment \nTo add a comment to the latest version: \n1. Click the version.\n2. Click **Save now**. \n![Save comment](https:\/\/docs.databricks.com\/_images\/version-comment.png)\n3. In the Save Notebook Version dialog, enter a comment.\n4. Click **Save**. The notebook version is saved with the entered comment. \n### Restore a version \nTo restore a version: \n1. Click the version.\n2. Click **Restore this version**. \n![Restore version](https:\/\/docs.databricks.com\/_images\/restore-version.png)\n3. Click **Confirm**. The selected version becomes the latest version of the notebook. \n### Delete a version \nTo delete a version entry: \n1. Click the version.\n2. Click the trash icon ![Trash](https:\/\/docs.databricks.com\/_images\/trash-icon1.png). \n![Delete version](https:\/\/docs.databricks.com\/_images\/delete-version.png)\n3. Click **Yes, erase**. The selected version is deleted from the history. \n### Clear version history \nThe version history cannot be recovered after it has been cleared. \nTo clear the version history for a notebook: \n1. Select **File > Clear version history**.\n2. Click **Yes, clear**. The notebook version history is cleared.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Develop code in Databricks notebooks\n##### Code languages in notebooks\n\n### Set default language \nThe default language for the notebook appears next to the notebook name. \n![Notebook default language](https:\/\/docs.databricks.com\/_images\/toolbar.png) \nTo change the default language, click the language button and select the new language from the dropdown menu. To ensure that existing commands continue to work, commands of the previous default language are automatically prefixed with a language magic command. \n### Mix languages \nBy default, cells use the default language of the notebook. You can override the default language in a cell by clicking the language button and selecting a language from the dropdown menu. \n![Cell language drop down](https:\/\/docs.databricks.com\/_images\/cell-language-button.png) \nAlternately, you can use the language magic command `%<language>` at the beginning of a cell. The supported magic commands are: `%python`, `%r`, `%scala`, and `%sql`. \nNote \nWhen you invoke a language magic command, the command is dispatched to the REPL in the [execution context](https:\/\/docs.databricks.com\/notebooks\/execution-context.html) for the notebook. Variables defined in one language (and hence in the REPL for that language) are not available in the REPL of another language. REPLs can share state only through external resources such as files in DBFS or objects in object storage. \nNotebooks also support a few auxiliary magic commands: \n* `%sh`: Allows you to run shell code in your notebook. To fail the cell if the shell command has a non-zero exit status, add the `-e` option. This command runs only on the Apache Spark driver, and not the workers. To run a shell command on all nodes, use an [init script](https:\/\/docs.databricks.com\/init-scripts\/index.html).\n* `%fs`: Allows you to use `dbutils` filesystem commands. For example, to run the `dbutils.fs.ls` command to list files, you can specify `%fs ls` instead. For more information, see [Work with files on Databricks](https:\/\/docs.databricks.com\/files\/index.html).\n* `%md`: Allows you to include various types of documentation, including text, images, and mathematical formulas and equations. See the next section. \n### SQL syntax highlighting and autocomplete in Python commands \nSyntax highlighting and SQL [autocomplete](https:\/\/docs.databricks.com\/notebooks\/notebook-editor.html#autocomplete) are available when you use SQL inside a Python command, such as in a `spark.sql` command. \n### Explore SQL cell results in Python notebooks using Python \nYou might want to load data using SQL and explore it using Python. In a Databricks Python notebook, table results from a SQL language cell are automatically made available as a Python DataFrame assigned to the variable `_sqldf`. \nIn Databricks Runtime 13.3 LTS and above, you can also access the DataFrame result using [IPython\u2019s output caching system](https:\/\/ipython.readthedocs.io\/en\/stable\/interactive\/reference.html#output-caching-system). The prompt counter appears in the output message displayed at the bottom of the cell results. For the example shown, you would reference the result as `Out[2]`. \nNote \n* The variable `_sqldf` may be reassigned each time a `%sql` cell is run. To avoid losing reference to the DataFrame result, assign it to a new variable name before you run the next `%sql` cell: \n```\nnew_dataframe_name = _sqldf\n\n```\n* If the query uses a [widget](https:\/\/docs.databricks.com\/notebooks\/widgets.html) for parameterization, the results are not available as a Python DataFrame.\n* If the query uses the keywords `CACHE TABLE` or `UNCACHE TABLE`, the results are not available as a Python DataFrame. \nThe screenshot shows an example: \n![sql results dataframe](https:\/\/docs.databricks.com\/_images\/implicit-df.png) \n### Execute SQL cells in parallel \nWhile a command is running and your notebook is attached to an interactive cluster, you can run a SQL cell simultaneously with the current command. The SQL cell is executed in a new, parallel session. \nTo execute a cell in parallel: \n1. [Run the cell](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html).\n2. Click **Run Now**. The cell is immediately executed. \n![Run a SQL cell in parallel with current running cell](https:\/\/docs.databricks.com\/_images\/parallel-sql-execution.png) \nBecause the cell is run in a new session, temporary views, UDFs, and the [implicit Python DataFrame](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#implicit-sql-df) (`_sqldf`) are not supported for cells that are executed in parallel. In addition, the default catalog and database names are used during parallel execution. If your code refers to a table in a different catalog or database, you must specify the table name using three-level namespace (`catalog`.`schema`.`table`). \n### Execute SQL cells on a SQL warehouse \nYou can run SQL commands in a Databricks notebook on a [SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html), a type of compute that is optimized for SQL analytics. See [Use a notebook with a SQL warehouse](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#notebook-sql-warehouse).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Develop code in Databricks notebooks\n##### Display images\n\nTo display images stored in the [FileStore](https:\/\/docs.databricks.com\/archive\/legacy\/filestore.html#static-images), use the following syntax: \n```\n%md\n![test](files\/image.png)\n\n``` \nFor example, suppose you have the Databricks logo image file in FileStore: \n```\ndbfs ls dbfs:\/FileStore\/\n\n``` \n```\ndatabricks-logo-mobile.png\n\n``` \nWhen you include the following code in a Markdown cell: \n![Image in Markdown cell](https:\/\/docs.databricks.com\/_images\/image-code.png) \nthe image is rendered in the cell: \n![Rendered image](https:\/\/docs.databricks.com\/_images\/image-render.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Develop code in Databricks notebooks\n##### Display mathematical equations\n\nNotebooks support [KaTeX](https:\/\/github.com\/Khan\/KaTeX\/wiki) for displaying mathematical formulas and equations. For example, \n```\n%md\n\\\\(c = \\\\pm\\\\sqrt{a^2 + b^2} \\\\)\n\n\\\\(A{_i}{_j}=B{_i}{_j}\\\\)\n\n$$c = \\\\pm\\\\sqrt{a^2 + b^2}$$\n\n\\\\[A{_i}{_j}=B{_i}{_j}\\\\]\n\n``` \nrenders as: \n![Rendered equation 1](https:\/\/docs.databricks.com\/_images\/equations.png) \nand \n```\n%md\n\\\\( f(\\beta)= -Y_t^T X_t \\beta + \\sum log( 1+{e}^{X_t\\bullet\\beta}) + \\frac{1}{2}\\delta^t S_t^{-1}\\delta\\\\)\n\nwhere \\\\(\\delta=(\\beta - \\mu_{t-1})\\\\)\n\n``` \nrenders as: \n![Rendered equation 2](https:\/\/docs.databricks.com\/_images\/equations2.png)\n\n#### Develop code in Databricks notebooks\n##### Include HTML\n\nYou can include HTML in a notebook by using the function `displayHTML`. See [HTML, D3, and SVG in notebooks](https:\/\/docs.databricks.com\/visualizations\/html-d3-and-svg.html) for an example of how to do this. \nNote \nThe `displayHTML` iframe is served from the domain `databricksusercontent.com` and the iframe sandbox includes the `allow-same-origin` attribute. `databricksusercontent.com` must be accessible from your browser. If it is currently blocked by your corporate network, it must added to an allow list.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Develop code in Databricks notebooks\n##### Link to other notebooks\n\nYou can link to other notebooks or folders in Markdown cells using relative paths. Specify the `href`\nattribute of an anchor tag as the relative path, starting with a `$` and then follow the same\npattern as in Unix file systems: \n```\n%md\n<a href=\"$.\/myNotebook\">Link to notebook in same folder as current notebook<\/a>\n<a href=\"$..\/myFolder\">Link to folder in parent folder of current notebook<\/a>\n<a href=\"$.\/myFolder2\/myNotebook2\">Link to nested notebook<\/a>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Install libraries from object storage\n\nThis article walks you through the steps required to install libraries from cloud object storage on Databricks. \nNote \nThis article refers to cloud object storage as a general concept, and assumes that you are directly interacting with data stored in object storage using URIs. Databricks recommends using Unity Catalog volumes to configure access to files in cloud object storage. See [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html). \nYou can store custom JAR and Python Whl libraries in cloud object storage, instead of storing them in the DBFS root. See [Cluster-scoped libraries](https:\/\/docs.databricks.com\/libraries\/index.html#compatibility) for full library compatibility details. \nImportant \nLibraries can be installed from DBFS when using Databricks Runtime 14.3 LTS and below. However, any workspace user can modify library files stored in DBFS. To improve the security of libraries in a Databricks workspace, storing library files in the DBFS root is deprecated and disabled by default in Databricks Runtime 15.0 and above. See [Storing libraries in DBFS root is deprecated and disabled by default](https:\/\/docs.databricks.com\/release-notes\/runtime\/15.0.html#libraries-dbfs-deprecation). \nInstead, Databricks [recommends](https:\/\/docs.databricks.com\/libraries\/index.html#recommendations) uploading all libraries, including Python libraries, JAR files, and Spark connectors, to workspace files or Unity Catalog volumes, or using library package repositories. If your workload does not support these patterns, you can also use libraries stored in cloud object storage.\n\n#### Install libraries from object storage\n##### Load libraries to object storage\n\nYou can load libraries to object storage the same way you load other files. You must have proper permissions in your cloud provider to create new object storage containers or load files into cloud object storage.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/object-storage-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Install libraries from object storage\n##### Grant read-only permissions to object storage\n\nDatabricks recommends configuring all privileges related to library installation with read-only permissions. \nDatabricks allows you to assign security permissions to individual clusters that govern access to data in cloud object storage. These policies can be expanded to add read-only access to cloud object storage that contains libraries. \nNote \nIn Databricks Runtime 12.2 LTS and below, you cannot load JAR libraries when using clusters with shared access modes. In Databricks Runtime 13.3 LTS and above, you must add JAR libraries to the Unity Catalog allowlist. See [Allowlist libraries and init scripts on shared compute](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/allowlist.html). \nDatabricks recommends using instance profiles to manage access to libraries stored in S3. Use the following documentation in the cross-reference link to complete this setup: \n1. Create a IAM role with read and list permissions on your desired buckets. See [Tutorial: Configure S3 access with an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html).\n2. Launch a cluster with the instance profile. See [Instance profiles](https:\/\/docs.databricks.com\/compute\/configure.html#instance-profiles).\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/object-storage-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Install libraries from object storage\n##### Install libraries to clusters\n\nTo install a library stored in cloud object storage to a cluster, complete the following steps: \n1. Select a cluster from the list in the clusters UI.\n2. Select the **Libraries** tab.\n3. Select the **File path\/S3** option.\n4. Provide the full URI path to the library object (for example, `s3:\/\/bucket-name\/path\/to\/library.whl`).\n5. Click **Install**. \nYou can also install libraries using the [REST API](https:\/\/docs.databricks.com\/api\/workspace\/libraries\/install) or [CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html).\n\n#### Install libraries from object storage\n##### Install libraries to notebooks\n\nYou can use `%pip` to install custom Python wheel files stored in object storage scoped to a notebook-isolated SparkSession. To use this method, you must either store libraries in publicly readable object storage or use a pre-signed URL. \nSee [Notebook-scoped Python libraries](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html). \nNote \nJAR libraries cannot be installed in the notebook. You must install JAR libraries at the cluster level.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/object-storage-libraries.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n#### AI Functions on Databricks\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes Databricks AI Functions, built-in SQL functions that allow you to apply AI on your data directly from SQL. \nSQL is crucial for data analysis due to its versatility, efficiency, and widespread use. Its simplicity enables swift retrieval, manipulation, and management of large datasets. Incorporating AI functions into SQL for data analysis enhances efficiency, which enables businesses to swiftly extract insights. \nIntegrating AI into analysis workflows provides access to information previously inaccessible to analysts, and empowers them to make more informed decisions, manage risks, and sustain a competitive advantage through data-driven innovation and efficiency.\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/ai-functions.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n#### AI Functions on Databricks\n##### AI functions using Databricks Foundation Model APIs\n\nNote \nFor Databricks Runtime 15.0 and above, these functions are supported in notebook environments including Databricks notebooks and workflows. \nThese functions invoke a state-of-the-art generative AI model from [Databricks Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html) to perform tasks like, sentiment analysis, classification and translation. See [Analyze customer reviews using AI Functions](https:\/\/docs.databricks.com\/large-language-models\/ai-functions-example.html). \n* [ai\\_analyze\\_sentiment](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_analyze_sentiment.html)\n* [ai\\_classify](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_classify.html)\n* [ai\\_extract](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_extract.html)\n* [ai\\_fix\\_grammar](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_fix_grammar.html)\n* [ai\\_gen](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_gen.html)\n* [ai\\_mask](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_mask.html)\n* [ai\\_similarity](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_similarity.html)\n* [ai\\_summarize](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_summarize.html)\n* [ai\\_translate](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_translate.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/ai-functions.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n#### AI Functions on Databricks\n##### ai\\_query\n\nNote \n* For Databricks Runtime 14.2 and above, this function is supported in notebook environments including Databricks notebooks and workflows.\n* For Databricks Runtime 14.1 and below, this function is not supported in notebook environments, including Databricks notebooks. \nThe `ai_query()` function allows you to serve your machine learning models and large language models using [Databricks Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) and query them using SQL. To do so, this function invokes an existing Databricks Model Serving endpoint and parses and returns its response. You can use `ai_query()` to query endpoints that serve custom models, foundation models made available using [Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html), and [external models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html). \n* [ai\\_query function](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_query.html).\n* [Query a served model with ai\\_query()](https:\/\/docs.databricks.com\/large-language-models\/how-to-ai-query.html).\n* [Query an external model with ai\\_query()](https:\/\/docs.databricks.com\/large-language-models\/ai-query-external-model.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/ai-functions.html"} +{"content":"# AI and Machine Learning on Databricks\n## GraphFrames\n#### GraphFrames user guide - Scala\n\nThis article demonstrates examples from the [GraphFrames User Guide](https:\/\/graphframes.github.io\/graphframes\/docs\/_site\/user-guide.html). \n```\nimport org.apache.spark.sql._\nimport org.apache.spark.sql.functions._\nimport org.graphframes._\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/graphframes\/user-guide-scala.html"} +{"content":"# AI and Machine Learning on Databricks\n## GraphFrames\n#### GraphFrames user guide - Scala\n##### Creating GraphFrames\n\nYou can create GraphFrames from vertex and edge DataFrames. \n* Vertex DataFrame: A vertex DataFrame should contain a special column\nnamed `id` which specifies unique IDs for each vertex in the graph.\n* Edge DataFrame: An edge DataFrame should contain two special columns:\n`src` (source vertex ID of edge) and `dst` (destination vertex ID of\nedge). \nBoth DataFrames can have arbitrary other columns. Those columns can\nrepresent vertex and edge attributes. \nCreate the vertices and edges \n```\n\/\/ Vertex DataFrame\nval v = spark.createDataFrame(List(\n(\"a\", \"Alice\", 34),\n(\"b\", \"Bob\", 36),\n(\"c\", \"Charlie\", 30),\n(\"d\", \"David\", 29),\n(\"e\", \"Esther\", 32),\n(\"f\", \"Fanny\", 36),\n(\"g\", \"Gabby\", 60)\n)).toDF(\"id\", \"name\", \"age\")\n\/\/ Edge DataFrame\nval e = spark.createDataFrame(List(\n(\"a\", \"b\", \"friend\"),\n(\"b\", \"c\", \"follow\"),\n(\"c\", \"b\", \"follow\"),\n(\"f\", \"c\", \"follow\"),\n(\"e\", \"f\", \"follow\"),\n(\"e\", \"d\", \"friend\"),\n(\"d\", \"a\", \"friend\"),\n(\"a\", \"e\", \"friend\")\n)).toDF(\"src\", \"dst\", \"relationship\")\n\n``` \nLet\u2019s create a graph from these vertices and these edges: \n```\nval g = GraphFrame(v, e)\n\n``` \n```\n\/\/ This example graph also comes with the GraphFrames package.\n\/\/ val g = examples.Graphs.friends\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/graphframes\/user-guide-scala.html"} +{"content":"# AI and Machine Learning on Databricks\n## GraphFrames\n#### GraphFrames user guide - Scala\n##### Basic graph and DataFrame queries\n\nGraphFrames provide simple graph queries, such as node degree. \nAlso, since GraphFrames represent graphs as pairs of vertex and edge\nDataFrames, it is easy to make powerful queries directly on the vertex\nand edge DataFrames. Those DataFrames are available as vertices and\nedges fields in the GraphFrame. \n```\ndisplay(g.vertices)\n\n``` \n```\ndisplay(g.edges)\n\n``` \nThe incoming degree of the vertices: \n```\ndisplay(g.inDegrees)\n\n``` \nThe outgoing degree of the vertices: \n```\ndisplay(g.outDegrees)\n\n``` \nThe degree of the vertices: \n```\ndisplay(g.degrees)\n\n``` \nYou can run queries directly on the vertices DataFrame. For example, we\ncan find the age of the youngest person in the graph: \n```\nval youngest = g.vertices.groupBy().min(\"age\")\ndisplay(youngest)\n\n``` \nLikewise, you can run queries on the edges DataFrame. For example, let\nus count the number of \u2018follow\u2019 relationships in the graph: \n```\nval numFollows = g.edges.filter(\"relationship = 'follow'\").count()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/graphframes\/user-guide-scala.html"} +{"content":"# AI and Machine Learning on Databricks\n## GraphFrames\n#### GraphFrames user guide - Scala\n##### Motif finding\n\nBuild more complex relationships involving edges and vertices using motifs. The following cell finds the pairs of vertices with edges in both directions between them. The result is a DataFrame, in which the column names are motif keys. \nCheck out the [GraphFrame User\nGuide](https:\/\/graphframes.github.io\/graphframes\/docs\/_site\/user-guide.html#motif-finding)\nfor more details on the API. \n```\n\/\/ Search for pairs of vertices with edges in both directions between them.\nval motifs = g.find(\"(a)-[e]->(b); (b)-[e2]->(a)\")\ndisplay(motifs)\n\n``` \nSince the result is a DataFrame, you can build more complex queries can on top of the motif. Let us find all the reciprocal relationships in which one person is older than 30: \n```\nval filtered = motifs.filter(\"b.age > 30\")\ndisplay(filtered)\n\n``` \n### Stateful queries \nMost motif queries are stateless and simple to express, as in the examples above. The next examples demonstrate more complex queries which carry state along a path in the motif. Express these queries by combining GraphFrame motif finding with filters on the result, where the filters use sequence operations to construct a series of DataFrame columns. \nFor example, suppose you want to identify a chain of 4 vertices with\nsome property defined by a sequence of functions. That is, among chains\nof 4 vertices `a->b->c->d`, identify the subset of chains matching\nthis complex filter: \n* Initialize state on path.\n* Update state based on vertex a.\n* Update state based on vertex b.\n* Etc. for c and d.\n* If final state matches some condition, then the filter accepts the chain. \nThe following code snippets demonstrate this process, where we identify\nchains of 4 vertices such that at least 2 of the 3 edges are \u201cfriend\u201d\nrelationships. In this example, the state is the current count of\n\u201cfriend\u201d edges; in general, it could be any DataFrame column. \n```\n\/\/ Find chains of 4 vertices.\nval chain4 = g.find(\"(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)\")\n\n\/\/ Query on sequence, with state (cnt)\n\/\/ (a) Define method for updating state given the next element of the motif.\ndef sumFriends(cnt: Column, relationship: Column): Column = {\nwhen(relationship === \"friend\", cnt + 1).otherwise(cnt)\n}\n\/\/ (b) Use sequence operation to apply method to sequence of elements in motif.\n\/\/ In this case, the elements are the 3 edges.\nval condition = Seq(\"ab\", \"bc\", \"cd\").\nfoldLeft(lit(0))((cnt, e) => sumFriends(cnt, col(e)(\"relationship\")))\n\/\/ (c) Apply filter to DataFrame.\nval chainWith2Friends2 = chain4.where(condition >= 2)\ndisplay(chainWith2Friends2)\n\n``` \n### Subgraphs \nGraphFrames provides APIs for building subgraphs by filtering on edges and vertices. These filters\ncan composed together. For example, the following subgraph contains only people who are friends and\nwho are more than 30 years old. \n```\n\/\/ Select subgraph of users older than 30, and edges of type \"friend\"\nval g2 = g\n.filterEdges(\"relationship = 'friend'\")\n.filterVertices(\"age > 30\")\n.dropIsolatedVertices()\n\n``` \n#### Complex triplet filters \nThe following example shows how to select a subgraph based upon triplet filters that operate on an edge and its \u201csrc\u201d\nand \u201cdst\u201d vertices. Extending this example to go beyond triplets by using more complex motifs is simple. \n```\n\/\/ Select subgraph based on edges \"e\" of type \"follow\"\n\/\/ pointing from a younger user \"a\" to an older user \"b\".\nval paths = g.find(\"(a)-[e]->(b)\")\n.filter(\"e.relationship = 'follow'\")\n.filter(\"a.age < b.age\")\n\/\/ \"paths\" contains vertex info. Extract the edges.\nval e2 = paths.select(\"e.src\", \"e.dst\", \"e.relationship\")\n\/\/ In Spark 1.5+, the user may simplify this call:\n\/\/ val e2 = paths.select(\"e.*\")\n\n\/\/ Construct the subgraph\nval g2 = GraphFrame(g.vertices, e2)\n\n``` \n```\ndisplay(g2.vertices)\n\n``` \n```\ndisplay(g2.edges)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/graphframes\/user-guide-scala.html"} +{"content":"# AI and Machine Learning on Databricks\n## GraphFrames\n#### GraphFrames user guide - Scala\n##### Standard graph algorithms\n\nThis section describes the standard graph algorithms built into GraphFrames. \n### Breadth-first search (BFS) \nSearch from \u201cEsther\u201d for users of age < 32. \n```\nval paths: DataFrame = g.bfs.fromExpr(\"name = 'Esther'\").toExpr(\"age < 32\").run()\ndisplay(paths)\n\n``` \nThe search may also limit edge filters and maximum path lengths. \n```\nval filteredPaths = g.bfs.fromExpr(\"name = 'Esther'\").toExpr(\"age < 32\")\n.edgeFilter(\"relationship != 'friend'\")\n.maxPathLength(3)\n.run()\ndisplay(filteredPaths)\n\n``` \n### Connected components \nCompute the connected component membership of each vertex and return a\ngraph with each vertex assigned a component ID. \n```\nval result = g.connectedComponents.run() \/\/ doesn't work on Spark 1.4\ndisplay(result)\n\n``` \n### Strongly connected components \nCompute the strongly connected component (SCC) of each vertex and return\na graph with each vertex assigned to the SCC containing that vertex. \n```\nval result = g.stronglyConnectedComponents.maxIter(10).run()\ndisplay(result.orderBy(\"component\"))\n\n``` \n### Label propagation \nRun static Label Propagation Algorithm for detecting communities in\nnetworks. \nEach node in the network is initially assigned to its own community. At\nevery superstep, nodes send their community affiliation to all neighbors\nand update their state to the mode community affiliation of incoming\nmessages. \nLPA is a standard community detection algorithm for graphs. It is inexpensive computationally,\nalthough (1) convergence is not guaranteed and (2) one can end up with trivial solutions (all nodes identify into a single community). \n```\nval result = g.labelPropagation.maxIter(5).run()\ndisplay(result.orderBy(\"label\"))\n\n``` \n### PageRank \nIdentify important vertices in a graph based on connections. \n```\n\/\/ Run PageRank until convergence to tolerance \"tol\".\nval results = g.pageRank.resetProbability(0.15).tol(0.01).run()\ndisplay(results.vertices)\n\n``` \n```\ndisplay(results.edges)\n\n``` \n```\n\/\/ Run PageRank for a fixed number of iterations.\nval results2 = g.pageRank.resetProbability(0.15).maxIter(10).run()\ndisplay(results2.vertices)\n\n``` \n```\n\/\/ Run PageRank personalized for vertex \"a\"\nval results3 = g.pageRank.resetProbability(0.15).maxIter(10).sourceId(\"a\").run()\ndisplay(results3.vertices)\n\n``` \n### Shortest paths \nComputes shortest paths to the given set of landmark vertices, where landmarks specify by vertex ID. \n```\nval paths = g.shortestPaths.landmarks(Seq(\"a\", \"d\")).run()\ndisplay(paths)\n\n``` \n### Triangle counting \nComputes the number of triangles passing through each vertex. \n```\nimport org.graphframes.examples\nval g: GraphFrame = examples.Graphs.friends \/\/ get example graph\n\nval results = g.triangleCount.run()\nresults.select(\"id\", \"count\").show()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/graphframes\/user-guide-scala.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Manage Delta Sharing providers (for data recipients)\n\nThis article describes how to use Unity Catalog to get information about data providers who are sharing data with you using Delta Sharing. It also describes what a provider object is and when you might need to create a provider object in your Unity Catalog metastore, a task that most recipients should never need to do. \nImportant \nData recipients must have access to a Databricks workspace that is enabled for [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) to use the functionality described in this article. This article does not apply to recipients who do not have Unity Catalog-enabled workspaces.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/manage-provider.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Manage Delta Sharing providers (for data recipients)\n#### Do recipients need to create provider objects?\n\nIn Delta Sharing on Databricks, the term \u201cprovider\u201d can mean both the organization that is sharing data with you and a securable object in a recipient\u2019s Unity Catalog metastore that represents that organization. The existence of that securable object in a recipient\u2019s Unity Catalog metastore enables recipients to [manage their team\u2019s access to shared data using Unity Catalog](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html). \nAs a recipient with access to a Unity Catalog metastore, you typically do not need to create provider objects. This is because data should be shared with you using [Databricks-to-Databricks sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html#d-to-d), and provider objects are created automatically in your Unity Catalog metastore. \nNote \nIf you are the rare recipient on Unity Catalog who is receiving data from a provider that is not sharing from a Unity Catalog-enabled Databricks workspace, you may want to create provider objects in Unity Catalog so that you can manage that shared data using Unity Catalog. If you are in that category, you can use the [POST \/api\/2.1\/unity-catalog\/providers](https:\/\/docs.databricks.com\/api\/workspace\/providers\/create) REST API call or the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) to create the Unity Catalog provider object. You must be a metastore admin or user with the `CREATE_PROVIDER` privilege for the metastore.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/manage-provider.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Manage Delta Sharing providers (for data recipients)\n#### View providers\n\nTo view a list of available data providers, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `SHOW PROVIDERS` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: You must be a metastore admin or have the `USE PROVIDER` privilege to view all providers in the metastore. Other users have access only to the providers that they own. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared with me**.\n3. On the **Providers** tab, view all available providers. \nRun the following command in a notebook or the Databricks SQL query editor. Optionally, replace `<pattern>` with a [`LIKE` predicate](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/like.html). \n```\nSHOW PROVIDERS [LIKE <pattern>];\n\n``` \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \n```\ndatabricks providers list\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/manage-provider.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Manage Delta Sharing providers (for data recipients)\n#### View provider details\n\nTo view details about a provider, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `DESCRIBE PROVIDER` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: Metastore admin, user with the `USE PROVIDER` privilege, or the provider object owner. \nDetails include: \n* Shares shared by the provider (see [View shares that a provider has shared with you](https:\/\/docs.databricks.com\/data-sharing\/manage-provider.html#view-shares)).\n* The provider\u2019s creator, creation timestamp, comments, and authentication type (`TOKEN` or `DATABRICKS`). `TOKEN` represents providers who have shared data with you using the Delta Sharing open sharing protocol. `DATABRICKS` represents providers who have shared data with you using the Databricks-to-Databricks sharing protocol.\n* If the provider uses Databricks-to-Databricks sharing: the cloud, region, and metastore ID of the provider\u2019s Unity Catalog metastore.\n* If the provider uses open sharing: your recipient profile endpoint, which is the where the Delta Sharing sharing server is hosted. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared with me**.\n3. On the **Providers** tab, find and select the provider.\n4. View provider details on the **Details** tab. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nDESC PROVIDER <provider-name>;\n\n``` \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \n```\ndatabricks providers get <provider-name>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/manage-provider.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Manage Delta Sharing providers (for data recipients)\n#### View shares that a provider has shared with you\n\nTo view the shares that a provider has shared with you, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `SHOW SHARES IN PROVIDER` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: Metastore admin, user with the `USE PROVIDER` privilege, or the provider object owner. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared with me**.\n3. On the **Providers** tab, select the provider whose shares you want to view. \nRun the following command in a notebook or the Databricks SQL query editor. Optionally, replace `<pattern>` with a [`LIKE` predicate](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/like.html). \n```\nSHOW SHARES IN PROVIDER [LIKE <pattern>];\n\n``` \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \n```\ndatabricks providers list-shares <provider-name>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/manage-provider.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Manage Delta Sharing providers (for data recipients)\n#### Update a provider (rename, change owner, comment)\n\nYou can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `ALTER PROVIDER` SQL command in a Databricks notebook or the Databricks SQL query editor to modify the provider object in your Unity Catalog metastore: \n* Rename the provider to modify the way users see the provider object in their Databricks interfaces.\n* Change the owner of the provider object.\n* Add or modify comments. \n**Permissions required**: You must be a metastore admin or owner of the provider object to update the owner. You must be a metastore admin (or user with the `CREATE_PROVIDER` privilege) *and* provider owner to update the provider name. You must be the owner to update the comment. The initial owner is the metastore admin. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared with you**.\n3. On the **Providers** tab, find and select the provider.\n4. On the details page, update the owner, comment, or provider name. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nALTER PROVIDER <provider-name> RENAME TO <new-provider-name>\nOWNER TO <new-owner>\nCOMMENT \"<comment>\";\n\n``` \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). Replace `<provider-name>` with the current provider name and `<new-provider-name>` with the new name. \n```\ndatabricks providers update <provider-name> \/\n--new-name <new-provider-name> \/\n--comment \"<new comment>\" \/\n--owner <new-owner-name>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/manage-provider.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Manage Delta Sharing providers (for data recipients)\n#### Delete a provider\n\nTo delete a provider, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `DROP PROVIDER` SQL command in a Databricks notebook or the Databricks SQL query editor. You must be the provider object owner to delete the provider. \nWhen you delete a provider, you and the users in your organization (the recipient) can no longer access the data shared by the provider. \n**Permissions required**: Provider object owner. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared with you**.\n3. On the **Providers** tab, find and select the provider.\n4. Click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu (also known as the three-dot menu) and select **Delete**.\n5. On the confirmation dialog, click **Delete**. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nDROP PROVIDER [IF EXISTS] <provider-name>;\n\n``` \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \n```\ndatabricks providers delete <provider-name>\n\n``` \nIf the operation is successful, no results are returned.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/manage-provider.html"} +{"content":"# Security and compliance guide\n## Networking\n### Users to Databricks networking\n##### Manage IP access lists\n\nThis guide introduces IP access lists for the Databricks account and workspaces.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list.html"} +{"content":"# Security and compliance guide\n## Networking\n### Users to Databricks networking\n##### Manage IP access lists\n###### IP access lists overview\n\nNote \nThis feature requires the [Enterprise pricing tier](https:\/\/www.databricks.com\/product\/pricing\/platform-addons). \nBy default, users can connect to Databricks from any computer or IP address. IP access lists enable you to restrict access to your Databricks account and workspaces based on a user\u2019s IP address. For example, you can configure IP access lists to allow users to connect only through existing corporate networks with a secure perimeter. If the internal VPN network is authorized, users who are remote or traveling can use the VPN to connect to the corporate network. If a user attempts to connect to Databricks from an insecure network, like from a coffee shop, access is blocked. \nThere are two IP access list features: \n* **IP access lists for the account console**: Account admins can configure IP access lists for the account console to allow users to connect to the account console UI and account-level REST APIs only through a set of approved IP addresses. Account admins can use an account console UI or a REST API to configure allowed and blocked IP addresses and subnets. See [Configure IP access lists for the account console](https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list-account.html). \n* **IP access lists for workspaces**: Workspace admins can configure IP access lists for Databricks workspaces to allow users to connect to the workspace or workspace-level APIs only through a set of approved IP addresses. Workspace admins use a REST API to configure allowed and blocked IP addresses and subnets. See [Configure IP access lists for workspaces](https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list-workspace.html). \nNote \nIf you use PrivateLink, IP access lists apply only to requests over the internet (public IP addresses). Private IP addresses from PrivateLink traffic cannot be blocked by IP access lists. To block specific private IP addresses from PrivateLink traffic, use AWS Network Firewall. If you want to restrict the PrivateLink connection to a set of registered PrivateLink endpoints, change your workspace\u2019s private access settings object to use the ENDPOINT access level. See [Enable AWS PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list.html"} +{"content":"# Security and compliance guide\n## Networking\n### Users to Databricks networking\n##### Manage IP access lists\n###### How is access checked?\n\nThe IP access lists feature allows you to configure allow lists and block lists for the Databricks account console and workspaces: \n* **Allow lists** contain the set of IP addresses on the public internet that are allowed access. Allow multiple IP addresses explicitly or as entire subnets (for example `216.58.195.78\/28`).\n* **Block lists** contain the IP addresses or subnets to block, even if they are included in the allow list. You might use this feature if an allowed IP address range includes a smaller range of infrastructure IP addresses that in practice are outside the actual secure network perimeter. \nWhen a connection is attempted: \n1. **First all block lists are checked.** If the connection IP address matches any block list, the connection is rejected.\n2. **If the connection was not rejected by block lists**, the IP address is compared with the allow lists. If there is at least one allow list, the connection is allowed only if the IP address matches an allow list. If there are no allow lists, all IP addresses are allowed. \nIf the feature is disabled, all access is allowed to your account or workspace. \n![IP access list flow diagram](https:\/\/docs.databricks.com\/_images\/ip-access-list-flow.png) \nFor all allow lists and block lists combined, the account console supports a maximum of 1000 IP\/CIDR values, where one CIDR counts as a single value. \nChanges to IP access lists can take a few minutes to take effect.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n#### HorovodRunner: distributed deep learning with Horovod\n###### Adapt single node PyTorch to distributed deep learning\n\nLearn how to perform distributed training of machine learning models using PyTorch. \nThis notebook follows the recommended [development workflow](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-runner.html#development-workflow). It first shows how to train the model on a single node, and then how to adapt the code using HorovodRunner for distributed training.\n\n###### Adapt single node PyTorch to distributed deep learning\n####### HorovodRunner PyTorch MNIST example notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/mnist-pytorch.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/mnist-pytorch.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Manage dashboard permissions using the Workspace API\n\nThis tutorial demonstrates how to manage dashboard permissions using the Workspace API. Each step includes a sample request and response and explanations about how to use the API tools and properties together.\n\n##### Manage dashboard permissions using the Workspace API\n###### Prerequisites\n\n* You need a personal access token to connect with your workspace. See [Databricks personal access token authentication](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html).\n* You need the workspace ID of the workspace you want to access. See [Workspace instance names, URLs, and IDs](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-instance-names-urls-and-ids)\n* Familiarity with the [Databricks REST API reference](https:\/\/docs.databricks.com\/api\/workspace\/introduction).\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/manage-permissions.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Manage dashboard permissions using the Workspace API\n###### Path parameters\n\nEach endpoint request in this article requires two path parameters, `workspace_object_type` and `workspace_object_id`. \n* **`workspace_object_type`**: For Lakeview dashboards, the object type is `dashboards`.\n* **`workspace_object_id`**: This corresponds to the `resource_id` associated with the dashboard. You can use the [GET \/api\/2.0\/workspace\/list](https:\/\/docs.databricks.com\/api\/workspace\/workspace\/list) or [GET \/api\/2.0\/workspace\/get-status](https:\/\/docs.databricks.com\/api\/workspace\/workspace\/getstatus) to retrieve that value. It is a 32-character string similar to `01eec14769f616949d7a44244a53ed10`. \nSee [Step 1: Explore a workspace directory](https:\/\/docs.databricks.com\/dashboards\/tutorials\/workspace-lakeview-api.html#explore-directory) for an example of listing workspace objects.\nSee [GET \/api\/2.0\/workspace\/list](https:\/\/docs.databricks.com\/api\/workspace\/workspace\/list) for details about the the Workspace List API.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/manage-permissions.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Manage dashboard permissions using the Workspace API\n###### Get workspace object permission levels\n\nThis section uses the **Get workspace object permission levels** endpoint to get the permission levels that a user can have on a dashboard. See [GET \/api\/workspace\/workspace\/getpermissionlevels](https:\/\/docs.databricks.com\/api\/workspace\/workspace\/getpermissionlevels). \nIn the following example, the request includes sample path parameters described above. The response includes the permissions that can be applied to the dashboard indicated in the request. \n```\nGET \/api\/2.0\/permissions\/dashboards\/01eec14769f616949d7a44244a53ed10\/permissionLevels\n\nResponse:\n{\n\"permission_levels\": [\n{\n\"permission_level\": \"CAN_READ\",\n\"description\": \"Can view the Lakeview dashboard\"\n},\n{\n\"permission_level\": \"CAN_RUN\",\n\"description\": \"Can view, attach\/detach, and run the Lakeview dashboard\"\n},\n{\n\"permission_level\": \"CAN_EDIT\",\n\"description\": \"Can view, attach\/detach, run, and edit the Lakeview dashboard\"\n},\n{\n\"permission_level\": \"CAN_MANAGE\",\n\"description\": \"Can view, attach\/detach, run, edit, and change permissions of the Lakeview dashboard\"\n}\n]\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/manage-permissions.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Manage dashboard permissions using the Workspace API\n###### Get workspace object permission details\n\nThe **Get workspace object permissions** endpoint gets the assigned permissions on a specific workspace object. See [GET \/api\/workspace\/workspace\/getpermissions](https:\/\/docs.databricks.com\/api\/workspace\/workspace\/getpermissions). \nThe following example shows a request and response for the dashboard in the previous example. The response includes details about the dashboard and users and groups with permissions on the dashboard. Permissions on this object have been inherited for both items in the `access_control_list` portion of the response. In the first entry, permissions are inherited from a folder in the workspace. The second entry shows permissions inherited by membership in the group, `admins`. \n```\n\nGET \/api\/2.0\/permissions\/dashboards\/01eec14769f616949d7a44244a53ed10\n\nResponse:\n{\n\"object_id\": \"\/dashboards\/490384175243923\",\n\"object_type\": \"dashboard\",\n\"access_control_list\": [\n{\n\"user_name\": \"first.last@example.com\",\n\"display_name\": \"First Last\",\n\"all_permissions\": [\n{\n\"permission_level\": \"CAN_MANAGE\",\n\"inherited\": true,\n\"inherited_from_object\": [\n\"\/directories\/2951435987702195\"\n]\n}\n]\n},\n{\n\"group_name\": \"admins\",\n\"all_permissions\": [\n{\n\"permission_level\": \"CAN_MANAGE\",\n\"inherited\": true,\n\"inherited_from_object\": [\n\"\/directories\/\"\n]\n}\n]\n}\n]\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/manage-permissions.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Manage dashboard permissions using the Workspace API\n###### Set workspace object permissions\n\nYou can set permissions on dashboards using the **Set workspace object permissions** endpoint. See [PUT \/api\/workspace\/workspace\/setpermissions](https:\/\/docs.databricks.com\/api\/workspace\/workspace\/setpermissions). \nThe following example gives CAN EDIT permission to all workspace users for the `workspace_object_id` in the PUT request. \n```\nPUT \/api\/2.0\/permissions\/dashboards\/01eec14769f616949d7a44244a53ed10\n\nRequest body:\n\n{\n\"access_control_list\": [\n{\n\"group_name\": \"users\",\n\"permission_level\": \"CAN_EDIT\"\n}\n]\n}\n\n``` \nFor Lakeview dashboards, you can use the group `account users` to assign view permission to all users registered to the Databricks account. See [What is share to account?](https:\/\/docs.databricks.com\/dashboards\/index.html#share-to-account).\n\n##### Manage dashboard permissions using the Workspace API\n###### Update workspace object permissions\n\nThe **Update workspace object permissions** endpoint performs functions similarly to the **Set workspace object permissions** endpoint. It assigns permissions using a `PATCH` request instead of a `PUT` request. \nSee [PATCH \/api\/workspace\/workspace\/updatepermissions](https:\/\/docs.databricks.com\/api\/workspace\/workspace\/updatepermissions). \n```\n\nPATCH \/api\/2.0\/permissions\/dashboards\/01eec14769f616949d7a44244a53ed10\n\nRequest body:\n\n{\n\"access_control_list\": [\n{\n\"group_name\": \"account userS\",\n\"permission_level\": \"CAN_VIEW\"\n}\n]\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/manage-permissions.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Format numeric types in visualizations\n\nIn many visualizations you can control how the numeric types are formatted. You control the format by supplying a format string. That formatting applies to numbers in the [table](https:\/\/docs.databricks.com\/visualizations\/tables.html) visualization and when hovering over data points on a chart visualization, but not to the [counter](https:\/\/docs.databricks.com\/visualizations\/visualization-types.html#counter) visualization or when formatting axis values. Here are examples for the various format options.\n\n##### Format numeric types in visualizations\n###### Numbers\n\n| Number | Format | Output |\n| --- | --- | --- |\n| 10000 | \u20180,0.0000\u2019 | 10,000.0000 |\n| 10000.23 | \u20180,0\u2019 | 10,000 |\n| 10000.23 | \u2018+0,0\u2019 | +10,000 |\n| -10000 | \u20180,0.0\u2019 | -10,000.0 |\n| 10000.1234 | \u20180.000\u2019 | 10000.123 |\n| 100.1234 | \u201800000\u2019 | 00100 |\n| 1000.1234 | \u2018000000,0\u2019 | 001,000 |\n| 10 | \u2018000.00\u2019 | 010.00 |\n| 10000.1234 | \u20180[.]00000\u2019 | 10000.12340 |\n| -10000 | \u2018(0,0.0000)\u2019 | (10,000.0000) |\n| -0.23 | \u2018.00\u2019 | -.23 |\n| -0.23 | \u2018(.00)\u2019 | (.23) |\n| 0.23 | \u20180.00000\u2019 | 0.23000 |\n| 0.23 | \u20180.0[0000]\u2019 | 0.23 |\n| 1230974 | \u20180.0a\u2019 | 1.2m |\n| 1460 | \u20180 a\u2019 | 1 k |\n| -104000 | \u20180a\u2019 | -104k |\n| 1 | \u20180o\u2019 | 1st |\n| 100 | \u20180o\u2019 | 100th |\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/format-numeric-types.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Format numeric types in visualizations\n###### Currency\n\nThe following examples use a `$` symbol to denote currency. You can use any other currency symbol that is available on your keyboard. \n| Number | Format | Output |\n| --- | --- | --- |\n| 1000.234 | \u2018$0,000.00\u2019 | $1,000.23 |\n| 1000.2 | \u20180,0[.]00 $\u2019 | 1,000.20 $ |\n| 1001 | \u2018$ 0,0[.]00\u2019 | $ 1,001 |\n| -1000.234 | \u2018($0,0)\u2019 | ($1,000) |\n| -1000.234 | \u2018$0.00\u2019 | -$1000.23 |\n| 1230974 | \u2018($ 0.00 a)\u2019 | $ 1.23 m |\n\n##### Format numeric types in visualizations\n###### Bytes\n\n| Number | Format | Output |\n| --- | --- | --- |\n| 100 | \u20180b\u2019 | 100B |\n| 1024 | \u20180b\u2019 | 1KB |\n| 2048 | \u20180 ib\u2019 | 2 KiB |\n| 3072 | \u20180.0 b\u2019 | 3.1 KB |\n| 7884486213 | \u20180.00b\u2019 | 7.88GB |\n| 3467479682787 | \u20180.000 ib\u2019 | 3.154 TiB |\n\n##### Format numeric types in visualizations\n###### Percentages\n\n| Number | Format | Output |\n| --- | --- | --- |\n| 100 | \u20180%\u2019 | 100% |\n| 97.4878234 | \u20180.000%\u2019 | 97.488% |\n| -4.3 | \u20180 %\u2019 | -4 % |\n| 65.43 | \u2018(0.000 %)\u2019 | 65.430 % |\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/format-numeric-types.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Format numeric types in visualizations\n###### Exponentials\n\n| Number | Format | Output |\n| --- | --- | --- |\n| 1123456789 | \u20180,0e+0\u2019 | 1e+9 |\n| 12398734.202 | \u20180.00e+0\u2019 | 1.24e+7 |\n| 0.000123987 | \u20180.000e+0\u2019 | 1.240e-4 |\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/format-numeric-types.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Upgrade ML workflows to target models in Unity Catalog\n\nThis article explains how to migrate and upgrade ML [workflows](https:\/\/docs.databricks.com\/workflows\/index.html) to target models in Unity Catalog.\n\n##### Upgrade ML workflows to target models in Unity Catalog\n###### Requirements\n\nBefore getting started, make sure to meet the requirements in [Requirements](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#requirements). In particular, make sure that the users or principals used to execute your model training, deployment, and inference workflows have the necessary privileges on a registered model in Unity Catalog: \n* Training: Ownership of the registered model (required to create new model versions), plus `USE CATALOG` and `USE SCHEMA` privileges on the enclosing catalog and schema.\n* Deployment: Ownership of the registered model (required to set aliases on the model), plus `USE CATALOG` and `USE SCHEMA` privileges on the enclosing catalog and schema.\n* Inference: `EXECUTE` privilege on the registered model (required to read and perform inference with model versions), plus `USE CATALOG` and `USE SCHEMA privileges on the enclosing catalog and schema.\n\n##### Upgrade ML workflows to target models in Unity Catalog\n###### Creating parallel training, deployment, and workflows\n\nTo upgrade model training and inference workflows to Unity Catalog, Databricks recommends an incremental approach in which you create a parallel training, deployment, and inference pipeline that leverage models in Unity Catalog. When you\u2019re comfortable with the results using Unity Catalog, you can switch downstream consumers to read the batch inference output, or increase the traffic routed to models in Unity Catalog in serving endpoints.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Upgrade ML workflows to target models in Unity Catalog\n###### Model training workflow\n\n[Clone](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#create-a-job-from-an-existing-job) your model training workflow. Then, ensure that: \n1. The workflow cluster has access to Unity Catalog and meets the requirements described in [Requirements](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#requirements).\n2. The principal running the workflow has the [necessary permissions](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html#requirements) on a registered model in Unity Catalog. \nNext, modify model training code in the cloned workflow. You may need to clone the notebook run by the workflow, or create and target a new git branch in the cloned workflow. Follow [these steps](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#upgrade-training-workloads-for-uc) to install the necessary version of MLflow, configure the client to target Unity Catalog in your training code, and update the model training code to register models to Unity Catalog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Upgrade ML workflows to target models in Unity Catalog\n###### Model deployment workflow\n\nClone your model deployment workflow, following similar steps as in [Model training workflow](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html#model-training) to update its compute configuration to enable access to Unity Catalog. \nEnsure the principal who owns the cloned workflow has the [necessary permissions](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html#requirements). If you have model validation logic in your deployment workflow, update it to [load model versions from UC](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#load-models-for-inference). Use [aliases](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#uc-model-aliases) to manage production model rollouts.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Upgrade ML workflows to target models in Unity Catalog\n###### Model inference workflow\n\n### Batch inference workflow \nFollow similar steps as in [Model training workflow](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html#model-training) to clone the batch inference workflow and update its compute configuration to enable access to Unity Catalog. Ensure the principal running the cloned batch inference job has the [necessary permissions](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html#requirements) to load the model for inference. \n### Model serving workflow \nIf you are using [Databricks Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html), you do not need to clone your existing endpoint. Instead, you can leverage the [traffic split](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/serve-multiple-models-to-serving-endpoint.html) feature to route a small fraction of traffic to models in Unity Catalog. \nFirst, ensure the principal who owns the model serving endpoint has the [necessary permissions](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html#requirements) to load the model for inference. Then, update your [cloned model deployment workflow](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html#model-deployment) to assign a small percentage of traffic to model versions in Unity Catalog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Upgrade ML workflows to target models in Unity Catalog\n###### Promote a model across environments\n\nDatabricks recommends that you deploy ML pipelines as code. This eliminates the need to promote models across environments, as all production models can be produced through automated training workflows in a production environment. \nHowever, in some cases, it may be too expensive to retrain models across environments. Instead, you can copy model versions across registered models in Unity Catalog to promote them across environments. \nYou need the following privileges to execute the example code below: \n* `USE CATALOG` on the `staging` and `prod` catalogs.\n* `USE SCHEMA` on the `staging.ml_team` and `prod.ml_team` schemas.\n* `EXECUTE` on `staging.ml_team.fraud_detection`. \nIn addition, you must be the owner of the registered model `prod.ml_team.fraud_detection`. \nThe following code snippet uses the `copy_model_version` [MLflow Client API](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.client.html#mlflow.client.MlflowClient.copy_model_version), available in MLflow version 2.8.0 and above. \n```\nimport mlflow\nmlflow.set_registry_uri(\"databricks-uc\")\n\nclient = mlflow.tracking.MlflowClient()\nsrc_model_name = \"staging.ml_team.fraud_detection\"\nsrc_model_version = \"1\"\nsrc_model_uri = f\"models:\/{src_model_name}\/{src_model_version}\"\ndst_model_name = \"prod.ml_team.fraud_detection\"\ncopied_model_version = client.copy_model_version(src_model_uri, dst_model_name)\n\n``` \nAfter the model version is in the production environment, you can perform any necessary pre-deployment validation. Then, you can mark the model version for deployment [using aliases](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#uc-model-aliases). \n```\nclient = mlflow.tracking.MlflowClient()\nclient.set_registered_model_alias(name=\"prod.ml_team.fraud_detection\", alias=\"Champion\", version=copied_model_version.version)\n\n``` \nIn the example above, only users who can read from the `staging.ml_team.fraud_detection` registered model and write to the `prod.ml_team.fraud_detection` registered model can promote staging models to the production environment. The same users can also use aliases to manage which model versions are deployed within the production environment. You don\u2019t need to configure any other rules or policies to govern model promotion and deployment. \nYou can customize this flow to promote the model version across multiple environments that match your setup, such as `dev`, `qa`, and `prod`. Access control is enforced as configured in each environment.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Upgrade ML workflows to target models in Unity Catalog\n###### Use job webhooks for manual approval for model deployment\n\nDatabricks recommends that you automate model deployment if possible, using appropriate checks and tests during the model deployment process. However, if you do need to perform manual approvals to deploy production models, you can use\n[job webhooks](https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html#configure-system-notifications) to call out to external CI\/CD systems to request manual approval for deploying a model, after your model training job completes successfully. After manual approval is provided, your CI\/CD system can then deploy the model version to serve traffic, for example by setting the \u201cChampion\u201d alias on it.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### AI Functions on Databricks\n##### Analyze customer reviews using AI Functions\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article illustrates how to use AI Functions to examine customer reviews and determine if a response needs to be generated. The AI Functions used in this example are built-in Databricks SQL functions, powered by generative AI models made available by Databricks Foundation Model APIs. See [AI Functions on Databricks](https:\/\/docs.databricks.com\/large-language-models\/ai-functions.html). \nThis example performs the following on a test dataset called `reviews` using AI Functions: \n* Determines the sentiment of a review.\n* For negative reviews, extracts information from the review to classify the cause.\n* Identifies whether a response is required back to the customer.\n* Generates a response mentioning alternative products that may satisfy the customer.\n\n##### Analyze customer reviews using AI Functions\n###### Requirements\n\n* A workspace in a Foundation Model APIs [pay-per-token supported region](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html#required).\n* These functions are not available on Databricks SQL Classic.\n* During the preview, these functions have restrictions on their performance. Reach out to your Databricks account team if you require a higher quota for your use cases.\n\n##### Analyze customer reviews using AI Functions\n###### Analyze sentiment of reviews\n\nYou can use the [ai\\_analyze\\_sentiment()](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_analyze_sentiment.html) to help you understand how customers feel from their reviews. In the following example, the sentiment can be positive, negative, neutral, or mixed. \n```\nSELECT\nreview,\nai_analyze_sentiment(review) AS sentiment\nFROM\nproduct_reviews;\n\n``` \nFrom the following results, you see that the function returns the sentiment for each review without any prompt engineering or parsing results. \n![Results for ai_sentiment function](https:\/\/docs.databricks.com\/_images\/ai-sentiment-results.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/ai-functions-example.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### AI Functions on Databricks\n##### Analyze customer reviews using AI Functions\n###### Classify reviews\n\nIn this example, after identifying negative reviews you can use [ai\\_classify()](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_classify.html) to gain more insights into customer reviews, like whether the negative review is due to poor logistics, product quality, or other factors. \n```\nSELECT\nreview,\nai_classify(\nreview,\nARRAY(\n\"Arrives too late\",\n\"Wrong size\",\n\"Wrong color\",\n\"Dislike the style\"\n)\n) AS reason\nFROM\nproduct_reviews\nWHERE\nai_analyze_sentiment(review) = \"negative\"\n\n``` \nIn this case, `ai_classify()` is able to correctly categorize the negative reviews based on custom labels to allow for further analysis. \n![Results for ai_classify function](https:\/\/docs.databricks.com\/_images\/ai-classify-results.png)\n\n##### Analyze customer reviews using AI Functions\n###### Extract information from reviews\n\nYou might want to improve your product description based on the reasons customers had for their negative reviews. You can find key information from a blob of text using [ai\\_extract()](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_extract.html). The following example extracts information and classifies if the negative review was based on sizing issues with the product: \n```\nSELECT\nreview,\nai_extract(review, array(\"usual size\")) AS usual_size,\nai_classify(review, array(\"Size is wrong\", \"Size is right\")) AS fit\nFROM\nproduct_reviews\n\n``` \nThe following are a sample of results: \n![Results for ai_extract function](https:\/\/docs.databricks.com\/_images\/ai-extract-results.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/ai-functions-example.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### AI Functions on Databricks\n##### Analyze customer reviews using AI Functions\n###### Generate responses with recommendations\n\nAfter reviewing the customer responses, you can use the [ai\\_gen()](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_gen.html) function to generate a response to a customer based on their complaint and strengthen customer relationships with prompt replies to their feedback. \n```\nSELECT\nreview,\nai_gen(\n\"Generate a reply in 60 words to address the customer's review.\nMention their opinions are valued and a 30% discount coupon code has been sent to their email.\nCustomer's review: \" || review\n) AS reply\nFROM\nproduct_reviews\nWHERE\nai_analyze_sentiment(review) = \"negative\"\n\n``` \nThe following are a sample of results: \n![Results for ai_gen_results function](https:\/\/docs.databricks.com\/_images\/ai-gen-results.png)\n\n##### Analyze customer reviews using AI Functions\n###### Additional resources\n\n* [AI Functions on Databricks](https:\/\/docs.databricks.com\/large-language-models\/ai-functions.html)\n* [Databricks Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/ai-functions-example.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Log, load, register, and deploy MLflow models\n\nAn MLflow [Model](https:\/\/mlflow.org\/docs\/latest\/models.html) is a standard format for packaging machine learning models that can be used in a variety of downstream tools\u2014for example, batch inference on Apache Spark or real-time serving through a REST API. The format defines a convention that lets you save a model in different [flavors](https:\/\/www.mlflow.org\/docs\/latest\/models.html#built-in-model-flavors) (python-function, pytorch, sklearn, and so on), that can be understood by different model [serving and inference platforms](https:\/\/www.mlflow.org\/docs\/latest\/models.html#built-in-deployment-tools).\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/models.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Log, load, register, and deploy MLflow models\n##### Log and load models\n\nWhen you log a model, MLflow automatically logs `requirements.txt` and `conda.yaml` files. You can use these files to recreate the model development environment and reinstall dependencies using `virtualenv` (recommended) or `conda`. \n* [Databricks Autologging](https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html) \nImportant \nAnaconda Inc. updated their [terms of service](https:\/\/www.anaconda.com\/terms-of-service) for anaconda.org channels. Based on the new terms of service you may require a commercial license if you rely on Anaconda\u2019s packaging and distribution. See [Anaconda Commercial Edition FAQ](https:\/\/www.anaconda.com\/blog\/anaconda-commercial-edition-faq) for more information. Your use of any Anaconda channels is governed by their terms of service. \nMLflow models logged before [v1.18](https:\/\/mlflow.org\/news\/2021\/06\/18\/1.18.0-release\/index.html) (Databricks Runtime 8.3 ML or earlier) were by default logged with the conda `defaults` channel (<https:\/\/repo.anaconda.com\/pkgs\/>) as a dependency. Because of this license change, Databricks has stopped the use of the `defaults` channel for models logged using MLflow v1.18 and above. The default channel logged is now `conda-forge`, which points at the community managed <https:\/\/conda-forge.org\/>. \nIf you logged a model before MLflow v1.18 without excluding the `defaults` channel from the conda environment for the model, that model may have a dependency on the `defaults` channel that you may not have intended.\nTo manually confirm whether a model has this dependency, you can examine `channel` value in the `conda.yaml` file that is packaged with the logged model. For example, a model\u2019s `conda.yaml` with a `defaults` channel dependency may look like this: \n```\nchannels:\n- defaults\ndependencies:\n- python=3.8.8\n- pip\n- pip:\n- mlflow\n- scikit-learn==0.23.2\n- cloudpickle==1.6.0\nname: mlflow-env\n\n``` \nBecause Databricks can not determine whether your use of the Anaconda repository to interact with your models is permitted under your relationship with Anaconda, Databricks is not forcing its customers to make any changes. If your use of the Anaconda.com repo through the use of Databricks is permitted under Anaconda\u2019s terms, you do not need to take any action. \nIf you would like to change the channel used in a model\u2019s environment, you can re-register the model to the model registry with a new `conda.yaml`. You can do this by specifying the channel in the `conda_env` parameter of `log_model()`. \nFor more information on the `log_model()` API, see the MLflow documentation for the model flavor you are working with, for example, [log\\_model for scikit-learn](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.sklearn.html#mlflow.sklearn.log_model). \nFor more information on `conda.yaml` files, see the [MLflow documentation](https:\/\/www.mlflow.org\/docs\/latest\/models.html#additional-logged-files). \n### API commands \nTo log a model to the MLflow [tracking server](https:\/\/docs.databricks.com\/mlflow\/tracking.html), use `mlflow.<model-type>.log_model(model, ...)`. \nTo load a previously logged model for inference or further development, use `mlflow.<model-type>.load_model(modelpath)`, where `modelpath` is one of the following: \n* a run-relative path (such as `runs:\/{run_id}\/{model-path}`)\n* a DBFS path\n* a [registered model](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html) path (such as `models:\/{model_name}\/{model_stage}`). \nFor a complete list of options for loading MLflow models, see [Referencing Artifacts in the MLflow documentation](https:\/\/www.mlflow.org\/docs\/latest\/concepts.html#artifact-locations). \nFor Python MLflow models, an additional option is to use `mlflow.pyfunc.load_model()` to load the model as a generic Python function.\nYou can use the following code snippet to load the model and score data points. \n```\nmodel = mlflow.pyfunc.load_model(model_path)\nmodel.predict(model_input)\n\n``` \nAs an alternative, you can export the model as an Apache Spark UDF to use for scoring on a Spark cluster,\neither as a batch job or as a real-time [Spark Streaming](https:\/\/docs.databricks.com\/structured-streaming\/index.html) job. \n```\n# load input data table as a Spark DataFrame\ninput_data = spark.table(input_table_name)\nmodel_udf = mlflow.pyfunc.spark_udf(spark, model_path)\ndf = input_data.withColumn(\"prediction\", model_udf())\n\n``` \n### Log model dependencies \nTo accurately load a model, you should make sure the model dependencies are loaded with the correct versions into the notebook environment. In Databricks Runtime 10.5 ML and above, MLflow warns you if a mismatch is detected between the current environment and the model\u2019s dependencies. \nAdditional functionality to simplify restoring model dependencies is included in Databricks Runtime 11.0 ML and above. In Databricks Runtime 11.0 ML and above, for `pyfunc` flavor models, you can call `mlflow.pyfunc.get_model_dependencies` to retrieve and download the model dependencies. This function returns a path to the dependencies file which you can then install by using `%pip install <file-path>`. When you load a model as a PySpark UDF, specify `env_manager=\"virtualenv\"` in the `mlflow.pyfunc.spark_udf` call. This restores model dependencies in the context of the PySpark UDF and does not affect the outside environment. \nYou can also use this functionality in Databricks Runtime 10.5 or below by manually installing [MLflow version 1.25.0 or above](https:\/\/www.mlflow.org\/docs\/latest\/index.html): \n```\n%pip install \"mlflow>=1.25.0\"\n\n``` \nFor additional information on how to log model dependencies (Python and non-Python) and artifacts, see [Log model dependencies](https:\/\/docs.databricks.com\/mlflow\/log-model-dependencies.html). \nLearn how to log model dependencies and custom artifacts for model serving: \n* [Deploy models with dependencies](https:\/\/docs.databricks.com\/mlflow\/log-model-dependencies.html#deploy-dependencies)\n* [Use custom Python libraries with Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/private-libraries-model-serving.html)\n* [Package custom artifacts for Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-custom-artifacts.html) \n* [Log model dependencies](https:\/\/docs.databricks.com\/mlflow\/log-model-dependencies.html) \n### Automatically generated code snippets in the MLflow UI \nWhen you log a model in a Databricks notebook, Databricks automatically generates code snippets that you can copy and use to load and run the model. To view these code snippets: \n1. Navigate to the Runs screen for the run that generated the model. (See [View notebook experiment](https:\/\/docs.databricks.com\/mlflow\/experiments.html#view-notebook-experiment) for how to display the Runs screen.)\n2. Scroll to the **Artifacts** section.\n3. Click the name of the logged model. A panel opens to the right showing code you can use to load the logged model and make predictions on Spark or pandas DataFrames. \n![Artifact panel code snippets](https:\/\/docs.databricks.com\/_images\/code-snippets.png) \n### Examples \nFor examples of logging models, see the examples in [Track machine learning training runs examples](https:\/\/docs.databricks.com\/mlflow\/tracking.html#tracking-examples). For an example of loading a logged model for inference, see the [Model inference example](https:\/\/docs.databricks.com\/mlflow\/model-example.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/models.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Log, load, register, and deploy MLflow models\n##### Register models in the Model Registry\n\nYou can register models in the MLflow Model Registry, a centralized model store that provides a UI and set of APIs to manage the full lifecycle of MLflow Models. For instructions on how to use the Model Registry to manage models in Databricks Unity Catalog, see [Manage model lifecycle in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html). To use the Workspace Model Registry, see [Manage model lifecycle using the Workspace Model Registry (legacy)](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html). \nTo register a model using the API, use `mlflow.register_model(\"runs:\/{run_id}\/{model-path}\", \"{registered-model-name}\")`.\n\n#### Log, load, register, and deploy MLflow models\n##### Save models to DBFS\n\nTo save a model locally, use `mlflow.<model-type>.save_model(model, modelpath)`. `modelpath` must be a [DBFS](https:\/\/docs.databricks.com\/dbfs\/index.html) path. For example, if you use a DBFS location `dbfs:\/my_project_models` to store your project work, you must use the model path `\/dbfs\/my_project_models`: \n```\nmodelpath = \"\/dbfs\/my_project_models\/model-%f-%f\" % (alpha, l1_ratio)\nmlflow.sklearn.save_model(lr, modelpath)\n\n``` \nFor MLlib models, use [ML Pipelines](https:\/\/spark.apache.org\/docs\/latest\/ml-pipeline.html#ml-persistence-saving-and-loading-pipelines).\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/models.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Log, load, register, and deploy MLflow models\n##### Download model artifacts\n\nYou can download the logged model artifacts (such as model files, plots, and metrics) for a registered model with various APIs. \n[Python API](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.tracking.html#mlflow.tracking.MlflowClient.get_model_version_download_uri) example: \n```\nfrom mlflow.store.artifact.models_artifact_repo import ModelsArtifactRepository\n\nmodel_uri = MlflowClient.get_model_version_download_uri(model_name, model_version)\nModelsArtifactRepository(model_uri).download_artifacts(artifact_path=\"\")\n\n``` \n[Java API](https:\/\/mlflow.org\/docs\/latest\/java_api\/org\/mlflow\/tracking\/MlflowClient.html#downloadModelVersion-java.lang.String-java.lang.String-) example: \n```\nMlflowClient mlflowClient = new MlflowClient();\n\/\/ Get the model URI for a registered model version.\nString modelURI = mlflowClient.getModelVersionDownloadUri(modelName, modelVersion);\n\n\/\/ Or download the model artifacts directly.\nFile modelFile = mlflowClient.downloadModelVersion(modelName, modelVersion);\n\n``` \n[CLI command](https:\/\/www.mlflow.org\/docs\/latest\/cli.html#mlflow-artifacts-download) example: \n```\nmlflow artifacts download --artifact-uri models:\/<name>\/<version|stage>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/models.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Log, load, register, and deploy MLflow models\n##### Deploy models for online serving\n\nYou can use [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) to host machine learning models from the Model Registry as REST endpoints. These endpoints are updated automatically based on the availability of model versions and their stages. \nYou can also deploy a model to third-party serving frameworks using [MLflow\u2019s built-in deployment tools](https:\/\/mlflow.org\/docs\/latest\/models.html#built-in-deployment-tools). \nSee the following examples: \n* [scikit-learn model deployment on SageMaker](https:\/\/docs.databricks.com\/mlflow\/scikit-learn-model-deployment-on-sagemaker.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/models.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Google BigQuery\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes how to set up Lakehouse Federation to run federated queries on BigQuery data that is not managed by Databricks. To learn more about Lakehouse Federation, see [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). \nTo connect to your BigQuery database using Lakehouse Federation, you must create the following in your Databricks Unity Catalog metastore: \n* A *connection* to your BigQuery database.\n* A *foreign catalog* that mirrors your BigQuery database in Unity Catalog so that you can use Unity Catalog query syntax and data governance tools to manage Databricks user access to the database.\n\n#### Run federated queries on Google BigQuery\n##### Before you begin\n\nWorkspace requirements: \n* Workspace enabled for Unity Catalog. \nCompute requirements: \n* Network connectivity from your Databricks Runtime cluster or SQL warehouse to the target database systems. See [Networking recommendations for Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/networking.html).\n* Databricks clusters must use Databricks Runtime 13.3 LTS or above and shared or single-user access mode.\n* SQL warehouses must be Pro or Serverless. \nPermissions required: \n* To create a connection, you must be a metastore admin or a user with the `CREATE CONNECTION` privilege on the Unity Catalog metastore attached to the workspace.\n* To create a foreign catalog, you must have the `CREATE CATALOG` permission on the metastore and be either the owner of the connection or have the `CREATE FOREIGN CATALOG` privilege on the connection. \nAdditional permission requirements are specified in each task-based section that follows.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/bigquery.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Google BigQuery\n##### Create a connection\n\nA connection specifies a path and credentials for accessing an external database system. To create a connection, you can use Catalog Explorer or the `CREATE CONNECTION` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** Metastore admin or user with the `CREATE CONNECTION` privilege. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **External Data** menu and select **Connections**.\n3. Click **Create connection**.\n4. Enter a user-friendly **Connection name**.\n5. Select a **Connection type** of BigQuery.\n6. Enter the following connection property for your BigQuery instance. \n**GoogleServiceAccountKeyJson**: A raw JSON object that is used to specify the BigQuery project and provide authentication. You can generate this JSON object and download it from the service account details page in Google Cloud under \u2018KEYS\u2019. The service account must have proper permissions granted in BigQuery, including BigQuery User and BigQuery Data Viewer. The following is an example. \n```\n{\n\"type\": \"service_account\",\n\"project_id\": \"PROJECT_ID\",\n\"private_key_id\": \"KEY_ID\",\n\"private_key\": \"-----BEGIN PRIVATE KEY-----\\nPRIVATE_KEY\\n-----END PRIVATE KEY-----\\n\",\n\"client_email\": \"SERVICE_ACCOUNT_EMAIL\",\n\"client_id\": \"CLIENT_ID\",\n\"auth_uri\": \"https:\/\/accounts.google.com\/o\/oauth2\/auth\",\n\"token_uri\": \"https:\/\/accounts.google.com\/o\/oauth2\/token\",\n\"auth_provider_x509_cert_url\": \"https:\/\/www.googleapis.com\/oauth2\/v1\/certs\",\n\"client_x509_cert_url\": \"https:\/\/www.googleapis.com\/robot\/v1\/metadata\/x509\/SERVICE_ACCOUNT_EMAIL\",\n\"universe_domain\": \"googleapis.com\"\n}\n\n```\n7. (Optional) Click **Test connection** to confirm network connectivity. This action does not test authentication.\n8. (Optional) Add a comment.\n9. Click **Create**. \nRun the following command in a notebook or the Databricks SQL query editor. Replace `<GoogleServiceAccountKeyJson>` with a raw JSON object that specifies the BigQuery project and provides authentication. You can generate this JSON object and download it from the service account details page in Google Cloud under \u2018KEYS\u2019. The service account needs to have proper permissions granted in BigQuery, including BigQuery User and BigQuery Data Viewer. For an example JSON object, view the **Catalog Explorer** tab on this page. \n```\nCREATE CONNECTION <connection-name> TYPE bigquery\nOPTIONS (\nGoogleServiceAccountKeyJson '<GoogleServiceAccountKeyJson>'\n);\n\n``` \nWe recommend that you use Databricks [secrets](https:\/\/docs.databricks.com\/security\/secrets\/index.html) instead of plaintext strings for sensitive values like credentials. For example: \n```\nCREATE CONNECTION <connection-name> TYPE bigquery\nOPTIONS (\nGoogleServiceAccountKeyJson secret ('<secret-scope>','<secret-key-user>')\n)\n\n``` \nFor information about setting up secrets, see [Secret management](https:\/\/docs.databricks.com\/security\/secrets\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/bigquery.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Google BigQuery\n##### Create a foreign catalog\n\nA foreign catalog mirrors a database in an external data system so that you can query and manage access to data in that database using Databricks and Unity Catalog. To create a foreign catalog, use a connection to the data source that has already been defined. \nTo create a foreign catalog, you can use Catalog Explorer or `CREATE FOREIGN CATALOG` in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** `CREATE CATALOG` permission on the metastore and either ownership of the connection or the `CREATE FOREIGN CATALOG` privilege on the connection. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Click the **Create Catalog** button.\n3. On the **Create a new catalog** dialog, enter a name for the catalog and select a **Type** of **Foreign**.\n4. Select the **Connection** that provides access to the database that you want to mirror as a Unity Catalog catalog.\n5. Click **Create.** \nRun the following SQL command in a notebook or the Databricks SQL editor. Items in brackets are optional. Replace the placeholder values. \n* `<catalog-name>`: Name for the catalog in Databricks.\n* `<connection-name>`: The [connection object](https:\/\/docs.databricks.com\/query-federation\/bigquery.html#connection) that specifies the data source, path, and access credentials. \n```\nCREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>;\n\n```\n\n#### Run federated queries on Google BigQuery\n##### Supported pushdowns\n\nThe following pushdowns are supported: \n* Filters\n* Projections\n* Limit\n* Functions: partial, only for filter expressions. (String functions, Mathematical functions, Data, Time and Timestamp functions, and other miscellaneous functions, such as Alias, Cast, SortOrder)\n* Aggregates\n* Sorting, when used with limit \nThe following pushdowns are not supported: \n* Joins\n* Windows functions\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/bigquery.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Google BigQuery\n##### Data type mappings\n\nThe following table shows the BigQuery to Spark data type mapping. \n| BigQuery type | Spark type |\n| --- | --- |\n| bignumeric, numeric | DecimalType |\n| int64 | LongType |\n| float64 | DoubleType |\n| array, geography, interval, json, string, struct | VarcharType |\n| bytes | BinaryType |\n| bool | BooleanType |\n| date | DateType |\n| datetime, time, timestamp | TimestampType\/TimestampNTZType | \nWhen you read from BigQuery, BigQuery `Timestamp` is mapped to Spark `TimestampType` if `preferTimestampNTZ = false` (default). BigQuery `Timestamp` is mapped to `TimestampNTZType` if `preferTimestampNTZ = true`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/bigquery.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n### Set up query federation for non-Unity-Catalog workspaces\n##### Query federation for PostgreSQL in Databricks SQL (Experimental)\n\nExperimental \nThe configurations described in this article are [Experimental](https:\/\/docs.databricks.com\/release-notes\/release-types.html). Experimental features are provided as-is and are not supported by Databricks through customer technical support. **To get full query federation support, you should instead use [Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html), which enables your Databricks users to take advantage of Unity Catalog syntax and data governance tools.** \nThis article describes how to configure read-only query federation to PostgreSQL on serverless and pro SQL warehouses. \nYou configure connections to PostgreSQL at the table level. You can use [secrets](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/secret.html) to store and access text credentials without displaying them in plaintext. See the following example: \n```\nDROP TABLE IF EXISTS postgresql_table;\nCREATE TABLE postgresql_table\nUSING postgresql\nOPTIONS (\ndbtable '<table-name>',\nhost '<database-host-url>',\nport '5432',\ndatabase '<database-name>',\nuser secret('postgres_creds', 'my_username'),\npassword secret('postgres_creds', 'my_password')\n);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/postgresql-no-uc.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### PyTorch\n##### Train a PyTorch model\n\n[PyTorch](https:\/\/docs.databricks.com\/machine-learning\/train-model\/pytorch.html) is a Python package that provides GPU-accelerated tensor computation and high level functionality for building deep learning networks. \nThe MLflow PyTorch notebook fits a neural network on MNIST handwritten digit recognition data and logs run results to an MLflow server. It logs training metrics and weights in TensorFlow event format locally and then uploads them to the MLflow run\u2019s artifact directory. Finally, it starts TensorBoard and reads the events logged locally. \nWhen you\u2019re ready you can deploy your model using [Model serving with Databricks](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html).\n\n##### Train a PyTorch model\n###### MLflow PyTorch model training notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-pytorch-training.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/tracking-ex-pytorch.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualizations in Databricks notebooks\n\nDatabricks has built-in support for charts and visualizations in both Databricks SQL and in notebooks. This page describes how to work with visualizations in a Databricks notebook. For information about using visualizations in Databricks SQL, see [Visualization in Databricks SQL](https:\/\/docs.databricks.com\/sql\/user\/visualizations\/index.html). \nTo view the types of visualizations, see [visualization types](https:\/\/docs.databricks.com\/visualizations\/visualization-types.html). \nImportant \nFor information about a preview version of Databricks charts, see [preview chart visualizations](https:\/\/docs.databricks.com\/visualizations\/preview-chart-visualizations.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/index.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualizations in Databricks notebooks\n##### Create a new visualization\n\nTo recreate the example in this section, use the following code: \n```\nsparkDF = spark.read.csv(\"\/databricks-datasets\/bikeSharing\/data-001\/day.csv\", header=\"true\", inferSchema=\"true\")\ndisplay(sparkDF)\n\n``` \nTo create a visualization, click **+** above a result and select **Visualization**. The visualization editor appears. \n![New visualization menu](https:\/\/docs.databricks.com\/_images\/new-visualization-menu.png) \n1. In the **Visualization Type** drop-down, choose a type. \n![Visualization editor](https:\/\/docs.databricks.com\/_images\/visualization-editor.png)\n2. Select the data to appear in the visualization. The fields available depend on the selected type.\n3. Click **Save**. \n### Visualization tools \nIf you hover over the top right of a chart in the visualization editor, a Plotly toolbar appears where you can perform operations such as select, zoom, and pan. \n![Notebook visualization editor toolbar](https:\/\/docs.databricks.com\/_images\/viz-plotly-bar.png) \nIf you hover over the top right of a chart outside the visualization editor a smaller subset of tools appears: \n![Notebook chart toolbar](https:\/\/docs.databricks.com\/_images\/nb-plotly-bar.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/index.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualizations in Databricks notebooks\n##### Create a new data profile\n\nNote \nAvailable in Databricks Runtime 9.1 LTS and above. \nData profiles display summary statistics of an Apache Spark DataFrame, a pandas DataFrame, or a SQL table in tabular and graphic format. To create a data profile from a results cell, click **+** and select **Data Profile**. \nDatabricks calculates and displays the summary statistics. \n![Data Profile](https:\/\/docs.databricks.com\/_images\/data-profile.png) \n* Numeric and categorical features are shown in separate tables.\n* At the top of the tab, you can sort or search for features.\n* At the top of the chart column, you can choose to display a histogram (**Standard**) or quantiles.\n* Check **expand** to enlarge the charts.\n* Check **log** to display the charts on a log scale.\n* You can hover your cursor over the charts for more detailed information, such as the boundaries of a histogram column and the number of rows in it, or the quantile value. \nYou can also generate data profiles programmatically; see [summarize command (dbutils.data.summarize)](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#summarize-command-dbutilsdatasummarize).\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/index.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualizations in Databricks notebooks\n##### Work with visualizations and data profiles\n\nNote \nData profiles are available in Databricks Runtime 9.1 LTS and above. \n### Rename, duplicate, or remove a visualization or data profile \nTo rename, duplicate, or remove a visualization or data profile, click the downward pointing arrow at the right of the tab name. \n![Notebook visualization drop down menu](https:\/\/docs.databricks.com\/_images\/nb-viz-work-with-menu.png) \nYou can also change the name by clicking directly on it and editing the name in place. \n### Edit a visualization \nClick ![Edit visualization button](https:\/\/docs.databricks.com\/_images\/edit-visualization-button.png) beneath the visualization to open the visualization editor. When you have finished making changes, click **Save**. \n#### Edit colors \nYou can customize a visualization\u2019s colors when you create the visualization or by editing it. \n1. Create or edit a visualization.\n2. Click **Colors**.\n3. To modify a color, click the square and select the new color by doing one of the following: \n* Click it in the color selector.\n* Enter a hex value.\n4. Click anywhere outside the color selector to close it and save changes. \n#### Temporarily hide or show a series \nTo hide a series in a visualization, click the series in the legend. To show the series again, click it again in the legend. \nTo show only a single series, double-click the series in the legend. To show other series, click each one. \n### Download a visualization \nTo download a visualization in .png format, click the camera icon ![camera icon](https:\/\/docs.databricks.com\/_images\/viz-camera-icon.png)in the notebook cell or in the visualization editor. \n* In a result cell, the camera icon appears at the upper right when you move the cursor over the cell. \n![camera in notebook cell](https:\/\/docs.databricks.com\/_images\/camera-in-nb-cell.png)\n* In the visualization editor, the camera icon appears when you move the cursor over the chart. See [Visualization tools](https:\/\/docs.databricks.com\/visualizations\/index.html#visualization-tools). \n### Add a visualization or data profile to a dashboard \n1. Click the downward pointing arrow at the right of the tab name.\n2. Select **Add to dashboard**. A list of available dashboard views appears, along with a menu option **Add to new dashboard**.\n3. Select a dashboard or select **Add to new dashboard**. The dashboard appears, including the newly added visualization or data profile.\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Log\n#### load\n##### register\n###### and deploy MLflow models\n######## Databricks Autologging\n\nDatabricks Autologging is a no-code solution that extends [MLflow automatic logging](https:\/\/mlflow.org\/docs\/latest\/tracking.html#automatic-logging) to deliver automatic experiment tracking for machine learning training sessions on Databricks. \nWith Databricks Autologging, model parameters, metrics, files, and lineage information are automatically captured when you train models from a variety of popular machine learning libraries. Training sessions are recorded as [MLflow tracking runs](https:\/\/docs.databricks.com\/mlflow\/tracking.html). Model files are also tracked so you can easily log them to the [MLflow Model Registry](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html) and deploy them for real-time scoring with [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html). \nThe following video shows Databricks Autologging with a scikit-learn model training session in an\ninteractive Python notebook. Tracking information is automatically captured and displayed in the\nExperiment Runs sidebar and in the MLflow UI. \n![Autologging example](https:\/\/docs.databricks.com\/_images\/autologging-example.gif)\n\n######## Databricks Autologging\n######### Requirements\n\n* Databricks Autologging is generally available in all regions with Databricks Runtime 10.4 LTS ML or above.\n* Databricks Autologging is available in select preview regions with Databricks Runtime 9.1 LTS ML or above.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Log\n#### load\n##### register\n###### and deploy MLflow models\n######## Databricks Autologging\n######### How it works\n\nWhen you attach an interactive Python notebook to a Databricks cluster, Databricks Autologging\ncalls [mlflow.autolog()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.autolog)\nto set up tracking for your model training sessions. When you train models in the notebook,\nmodel training information is automatically tracked with\n[MLflow Tracking](https:\/\/docs.databricks.com\/mlflow\/tracking.html). For information about how this model training\ninformation is secured and managed, see [Security and data management](https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html#security-and-data-management). \nThe default configuration for the\n[mlflow.autolog()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.autolog) call is: \n```\nmlflow.autolog(\nlog_input_examples=False,\nlog_model_signatures=True,\nlog_models=True,\ndisable=False,\nexclusive=False,\ndisable_for_unsupported_versions=True,\nsilent=False\n)\n\n``` \nYou can [customize the autologging configuration](https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html#customize-logging-behavior).\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Log\n#### load\n##### register\n###### and deploy MLflow models\n######## Databricks Autologging\n######### Usage\n\nTo use Databricks Autologging, train a machine learning model in a\n[supported framework](https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html#supported-environments-and-frameworks) using an\ninteractive Databricks Python notebook. Databricks Autologging automatically records model lineage\ninformation, parameters, and metrics to [MLflow Tracking](https:\/\/docs.databricks.com\/mlflow\/tracking.html). You\ncan also [customize the behavior of Databricks Autologging](https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html#customize-logging-behavior). \nNote \nDatabricks Autologging is not applied to runs created using the\n[MLflow fluent API](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.html) with\n`mlflow.start_run()`. In these cases, you must call `mlflow.autolog()` to save autologged content\nto the MLflow run. See [Track additional content](https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html#track-additional-content). \n### Customize logging behavior \nTo customize logging, use [mlflow.autolog()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.autolog).\nThis function provides configuration parameters to enable model logging (`log_models`),\ncollect input examples (`log_input_examples`), configure warnings (`silent`), and more. \n### Track additional content \nTo track additional metrics, parameters, files, and metadata with MLflow runs created by\nDatabricks Autologging, follow these steps in a Databricks interactive Python notebook: \n1. Call [mlflow.autolog()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.autolog)\nwith `exclusive=False`.\n2. Start an MLflow run using [mlflow.start\\_run()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.start_run).\nYou can wrap this call in `with mlflow.start_run()`; when you do this, the run is ended automatically after it completes.\n3. Use [MLflow Tracking methods](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html), such as\n[mlflow.log\\_param()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.log_param),\nto track pre-training content.\n4. Train one or more machine learning models in a framework supported by Databricks Autologging.\n5. Use [MLflow Tracking methods](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html), such as\n[mlflow.log\\_metric()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.log_metric),\nto track post-training content.\n6. If you did not use `with mlflow.start_run()` in Step 2, end the MLflow run using\n[mlflow.end\\_run()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.end_run). \nFor example: \n```\nimport mlflow\nmlflow.autolog(exclusive=False)\n\nwith mlflow.start_run():\nmlflow.log_param(\"example_param\", \"example_value\")\n# <your model training code here>\nmlflow.log_metric(\"example_metric\", 5)\n\n``` \n### Disable Databricks Autologging \nTo disable Databricks Autologging in a Databricks interactive Python notebook, call\n[mlflow.autolog()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.autolog) with\n`disable=True`: \n```\nimport mlflow\nmlflow.autolog(disable=True)\n\n``` \nAdministrators can also disable Databricks Autologging for all clusters in a workspace from\nthe **Advanced** tab of the [admin settings page](https:\/\/docs.databricks.com\/admin\/index.html#admin-settings). Clusters\nmust be restarted for this change to take effect.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Log\n#### load\n##### register\n###### and deploy MLflow models\n######## Databricks Autologging\n######### Supported environments and frameworks\n\nDatabricks Autologging is supported in interactive Python notebooks and is available for the\nfollowing ML frameworks: \n* scikit-learn\n* Apache Spark MLlib\n* TensorFlow\n* Keras\n* PyTorch Lightning\n* XGBoost\n* LightGBM\n* Gluon\n* Fast.ai (version 1.x)\n* statsmodels. \nFor more information about each of the supported frameworks,\nsee [MLflow automatic logging](https:\/\/mlflow.org\/docs\/latest\/tracking.html#automatic-logging).\n\n######## Databricks Autologging\n######### Security and data management\n\nAll model training information tracked with Databricks Autologging is stored in MLflow Tracking and\nis secured by [MLflow Experiment permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#experiments).\nYou can share, modify, or delete model training information using the [MLflow Tracking](https:\/\/docs.databricks.com\/mlflow\/tracking.html)\nAPI or UI.\n\n######## Databricks Autologging\n######### Administration\n\nAdministrators can enable or disable Databricks Autologging for all interactive notebook sessions\nacross their workspace in the **Advanced** tab of the [admin settings page](https:\/\/docs.databricks.com\/admin\/index.html#admin-settings).\nChanges do not take effect until the cluster is restarted.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Log\n#### load\n##### register\n###### and deploy MLflow models\n######## Databricks Autologging\n######### Limitations\n\n* Databricks Autologging is not supported in Databricks jobs. To use autologging from jobs, you\ncan explicitly call [mlflow.autolog()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.autolog).\n* Databricks Autologging is enabled only on the driver node of your Databricks cluster. To use\nautologging from worker nodes, you must explicitly call\n[mlflow.autolog()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.autolog) from\nwithin the code executing on each worker.\n* The XGBoost scikit-learn integration is not supported.\n\n######## Databricks Autologging\n######### Apache Spark MLlib, Hyperopt, and automated MLflow tracking\n\nDatabricks Autologging does not change the behavior of existing automated MLflow tracking\nintegrations for [Apache Spark MLlib](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/mllib-mlflow-integration.html)\nand [Hyperopt](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-spark-mlflow-integration.html). \nNote \nIn Databricks Runtime 10.1 ML, disabling the automated MLflow tracking integration for Apache Spark MLlib `CrossValidator` and `TrainValidationSplit` models also disables the Databricks Autologging feature for all Apache Spark MLlib models.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html"} +{"content":"# What is Databricks Marketplace?\n### Access data products in Databricks Marketplace using external platforms\n\nThis article describes how to access data products in Databricks Marketplace without a Unity Catalog-enabled Databricks workspace. You can use Delta Sharing open sharing connectors to access Marketplace data using a number of common platforms, including Microsoft Power BI, Microsoft Excel, pandas, Apache Spark, and non-Unity Catalog Databricks workspaces. Only tabular data sets are available on external platforms (not Databricks notebooks, volumes, or models, for example). \nImportant \nIf you have a Databricks workspace that is enabled for Unity Catalog, you should access shared data using Unity Catalog. See [Access data products in Databricks Marketplace (Unity Catalog-enabled workspaces)](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html).\n\n### Access data products in Databricks Marketplace using external platforms\n#### Before you begin\n\nTo browse data product listings on Databricks Marketplace, you can use either of the following: \n* The [Open Marketplace](https:\/\/marketplace.databricks.com).\n* A Databricks workspace. \nTo request access to data products, regardless of platform, you must have a Databricks workspace. \nIf you don\u2019t have one, you can get a free trial. Click **Try for free** on the [Open Marketplace](https:\/\/marketplace.databricks.com) and follow the prompts to start your trial.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-consumer-open.html"} +{"content":"# What is Databricks Marketplace?\n### Access data products in Databricks Marketplace using external platforms\n#### Browse Databricks Marketplace listings that are accessible on external platforms\n\nTo find a data product you want, simply browse or search the data product listings in Databricks Marketplace. Only **Data set** (tabular data) product types are available to share using external platforms or non-Unity-Catalog Databricks workspaces. \n1. Go to [marketplace.databricks.com](https:\/\/marketplace.databricks.com) or log into your Databricks workspace and click ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace**. \nNote \nAlternatively, you can search for Marketplace listings using the global search bar at the top of your Databricks workspace. See [Search for workspace objects](https:\/\/docs.databricks.com\/search\/index.html).\n2. Browse or search for the data product that you want. \nYou can filter listings by provider name, product type, category, cost (free or paid), or keyword search. Only the **Data set** product type is available for access using external platforms. \nIf you are logged into a Databricks workspace, you can also choose to view only the private listings available to you as part of a private exchange. See [Participate in private exchanges](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html#private-exchange).\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-consumer-open.html"} +{"content":"# What is Databricks Marketplace?\n### Access data products in Databricks Marketplace using external platforms\n#### Get access to data products that are accessible on external platforms\n\nTo request access to data products, you must be logged into a Databricks workspace, even if you will use the shared data product on an external platform. Some data products are available immediately, and others require provider approval and transaction completion using provider interfaces. \n### Get access to data products that are instantly available \nData products that are available instantly require only that you request them and agree to terms. These data products are listed under the **Free and instantly available** heading on the Marketplace landing page, are identified on the listing tile as **Free**, and are identified as **Instantly available** on the listing detail page. \n1. When you\u2019ve found a listing you\u2019re interested in on the Marketplace landing page, click the listing to open the listing detail page.\n2. Click the **Get instant access** button.\n3. Under **More options** , select **On external platforms**.\n4. Accept the Databricks terms and conditions.\n5. Click **Get instant access**.\n6. Click the **Download credential file** button to get the credential file, which you and your team can use to gain access to shared data using third-party data platforms and non-Unity Catalog Databricks workspaces. \nImportant \nThe credential file can only be downloaded once. The download button remains active after you download the file, but subsequent downloads rotate to a new credential. The old credential expires after one day or its original expiration date, whichever is sooner. Only two credentials can be active at the same time.\n7. Store the credential file in a secure location. \nDon\u2019t share the credential file with anyone outside the group of users who should have access to the shared data. If you need to share it with someone in your organization, Databricks recommends using a password manager. \nTo learn how to access the shared data using your platform of choice, see [Access shared data using Delta Sharing open sharing connectors](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer-open.html#access-open). \n### Request data products that require provider approval \nSome data products require provider approval, typically because a commercial transaction is involved, or the provider might prefer to customize data products for you. These listings are identified on the listing detail page as **By request** and include a **Request access** button. \n1. When you\u2019ve found a listing you\u2019re interested in on the Marketplace landing page, click the listing to open the listing detail page.\n2. Click the **Request access** button.\n3. Enter your name, company, and a brief description of your intended use for the data product.\n4. Click **More options** and select **On external platforms**.\n5. Accept the Databricks terms and conditions and click **Request access**.\n6. You will be notified by email when the provider has completed their review of your request. \nYou can also monitor the progress of your request on the My Requests page in Marketplace. See [Manage shared Databricks Marketplace data products](https:\/\/docs.databricks.com\/marketplace\/manage-requests-consumer.html). However, any transactions that follow will use provider communications and payment platforms. No commercial transactions are handled directly on Databricks Marketplace.\n7. When your transaction is complete, you will receive a notification email from the data provider, and the listing will display a **Download credential file** button. Click this button to download the credential file, which you and your team can use to gain access to shared data using third-party data platforms and non-Unity Catalog Databricks workspaces. \nYou can also find the listing in Marketplace under **My requests**. When the credential is ready for download, the data product appears on the **Installed data products** tab. \nImportant \nThe credential file can only be downloaded once. The download button remains active after you download the file, but subsequent downloads rotate to a new credential. The old credential expires after one day or its original expiration date, whichever is sooner. Only two credentials can be active at the same time.\n8. Store the credential file in a secure location. \nDon\u2019t share the credential file with anyone outside the group of users who should have access to the shared data. If you need to share it with someone in your organization, Databricks recommends using a password manager. \nTo learn how to access the shared data using your platform of choice, see [Access shared data using Delta Sharing open sharing connectors](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer-open.html#access-open).\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-consumer-open.html"} +{"content":"# What is Databricks Marketplace?\n### Access data products in Databricks Marketplace using external platforms\n#### Access shared data using Delta Sharing open sharing connectors\n\nTo use external platforms or non-Unity-Catalog Databricks workspaces to access datasets that have been shared using Databricks Marketplace, you need the credential file that was downloaded from the Marketplace listing. You use this credential file to access the shared data using Delta Sharing open sharing connectors. \nFor full instructions for using non-Unity-Catalog Databricks workspaces, Apache Spark, pandas, and Power BI to access and read shared data, see [Read data shared using Delta Sharing open sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html). \nFor a full list of Delta Sharing connectors and information about how to use them, see the [Delta Sharing open source documentation](https:\/\/delta.io\/sharing).\n\n### Access data products in Databricks Marketplace using external platforms\n#### Limitations on sharing to external platforms using Marketplace\n\nSome tables require partition info from the consumer side (country, for example). In the open sharing protocol, this information is not available to the share, and therefore the table is not accessible. The following error is returned: \u201cRecipient authentication failure: the data is restricted by Recipient properties that do not apply to the current recipient in the session. Please contact the data provider to resolve the issue.\u201d\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-consumer-open.html"} +{"content":"# \n\nPlease activate JavaScript to enable the search\nfunctionality.\n\n","doc_uri":"https:\/\/docs.databricks.com\/search.html"} +{"content":"# Compute\n## Use compute\n#### Compute access mode limitations for Unity Catalog\n\nDatabricks recommends using Unity Catalog and shared access mode for most workloads. This article outlines various limitations for each access mode with Unity Catalog. For details on access modes, see [Access modes](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode). \nDatabricks recommends using compute policies to simplify configuration options for most users. See [Create and manage compute policies](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html). \nNote \nNo-isolation shared is a legacy access mode that does not support Unity Catalog. \nImportant \nInit scripts and libraries have different support across access modes and Databricks Runtime versions. See [Where can init scripts be installed?](https:\/\/docs.databricks.com\/init-scripts\/index.html#compatibility) and [Cluster-scoped libraries](https:\/\/docs.databricks.com\/libraries\/index.html#compatibility).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/access-mode-limitations.html"} +{"content":"# Compute\n## Use compute\n#### Compute access mode limitations for Unity Catalog\n##### Single user access mode limitations on Unity Catalog\n\nSingle user access mode on Unity Catalog has the following limitations. These are in addition to the general limitations for all Unity Catalog access mode. See [General limitations for Unity Catalog](https:\/\/docs.databricks.com\/compute\/access-mode-limitations.html#general-uc). \n### Fine-grained access control limitations for Unity Catalog single user access mode \n* Dynamic views are not supported.\n* To read from a view, you must have `SELECT` on all referenced tables and views.\n* You cannot access a table that has a [row filter or column mask](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/row-and-column-filters.html). \n* You cannot use a single user compute to query tables created by a Unity Catalog-enabled Delta Live Tables pipeline, including [streaming tables](https:\/\/docs.databricks.com\/sql\/load-data-streaming-table.html) and [materialized views](https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html) created in Databricks SQL. To query tables created by a Delta Live Tables pipeline, you must use a shared compute using Databricks Runtime 13.3 LTS and above. \n### Streaming limitations for Unity Catalog single user access mode \n* Asynchronous checkpointing is not supported in Databricks Runtime 11.3 LTS and below.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/access-mode-limitations.html"} +{"content":"# Compute\n## Use compute\n#### Compute access mode limitations for Unity Catalog\n##### Shared access mode limitations on Unity Catalog\n\nShared access mode on Unity Catalog has the following limitations. These are in addition to the general limitations for all Unity Catalog access mode. See [General limitations for Unity Catalog](https:\/\/docs.databricks.com\/compute\/access-mode-limitations.html#general-uc). \n* Databricks Runtime ML and Spark Machine Learning Library (MLlib) are not supported.\n* Spark-submit jobs are not supported.\n* On Databricks Runtime 13.3 and above, individual rows must not exceed the maximum size of 128MB. \n* When used with credential passthrough, Unity Catalog features are disabled.\n* Custom containers are not supported. \n### Language support for Unity Catalog shared access mode \n* R is not supported.\n* Scala is supported on Databricks Runtime 13.3 and above. \n### Spark API limitations for Unity Catalog shared access mode \n* RDD APIs are not supported.\n* DBUtils and other clients that directly read the data from cloud storage are only supported when you use an external location to access the storage location. See [Create an external location to connect cloud storage to Databricks](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html).\n* Spark Context (`sc`),`spark.sparkContext`, and `sqlContext` are not supported for Scala in any Databricks Runtime and are not supported for Python in Databricks Runtime 14.0 and above. \n+ Databricks recommends using the `spark` variable to interact with the `SparkSession` instance.\n+ The following `sc` functions are also not supported: `emptyRDD`, `range`, `init_batched_serializer`, `parallelize`, `pickleFile`, `textFile`, `wholeTextFiles`, `binaryFiles`, `binaryRecords`, `sequenceFile`, `newAPIHadoopFile`, `newAPIHadoopRDD`, `hadoopFile`, `hadoopRDD`, `union`, `runJob`, `setSystemProperty`, `uiWebUrl`, `stop`, `setJobGroup`, `setLocalProperty`, `getConf`. \n### UDF limitations for Unity Catalog shared access mode \nPreview \nSupport for Scala UDFs on Unity Catalog-enabled compute with shared access mode is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nUser-defined functions (UDFs) have the following limitations with shared access mode: \n* Hive UDFs are not supported.\n* `applyInPandas` and `mapInPandas` are not supported in Databricks Runtime 14.2 and below.\n* In Databricks Runtime 14.2 and above, Scala scalar UDFs are supported. Other Scala UDFs and UDAFs are not supported. \n* In Databricks Runtime 13.3 LTS and above, Python scalar UDFs and Pandas UDFs are supported. Other Python UDFs, including UDAFs, UDTFs, and Pandas on Spark are not supported. \nSee [User-defined functions (UDFs) in Unity Catalog](https:\/\/docs.databricks.com\/udf\/unity-catalog.html). \n### Streaming limitations for Unity Catalog shared access mode \nNote \nSome of the listed Kafka options have limited support when used for supported configurations on Databricks. See [Stream processing with Apache Kafka and Databricks](https:\/\/docs.databricks.com\/connect\/streaming\/kafka.html). \n* For Scala, `foreach` and `foreachBatch` are not supported.\n* For Python, `foreachBatch` has new behavior in Databricks Runtime 14.0 and above. See [Behavior changes for foreachBatch in Databricks Runtime 14.0](https:\/\/docs.databricks.com\/structured-streaming\/foreach.html#spark-connect).\n* For Scala, `from_avro` requires Databricks Runtime 14.2 or above.\n* `applyInPandasWithState` is not supported.\n* Working with socket sources is not supported.\n* The `sourceArchiveDir` must be in the same external location as the source when you use `option(\"cleanSource\", \"archive\")` with a data source managed by Unity Catalog.\n* For Kafka sources and sinks, the following options are unsupported: \n+ `kafka.sasl.client.callback.handler.class`\n+ `kafka.sasl.login.callback.handler.class`\n+ `kafka.sasl.login.class`\n+ `kafka.partition.assignment.strategy`\n* The following Kafka options are supported in Databricks Runtime 13.3 LTS and above but unsupported in Databricks Runtime 12.2 LTS. You can only specify external locations managed by Unity Catalog for these options: \n+ `kafka.ssl.truststore.location`\n+ `kafka.ssl.keystore.location` \n* You cannot use instance profiles to configure access to external sources such as Kafka or Kinesis for streaming workloads in shared access mode. \n### Network and file system access limitations for Unity Catalog shared access mode \n* Must run commands on compute nodes as a low-privilege user forbidden from accessing sensitive parts of the filesystem.\n* In Databricks Runtime 11.3 LTS and below, you can only create network connections to ports 80 and 443. \n* Cannot connect to the instance metadata service (IMDS), other EC2 instances, or any other services running in the Databricks VPC. This prevents access to any service that uses the IMDS, such as boto3 and the AWS CLI.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/access-mode-limitations.html"} +{"content":"# Compute\n## Use compute\n#### Compute access mode limitations for Unity Catalog\n##### General limitations for Unity Catalog\n\nThe following limitations apply to all Unity Catalog-enabled access modes. \n### UDFs \nGraviton instances do not support UDFs on Unity Catalog-enabled compute. Additional limitations exist for shared access mode. See [UDF limitations for Unity Catalog shared access mode](https:\/\/docs.databricks.com\/compute\/access-mode-limitations.html#udf-shared). \n### Streaming limitations for Unity Catalog \n* Apache Spark continuous processing mode is not supported. See [Continuous Processing](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html#continuous-processing) in the Spark Structured Streaming Programming Guide.\n* `StreamingQueryListener` cannot use credentials or interact with objects managed by Unity Catalog. \nSee also [Streaming limitations for Unity Catalog single user access mode](https:\/\/docs.databricks.com\/compute\/access-mode-limitations.html#streaming-single) and [Streaming limitations for Unity Catalog shared access mode](https:\/\/docs.databricks.com\/compute\/access-mode-limitations.html#streaming-shared). \nFor more on streaming with Unity Catalog, see [Using Unity Catalog with Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/unity-catalog.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/access-mode-limitations.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Load data using the add data UI\n#### Load data using a Unity Catalog external location\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes how to use the add data UI to create a managed table from data in Amazon S3 using a Unity Catalog external location. An external location is an object that combines a cloud storage path with a storage credential that authorizes access to the cloud storage path. \nFor other approaches to loading data using external locations, see [Create a table from files stored in your cloud tenant](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#create-a-table-from-files-stored-in-your-cloud-tenant). \nDatabricks recommends using Unity Catalog external locations to access data in cloud object storage. The legacy **S3 table import** page only supports creating tables in the legacy Hive metastore and requires that you select a compute resource that uses an instance profile.\n\n#### Load data using a Unity Catalog external location\n##### Before you begin\n\nBefore you begin, you must have the following: \n* A workspace with Unity Catalog enabled. For more information, see [Set up and manage Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html).\n* The `READ FILES` privilege on the external location. For more information, see [Create an external location to connect cloud storage to Databricks](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html).\n* The `CREATE TABLE` privilege on the schema in which you want to create the managed table, the `USE SCHEMA` privilege on the schema, and the `USE CATALOG` privilege on the parent catalog. For more information, see [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html).\n\n#### Load data using a Unity Catalog external location\n##### File types\n\nThe following file types are supported: \n* CSV\n* TSV\n* JSON\n* XML\n* AVRO\n* Parquet\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/add-data\/add-data-external-locations.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Load data using the add data UI\n#### Load data using a Unity Catalog external location\n##### Step 1: Confirm access to the external location\n\nTo confirm access to the external location, do the following: \n1. In the sidebar of your Databricks workspace, click **Catalog**.\n2. In Catalog Explorer, click **External Data** > **External Locations**.\n\n#### Load data using a Unity Catalog external location\n##### Step 2: Create the managed table\n\nTo create the managed table, do the following: \n1. In the sidebar of your workspace, click **+ New** > **Add data**.\n2. In the add data UI, click **Amazon S3**.\n3. Select an external location from the drop-down list.\n4. Select the folders and the files that you want to load into Databricks, and then click **Preview table**.\n5. Select a catalog and a schema from the drop-down lists.\n6. (Optional) Edit the table name.\n7. (Optional) To set advanced format options by file type, click **Advanced attributes**, turn off **Automatically detect file type**, and then select a file type. \nFor a list of format options, see the following section.\n8. (Optional) To edit the column name, click the input box at the top of the column. \nColumn names don\u2019t support commas, backslashes, or unicode characters (such as emojis).\n9. (Optional) To edit column types, click the icon with the type.\n10. Click **Create table**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/add-data\/add-data-external-locations.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Load data using the add data UI\n#### Load data using a Unity Catalog external location\n##### File type format options\n\nThe following format options are available, depending on the file type: \n| Format option | Description | Supported file types |\n| --- | --- | --- |\n| `Column delimiter` | The separator character between columns. Only a single character is allowed, and backslash is not supported. The default is a comma. | CSV |\n| `Escape character` | The escape character to use when parsing the data. The default is a quotation mark. | CSV |\n| `First row contains the header` | This option specifies whether the file contains a header. Enabled by default. | CSV |\n| `Automatically detect file type` | Automatically detect file type. Default is `true`. | XML |\n| `Automatically detect column types` | Automatically detect column types from file content. You can edit types in the preview table. If this is set to false, all column types are inferred as STRING. Enabled by default. | * CSV * JSON * XML |\n| `Rows span multiple lines` | Whether a column\u2019s value can span multiple lines in the file. Disabled by default. | * CSV * JSON |\n| `Merge the schema across multiple files` | Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default. | CSV |\n| `Allow comments` | Whether comments are allowed in the file. Enabled by default. | JSON |\n| `Allow single quotes` | Whether single quotes are allowed in the file. Enabled by default. | JSON |\n| `Infer timestamp` | Whether to try to infer timestamp strings as `TimestampType`. Enabled by default. | JSON |\n| `Rescued data column` | Whether to save columns that don\u2019t match the schema. For more information, see [What is the rescued data column?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html#what-is-the-rescued-data-column). Enabled by default. | * CSV * JSON * Avro * Parquet |\n| `Exclude attribute` | Whether to exclude attributes in elements. Default is `false`. | XML |\n| `Attribute prefix` | The prefix for attributes to differentiate attributes and elements. Default is `_`. | XML |\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/add-data\/add-data-external-locations.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Load data using the add data UI\n#### Load data using a Unity Catalog external location\n##### Column data types\n\nThe following column data types are supported. For more information about individual data types see [SQL data types](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-datatypes.html). \n| Data Type | Description |\n| --- | --- |\n| `BIGINT` | 8-byte signed integer numbers. |\n| `BOOLEAN` | Boolean (`true`, `false`) values. |\n| `DATE` | and day, without a time-zone. |\n| `DECIMAL (P,S)` | Numbers with maximum precision `P` and fixed scale `S`. |\n| `DOUBLE` | 8-byte double-precision floating point numbers. |\n| `STRING` | Character string values. |\n| `TIMESTAMP` | Values comprising values of fields year, month, day, hour, minute, and second, with the session local timezone. |\n\n#### Load data using a Unity Catalog external location\n##### Known issues\n\n* You might experience issues with special characters in complex data types, such as a JSON object with a key containing a backtick or a colon.\n* Some JSON files might require that you manually select JSON for the file type. To manually select a file type after you select files, click **Advanced attributes**, turn off **Automatically detect file type**, and then select **JSON**.\n* Nested timestamps and decimals inside complex types might encounter issues.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/add-data\/add-data-external-locations.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Spark stage high I\/O\n\nNext, look at the I\/O stats of the longest stage again: \n![Long Stage I\/O](https:\/\/docs.databricks.com\/_images\/long-stage-io.jpeg)\n\n##### Spark stage high I\/O\n###### What is high I\/O?\n\nHow much data needs to be in an I\/O column to be considered high? To figure this out, first start with the highest number in any of the given columns. Then consider the total number of CPU cores you have across all our workers. Generally each core can read and write about 3 MBs per second. \nDivide your biggest I\/O column by the number of cluster worker cores, then divide that by duration seconds. If the result is around 3 MB, then you\u2019re probably I\/O bound. That would be high I\/O.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/long-spark-stage-io.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Spark stage high I\/O\n###### High input\n\nIf you see a lot of input into your stage, that means you\u2019re spending a lot of time reading data. First, identify what data this stage is reading. See [Identifying an expensive read in Spark\u2019s DAG](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/spark-dag-expensive-read.html). \nAfter you identify the specific data, here are some approaches to speeding up your reads: \n* Use [Delta](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#lakehouse-format).\n* Try [Photon](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#photon). It can help a lot with read speed, especially for wide tables.\n* Make your query more selective so it doesn\u2019t need to read as much data.\n* [Reconsider your data layout](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#data-layout) so that [data skipping](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#data-skipping) is more effective.\n* If you\u2019re reading the same data multiple times, use the [Delta cache](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#delta-cache).\n* If you\u2019re doing a join, consider trying to get [DFP](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#dynamic-file) working.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/long-spark-stage-io.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Spark stage high I\/O\n###### High output\n\nIf you see a lot of output from your stage, that means you\u2019re spending a lot of time writing data. Here are some approaches to resolving this: \n* Are you rewriting a lot of data? See [How to determine if Spark is rewriting data](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/spark-rewriting-data.html) to check. If you are rewriting a lot of data: \n+ See if you have a [merge that needs to be optimized](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#delta-merge).\n+ Use [deletion vectors](https:\/\/docs.databricks.com\/delta\/deletion-vectors.html) to mark existing rows as removed or changed without rewriting the Parquet file.\n* Enable [Photon](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#photon) if it isn\u2019t already. Photon can help a lot with write speed.\n\n##### Spark stage high I\/O\n###### High shuffle\n\nIf you\u2019re not familiar with shuffle, this is the time to [learn](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#data-shuffling).\n\n##### Spark stage high I\/O\n###### No high I\/O\n\nIf you don\u2019t see high I\/O in any of the columns, then you need to dig deeper. See [Slow Spark stage with little I\/O](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/slow-spark-stage-low-io.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/long-spark-stage-io.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Skew and spill\n###### Spill\n\nThe first thing to look for in a long-running stage is whether there\u2019s [spill](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#data-spilling). \nAt the top of the stage\u2019s page you\u2019ll see the details, which may include stats about spill: \n![Spill Stats](https:\/\/docs.databricks.com\/_images\/spill-stats.png) \nSpill is what happens when Spark runs low on memory. It starts to move data from memory to disk, and this can be quite expensive. It is most common during [data shuffling](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#data-shuffling). \nIf you don\u2019t see any stats for spill, that means the stage doesn\u2019t have any spill. If the stage has some spill, see [this guide](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#data-spilling) on how to deal with spill caused by shuffle.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/long-spark-stage-page.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Skew and spill\n###### Skew\n\nThe next thing we want to look into is whether there\u2019s [skew](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#data-skewness). Skew is when one or just a few tasks take much longer than the rest. This results in poor cluster utilization and longer jobs. \nScroll down to the **Summary Metrics**. The main thing we\u2019re looking for is the **Max** duration being much higher than the 75th percentile duration. The screenshot below shows a healthy stage, where the 75th percentile and **Max** are the same: \n![Skew Stats](https:\/\/docs.databricks.com\/_images\/skew-stats.png) \nIf the Max duration is 50% more than the 75th percentile, you may be suffering from skew. \nIf you see skew, learn about skew remediation steps [here](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#data-skewness).\n\n##### Skew and spill\n###### No skew or spill\n\nIf you don\u2019t see skew or spill, go back to the job page to get an overview of what\u2019s going on. Scroll up to the top of the page and click **Associated Job Ids**: \n![Stage to Job](https:\/\/docs.databricks.com\/_images\/stage-to-job.png) \nIf the stage doesn\u2019t have spill or skew, see [Spark stage high I\/O](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/long-spark-stage-io.html) for the next steps.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/long-spark-stage-page.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Work with DataFrames and tables in R\n\nThis article describes how to use R packages such as [SparkR](https:\/\/docs.databricks.com\/sparkr\/overview.html), [sparklyr](https:\/\/docs.databricks.com\/sparkr\/sparklyr.html), and [dplyr](https:\/\/dplyr.tidyverse.org\/) to work with R `data.frame`s, [Spark DataFrames](https:\/\/spark.apache.org\/docs\/latest\/sparkr.html#sparkdataframe), and in-memory tables. \nNote that as you work with SparkR, sparklyr, and dplyr, you may find that you can complete a particular operation with all of these packages, and you can use the package that you are most comfortable with. For example, to run a query, you can call functions such as `SparkR::sql`, `sparklyr::sdf_sql`, and `dplyr::select`. At other times, you might be able to complete an operation with just one or two of these packages, and the operation you choose depends on your usage scenario. For example, the way you call `sparklyr::sdf_quantile` differs slightly from the way you call `dplyr::percentile_approx`, even though both functions calcuate quantiles. \nYou can use SQL as a bridge between SparkR and sparklyr. For example, you can use `SparkR::sql` to query tables that you create with sparklyr. You can use `sparklyr::sdf_sql` to query tables that you create with SparkR. And `dplyr` code always gets translated to SQL in memory before it is run. See also [API interoperability](https:\/\/docs.databricks.com\/sparkr\/sparkr-vs-sparklyr.html#api-interoperability) and [SQL Translation](https:\/\/spark.rstudio.com\/guides\/dplyr.html#sql-translation).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/dataframes-tables.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Work with DataFrames and tables in R\n##### Load SparkR, sparklyr, and dplyr\n\nThe SparkR, sparklyr, and dplyr packages are included in the Databricks Runtime that is installed on Databricks [clusters](https:\/\/docs.databricks.com\/compute\/configure.html). Therefore, you do not need to call the usual `install.package` before you can begin call these packages. However, you must still load these packages with `library` first. For example, from within an R [notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook) in a Databricks workspace, run the following code in a notebook cell to load SparkR, sparklyr, and dplyr: \n```\nlibrary(SparkR)\nlibrary(sparklyr)\nlibrary(dplyr)\n\n```\n\n#### Work with DataFrames and tables in R\n##### Connect sparklyr to a cluster\n\nAfter you load sparklyr, you must call `sparklyr::spark_connect` to connect to the cluster, specifying the `databricks` connection method. For example, run the following code in a notebook cell to connect to the cluster that hosts the notebook: \n```\nsc <- spark_connect(method = \"databricks\")\n\n``` \nIn contrast, a Databricks notebook already establishes a `SparkSession` on the cluster for use with SparkR, so you do not need to call `SparkR::sparkR.session` before you can begin calling SparkR.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/dataframes-tables.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Work with DataFrames and tables in R\n##### Upload a JSON data file to your workspace\n\nMany of the code examples in this article are based on data in a specific location in your Databricks workspace, with specific column names and data types. The data for this code example originates in a JSON file named `book.json` from within GitHub. To get this file and upload it to your workspace: \n1. Go to the [books.json](https:\/\/github.com\/benoitvallon\/100-best-books\/blob\/master\/books.json) file on GitHub and use a text editor to copy its contents to a file named `books.json` somewhere on your local machine.\n2. In your Databricks workspace sidebar, click **Catalog**.\n3. Click **Create Table**.\n4. On the **Upload File** tab, drop the `books.json` file from your local machine to the **Drop files to upload** box. Or select **click to browse**, and browse to the `books.json` file from your local machine. \nBy default, Databricks uploads your local `books.json` file to the [DBFS](https:\/\/docs.databricks.com\/sparkr\/dataframes-tables.html) location in your workspace with the path `\/FileStore\/tables\/books.json`. \nDo not click **Create Table with UI** or **Create Table in Notebook**. The code examples in this article use the data in the uploaded `books.json` file in this DBFS location.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/dataframes-tables.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Work with DataFrames and tables in R\n##### Read the JSON data into a DataFrame\n\nUse `sparklyr::spark_read_json` to read the uploaded JSON file into a DataFrame, specifying the connection, the path to the JSON file, and a name for the internal table representation of the data. For this example, you must specify that the `book.json` file contains multiple lines. Specifying the columns\u2019 schema here is optional. Otherwise, sparklyr infers the columns\u2019 schema by default. For example, run the following code in a notebook cell to read the uploaded JSON file\u2019s data into a DataFrame named `jsonDF`: \n```\njsonDF <- spark_read_json(\nsc = sc,\nname = \"jsonTable\",\npath = \"\/FileStore\/tables\/books.json\",\noptions = list(\"multiLine\" = TRUE),\ncolumns = c(\nauthor = \"character\",\ncountry = \"character\",\nimageLink = \"character\",\nlanguage = \"character\",\nlink = \"character\",\npages = \"integer\",\ntitle = \"character\",\nyear = \"integer\"\n)\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/dataframes-tables.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Work with DataFrames and tables in R\n##### Print the first few rows of a DataFrame\n\nYou can use `SparkR::head`, `SparkR::show`, or `sparklyr::collect` to print the first rows of a DataFrame. By default, `head` prints the first six rows by default. `show` and `collect` print the first 10 rows. For example, run the following code in a notebook cell to print the first rows of the DataFrame named `jsonDF`: \n```\nhead(jsonDF)\n\n# Source: spark<?> [?? x 8]\n# author country image\u2026\u00b9 langu\u2026\u00b2 link pages title year\n# <chr> <chr> <chr> <chr> <chr> <int> <chr> <int>\n# 1 Chinua Achebe Nigeria images\u2026 English \"htt\u2026 209 Thin\u2026 1958\n# 2 Hans Christian Andersen Denmark images\u2026 Danish \"htt\u2026 784 Fair\u2026 1836\n# 3 Dante Alighieri Italy images\u2026 Italian \"htt\u2026 928 The \u2026 1315\n# 4 Unknown Sumer and Akk\u2026 images\u2026 Akkadi\u2026 \"htt\u2026 160 The \u2026 -1700\n# 5 Unknown Achaemenid Em\u2026 images\u2026 Hebrew \"htt\u2026 176 The \u2026 -600\n# 6 Unknown India\/Iran\/Ir\u2026 images\u2026 Arabic \"htt\u2026 288 One \u2026 1200\n# \u2026 with abbreviated variable names \u00b9imageLink, \u00b2language\n\nshow(jsonDF)\n\n# Source: spark<jsonTable> [?? x 8]\n# author country image\u2026\u00b9 langu\u2026\u00b2 link pages title year\n# <chr> <chr> <chr> <chr> <chr> <int> <chr> <int>\n# 1 Chinua Achebe Nigeria images\u2026 English \"htt\u2026 209 Thin\u2026 1958\n# 2 Hans Christian Andersen Denmark images\u2026 Danish \"htt\u2026 784 Fair\u2026 1836\n# 3 Dante Alighieri Italy images\u2026 Italian \"htt\u2026 928 The \u2026 1315\n# 4 Unknown Sumer and Ak\u2026 images\u2026 Akkadi\u2026 \"htt\u2026 160 The \u2026 -1700\n# 5 Unknown Achaemenid E\u2026 images\u2026 Hebrew \"htt\u2026 176 The \u2026 -600\n# 6 Unknown India\/Iran\/I\u2026 images\u2026 Arabic \"htt\u2026 288 One \u2026 1200\n# 7 Unknown Iceland images\u2026 Old No\u2026 \"htt\u2026 384 Nj\u00e1l\u2026 1350\n# 8 Jane Austen United Kingd\u2026 images\u2026 English \"htt\u2026 226 Prid\u2026 1813\n# 9 Honor\u00e9 de Balzac France images\u2026 French \"htt\u2026 443 Le P\u2026 1835\n# 10 Samuel Beckett Republic of \u2026 images\u2026 French\u2026 \"htt\u2026 256 Moll\u2026 1952\n# \u2026 with more rows, and abbreviated variable names \u00b9imageLink, \u00b2language\n# \u2139 Use `print(n = ...)` to see more rows\n\ncollect(jsonDF)\n\n# A tibble: 100 \u00d7 8\n# author country image\u2026\u00b9 langu\u2026\u00b2 link pages title year\n# <chr> <chr> <chr> <chr> <chr> <int> <chr> <int>\n# 1 Chinua Achebe Nigeria images\u2026 English \"htt\u2026 209 Thin\u2026 1958\n# 2 Hans Christian Andersen Denmark images\u2026 Danish \"htt\u2026 784 Fair\u2026 1836\n# 3 Dante Alighieri Italy images\u2026 Italian \"htt\u2026 928 The \u2026 1315\n# 4 Unknown Sumer and Ak\u2026 images\u2026 Akkadi\u2026 \"htt\u2026 160 The \u2026 -1700\n# 5 Unknown Achaemenid E\u2026 images\u2026 Hebrew \"htt\u2026 176 The \u2026 -600\n# 6 Unknown India\/Iran\/I\u2026 images\u2026 Arabic \"htt\u2026 288 One \u2026 1200\n# 7 Unknown Iceland images\u2026 Old No\u2026 \"htt\u2026 384 Nj\u00e1l\u2026 1350\n# 8 Jane Austen United Kingd\u2026 images\u2026 English \"htt\u2026 226 Prid\u2026 1813\n# 9 Honor\u00e9 de Balzac France images\u2026 French \"htt\u2026 443 Le P\u2026 1835\n# 10 Samuel Beckett Republic of \u2026 images\u2026 French\u2026 \"htt\u2026 256 Moll\u2026 1952\n# \u2026 with 90 more rows, and abbreviated variable names \u00b9imageLink, \u00b2language\n# \u2139 Use `print(n = ...)` to see more rows\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/dataframes-tables.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Work with DataFrames and tables in R\n##### Run SQL queries, and write to and read from a table\n\nYou can use dplyr functions to run SQL queries on a DataFrame. For example, run the following code in a notebook cell to use `dplyr::group_by` and `dployr::count` to get counts by author from the DataFrame named `jsonDF`. Use `dplyr::arrange` and `dplyr::desc` to sort the result in descending order by counts. Then print the first 10 rows by default. \n```\ngroup_by(jsonDF, author) %>%\ncount() %>%\narrange(desc(n))\n\n# Source: spark<?> [?? x 2]\n# Ordered by: desc(n)\n# author n\n# <chr> <dbl>\n# 1 Fyodor Dostoevsky 4\n# 2 Unknown 4\n# 3 Leo Tolstoy 3\n# 4 Franz Kafka 3\n# 5 William Shakespeare 3\n# 6 William Faulkner 2\n# 7 Gustave Flaubert 2\n# 8 Homer 2\n# 9 Gabriel Garc\u00eda M\u00e1rquez 2\n# 10 Thomas Mann 2\n# \u2026 with more rows\n# \u2139 Use `print(n = ...)` to see more rows\n\n``` \nYou could then use `sparklyr::spark_write_table` to write the result to a table in Databricks. For example, run the following code in a notebook cell to rerun the query and then write the result to a table named `json_books_agg`: \n```\ngroup_by(jsonDF, author) %>%\ncount() %>%\narrange(desc(n)) %>%\nspark_write_table(\nname = \"json_books_agg\",\nmode = \"overwrite\"\n)\n\n``` \nTo verify that the table was created, you could then use `sparklyr::sdf_sql` along with `SparkR::showDF` to display the table\u2019s data. For example, run the following code in a notebook cell to query the table into a DataFrame and then use `sparklyr::collect` to print the first 10 rows of the DataFrame by default: \n```\ncollect(sdf_sql(sc, \"SELECT * FROM json_books_agg\"))\n\n# A tibble: 82 \u00d7 2\n# author n\n# <chr> <dbl>\n# 1 Fyodor Dostoevsky 4\n# 2 Unknown 4\n# 3 Leo Tolstoy 3\n# 4 Franz Kafka 3\n# 5 William Shakespeare 3\n# 6 William Faulkner 2\n# 7 Homer 2\n# 8 Gustave Flaubert 2\n# 9 Gabriel Garc\u00eda M\u00e1rquez 2\n# 10 Thomas Mann 2\n# \u2026 with 72 more rows\n# \u2139 Use `print(n = ...)` to see more rows\n\n``` \nYou could also use `sparklyr::spark_read_table` to do something similar. For example, run the following code in a notebook cell to query the preceding DataFrame named `jsonDF` into a DataFrame and then use `sparklyr::collect` to print the first 10 rows of the DataFrame by default: \n```\nfromTable <- spark_read_table(\nsc = sc,\nname = \"json_books_agg\"\n)\n\ncollect(fromTable)\n\n# A tibble: 82 \u00d7 2\n# author n\n# <chr> <dbl>\n# 1 Fyodor Dostoevsky 4\n# 2 Unknown 4\n# 3 Leo Tolstoy 3\n# 4 Franz Kafka 3\n# 5 William Shakespeare 3\n# 6 William Faulkner 2\n# 7 Homer 2\n# 8 Gustave Flaubert 2\n# 9 Gabriel Garc\u00eda M\u00e1rquez 2\n# 10 Thomas Mann 2\n# \u2026 with 72 more rows\n# \u2139 Use `print(n = ...)` to see more rows\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/dataframes-tables.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Work with DataFrames and tables in R\n##### Add columns and compute column values in a DataFrame\n\nYou can use dplyr functions to add columns to DataFrames and to compute columns\u2019 values. \nFor example, run the following code in a notebook cell to get the contents of the DataFrame named `jsonDF`. Use `dplyr::mutate` to add a column named `today`, and fill this new column with the current timestamp. Then write these contents to a new DataFrame named `withDate` and use `dplyr::collect` to print the new DataFrame\u2019s first 10 rows by default. \nNote \n`dplyr::mutate` only accepts arguments that conform to Hive\u2019s built-in functions (also known as UDFs) and built-in aggregate functions (also known as UDAFs). For general information, see [Hive Functions](https:\/\/spark.rstudio.com\/guides\/dplyr.html#hive-functions). For information about the date-related functions in this section, see [Date Functions](https:\/\/cwiki.apache.org\/confluence\/display\/Hive\/LanguageManual+UDF#LanguageManualUDF-DateFunctions). \n```\nwithDate <- jsonDF %>%\nmutate(today = current_timestamp())\n\ncollect(withDate)\n\n# A tibble: 100 \u00d7 9\n# author country image\u2026\u00b9 langu\u2026\u00b2 link pages title year today\n# <chr> <chr> <chr> <chr> <chr> <int> <chr> <int> <dttm>\n# 1 Chinua A\u2026 Nigeria images\u2026 English \"htt\u2026 209 Thin\u2026 1958 2022-09-27 21:32:59\n# 2 Hans Chr\u2026 Denmark images\u2026 Danish \"htt\u2026 784 Fair\u2026 1836 2022-09-27 21:32:59\n# 3 Dante Al\u2026 Italy images\u2026 Italian \"htt\u2026 928 The \u2026 1315 2022-09-27 21:32:59\n# 4 Unknown Sumer \u2026 images\u2026 Akkadi\u2026 \"htt\u2026 160 The \u2026 -1700 2022-09-27 21:32:59\n# 5 Unknown Achaem\u2026 images\u2026 Hebrew \"htt\u2026 176 The \u2026 -600 2022-09-27 21:32:59\n# 6 Unknown India\/\u2026 images\u2026 Arabic \"htt\u2026 288 One \u2026 1200 2022-09-27 21:32:59\n# 7 Unknown Iceland images\u2026 Old No\u2026 \"htt\u2026 384 Nj\u00e1l\u2026 1350 2022-09-27 21:32:59\n# 8 Jane Aus\u2026 United\u2026 images\u2026 English \"htt\u2026 226 Prid\u2026 1813 2022-09-27 21:32:59\n# 9 Honor\u00e9 d\u2026 France images\u2026 French \"htt\u2026 443 Le P\u2026 1835 2022-09-27 21:32:59\n# 10 Samuel B\u2026 Republ\u2026 images\u2026 French\u2026 \"htt\u2026 256 Moll\u2026 1952 2022-09-27 21:32:59\n# \u2026 with 90 more rows, and abbreviated variable names \u00b9imageLink, \u00b2language\n# \u2139 Use `print(n = ...)` to see more rows\n\n``` \nNow use `dplyr::mutate` to add two more columns to the contents of the `withDate` DataFrame. The new `month` and `year` columns contain the numeric month and year from the `today` column. Then write these contents to a new DataFrame named `withMMyyyy`, and use `dplyr::select` along with `dplyr::collect` to print the `author`, `title`, `month` and `year` columns of the new DataFrame\u2019s first ten rows by default: \n```\nwithMMyyyy <- withDate %>%\nmutate(month = month(today),\nyear = year(today))\n\ncollect(select(withMMyyyy, c(\"author\", \"title\", \"month\", \"year\")))\n\n# A tibble: 100 \u00d7 4\n# author title month year\n# <chr> <chr> <int> <int>\n# 1 Chinua Achebe Things Fall Apart 9 2022\n# 2 Hans Christian Andersen Fairy tales 9 2022\n# 3 Dante Alighieri The Divine Comedy 9 2022\n# 4 Unknown The Epic Of Gilgamesh 9 2022\n# 5 Unknown The Book Of Job 9 2022\n# 6 Unknown One Thousand and One Nights 9 2022\n# 7 Unknown Nj\u00e1l's Saga 9 2022\n# 8 Jane Austen Pride and Prejudice 9 2022\n# 9 Honor\u00e9 de Balzac Le P\u00e8re Goriot 9 2022\n# 10 Samuel Beckett Molloy, Malone Dies, The Unnamable, the \u2026 9 2022\n# \u2026 with 90 more rows\n# \u2139 Use `print(n = ...)` to see more rows\n\n``` \nNow use `dplyr::mutate` to add two more columns to the contents of the `withMMyyyy` DataFrame. The new `formatted_date` columns contains the `yyyy-MM-dd` portion from the `today` column, while the new `day` column contains the numeric day from the new `formatted_date` column. Then write these contents to a new DataFrame named `withUnixTimestamp`, and use `dplyr::select` along with `dplyr::collect` to print the `title`, `formatted_date`, and `day` columns of the new DataFrame\u2019s first ten rows by default: \n```\nwithUnixTimestamp <- withMMyyyy %>%\nmutate(formatted_date = date_format(today, \"yyyy-MM-dd\"),\nday = dayofmonth(formatted_date))\n\ncollect(select(withUnixTimestamp, c(\"title\", \"formatted_date\", \"day\")))\n\n# A tibble: 100 \u00d7 3\n# title formatted_date day\n# <chr> <chr> <int>\n# 1 Things Fall Apart 2022-09-27 27\n# 2 Fairy tales 2022-09-27 27\n# 3 The Divine Comedy 2022-09-27 27\n# 4 The Epic Of Gilgamesh 2022-09-27 27\n# 5 The Book Of Job 2022-09-27 27\n# 6 One Thousand and One Nights 2022-09-27 27\n# 7 Nj\u00e1l's Saga 2022-09-27 27\n# 8 Pride and Prejudice 2022-09-27 27\n# 9 Le P\u00e8re Goriot 2022-09-27 27\n# 10 Molloy, Malone Dies, The Unnamable, the trilogy 2022-09-27 27\n# \u2026 with 90 more rows\n# \u2139 Use `print(n = ...)` to see more rows\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/dataframes-tables.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Work with DataFrames and tables in R\n##### Create a temporary view\n\nYou can create named temporary views in memory that are based on existing DataFrames. For example, run the following code in a notebook cell to use `SparkR::createOrReplaceTempView` to get the contents of the preceding DataFrame named `jsonTable` and make a temporary view out of it named `timestampTable`. Then, use `sparklyr::spark_read_table` to read the temporary view\u2019s contents. Use `sparklyr::collect` to print the first 10 rows of the temporary table by default: \n```\ncreateOrReplaceTempView(withTimestampDF, viewName = \"timestampTable\")\n\nspark_read_table(\nsc = sc,\nname = \"timestampTable\"\n) %>% collect()\n\n# A tibble: 100 \u00d7 10\n# author country image\u2026\u00b9 langu\u2026\u00b2 link pages title year today\n# <chr> <chr> <chr> <chr> <chr> <int> <chr> <int> <dttm>\n# 1 Chinua A\u2026 Nigeria images\u2026 English \"htt\u2026 209 Thin\u2026 1958 2022-09-27 21:11:56\n# 2 Hans Chr\u2026 Denmark images\u2026 Danish \"htt\u2026 784 Fair\u2026 1836 2022-09-27 21:11:56\n# 3 Dante Al\u2026 Italy images\u2026 Italian \"htt\u2026 928 The \u2026 1315 2022-09-27 21:11:56\n# 4 Unknown Sumer \u2026 images\u2026 Akkadi\u2026 \"htt\u2026 160 The \u2026 -1700 2022-09-27 21:11:56\n# 5 Unknown Achaem\u2026 images\u2026 Hebrew \"htt\u2026 176 The \u2026 -600 2022-09-27 21:11:56\n# 6 Unknown India\/\u2026 images\u2026 Arabic \"htt\u2026 288 One \u2026 1200 2022-09-27 21:11:56\n# 7 Unknown Iceland images\u2026 Old No\u2026 \"htt\u2026 384 Nj\u00e1l\u2026 1350 2022-09-27 21:11:56\n# 8 Jane Aus\u2026 United\u2026 images\u2026 English \"htt\u2026 226 Prid\u2026 1813 2022-09-27 21:11:56\n# 9 Honor\u00e9 d\u2026 France images\u2026 French \"htt\u2026 443 Le P\u2026 1835 2022-09-27 21:11:56\n# 10 Samuel B\u2026 Republ\u2026 images\u2026 French\u2026 \"htt\u2026 256 Moll\u2026 1952 2022-09-27 21:11:56\n# \u2026 with 90 more rows, 1 more variable: month <chr>, and abbreviated variable\n# names \u00b9imageLink, \u00b2language\n# \u2139 Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/dataframes-tables.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Work with DataFrames and tables in R\n##### Perform statistical analysis on a DataFrame\n\nYou can use sparklyr along with dplyr for statistical analyses. \nFor example, create a DataFrame to run statistics on. To do this, run the following code in a notebook cell to use `sparklyr::sdf_copy_to` to write the contents of the `iris` dataset that is built into R to a DataFrame named `iris`. Use `sparklyr::sdf_collect` to print the first 10 rows of the temporary table by default: \n```\nirisDF <- sdf_copy_to(\nsc = sc,\nx = iris,\nname = \"iris\",\noverwrite = TRUE\n)\n\nsdf_collect(irisDF, \"row-wise\")\n\n# A tibble: 150 \u00d7 5\n# Sepal_Length Sepal_Width Petal_Length Petal_Width Species\n# <dbl> <dbl> <dbl> <dbl> <chr>\n# 1 5.1 3.5 1.4 0.2 setosa\n# 2 4.9 3 1.4 0.2 setosa\n# 3 4.7 3.2 1.3 0.2 setosa\n# 4 4.6 3.1 1.5 0.2 setosa\n# 5 5 3.6 1.4 0.2 setosa\n# 6 5.4 3.9 1.7 0.4 setosa\n# 7 4.6 3.4 1.4 0.3 setosa\n# 8 5 3.4 1.5 0.2 setosa\n# 9 4.4 2.9 1.4 0.2 setosa\n# 10 4.9 3.1 1.5 0.1 setosa\n# \u2026 with 140 more rows\n# \u2139 Use `print(n = ...)` to see more rows\n\n``` \nNow use `dplyr::group_by` to group rows by the `Species` column. Use `dplyr::summarize` along with `dplyr::percentile_approx` to calculate summary statistics by the 25th, 50th, 75th, and 100th quantiles of the `Sepal_Length` column by `Species`. Use `sparklyr::collect` a print the results: \nNote \n`dplyr::summarize` only accepts arguments that conform to Hive\u2019s built-in functions (also known as UDFs) and built-in aggregate functions (also known as UDAFs). For general information, see [Hive Functions](https:\/\/spark.rstudio.com\/guides\/dplyr.html#hive-functions). For information about `percentile_approx`, see [Built-in Aggregate Functions(UDAF)](https:\/\/cwiki.apache.org\/confluence\/display\/hive\/languagemanual+udf#LanguageManualUDF-Built-inAggregateFunctions(UDAF)). \n```\nquantileDF <- irisDF %>%\ngroup_by(Species) %>%\nsummarize(\nquantile_25th = percentile_approx(\nSepal_Length,\n0.25\n),\nquantile_50th = percentile_approx(\nSepal_Length,\n0.50\n),\nquantile_75th = percentile_approx(\nSepal_Length,\n0.75\n),\nquantile_100th = percentile_approx(\nSepal_Length,\n1.0\n)\n)\n\ncollect(quantileDF)\n\n# A tibble: 3 \u00d7 5\n# Species quantile_25th quantile_50th quantile_75th quantile_100th\n# <chr> <dbl> <dbl> <dbl> <dbl>\n# 1 virginica 6.2 6.5 6.9 7.9\n# 2 versicolor 5.6 5.9 6.3 7\n# 3 setosa 4.8 5 5.2 5.8\n\n``` \nSimilar results can be calculated, for example, by using `sparklyr::sdf_quantile`: \n```\nprint(sdf_quantile(\nx = irisDF %>%\nfilter(Species == \"virginica\"),\ncolumn = \"Sepal_Length\",\nprobabilities = c(0.25, 0.5, 0.75, 1.0)\n))\n\n# 25% 50% 75% 100%\n# 6.2 6.5 6.9 7.9\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/dataframes-tables.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Production considerations for Structured Streaming\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/async-checkpointing.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Production considerations for Structured Streaming\n##### Asynchronous state checkpointing for stateful queries\n\nNote \nAvailable in Databricks Runtime 10.4 LTS and above. \nAsynchronous state checkpointing maintains exactly-once guarantees for streaming queries but can reduce overall latency for some Structured Streaming stateful workloads bottlenecked on state updates. This is accomplished by beginning to process the next micro-batch as soon as the computation of the previous micro-batch has been completed without waiting for state checkpointing to complete. The following table compares the tradeoffs for synchronous and asynchronous checkpointing: \n| Characteristic | Synchronous checkpointing | Asynchronous checkpointing |\n| --- | --- | --- |\n| Latency | Higher latency for each micro-batch. | Reduced latency as micro-batches can overlap. |\n| Restart | Fast recovery as only last batch needs to be re-run. | Higher restart delay as more than on micro-batch might need to be re-run. | \nThe following are streaming job characteristics that might benefit from asynchronous state checkpointing: \n* Job has one or more stateful operations (e.g., aggregation, `flatMapGroupsWithState`, `mapGroupsWithState`, stream-stream joins)\n* State checkpoint latency is one of the major contributors to overall batch execution latency. This information can be found in the [StreamingQueryProgress](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html#monitoring-streaming-queries) events. These events are found in log4j logs on Spark driver as well. Here is an example of streaming query progress and how to find the state checkpoint impact on the overall batch execution latency. \n+ ```\n{\n\"id\" : \"2e3495a2-de2c-4a6a-9a8e-f6d4c4796f19\",\n\"runId\" : \"e36e9d7e-d2b1-4a43-b0b3-e875e767e1fe\",\n\"...\",\n\"batchId\" : 0,\n\"durationMs\" : {\n\"...\",\n\"triggerExecution\" : 547730,\n\"...\"\n},\n\"stateOperators\" : [ {\n\"...\",\n\"commitTimeMs\" : 3186626,\n\"numShufflePartitions\" : 64,\n\"...\"\n}]\n} \n```\n+ State checkpoint latency analysis of above query progress event\n\n\n- Batch duration (`durationMs.triggerDuration`) is around 547 secs.\n- State store commit latency (`stateOperations[0].commitTimeMs`) is around 3,186 secs. Commit latency is aggregated across tasks containing a state store. In this case there are 64 such tasks (`stateOperators[0].numShufflePartitions`).\n- Each task containing state operator took an average of 50 sec (3,186\/64) for checkpoint. This is an extra latency that is contributed to the batch duration. Assuming all 64 tasks are running concurrently, checkpoint step contributed around 9% (50 secs \/ 547 secs) of the batch duration. The percentage gets even higher when the max concurrent tasks is less than 64.\n\n## Enabling asynchronous state checkpointing\n\nYou must use the [RocksDB based state store](https:\/\/docs.databricks.com\/structured-streaming\/rocksdb-state-store.html) for asyncronous state checkpointing. Set the following configurations:\n\n```\nspark.conf.set(\n\"spark.databricks.streaming.statefulOperator.asyncCheckpoint.enabled\",\n\"true\"\n) \nspark.conf.set(\n\"spark.sql.streaming.stateStore.providerClass\",\n\"com.databricks.sql.streaming.state.RocksDBStateStoreProvider\"\n) \n```\n\n## Limitations and requirements for asynchronous checkpointing\n\nNote\n\nCompute auto-scaling has limitations scaling down cluster size for Structured Streaming workloads. Databricks recommends using Delta Live Tables with Enhanced Autoscaling for streaming workloads. See [Optimize the cluster utilization of Delta Live Tables pipelines with Enhanced Autoscaling](https:\/\/docs.databricks.com\/delta-live-tables\/auto-scaling.html).\n\n* Any failure in an asynchronous checkpoint at any one or more stores fails the query. In synchronous checkpointing mode, the checkpoint is executed as part of the task and Spark retries the task multiple times before failing the query. This mechanism is not present with asynchronous state checkpointing. However, using the Databricks [job retries](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#retry-policies), such failures can be automatically retried.\n* Asynchronous checkpointing works best when the state store locations are not changed between micro-batch executions. Cluster resizing, in combination with asynchronous state checkpointing, might not work well because the state stores instance might get re-distributed as nodes are added or deleted as part of the cluster resizing event.\n* Asynchronous state checkpointing is supported only in the RocksDB state store provider implementation. The default in-memory state store implementation does not support it.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/async-checkpointing.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Table options\n\nWith Databricks table visualizations you can manually reorder, hide, and format data. This article describes how you can control data presentation in table visualizations. \nA table visualization can be manipulated independently of the original cell results table. \nYou can: \n* Reorder columns by dragging them up or down using the ![Column Handle](https:\/\/docs.databricks.com\/_images\/column-handle.png) handle\n* Hide columns by toggling the ![Visibility Icon](https:\/\/docs.databricks.com\/_images\/visibility-icon.png) icon\n* Format columns using the format settings\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/tables.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Table options\n###### Format columns\n\nYou can format some common data types\u2013text, numbers, dates, and Booleans. Databricks also has special support for non-standard column types like images, JSON documents, and links. To display type-specific formatting options, select the type in the **Display as** field. \n### Conditionally format column colors \nYou can configure the font colors for column results to a static color or a range of colors based on comparing the column value to a threshold. \n1. Edit the visualization.\n2. Optionally, set the default font color to a non-default value.\n3. Under **Font Conditions**, click **+ Add condition**.\n4. Select the column, the threshold, the comparator, and the font color if the comparison succeeds. \nThe threshold can be a numeric type, string, or date. For the comparison to succeed, the threshold must be the same data type as the column. For example, to colorize results whose values exceed the numeric value `500000`, create the threshold `> 500000`, rather than `> 500,000`. Numeric types, strings, and dates are supported.\n5. Optionally add more conditions. \n### Common data types \nYou can control how data is displayed. For example, you can: \n* Display floats out to three decimal places\n* Show only the month and year of a date column\n* Zero-pad integers\n* Prepend or append text to your number fields \n#### Text \nThe **Allow HTML content** field has the following behavior: \n* **Enabled**: HTML content is run through an HTML sanitizer and the column is rendered as HTML.\n* **Disabled**: The content is displayed without rendering the HTML. \n#### Numeric and date-time \nFor reference information on formatting numeric and date and time data types, see: \n* [Numeric types](https:\/\/docs.databricks.com\/visualizations\/format-numeric-types.html)\n* [Date-time types](https:\/\/momentjs.com\/docs\/#\/displaying\/format\/) \n### Special data types \nDatabricks supports the following special data types: image, JSON, and link. \n#### Image \nIf a field in your database contains links to images, select **Image** to display the images inline with your table results. This is especially useful for dashboards. In the following dashboard, the **Customer Image** field is a link to an image that Databricks displays in-place. \n![Dashboard with images](https:\/\/docs.databricks.com\/_images\/dashboard-with-images.png) \n#### JSON \nIf your data returns JSON formatted text, select **JSON**. This lets you collapse and expand elements in a clean format. \n#### Link \nTo make HTML links from your dashboard clickable, select **Link**. Three fields appear: **URL template**, **Text template**, **Title template**. These template fields control how an HTML link is rendered. The fields accept mustache-style parameters (double curly braces) for the column names in the table. The mustache parameters support the pipe filter `urlEscape`, which URL-escapes the value. \n![Parameter in URL](https:\/\/docs.databricks.com\/_images\/parameter-in-url.png) \nThe query in the following screenshot generates a result set containing the search URL. If you hover over the link in the result set, you see the URL in the lower left corner. Click the URL in the result set to view the search results in your browser. \n![Parameter in URL query](https:\/\/docs.databricks.com\/_images\/parameter-in-url-query.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/tables.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Monitoring and observability options for Delta Live Tables pipelines\n##### Define custom monitoring of Delta Live Tables pipelines with event hooks\n\nPreview \nSupport for event hooks is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nYou can use *event hooks* to add custom Python callback functions that run when events are persisted to a Delta Live Tables pipeline\u2019s [event log](https:\/\/docs.databricks.com\/delta-live-tables\/observability.html#event-log). You can use event hooks to implement custom monitoring and alerting solutions. For example, you can use event hooks to send emails or write to a log when specific events occur or to integrate with third-party solutions to monitor pipeline events. \nYou define an event hook with a Python function that accepts a single argument, where the argument is a dictionary representing an event. You then include the event hooks as part of the source code for a pipeline. Any event hooks defined in a pipeline will attempt to process all events generated during each pipeline update. If your pipeline is composed of multiple source code artifacts, for example, multiple notebooks, any defined event hooks are applied to the entire pipeline. Although event hooks are included in the source code for your pipeline, they are not included in the pipeline graph. \nYou can use event hooks with pipelines that publish to the Hive metastore or Unity Catalog. \nNote \n* Python is the only language supported for defining event hooks.\n* Event hooks are triggered only for events where the [maturity\\_level](https:\/\/docs.databricks.com\/delta-live-tables\/observability.html#schema) is `STABLE`.\n* Event hooks are executed asynchronously from pipeline updates but synchronously with other event hooks. This means that only a single event hook runs at a time, and other event hooks wait to run until the currently running event hook completes. If an event hook runs indefinitely, it blocks all other event hooks.\n* Delta Live Tables attempts to run each event hook on every event emitted during a pipeline update. To help ensure that lagging event hooks have time to process all queued events, Delta Live Tables waits a non-configurable fixed period before terminating the compute running the pipeline. However, it is not guaranteed that all hooks are triggered on all events before the compute is terminated.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/event-hooks.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Monitoring and observability options for Delta Live Tables pipelines\n##### Define custom monitoring of Delta Live Tables pipelines with event hooks\n###### Monitor event hook processing\n\nUse the `hook_progress` event type in the Delta Live Tables event log to monitor the state of an update\u2019s event hooks. To prevent circular dependencies, event hooks are not triggered for `hook_progress` events.\n\n##### Define custom monitoring of Delta Live Tables pipelines with event hooks\n###### Define an event hook\n\nTo define an event hook, use the `on_event_hook` decorator: \n```\n@dlt.on_event_hook(max_allowable_consecutive_failures=None)\ndef user_event_hook(event):\n# Python code defining the event hook\n\n``` \nThe `max_allowable_consecutive_failures` describes the maximum number of consecutive times an event hook can fail before it is disabled. An event hook failure is defined as any time the event hook throws an exception. If an event hook is disabled, it does not process new events until the pipeline is restarted. \n`max_allowable_consecutive_failures` must be an integer greater than or equal to `0` or `None`. A value of `None` (assigned by default) means there is no limit to the number of consecutive failures allowed for the event hook, and the event hook is never disabled. \nEvent hook failures and disabling of event hooks can be monitored in the event log as `hook_progress` events. \nThe event hook function must be a Python function that accepts exactly one parameter, a dictionary representation of the event that triggered this event hook. Any return value from the event hook function is ignored.\n\n##### Define custom monitoring of Delta Live Tables pipelines with event hooks\n###### Example: Select specific events for processing\n\nThe following example demonstrates an event hook that selects specific events for processing. Specifically, this example waits until pipeline `STOPPING` events are received and then outputs a message to the driver logs `stdout`. \n```\n@on_event_hook\ndef my_event_hook(event):\nif (\nevent['event_type'] == 'update_progress' and\nevent['details']['update_progress']['state'] == 'STOPPING'\n):\nprint('Received notification that update is stopping: ', event)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/event-hooks.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Monitoring and observability options for Delta Live Tables pipelines\n##### Define custom monitoring of Delta Live Tables pipelines with event hooks\n###### Example: Send all events to a Slack channel\n\nThe following example implements an event hook that sends all events received to a Slack channel using the Slack API. \nThis example uses a Databricks [secret](https:\/\/docs.databricks.com\/security\/secrets\/index.html) to securely store a token required to authenticate to the Slack API. \n```\nfrom dlt import on_event_hook\nimport requests\n\n# Get a Slack API token from a Databricks secret scope.\nAPI_TOKEN = dbutils.secrets.get(scope=\"<secret-scope>\", key=\"<token-key>\")\n\n@on_event_hook\ndef write_events_to_slack(event):\nres = requests.post(\nurl='https:\/\/slack.com\/api\/chat.postMessage',\nheaders={\n'Content-Type': 'application\/json',\n'Authorization': 'Bearer ' + API_TOKEN,\n},\njson={\n'channel': '<channel-id>',\n'text': 'Received event:\\n' + event,\n}\n)\n\n```\n\n##### Define custom monitoring of Delta Live Tables pipelines with event hooks\n###### Example: Configure an event hook to disable after four consecutive failures\n\nThe following example demonstrates how to configure an event hook that is disabled if it fails consecutively four times. \n```\nfrom dlt import on_event_hook\nimport random\n\ndef run_failing_operation():\nraise Exception('Operation has failed')\n\n# Allow up to 3 consecutive failures. After a 4th consecutive\n# failure, this hook is disabled.\n@on_event_hook(max_allowable_consecutive_failures=3)\ndef non_reliable_event_hook(event):\nrun_failing_operation()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/event-hooks.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Monitoring and observability options for Delta Live Tables pipelines\n##### Define custom monitoring of Delta Live Tables pipelines with event hooks\n###### Example: A Delta Live Tables pipeline with an event hook\n\nThe following example demonstrates adding an event hook to the source code for a pipeline. This is a simple but complete example of using event hooks with a pipeline. \n```\nfrom dlt import table, on_event_hook, read\nimport requests\nimport json\nimport time\n\nAPI_TOKEN = dbutils.secrets.get(scope=\"<secret-scope>\", key=\"<token-key>\")\nSLACK_POST_MESSAGE_URL = 'https:\/\/slack.com\/api\/chat.postMessage'\nDEV_CHANNEL = 'CHANNEL'\nSLACK_HTTPS_HEADER_COMMON = {\n'Content-Type': 'application\/json',\n'Authorization': 'Bearer ' + API_TOKEN\n}\n\n# Create a single dataset.\n@table\ndef test_dataset():\nreturn spark.range(5)\n\n# Definition of event hook to send events to a Slack channel.\n@on_event_hook\ndef write_events_to_slack(event):\nres = requests.post(url=SLACK_POST_MESSAGE_URL, headers=SLACK_HTTPS_HEADER_COMMON, json = {\n'channel': DEV_CHANNEL,\n'text': 'Event hook triggered by event: ' + event['event_type'] + ' event.'\n})\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/event-hooks.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n#### Use Apache Spark MLlib on Databricks\n\nThis page provides example notebooks showing how to use MLlib on Databricks. \nApache Spark MLlib is the Apache Spark machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives. For reference information about MLlib features, Databricks recommends the following Apache Spark API references: \n* [MLlib Programming Guide](https:\/\/spark.apache.org\/docs\/latest\/ml-guide.html)\n* [Python API Reference](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/)\n* [Scala API Reference](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/ml\/index.html)\n* [Java API](https:\/\/spark.apache.org\/docs\/latest\/api\/java\/org\/apache\/spark\/ml\/package-summary.html) \nFor information about using Apache Spark MLlib from R, see the [R machine learning](https:\/\/docs.databricks.com\/sparkr\/overview.html#r-ml) documentation.\n\n#### Use Apache Spark MLlib on Databricks\n##### Binary classification example notebook\n\nThis notebook shows you how to build a binary classification application using the Apache Spark MLlib Pipelines API. \n### Binary classification notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/binary-classification.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/mllib.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n#### Use Apache Spark MLlib on Databricks\n##### Decision trees example notebooks\n\nThese examples demonstrate various applications of decision trees using the Apache Spark MLlib Pipelines API. \n### Decision trees \nThese notebooks show you how to perform classifications with decision trees. \n#### Decision trees for digit recognition notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/decision-trees.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n#### Decision trees for SFO survey notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/decision-trees-sfo-airport-survey-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### GBT regression using MLlib pipelines \nThis notebook shows you how to use MLlib pipelines to perform a regression using gradient boosted trees to predict bike rental counts (per hour) from information such as day of the week, weather, season, and so on. \n#### Bike sharing regression notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/gbt-regression.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### Use Apache Spark MLlib on Databricks\n##### Advanced Apache Spark MLlib notebook example\n\nThis notebook illustrates how to create a custom transformer. \n### Custom transformer notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/flat-map-transformer-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/mllib.html"} +{"content":"# Query data\n## Data format options\n#### Hive table\n\nThis article shows how to import a [Hive table](https:\/\/spark.apache.org\/docs\/latest\/sql-data-sources-hive-tables.html) from cloud storage into Databricks using an external table.\n\n#### Hive table\n##### Step 1: Show the `CREATE TABLE` statement\n\nIssue a `SHOW CREATE TABLE <tablename>` command on your Hive command line to see the statement that created the table. \n```\nhive> SHOW CREATE TABLE wikicc;\nOK\nCREATE TABLE `wikicc`(\n`country` string,\n`count` int)\nROW FORMAT SERDE\n'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'\nSTORED AS INPUTFORMAT\n'org.apache.hadoop.mapred.TextInputFormat'\nOUTPUTFORMAT\n'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'\nLOCATION\n'<path-to-table>'\nTBLPROPERTIES (\n'totalSize'='2335',\n'numRows'='240',\n'rawDataSize'='2095',\n'COLUMN_STATS_ACCURATE'='true',\n'numFiles'='1',\n'transient_lastDdlTime'='1418173653')\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/hive-tables.html"} +{"content":"# Query data\n## Data format options\n#### Hive table\n##### Step 2: Issue a `CREATE EXTERNAL TABLE` statement\n\nIf the statement that is returned uses a `CREATE TABLE` command, copy the statement and replace `CREATE TABLE` with `CREATE EXTERNAL TABLE`. \n* `EXTERNAL` ensures that Spark SQL does not delete your data if you drop the table.\n* You can omit the `TBLPROPERTIES` field. \n```\nDROP TABLE wikicc\n\n``` \n```\nCREATE EXTERNAL TABLE `wikicc`(\n`country` string,\n`count` int)\nROW FORMAT SERDE\n'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'\nSTORED AS INPUTFORMAT\n'org.apache.hadoop.mapred.TextInputFormat'\nOUTPUTFORMAT\n'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'\nLOCATION\n'<path-to-table>'\n\n```\n\n#### Hive table\n##### Step 3: Issue SQL commands on your data\n\n```\nSELECT * FROM wikicc\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/hive-tables.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Monitoring and observability options for Delta Live Tables pipelines\n##### Monitor Delta Live Tables pipelines\n\nThis article describes how you can use built-in monitoring and observability features for Delta Live Tables pipelines, including data lineage, update history, and data quality reporting. \nYou can review most monitoring data manually through the pipeline details UI. Some tasks are easier to accomplish by querying the event log metadata. See [What is the Delta Live Tables event log?](https:\/\/docs.databricks.com\/delta-live-tables\/observability.html#event-log).\n\n##### Monitor Delta Live Tables pipelines\n###### What pipeline details are available in the UI?\n\nThe pipeline graph displays as soon as an update to a pipeline has successfully started. Arrows represent dependencies between datasets in your pipeline. By default, the pipeline details page shows the most recent update for the table, but you can select older updates from a drop-down menu. \nDetails displayed include the pipeline ID, source libraries, compute cost, product edition, and the channel configured for the pipeline. \nTo see a tabular view of datasets, click the **List** tab. The **List** view allows you to see all datasets in your pipeline represented as a row in a table and is useful when your pipeline DAG is too large to visualize in the **Graph** view. You can control the datasets displayed in the table using multiple filters such as dataset name, type, and status. To switch back to the DAG visualization, click **Graph**. \nThe **Run as** user is the pipeline owner, and pipeline updates run with this user\u2019s permissions. To change the `run as` user, click **Permissions** and change the pipeline owner.\n\n##### Monitor Delta Live Tables pipelines\n###### How can you view dataset details?\n\nClicking on a dataset in the pipeline graph or dataset list displays details about the dataset. Details include the dataset schema, data quality metrics, and a link back to the source code that define the dataset.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/observability.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Monitoring and observability options for Delta Live Tables pipelines\n##### Monitor Delta Live Tables pipelines\n###### View update history\n\nTo view the history and status of pipeline updates, click the update history drop-down menu in the top bar. \nTo view the graph, details, and events for an update, select the update in the drop-down menu. To return to the latest update, click **Show the latest update**.\n\n##### Monitor Delta Live Tables pipelines\n###### Get notifications for pipeline events\n\nTo receive real-time notifications for pipeline events such as successful completion of a pipeline update or failure of a pipeline update, add [Add email notifications for pipeline events](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#email-notifications) when you create or edit a pipeline.\n\n##### Monitor Delta Live Tables pipelines\n###### What is the Delta Live Tables event log?\n\nThe Delta Live Tables event log contains all information related to a pipeline, including audit logs, data quality checks, pipeline progress, and data lineage. You can use the event log to track, understand, and monitor the state of your data pipelines. \nYou can view event log entries in the Delta Live Tables user interface, the Delta Live Tables [API](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#list-events), or by directly querying the event log. This section focuses on querying the event log directly. \nYou can also define custom actions to run when events are logged, for example, sending alerts, with [event hooks](https:\/\/docs.databricks.com\/delta-live-tables\/event-hooks.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/observability.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Monitoring and observability options for Delta Live Tables pipelines\n##### Monitor Delta Live Tables pipelines\n###### Event log schema\n\nThe following table describes the event log schema. Some of these fields contain JSON data that require parsing to perform some queries, such as the `details` field. Databricks supports the `:` operator to parse JSON fields. See [: (colon sign) operator](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/colonsign.html). \n| Field | Description |\n| --- | --- |\n| `id` | A unique identifier for the event log record. |\n| `sequence` | A JSON document containing metadata to identify and order events. |\n| `origin` | A JSON document containing metadata for the origin of the event, for example, the cloud provider, the cloud provider region, `user_id`, `pipeline_id`, or `pipeline_type` to show where the pipeline was created, either `DBSQL` or `WORKSPACE`. |\n| `timestamp` | The time the event was recorded. |\n| `message` | A human-readable message describing the event. |\n| `level` | The event type, for example, `INFO`, `WARN`, `ERROR`, or `METRICS`. |\n| `error` | If an error occurred, details describing the error. |\n| `details` | A JSON document containing structured details of the event. This is the primary field used for analyzing events. |\n| `event_type` | The event type. |\n| `maturity_level` | The stability of the event schema. The possible values are:* `STABLE`: The schema is stable and will not change. * `NULL`: The schema is stable and will not change. The value may be `NULL` if the record was created before the `maturity_level` field was added ([release 2022.37](https:\/\/docs.databricks.com\/release-notes\/delta-live-tables\/2022\/37\/index.html)). * `EVOLVING`: The schema is not stable and may change. * `DEPRECATED`: The schema is deprecated and the Delta Live Tables runtime may stop producing this event at any time. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/observability.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Monitoring and observability options for Delta Live Tables pipelines\n##### Monitor Delta Live Tables pipelines\n###### Querying the event log\n\nThe location of the event log and the interface to query the event log depend on whether your pipeline is configured to use the Hive metastore or Unity Catalog. \n### Hive metastore \nIf your pipeline [publishes tables to the Hive metastore](https:\/\/docs.databricks.com\/delta-live-tables\/publish.html), the event log is stored in `\/system\/events` under the `storage` location. For example, if you have configured your pipeline `storage` setting as `\/Users\/username\/data`, the event log is stored in the `\/Users\/username\/data\/system\/events` path in DBFS. \nIf you have not configured the `storage` setting, the default event log location is `\/pipelines\/<pipeline-id>\/system\/events` in DBFS. For example, if the ID of your pipeline is `91de5e48-35ed-11ec-8d3d-0242ac130003`, the storage location is `\/pipelines\/91de5e48-35ed-11ec-8d3d-0242ac130003\/system\/events`. \nYou can create a view to simplify querying the event log. The following example creates a temporary view called `event_log_raw`. This view is used in the example event log queries included in this article: \n```\nCREATE OR REPLACE TEMP VIEW event_log_raw AS SELECT * FROM delta.`<event-log-path>`;\n\n``` \nReplace `<event-log-path>` with the event log location. \nEach instance of a pipeline run is called an *update*. You often want to extract information for the most recent update. Run the following query to find the identifier for the most recent update and save it in the `latest_update_id` temporary view. This view is used in the example event log queries included in this article: \n```\nCREATE OR REPLACE TEMP VIEW latest_update AS SELECT origin.update_id AS id FROM event_log_raw WHERE event_type = 'create_update' ORDER BY timestamp DESC LIMIT 1;\n\n``` \nYou can query the event log in a Databricks notebook or the [SQL editor](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html). Use a notebook or the SQL editor to run the example event log queries. \n### Unity Catalog \nIf your pipeline [publishes tables to Unity Catalog](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html), you must use the `event_log` *table valued function* (TVF) to fetch the event log for the pipeline. You retrieve the event log for a pipeline by passing the pipeline ID or a table name to the TVF. For example, to retrieve the event log records for the pipeline with ID `04c78631-3dd7-4856-b2a6-7d84e9b2638b`: \n```\nSELECT * FROM event_log(\"04c78631-3dd7-4856-b2a6-7d84e9b2638b\")\n\n``` \nTo retrieve the event log records for the pipeline that created or owns the table `my_catalog.my_schema.table1`: \n```\nSELECT * FROM event_log(TABLE(my_catalog.my_schema.table1))\n\n``` \nTo call the TVF, you must use a shared cluster or a SQL warehouse. For example, you can use a notebook attached to a shared cluster or use the [SQL editor](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html) connected to a SQL warehouse. \nTo simplify querying events for a pipeline, the owner of the pipeline can create a view over the `event_log` TVF. The following example creates a view over the event log for a pipeline. This view is used in the example event log queries included in this article. \nNote \nThe `event_log` TVF can be called only by the pipeline owner and a view created over the `event_log` TVF can be queried only by the pipeline owner. The view cannot be shared with other users. \n```\nCREATE VIEW event_log_raw AS SELECT * FROM event_log(\"<pipeline-ID>\");\n\n``` \nReplace `<pipeline-ID>` with the unique identifier for the Delta Live Tables pipeline. You can find the ID in the **Pipeline details** panel in the Delta Live Tables UI. \nEach instance of a pipeline run is called an *update*. You often want to extract information for the most recent update. Run the following query to find the identifier for the most recent update and save it in the `latest_update_id` temporary view. This view is used in the example event log queries included in this article: \n```\nCREATE OR REPLACE TEMP VIEW latest_update AS SELECT origin.update_id AS id FROM event_log_raw WHERE event_type = 'create_update' ORDER BY timestamp DESC LIMIT 1;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/observability.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Monitoring and observability options for Delta Live Tables pipelines\n##### Monitor Delta Live Tables pipelines\n###### Query lineage information from the event log\n\nEvents containing information about lineage have the event type `flow_definition`. The `details:flow_definition` object contains the `output_dataset` and `input_datasets` defining each relationship in the graph. \nYou can use the following query to extract the input and output datasets to see lineage information: \n```\nSELECT\ndetails:flow_definition.output_dataset as output_dataset,\ndetails:flow_definition.input_datasets as input_dataset\nFROM\nevent_log_raw,\nlatest_update\nWHERE\nevent_type = 'flow_definition'\nAND\norigin.update_id = latest_update.id\n\n``` \n| | `output_dataset` | `input_datasets` |\n| --- | --- | --- |\n| 1 | `customers` | `null` |\n| 2 | `sales_orders_raw` | `null` |\n| 3 | `sales_orders_cleaned` | `[\"customers\", \"sales_orders_raw\"]` |\n| 4 | `sales_order_in_la` | `[\"sales_orders_cleaned\"]` |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/observability.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Monitoring and observability options for Delta Live Tables pipelines\n##### Monitor Delta Live Tables pipelines\n###### Query data quality from the event log\n\nIf you define expectations on datasets in your pipeline, the data quality metrics are stored in the `details:flow_progress.data_quality.expectations` object. Events containing information about data quality have the event type `flow_progress`. The following example queries the data quality metrics for the last pipeline update: \n```\nSELECT\nrow_expectations.dataset as dataset,\nrow_expectations.name as expectation,\nSUM(row_expectations.passed_records) as passing_records,\nSUM(row_expectations.failed_records) as failing_records\nFROM\n(\nSELECT\nexplode(\nfrom_json(\ndetails :flow_progress :data_quality :expectations,\n\"array<struct<name: string, dataset: string, passed_records: int, failed_records: int>>\"\n)\n) row_expectations\nFROM\nevent_log_raw,\nlatest_update\nWHERE\nevent_type = 'flow_progress'\nAND origin.update_id = latest_update.id\n)\nGROUP BY\nrow_expectations.dataset,\nrow_expectations.name\n\n``` \n| | `dataset` | `expectation` | `passing_records` | `failing_records` |\n| --- | --- | --- | --- | --- |\n| 1 | `sales_orders_cleaned` | `valid_order_number` | 4083 | 0 |\n\n##### Monitor Delta Live Tables pipelines\n###### Monitor data backlog by querying the event log\n\nDelta Live Tables tracks how much data is present in the backlog in the `details:flow_progress.metrics.backlog_bytes` object. Events containing backlog metrics have the event type `flow_progress`. The following example queries backlog metrics for the last pipeline update: \n```\nSELECT\ntimestamp,\nDouble(details :flow_progress.metrics.backlog_bytes) as backlog\nFROM\nevent_log_raw,\nlatest_update\nWHERE\nevent_type ='flow_progress'\nAND\norigin.update_id = latest_update.id\n\n``` \nNote \nThe backlog metrics may not be available depending on the pipeline\u2019s data source type and Databricks Runtime version.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/observability.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Monitoring and observability options for Delta Live Tables pipelines\n##### Monitor Delta Live Tables pipelines\n###### Monitor Enhanced Autoscaling events from the event log\n\nThe event log captures cluster resizes when Enhanced Autoscaling is enabled in your pipelines. Events containing information about Enhanced Autoscaling have the event type `autoscale`. The cluster resizing request information is stored in the `details:autoscale` object. The following example queries the Enhanced Autoscaling cluster resize requests for the last pipeline update: \n```\nSELECT\ntimestamp,\nDouble(\ncase\nwhen details :autoscale.status = 'RESIZING' then details :autoscale.requested_num_executors\nelse null\nend\n) as starting_num_executors,\nDouble(\ncase\nwhen details :autoscale.status = 'SUCCEEDED' then details :autoscale.requested_num_executors\nelse null\nend\n) as succeeded_num_executors,\nDouble(\ncase\nwhen details :autoscale.status = 'PARTIALLY_SUCCEEDED' then details :autoscale.requested_num_executors\nelse null\nend\n) as partially_succeeded_num_executors,\nDouble(\ncase\nwhen details :autoscale.status = 'FAILED' then details :autoscale.requested_num_executors\nelse null\nend\n) as failed_num_executors\nFROM\nevent_log_raw,\nlatest_update\nWHERE\nevent_type = 'autoscale'\nAND\norigin.update_id = latest_update.id\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/observability.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Monitoring and observability options for Delta Live Tables pipelines\n##### Monitor Delta Live Tables pipelines\n###### Monitor compute resource utilization\n\n`cluster_resources` events provide metrics on the number of task slots in the cluster, how much those task slots are utilized, and how many tasks are waiting to be scheduled. \nWhen Enhanced Autoscaling is enabled, `cluster_resources` events also contain metrics for the autoscaling algorithm, including `latest_requested_num_executors`, and `optimal_num_executors`. The events also show the status of the algorithm as different states such as `CLUSTER_AT_DESIRED_SIZE`, `SCALE_UP_IN_PROGRESS_WAITING_FOR_EXECUTORS`, and `BLOCKED_FROM_SCALING_DOWN_BY_CONFIGURATION`.\nThis information can be viewed in conjunction with the autoscaling events to provide an overall picture of Enhanced Autoscaling. \nThe following example queries the task queue size history for the last pipeline update: \n```\nSELECT\ntimestamp,\nDouble(details :cluster_resources.avg_num_queued_tasks) as queue_size\nFROM\nevent_log_raw,\nlatest_update\nWHERE\nevent_type = 'cluster_resources'\nAND\norigin.update_id = latest_update.id\n\n``` \nThe following example queries the utilization history for the last pipeline update: \n```\nSELECT\ntimestamp,\nDouble(details :cluster_resources.avg_task_slot_utilization) as utilization\nFROM\nevent_log_raw,\nlatest_update\nWHERE\nevent_type = 'cluster_resources'\nAND\norigin.update_id = latest_update.id\n\n``` \nThe following example queries the executor count history, accompanied by metrics available only for Enhanced Autoscaling pipelines, including the number of executors requested by the algorithm in the latest request, the optimal number of executors recommended by the algorithm based on the most recent metrics, and the autoscaling algorithm state: \n```\nSELECT\ntimestamp,\nDouble(details :cluster_resources.num_executors) as current_executors,\nDouble(details :cluster_resources.latest_requested_num_executors) as latest_requested_num_executors,\nDouble(details :cluster_resources.optimal_num_executors) as optimal_num_executors,\ndetails :cluster_resources.state as autoscaling_state\nFROM\nevent_log_raw,\nlatest_update\nWHERE\nevent_type = 'cluster_resources'\nAND\norigin.update_id = latest_update.id\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/observability.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Monitoring and observability options for Delta Live Tables pipelines\n##### Monitor Delta Live Tables pipelines\n###### Audit Delta Live Tables pipelines\n\nYou can use Delta Live Tables event log records and other Databricks audit logs to get a complete picture of how data is being updated in Delta Live Tables. \nDelta Live Tables uses the credentials of the pipeline owner to run updates. You can change the credentials used by updating the pipeline owner. Delta Live Tables records the user for actions on the pipeline, including pipeline creation, edits to configuration, and triggering updates. \nSee [Unity Catalog events](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html#uc) for a reference of Unity Catalog audit events.\n\n##### Monitor Delta Live Tables pipelines\n###### Query user actions in the event log\n\nYou can use the event log to audit events, for example, user actions. Events containing information about user actions have the event type `user_action`. \nInformation about the action is stored in the `user_action` object in the `details` field. Use the following query to construct an audit log of user events. To create the `event_log_raw` view used in this query, see [Querying the event log](https:\/\/docs.databricks.com\/delta-live-tables\/observability.html#query-event-log). \n```\nSELECT timestamp, details:user_action:action, details:user_action:user_name FROM event_log_raw WHERE event_type = 'user_action'\n\n``` \n| | `timestamp` | `action` | `user_name` |\n| --- | --- | --- | --- |\n| 1 | 2021-05-20T19:36:03.517+0000 | `START` | `user@company.com` |\n| 2 | 2021-05-20T19:35:59.913+0000 | `CREATE` | `user@company.com` |\n| 3 | 2021-05-27T00:35:51.971+0000 | `START` | `user@company.com` |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/observability.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Monitoring and observability options for Delta Live Tables pipelines\n##### Monitor Delta Live Tables pipelines\n###### Runtime information\n\nYou can view runtime information for a pipeline update, for example, the Databricks Runtime version for the update: \n```\nSELECT details:create_update:runtime_version:dbr_version FROM event_log_raw WHERE event_type = 'create_update'\n\n``` \n| | `dbr_version` |\n| --- | --- |\n| 1 | 11.0 |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/observability.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create a storage credential for connecting to AWS S3\n\nThis article describes how to create a storage credential in Unity Catalog to connect to AWS S3. \nTo manage access to the underlying cloud storage that holds tables and volumes, Unity Catalog uses the following object types: \n* **Storage credentials** encapsulate a long-term cloud credential that provides access to cloud storage.\n* **External locations** contain a reference to a storage credential and a cloud storage path. \nFor more information, see [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html). \nUnity Catalog supports two cloud storage options for Databricks on AWS: AWS S3 buckets and Cloudflare R2 buckets. Cloudflare R2 is intended primarily for Delta Sharing use cases in which you want to avoid data egress fees. S3 is appropriate for most other use cases. This article focuses on creating storage credentials for S3. For Cloudflare R2, see [Create a storage credential for connecting to Cloudflare R2](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials-r2.html). \nTo create a storage credential for access to an S3 bucket, you create an IAM role that authorizes access (read, or read and write) to the S3 bucket path and reference that IAM role in the storage credential definition.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create a storage credential for connecting to AWS S3\n##### Requirements\n\nIn Databricks: \n* Databricks workspace enabled for Unity Catalog.\n* `CREATE STORAGE CREDENTIAL` privilege on the Unity Catalog metastore attached to the workspace. Account admins and metastore admins have this privilege by default. \nIn your AWS account: \n* An S3 bucket in the same region as the workspaces you want to access the data from. \nThe bucket name cannot include dot notation (for example, `incorrect.bucket.name.notation`). For more bucket naming guidance, see the [AWS bucket naming rules](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/bucketnamingrules.html).\n* The ability to create IAM roles.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create a storage credential for connecting to AWS S3\n##### Step 1: Create an IAM role\n\nIn AWS, create an IAM role that gives access to the S3 bucket that you want your users to access. This IAM role must be defined in the same account as the S3 bucket. \nTip \nIf you have already created an IAM role that provides this access, you can skip this step and go straight to [Step 2: Give Databricks the IAM role details](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html#create-storage-credentials-2). \n1. Create an IAM role that will allow access to the S3 bucket. \nRole creation is a two-step process. In this step you create the role, adding a *temporary* trust relationship policy and a placeholder external ID that you then modify after creating the storage credential in Databricks. \nYou must modify the trust policy *after* you create the role because your role must be self-assuming (that is, it must be configured to trust itself). The role must therefore exist before you add the self-assumption statement. For information about self-assuming roles, see this [Amazon blog article](https:\/\/aws.amazon.com\/blogs\/security\/announcing-an-update-to-iam-role-trust-policy-behavior\/). \nTo create the policy, you must use a placeholder external ID. An external ID is required in AWS to grant access to your AWS resources to a third party. \n1. Create the IAM role with a **Custom Trust Policy**.\n2. In the **Custom Trust Policy** field, paste the following policy JSON. \nThis policy establishes a cross-account trust relationship so that Unity Catalog can assume the role to access the data in the bucket on behalf of Databricks users. This is specified by the ARN in the `Principal` section. It is a static value that references a role created by Databricks. The policy uses the Databricks AWS account ID `414351767826`. If you are are using [Databricks on AWS GovCloud](https:\/\/docs.databricks.com\/security\/privacy\/gov-cloud.html) use the Databricks account ID `044793339203`. \nThe policy sets the external ID to `0000` as a placeholder. You update this to the external ID of your storage credential in a later step. \n```\n{\n\"Version\": \"2012-10-17\",\n\"Statement\": [{\n\"Effect\": \"Allow\",\n\"Principal\": {\n\"AWS\": [\n\"arn:aws:iam::414351767826:role\/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL\"\n]\n},\n\"Action\": \"sts:AssumeRole\",\n\"Condition\": {\n\"StringEquals\": {\n\"sts:ExternalId\": \"0000\"\n}\n}\n}]\n}\n\n```\n3. Skip the permissions policy configuration. You\u2019ll go back to add that in a later step.\n4. Save the IAM role.\n2. Create the following IAM policy in the same account as the S3 bucket, replacing the following values: \n* `<BUCKET>`: The name of the S3 bucket.\n* `<KMS-KEY>`: Optional. If encryption is enabled, provide the name of the KMS key that encrypts the S3 bucket contents. **If encryption is disabled, remove the entire KMS section of the IAM policy.**\n* `<AWS-ACCOUNT-ID>`: The Account ID of your AWS account (not your Databricks account).\n* `<AWS-IAM-ROLE-NAME>`: The name of the AWS IAM role that you created in the previous step.This IAM policy grants read and write access. You can also create a policy that grants read access only. However, this may be unnecessary, because you can mark the storage credential as read-only, and any write access granted by this IAM role will be ignored. \n```\n{\n\"Version\": \"2012-10-17\",\n\"Statement\": [\n{\n\"Action\": [\n\"s3:GetObject\",\n\"s3:PutObject\",\n\"s3:DeleteObject\",\n\"s3:ListBucket\",\n\"s3:GetBucketLocation\"\n],\n\"Resource\": [\n\"arn:aws:s3:::<BUCKET>\/*\",\n\"arn:aws:s3:::<BUCKET>\"\n],\n\"Effect\": \"Allow\"\n},\n{\n\"Action\": [\n\"kms:Decrypt\",\n\"kms:Encrypt\",\n\"kms:GenerateDataKey*\"\n],\n\"Resource\": [\n\"arn:aws:kms:<KMS-KEY>\"\n],\n\"Effect\": \"Allow\"\n},\n{\n\"Action\": [\n\"sts:AssumeRole\"\n],\n\"Resource\": [\n\"arn:aws:iam::<AWS-ACCOUNT-ID>:role\/<AWS-IAM-ROLE-NAME>\"\n],\n\"Effect\": \"Allow\"\n}\n]\n}\n\n``` \nNote \nIf you need a more restrictive IAM policy for Unity Catalog, contact your Databricks account team for assistance.\n3. Attach the IAM policy to the IAM role. \nIn the Role\u2019s **Permission** tab, attach the IAM Policy you just created.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create a storage credential for connecting to AWS S3\n##### Step 2: Give Databricks the IAM role details\n\n1. In Databricks, log in to a workspace that is linked to the metastore. \nYou must have the `CREATE STORAGE CREDENTIAL` privilege. The metastore admin and account admin roles both include this privilege.\n2. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. Click the **+Add** button and select **Add a storage credential** from the menu. \nThis option does not appear if you don\u2019t have the `CREATE STORAGE CREDENTIAL` privilege.\n4. Select a **Credential Type** of **AWS IAM Role**.\n5. Enter a name for the credential, the IAM Role ARN that authorizes Unity Catalog to access the storage location on your cloud tenant, and an optional comment. \nTip \nIf you have already defined an [instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html) in Databricks, you can click **Copy instance profile** to copy over the IAM role ARN for that instance profile. The instance profile\u2019s IAM role must have a cross-account trust relationship that enables Databricks to assume the role in order to access the bucket on behalf of Databricks users. For more information about the IAM role policy and trust relationship requirements, see [Step 1: Create an IAM role](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html#create-storage-credentials-1).\n6. (Optional) If you want users to have read-only access to the external locations that use this storage credential, in **Advanced options** select **Read only**. For more information, see [Mark a storage credential as read-only](https:\/\/docs.databricks.com\/connect\/unity-catalog\/manage-storage-credentials.html#read-only).\n7. Click **Create**.\n8. In the **Storage credential created** dialog, copy the **External ID**.\n9. Click **Done**.\n10. (Optional) Bind the storage credential to specific workspaces. \nBy default, any privileged user can use the storage credential on any workspace attached to the metastore. If you want to allow access only from specific workspaces, go to the **Workspaces** tab and assign workspaces. See [(Optional) Assign a storage credential to specific workspaces](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html#workspace-binding).\n11. [Create an external location](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html) that references this storage credential. \nYou can also create a storage credential by using [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_storage\\_credential](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/storage_credential).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create a storage credential for connecting to AWS S3\n##### Step 3: Update the IAM role policy\n\nIn AWS, modify the trust relationship policy to add your storage credential\u2019s external ID and make it self-assuming. \n1. Return to your saved IAM role and go to the **Trust Relationships** tab.\n2. Edit the trust relationship policy as follows: \nAdd the following ARN to the \u201cAllow\u201d statement. Replace `<YOUR-AWS-ACCOUNT-ID>` and `<THIS-ROLE-NAME>` with your actual account ID and IAM role values. \n```\n\"arn:aws:iam::<YOUR-AWS-ACCOUNT-ID>:role\/<THIS-ROLE-NAME>\"\n\n``` \nIn the `\"sts:AssumeRole\"` statement, update the placeholder external ID to your storage credential\u2019s external ID that you copied in the previous step. \n```\n\"sts:ExternalId\": \"<STORAGE-CREDENTIAL-EXTERNAL-ID>\"\n\n``` \nYour policy should now look like the following, with the replacement text updated to use your storage credential\u2019s external ID, account ID, and IAM role values: \n```\n{\n\"Version\": \"2012-10-17\",\n\"Statement\": [\n{\n\"Effect\": \"Allow\",\n\"Principal\": {\n\"AWS\": [\n\"arn:aws:iam::414351767826:role\/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL\",\n\"arn:aws:iam::<YOUR-AWS-ACCOUNT-ID>:role\/<THIS-ROLE-NAME>\"\n]\n},\n\"Action\": \"sts:AssumeRole\",\n\"Condition\": {\n\"StringEquals\": {\n\"sts:ExternalId\": \"<STORAGE-CREDENTIAL-EXTERNAL-ID>\"\n}\n}\n}\n]\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create a storage credential for connecting to AWS S3\n##### (Optional) Assign a storage credential to specific workspaces\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nBy default, a storage credential is accessible from all of the workspaces in the metastore. This means that if a user has been granted a privilege (such as `CREATE EXTERNAL LOCATION`) on that storage credential, they can exercise that privilege from any workspace attached to the metastore. If you use workspaces to isolate user data access, you may want to allow access to a storage credential only from specific workspaces. This feature is known as workspace binding or storage credential isolation. \nA typical use case for binding a storage credential to specific workspaces is the scenario in which a cloud admin configures a storage credential using a production cloud account credential, and you want to ensure that Databricks users use this credential to create external locations only in the production workspace. \nFor more information about workspace binding, see [(Optional) Assign an external location to specific workspaces](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html#workspace-binding) and [Workspace-catalog binding example](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#catalog-binding-example). \nNote \nWorkspace bindings are referenced when privileges against storage credentials are exercised. For example, if a user creates an external location using a storage credential, the workspace binding on the storage credential is checked only when the external location is created. After the external location is created, it will function independently of the workspace bindings configured on the storage credential. \n### Bind a storage credential to one or more workspaces \nTo assign a storage credential to specific workspaces, you can use Catalog Explorer or the Unity Catalog REST API. \n**Permissions required**: Metastore admin or storage credential owner. \nNote \nMetastore admins can see all storage credentials in a metastore using Catalog Explorer\u2014and storage credential owners can see all storage credentials that they own in a metastore\u2014regardless of whether the storage credential is assigned to the current workspace. Storage credentials that are not assigned to the workspace appear grayed out. \n1. Log in to a workspace that is linked to the metastore.\n2. In the sidebar, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. At the bottom of the screen, click **External Data > Storage credentials**.\n4. Select the storage credential and go to the **Workspaces** tab.\n5. On the **Workspaces** tab, clear the **All workspaces have access** checkbox. \nIf your storage credential is already bound to one or more workspaces, this checkbox is already cleared.\n6. Click **Assign to workspaces** and enter or find the workspaces you want to assign. \nTo revoke access, go to the **Workspaces** tab, select the workspace, and click **Revoke**. To allow access from all workspaces, select the **All workspaces have access** checkbox. \nThere are two APIs and two steps required to assign a storage credential to a workspace. In the following examples, replace `<workspace-url>` with your workspace instance name. To learn how to get the workspace instance name and workspace ID, see [Get identifiers for workspace objects](https:\/\/docs.databricks.com\/workspace\/workspace-details.html). To learn about getting access tokens, see [Authentication for Databricks automation - overview](https:\/\/docs.databricks.com\/dev-tools\/auth\/index.html). \n1. Use the `catalogs` API to set the storage credential\u2019s `isolation mode` to `ISOLATED`: \n```\ncurl -L -X PATCH 'https:\/\/<workspace-url>\/api\/2.1\/unity-catalog\/storage-credentials\/<my-storage-credential> \\\n-H 'Authorization: Bearer <my-token> \\\n-H 'Content-Type: application\/json' \\\n--data-raw '{\n\"isolation_mode\": \"ISOLATED\"\n}'\n\n``` \nThe default `isolation mode` is `OPEN` to all workspaces attached to the metastore.\n2. Use the update `bindings` API to assign the workspaces to the catalog: \n```\ncurl -L -X PATCH 'https:\/\/<workspace-url>\/api\/2.1\/unity-catalog\/bindings\/storage-credentials\/<my-storage-credential> \\\n-H 'Authorization: Bearer <my-token> \\\n-H 'Content-Type: application\/json' \\\n--data-raw '{\n\"add\": [{\"workspace_id\": <workspace-id>,...],\n\"remove\": [{\"workspace_id\": <workspace-id>,...]\n}'\n\n``` \nUse the `\"add\"` and `\"remove\"` properties to add or remove workspace bindings. \n..note:: Read-only binding (`BINDING_TYPE_READ_ONLY`) is not available for storage credentials. Therefore there is no reason to set binding type for the storage credential binding. \nTo list all workspace assignments for a storage credential, use the list `bindings` API: \n```\ncurl -L -X GET 'https:\/\/<workspace-url>\/api\/2.1\/unity-catalog\/bindings\/storage-credentials\/<my-storage-credential> \\\n-H 'Authorization: Bearer <my-token> \\\n\n``` \n### Unbind a storage credential from a workspace \nInstructions for revoking workspace access to a storage credential using Catalog Explorer or the `bindings` API are included in [Bind a storage credential to one or more workspaces](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html#bind).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create a storage credential for connecting to AWS S3\n##### Next steps\n\nYou can view, update, delete, and grant other users permission to use storage credentials. See [Manage storage credentials](https:\/\/docs.databricks.com\/connect\/unity-catalog\/manage-storage-credentials.html). \nYou can define external locations using storage credentials. See [Create an external location to connect cloud storage to Databricks](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Google BigQuery\n\nThis article describes how to read from and write to Google BigQuery tables in Databricks. \nNote \nYou may prefer Lakehouse Federation for managing queries on BigQuery data. See [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). \nYou must connect to BigQuery using key-based authentication.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/bigquery.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Google BigQuery\n##### Permissions\n\nYour projects must have specific Google permissions to read and write using BigQuery. \nNote \nThis article discusses BigQuery materialized views. For details, see the Google article [Introduction to materialized views](https:\/\/cloud.google.com\/bigquery\/docs\/materialized-views-intro). To learn other BigQuery terminology and the BigQuery security model, see the [Google BigQuery documentation](https:\/\/cloud.google.com\/bigquery\/docs). \nReading and writing data with BigQuery depends on two Google Cloud projects: \n* Project (`project`): The ID for the Google Cloud project from which Databricks reads or writes the BigQuery table.\n* Parent project (`parentProject`): The ID for the parent project, which is the Google Cloud Project ID to bill for reading and writing. Set this to the Google Cloud project associated with the Google service account for which you will generate keys. \nYou must explicitly provide the `project` and `parentProject` values in the code that accesses BigQuery. Use code similar to the following: \n```\nspark.read.format(\"bigquery\") \\\n.option(\"table\", table) \\\n.option(\"project\", <project-id>) \\\n.option(\"parentProject\", <parent-project-id>) \\\n.load()\n\n``` \nThe required permissions for the Google Cloud projects depend on whether `project` and `parentProject` are the same. The following sections list the required permissions for each scenario. \n### Permissions required if `project` and `parentProject` match \nIf the IDs for your `project` and `parentProject` are the same, use the following table to determine minimum permissions: \n| Databricks task | Google permissions required in the project |\n| --- | --- |\n| Read a BigQuery table without materializedview | In the `project` project:* BigQuery Read Session User * BigQuery Data Viewer (Optionally grant this at dataset\/table level instead of project level) |\n| Read a BigQuery table [with materializedview](https:\/\/cloud.google.com\/bigquery\/docs\/materialized-views-intro) | In the `project` project:* BigQuery Job User * BigQuery Read Session User * BigQuery Data Viewer (Optionally grant this at dataset\/table level instead of project level) In the materialization project:* BigQuery Data Editor |\n| Write a BigQuery table | In the `project` project:* BigQuery Job User * BigQuery Data Editor | \n### Permissions required if `project` and `parentProject` are different \nIf the IDs for your `project` and `parentProject` are different, use the following table to determine minimum permissions: \n| Databricks task | Google permissions required |\n| --- | --- |\n| Read a BigQuery table without materializedview | In the `parentProject` project:* BigQuery Read Session User In the `project` project:* BigQuery Data Viewer (Optionally grant this at dataset\/table level instead of project level) |\n| Read a BigQuery table [with materializedview](https:\/\/cloud.google.com\/bigquery\/docs\/materialized-views-intro) | In the `parentProject` project:* BigQuery Read Session User * BigQuery Job User In the `project` project:* BigQuery Data Viewer (Optionally grant this at dataset\/table level instead of project level) In the materialization project:* BigQuery Data Editor |\n| Write a BigQuery table | In the `parentProject` project:* BigQuery Job User In the `project` project:* BigQuery Data Editor |\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/bigquery.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Google BigQuery\n##### Step 1: Set up Google Cloud\n\n### Enable the BigQuery Storage API \nThe BigQuery Storage API is enabled by default in new Google Cloud projects in which BigQuery is enabled. However, if you have an existing project and the BigQuery Storage API isn\u2019t enabled, follow the steps in this section to enable it. \nYou can enable the BigQuery Storage API using the Google Cloud CLI or the Google Cloud Console. \n#### Enable the BigQuery Storage API using Google Cloud CLI \n```\ngcloud services enable bigquerystorage.googleapis.com\n\n``` \n#### Enable the BigQuery Storage API using Google Cloud Console \n1. Click **APIs & Services** in the left navigation pane.\n2. Click the **ENABLE APIS AND SERVICES** button. \n![Google Enable Services](https:\/\/docs.databricks.com\/_images\/google-enable-services.png)\n3. Type `bigquery storage api` in the search bar and select the first result. \n![Google BigQuery Storage](https:\/\/docs.databricks.com\/_images\/google-bigquery-storage.png)\n4. Ensure that the BigQuery Storage API is enabled. \n![Google BigQuery](https:\/\/docs.databricks.com\/_images\/google-bigquery.png) \n### Create a Google service account for Databricks \nCreate a service account for the Databricks cluster. Databricks recommends giving this service account the least privileges needed to perform its tasks. See [BigQuery Roles and Permissions](https:\/\/cloud.google.com\/bigquery\/docs\/access-control). \nYou can create a service account using the Google Cloud CLI or the Google Cloud Console. \n#### Create a Google service account using Google Cloud CLI \n```\ngcloud iam service-accounts create <service-account-name>\n\ngcloud projects add-iam-policy-binding <project-name> \\\n--role roles\/bigquery.user \\\n--member=\"serviceAccount:<service-account-name>@<project-name>.iam.gserviceaccount.com\"\n\ngcloud projects add-iam-policy-binding <project-name> \\\n--role roles\/bigquery.dataEditor \\\n--member=\"serviceAccount:<service-account-name>@<project-name>.iam.gserviceaccount.com\"\n\n``` \nCreate the keys for your service account: \n```\ngcloud iam service-accounts keys create --iam-account \\\n\"<service-account-name>@<project-name>.iam.gserviceaccount.com\" \\\n<project-name>-xxxxxxxxxxx.json\n\n``` \n#### Create a Google service account using Google Cloud Console \nTo create the account: \n1. Click **IAM and Admin** in the left navigation pane.\n2. Click **Service Accounts**.\n3. Click **+ CREATE SERVICE ACCOUNT**.\n4. Enter the service account name and description. \n![Google create service account](https:\/\/docs.databricks.com\/_images\/google-create-service-account.png)\n5. Click **CREATE**.\n6. Specify roles for your service account. In the **Select a role** drop-down, type `BigQuery` and add the following roles: \n![Google Permissions](https:\/\/docs.databricks.com\/_images\/google-permissions.png)\n7. Click **CONTINUE**.\n8. Click **DONE**. \nTo create keys for your service account: \n1. In the service accounts list, click your newly created account.\n2. In the Keys section, select **ADD KEY > Create new key** button. \n![Google Create Key](https:\/\/docs.databricks.com\/_images\/google-create-key.png)\n3. Accept the **JSON** key type.\n4. Click **CREATE**. The JSON key file is downloaded to your computer. \nImportant \nThe JSON key file you generate for the service account is a private key that should be shared only with authorized users, because it controls access to datasets and resources in your Google Cloud account. \n### Create a Google Cloud Storage (GCS) bucket for temporary storage \nTo write data to BigQuery, the data source needs access to a GCS bucket. \n1. Click **Storage** in the left navigation pane.\n2. Click **CREATE BUCKET**. \n![Google Create Bucket](https:\/\/docs.databricks.com\/_images\/google-create-bucket.png)\n3. Configure the bucket details. \n![Google Bucket Details](https:\/\/docs.databricks.com\/_images\/google-bucket-details.png)\n4. Click **CREATE**.\n5. Click the **Permissions** tab and **Add members**.\n6. Provide the following permissions to the service account on the bucket. \n![Google Bucket Permissions](https:\/\/docs.databricks.com\/_images\/google-bucket-permissions.png)\n7. Click **SAVE**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/bigquery.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Google BigQuery\n##### Step 2: Set up Databricks\n\nTo configure a cluster to access BigQuery tables, you must provide your JSON key file as a Spark configuration. Use a local tool to Base64-encode your JSON key file. For security purposes do not use a web-based or remote tool that could access your keys. \nWhen you [configure your cluster](https:\/\/docs.databricks.com\/compute\/configure.html): \nIn the **Spark Config** tab, add the following Spark config. Replace `<base64-keys>` with the string of your Base64-encoded JSON key file. Replace the other items in brackets (such as `<client-email>`) with the values of those fields from your JSON key file. \n```\ncredentials <base64-keys>\n\nspark.hadoop.google.cloud.auth.service.account.enable true\nspark.hadoop.fs.gs.auth.service.account.email <client-email>\nspark.hadoop.fs.gs.project.id <project-id>\nspark.hadoop.fs.gs.auth.service.account.private.key <private-key>\nspark.hadoop.fs.gs.auth.service.account.private.key.id <private-key-id>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/bigquery.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Google BigQuery\n##### Read and write to a BigQuery table\n\nTo read a BigQuery table, specify \n```\ndf = spark.read.format(\"bigquery\") \\\n.option(\"table\",<table-name>) \\\n.option(\"project\", <project-id>) \\\n.option(\"parentProject\", <parent-project-id>) \\\n.load()\n\n``` \nTo write to a BigQuery table, specify \n```\ndf.write.format(\"bigquery\") \\\n.mode(\"<mode>\") \\\n.option(\"temporaryGcsBucket\", \"<bucket-name>\") \\\n.option(\"table\", <table-name>) \\\n.option(\"project\", <project-id>) \\\n.option(\"parentProject\", <parent-project-id>) \\\n.save()\n\n``` \nwhere `<bucket-name>` is the name of the bucket you created in [Create a Google Cloud Storage (GCS) bucket for temporary storage](https:\/\/docs.databricks.com\/connect\/external-systems\/bigquery.html#gcs-bucket). See [Permissions](https:\/\/docs.databricks.com\/connect\/external-systems\/bigquery.html#permissions) to learn about requirements for `<project-id>` and `<parent-id>` values.\n\n#### Google BigQuery\n##### Create an external table from BigQuery\n\nImportant \nThis feature is not supported by Unity Catalog. \nYou can declare an unmanaged table in Databricks that will read data directly from BigQuery: \n```\nCREATE TABLE chosen_dataset.test_table\nUSING bigquery\nOPTIONS (\nparentProject 'gcp-parent-project-id',\nproject 'gcp-project-id',\ntemporaryGcsBucket 'some-gcp-bucket',\nmaterializationDataset 'some-bigquery-dataset',\ntable 'some-bigquery-dataset.table-to-copy'\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/bigquery.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Google BigQuery\n##### Python notebook example: Load a Google BigQuery table into a DataFrame\n\nThe following Python notebook loads a Google BigQuery table into a Databricks DataFrame. \n### Google BigQuery Python sample notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/big-query-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### Google BigQuery\n##### Scala notebook example: Load a Google BigQuery table into a DataFrame\n\nThe following Scala notebook loads a Google BigQuery table into a Databricks DataFrame. \n### Google BigQuery Scala sample notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/big-query-scala.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/bigquery.html"} +{"content":"# \n### What is Databricks?\n\nDatabricks is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale. The Databricks Data Intelligence Platform integrates with cloud storage and security in your cloud account, and manages and deploys cloud infrastructure on your behalf.\n\n### What is Databricks?\n#### How does a data intelligence platform work?\n\nDatabricks uses generative AI with the [data lakehouse](https:\/\/docs.databricks.com\/lakehouse\/index.html) to understand the unique semantics of your data. Then, it automatically optimizes performance and manages infrastructure to match your business needs. \nNatural language processing learns your business\u2019s language, so you can search and discover data by asking a question in your own words. Natural language assistance helps you write code, troubleshoot errors, and find answers in documentation. \nFinally, your data and AI applications can rely on strong governance and security. You can integrate APIs such as OpenAI without compromising data privacy and IP control.\n\n### What is Databricks?\n#### What is Databricks used for?\n\nDatabricks provides tools that help you connect your sources of data to one platform to process, store, share, analyze, model, and monetize datasets with solutions from BI to generative AI. \nThe Databricks workspace provides a unified interface and tools for most data tasks, including: \n* Data processing scheduling and management, in particular ETL\n* Generating dashboards and visualizations\n* Managing security, governance, high availability, and disaster recovery\n* Data discovery, annotation, and exploration\n* Machine learning (ML) modeling, tracking, and model serving\n* Generative AI solutions\n\n","doc_uri":"https:\/\/docs.databricks.com\/introduction\/index.html"} +{"content":"# \n### What is Databricks?\n#### Managed integration with open source\n\nDatabricks has a strong commitment to the open source community. Databricks manages updates of open source integrations in the Databricks Runtime releases. The following technologies are open source projects originally created by Databricks employees: \n* [Delta Lake](https:\/\/delta.io\/) and [Delta Sharing](https:\/\/delta.io\/sharing)\n* [MLflow](https:\/\/mlflow.org\/)\n* [Apache Spark](https:\/\/spark.apache.org\/) and [Structured Streaming](https:\/\/spark.apache.org\/streaming\/)\n* [Redash](https:\/\/redash.io\/)\n\n### What is Databricks?\n#### Tools and programmatic access\n\nDatabricks maintains a number of proprietary tools that integrate and expand these technologies to add optimized performance and ease of use, such as the following: \n* [Workflows](https:\/\/docs.databricks.com\/workflows\/index.html)\n* [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html)\n* [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html)\n* [Databricks SQL](https:\/\/docs.databricks.com\/sql\/index.html)\n* [Photon compute clusters](https:\/\/docs.databricks.com\/compute\/photon.html) \nIn addition to the workspace UI, you can interact with Databricks programmatically with the following tools: \n* REST API\n* CLI\n* Terraform\n\n","doc_uri":"https:\/\/docs.databricks.com\/introduction\/index.html"} +{"content":"# \n### What is Databricks?\n#### How does Databricks work with AWS?\n\nThe Databricks platform architecture comprises two primary parts: \n* The infrastructure used by Databricks to deploy, configure, and manage the platform and services.\n* The customer-owned infrastructure managed in collaboration by Databricks and your company. \nUnlike many enterprise data companies, Databricks does not force you to migrate your data into proprietary storage systems to use the platform. Instead, you configure a Databricks workspace by configuring secure integrations between the Databricks platform and your cloud account, and then Databricks deploys compute clusters using cloud resources in your account to process and store data in object storage and other integrated services you control. \nUnity Catalog further extends this relationship, allowing you to manage permissions for accessing data using familiar SQL syntax from within Databricks. \nDatabricks workspaces meet the security and networking requirements of [some of the world\u2019s largest and most security-minded companies](https:\/\/www.databricks.com\/customers). Databricks makes it easy for new users to get started on the platform. It removes many of the burdens and concerns of working with cloud infrastructure, without limiting the customizations and control experienced data, operations, and security teams require.\n\n### What is Databricks?\n#### What are common use cases for Databricks?\n\nUse cases on Databricks are as varied as the data processed on the platform and the many personas of employees that work with data as a core part of their job. The following use cases highlight how users throughout your organization can leverage Databricks to accomplish tasks essential to processing, storing, and analyzing the data that drives critical business functions and decisions.\n\n### What is Databricks?\n#### Build an enterprise data lakehouse\n\nThe data lakehouse combines the strengths of enterprise data warehouses and data lakes to accelerate, simplify, and unify enterprise data solutions. Data engineers, data scientists, analysts, and production systems can all use the data lakehouse as their single source of truth, allowing timely access to consistent data and reducing the complexities of building, maintaining, and syncing many distributed data systems. See [What is a data lakehouse?](https:\/\/docs.databricks.com\/lakehouse\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/introduction\/index.html"} +{"content":"# \n### What is Databricks?\n#### ETL and data engineering\n\nWhether you\u2019re generating dashboards or powering artificial intelligence applications, data engineering provides the backbone for data-centric companies by making sure data is available, clean, and stored in data models that allow for efficient discovery and use. Databricks combines the power of Apache Spark with Delta Lake and custom tools to provide an unrivaled ETL (extract, transform, load) experience. You can use SQL, Python, and Scala to compose ETL logic and then orchestrate scheduled job deployment with just a few clicks. \n[Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html) simplifies ETL even further by intelligently managing dependencies between datasets and automatically deploying and scaling production infrastructure to ensure timely and accurate delivery of data per your specifications. \nDatabricks provides a number of custom tools for [data ingestion](https:\/\/docs.databricks.com\/ingestion\/index.html), including [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html), an efficient and scalable tool for incrementally and idempotently loading data from cloud object storage and data lakes into the data lakehouse.\n\n","doc_uri":"https:\/\/docs.databricks.com\/introduction\/index.html"} +{"content":"# \n### What is Databricks?\n#### Machine learning, AI, and data science\n\nDatabricks machine learning expands the core functionality of the platform with a suite of tools tailored to the needs of data scientists and ML engineers, including [MLflow](https:\/\/docs.databricks.com\/mlflow\/index.html) and [Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html). \n### Large language models and generative AI \nDatabricks Runtime for Machine Learning includes libraries like [Hugging Face Transformers](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/index.html) that allow you to integrate existing pre-trained models or other open-source libraries into your workflow. The Databricks MLflow integration makes it easy to use the MLflow tracking service with transformer pipelines, models, and processing components. In addition, you can integrate [OpenAI](https:\/\/platform.openai.com\/docs\/introduction) models or solutions from partners like [John Snow Labs](https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/natural-language-processing.html#john-snow-labs) in your Databricks workflows. \nWith Databricks, you can customize a LLM on your data for your specific task. With the support of open source tooling, such as Hugging Face and DeepSpeed, you can efficiently take a foundation LLM and start training with your own data to have more accuracy for your domain and workload. \nIn addition, Databricks provides AI functions that SQL data analysts can use to access LLM models, including from OpenAI, directly within their data pipelines and workflows. See [AI Functions on Databricks](https:\/\/docs.databricks.com\/large-language-models\/ai-functions.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/introduction\/index.html"} +{"content":"# \n### What is Databricks?\n#### Data warehousing, analytics, and BI\n\nDatabricks combines user-friendly UIs with cost-effective compute resources and infinitely scalable, affordable storage to provide a powerful platform for running analytic queries. Administrators configure scalable compute clusters as [SQL warehouses](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html), allowing end users to execute queries without worrying about any of the complexities of working in the cloud. SQL users can run queries against data in the lakehouse using the [SQL query editor](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html) or in notebooks. [Notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html) support Python, R, and Scala in addition to SQL, and allow users to embed the same [visualizations](https:\/\/docs.databricks.com\/visualizations\/index.html) available in [legacy dashboards](https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html) alongside links, images, and commentary written in markdown.\n\n","doc_uri":"https:\/\/docs.databricks.com\/introduction\/index.html"} +{"content":"# \n### What is Databricks?\n#### Data governance and secure data sharing\n\nUnity Catalog provides a unified data governance model for the data lakehouse. Cloud administrators configure and integrate coarse access control permissions for Unity Catalog, and then Databricks administrators can manage permissions for teams and individuals. Privileges are managed with access control lists (ACLs) through either user-friendly UIs or SQL syntax, making it easier for database administrators to secure access to data without needing to scale on cloud-native identity access management (IAM) and networking. \nUnity Catalog makes running secure analytics in the cloud simple, and provides a division of responsibility that helps limit the reskilling or upskilling necessary for both administrators and end users of the platform. See [What is Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). \nThe lakehouse makes data sharing within your organization as simple as granting query access to a table or view. For sharing outside of your secure environment, Unity Catalog features a managed version of [Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html).\n\n### What is Databricks?\n#### DevOps, CI\/CD, and task orchestration\n\nThe development lifecycles for ETL pipelines, ML models, and analytics dashboards each present their own unique challenges. Databricks allows all of your users to leverage a single data source, which reduces duplicate efforts and out-of-sync reporting. By additionally providing a suite of common tools for versioning, automating, scheduling, deploying code and production resources, you can simplify your overhead for monitoring, orchestration, and operations. [Workflows](https:\/\/docs.databricks.com\/workflows\/index.html) schedule Databricks notebooks, SQL queries, and other arbitrary code. [Git folders](https:\/\/docs.databricks.com\/repos\/index.html) let you sync Databricks projects with a number of popular git providers. For a complete overview of tools, see [Developer tools and guidance](https:\/\/docs.databricks.com\/dev-tools\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/introduction\/index.html"} +{"content":"# \n### What is Databricks?\n#### Real-time and streaming analytics\n\nDatabricks leverages Apache Spark Structured Streaming to work with streaming data and incremental data changes. Structured Streaming integrates tightly with Delta Lake, and these technologies provide the foundations for both Delta Live Tables and Auto Loader. See [Streaming on Databricks](https:\/\/docs.databricks.com\/structured-streaming\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/introduction\/index.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Run Delta Live Tables pipelines\n##### Run a Delta Live Tables pipeline in a workflow\n\nYou can run a Delta Live Tables pipeline as part of a data processing workflow with Databricks jobs, Apache Airflow, or Azure Data Factory.\n\n##### Run a Delta Live Tables pipeline in a workflow\n###### Jobs\n\nYou can orchestrate multiple tasks in a Databricks job to implement a data processing workflow. To include a Delta Live Tables pipeline in a job, use the **Pipeline** task when you [create a job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#job-create).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/workflows.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Run Delta Live Tables pipelines\n##### Run a Delta Live Tables pipeline in a workflow\n###### Apache Airflow\n\n[Apache Airflow](https:\/\/airflow.apache.org\/) is an open source solution for managing and scheduling data workflows. Airflow represents workflows as directed acyclic graphs (DAGs) of operations. You define a workflow in a Python file and Airflow manages the scheduling and execution. For information on installing and using Airflow with Databricks, see [Orchestrate Databricks jobs with Apache Airflow](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-airflow-with-jobs.html). \nTo run a Delta Live Tables pipeline as part of an Airflow workflow, use the [DatabricksSubmitRunOperator](https:\/\/airflow.apache.org\/docs\/apache-airflow-providers-databricks\/stable\/_api\/airflow\/providers\/databricks\/operators\/databricks\/index.html#airflow.providers.databricks.operators.databricks.DatabricksSubmitRunOperator). \n### Requirements \nThe following are required to use the Airflow support for Delta Live Tables: \n* Airflow version 2.1.0 or later.\n* The [Databricks provider](https:\/\/pypi.org\/project\/apache-airflow-providers-databricks\/) package version 2.1.0 or later. \n### Example \nThe following example creates an Airflow DAG that triggers an update for the Delta Live Tables pipeline with the identifier `8279d543-063c-4d63-9926-dae38e35ce8b`: \n```\nfrom airflow import DAG\nfrom airflow.providers.databricks.operators.databricks import DatabricksSubmitRunOperator\nfrom airflow.utils.dates import days_ago\n\ndefault_args = {\n'owner': 'airflow'\n}\n\nwith DAG('dlt',\nstart_date=days_ago(2),\nschedule_interval=\"@once\",\ndefault_args=default_args\n) as dag:\n\nopr_run_now=DatabricksSubmitRunOperator(\ntask_id='run_now',\ndatabricks_conn_id='CONNECTION_ID',\npipeline_task={\"pipeline_id\": \"8279d543-063c-4d63-9926-dae38e35ce8b\"}\n)\n\n``` \nReplace `CONNECTION_ID` with the identifier for an [Airflow connection](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-airflow-with-jobs.html) to your workspace. \nSave this example in the `airflow\/dags` directory and use the Airflow UI to [view and trigger](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-airflow-with-jobs.html) the DAG. Use the Delta Live Tables UI to view the details of the pipeline update.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/workflows.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Run Delta Live Tables pipelines\n##### Run a Delta Live Tables pipeline in a workflow\n###### Azure Data Factory\n\nAzure Data Factory is a cloud-based ETL service that lets you orchestrate data integration and transformation workflows. Azure Data Factory directly supports running Databricks tasks in a workflow, including [notebooks](https:\/\/learn.microsoft.com\/azure\/data-factory\/transform-data-using-databricks-notebook), JAR tasks, and Python scripts. You can also include a pipeline in a workflow by calling the Delta Live Tables [API](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html) from an Azure Data Factory [Web activity](https:\/\/learn.microsoft.com\/azure\/data-factory\/control-flow-web-activity). For example, to trigger a pipeline update from Azure Data Factory: \n1. [Create a data factory](https:\/\/learn.microsoft.com\/azure\/data-factory\/quickstart-create-data-factory-portal#create-a-data-factory) or open an existing data factory.\n2. When creation completes, open the page for your data factory and click the **Open Azure Data Factory Studio** tile. The Azure Data Factory user interface appears.\n3. Create a new Azure Data Factory pipeline by selecting **Pipeline** from the **New** drop-down menu in the Azure Data Factory Studio user interface.\n4. In the **Activities** toolbox, expand **General** and drag the **Web** activity to the pipeline canvas. Click the **Settings** tab and enter the following values: \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \n* **URL**: `https:\/\/<databricks-instance>\/api\/2.0\/pipelines\/<pipeline-id>\/updates`. \nReplace `<get-workspace-instance>`. \nReplace `<pipeline-id>` with the pipeline identifier.\n* **Method**: Select **POST** from the drop-down menu.\n* **Headers**: Click **+ New**. In the **Name** text box, enter `Authorization`. In the **Value** text box, enter `Bearer <personal-access-token>`. \nReplace `<personal-access-token>` with a Databricks [personal access token](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement).\n* **Body**: To pass additional request parameters, enter a JSON document containing the parameters. For example, to start an update and reprocess all data for the pipeline: `{\"full_refresh\": \"true\"}`. If there are no additional request parameters, enter empty braces (`{}`). \nTo test the Web activity, click **Debug** on the pipeline toolbar in the Data Factory UI. The output and status of the run, including errors, are displayed in the **Output** tab of the Azure Data Factory pipeline. Use the Delta Live Tables UI to view the details of the pipeline update. \nTip \nA common workflow requirement is to start a task after completion of a previous task. Because the Delta Live Tables `updates` request is asynchronous\u2014the request returns after starting the update but before the update completes\u2014tasks in your Azure Data Factory pipeline with a dependency on the Delta Live Tables update must wait for the update to complete. An option to wait for update completion is adding an [Until activity](https:\/\/learn.microsoft.com\/azure\/data-factory\/control-flow-until-activity) following the Web activity that triggers the Delta Live Tables update. In the Until activity: \n1. Add a [Wait activity](https:\/\/learn.microsoft.com\/azure\/data-factory\/control-flow-wait-activity) to wait a configured number of seconds for update completion.\n2. Add a Web activity following the Wait activity that uses the Delta Live Tables [Get update details](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html#update-details) request to get the status of the update. The `state` field in the response returns the current state of the update, including if it has completed.\n3. Use the value of the `state` field to set the terminating condition for the Until activity. You can also use a [Set Variable activity](https:\/\/learn.microsoft.com\/azure\/data-factory\/control-flow-set-variable-activity) to add a pipeline variable based on the `state` value and use this variable for the terminating condition.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/workflows.html"} +{"content":"# Develop on Databricks\n### Databricks for Scala developers\n\nThis article provides a guide to developing notebooks and jobs in Databricks using the Scala language. The first section provides links to tutorials for common workflows and tasks. The second section provides links to APIs, libraries, and key tools. \nA basic workflow for getting started is: \n* [Import code and run it using an interactive Databricks notebook](https:\/\/docs.databricks.com\/languages\/scala.html#manage-code-with-notebooks-and-databricks-git-folders): Either import your own code from files or Git repos or try a tutorial listed below.\n* [Run your code on a cluster](https:\/\/docs.databricks.com\/languages\/scala.html#clusters-and-libraries): Either create a cluster of your own or ensure that you have permissions to use a shared cluster. Attach your notebook to the cluster and run the notebook. \nBeyond this, you can branch out into more specific topics: \n* [Work with larger data sets](https:\/\/docs.databricks.com\/languages\/scala.html#scala-api) using Apache Spark\n* [Add visualizations](https:\/\/docs.databricks.com\/languages\/scala.html#visualizations)\n* [Automate your workload](https:\/\/docs.databricks.com\/languages\/scala.html#jobs) as a job\n* [Develop in IDEs](https:\/\/docs.databricks.com\/languages\/scala.html#ides-tools-sdks)\n\n","doc_uri":"https:\/\/docs.databricks.com\/languages\/scala.html"} +{"content":"# Develop on Databricks\n### Databricks for Scala developers\n#### Tutorials\n\nThe tutorials below provide example code and notebooks to learn about common workflows. See [Import a notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html#import-notebook) for instructions on importing notebook examples into your workspace. \n* [Tutorial: Load and transform data using Apache Spark DataFrames](https:\/\/docs.databricks.com\/getting-started\/dataframes.html)\n* [Tutorial: Delta Lake](https:\/\/docs.databricks.com\/delta\/tutorial.html) provides Scala examples.\n* [Quickstart Java and Scala](https:\/\/docs.databricks.com\/mlflow\/quick-start-java-scala.html) helps you learn the basics of tracking machine learning training runs using MLflow in Scala.\n* [Use XGBoost on Databricks](https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost.html) provides a Scala example.\n\n","doc_uri":"https:\/\/docs.databricks.com\/languages\/scala.html"} +{"content":"# Develop on Databricks\n### Databricks for Scala developers\n#### Reference\n\nThe below subsections list key features and tips to help you begin developing in Databricks with Scala. \n### Scala API \nThese links provide an introduction to and reference for the Apache Spark Scala API. \n* [Tutorial: Load and transform data using Apache Spark DataFrames](https:\/\/docs.databricks.com\/getting-started\/dataframes.html)\n* [Query semi-structured data in Databricks](https:\/\/docs.databricks.com\/optimizations\/semi-structured.html)\n* [Introduction to Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/examples.html)\n* [Apache Spark Core API reference](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/index.html)\n* [Apache Spark ML API reference](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/ml\/index.html) \n### Manage code with notebooks and Databricks Git folders \n[Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html) support Scala. These notebooks provide functionality similar to that of Jupyter, but with additions such as built-in visualizations using big data, Apache Spark integrations for debugging and performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by [importing a notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html#import-notebook). Once you have access to a cluster, you can [attach a notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) to the cluster and [run the notebook](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html). \nTip \nTo completely reset the state of your notebook, it can be useful to restart the kernel. For Jupyter users, the \u201crestart kernel\u201d option in Jupyter corresponds to detaching and re-attaching a notebook in Databricks. To restart the kernel in a notebook, click the [compute selector](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#notebook-toolbar) in the notebook toolbar and hover over the attached cluster or SQL warehouse in the list to display a side menu. Select **Detach & re-attach**. This detaches the notebook from your cluster and reattaches it, which restarts the process. \n[Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html) allow users to synchronize notebooks and other files with Git repositories. Databricks Git folders help with code versioning and collaboration, and it can simplify importing a full repository of code into Databricks, viewing past notebook versions, and integrating with IDE development. Get started by [cloning a remote Git repository](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html). You can then open or create notebooks with the repository clone, [attach the notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) to a cluster, and [run the notebook](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html). \n### Clusters and libraries \nDatabricks [Compute](https:\/\/docs.databricks.com\/compute\/index.html) provides compute management for clusters of any size: from single node clusters up to large clusters. You can customize cluster hardware and libraries according to your needs. Data scientists generally begin work either by [creating a cluster](https:\/\/docs.databricks.com\/compute\/configure.html) or using an existing [shared cluster](https:\/\/docs.databricks.com\/compute\/use-compute.html). Once you have access to a cluster, you can [attach a notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) to the cluster or [run a job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#create-a-job) on the cluster. \n* For small workloads which only require single nodes, data scientists can use [single node compute](https:\/\/docs.databricks.com\/compute\/configure.html#single-node) for cost savings.\n* For detailed tips, see [Compute configuration best practices](https:\/\/docs.databricks.com\/compute\/cluster-config-best-practices.html)\n* Administrators can set up [cluster policies](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html) to simplify and guide cluster creation. \nDatabricks clusters use a Databricks Runtime, which provides many popular libraries out-of-the-box, including Apache Spark, Delta Lake, and more. You can also install additional third-party or custom libraries to use with notebooks and jobs. \n* Start with the default libraries in the [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). For full lists of pre-installed libraries, see [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html).\n* You can also [install Scala libraries in a cluster](https:\/\/docs.databricks.com\/libraries\/package-repositories.html#maven-libraries).\n* For more details, see [Libraries](https:\/\/docs.databricks.com\/libraries\/index.html). \n### Visualizations \nDatabricks Scala notebooks have built-in support for many types of [visualizations](https:\/\/docs.databricks.com\/visualizations\/index.html). You can also use legacy visualizations: \n* [Visualization overview](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#visualizations-in-scala)\n* [Visualization deep dive in Scala](https:\/\/docs.databricks.com\/visualizations\/charts-and-graphs-scala.html) \n### Interoperability \nThis section describes features that support interoperability between Scala and SQL. \n* [User-defined functions](https:\/\/docs.databricks.com\/udf\/scala.html)\n* [User-defined aggregate functions](https:\/\/docs.databricks.com\/udf\/aggregate-scala.html) \n### Jobs \nYou can automate Scala workloads as scheduled or triggered [jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) in Databricks. Jobs can run notebooks and JARs. \n* For details on creating a job via the UI, see [Create a job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#create-a-job).\n* The [Databricks SDKs](https:\/\/docs.databricks.com\/dev-tools\/index-sdk.html) allow you to create, edit, and delete jobs programmatically.\n* The [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) provides a convenient command line interface for automating jobs. \n### IDEs, developer tools, and SDKs \nIn addition to developing Scala code within Databricks notebooks, you can develop externally using integrated development environments (IDEs) such as IntelliJ IDEA. To synchronize work between external development environments and Databricks, there are several options: \n* **Code**: You can synchronize code using Git. See [Git integration with Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html).\n* **Libraries and jobs**: You can create libraries externally and upload them to Databricks. Those libraries may be imported within Databricks notebooks, or they can be used to create jobs. See [Libraries](https:\/\/docs.databricks.com\/libraries\/index.html) and [Create and run Databricks Jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html).\n* **Remote machine execution**: You can run code from your local IDE for interactive development and testing. The IDE can communicate with Databricks to execute large computations on Databricks clusters. For example, you can use IntelliJ IDEA with [Databricks Connect](https:\/\/docs.databricks.com\/dev-tools\/databricks-connect\/index.html). \nDatabricks provides a set of SDKs which support automation and integration with external tooling. You can use the Databricks SDKs to manage resources like clusters and libraries, code and other workspace objects, workloads and jobs, and more. See the [Databricks SDKs](https:\/\/docs.databricks.com\/dev-tools\/index-sdk.html). \nFor more information on IDEs, developer tools, and SDKs, see [Developer tools and guidance](https:\/\/docs.databricks.com\/dev-tools\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/languages\/scala.html"} +{"content":"# Develop on Databricks\n### Databricks for Scala developers\n#### Additional resources\n\n* The [Databricks Academy](https:\/\/databricks.com\/learn\/training\/home) offers self-paced and instructor-led courses on many topics.\n\n","doc_uri":"https:\/\/docs.databricks.com\/languages\/scala.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n### Configure data access for ingestion\n##### Generate temporary credentials for ingestion\n\nThis article describes how to create an IAM user in your AWS account that has just enough access to read data in an Amazon S3 (S3) bucket.\n\n##### Generate temporary credentials for ingestion\n###### Create an IAM policy\n\n1. Open the AWS IAM console in your AWS account, typically at <https:\/\/console.aws.amazon.com\/iam>.\n2. Click **Policies**.\n3. Click **Create Policy**.\n4. Click the **JSON** tab.\n5. Replace the existing JSON code with the following code. In the code, replace: \n* `<s3-bucket>` with the name of your S3 bucket.\n* `<folder>` with the name of the folder within your S3 bucket.\n```\n{\n\"Version\": \"2012-10-17\",\n\"Statement\": [\n{\n\"Sid\": \"ReadOnlyAccessToTrips\",\n\"Effect\": \"Allow\",\n\"Action\": [\n\"s3:GetObject\",\n\"s3:ListBucket\"\n],\n\"Resource\": [\n\"arn:aws:s3:::<s3-bucket>\",\n\"arn:aws:s3:::<s3-bucket>\/<folder>\/*\"\n]\n}\n]\n}\n\n```\n6. Click **Next: Tags**.\n7. Click **Next: Review**.\n8. Enter a name for the policy and click **Create policy**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/generate-temporary-credentials.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n### Configure data access for ingestion\n##### Generate temporary credentials for ingestion\n###### Create an IAM user\n\n1. In the sidebar, click **Users**.\n2. Click **Add users**.\n3. Enter a name for the user.\n4. Select the **Access key - Programmatic access** box, and then click **Next: Permissions**.\n5. Click **Attach existing policies directly**.\n6. Select the box next to the policy, and then click **Next: Tags**.\n7. Click **Next: Review**.\n8. Click **Create user**.\n9. Copy the **Access key ID** and **Secret access key** values that appear to a secure location, as you need them to get the AWS STS session token.\n\n##### Generate temporary credentials for ingestion\n###### Create a named profile\n\n1. On your local development machine, use the AWS CLI to create a named profile with the AWS credentials that you copied in the previous step. See [Named profiles for the AWS CLI](https:\/\/docs.aws.amazon.com\/cli\/latest\/userguide\/cli-configure-profiles.html) on the AWS website.\n2. Test your AWS credentials. To do this, use the AWS CLI to run the following command, which displays the contents of the folder that contains your data. In the command, replace: \n* `<s3-bucket>` with the name of your S3 bucket.\n* `<folder>` with the name of the folder within your S3 bucket.\n* `<named-profile>` with the name of your named profile.\n```\naws s3 ls s3:\/\/<s3-bucket>\/<folder>\/ --profile <named-profile>\n\n```\n3. To get the session token, run the following command: \n```\naws sts get-session-token --profile <named-profile>\n\n``` \nReplace `<named-profile>` with the name of your named profile.\n4. Copy the **AccessKeyId**, **SecretAccessKey**, and **SessionToken** values that appear to a secure location.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/generate-temporary-credentials.html"} +{"content":"# Develop on Databricks\n### Databricks for Python developers\n\nThis section provides a guide to developing notebooks and jobs in Databricks using the Python language. The first subsection provides links to tutorials for common workflows and tasks. The second subsection provides links to APIs, libraries, and key tools. \nA basic workflow for getting started is: \n* [Import code](https:\/\/docs.databricks.com\/languages\/python.html#manage-code-with-notebooks-and-databricks-git-folders): Either import your own code from files or Git repos or try a tutorial listed below. Databricks recommends learning using interactive Databricks Notebooks.\n* [Run your code on a cluster](https:\/\/docs.databricks.com\/languages\/python.html#clusters-and-libraries): Either create a cluster of your own, or ensure you have permissions to use a shared cluster. Attach your notebook to the cluster, and run the notebook.\n* Beyond this, you can branch out into more specific topics: \n+ [Work with larger data sets](https:\/\/docs.databricks.com\/languages\/python.html#python-apis) using Apache Spark\n+ [Add visualizations](https:\/\/docs.databricks.com\/languages\/python.html#visualizations)\n+ [Automate your workload](https:\/\/docs.databricks.com\/languages\/python.html#jobs) as a job\n+ [Use machine learning](https:\/\/docs.databricks.com\/languages\/python.html#machine-learning) to analyze your data\n+ [Develop in IDEs](https:\/\/docs.databricks.com\/languages\/python.html#ides-tools-sdks)\n\n","doc_uri":"https:\/\/docs.databricks.com\/languages\/python.html"} +{"content":"# Develop on Databricks\n### Databricks for Python developers\n#### Tutorials\n\nThe below tutorials provide example code and notebooks to learn about common workflows. See [Import a notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html#import-a-notebook) for instructions on importing notebook examples into your workspace. \n### Interactive data science and machine learning \n* Getting started with Apache Spark DataFrames for data preparation and analytics: [Tutorial: Load and transform data using Apache Spark DataFrames](https:\/\/docs.databricks.com\/getting-started\/dataframes.html)\n* [Tutorial: End-to-end ML models on Databricks](https:\/\/docs.databricks.com\/mlflow\/end-to-end-example.html). For additional examples, see [Tutorials: Get started with ML](https:\/\/docs.databricks.com\/machine-learning\/ml-tutorials.html) and the MLflow guide\u2019s [Quickstart Python](https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html).\n* [Databricks AutoML](https:\/\/docs.databricks.com\/machine-learning\/automl\/index.html) lets you get started quickly with developing machine learning models on your own datasets. Its glass-box approach generates notebooks with the complete machine learning workflow, which you may clone, modify, and rerun. \n### Data engineering \n* [Tutorial: Load and transform data using Apache Spark DataFrames](https:\/\/docs.databricks.com\/getting-started\/dataframes.html) provides a walkthrough to help you learn about Apache Spark DataFrames for data preparation and analytics.\n* [Tutorial: Delta Lake](https:\/\/docs.databricks.com\/delta\/tutorial.html).\n* [Tutorial: Run your first Delta Live Tables pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html). \n### Production machine learning and machine learning operations \n* [Manage model lifecycle in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html)\n* [Tutorial: End-to-end ML models on Databricks](https:\/\/docs.databricks.com\/mlflow\/end-to-end-example.html) \n### Debug in Python notebooks \nThe example notebook illustrates how to use the Python debugger (pdb) in Databricks notebooks. To use the Python debugger, you must be running Databricks Runtime 11.3 LTS or above. \nWith Databricks Runtime 12.2 LTS and above, you can use [variable explorer](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#variable-explorer) to track the current value of Python variables in the notebook UI. You can use variable explorer to observe the values of Python variables as you step through breakpoints. \n#### Python debugger example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/python-debugger.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nNote \n`breakpoint()` is [not supported in IPython](https:\/\/github.com\/ipython\/ipykernel\/issues\/897) and thus does not work in Databricks notebooks. You can use `import pdb; pdb.set_trace()` instead of `breakpoint()`. \n### Python APIs \nPython code that runs outside of Databricks can generally run within Databricks, and vice versa. If you have existing code, just import it into Databricks to get started. See [Manage code with notebooks and Databricks Git folders](https:\/\/docs.databricks.com\/languages\/python.html#manage-code-with-notebooks-and-databricks-git-folders) below for details. \nDatabricks can run both single-machine and distributed Python workloads. For single-machine computing, you can use Python APIs and libraries as usual; for example, pandas and scikit-learn will \u201cjust work.\u201d For distributed Python workloads, Databricks offers two popular APIs out of the box: PySpark and Pandas API on Spark. \n#### PySpark API \n[PySpark](https:\/\/docs.databricks.com\/pyspark\/index.html) is the official Python API for Apache Spark and combines the power of Python and Apache Spark. PySpark is more flexibility than the Pandas API on Spark and provides extensive support and features for data science and engineering functionality such as Spark SQL, Structured Streaming, MLLib, and GraphX. \n#### Pandas API on Spark \nNote \nThe [Koalas open-source project](https:\/\/koalas.readthedocs.io\/) now recommends switching to the Pandas API on Spark. The Pandas API on Spark is available on clusters that run [Databricks Runtime 10.0 (unsupported)](https:\/\/docs.databricks.com\/archive\/runtime-release-notes\/10.0.html) and above. For clusters that run [Databricks Runtime 9.1 LTS](https:\/\/docs.databricks.com\/release-notes\/runtime\/9.1lts.html) and below, use [Koalas](https:\/\/docs.databricks.com\/archive\/legacy\/koalas.html) instead. \n[pandas](https:\/\/pandas.pydata.org) is a Python package commonly used by data scientists for data analysis and manipulation. However, pandas does not scale out to big data. [Pandas API on Spark](https:\/\/docs.databricks.com\/pandas\/pandas-on-spark.html) fills this gap by providing pandas-equivalent APIs that work on Apache Spark. This [open-source API](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/user_guide\/pandas_on_spark\/index.html) is an ideal choice for data scientists who are familiar with pandas but not Apache Spark. \n### Manage code with notebooks and Databricks Git folders \n[Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html) support Python. These notebooks provide functionality similar to that of Jupyter, but with additions such as built-in visualizations using big data, Apache Spark integrations for debugging and performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by [importing a notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html#import-a-notebook). Once you have access to a cluster, you can [attach a notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) to the cluster and [run the notebook](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html). \nTip \nTo completely reset the state of your notebook, it can be useful to restart the iPython kernel. For Jupyter users, the \u201crestart kernel\u201d option in Jupyter corresponds to detaching and re-attaching a notebook in Databricks. To restart the kernel in a Python notebook, click the [compute selector](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#notebook-toolbar) in the notebook toolbar and hover over the attached cluster or SQL warehouse in the list to display a side menu. Select **Detach & re-attach**. This detaches the notebook from your cluster and reattaches it, which restarts the Python process. \n[Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html) allow users to synchronize notebooks and other files with Git repositories. Databricks Git folders help with code versioning and collaboration, and it can simplify importing a full repository of code into Databricks, viewing past notebook versions, and integrating with IDE development. Get started by [cloning a remote Git repository](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html). You can then open or create notebooks with the repository clone, [attach the notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) to a cluster, and [run the notebook](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html). \n### Clusters and libraries \nDatabricks [Compute](https:\/\/docs.databricks.com\/compute\/index.html) provide compute management for clusters of any size: from single node clusters up to large clusters. You can customize cluster hardware and libraries according to your needs. Data scientists will generally begin work either by [creating a cluster](https:\/\/docs.databricks.com\/compute\/configure.html) or using an existing [shared cluster](https:\/\/docs.databricks.com\/compute\/use-compute.html). Once you have access to a cluster, you can [attach a notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) to the cluster or [run a job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#create-a-job) on the cluster. \n* For small workloads which only require single nodes, data scientists can use [single node compute](https:\/\/docs.databricks.com\/compute\/configure.html#single-node) for cost savings.\n* For detailed tips, see [Compute configuration best practices](https:\/\/docs.databricks.com\/compute\/cluster-config-best-practices.html)\n* Administrators can set up [cluster policies](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html) to simplify and guide cluster creation. \nDatabricks clusters use a Databricks Runtime, which provides many popular libraries out-of-the-box, including Apache Spark, Delta Lake, pandas, and more. You can also install additional third-party or custom Python libraries to use with notebooks and jobs. \n* Start with the default libraries in the [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). Use [Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html) for machine learning workloads. For full lists of pre-installed libraries, see [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html).\n* Customize your environment using [Notebook-scoped Python libraries](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html), which allow you to modify your notebook or job environment with libraries from PyPI or other repositories. The `%pip install my_library` magic command installs `my_library` to all nodes in your currently attached cluster, yet does not interfere with other workloads on shared clusters.\n* Install non-Python libraries as [Cluster libraries](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html) as needed.\n* For more details, see [Libraries](https:\/\/docs.databricks.com\/libraries\/index.html). \n### Visualizations \nDatabricks Python notebooks have built-in support for many types of [visualizations](https:\/\/docs.databricks.com\/visualizations\/index.html). You can also use [legacy visualizations](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#visualizations-in-python). \nYou can also visualize data using third-party libraries; some are pre-installed in the Databricks Runtime, but you can install custom libraries as well. Popular options include: \n* [Bokeh](https:\/\/docs.databricks.com\/visualizations\/bokeh.html)\n* [Matplotlib](https:\/\/docs.databricks.com\/visualizations\/matplotlib.html)\n* [Plotly](https:\/\/docs.databricks.com\/visualizations\/plotly.html) \n### Jobs \nYou can automate Python workloads as scheduled or triggered [Create and run Databricks Jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) in Databricks. Jobs can run notebooks, Python scripts, and Python wheel files. \n* For details on creating a job via the UI, see [Create a job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#create-a-job).\n* The [Databricks SDKs](https:\/\/docs.databricks.com\/dev-tools\/index-sdk.html) allow you to create, edit, and delete jobs programmatically.\n* The [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) provides a convenient command line interface for automating jobs. \nTip \nTo schedule a Python script instead of a notebook, use the `spark_python_task` field under `tasks` in the body of a create job request. \n### Machine learning \nDatabricks supports a wide variety of machine learning (ML) workloads, including traditional ML on tabular data, deep learning for computer vision and natural language processing, recommendation systems, graph analytics, and more. For general information about machine learning on Databricks, see [AI and Machine Learning on Databricks](https:\/\/docs.databricks.com\/machine-learning\/index.html). \nFor ML algorithms, you can use pre-installed libraries in Databricks Runtime for Machine Learning, which includes popular Python tools such as scikit-learn, TensorFlow, Keras, PyTorch, Apache Spark MLlib, and XGBoost. You can also [install custom libraries](https:\/\/docs.databricks.com\/libraries\/index.html). \nFor machine learning operations (MLOps), Databricks provides a managed service for the open source library MLflow. With [MLflow Tracking](https:\/\/docs.databricks.com\/mlflow\/tracking.html) you can record model development and save models in reusable formats. You can use the [MLflow Model Registry](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html) to manage and automate the promotion of models towards production. [Jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) and [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) allow hosting models as batch and streaming jobs and as REST endpoints. For more information and examples, see the [ML lifecycle management using MLflow](https:\/\/docs.databricks.com\/mlflow\/index.html) or the [MLflow Python API docs](https:\/\/mlflow.org\/docs\/latest\/python_api\/index.html). \nTo get started with common machine learning workloads, see the following pages: \n* Training scikit-learn and tracking with MLflow: [10-minute tutorial: machine learning on Databricks with scikit-learn](https:\/\/docs.databricks.com\/mlflow\/end-to-end-example.html)\n* Training deep learning models: [Deep learning](https:\/\/docs.databricks.com\/machine-learning\/train-model\/deep-learning.html)\n* Hyperparameter tuning: [Parallelize hyperparameter tuning with scikit-learn and MLflow](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-spark-mlflow-integration.html)\n* Graph analytics: [GraphFrames user guide - Python](https:\/\/docs.databricks.com\/integrations\/graphframes\/user-guide-python.html) \n### IDEs, developer tools, and SDKs \nIn addition to developing Python code within Databricks notebooks, you can develop externally using integrated development environments (IDEs) such as PyCharm, Jupyter, and Visual Studio Code. To synchronize work between external development environments and Databricks, there are several options: \n* **Code**: You can synchronize code using Git. See [Git integration with Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html).\n* **Libraries and Jobs**: You can create libraries (such as Python wheel files) externally and upload them to Databricks. Those libraries may be imported within Databricks notebooks, or they can be used to create jobs. See [Libraries](https:\/\/docs.databricks.com\/libraries\/index.html) and [Create and run Databricks Jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html).\n* **Remote machine execution**: You can run code from your local IDE for interactive development and testing. The IDE can communicate with Databricks to execute Apache Spark and large computations on Databricks clusters. See [Databricks Connect](https:\/\/docs.databricks.com\/dev-tools\/databricks-connect\/index.html). \nDatabricks provides a set of SDKs which support automation and integration with external tooling. You can use the Databricks SDKs to manage resources like clusters and libraries, code and other workspace objects, workloads and jobs, and more. See the [Databricks SDKs](https:\/\/docs.databricks.com\/dev-tools\/index-sdk.html). \nFor more information on IDEs, developer tools, and SDKs, see [Developer tools and guidance](https:\/\/docs.databricks.com\/dev-tools\/index.html). \n### Additional resources \n* The [Databricks Academy](https:\/\/databricks.com\/learn\/training\/home) offers self-paced and instructor-led courses on many topics.\n* Features that support interoperability between PySpark and pandas \n+ [pandas function APIs](https:\/\/docs.databricks.com\/pandas\/pandas-function-apis.html)\n+ [pandas user-defined functions](https:\/\/docs.databricks.com\/udf\/pandas.html)\n+ [Convert between PySpark and pandas DataFrames](https:\/\/docs.databricks.com\/pandas\/pyspark-pandas-conversion.html)\n* Python and SQL database connectivity \n+ The [Databricks SQL Connector for Python](https:\/\/docs.databricks.com\/dev-tools\/python-sql-connector.html) allows you to use Python code to run SQL commands on Databricks resources.\n+ [pyodbc](https:\/\/docs.databricks.com\/dev-tools\/pyodbc.html) allows you to connect from your local Python code through ODBC to data stored in the Databricks lakehouse.\n* FAQs and tips for moving Python workloads to Databricks \n+ [Knowledge Base](https:\/\/kb.databricks.com\/python-aws)\n\n","doc_uri":"https:\/\/docs.databricks.com\/languages\/python.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Run Delta Live Tables pipelines\n##### Run an update on a Delta Live Tables pipeline\n\nThis article explains what a Delta Live Tables pipeline update is and how to run one. \nAfter you create a pipeline and are ready to run it, you start an *update*. A pipeline update does the following: \n* Starts a cluster with the correct configuration.\n* Discovers all the tables and views defined, and checks for any analysis errors such as invalid column names, missing dependencies, and syntax errors.\n* Creates or updates tables and views with the most recent data available. \nYou can check for problems in a pipeline\u2019s source code without waiting for tables to be created or updated using a [Validate update](https:\/\/docs.databricks.com\/delta-live-tables\/updates.html#validate-update). The `Validate` feature is useful when developing or testing pipelines by allowing you to quickly find and fix errors in your pipeline, such as incorrect table or column names. \nTo learn how to create a pipeline, see [Tutorial: Run your first Delta Live Tables pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/updates.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Run Delta Live Tables pipelines\n##### Run an update on a Delta Live Tables pipeline\n###### Start a pipeline update\n\nDatabricks provides several options to start pipeline updates, including the following: \n* In the Delta Live Tables UI, you have the following options: \n+ Click the ![Delta Live Tables Start Icon](https:\/\/docs.databricks.com\/_images\/dlt-start-button.png) button on the pipeline details page.\n+ From the pipelines list, click ![Right Arrow Icon](https:\/\/docs.databricks.com\/_images\/right-arrow.png) in the **Actions** column.\n* To start an update in a notebook, click **Delta Live Tables > Start** in the notebook toolbar. See [Open or run a Delta Live Tables pipeline from a notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-dlt-pipeline.html).\n* You can trigger pipelines programmatically using the API or CLI. See [Delta Live Tables API guide](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html).\n* You can schedule the pipeline as a job using the Delta Live Tables UI or the jobs UI. See [Schedule a pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/updates.html#pipeline-schedule).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/updates.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Run Delta Live Tables pipelines\n##### Run an update on a Delta Live Tables pipeline\n###### How Delta Live Tables updates tables and views\n\nThe tables and views updated, and how those tables are views are updated, depends on the update type: \n* **Refresh all**: All live tables are updated to reflect the current state of their input data sources. For all streaming tables, new rows are appended to the table.\n* **Full refresh all**: All live tables are updated to reflect the current state of their input data sources. For all streaming tables, Delta Live Tables attempts to clear all data from each table and then load all data from the streaming source.\n* **Refresh selection**: The behavior of `refresh selection` is identical to `refresh all`, but allows you to refresh only selected tables. Selected live tables are updated to reflect the current state of their input data sources. For selected streaming tables, new rows are appended to the table.\n* **Full refresh selection**: The behavior of `full refresh selection` is identical to `full refresh all`, but allows you to perform a full refresh of only selected tables. Selected live tables are updated to reflect the current state of their input data sources. For selected streaming tables, Delta Live Tables attempts to clear all data from each table and then load all data from the streaming source. \nFor existing live tables, an update has the same behavior as a SQL `REFRESH` on a materialized view. For new live tables, the behavior is the same as a SQL `CREATE` operation.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/updates.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Run Delta Live Tables pipelines\n##### Run an update on a Delta Live Tables pipeline\n###### Start a pipeline update for selected tables\n\nYou may want to reprocess data for only selected tables in your pipeline. For example, during development, you only change a single table and want to reduce testing time, or a pipeline update fails and you want to refresh only the [failed tables](https:\/\/docs.databricks.com\/delta-live-tables\/updates.html#refresh-failed). \nNote \nYou can use selective refresh with only triggered pipelines. \nTo start an update that refreshes selected tables only, on the **Pipeline details** page: \n1. Click **Select tables for refresh**. The **Select tables for refresh** dialog appears. \nIf you do not see the **Select tables for refresh** button, make sure the **Pipeline details** page displays the latest update, and the update is complete. If a DAG is not displayed for the latest update, for example, because the update failed, the **Select tables for refresh** button is not displayed.\n2. To select the tables to refresh, click on each table. The selected tables are highlighted and labeled. To remove a table from the update, click on the table again.\n3. Click **Refresh selection**. \nNote \nThe **Refresh selection** button displays the number of selected tables in parentheses. \nTo reprocess data that has already been ingested for the selected tables, click ![Blue Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret-blue.png) next to the **Refresh selection** button and click **Full Refresh selection**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/updates.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Run Delta Live Tables pipelines\n##### Run an update on a Delta Live Tables pipeline\n###### Start a pipeline update for failed tables\n\nIf a pipeline update fails because of errors in one or more tables in the pipeline graph, you can start an update of only failed tables and any downstream dependencies. \nNote \nExcluded tables are not refreshed, even if they depend on a failed table. \nTo update failed tables, on the **Pipeline details** page, click **Refresh failed tables**. \nTo update only selected failed tables: \n1. Click ![Button Down](https:\/\/docs.databricks.com\/_images\/button-down.png) next to the **Refresh failed tables** button and click **Select tables for refresh**. The **Select tables for refresh** dialog appears.\n2. To select the tables to refresh, click on each table. The selected tables are highlighted and labeled. To remove a table from the update, click on the table again.\n3. Click **Refresh selection**. \nNote \nThe **Refresh selection** button displays the number of selected tables in parentheses. \nTo reprocess data that has already been ingested for the selected tables, click ![Blue Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret-blue.png) next to the **Refresh selection** button and click **Full Refresh selection**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/updates.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Run Delta Live Tables pipelines\n##### Run an update on a Delta Live Tables pipeline\n###### Check a pipeline for errors without waiting for tables to update\n\nPreview \nThe Delta Live Tables `Validate` update feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nTo check whether a pipeline\u2019s source code is valid without running a full update, use *Validate*. A `Validate` update resolves the definitions of datasets and flows defined in the pipeline but does not materialize or publish any datasets. Errors found during validation, such as incorrect table or column names, are reported in the UI. \nTo run a `Validate` update, on the pipeline details page click ![Blue Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret-blue.png) next to **Start** and click **Validate**. \nAfter the `Validate` update completes, the event log shows events related only to the `Validate` update, and no metrics are displayed in the DAG. If errors are found, details are available in the event log. \nYou can see results for only the most recent `Validate` update. If the `Validate` update was the most recently run update, you can see the results by selecting it in the [update history](https:\/\/docs.databricks.com\/delta-live-tables\/observability.html#update-history). If another update is run after the `Validate` update, the results are no longer available in the UI.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/updates.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Run Delta Live Tables pipelines\n##### Run an update on a Delta Live Tables pipeline\n###### Continuous vs. triggered pipeline execution\n\nIf the pipeline uses the **triggered** execution mode, the system stops processing after successfully refreshing all tables or selected tables in the pipeline once, ensuring each table that is part of the update is updated based on the data available when the update started. \nIf the pipeline uses **continuous** execution, Delta Live Tables processes new data as it arrives in data sources to keep tables throughout the pipeline fresh. \nThe execution mode is independent of the type of table being computed. Both materialized views and streaming tables can be updated in either execution mode. To avoid unnecessary processing in continuous execution mode, pipelines automatically monitor dependent Delta tables and perform an update only when the contents of those dependent tables have changed. \n### Table comparing data pipeline execution modes \nThe following table highlights differences between these execution modes: \n| | Triggered | Continuous |\n| --- | --- | --- |\n| When does the update stop? | Automatically once complete. | Runs continuously until manually stopped. |\n| What data is processed? | Data available when the update is started. | All data as it arrives at configured sources. |\n| What data freshness requirements is this best for? | Data updates run every 10 minutes, hourly, or daily. | Data updates desired between every 10 seconds and a few minutes. | \nTriggered pipelines can reduce resource consumption and expense since the cluster runs only long enough to execute the pipeline. However, new data won\u2019t be processed until the pipeline is triggered. Continuous pipelines require an always-running cluster, which is more expensive but reduces processing latency. \nYou can configure execution mode with the **Pipeline mode** option in the settings.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/updates.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Run Delta Live Tables pipelines\n##### Run an update on a Delta Live Tables pipeline\n###### How to choose pipeline boundaries\n\nA Delta Live Tables pipeline can process updates to a single table, many tables with dependent relationship, many tables without relationships, or multiple independent flows of tables with dependent relationships. This section contains considerations to help determine how to break up your pipelines. \nLarger Delta Live Tables pipelines have a number of benefits. These include the following: \n* More efficiently use cluster resources.\n* Reduce the number of pipelines in your workspace.\n* Reduce the complexity of workflow orchestration. \nSome common recommendations on how processing pipelines should be split include the following: \n* Split functionality at team boundaries. For example, your data team may maintain pipelines to transform data while your data analysts maintain pipelines that analyze the transformed data.\n* Split functionality at application-specific boundaries to reduce coupling and facilitate the re-use of common functionality.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/updates.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Run Delta Live Tables pipelines\n##### Run an update on a Delta Live Tables pipeline\n###### Development and production modes\n\nYou can optimize pipeline execution by switching between development and production modes. Use the ![Delta Live Tables Environment Toggle Icon](https:\/\/docs.databricks.com\/_images\/dlt-env-toggle.png) buttons in the Pipelines UI to switch between these two modes. By default, pipelines run in development mode. \nWhen you run your pipeline in development mode, the Delta Live Tables system does the following: \n* Reuses a cluster to avoid the overhead of restarts. By default, clusters run for two hours when development mode is enabled. You can change this with the `pipelines.clusterShutdown.delay` setting in the [Configure your compute settings](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#cluster-config).\n* Disables pipeline retries so you can immediately detect and fix errors. \nIn production mode, the Delta Live Tables system does the following: \n* Restarts the cluster for specific recoverable errors, including memory leaks and stale credentials.\n* Retries execution in the event of specific errors, for example, a failure to start a cluster. \nNote \nSwitching between development and production modes only controls cluster and pipeline execution behavior. Storage locations and target schemas in the catalog for publishing tables must be configured as part of pipeline settings and are not affected when switching between modes.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/updates.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Run Delta Live Tables pipelines\n##### Run an update on a Delta Live Tables pipeline\n###### Schedule a pipeline\n\nYou can start a triggered pipeline manually or run the pipeline on a schedule with a Databricks [job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html). You can create and schedule a job with a single pipeline task directly in the Delta Live Tables UI or add a pipeline task to a multi-task workflow in the jobs UI. \nTo create a single-task job and a schedule for the job in the Delta Live Tables UI: \n1. Click **Schedule > Add a schedule**. The **Schedule** button is updated to show the number of existing schedules if the pipeline is included in one or more scheduled jobs, for example, **Schedule (5)**.\n2. Enter a name for the job in the **Job name** field.\n3. Set the **Schedule** to **Scheduled**.\n4. Specify the period, starting time, and time zone.\n5. Configure one or more email addresses to receive alerts on pipeline start, success, or failure.\n6. Click **Create**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/updates.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Load data using COPY INTO with Unity Catalog volumes or external locations\n\nLearn how to use `COPY INTO` to ingest data to Unity Catalog managed or external tables from any source and file format supported by COPY INTO. Unity Catalog adds new options for configuring secure access to raw data. You can use Unity Catalog volumes or external locations to access data in cloud object storage. \nDatabricks recommends using volumes to access files in cloud storage as part of the ingestion process using `COPY INTO`. For more information about recommendations for using volumes and external locations, see [Unity Catalog best practices](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html). \nThis article describes how to use the `COPY INTO` command to load data from an Amazon S3 (S3) bucket in your AWS account into a table in Databricks SQL. \nThe steps in this article assume that your admin has configured a Unity Catalog volume or external location so that you can access your source files in S3. If your admin configured a compute resource to use an AWS instance profile, see [Load data using COPY INTO with an instance profile](https:\/\/docs.databricks.com\/ingestion\/copy-into\/tutorial-dbsql.html) or [Tutorial: COPY INTO with Spark SQL](https:\/\/docs.databricks.com\/ingestion\/copy-into\/tutorial-notebook.html) instead. If your admin gave you temporary credentials (an AWS access key ID, a secret key, and a session token), see [Load data using COPY INTO with temporary credentials](https:\/\/docs.databricks.com\/ingestion\/copy-into\/temporary-credentials.html) instead.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/unity-catalog.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Load data using COPY INTO with Unity Catalog volumes or external locations\n##### Before you begin\n\nBefore you use `COPY INTO` to load data from a Unity Catalog volume or from a cloud object storage path that\u2019s defined as a Unity Catalog external location, you must have the following: \n* The `READ VOLUME` privilege on a volume or the `READ FILES` privilege on an external location. \nFor more information about creating volumes, see [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html). \nFor more information about creating external locations, see [Create an external location to connect cloud storage to Databricks](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html).\n* The path to your source data in the form of a cloud object storage URL or a volume path. \nExample cloud object storage URL: `s3:\/\/landing-bucket\/raw-data\/json`. \nExample volume path: `\/Volumes\/quickstart_catalog\/quickstart_schema\/quickstart_volume\/raw_data\/json`.\n* The `USE SCHEMA` privilege on the schema that contains the target table.\n* The `USE CATALOG` privilege on the parent catalog. \nFor more information about Unity Catalog privileges, see [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/unity-catalog.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Load data using COPY INTO with Unity Catalog volumes or external locations\n##### Load data from a volume\n\nTo load data from a Unity Catalog volume, you must have the `READ VOLUME` privilege. Volume privileges apply to all nested directories under the specified volume. \nFor example, if you have access to a volume with the path `\/Volumes\/quickstart_catalog\/quickstart_schema\/quickstart_volume\/`, the following commands are valid: \n```\nCOPY INTO landing_table\nFROM '\/Volumes\/quickstart_catalog\/quickstart_schema\/quickstart_volume\/raw_data'\nFILEFORMAT = PARQUET;\n\nCOPY INTO json_table\nFROM '\/Volumes\/quickstart_catalog\/quickstart_schema\/quickstart_volume\/raw_data\/json'\nFILEFORMAT = JSON;\n\n``` \nOptionally, you can also use a volume path with the dbfs scheme. For example, the following commands are also valid: \n```\nCOPY INTO landing_table\nFROM 'dbfs:\/Volumes\/quickstart_catalog\/quickstart_schema\/quickstart_volume\/raw_data'\nFILEFORMAT = PARQUET;\n\nCOPY INTO json_table\nFROM 'dbfs:\/Volumes\/quickstart_catalog\/quickstart_schema\/quickstart_volume\/raw_data\/json'\nFILEFORMAT = JSON;\n\n```\n\n#### Load data using COPY INTO with Unity Catalog volumes or external locations\n##### Load data using an external location\n\nThe following example loads data from S3 into a table using Unity Catalog external locations to provide access to the source code. \n```\nCOPY INTO my_json_data\nFROM 's3:\/\/landing-bucket\/json-data'\nFILEFORMAT = JSON;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/unity-catalog.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Load data using COPY INTO with Unity Catalog volumes or external locations\n##### External location privilege inheritance\n\nExternal location privileges apply to all nested directories under the specified location. \nFor example, if you have access to an external location defined with the URL `s3:\/\/landing-bucket\/raw-data`, the following commands are valid: \n```\nCOPY INTO landing_table\nFROM 's3:\/\/landing-bucket\/raw-data'\nFILEFORMAT = PARQUET;\n\nCOPY INTO json_table\nFROM 's3:\/\/landing-bucket\/raw-data\/json'\nFILEFORMAT = JSON;\n\n``` \nPermissions on this external location do not grant any privileges on directories above or parallel to the location specified. For example, **neither of the following commands are valid**: \n```\nCOPY INTO parent_table\nFROM 's3:\/\/landing-bucket'\nFILEFORMAT = PARQUET;\n\nCOPY INTO sibling_table\nFROM 's3:\/\/landing-bucket\/json-data'\nFILEFORMAT = JSON;\n\n```\n\n#### Load data using COPY INTO with Unity Catalog volumes or external locations\n##### Three-level namespace for target tables\n\nYou can target a Unity Catalog table using a three tier identifier (`<catalog_name>.<database_name>.<table_name>`). You can use the `USE CATALOG <catalog_name>` and `USE <database_name>` commands to set the default catalog and database for your current query or notebook.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/unity-catalog.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Configure Databricks S3 commit service-related settings\n\nDatabricks runs a commit service that coordinates writes to Amazon S3 from multiple clusters. This service runs in the Databricks [control plane](https:\/\/docs.databricks.com\/getting-started\/overview.html). For additional security, you can disable the service\u2019s direct upload optimization as described in [Disable the direct upload optimization](https:\/\/docs.databricks.com\/security\/network\/classic\/s3-commit-service.html#direct-upload-optimization). To further restrict access to your S3 buckets, see [Additional bucket security restrictions](https:\/\/docs.databricks.com\/security\/network\/classic\/s3-commit-service.html#additional-security). \nIf you receive AWS GuardDuty alerts related to the S3 commit service, see [AWS GuardDuty alerts related to S3 commit service](https:\/\/docs.databricks.com\/security\/network\/classic\/s3-commit-service.html#guardduty-alerts).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/s3-commit-service.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Configure Databricks S3 commit service-related settings\n###### About the commit service\n\nThe S3 commit service helps guarantee consistency of writes across multiple clusters on a single table in specific cases. For example, the commit service helps [Delta Lake](https:\/\/docs.databricks.com\/delta\/index.html) implement ACID transactions. \nIn the default configuration, Databricks sends temporary AWS credentials from the compute plane to the control plane in the commit service API call. Instance profile credentials are valid for six hours. \nThe compute plane writes data directly to S3, and then the S3 commit service in the control plane provides concurrency control by finalizing the commit log upload (completing the multipart upload described below). The commit service does not read any data from S3. It puts a new file in S3 if it does not exist. \nThe most common data that is written to S3 by the Databricks commit service is the Delta log, which contains statistical aggregates from your data, such as the column\u2019s minimum and maximum values. Most Delta log data is sent to S3 from the [control plane](https:\/\/docs.databricks.com\/getting-started\/overview.html) using an [Amazon S3 multipart upload](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/mpuoverview.html). \nAfter the cluster stages the multipart data to write the Delta log to S3, the S3 commit service in the Databricks control plane finishes the S3 multipart upload by letting S3 know that it is complete. As a performance optimization for very small updates, by default the commit service sometimes pushes small updates directly from the control plane to S3. This direct update optimization can be disabled. See [Disable the direct upload optimization](https:\/\/docs.databricks.com\/security\/network\/classic\/s3-commit-service.html#direct-upload-optimization). \nIn addition to Delta Lake, the following Databricks features use the same S3 commit service: \n* [Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/index.html)\n* [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html)\n* [The SQL command COPY INTO](https:\/\/docs.databricks.com\/ingestion\/copy-into\/index.html) \nThe commit service is necessary because Amazon doesn\u2019t provide an operation that puts an object only if it does not yet exist. Amazon S3 is a distributed system. If S3 receives multiple write requests for the same object simultaneously, it overwrites all but the last object written. Without the ability to centrally verify commits, simultaneous commits from different clusters would corrupt tables.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/s3-commit-service.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Configure Databricks S3 commit service-related settings\n###### AWS GuardDuty alerts related to S3 commit service\n\nImportant \nCommits to tables managed by [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) do not trigger GuardDuty alerts. \nIf you use [AWS GuardDuty](https:\/\/aws.amazon.com\/guardduty\/) and you access data using AWS IAM instance profiles, GuardDuty may create alerts for default Databricks behavior related to Delta Lake, Structured Streaming, Auto Loader, or `COPY INTO`. These alerts are related to instance credential exfiltration detection, which is enabled by default. These alerts include the title `UnauthorizedAccess:IAMUser\/InstanceCredentialExfiltration.InsideAWS`. \nYou can configure your Databricks deployment to address GuardDuty alerts related to the S3 commit service by creating an [AWS instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html) that assumes the role of your original S3 data access IAM role. \nAs an alternative to using instance profile credentials, this new instance profile can configure clusters to assume a role with short duration tokens. This capability already exists in all recent Databricks Runtime versions and can be enforced globally via [cluster policies](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html). \n1. If you have not already done so, create a normal [instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html) to access the S3 data. This instance profile uses instance profile credentials to directly access the S3 data. \nThis section refers to the role ARN in this instance profile as the `<data-role-arn>`.\n2. Create a new instance profile that will use tokens and references your instance profile that directly accesses the data. Your cluster will reference this new token-based instance profile. See [Tutorial: Configure S3 access with an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html). \nThis instance profile does not need any direct S3 access. Instead it needs only the permissions to assume the IAM role that you use for data access. This section refers to the role ARN in this instance profile as the `<cluster-role-arn>`. \n1. Add an attached IAM policy on the new cluster instance profile IAM role (`<cluster-role-arn>`). Add the following policy statement to your new cluster Instance profile IAM Role and replace `<data-role-arn>` with the ARN of your original instance profile that accesses your bucket. \n```\n{\n\"Effect\": \"Allow\",\n\"Action\": \"sts:AssumeRole\",\n\"Resource\": \"<data-role-arn>\"\n}\n\n```\n2. Add a trust policy statement to your existing data access IAM Role and replace `<cluster-role-arn>` with the ARN of the original instance profile that accesses your bucket. \n```\n{\n\"Effect\": \"Allow\",\n\"Principal\": {\n\"AWS\": \"<cluster-role-arn>\"\n},\n\"Action\": \"sts:AssumeRole\"\n}\n\n```\n3. To use notebook code that makes direct connection to S3 without using DBFS, configure your clusters to use the new token-based instance profile and to assume the data access role. \n* Configure a cluster for S3 access to all buckets. Add the following to the cluster\u2019s Spark configuration: \n```\nfs.s3a.credentialsType AssumeRole\nfs.s3a.stsAssumeRole.arn <data-role-arn>\n\n```\n* You can configure this for a specific bucket: \n```\nfs.s3a.bucket.<bucket-name>.aws.credentials.provider org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider\nfs.s3a.bucket.<bucket-name>.assumed.role.arn <data-role-arn>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/s3-commit-service.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Configure Databricks S3 commit service-related settings\n###### Disable the direct upload optimization\n\nAs a performance optimization for very small updates, by default the commit service sometimes pushes small updates directly from the control plane to S3. To disable this optimization, set the Spark parameter `spark.hadoop.fs.s3a.databricks.s3commit.directPutFileSizeThreshold` to `0`. You can apply this setting in the cluster\u2019s Spark config or set it using cluster policies. \nDisabling this feature may result in a small performance impact for near real-time Structured Streaming queries with constant small updates. Consider testing the performance impact with your data before disabling this feature in production.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/s3-commit-service.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Configure Databricks S3 commit service-related settings\n###### Additional bucket security restrictions\n\nThe following bucket policy configurations further restrict access to your S3 buckets. \nNeither of these changes affects GuardDuty alerts. \n* **Limit the bucket access to specific IP addresses and S3 operations.** If you are interested in additional controls on your bucket, you can limit specific S3 buckets to be accessible only from specific IP addresses. For example, you can restrict access to only your own environment and the IP addresses for the Databricks control plane, including the S3 commit service. See [Restrict access to your S3 buckets](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#s3-restrict-access). This configuration limits the risk that credentials are used from other locations.\n* **Limit S3 operation types outside the required directories**. You can deny access from the Databricks control plane to your S3 bucket outside the required directories for the S3 commit service. You also can limit the operations in those directories to just the required S3 operations `put` and `list` from Databricks IP addresses. The Databricks control plane (including the S3 commit service) does not require `get` access on the bucket. \n```\n{\n\"Sid\": \"LimitCommitServiceActions\",\n\"Effect\": \"Deny\",\n\"Principal\": \"*\",\n\"NotAction\": [\n\"s3:ListBucket\",\n\"s3:GetBucketLocation\",\n\"s3:PutObject\"\n],\n\"Resource\": [\n\"arn:aws:s3:::<bucket-name>\/*\",\n\"arn:aws:s3:::<bucket-name>\"\n],\n\"Condition\": {\n\"IpAddress\": {\n\"aws:SourceIp\": \"<control-plane-ip>\"\n}\n}\n},\n{\n\"Sid\": \"LimitCommitServicePut\",\n\"Effect\": \"Deny\",\n\"Principal\": \"*\",\n\"Action\": \"s3:PutObject\",\n\"NotResource\": [\n\"arn:aws:s3:::<bucket-name>\/*_delta_log\/*\",\n\"arn:aws:s3:::<bucket-name>\/*_spark_metadata\/*\",\n\"arn:aws:s3:::<bucket-name>\/*offsets\/*\",\n\"arn:aws:s3:::<bucket-name>\/*sources\/*\",\n\"arn:aws:s3:::<bucket-name>\/*sinks\/*\",\n\"arn:aws:s3:::<bucket-name>\/*_schemas\/*\"\n],\n\"Condition\": {\n\"IpAddress\": {\n\"aws:SourceIp\": \"<control-plane-ip>\"\n}\n}\n},\n{\n\"Sid\": \"LimitCommitServiceList\",\n\"Effect\": \"Deny\",\n\"Principal\": \"*\",\n\"Action\": \"s3:ListBucket\",\n\"Resource\": \"arn:aws:s3:::<bucket-name>\",\n\"Condition\": {\n\"StringNotLike\": {\n\"s3:Prefix\": [\n\"*_delta_log\/*\",\n\"*_spark_metadata\/*\",\n\"*offsets\/*\",\n\"*sources\/*\",\n\"*sinks\/*\",\n\"*_schemas\/*\"\n]\n},\n\"IpAddress\": {\n\"aws:SourceIp\": \"<control-plane-ip>\"\n}\n}\n}\n\n``` \nReplace `<control-plane-ip>` with your [regional IP address for the Databricks control plane](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#required-ips-and-storage-buckets). Replace `<bucket-name>` with your S3 bucket name.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/s3-commit-service.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Troubleshoot common sharing issues in Delta Sharing\n\nThe following sections describe common errors that might occur when you try to access data in a share.\n\n### Troubleshoot common sharing issues in Delta Sharing\n#### Resource limit exceeded errors\n\n**Issue**: Your query on a shared table returns the error `RESOURCE_LIMIT_EXCEEDED`. \nYou may see either of these errors: \n* `\"RESOURCE_LIMIT_EXCEEDED\",\"message\":\"The table metadata size exceeded limits\"`\n* `\"RESOURCE_LIMIT_EXCEEDED\",\"message\":\"The number of files in the table to return exceeded limits, consider contact your provider to optimize the table\"` \n**Possible causes**: There are limits on the number of files in metadata allowed for a shared table. \n**Recommended fix**: To learn how to resolve either of these issues, see [RESOURCE\\_LIMIT\\_EXCEEDED error when querying a Delta Sharing table](https:\/\/kb.databricks.com\/resource_limit_exceeded-error-when-querying-a-delta-sharing-table) in the Databricks Knowledge Base.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/troubleshooting.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Troubleshoot common sharing issues in Delta Sharing\n#### AWS S3 bucket name issue\n\n**Issue**: You see an error message that throws a file not found or certificate exception. \nSpark error example: \n```\nFileReadException: Error while reading file delta-sharing:\/%252Ftmp%252Fexample.share%2523example.tpc_ds.example\/XXXXXXXXXXXXX\/XXXXXXXX.\n\nCaused by: SSLPeerUnverifiedException: Certificate for - <[workspace name].cloud.databricks.com.s3.us-east-1.amazonaws.com> doesn't match any of the subject alternative names [s3.amazonaws.com, *.s3.amazonaws.com\u2026]:\n\n``` \nPandas error example: \n```\nFileNotFoundError(path)\nFileNotFoundError: https:\/\/xxxx.xxxxxx.s3.xx-xxxx-1.amazonaws.com\/xxxxxx\/part-00000-xxxxx-Amz-Algorithm=Axxxxxx-Amz-Date=xxxxxxxx&X-Amz-SignedHeaders=host&X-Amz-Expires=xxx&X-Amz-Credential=xxxxxxx_request&X-Amz-Signature=xxxxx\n\n``` \nPower BI error example: \n```\nDataSource.Error: The underlying connection was closed: Could not establish trust relationship for the SSL\/TLS secure channel.\nDetails:\nhttps:\/\/xxxx.xxxxxxxxx.s3.xx-xxxx-1.amazonaws.com\/xxxxxxxx\/part-00000-xxxxxxx.snappy.parquet\n\n``` \n**Possible cause**: Typically you see this error because your bucket name uses dot or period notation (for example, `incorrect.bucket.name.notation`). This is an AWS limitation. See the [AWS bucket naming rules](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/bucketnamingrules.html). \nYou may get this error even if your bucket name is formatted correctly. For example, you may encounter an SSL error (`SSLCertVerificationError`) when you execute code on PyCharm. \n**Recommended fix**: If your bucket name uses invalid AWS bucket naming notation, use a different bucket for Unity Catalog and Delta Sharing. \nIf your bucket uses valid naming conventions and you still face a `FileNotFoundError` in Python, enable debug logging to help isolate the issue: \n```\nimport logging\nlogging.basicConfig(level=logging.DEBUG)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/troubleshooting.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Troubleshoot common sharing issues in Delta Sharing\n#### Vacuumed data file issue\n\n**Issue**: You see an error message that throws a \u201c404 The specified [path|key] does not exist\u201d exception. \nSpark error examples: \n```\njava.lang.Throwable: HTTP request failed with status: HTTP\/1.1 404 The specified path does not exist.\n\n``` \nor \n```\nHTTP request failed with status: HTTP\/1.1 404 Not Found <?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>NoSuchKey<\/Code><Message>The specified key does not exist.<\/Message>\n\n``` \n**Possible cause**: Typically you see this error because the data file corresponding to the pre-signed URL is vacuumed in the shared table and the data file belongs to a historical table version. \n**Workaround**: Query the latest snapshot.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/troubleshooting.html"} +{"content":"# Compute\n## Use compute\n#### Run shell commands in Databricks web terminal\n\nDatabricks web terminal provides a convenient and highly interactive way for you to run shell commands and use editors, such as Vim or Emacs, on the Spark driver node. Unlike using [SSH](https:\/\/docs.databricks.com\/archive\/compute\/configure.html#ssh-access), web terminal can be used by many users on one compute and does not require setting up keys. Example uses of the web terminal include monitoring resource usage and installing Linux packages. \nWeb terminal is disabled by default for all workspace users. \nEnabling [Docker Container Services](https:\/\/docs.databricks.com\/compute\/custom-containers.html) disables web terminal. \nWarning \nDatabricks proxies the web terminal service from port 7681 on the compute\u2019s Spark driver. This web proxy is intended for use only with the web terminal. If the port is occupied when the compute starts or if there is otherwise a conflict, the web terminal may not work as expected. If other web services are launched on port 7681, compute users may be exposed to potential security exploits. Databricks is not responsible for any issues that result from the installation of unsupported software on a compute.\n\n#### Run shell commands in Databricks web terminal\n##### Requirements\n\n* [CAN ATTACH TO](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions) permission on a compute.\n* Your Databricks workspace must have web terminal [enabled](https:\/\/docs.databricks.com\/admin\/clusters\/web-terminal.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/web-terminal.html"} +{"content":"# Compute\n## Use compute\n#### Run shell commands in Databricks web terminal\n##### Launch the web terminal\n\nYou can launch the web terminal from the compute detail page or from a notebook. \n* To launch web terminal from the compute detail page, click the **Apps** tab and then click **Web Terminal**. A new tab opens with the web terminal UI and the Bash prompt.\n* To launch web terminal from a notebook, click the attached compute drop-down, hover over the attached compute, then click **Web Terminal**. The web terminal opens in a panel at the bottom of the screen. \nIn the web terminal panel in the notebook, you can use the buttons at the upper-right of the panel to do the following: \n+ Open a new terminal session in a new tab ![open a new terminal session](https:\/\/docs.databricks.com\/_images\/new-terminal-session.png).\n+ Reload a terminal session ![reload terminal session](https:\/\/docs.databricks.com\/_images\/reload-terminal-session.png).\n+ Close the bottom panel ![close bottom panel](https:\/\/docs.databricks.com\/_images\/close-bottom-panel.png). To reopen the panel, click ![reopen bottom panel](https:\/\/docs.databricks.com\/_images\/reopen-bottom-panel.png) at the bottom of the right sidebar.\n\n#### Run shell commands in Databricks web terminal\n##### Use web terminal\n\nIn the web terminal, you can run commands as root inside the container of the compute driver node. \nEach user can have up to 100 active web terminal sessions (tabs) open. Idle web terminal sessions may time out and the web terminal web application will reconnect, resulting in a new shell process. If you want to keep your Bash session, Databricks recommends using [tmux](https:\/\/www.man7.org\/linux\/man-pages\/man1\/tmux.1.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/web-terminal.html"} +{"content":"# Compute\n## Use compute\n#### Run shell commands in Databricks web terminal\n##### Limitations\n\n* Databricks does not support running Spark jobs from the web terminal. In addition, Databricks web terminal is not available in the following compute types: \n+ Job compute\n+ Compute launched with the `DISABLE_WEB_TERMINAL=true` environment variable set.\n+ Compute launched with [access mode](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode) set to **Shared**.\n+ Compute launched with the Spark configuration `spark.databricks.pyspark.enableProcessIsolation` set to `true`.\n* There is a hard limit of 12 hours since the initial page load, after which any connection, even if active, will be terminated. You can refresh the web terminal to reconnect. Databricks recommends using [tmux](https:\/\/www.man7.org\/linux\/man-pages\/man1\/tmux.1.html) to preserve your shell session. \n* Enabling [Docker Container Services](https:\/\/docs.databricks.com\/compute\/custom-containers.html) disables web terminal.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/web-terminal.html"} +{"content":"# Security and compliance guide\n","doc_uri":"https:\/\/docs.databricks.com\/security\/secrets\/index.html"} +{"content":"# Security and compliance guide\n### Secret management\n\nSometimes accessing data requires that you authenticate to external data sources through JDBC.\nInstead of directly entering your credentials into a notebook, use Databricks secrets to store your\ncredentials and reference them in notebooks and jobs. To manage secrets, you can use the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) to access the [Secrets API](https:\/\/docs.databricks.com\/api\/workspace\/secrets). \nWarning \nAdministrators, secret creators, and users granted [permission](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#secrets) can read Databricks secrets. While\nDatabricks makes an effort to redact secret values that might be displayed in notebooks, it is not possible to prevent such users from reading secrets. For more information, see [Secret redaction](https:\/\/docs.databricks.com\/security\/secrets\/redaction.html). \nTo set up secrets you: \n1. [Create a secret scope](https:\/\/docs.databricks.com\/security\/secrets\/secret-scopes.html). Secret scope names are case insensitive.\n2. [Add secrets to the scope](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html). Secret names are case insensitive.\n3. If you have the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons), [assign access control](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#secrets) to the secret scope. \nThis guide shows you how to perform these setup tasks and manage secrets. For more information, see: \n* An end-to-end [example](https:\/\/docs.databricks.com\/security\/secrets\/example-secret-workflow.html) of how to use secrets in your workflows.\n* Reference for the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html).\n* Reference for the [Secrets API](https:\/\/docs.databricks.com\/api\/workspace\/secrets).\n* How to use [Secrets utility (dbutils.secrets)](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#dbutils-secrets) to reference secrets in notebooks and jobs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/secrets\/index.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Optimize stateful processing in Delta Live Tables with watermarks\n\nTo effectively manage the data kept in state, use watermarks when performing stateful stream processing in Delta Live Tables, including aggregations, joins, and deduplication. This article describes how to use watermarks in your Delta Live Tables queries and includes examples of the recommended operations. \nNote \nTo ensure queries that perform aggregations are processed incrementally and not fully recomputed with each update, you must use watermarks.\n\n##### Optimize stateful processing in Delta Live Tables with watermarks\n###### What is a watermark?\n\nIn stream processing, a *watermark* is an Apache Spark feature that can define a time-based threshold for processing data when performing stateful operations such as aggregations. Data arriving is processed until the threshold is reached, at which point the time window defined by the threshold is closed. Watermarks can be used to avoid problems during query processing, mainly when processing larger datasets or long-running processing. These problems can include high latency in producing results and even out-of-memory (OOM) errors because of the amount of data kept in state during processing. Because streaming data is inherently unordered, watermarks also support correctly calculating operations like time-window aggregations. \nTo learn more about using watermarks in stream processing, see [Watermarking in Apache Spark Structured Streaming](https:\/\/www.databricks.com\/blog\/2022\/08\/22\/feature-deep-dive-watermarking-apache-spark-structured-streaming.html) and [Apply watermarks to control data processing thresholds](https:\/\/docs.databricks.com\/structured-streaming\/watermarks.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/stateful-processing.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Optimize stateful processing in Delta Live Tables with watermarks\n###### How do you define a watermark?\n\nYou define a watermark by specifying a timestamp field and a value representing the time threshold for *late data* to arrive. Data is considered late if it arrives after the defined time threshold. For example, if the threshold is defined as 10 minutes, records arriving after the 10-minute threshold might be dropped. \nBecause records that arrive after the defined threshold might be dropped, selecting a threshold that meets your latency vs. correctness requirements is important. Choosing a smaller threshold results in records being emitted sooner but also means late records are more likely to be dropped. A larger threshold means a longer wait but possibly more completeness of data. Because of the larger state size, a larger threshold might also require additional computing resources. Because the threshold value depends on your data and processing requirements, testing and monitoring your processing is important to determine an optimal threshold. \nYou use the `withWatermark()` function in Python to define a watermark. In SQL, use the `WATERMARK` clause to define a watermark: \n```\nwithWatermark(\"timestamp\", \"3 minutes\")\n\n``` \n```\nWATERMARK timestamp DELAY OF INTERVAL 3 MINUTES\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/stateful-processing.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Optimize stateful processing in Delta Live Tables with watermarks\n###### Use watermarks with stream-stream joins\n\nFor stream-stream joins, you must define a watermark on both sides of the join and a time interval clause. Because each join source has an incomplete view of the data, the time interval clause is required to tell the streaming engine when no further matches can be made. The time interval clause must use the same fields used to define the watermarks. \nBecause there might be times when each stream requires different thresholds for watermarks, the streams do not need to have the same thresholds. To avoid missing data, the streaming engine maintains one global watermark based on the slowest stream. \nThe following example joins a stream of ad impressions and a stream of user clicks on ads. In this example, a click must occur within 3 minutes of the impression. After the 3-minute time interval passes, rows from the state that can no longer be matched are dropped. \n```\nimport dlt\n\ndlt.create_streaming_table(\"adImpressionClicks\")\n@dlt.append_flow(target = \"adImpressionClicks\")\ndef joinClicksAndImpressions():\nclicksDf = (read_stream(\"rawClicks\")\n.withWatermark(\"clickTimestamp\", \"3 minutes\")\n)\nimpressionsDf = (read_stream(\"rawAdImpressions\")\n.withWatermark(\"impressionTimestamp\", \"3 minutes\")\n)\njoinDf = impressionsDf.alias(\"imp\").join(\nclicksDf.alias(\"click\"),\nexpr(\"\"\"\nimp.userId = click.userId AND\nclickAdId = impressionAdId AND\nclickTimestamp >= impressionTimestamp AND\nclickTimestamp <= impressionTimestamp + interval 3 minutes\n\"\"\"),\n\"inner\"\n).select(\"imp.userId\", \"impressionAdId\", \"clickTimestamp\", \"impressionSeconds\")\n\nreturn joinDf\n\n``` \n```\nCREATE OR REFRESH STREAMING TABLE\nsilver.adImpressionClicks\nAS SELECT\nimp.userId, impressionAdId, clickTimestamp, impressionSeconds\nFROM STREAM\n(LIVE.bronze.rawAdImpressions)\nWATERMARK\nimpressionTimestamp DELAY OF INTERVAL 3 MINUTES imp\nINNER JOIN STREAM\n(LIVE.bronze.rawClicks)\nWATERMARK clickTimestamp DELAY OF INTERVAL 3 MINUTES click\nON\nimp.userId = click.userId\nAND\nclickAdId = impressionAdId\nAND\nclickTimestamp >= impressionTimestamp\nAND\nclickTimestamp <= impressionTimestamp + interval 3 minutes\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/stateful-processing.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Optimize stateful processing in Delta Live Tables with watermarks\n###### Perform windowed aggregations with watermarks\n\nA common stateful operation on streaming data is a windowed aggregation. Windowed aggregations are similar to grouped aggregations, except that aggregate values are returned for the set of rows that are part of the defined window. \nA window can be defined as a certain length, and an aggregation operation can be performed on all rows that are part of that window. Spark Streaming supports three types of windows: \n* **Tumbling (fixed) windows**: A series of fixed-sized, non-overlapping, and contiguous time intervals. An input record belongs to only a single window.\n* **Sliding windows**: Similar to tumbling windows, sliding windows are fixed-sized, but windows can overlap, and a record can fall into multiple windows. \nWhen data arrives past the end of the window plus the length of the watermark, no new data is accepted for the window, the result of the aggregation is emitted, and the state for the window is dropped. \nThe following example calculates a sum of impressions every 5 minutes using a fixed window. In this example, the select clause uses the alias `impressions_window`, and then the window itself is defined as part of the `GROUP BY` clause. The window must be based on the same timestamp column as the watermark, the `clickTimestamp` column in this example. \n```\nCREATE OR REFRESH STREAMING TABLE\ngold.adImpressionSeconds\nAS SELECT\nimpressionAdId, impressions_window, sum(impressionSeconds) as totalImpressionSeconds\nFROM STREAM\n(LIVE.silver.adImpressionClicks)\nWATERMARK\nclickTimestamp DELAY OF INTERVAL 3 MINUTES\nGROUP BY\nimpressionAdId, window(clickTimestamp, \"5 minutes\")\n\n``` \nA similar example in Python to calculate profit over hourly fixed windows: \n```\nimport dlt\n\n@dlt.table()\ndef profit_by_hour():\nreturn (\ndlt.read_stream(\"sales\")\n.withWatermark(\"timestamp\", \"1 hour\")\n.groupBy(window(\"timestamp\", \"1 hour\").alias(\"time\"))\n.aggExpr(\"sum(profit) AS profit\")\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/stateful-processing.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Optimize stateful processing in Delta Live Tables with watermarks\n###### Deduplicate streaming records\n\nStructured Streaming has exactly-once processing guarantees but does not automatically de-duplicate records from data sources. For example, because many message queues have at-least once guarantees, duplicate records should be expected when reading from one of these message queues. You can use the `dropDuplicatesWithinWatermark()` function to de-duplicate records on any specified field, removing duplicates from a stream even if some fields differ (such as event time or arrival time). You must specify a watermark to use the `dropDuplicatesWithinWatermark()` function. All duplicate data that arrives within the time range specified by the watermark are dropped. \nOrdered data is important because out-of-order data causes the watermark value to jump ahead incorrectly. Then, when older data arrives, it is considered late and dropped. Use the `withEventTimeOrder` option to process the initial snapshot in order based on the timestamp specified in the watermark. The `withEventTimeOrder` option can be declared in the code defining the dataset or in the [pipeline settings](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html) using `spark.databricks.delta.withEventTimeOrder.enabled`. For example: \n```\n{\n\"spark_conf\": {\n\"spark.databricks.delta.withEventTimeOrder.enabled\": \"true\"\n}\n}\n\n``` \nNote \nThe `withEventTimeOrder` option is supported only with Python. \nIn the following example, data is processed ordered by `clickTimestamp`, and records arriving within 5 seconds of each other that contain duplicate `userId` and `clickAdId` columns are dropped. \n```\nclicksDedupDf = (\nspark.readStream\n.option(\"withEventTimeOrder\", \"true\")\n.table(rawClicks)\n.withWatermark(\"clickTimestamp\", \"5 seconds\")\n.dropDuplicatesWithinWatermark([\"userId\", \"clickAdId\"]))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/stateful-processing.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Optimize stateful processing in Delta Live Tables with watermarks\n###### Optimize pipeline configuration for stateful processing\n\nTo help prevent production issues and excessive latency, Databricks recommends enabling RocksDB-based state management for your stateful stream processing, particularly if your processing requires saving a large amount of intermediate state. To enable the RocksDB state store, see [Enable RocksDB state store for Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#rocksdb).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/stateful-processing.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Read and write protocol buffers\n\nDatabricks provides native support for serialization and deserialization between Apache Spark structs and protocol buffers (protobuf). Protobuf support is implemented as an Apache Spark DataFrame transformer and can be used with Structured Streaming or for batch operations.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/protocol-buffers.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Read and write protocol buffers\n##### How to deserialize and serialize protocol buffers\n\nIn Databricks Runtime 12.2 LTS and above, you can use `from_protobuf` and `to_protobuf` functions to serialize and deserialize data. Protobuf serialization is commonly used in streaming workloads. \nThe basic syntax for protobuf functions is similar for read and write functions. You must import these functions before use. \n`from_protobuf` casts a binary column to a struct, and `to_protobuf` casts a struct column to binary. You must provide either a schema registry specified with the `options` argument or a descriptor file identified by the `descFilePath` argument. \n```\nfrom_protobuf(data: 'ColumnOrName', messageName: Optional[str] = None, descFilePath: Optional[str] = None, options: Optional[Dict[str, str]] = None)\n\nto_protobuf(data: 'ColumnOrName', messageName: Optional[str] = None, descFilePath: Optional[str] = None, options: Optional[Dict[str, str]] = None)\n\n``` \n```\n\/\/ While using with Schema registry:\nfrom_protobuf(data: Column, options: Map[String, String])\n\n\/\/ Or with Protobuf descriptor file:\nfrom_protobuf(data: Column, messageName: String, descFilePath: String, options: Map[String, String])\n\n\/\/ While using with Schema registry:\nto_protobuf(data: Column, options: Map[String, String])\n\n\/\/ Or with Protobuf descriptor file:\nto_protobuf(data: Column, messageName: String, descFilePath: String, options: Map[String, String])\n\n``` \nThe following examples illustrate processing binary protobuf records with `from_protobuf()` and converting Spark SQL struct to binary protobuf with `to_protobuf()`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/protocol-buffers.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Read and write protocol buffers\n##### Use protobuf with Confluent Schema Registry\n\nDatabricks supports using the [Confluent Schema Registry](https:\/\/docs.confluent.io\/platform\/current\/schema-registry\/index.html) to define Protobuf. \n```\nfrom pyspark.sql.protobuf.functions import to_protobuf, from_protobuf\n\nschema_registry_options = {\n\"schema.registry.subject\" : \"app-events-value\",\n\"schema.registry.address\" : \"https:\/\/schema-registry:8081\/\"\n}\n\n# Convert binary Protobuf to SQL struct with from_protobuf():\nproto_events_df = (\ninput_df\n.select(\nfrom_protobuf(\"proto_bytes\", options = schema_registry_options)\n.alias(\"proto_event\")\n)\n)\n\n# Convert SQL struct to binary Protobuf with to_protobuf():\nprotobuf_binary_df = (\nproto_events_df\n.selectExpr(\"struct(name, id, context) as event\")\n.select(\nto_protobuf(\"event\", options = schema_registry_options)\n.alias(\"proto_bytes\")\n)\n)\n\n``` \n```\nimport org.apache.spark.sql.protobuf.functions._\nimport scala.collection.JavaConverters._\n\nval schemaRegistryOptions = Map(\n\"schema.registry.subject\" -> \"app-events-value\",\n\"schema.registry.address\" -> \"https:\/\/schema-registry:8081\/\"\n)\n\n\/\/ Convert binary Protobuf to SQL struct with from_protobuf():\nval protoEventsDF = inputDF\n.select(\nfrom_protobuf($\"proto_bytes\", options = schemaRegistryOptions.asJava)\n.as(\"proto_event\")\n)\n\n\/\/ Convert SQL struct to binary Protobuf with to_protobuf():\nval protobufBinaryDF = protoEventsDF\n.selectExpr(\"struct(name, id, context) as event\")\n.select(\nto_protobuf($\"event\", options = schemaRegistryOptions.asJava)\n.as(\"proto_bytes\")\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/protocol-buffers.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Read and write protocol buffers\n##### Authenticate to an external Confluent Schema Registry\n\nTo authenticate to an external Confluent Schema Registry, update your schema registry options to include auth credentials and API keys. \n```\nschema_registry_options = {\n\"schema.registry.subject\" : \"app-events-value\",\n\"schema.registry.address\" : \"https:\/\/remote-schema-registry-endpoint\",\n\"confluent.schema.registry.basic.auth.credentials.source\" : \"USER_INFO\",\n\"confluent.schema.registry.basic.auth.user.info\" : \"confluentApiKey:confluentApiSecret\"\n}\n\n``` \n```\nval schemaRegistryOptions = Map(\n\"schema.registry.subject\" -> \"app-events-value\",\n\"schema.registry.address\" -> \"https:\/\/remote-schema-registry-endpoint\",\n\"confluent.schema.registry.basic.auth.credentials.source\" -> \"USER_INFO\",\n\"confluent.schema.registry.basic.auth.user.info\" -> \"confluentApiKey:confluentApiSecret\"\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/protocol-buffers.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Read and write protocol buffers\n##### Use truststore and keystore files in Unity Catalog volumes\n\nIn Databricks Runtime 14.3 LTS and above, you can use truststore and keystore files in Unity Catalog volumes to authenticate to a Confluent Schema Registry. Update your schema registry options according to the following example: \n```\nschema_registry_options = {\n\"schema.registry.subject\" : \"app-events-value\",\n\"schema.registry.address\" : \"https:\/\/remote-schema-registry-endpoint\",\n\"confluent.schema.registry.ssl.truststore.location\" : \"\/Volumes\/<catalog_name>\/<schema_name>\/<volume_name>\/kafka.client.truststore.jks\",\n\"confluent.schema.registry.ssl.truststore.password\" : \"<password>\",\n\"confluent.schema.registry.ssl.keystore.location\" : \"\/Volumes\/<catalog_name>\/<schema_name>\/<volume_name>\/kafka.client.keystore.jks\",\n\"confluent.schema.registry.ssl.keystore.password\" : \"<password>\",\n\"confluent.schema.registry.ssl.key.password\" : \"<password>\"\n}\n\n``` \n```\nval schemaRegistryOptions = Map(\n\"schema.registry.subject\" -> \"app-events-value\",\n\"schema.registry.address\" -> \"https:\/\/remote-schema-registry-endpoint\",\n\"confluent.schema.registry.ssl.truststore.location\" -> \"\/Volumes\/<catalog_name>\/<schema_name>\/<volume_name>\/kafka.client.truststore.jks\",\n\"confluent.schema.registry.ssl.truststore.password\" -> \"<password>\",\n\"confluent.schema.registry.ssl.keystore.location\" -> \"\/Volumes\/<catalog_name>\/<schema_name>\/<volume_name>\/kafka.client.keystore.jks\",\n\"confluent.schema.registry.ssl.keystore.password\" -> \"<password>\",\n\"confluent.schema.registry.ssl.key.password\" -> \"<password>\"\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/protocol-buffers.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Read and write protocol buffers\n##### Use Protobuf with a descriptor file\n\nYou can also reference a protobuf descriptor file that is available to your compute cluster. Make sure you have proper permissions to read the file, depending on its location. \n```\nfrom pyspark.sql.protobuf.functions import to_protobuf, from_protobuf\n\ndescriptor_file = \"\/path\/to\/proto_descriptor.desc\"\n\nproto_events_df = (\ninput_df.select(\nfrom_protobuf(input_df.value, \"BasicMessage\", descFilePath=descriptor_file).alias(\"proto\")\n)\n)\n\nproto_binary_df = (\nproto_events_df\n.select(\nto_protobuf(proto_events_df.proto, \"BasicMessage\", descriptor_file).alias(\"bytes\")\n)\n)\n\n``` \n```\nimport org.apache.spark.sql.protobuf.functions._\n\nval descriptorFile = \"\/path\/to\/proto_descriptor.desc\"\n\nval protoEventsDF = inputDF\n.select(\nfrom_protobuf($\"value\", \"BasicMessage\", descFilePath=descriptorFile).as(\"proto\")\n)\n\nval protoBytesDF = protoEventsDF\n.select(\nto_protobuf($\"proto\", \"BasicMessage\", descriptorFile).as(\"bytes\")\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/protocol-buffers.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Read and write protocol buffers\n##### Supported options in Protobuf functions\n\nThe following options are supported in Protobuf functions. \n* **mode**: Determines how errors while deserializing Protobuf records are handled. The errors might be caused by various types of malformed records including a mismatch between the actual schema of the record and the expected schema provided in `from_protobuf()`. \n+ **Values**: \n- `FAILFAST`(default): An error is thrown when a malformed record is encountered and the task fails.\n- `PERMISSIVE`: A NULL is returned for malformed records. Use this option carefully since it can result in dropping many records. This is useful when a small fraction of the records in the source are incorrect.\n* **recursive.fields.max.depth**: Adds support for recursive fields. Spark SQL schemas do not support recursive fields. When this option is not specified, recursive fields are not permitted. In order to support recursive fields in Protobufs, they need to be expanding to a specified depth. \n+ **Values**: \n- -1 (default): Recursive fields are not allowed.\n- 0: Recursive fields are dropped.\n- 1: Allows a single level of recursion.\n- [2-10]: Specify a threshold for multiple recursion, up to 10 levels. \nSetting a value to greater than 0 allows recursive fields by expanding the nested fields to the configured depth. Values larger than 10 are not allowed in order to avoid inadvertently creating very large schemas. If a Protobuf message has depth beyond the configured limit, the Spark struct returned is truncated after the recursion limit.\n+ **Example**: Consider a Protobuf with the following recursive field: \n```\nmessage Person { string name = 1; Person friend = 2; }\n\n``` \nThe following lists the end schema with different values for this setting: \n- Option set to 1: `STRUCT<name: STRING>`\n- Option set to 2: `STRUCT<name STRING, friend: STRUCT<name: STRING>>`\n- Option set to 3: `STRUCT<name STRING, friend: STRUCT<name STRING, friend: STRUCT<name: STRING>>>`\n* **convert.any.fields.to.json**: This option enables converting Protobuf [Any](https:\/\/protobuf.dev\/programming-guides\/proto3\/#any) fields to JSON. This feature should be enabled carefully. JSON conversion and processing are inefficient. In addition, the JSON string field loses Protobuf schema safety making downstream processing prone to errors. \n+ **Values**: \n- False (default): At runtime, such wildcard fields can contain arbitrary Protobuf messages as binary data. By default such fields are handled like a normal Protobuf message. It has two fields with schema `(STRUCT<type_url: STRING, value: BINARY>)`. By default, the binary `value` field is not interpreted in any way. But the binary data might not be convenient in practice to work in some applications.\n- True: Setting this value to True enables converting `Any` fields to JSON strings at runtime. With this option, the binary is parsed and the Protobuf message is deserialized into a JSON string.\n+ **Example**: Consider two Protobuf types defined as follows: \n```\nmessage ProtoWithAny {\nstring event_name = 1;\ngoogle.protobuf.Any details = 2;\n}\n\nmessage Person {\nstring name = 1;\nint32 id = 2;\n}\n\n``` \nWith this option enabled, the schema for `from_protobuf(\"col\", messageName =\"ProtoWithAny\")` would be: `STRUCT<event_name: STRING, details: STRING>`. \nAt run time, if `details` field contains `Person` Protobuf message, the returned value looks like this: `('click', '{\"@type\":\"type.googleapis.com\/...ProtoWithAny\",\"name\":\"Mario\",\"id\":100}')`.\n+ **Requirements**: \n- The definitions for all the possible Protobuf types that are used in `Any` fields should be available in the Protobuf descriptor file passed to `from_protobuf()`.\n- If `Any` Protobuf is not found, it will result in an error for that record.\n- This feature is currently not supported with schema-registry.\n* **emit.default.values**: Enables rendering fields with zero values when deserializing Protobuf to a Spark struct. This option should be used sparingly. It is usually not advisable to depend on such finer differences in semantics. \n+ **Values** \n- False (default): When a field is empty in the serialized Protobuf, the resulting field in the Spark struct is by default null. It is simpler to not enable this option and treat `null` as the default value.\n- True: When this option is enabled, such fields are filled with corresponding default values.\n+ **Example**: Consider the following Protobuf with the Protobuf constructed like `Person(age=0, middle_name=\"\")`: \n```\nsyntax = \"proto3\";\n\nmessage Person {\nstring name = 1;\nint64 age = 2;\noptional string middle_name = 3;\noptional int64 salary = 4;\n}\n\n``` \n- With this option set to False, the Spark struct after calling `from_protobuf()` would be all nulls: `{\"name\": null, \"age\": null, \"middle_name\": \"\", \"salary\": null}`. Even though two fields (`age` and `middle_name`) had values set, Protobuf does not include them in wire-format since they are default values.\n- With this option set to True, the Spark struct after calling `from_protobuf()` would be: `{\"name\": \"\", \"age\": 0, \"middle_name\": \"\", \"salary\": null}`. The `salary` field remains null since it is explicitly declared `optional` and it is not set in the input record.\n* **enums.as.ints**: When enabled, enum fields in Protobuf are rendered as integer fields in Spark. \n+ **Values** \n- False (default)\n- True: When enabled, enum fields in Protobuf are rendered as integer fields in Spark.\n+ **Example**: Consider the following Protobuf: \n```\nsyntax = \"proto3\";\n\nmessage Person {\nenum Job {\nNONE = 0;\nENGINEER = 1;\nDOCTOR = 2;\nNURSE = 3;\n}\nJob job = 1;\n}\n\n``` \nGiven a Protobuf message like `Person(job = ENGINEER)`: \n- With this option disabled, the corresponding Spark struct would be `{\"job\": \"ENGINEER\"}`.\n- With this option enabled, the corresponding Spark struct would be `{\"job\": 1}`.Notice that the schema for these fields is different in each case (integer rather than default string). Such a change can affect the schema of downstream tables.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/protocol-buffers.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Read and write protocol buffers\n##### Schema Registry Options\n\nThe following schema registry options are relevant while using schema registry with Protobuf functions. \n* **schema.registry.subject** \n+ Required\n+ Specifies subject for schema in Schema Registry, such as \u201cclient-event\u201d\n* **schema.registry.address** \n+ Required\n+ URL for schema registry, such as `https:\/\/schema-registry.example.com:8081`\n* **schema.registry.protobuf.name** \n+ Optional\n+ Default: `<NONE>`.\n+ A schema-registry entry for a subject can contain multiple Protobuf definitions, just like a single `proto` file. When this option is not specified, the first Protobuf is used for the schema. Specify the name of the Protobuf message when it is not the first one in the entry. For example, consider an entry with two Protobuf definitions: \u201cPerson\u201d and \u201cLocation\u201d in that order. If the stream corresponds to \u201cLocation\u201d rather than \u201cPerson\u201d, set this option to \u201cLocation\u201d (or its full name including package \u201ccom.example.protos.Location\u201d).\n* **schema.registry.schema.evolution.mode** \n+ Default: \u201crestart\u201d.\n+ Supported modes: \n- \u201crestart\u201d\n- \u201cnone\u201d\n+ This option sets schema-evolution mode for `from_protobuf()`. At the start of a query, Spark records the latest schema-id for the given subject. This determines the schema for `from_protobuf()`. A new schema might be published to the schema registry after the query starts. When a newer schema-id is noticed in an incoming record, it indicates a change to the schema. This option determines how such a change to schema is handled: \n- **restart** (default): Triggers an `UnknownFieldException` when a newer schema-id is noticed. This terminates the query. Databricks recommends configuring workflows to restart on query failure to pick up schema changes.\n- **none**: Schema-id changes are ignored. The records with newer schema-id are parsed with the same schema that was observed at the start of the query. Newer Protobuf definitions are expected to be backward compatible, and new fields are ignored.\n* **confluent.schema.registry.`<schema-registy-client-option>`** \n+ Optional\n+ Schema-registry connects to Confluent schema-registry using the [Confluent Schema Registry client](https:\/\/docs.confluent.io\/platform\/current\/schema-registry\/develop\/using.html). Any configuration options supported by the client can be specified with the prefix \u201cconfluent.schema.registry\u201d. For example, the following two settings provide \u201cUSER\\_INFO\u201d authentication credentials: \n- \u201cconfluent.schema.registry.basic.auth.credentials.source\u201d: \u2018USER\\_INFO\u2019\n- \u201cconfluent.schema.registry.basic.auth.user.info\u201d: \u201c`<KEY>` : `<SECRET>`\u201d\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/protocol-buffers.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization in Databricks SQL\n\nDatabricks has built-in support for charts and visualizations in both Databricks SQL and in notebooks. This page describes how to work with visualizations in Databricks SQL. For information about using visualizations in notebooks, see [Visualizations in Databricks notebooks](https:\/\/docs.databricks.com\/visualizations\/index.html). \nTo view the types of visualizations, see [visualization types](https:\/\/docs.databricks.com\/visualizations\/visualization-types.html). \nImportant \nFor information about a preview version of Databricks charts, see [preview chart visualizations](https:\/\/docs.databricks.com\/visualizations\/preview-chart-visualizations.html).\n\n#### Visualization in Databricks SQL\n##### Create a visualization\n\n1. Run the following query in SQL editor. \n```\nUSE CATALOG SAMPLES;\nSELECT\nhour(tpep_dropoff_datetime) as dropoff_hour,\nCOUNT(*) AS num\nFROM samples.nyctaxi.trips\nWHERE pickup_zip IN ({{pickupzip}})\nGROUP BY 1\n\n``` \n![Add visualization](https:\/\/docs.databricks.com\/_images\/add-visualization.png)\n2. After running a query, in the **Results** panel, click **+** and then select **Visualization**.\n3. In the **Visualization Type** drop-down, choose **Bar**.\n4. Enter a visualization name, such as **Dropoff Rates**.\n5. Review the visualization properties. \n![Configure chart](https:\/\/docs.databricks.com\/_images\/configure-chart.png)\n6. Click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/visualizations\/index.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization in Databricks SQL\n##### Visualization tools\n\nIf you hover over the top right of a chart in the visualization editor, a Plotly toolbar displays where you can perform operations such as select, zoom, and pan. \n![Plotly toolbar](https:\/\/docs.databricks.com\/_images\/plotly-bar.png) \nIf you do not see the toolbar, your administrator has [disabled toolbar display](https:\/\/docs.databricks.com\/admin\/workspace-settings\/appearance.html) for your Databricks SQL instance.\n\n#### Visualization in Databricks SQL\n##### Temporarily hide or show only a series\n\nTo hide a series in a visualization, click the series in the legend. To show the series again, click it again in the legend. \nTo show only a single series, double-click the series in the legend. To show other series, click each one.\n\n#### Visualization in Databricks SQL\n##### Clone a visualization\n\nTo clone a visualization: \n1. Open the visualization in the SQL editor.\n2. Click ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) in the visualization\u2019s tab (not ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) next to **Edit Visualization**).\n3. Click **Duplicate**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/visualizations\/index.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization in Databricks SQL\n##### Enable aggregation in a visualization\n\nFor bar, line, area, pie and heatmap charts, you add aggregation directly in the visualization rather than modifying the query to add an aggregation column. This approach has the following advantages: \n* You don\u2019t need to modify the underlying SQL.\n* You can quickly perform scenario-based data analysis on the fly by modifying the aggregation.\n* The aggregation applies to the entire data set, not just the first 64,000 rows displayed in a table. \nAggregation is available in the following visualizations: \n* Line\n* Bar\n* Area\n* Pie\n* Heatmap\n* Histogram \nAggregations do not support combination visualizations, such as displaying a line and bars in the same chart. To create a new combination chart, [clone](https:\/\/docs.databricks.com\/sql\/user\/visualizations\/index.html#clone-a-visualization) a legacy visualization. \nTable visualizations display only the first 64,000 rows. \nTo aggregate Y-axis columns for a visualization: \n1. From the SQL editor, create a new visualization or edit an existing one. \nIf you see the message `This visualization uses an old configuration. New visualizations support aggregating data directly within the editor`, you must re-create the visualization before you can use aggregation.\n2. Next to the Y-axis columns, select the aggregation type from the following for numeric types: \n* Sum (the default)\n* Average\n* Count\n* Count Distinct\n* Max\n* Min\n* MedianOr from the following for string types: \n* Count\n* Count DistinctYour changes are applied to the preview of the visualization.\n3. Click **Save**.\n4. The visualization displays the number of rows that it aggregates. \nIn some cases, you may not want to use aggregation on Y-axis columns. To disable aggregation, click on the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) next to **Y columns** and uncheck **Use aggregation**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/visualizations\/index.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization in Databricks SQL\n##### Customize colors for a visualization\n\nNote \nBy default, if a legacy dashboard uses a custom color palette, visualization color choices are ignored. To override this setting, see [Use a different color palatte for a visualization](https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html#override-colors-legacy). \nYou can customize a visualization\u2019s colors when you create the visualization or by editing it. \n1. Create or edit a visualization.\n2. Click **Colors**.\n3. To modify a color, click the square and select the new color by doing one of the following: \n* Click it in the color selector.\n* Enter a hex value.\n4. Click anywhere outside the color selector to close it and save changes.\n\n#### Visualization in Databricks SQL\n##### Add a visualization to a dashboard\n\n1. Click the vertical ellipsis ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) button beneath the visualization.\n2. Select **+ Add to Dashboard**.\n3. Enter a dashboard name. A list of matching dashboards displays.\n4. Select a dashboard. \n![Choose dashboard](https:\/\/docs.databricks.com\/_images\/choose-dashboard.png)\n5. Click **OK**. A pop-up displays with a link to the dashboard. \n![Added to dashboard](https:\/\/docs.databricks.com\/_images\/added-to-dashboard.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/visualizations\/index.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization in Databricks SQL\n##### Download a visualization as a CSV, TSV, or Excel file\n\nTo download a visualization as a CSV, TSV, or Excel file, click the vertical ellipsis ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) button next to the visualization name and select the type of download desired. If the visualization uses aggregations, the downloaded results are also aggregated. The downloaded results are from the most recent execution of the query that created the visualization. \n![download tab delimited](https:\/\/docs.databricks.com\/_images\/download-visualization-tab-delimited.png)\n\n#### Visualization in Databricks SQL\n##### Download a chart visualization as an image file\n\nTo download a local image file of a chart visualization, display the [visualization tools](https:\/\/docs.databricks.com\/sql\/user\/visualizations\/index.html#visualization-tools) and click the camera icon. \n![Download visualization as image](https:\/\/docs.databricks.com\/_images\/download-visualization.png) \nA png file is downloaded to your device.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/visualizations\/index.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n#### LangChain on Databricks for LLM development\n\nImportant \nThese are experimental features and the API definitions might change. \nThis article describes the LangChain integrations that facilitate the development and deployment of large language models (LLMs) on Databricks. \nWith these LangChain integrations you can: \n* Seamlessly load data from a PySpark DataFrame with the PySpark DataFrame loader.\n* Interactively query your data using natural language with the Spark DataFrame Agent or Databricks SQL Agent.\n* Wrap your Databricks served model as a large language model (LLM) in LangChain.\n\n#### LangChain on Databricks for LLM development\n##### What is LangChain?\n\nLangChain is a software framework designed to help create applications that utilize large language models (LLMs). LangChain\u2019s strength lies in its wide array of integrations and capabilities. It includes API wrappers, web scraping subsystems, code analysis tools, document summarization tools, and more. It also supports large language models from OpenAI, Anthropic, HuggingFace, etc. out of the box along with various data sources and types. \nLangChain is available as an experimental MLflow flavor which allows LangChain customers to leverage the robust tools and experiment tracking capabilities of MLflow directly from the Databricks environment. See the [LangChain flavor MLflow documentation](https:\/\/mlflow.org\/docs\/latest\/models.html#langchain-langchain-experimental).\n\n#### LangChain on Databricks for LLM development\n##### Requirements\n\n* Databricks Runtime 13.3 ML and above.\n* Databricks recommends pip installing the latest version of LangChain to ensure you have the most recent updates. \n+ `%pip install --upgrade langchain`\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/langchain.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n#### LangChain on Databricks for LLM development\n##### Load data with the PySpark DataFrame loader\n\nThe [PySpark DataFrame loader](https:\/\/python.langchain.com\/docs\/integrations\/document_loaders\/pyspark_dataframe) in LangChain simplifies loading data from a PySpark DataFrame with a single method. \n```\nfrom langchain.document_loaders import PySparkDataFrameLoader\n\nloader = PySparkDataFrameLoader(spark, wikipedia_dataframe, page_content_column=\"text\")\ndocuments = loader.load()\n\n``` \nThe following notebook showcases an example where the PySpark DataFrame loader is used to create a retrieval based chatbot that is logged with MLflow, which in turn allows the model to be interpreted as a generic Python function for inference with `mlflow.pyfunc.load_model()`. \n### PySpark DataFrame loader and MLFlow in Langchain notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/large-language-models\/pyspark-dataframe-loader-langchain.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/langchain.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n#### LangChain on Databricks for LLM development\n##### Spark DataFrame Agent\n\nThe Spark DataFrame Agent in LangChain allows interaction with a Spark DataFrame, optimized for question answering. LangChain\u2019s [Spark DataFrame Agent documentation](https:\/\/python.langchain.com\/docs\/integrations\/toolkits\/spark) provides a detailed example of how to create and use the Spark DataFrame Agent with a DataFrame. \n```\nfrom langchain.agents import create_spark_dataframe_agent\n\ndf = spark.read.csv(\"\/databricks-datasets\/COVID\/coronavirusdataset\/Region.csv\", header=True, inferSchema=True)\ndisplay(df)\n\nagent = create_spark_dataframe_agent(llm=OpenAI(temperature=0), df=df, verbose=True)\n...\n\n``` \nThe following notebook demonstrates how to create and use the Spark DataFrame Agent to help you gain insights on your data. \n### Use LangChain to interact with a Spark DataFrame notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/large-language-models\/spark-dataframe-agent-langchain.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/langchain.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n#### LangChain on Databricks for LLM development\n##### Databricks SQL Agent\n\nThe [Databricks SQL Agent](https:\/\/python.langchain.com\/docs\/integrations\/providers\/databricks#sql-database-agent-example) is a variant of the standard SQL Database Agent that LangChain provides and is considered a more powerful variant of the Spark DataFrame Agent. \nWith the Databricks SQL Agent any Databricks users can interact with a specified schema in Unity Catalog and generate insights on their data. \nImportant \nThe Databricks SQL Agent can only query tables, and does not create tables. \nIn the following example the database instance is created within the `SQLDatabase.from_databricks(catalog=\"...\", schema=\"...\")` command and the agent and required tools are created by `SQLDatabaseToolkit(db=db, llm=llm)` and `create_sql_agent(llm=llm, toolkit=toolkit, **kwargs)`, respectively. \n```\nfrom langchain.agents import create_sql_agent\nfrom langchain.agents.agent_toolkits import SQLDatabaseToolkit\nfrom langchain.sql_database import SQLDatabase\nfrom langchain import OpenAI\n\ndb = SQLDatabase.from_databricks(catalog=\"samples\", schema=\"nyctaxi\")\nllm = OpenAI(model_name=\"gpt-3.5-turbo-instruct\", temperature=.7)\ntoolkit = SQLDatabaseToolkit(db=db, llm=llm)\nagent = create_sql_agent(llm=llm, toolkit=toolkit, verbose=True)\n\nagent.run(\"What is the longest trip distance and how long did it take?\")\n\n``` \nNote \nOpenAI models require a paid subscription, if the free subscription hits a rate limit. \nThe following notebook demonstrates how to create and use the Databricks SQL Agent to help you better understand the data in your database. \n### Use LangChain to interact with a SQL database notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/large-language-models\/sql-database-agent-langchain.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/langchain.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n#### LangChain on Databricks for LLM development\n##### Wrap Databricks served models as LLMs\n\nIf you have an LLM that you created on Databricks, you can use it directly within LangChain in the place of OpenAI, HuggingFace, or any other LLM provider. \nThis integration supports two endpoint types: \n* [Model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) recommended for production and development.\n* Cluster driver proxy app, recommended for interactive development. \n### Wrap a model serving endpoint \nYou can wrap Databricks endpoints as LLMs in LangChain. To wrap a model serving endpoint as an LLM in LangChain you need: \n* A registered LLM deployed to a Databricks model serving endpoint.\n* CAN QUERY permission to the endpoint. \nOftentimes, models require or recommend important parameters, like `temperature`or `max_tokens`. The following example shows how to input those parameters with a deployed model named `falcon-7b-instruct`. Additional details can be found on the [Wrapping a serving endpoint](https:\/\/python.langchain.com\/docs\/integrations\/llms\/databricks#wrapping-a-serving-endpoint) LangChain documentation. \n```\nfrom langchain.llms import Databricks\n\nllm = Databricks(endpoint_name=\"falcon-7b-instruct\", model_kwargs={\"temperature\": 0.1, \"max_tokens\": 100})\nllm(\"How are you?\")\n\n``` \n### Wrap a cluster driver proxy application \nTo wrap a cluster driver proxy application as an LLM in LangChain you need: \n* An LLM loaded on a Databricks interactive cluster in \u201csingle user\u201d or \u201cno isolation shared\u201d mode.\n* A local HTTP server running on the driver node to serve the model at \u201c\/\u201d using HTTP POST with JSON input\/output.\n* An app uses a port number between [3000, 8000] and listens to the driver IP address or simply 0.0.0.0 instead of `localhost` only.\n* The CAN ATTACH TO permission to the cluster. \nSee the LangChain documentation for [Wrapping a cluster driver proxy app](https:\/\/python.langchain.com\/docs\/integrations\/llms\/databricks#wrapping-a-cluster-driver-proxy-app) for an example.\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/langchain.html"} +{"content":"# \n### Environments\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html)\n\n### Environments\n#### Context: RAG development workflows\n\nThe RAG development workflow requires having users interact with early versions of the application to provide feedback. Without this feedback, it is difficult to iterate on and improve the application\u2019s quality. \nAllowing users to interact with a RAG Application requires you to deploy infrastructure, such as a web-based chat-app and the corresponding REST API endpoint behind the app, and allow that infrastructure to access your production data. Although this can create a blurry line between development and production, it is critically important that developers maintain a clean separation between these environments. \nFurther, it is ideal if the same code base can be used in development and production (e.g., no code changes, only configuration changes). This is the standard paradigm in full stack application development. Without a shared code base, developers are forced to maintain and sync separate code bases - leading to the risk that the application exhibits online\/offline skew, and, in the worst case, fails to deliver high quality answers in production.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/environments.html"} +{"content":"# \n### Environments\n#### How environments work in RAG Studio\n\nRAG Studio addresses this challenge through the concept of `Environments`. \n`Environments` enable you to have a **single code base** that is by-default deployable via CI\/CD with only configuration changes between dev and prod. RAG Studio\u2019s `Environments` are based on Databricks Asset Bundles concept of [targets](https:\/\/docs.databricks.com\/dev-tools\/bundles\/settings.html#targets). \nTo understand how `Environments` work, let\u2019s review the components of a RAG Studio application. \nNote \nTip \n**\ud83d\udea7 Roadmap \ud83d\udea7** RAG Studio only supports a single Workspace for all environments. Support for RAG Studio applications that use multiple Workspaces is on our roadmap. \nEach RAG Application comes with 3 pre-configured environments: \n1. **End Users Environment** `end_users` Used for production traffic\n2. **Reviewers Environment** `reviewers` Used for testing versions with end users\/expert reviewers\n3. **Development Environment** `dev` Used for the inner development loop e.g., testing and rapid iteration \nOptionally, to support multiple developers working on the same application, you can configure more than 1 `dev` environment. \n| Category | Item | `dev` | `reviewers` | `end_users` |\n| --- | --- | --- | --- | --- |\n| Global Configuration | Databricks Workspace | Shared | Shared | Shared |\n| | Workspace Folder to host deployed code + config | Shared | Shared | Shared |\n| | MLflow Experiment | Shared | Shared | Shared |\n| | Unity Catalog Schema\\* | Shared | Shared | Shared |\n| `Version` | Code + Config\\*\\* | Shared | Shared | Shared |\n| Infrastructure to run an App Version | Vector Search Endpoint | Shared | Shared | Shared |\n| | `\ud83d\udd17 Chain` Model Serving Endpoint | Unique | Unique | Unique |\n| | <review-ui> Application | Unique | Unique | Unique | \n\\* Stores all Delta Tables, Vector Indexes, Models. Naming conventions differentiate items between environments (e.g., `logs_prod` vs. `logs_dev`)\n\\*\\* Logged as MLflow runs & model versions\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/environments.html"} +{"content":"# \n### Environments\n#### Configure environments in RAG Studio\n\nEnvironments are configured in `rag-config.yml` - view the `rag-config.yml` [configuration docs](https:\/\/docs.databricks.com\/rag-studio\/details\/request-log.html) for how-to.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/environments.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n### Set up query federation for non-Unity-Catalog workspaces\n##### Query federation for MySQL in Databricks SQL without Unity Catalog (Experimental)\n\nExperimental \nThe configurations described in this article are [Experimental](https:\/\/docs.databricks.com\/release-notes\/release-types.html). Experimental features are provided as-is and are not supported by Databricks through customer technical support. **To get full query federation support, you should instead use [Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html), which enables your Databricks users to take advantage of Unity Catalog syntax and data governance tools.** \nThis article describes how to configure read-only query federation to MySQL on serverless and pro SQL warehouses. \nYou configure connections to MySQL at the table level. You can use [secrets](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/secret.html) to store and access text credentials without displaying them in plaintext. See the following example: \n```\nDROP TABLE IF EXISTS mysql_table;\nCREATE TABLE mysql_table\nUSING mysql\nOPTIONS (\ndbtable '<table-name>',\nhost '<database-host-url>',\nport '3306',\ndatabase '<database-name>',\nuser secret('mysql_creds', 'my_username'),\npassword secret('mysql_creds', 'my_password')\n);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/mysql-no-uc.html"} +{"content":"# Get started: Account and workspace setup\n## Best practice articles\n","doc_uri":"https:\/\/docs.databricks.com\/cheat-sheet\/administration.html"} +{"content":"# Get started: Account and workspace setup\n## Best practice articles\n#### Platform administration cheat sheet\n\nThis article aims to provide clear and opinionated guidance for account and workspace admins on recommended best practices. The following practices should be implemented by account or workspace admins to help optimize cost, observability, data governance, and security in their Databricks account. \nFor in-depth security best practices, see this PDF: [Databricks AWS Security Best Practices and Threat Model](https:\/\/cms.databricks.com\/sites\/default\/files\/2023-03\/Databricks-AWS-Security-Best-Practices-and-Threat-Model.pdf). \n| Best practice | Impact | Docs |\n| --- | --- | --- |\n| Enable Unity Catalog | **Data governance**: Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces. | * [Set up and manage Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html) |\n| Use cluster policies | **Cost**: Control costs with auto-termination (for all-purpose clusters), max cluster sizes, and instance type restrictions. **Observability**: Set `custom_tags` in your cluster policy to enforce tagging. **Security**: Restrict cluster access mode to only allow users to create Unity Catalog-enabled clusters to enforce data permissions. | * [Create and manage cluster policies](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html) * [Monitor cluster usage with tags](https:\/\/docs.databricks.com\/admin\/account-settings\/usage-detail-tags.html) |\n| Use Service Principals to connect to third-party software | **Security**: A service principal is a Databricks identity type that allows third-party services to authenticate directly to Databricks, not through an individual user\u2019s credentials. If something happens to an individual user\u2019s credentials, the third-party service won\u2019t be interrupted. | * [Create and manage service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) |\n| Set up SSO | **Security**: Instead of having users type their email and password to log into a workspace, set up Databricks SSO so users can authenticate via your identity provider. | * [Set up SSO for your workspace](https:\/\/docs.databricks.com\/security\/auth-authz\/index.html#sso) |\n| Set up SCIM integration | **Security**: Instead of adding users to Databricks manually, integrate with your identity provider to automate user provisioning and deprovisioning. When a user is removed from the identity provider, they are automatically removed from Databricks too. | * [Sync users and groups from your identity provider](https:\/\/docs.databricks.com\/admin\/users-groups\/scim\/index.html) |\n| Manage access control with account-level groups | **Data governance**: Create account-level groups so you can bulk control access to workspaces, resources, and data. This saves you from having to grant all users access to everything or grant individual users specific permissions. You can also sync groups from your identity provider to Databricks groups. | * [Manage groups](https:\/\/docs.databricks.com\/admin\/users-groups\/groups.html) * [Control access to resources](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html) * [Sync groups from your IdP to Databricks](https:\/\/docs.databricks.com\/admin\/users-groups\/best-practices.html#scim-provisioning) * [Data governance guide](https:\/\/docs.databricks.com\/data-governance\/index.html) |\n| Set up IP access for IP whitelisting | **Security**: IP access lists prevent users from accessing Databricks resources in unsecured networks. Accessing a cloud service from an unsecured network can pose security risks to an enterprise, especially when the user may have authorized access to sensitive or personal data Make sure to set up IP access lists for your account console and workspaces. | * [Create IP access lists for workspaces](https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list-workspace.html) * [Create IP access lists for the account console](https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list-account.html) |\n| Configure a customer-managed VPC with regional endpoints | **Security**: You can use a customer-managed VPC to exercise more control over your network configurations to comply with specific cloud security and governance standards your organization might require. **Cost**: Regional VPC endpoints to AWS services have a more direct connections and reduced cost compared to AWS global endpoints. | * [Customer-managed VPC](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html) |\n| Use Databricks Secrets or a cloud provider secrets manager | **Security**: Using Databricks secrets allows you to securely store credentials for external data sources. Instead of entering credentials directly into a notebook, you can simply reference a secret to authenticate to a data source. | * [Manage Databricks secrets](https:\/\/docs.databricks.com\/security\/secrets\/index.html) |\n| Set expiration dates on personal access tokens (PATs) | **Security**: Workspace admins can manage PATs for users, groups, and service principals. Setting expiration dates for PATs reduces the risk of lost tokens or long-lasting tokens that could lead to data exfiltration from the workspace. | * [Manage personal access tokens](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html) |\n| Use system tables to monitor account usage | **Observability**: System tables are a Databricks-hosted analytical store of your account\u2019s operational data, including audit logs, data lineage, and billable usage. You can use system tables for observability across your account. | * [Monitor usage with system tables](https:\/\/docs.databricks.com\/admin\/system-tables\/index.html) |\n\n","doc_uri":"https:\/\/docs.databricks.com\/cheat-sheet\/administration.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Configure access to resources from model serving endpoints\n\nThis article describes how to configure access to external and private resources from model serving endpoints. Model Serving supports plain text environment variables and secrets-based environment variables using Databricks [secrets](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html).\n\n#### Configure access to resources from model serving endpoints\n##### Requirements\n\nFor secrets-based environment variables, \n* The endpoint creator must have READ access to the Databricks secrets being referenced in the configs.\n* You must store credentials like your API key or other tokens as a Databricks secret.\n\n#### Configure access to resources from model serving endpoints\n##### Add plain text environment variables\n\nUse plain text environment variables to set variables that don\u2019t need to be hidden. You can set variables in the Serving UI or the REST API when you create or update an endpoint. \nFrom the Serving UI, you can add an environment variable in **Advanced configurations**: \n![Create a model serving endpoint](https:\/\/docs.databricks.com\/_images\/add-env-variable.png) \nThe following is an example for creating a serving endpoint using the REST API and the `environment_vars` field to configure your environment variable. \n```\nPOST \/api\/2.0\/serving-endpoints\n\n{\n\"name\": \"endpoint-name\",\n\"config\":{\n\"served_entities\": [{\n\"entity_name\": \"model-name\",\n\"entity_version\": \"1\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": \"true\",\n\"environment_vars\":{\n\"TEXT_ENV_VAR_NAME\": \"plain-text-env-value\"\n}\n}]\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/store-env-variable-model-serving.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Configure access to resources from model serving endpoints\n##### Add secrets-based environment variables\n\nYou can securely store credentials using Databricks secrets and reference those secrets in model serving using a secrets-based environment variables. This allows credentials to be fetched from model serving endpoints at serving time. \nFor example, you can pass credentials to call OpenAI and other external model endpoints or access external data storage locations directly from model serving. \nDatabricks recommends this feature for deploying [OpenAI](https:\/\/mlflow.org\/docs\/latest\/python_api\/openai\/index.html) and [LangChain](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.langchain.html) MLflow model flavors to serving. It is also applicable to other SaaS models requiring credentials with the understanding that the access pattern is based on using environment variables and API keys and tokens. \n### Step 1: Create a secret scope \nDuring model serving, the secrets are retrieved from Databricks secrets by the secret scope and key. These get assigned to the secret environment variable names that can be used inside the model. \nFirst, create a secret scope. See [Secret scopes](https:\/\/docs.databricks.com\/security\/secrets\/secret-scopes.html). \nThe following are CLI commands: \n```\ndatabricks secrets create-scope my_secret_scope\n\n``` \nYou can then add your secret to a desired secret scope and key as shown below: \n```\ndatabricks secrets put-secret my_secret_scope my_secret_key\n\n``` \nThe secret information and the name of the environment variable can then be passed to your endpoint configuration during endpoint creation or as an update to the configuration of an existing endpoint. \n### Step 2: Add secret scopes to endpoint configuration \nYou can add the secret scope to an environment variable and pass that variable to your endpoint during endpoint creation or configuration updates. See [Create custom model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html). \nFrom the Serving UI, you can add an environment variable in **Advanced configurations**. The secrets based environment variable must be provided using the following syntax: `{{secrets\/scope\/key}}`. Otherwise, the environment variable is considered a plain text environment variable. \n![Create a model serving endpoint](https:\/\/docs.databricks.com\/_images\/add-env-variable.png) \nThe following is an example for creating a serving endpoint using the REST API. During model serving endpoint creation and configuration updates, you are able to provide a list of secret environment variable specifications for each served model inside the API request using `environment_vars` field. \nThe following example assigns the value from the secret created in the provided code to the environment variable `OPENAI_API_KEY`. \n```\nPOST \/api\/2.0\/serving-endpoints\n\n{\n\"name\": \"endpoint-name\",\n\"config\":{\n\"served_entities\": [{\n\"entity_name\": \"model-name\",\n\"entity_version\": \"1\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": \"true\",\n\"environment_vars\":{\n\"OPENAI_API_KEY\": \"{{secrets\/my_secret_scope\/my_secret_key}}\"\n}\n}]\n}\n}\n\n``` \nYou can also update a serving endpoint, as in the following REST API example: \n```\nPUT \/api\/2.0\/serving-endpoints\/{name}\/config\n\n{\n\"served_entities\": [{\n\"entity_name\": \"model-name\",\n\"entity_version\": \"2\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": \"true\",\n\"environment_vars\":{\n\"OPENAI_API_KEY\": \"{{secrets\/my_secret_scope\/my_secret_key}}\"\n}\n}]\n}\n\n``` \nAfter the endpoint is created or updated, model serving automatically fetches the secret key from the Databricks secrets scope and populates the environment variable for your model inference code to use.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/store-env-variable-model-serving.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Configure access to resources from model serving endpoints\n##### Notebook example\n\nSee the following notebook for an example of how to configure an OpenAI API key for a LangChain Retrieval QA Chain deployed behind the model serving endpoints with secret-based environment variables. \n### Configure access to resources from model serving endpoints notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/configure-access-resources-from-model-serving.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### Configure access to resources from model serving endpoints\n##### Additional resource\n\n* [Add an instance profile to a model serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/add-model-serving-instance-profile.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/store-env-variable-model-serving.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Develop Delta Live Tables pipelines with Databricks Asset Bundles\n\n*Databricks Asset Bundles*, also known simply as *bundles*, enable you to programmatically validate, deploy, and run Databricks resources such as Delta Live Tables pipelines. You can also use bundles to programmatically manage Databricks jobs and to work with MLOps Stacks. See [What are Databricks Asset Bundles?](https:\/\/docs.databricks.com\/dev-tools\/bundles\/index.html). \nThis article describes a set of steps that you can complete from your local development machine to use a bundle that programmatically manages a Delta Live Tables pipeline.\n\n##### Develop Delta Live Tables pipelines with Databricks Asset Bundles\n###### Requirements\n\n* Databricks CLI version 0.218 or above. To check your installed version of the Databricks CLI, run the command `databricks -v`. To install the Databricks CLI, see [Install or update the Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/install.html).\n* The remote workspace must have workspace files enabled. See [What are workspace files?](https:\/\/docs.databricks.com\/files\/workspace.html).\n\n##### Develop Delta Live Tables pipelines with Databricks Asset Bundles\n###### (Optional) Install a Python module to support local pipeline development\n\nDatabricks provides a Python module to assist your local development of Delta Live Tables pipeline code by providing syntax checking, autocomplete, and data type checking as you write code in your IDE. \nThe Python module for local development is available on PyPi. To install the module, see [Python stub for Delta Live Tables](https:\/\/pypi.org\/project\/databricks-dlt\/).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-bundles.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Develop Delta Live Tables pipelines with Databricks Asset Bundles\n###### Decision: Create the bundle by using a template or manually\n\nDecide whether you want to create the bundle using a template or manually: \n* [Create the bundle by using a template](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-bundles.html#create-the-bundle-by-using-a-template)\n* [Create the bundle manually](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-bundles.html#create-the-bundle-manually)\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-bundles.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Develop Delta Live Tables pipelines with Databricks Asset Bundles\n###### Create the bundle by using a template\n\nIn these steps, you create the bundle by using the Databricks default bundle template for Python. These steps guide you to create a bundle that consists of a notebook that defines a Delta Live Tables pipeline, which filters data from the original dataset. You then validate, deploy, and run the deployed pipeline within your Databricks workspace. \n### Step 1: Set up authentication \nIn this step, you set up authentication between the Databricks CLI on your development machine and your Databricks workspace. This article assumes that you want to use OAuth user-to-machine (U2M) authentication and a corresponding Databricks configuration profile named `DEFAULT` for authentication. \nNote \nU2M authentication is appropriate for trying out these steps in real time. For fully automated workflows, Databricks recommends that you use OAuth machine-to-machine (M2M) authentication instead. See the M2M authentication setup instructions in [Authentication](https:\/\/docs.databricks.com\/dev-tools\/bundles\/index.html#authentication). \n1. Use the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/install.html) to initiate OAuth token management locally by running the following command for each target workspace. \nIn the following command, replace `<workspace-url>` with your Databricks [workspace instance URL](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `https:\/\/dbc-a1b2345c-d6e7.cloud.databricks.com`. \n```\ndatabricks auth login --host <workspace-url>\n\n```\n2. The Databricks CLI prompts you to save the information that you entered as a Databricks [configuration profile](https:\/\/docs.databricks.com\/dev-tools\/auth\/index.html#config-profiles). Press `Enter` to accept the suggested profile name, or enter the name of a new or existing profile. Any existing profile with the same name is overwritten with the information that you entered. You can use profiles to quickly switch your authentication context across multiple workspaces. \nTo get a list of any existing profiles, in a separate terminal or command prompt, use the Databricks CLI to run the command `databricks auth profiles`. To view a specific profile\u2019s existing settings, run the command `databricks auth env --profile <profile-name>`.\n3. In your web browser, complete the on-screen instructions to log in to your Databricks workspace.\n4. To view a profile\u2019s current OAuth token value and the token\u2019s upcoming expiration timestamp, run one of the following commands: \n* `databricks auth token --host <workspace-url>`\n* `databricks auth token -p <profile-name>`\n* `databricks auth token --host <workspace-url> -p <profile-name>`If you have multiple profiles with the same `--host` value, you might need to specify the `--host` and `-p` options together to help the Databricks CLI find the correct matching OAuth token information. \n### Step 2: Create the bundle \nA bundle contains the artifacts you want to deploy and the settings for the workflows you want to run. \n1. Use your terminal or command prompt to switch to a directory on your local development machine that will contain the template\u2019s generated bundle.\n2. Use the Dataricks CLI to run the `bundle init` command: \n```\ndatabricks bundle init\n\n```\n3. For `Template to use`, leave the default value of `default-python` by pressing `Enter`.\n4. For `Unique name for this project`, leave the default value of `my_project`, or type a different value, and then press `Enter`. This determines the name of the root directory for this bundle. This root directory is created within your current working directory.\n5. For `Include a stub (sample) notebook`, select `no` and press `Enter`. This instructs the Databricks CLI to not add a sample notebook at this point, as the sample notebook that is associated with this option has no Delta Live Tables code in it.\n6. For `Include a stub (sample) DLT pipeline`, leave the default value of `yes` by pressing `Enter`. This instructs the Databricks CLI to add a sample notebook that has Delta Live Tables code in it.\n7. For `Include a stub (sample) Python package`, select `no` and press `Enter`. This instructs the Databricks CLI to not add sample Python wheel package files or related build instructions to your bundle. \n### Step 3: Explore the bundle \nTo view the files that the template generated, switch to the root directory of your newly created bundle and open this directory with your preferred IDE, for example Visual Studio Code. Files of particular interest include the following: \n* `databricks.yml`: This file specifies the bundle\u2019s programmatic name, includes a reference to the pipeline definition, and specifies settings about the target workspace.\n* `resources\/<project-name>_job.yml` and `resources\/<project-name>_pipeline.yml`: This file specifies the pipeline\u2019s settings.\n* `src\/dlt_pipeline.ipynb`: This file is a notebook that, when run, executes the pipeline. \nFor customizing pipelines, the mappings within a pipeline declaration correspond to the create pipeline operation\u2019s request payload as defined in [POST \/api\/2.0\/pipelines](https:\/\/docs.databricks.com\/api\/workspace\/pipelines\/create) in the REST API reference, expressed in YAML format. \n### Step 4: Validate the project\u2019s bundle configuration file \nIn this step, you check whether the bundle configuration is valid. \n1. From the root directory, use the Databricks CLI to run the `bundle validate` command, as follows: \n```\ndatabricks bundle validate\n\n```\n2. If a summary of the bundle configuration is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step. \nIf you make any changes to your bundle after this step, you should repeat this step to check whether your bundle configuration is still valid. \n### Step 5: Deploy the local project to the remote workspace \nIn this step, you deploy the local notebook to your remote Databricks workspace and create the Delta Live Tables pipeline within your workspace. \n1. Use the Databricks CLI to run the `bundle validate` command as follows: \n```\ndatabricks bundle deploy -t dev\n\n```\n2. Check whether the local notebook was deployed: In your Databricks workspace\u2019s sidebar, click **Workspace**.\n3. Click into the **Users > `<your-username>` > .bundle > `<project-name>` > dev > files > src** folder. The notebook should be in this folder.\n4. Check whether the pipeline was created: In your Databricks workspace\u2019s sidebar, click **Delta Live Tables**.\n5. On the **Delta Live Tables** tab, click **[dev `<your-username>`] `<project-name>`\\_pipeline**. \nIf you make any changes to your bundle after this step, you should repeat steps 4-5 to check whether your bundle configuration is still valid and then redeploy the project. \n### Step 6: Run the deployed project \nIn this step, you run the Delta Live Tables pipeline in your workspace. \n1. From the root directory, use the Databricks CLI to run the `bundle run` command, as follows, replacing `<project-name>` with the name of your project from Step 2: \n```\ndatabricks bundle run -t dev <project-name>_pipeline\n\n```\n2. Copy the value of `Update URL` that appears in your terminal and paste this value into your web browser to open your Databricks workspace.\n3. In your Databricks workspace, after the pipeline completes successfully, click the **taxi\\_raw** view and the **filtered\\_taxis** materialized view to see the details. \nIf you make any changes to your bundle after this step, you should repeat steps 4-6 to check whether your bundle configuration is still valid, redeploy the project, and run the redeployed project. \n### Step 7: Clean up \nIn this step, you delete the deployed notebook and the pipeline from your workspace. \n1. From the root directory, use the Databricks CLI to run the `bundle destroy` command, as follows: \n```\ndatabricks bundle destroy -t dev\n\n```\n2. Confirm the pipeline deletion request: When prompted to permanently destroy resources, type `y` and press `Enter`.\n3. Confirm the notebook deletion request: When prompted to permanently destroy the previously deployed folder and all of its files, type `y` and press `Enter`.\n4. If you also want to delete the bundle from your development machine, you can now delete the local directory from Step 2. \nYou have reached the end of the steps for creating a bundle by using a template.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-bundles.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Develop Delta Live Tables pipelines with Databricks Asset Bundles\n###### Create the bundle manually\n\nIn these steps, you create the bundle from the beginning. These steps guide you to create a bundle that consists of a notebook with embedded Delta Live Tables directives and the definition of a Delta Live Tables pipeline to run this notebook. You then validate, deploy, and run the deployed notebook from the pipeline within your Databricks workspace. \n### Step 1: Create the bundle \nA bundle contains the artifacts you want to deploy and the settings for the workflows you want to run. \n1. Create or identify an empty directory on your development machine.\n2. Switch to the empty directory in your terminal, or open the empty directory in your IDE. \nTip \nYour empty directory could be associated with a cloned repository managed by a Git provider. This enables you to manage your bundle with external version control and to collaborate more easily with other developers and IT professionals on your project. However, to help simplify this demonstration, a cloned repo is not used here. \nIf you choose to clone a repo for this demo, Databricks recommends that the repo is empty or has only basic files in it such as `README` and `.gitignore`. Otherwise, any pre-existing files in the repo might be unnecessarily synchronized to your Databricks workspace. \n### Step 2: Add a notebook to the project \nIn this step, you add a notebooks to your project. This notebook does the following: \n* Reads raw JSON clickstream data from [Databricks datasets](https:\/\/docs.databricks.com\/discover\/databricks-datasets.html#dbfs-datasets) into a raw Delta table in the `pipelines` folder inside of your Databricks workspace\u2019s DBFS root folder.\n* Reads records from the raw Delta table and uses a Delta Live Tables query and expectations to create a new Delta table with cleaned and prepared data.\n* Performs an analysis of the prepared data in the new Delta table with a Delta Live Tables query. \n1. From the directory\u2019s root, create a file with the name `dlt-wikipedia-python.py`.\n2. Add the following code to the `dlt-wikipedia-python.py` file: \n```\n# Databricks notebook source\nimport dlt\nfrom pyspark.sql.functions import *\n\n# COMMAND ----------\njson_path = \"\/databricks-datasets\/wikipedia-datasets\/data-001\/clickstream\/raw-uncompressed-json\/2015_2_clickstream.json\"\n\n# COMMAND ----------\n@dlt.table(\ncomment=\"The raw wikipedia clickstream dataset, ingested from \/databricks-datasets.\"\n)\ndef clickstream_raw():\nreturn (spark.read.format(\"json\").load(json_path))\n\n# COMMAND ----------\n@dlt.table(\ncomment=\"Wikipedia clickstream data cleaned and prepared for analysis.\"\n)\n@dlt.expect(\"valid_current_page_title\", \"current_page_title IS NOT NULL\")\n@dlt.expect_or_fail(\"valid_count\", \"click_count > 0\")\ndef clickstream_prepared():\nreturn (\ndlt.read(\"clickstream_raw\")\n.withColumn(\"click_count\", expr(\"CAST(n AS INT)\"))\n.withColumnRenamed(\"curr_title\", \"current_page_title\")\n.withColumnRenamed(\"prev_title\", \"previous_page_title\")\n.select(\"current_page_title\", \"click_count\", \"previous_page_title\")\n)\n\n# COMMAND ----------\n@dlt.table(\ncomment=\"A table containing the top pages linking to the Apache Spark page.\"\n)\ndef top_spark_referrers():\nreturn (\ndlt.read(\"clickstream_prepared\")\n.filter(expr(\"current_page_title == 'Apache_Spark'\"))\n.withColumnRenamed(\"previous_page_title\", \"referrer\")\n.sort(desc(\"click_count\"))\n.select(\"referrer\", \"click_count\")\n.limit(10)\n)\n\n``` \n### Step 3: Add a bundle configuration schema file to the project \nIf you are using an IDE such as Visual Studio Code, PyCharm Professional, or IntelliJ IDEA Ultimate that provide support for YAML files and JSON schema files, you can use your IDE to not only create the bundle configuration schema file but to check your project\u2019s bundle configuration file syntax and formatting and provide code completion hints, as follows. Note that while the bundle configuration file that you will create later in Step 5 is YAML-based, the bundle configuration schema file in this step is JSON-based. \n1. Add YAML language server support to Visual Studio Code, for example by installing the [YAML extension](https:\/\/marketplace.visualstudio.com\/items?itemName=redhat.vscode-yaml) from the Visual Studio Code Marketplace.\n2. Generate the Databricks Asset Bundle configuration JSON schema file by using the Databricks CLI to run the `bundle schema` command and redirect the output to a JSON file. For example, generate a file named `bundle_config_schema.json` within the current directory, as follows: \n```\ndatabricks bundle schema > bundle_config_schema.json\n\n```\n3. Note that later in Step 5, you will add the following comment to the beginning of your bundle configuration file, which associates your bundle configuration file with the specified JSON schema file: \n```\n# yaml-language-server: $schema=bundle_config_schema.json\n\n``` \nNote \nIn the preceding comment, if your Databricks Asset Bundle configuration JSON schema file is in a different path, replace `bundle_config_schema.json` with the full path to your schema file. \n1. Generate the Databricks Asset Bundle configuration JSON schema file by using the Databricks CLI to run the `bundle schema` command and redirect the output to a JSON file. For example, generate a file named `bundle_config_schema.json` within the current directory, as follows: \n```\ndatabricks bundle schema > bundle_config_schema.json\n\n```\n2. Configure PyCharm to recognize the bundle configuration JSON schema file, and then complete the JSON schema mapping, by following the instructions in [Configure a custom JSON schema](https:\/\/www.jetbrains.com\/help\/pycharm\/json.html#ws_json_schema_add_custom_procedure).\n3. Note that later in Step 5, you will use PyCharm to create or open a bundle configuration file. By convention, this file is named `databricks.yml`. \n1. Generate the Databricks Asset Bundle configuration JSON schema file by using the Databricks CLI to run the `bundle schema` command and redirect the output to a JSON file. For example, generate a file named `bundle_config_schema.json` within the current directory, as follows: \n```\ndatabricks bundle schema > bundle_config_schema.json\n\n```\n2. Configure IntelliJ IDEA to recognize the bundle configuration JSON schema file, and then complete the JSON schema mapping, by following the instructions in [Configure a custom JSON schema](https:\/\/www.jetbrains.com\/help\/idea\/json.html#ws_json_schema_add_custom_procedure).\n3. Note that later in Step 5, you will use IntelliJ IDEA to create or open a bundle configuration file. By convention, this file is named `databricks.yml`. \n### Step 4: Set up authentication \nIn this step, you set up authentication between the Databricks CLI on your development machine and your Databricks workspace. This article assumes that you want to use OAuth user-to-machine (U2M) authentication and a corresponding Databricks configuration profile named `DEFAULT` for authentication. \nNote \nU2M authentication is appropriate for trying out these steps in real time. For fully automated workflows, Databricks recommends that you use OAuth machine-to-machine (M2M) authentication instead. See the M2M authentication setup instructions in [Authentication](https:\/\/docs.databricks.com\/dev-tools\/bundles\/index.html#authentication). \n1. Use the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/install.html) to initiate OAuth token management locally by running the following command for each target workspace. \nIn the following command, replace `<workspace-url>` with your Databricks [workspace instance URL](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `https:\/\/dbc-a1b2345c-d6e7.cloud.databricks.com`. \n```\ndatabricks auth login --host <workspace-url>\n\n```\n2. The Databricks CLI prompts you to save the information that you entered as a Databricks [configuration profile](https:\/\/docs.databricks.com\/dev-tools\/auth\/index.html#config-profiles). Press `Enter` to accept the suggested profile name, or enter the name of a new or existing profile. Any existing profile with the same name is overwritten with the information that you entered. You can use profiles to quickly switch your authentication context across multiple workspaces. \nTo get a list of any existing profiles, in a separate terminal or command prompt, use the Databricks CLI to run the command `databricks auth profiles`. To view a specific profile\u2019s existing settings, run the command `databricks auth env --profile <profile-name>`.\n3. In your web browser, complete the on-screen instructions to log in to your Databricks workspace.\n4. To view a profile\u2019s current OAuth token value and the token\u2019s upcoming expiration timestamp, run one of the following commands: \n* `databricks auth token --host <workspace-url>`\n* `databricks auth token -p <profile-name>`\n* `databricks auth token --host <workspace-url> -p <profile-name>`If you have multiple profiles with the same `--host` value, you might need to specify the `--host` and `-p` options together to help the Databricks CLI find the correct matching OAuth token information. \n### Step 5: Add a bundle configuration file to the project \nIn this step, you define how you want to deploy and run this notebook. For this demo, you want to use a Delta Live Tables pipeline to run the notebook. You model this objective within a bundle configuration file in your project. \n1. From the directory\u2019s root, use your favorite text editor or your IDE to create the bundle configuration file. By convention, this file is named `databricks.yml`.\n2. Add the following code to the `databricks.yml` file, replacing `<workspace-url>` with your [workspace URL](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `https:\/\/dbc-a1b2345c-d6e7.cloud.databricks.com`. This URL must match the one in your `.databrickscfg` file: \nTip \nThe first line, starting with `# yaml-language-server`, is required only if your IDE supports it. See Step 3 earlier for details. \n```\n# yaml-language-server: $schema=bundle_config_schema.json\nbundle:\nname: dlt-wikipedia\n\nresources:\npipelines:\ndlt-wikipedia-pipeline:\nname: dlt-wikipedia-pipeline\ndevelopment: true\ncontinuous: false\nchannel: \"CURRENT\"\nphoton: false\nlibraries:\n- notebook:\npath: .\/dlt-wikipedia-python.py\nedition: \"ADVANCED\"\nclusters:\n- label: \"default\"\nnum_workers: 1\n\ntargets:\ndevelopment:\nworkspace:\nhost: <workspace-url>\n\n``` \nFor customizing pipelines, the mappings within a pipeline declaration correspond to the create pipeline operation\u2019s request payload as defined in [POST \/api\/2.0\/pipelines](https:\/\/docs.databricks.com\/api\/workspace\/pipelines\/create) in the REST API reference, expressed in YAML format. \n### Step 6: Validate the project\u2019s bundle configuration file \nIn this step, you check whether the bundle configuration is valid. \n1. Use the Databricks CLI to run the `bundle validate` command, as follows: \n```\ndatabricks bundle validate\n\n```\n2. If a summary of the bundle configuration is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step. \nIf you make any changes to your bundle after this step, you should repeat this step to check whether your bundle configuration is still valid. \n### Step 7: Deploy the local project to the remote workspace \nIn this step, you deploy the local notebook to your remote Databricks workspace and create the Delta Live Tables pipeline in your workspace. \n1. Use the Databricks CLI to run the `bundle validate` command as follows: \n```\ndatabricks bundle deploy -t development\n\n```\n2. Check whether the local notebook was deployed: In your Databricks workspace\u2019s sidebar, click **Workspace**.\n3. Click into the **Users > `<your-username>` > .bundle > dlt-wikipedia > development > files** folder. The notebook should be in this folder.\n4. Check whether the Delta Live Tables pipeline was created: In your Databricks workspace\u2019s sidebar, click **Workflows**.\n5. On the **Delta Live Tables** tab, click **dlt-wikipedia-pipeline**. \nIf you make any changes to your bundle after this step, you should repeat steps 6-7 to check whether your bundle configuration is still valid and then redeploy the project. \n### Step 8: Run the deployed project \nIn this step, you run the Databricks job in your workspace. \n1. Use the Databricks CLI to run the `bundle run` command, as follows: \n```\ndatabricks bundle run -t development dlt-wikipedia-pipeline\n\n```\n2. Copy the value of `Update URL` that appears in your terminal and paste this value into your web browser to open your Databricks workspace.\n3. In your Databricks workspace, after the Delta Live Tables pipeline completes successfully and shows green title bars across the various materialized views, click the **clickstream\\_raw**, **clickstream\\_prepared**, or **top\\_spark\\_referrers** materialized views to see more details.\n4. Before you start the next step to clean up, note the location of the Delta tables created in DBFS as follows. You will need this information if you want to manually clean up these Delta tables later: \n1. With the Delta Live Tables pipeline still open, click the **Settings** button (next to the **Permissions** and **Schedule** buttons).\n2. In the **Destination** area, note the value of the **Storage location** field. This is where the Delta tables were created in DBFS. \nIf you make any changes to your bundle after this step, you should repeat steps 6-8 to check whether your bundle configuration is still valid, redeploy the project, and run the redeployed project. \n### Step 9: Clean up \nIn this step, you delete the deployed notebook and the Delta Live Tables pipeline from your workspace. \n1. Use the Databricks CLI to run the `bundle destroy` command, as follows: \n```\ndatabricks bundle destroy\n\n```\n2. Confirm the Delta Live Tables pipeline deletion request: When prompted to permanently destroy resources, type `y` and press `Enter`.\n3. Confirm the notebook deletion request: When prompted to permanently destroy the previously deployed folder and all of its files, type `y` and press `Enter`. \nRunning the `bundle destroy` command deletes only the deployed Delta Live Tables pipeline and the folder containing the deployed notebook. This command does not delete any side effects, such as the Delta tables that the notebook created in DBFS. If you need to delete these Delta tables, you must do so manually.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-bundles.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Develop Delta Live Tables pipelines with Databricks Asset Bundles\n###### Add an existing pipeline definition to a bundle\n\nYou can use an existing Delta Live Tables pipeline definition as a basis to define a new pipeline in a bundle configuration file. To do this, complete the following steps. \nNote \nThe following steps create a new pipeline that has the same settings as the existing pipeline. However, the new pipeline has a different pipeline ID than the existing pipeline. You cannot automatically import an existing pipeline ID into a bundle. \n### Step 1: Get the existing pipeline definition in JSON format \nIn this step, you use the Databricks workspace user interface to get the JSON representation of the existing pipeline definition. \n1. In your Databricks workspace\u2019s sidebar, click **Workflows**.\n2. On the **Delta Live Tables** tab, click your pipeline\u2019s **Name** link.\n3. Between the **Permissions** and **Schedule** buttons, click the **Settings** button.\n4. Click the **JSON** button.\n5. Copy the pipeline definition\u2019s JSON. \n### Step 2: Convert the pipeline definition from JSON to YAML format \nThe pipeline definition that you copied from the previous step is in JSON format. Bundle configurations are in YAML format. You must convert the pipeline definition from JSON to YAML format. Databricks recommends the following resources for converting JSON to YAML: \n* [Convert JSON to YAML online](https:\/\/www.json2yaml.com\/).\n* For Visual Studio Code, the [json2yaml](https:\/\/marketplace.visualstudio.com\/items?itemName=tuxtina.json2yaml) extension. \n### Step 3: Add the pipeline definition YAML to a bundle configuration file \nIn your bundle configuration file, add the YAML that you copied from the previous step to one of the following locations labelled `<pipeline-yaml-can-go-here>` in your bundle configuration files, as follows: \n```\nresources:\npipelines:\n<some-unique-programmatic-identifier-for-this-pipeline>:\n<pipeline-yaml-can-go-here>\n\ntargets:\n<some-unique-programmatic-identifier-for-this-target>:\nresources:\npipelines:\n<some-unique-programmatic-identifier-for-this-pipeline>:\n<pipeline-yaml-can-go-here>\n\n``` \n### Step 4: Add notebooks, Python files, and other artifacts to the bundle \nAny Python files and notebooks that are referenced in the existing pipeline should be moved into the bundle\u2019s sources. \nFor better compatibility with bundles, notebooks should use the IPython notebook format (`.ipynb`). If you develop the bundle locally, you can export an existing notebook from a Databricks workspace into the `.ipynb` format by clicking **File > Export > IPython Notebook** from the Databricks notebook user interface. By convention, you should then put the downloaded notebook into the `src\/` directory in your bundle. \nAfter you add your notebooks, Python files, and other artifacts to the bundle, make sure that your pipeline definition references them. For example, for a notebook with the filename of `hello.ipynb` that is in a `src\/` directory, and the `src\/` directory is in the same folder as the bundle configuration file that references the `src\/` directory, the pipeline definition might be expressed as follows: \n```\nresources:\npipelines:\nhello-pipeline:\nname: hello-pipeline\nlibraries:\n-\nnotebook:\npath: .\/src\/hello.ipynb\n\n``` \n### Step 5: Validate, deploy, and run the new pipeline \n1. Validate that the bundle\u2019s configuration files are syntactically correct, by running the following command: \n```\ndatabricks bundle validate\n\n```\n2. Deploy the bundle by running the following command. In this command, replace `<target-identifier>` with the unique programmatic identifier for the target from the bundle configuration: \n```\ndatabricks bundle deploy -t <target-identifier>\n\n```\n3. Run the pipeline by running the following command. In this command, replace the following: \n* Replace `<target-identifier>` with the unique programmatic identifier for the target from the bundle configuration.\n* Replace `<pipeline-identifier>` with unique programmatic identifier for the pipeline from the bundle configuration.\n```\ndatabricks bundle run -t <target-identifier> <pipeline-identifier>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-bundles.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Unity Catalog best practices\n\nThis document provides recommendations for using Unity Catalog and Delta Sharing to meet your data governance needs. \nUnity Catalog is a fine-grained governance solution for data and AI on the Databricks platform. It helps simplify security and governance of your data by providing a central place to administer and audit data access. Delta Sharing is a secure data sharing platform that lets you share data in Databricks with users outside your organization. It uses Unity Catalog to manage and audit sharing behavior.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Unity Catalog best practices\n##### Data governance and data isolation building blocks\n\nTo develop a data governance model and a data isolation plan that works for your organization, it helps to understand the primary building blocks that are available to you when you create your data governance solution in Databricks. \nThe following diagram illustrates the primary data hierarchy in Unity Catalog (some securable objects are grayed out to emphasize the hierarchy of objects managed under catalogs). \n![Unity Catalog object model diagram](https:\/\/docs.databricks.com\/_images\/object-model.png) \nThe objects in that hierarchy include the following: \n* **Metastore:** A metastore is the top-level container of objects in Unity Catalog. Metastores live at the account level and function as the top of the pyramid in the Databricks data governance model. \nMetastores manage data assets (tables, views, and volumes) and the permissions that govern access to them. Databricks account admins can create one metastore for each region in which they operate, and assign them to multiple Databricks workspaces in the same region. Metastore admins can manage all objects in the metastore. They don\u2019t have direct access to read and write to tables registered in the metastore, but they do have indirect access through their ability to transfer data object ownership. \nPhysical storage for any given metastore is, by default, isolated from storage for any other metastore in your account. \nMetastores provide regional isolation but are not intended as units of data isolation. Data isolation should begin at the catalog level.\n* **Catalog:** Catalogs are the highest level in the data hierarchy (catalog > schema > table\/view\/volume) managed by the Unity Catalog metastore. They are intended as the primary unit of data isolation in a typical Databricks data governance model. \nCatalogs represent a logical grouping of schemas, usually bounded by data access requirements. Catalogs often mirror organizational units or software development lifecycle scopes. You may choose, for example, to have a catalog for production data and a catalog for development data, or a catalog for non-customer data and one for sensitive customer data. \nCatalogs can be stored at the metastore level, or you can configure a catalog to be stored separately from the rest of the parent metastore. If your workspace was enabled for Unity Catalog automatically, there is no metastore-level storage, and you must specify a storage location when you create a catalog. \nIf the catalog is the primary unit of data isolation in the Databricks data governance model, the *workspace* is the primary environment for working with data assets. Metastore admins and catalog owners can manage access to catalogs independently of workspaces, or they can bind catalogs to specific workspaces to ensure that certain kinds of data are processed only in those workspaces. You might want separate production and development workspaces, for example, or a separate workspace for processing personal data. \nBy default, access permissions for a securable object are inherited by the children of that object, with catalogs at the top of the hierarchy. This makes it easier to set up default access rules for your data and to specify different rules at each level of the hierarchy only where you need them.\n* **Schema (Database):** Schemas, also known as databases, are logical groupings of tabular data (tables and views), non-tabular data (volumes), functions, and machine learning models. They give you a way to organize and control access to data that is more granular than catalogs. Typically they represent a single use case, project, or team sandbox. \nSchemas can be stored in the same physical storage as the parent catalog, or you can configure a schema to be stored separately from the rest of the parent catalog. \nMetastore admins, parent catalog owners, and schema owners can manage access to schemas.\n* **Tables:** Tables reside in the third layer of Unity Catalog\u2019s three-level namespace. They contains rows of data. \nUnity Catalog lets you create *managed tables* and *external tables*. \nFor managed tables, Unity Catalog fully manages the lifecycle and file layout. By default, managed tables are stored in the root storage location that you configure when you create a metastore. You can choose instead to isolate storage for managed tables at the catalog or schema levels. \nExternal tables are tables whose data lifecycle and file layout are managed using your cloud provider and other data platforms, not Unity Catalog. Typically you use external tables to register large amounts of your existing data, or if you also require write access to the data using tools outside of Databricks clusters and Databricks SQL warehouses. Once an external table is registered in a Unity Catalog metastore, you can manage and audit Databricks access to it just like you can with managed tables. \nParent catalog owners and schema owners can manage access to tables, as can metastore admins (indirectly).\n* **Views:** A view is a read-only object derived from one or more tables and views in a metastore.\n* **Rows and columns:** Row and column-level access, along with data masking, is granted using either dynamic views or row filters and column masks. Dynamic views are read-only.\n* **Volumes:** Volumes reside in the third layer of Unity Catalog\u2019s three-level namespace. They manage non-tabular data. You can use volumes to store, organize, and access files in any format, including structured, semi-structured, and unstructured data. Files in volumes cannot be registered as tables. \n* **Models and functions:** Although they are not, strictly speaking, data assets, registered models and user-defined functions can also be managed in Unity Catalog and reside at the lowest level in the object hierarchy. See [Manage model lifecycle in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html) and [User-defined functions (UDFs) in Unity Catalog](https:\/\/docs.databricks.com\/udf\/unity-catalog.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Unity Catalog best practices\n##### Plan your data isolation model\n\nWhen an organization uses a data platform like Databricks, there is often a need to have data isolation boundaries between environments (such as development and production) or between organizational operating units. \nIsolation standards might vary for your organization, but typically they include the following expectations: \n* Users can only gain access to data based on specified access rules.\n* Data can be managed only by designated people or teams.\n* Data is physically separated in storage.\n* Data can be accessed only in designated environments. \nThe need for data isolation can lead to siloed environments that can make both data governance and collaboration unnecessarily difficult. Databricks solves this problem using Unity Catalog, which provides a number of data isolation options while maintaining a unified data governance platform. This section discusses the data isolation options available in Databricks and how to use them, whether you prefer a centralized data governance model or a distributed one. \n### Users can only gain access to data based on specified access rules \nMost organizations have strict requirements around data access based on internal or regulatory requirements. Typical examples of data that must be kept secure include employee salary information or credit card payment information. Access to this type of information is typically tightly controlled and audited periodically. Unity Catalog provides you with granular control over data assets within the catalog to meet these industry standards. With the controls that Unity Catalog provides, users can see and query only the data that they are entitled to see and query. \n### Data can be managed only by designated people or teams \nUnity Catalog gives you the ability to choose between centralized and distributed governance models. \nIn the centralized governance model, your governance administrators are owners of the metastore and can take ownership of any object and grant and revoke permissions. \nIn a distributed governance model, the catalog or a set of catalogs is the data domain. The owner of that catalog can create and own all assets and manage governance within that domain. The owners of any given domain can operate independently of the owners of other domains. \nRegardless of whether you choose the metastore or catalogs as your data domain, Databricks strongly recommends that you set a group as the metastore admin or catalog owner. \n![Unity Catalog ownership and access](https:\/\/docs.databricks.com\/_images\/ownership.png) \n### Data is physically separated in storage \nAn organization can require that data of certain types be stored within specific accounts or buckets in their cloud tenant. \nUnity Catalog gives the ability to configure storage locations at the metastore, catalog, or schema level to satisfy such requirements. \nFor example, let\u2019s say your organization has a company compliance policy that requires production data relating to human resources to reside in the bucket s3:\/\/mycompany-hr-prod. In Unity Catalog, you can achieve this requirement by setting a location on a catalog level, creating a catalog called, for example `hr_prod`, and assigning the location s3:\/\/mycompany-hr-prod\/unity-catalog to it. This means that managed tables or volumes created in the `hr_prod` catalog (for example, using `CREATE TABLE hr_prod.default.table \u2026`) store their data in s3:\/\/mycompany-hr-prod\/unity-catalog. Optionally, you can choose to provide schema-level locations to organize data within the `hr_prod catalog` at a more granular level. \nIf such a storage isolation is not required, you can set a storage location at the metastore level. The result is that this location serves as a default location for storing managed tables and volumes across catalogs and schemas in the metastore. \nThe system evaluates the hierarchy of storage locations from schema to catalog to metastore. \nFor example, if a table `myCatalog.mySchema.myTable` is created in `my-region-metastore`, the table storage location is determined according to the following rule: \n1. If a location has been provided for `mySchema`, it will be stored there.\n2. If not, and a location has been provided on `myCatalog`, it will be stored there.\n3. Finally, if no location has been provided on `myCatalog`, it will be stored in the location associated with the `my-region-metastore`. \n![Unity Catalog storage hierarchy](https:\/\/docs.databricks.com\/_images\/managed-storage.png) \n### Data can be accessed only in designated environments \nOrganizational and compliance requirements often specify that you keep certain data, like personal data, accessible only in certain environments. You may also want to keep production data isolated from development environments or ensure that certain data sets and domains are never joined together. \nIn Databricks, the workspace is the primary data processing environment, and catalogs are the primary data domain. Unity Catalog lets metastore admins and catalog owners assign, or \u201cbind,\u201d catalogs to specific workspaces. These environment-aware bindings give you the ability to ensure that only certain catalogs are available within a workspace, regardless of the specific privileges on data objects granted to a user. \nNow let\u2019s take a deeper look at the process of setting up Unity Catalog to meet your needs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Unity Catalog best practices\n##### Configure a Unity Catalog metastore\n\nA metastore is the top-level container of objects in Unity Catalog. Metastores manage data assets (tables, views, and volumes) as well as other securable objects managed by Unity Catalog. For the complete list of securable objects, see [Securable objects in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html#securable-objects). \nThis section provides tips for creating and configuring metastores. If your workspace was automatically enabled for Unity Catalog, you do not need to create a metastore, but the information presented in this section might still be useful. See [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement). \nTips for configuring metastores: \n* You should set up one metastore for each region in which you have Databricks workspaces. \nEvery workspace attached to a single regional metastore has access to the data managed by the metastore. If you want to share data between metastores, use [Delta Sharing](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html#delta-sharing).\n* Each metastore can be configured with a managed storage location (also called root storage) in your cloud tenant that can be used to store managed tables and managed volumes. \nIf you choose to create a metastore-level managed location, you must ensure that no users have direct access to it (that is, through the cloud account that contains it). Giving access to this storage location could allow a user to bypass access controls in a Unity Catalog metastore and disrupt auditability. For these reasons, your metastore managed storage should be a dedicated bucket. You should not reuse a bucket that is also your DBFS root file system or has previously been a DBFS root file system. \nYou also have the option of defining managed storage at the catalog and schema levels, overriding the metastore\u2019s root storage location. In most scenarios, Databricks recommends storing managed data at the catalog level.\n* You should understand the privileges of workspace admins in workspaces that are enabled for Unity Catalog, and review your existing workspace admin assignments. \nWorkspace admins can manage operations for their workspace including adding users and service principals, creating clusters, and delegating other users to be workspace admins. If your workspace was enabled for Unity Catalog automatically, workspace admins have the ability to create catalogs and many other Unity Catalog objects by default. See [Workspace admin privileges when workspaces are enabled for Unity Catalog automatically](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html#workspace-admins-auto) \nWorkspace admins also have the ability to perform workspace management tasks such as managing job ownership and viewing notebooks, which may give indirect access to data registered in Unity Catalog. Workspace admin is a privileged role that you should distribute carefully.\n* If you use workspaces to isolate user data access, you might want to use workspace-catalog bindings. Workspace-catalog bindings enable you to limit catalog access by workspace boundaries. For example, you can ensure that workspace admins and users can only access production data in `prod_catalog` from a production workspace environment, `prod_workspace`. The default is to share the catalog with all workspaces attached to the current metastore. See [(Optional) Assign a catalog to specific workspaces](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#catalog-binding). \nIf your workspace was enabled for Unity Catalog automatically, the pre-provisioned workspace catalog is bound to your workspace by default. \nSee [Create a Unity Catalog metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Unity Catalog best practices\n##### Configure external locations and storage credentials\n\n*External locations* allow Unity Catalog to read and write data on your cloud tenant on behalf of users. External locations are defined as a path to cloud storage, combined with a *storage credential* that can be used to access that location. \nYou can use external locations to register external tables and external volumes in Unity Catalog. The content of these entities is physically located on a sub-path in an external location that is referenced when a user creates the volume or the table. \nA *storage credential* encapsulates a long-term cloud credential that provides access to cloud storage. For example, in AWS you can configure an IAM role to access S3 buckets. \nFor increased data isolation, you can bind storage credentials and external locations to specific workspaces. See [(Optional) Assign an external location to specific workspaces](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html#workspace-binding) and [(Optional) Assign a storage credential to specific workspaces](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html#workspace-binding). \nTip \nExternal locations, by combining storage credentials and storage paths, provide strong control and auditability of storage access. To prevent users from bypassing the access control provided by Unity Catalog, you should ensure that you limit the number of users with direct access to any bucket that is being used as an external location. For the same reason, you should not mount storage accounts to DBFS if they are also being used as external locations. Databricks recommends that you migrate mounts on cloud storage locations to external locations in Unity Catalog using [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html). \nFor a list of best practices for managing external locations, see [Manage external locations, external tables, and external volumes](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html#manage-external). See also [Create an external location to connect cloud storage to Databricks](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Unity Catalog best practices\n##### Organize your data\n\nDatabricks recommends using catalogs to provide segregation across your organization\u2019s information architecture. Often this means that catalogs correspond to a software development environment scope, team, or business unit. If you use workspaces as a data isolation tool\u2014for example, using different workspaces for production and development environments, or a specific workspace for working with highly sensitive data, you can also bind a catalog to specific workspaces. This ensures that all processing of specified data is handled in the appropriate workspace. See [(Optional) Assign a catalog to specific workspaces](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#catalog-binding). \n![Unity Catalog catalogs](https:\/\/docs.databricks.com\/_images\/uc-catalogs.png) \nA schema (also called a database) is the second layer of Unity Catalog\u2019s three-level namespace and organizes tables, views, and volumes. You can use schemas to organize and define permissions for your assets. \nObjects governed by Unity Catalog can be *managed* or *external*: \n* **Managed objects** are the default way to create data objects in Unity Catalog. \nUnity Catalog manages the lifecycle and file layout for these securables. You should not use tools outside of Databricks to manipulate files in managed tables or volumes directly. \nManaged tables and volumes are stored in *managed storage*, which can exist at the metastore, catalog, or schema level for any given table or volume. See [Data is physically separated in storage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html#physically-separate). \nManaged tables and volumes are a convenient solution when you want to provision a governed location for your content without the overhead of creating and managing external locations and storage credentials. \nManaged tables always use the [Delta](https:\/\/docs.databricks.com\/delta\/index.html) table format.\n* **External objects** are securables whose data lifecycle and file layout are not managed by Unity Catalog. \nExternal volumes and tables are registered on an external location to provide access to large numbers of files that already exist in cloud storage without requiring data copy activity. Use external objects when you have files that are produced by other systems and want them staged for access from within Databricks, or when tools outside of Databricks require direct access to these files. \nExternal tables support Delta Lake and many other data formats, including Parquet, JSON, and CSV. Both managed and external volumes can be used to access and store files of arbitrary formats: data can be structured, semi-structured, or unstructured. \nFor more information about creating tables and volumes, see [Create tables in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html) and [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Unity Catalog best practices\n##### Manage external locations, external tables, and external volumes\n\nThe diagram below represents the filesystem hierarchy of a single cloud storage bucket, with four external locations that share one storage credential. \n![External locations](https:\/\/docs.databricks.com\/_images\/external-locations.png) \nOnce you have external locations configured in Unity Catalog, you can create external tables and volumes on directories inside the external locations. You can then use Unity Catalog to manage user and group access to these tables and volumes. This allows you to provide specific users or groups access to specific directories and files in the cloud storage bucket. \nNote \nWhen you define a volume, cloud URI access to data under the volume path is governed by the permissions of the volume. \n### Recommendations for using external locations \nRecommendations for granting permissions on external locations: \n* Grant the ability to create external locations only to an administrator who is tasked with setting up connections between Unity Catalog and cloud storage, or to trusted data engineers. \nExternal locations provide access from within Unity Catalog to a broadly encompassing location in cloud storage\u2014for example, an entire bucket or container (s3:\/\/mybucket) or a broad subpath (s3:\/\/mybucket\/alotofdata). The intention is that a cloud administrator can be involved in setting up a few external locations and then delegate the responsibility of managing those locations to a Databricks administrator in your organization. The Databricks administrator can then further organize the external location into areas with more granular permissions by registering external volumes or external tables at specific prefixes under the external location. \nBecause external locations are so encompassing, Databricks recommends giving the `CREATE EXTERNAL LOCATION` permission only to an administrator who is tasked with setting up connections between Unity Catalog and cloud storage, or to trusted data engineers. To provide other users with more granular access, Databricks recommends registering external tables or volumes on top of external locations and granting users access to data using volumes or tables. Since tables and volumes are children of a catalog and schema, catalog or schema administrators have the ultimate control over access permissions. \nYou can also control access to an external location by binding it to specific workspaces. See [(Optional) Assign an external location to specific workspaces](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html#workspace-binding).\n* Don\u2019t grant general `READ FILES` or `WRITE FILES` permissions on external locations to end users. \nWith the availability of volumes, users shouldn\u2019t use external locations for anything but creating tables, volumes, or managed locations. They should not use external locations for path-based access for data science or other non-tabular data use cases. \nVolumes provide support for working with files using SQL commands, dbutils, Spark APIs, REST APIs, Terraform, and a user interface for browsing, uploading, and downloading files. Moreover, volumes offer a FUSE mount that is accessible on the local file system under `\/Volumes\/<catalog_name>\/<schema_name>\/<volume_name>\/`. The FUSE mount allows data scientists and ML engineers to access files as if they were in a local filesystem, as required by many machine learning or operating system libraries. \nIf you must grant direct access to files in an external location (for exploring files in cloud storage before a user creates an external table or volume, for example), you can grant `READ FILES`. Use cases for granting `WRITE FILES` are rare. \nYou should use external locations to do the following: \n* Register external tables and volumes using the `CREATE EXTERNAL VOLUME` or `CREATE TABLE` commands.\n* Explore existing files in cloud storage before you create an external table or volume at a specific prefix. The `READ FILES` privilege is a precondition.\n* Register a location as managed storage for catalogs and schemas instead of the metastore root bucket. The `CREATE MANAGED STORAGE` privilege is a precondition. \nMore recommendations for using external locations: \n* Avoid path overlap conflicts: never create external volumes or tables at the root of an external location. \nIf you do create external volumes or tables at the external location root, you can\u2019t create any additional external volumes or tables on the external location. Instead, create external volumes or tables on a sub-directory inside the external location. \n### Recommendations for using external volumes \nYou should use external volumes to do the following: \n* Register landing areas for raw data produced by external systems to support its processing in the early stages of ETL pipelines and other data engineering activities.\n* Register staging locations for ingestion, for example, using Auto Loader, `COPY INTO`, or CTAS (`CREATE TABLE AS`) statements.\n* Provide file storage locations for data scientists, data analysts, and machine learning engineers to use as parts of their exploratory data analysis and other data science tasks, when managed volumes are not an option.\n* Give Databricks users access to arbitrary files produced and deposited in cloud storage by other systems, for example, large collections of unstructured data (such as image, audio, video, and PDF files) captured by surveillance systems or IoT devices, or library files (JARs and Python wheel files) exported from local dependency management systems or CI\/CD pipelines.\n* Store operational data, such as logging or checkpointing files, when managed volumes are not an option. \nMore recommendations for using external volumes: \n* Databricks recommends that you create external volumes from one external location within one schema. \nTip \nFor ingestion use cases in which the data is copied to another location\u2014for example using Auto Loader or `COPY INTO`\u2014use external volumes. Use external tables when you want to query data in place as a table, with no copy involved. \n### Recommendations for using external tables \nYou should use external tables to support normal querying patterns on top of data stored in cloud storage, when creating managed tables is not an option. \nMore recommendations for using external tables: \n* Databricks recommends that you create external tables using one external location per schema.\n* Databricks strongly recommends against registering a table as an external table in more than one metastore due to the risk of consistency issues. For example, a change to the schema in one metastore will not register in the second metastore. Use Delta Sharing for sharing data between metastores. See [Share data securely using Delta Sharing](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html#delta-sharing).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Unity Catalog best practices\n##### Configure access control\n\nEach securable object in Unity Catalog has an owner. The principal that creates an object becomes its initial owner. An object\u2019s owner has all privileges on the object, such as SELECT and MODIFY on a table, as well as the permission to grant privileges on the securable object to other principals. Only owners of a securable object have the permission to grant privileges on that object to other principals. Therefore, it is best practice to configure ownership on all objects to the **group** responsible for administration of grants on the object. Both the owner and metastore admins can transfer ownership of a securable object to a group. Additionally, if the object is contained within a catalog (like a table or view), the catalog and schema owner can change the ownership of the object. \nSecurable objects in Unity Catalog are hierarchical and privileges are inherited downward. This means that granting a privilege on a catalog or schema automatically grants the privilege to all current and future objects within the catalog or schema. For more information, see [Inheritance model](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html#inheritance). \nIn order to read data from a table or view a user must have the following privileges: \n* `SELECT` on the table or view\n* `USE SCHEMA` on the schema that owns the table\n* `USE CATALOG` on the catalog that owns the schema \n`USE CATALOG` enables the grantee to traverse the catalog in order to access its child objects and `USE SCHEMA` enables the grantee to traverse the schema in order to access its child objects. For example, to select data from a table, users need to have the `SELECT` privilege on that table and the `USE CATALOG` privilege on its parent catalog, along with the `USE SCHEMA` privilege on its parent schema. Therefore, you can use this privilege to restrict access to sections of your data namespace to specific groups. A common scenario is to set up a schema per team where only that team has `USE SCHEMA` and `CREATE` on the schema. This means that any tables produced by team members can only be shared within the team. \nYou can secure access to a table using the following SQL syntax: \n```\nGRANT USE CATALOG ON CATALOG < catalog_name > TO < group_name >;\nGRANT USE SCHEMA ON SCHEMA < catalog_name >.< schema_name >\nTO < group_name >;\nGRANT\nSELECT\nON < catalog_name >.< schema_name >.< table_name >;\nTO < group_name >;\n\n``` \nYou can secure access to columns using a dynamic view in a secondary schema as shown in the following SQL syntax: \n```\nCREATE VIEW < catalog_name >.< schema_name >.< view_name > as\nSELECT\nid,\nCASE WHEN is_account_group_member(< group_name >) THEN email ELSE 'REDACTED' END AS email,\ncountry,\nproduct,\ntotal\nFROM\n< catalog_name >.< schema_name >.< table_name >;\nGRANT USE CATALOG ON CATALOG < catalog_name > TO < group_name >;\nGRANT USE SCHEMA ON SCHEMA < catalog_name >.< schema_name >.< view_name >;\nTO < group_name >;\nGRANT\nSELECT\nON < catalog_name >.< schema_name >.< view_name >;\nTO < group_name >;\n\n``` \nYou can secure access to rows using a dynamic view in a secondary schema as shown in the following SQL syntax: \n```\nCREATE VIEW < catalog_name >.< schema_name >.< view_name > as\nSELECT\n*\nFROM\n< catalog_name >.< schema_name >.< table_name >\nWHERE\nCASE WHEN is_account_group_member(managers) THEN TRUE ELSE total <= 1000000 END;\nGRANT USE CATALOG ON CATALOG < catalog_name > TO < group_name >;\nGRANT USE SCHEMA ON SCHEMA < catalog_name >.< schema_name >.< table_name >;\nTO < group_name >;\nGRANT\nSELECT\nON < catalog_name >.< schema_name >.< table_name >;\nTO < group_name >;\n\n``` \nYou can also grant users secure access to tables using row filters and columns masks. For more information, see [Filter sensitive table data using row filters and column masks](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/row-and-column-filters.html). \nFor more information on all privileges in Unity Catalog, see [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Unity Catalog best practices\n##### Manage cluster configurations\n\nDatabricks recommends using cluster policies to limit the ability to configure clusters based on a set of rules. Cluster policies let you restrict access to only create clusters which are Unity Catalog-enabled. Using cluster policies reduces available choices, which will greatly simplify the cluster creation process for users and ensure that they are able to access data seamlessly. Cluster policies also enable you to control cost by limiting per cluster maximum cost. \nTo ensure the integrity of access controls and enforce strong isolation guarantees, Unity Catalog imposes security requirements on compute resources. For this reason, Unity Catalog introduces the concept of a cluster\u2019s access mode. Unity Catalog is secure by default; if a cluster is not configured with an appropriate access mode, the cluster can\u2019t access data in Unity Catalog. See [Supported compute and cluster access modes for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#cluster-security-mode). \nDatabricks recommends using the shared access mode when sharing a cluster and the Single User access mode for automated jobs and machine learning workloads. \nThe JSON below provides a policy definition for a cluster with the shared access mode: \n```\n{\n\"spark_version\": {\n\"type\": \"regex\",\n\"pattern\": \"1[0-1]\\\\.[0-9]*\\\\.x-scala.*\",\n\"defaultValue\": \"10.4.x-scala2.12\"\n},\n\"access_mode\": {\n\"type\": \"fixed\",\n\"value\": \"USER_ISOLATION\",\n\"hidden\": true\n}\n}\n\n``` \nThe JSON below provides a policy definition for an automated job cluster with the Single User access mode: \n```\n{\n\"spark_version\": {\n\"type\": \"regex\",\n\"pattern\": \"1[0-1]\\\\.[0-9].*\",\n\"defaultValue\": \"10.4.x-scala2.12\"\n},\n\"access_mode\": {\n\"type\": \"fixed\",\n\"value\": \"SINGLE_USER\",\n\"hidden\": true\n},\n\"single_user_name\": {\n\"type\": \"regex\",\n\"pattern\": \".*\",\n\"hidden\": true\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Unity Catalog best practices\n##### Audit access\n\nA complete data governance solution requires auditing access to data and providing alerting and monitoring capabilities. Unity Catalog captures an audit log of actions performed against the metastore and these logs are delivered as part of Databricks audit logs. \nYou can access your account\u2019s audit logs using [system tables](https:\/\/docs.databricks.com\/admin\/system-tables\/index.html). For more information on the audit log system table, see [Audit log system table reference](https:\/\/docs.databricks.com\/admin\/system-tables\/audit-logs.html). \nSee [Monitoring Your Databricks Data Intelligence Platform with Audit Logs](https:\/\/www.databricks.com\/blog\/2022\/05\/02\/monitoring-your-databricks-lakehouse-platform-with-audit-logs.html) for details on how to get complete visibility into critical events relating to your Databricks Data Intelligence Platform.\n\n#### Unity Catalog best practices\n##### Share data securely using Delta Sharing\n\n[Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html) is an open protocol developed by Databricks for secure data sharing with other organizations or other departments within your organization, regardless of which computing platforms they use. When Delta Sharing is enabled on a metastore, Unity Catalog runs a Delta Sharing server. \nTo share data between metastores, you can leverage [Databricks-to-Databricks Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html#d-to-d). This allows you to register tables from metastores in different regions. These tables will appear as read-only objects in the consuming metastore. These tables can be granted access like any other object within Unity Catalog. \nWhen you use Databricks-to-Databricks Delta Sharing to share between metastores, keep in mind that access control is limited to one metastore. If a securable object, like a table, has grants on it and that resource is shared to an intra-account metastore, then the grants from the source will not apply to the destination share. The destination share will have to set its own grants.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Unity Catalog best practices\n##### Learn more\n\n* [Set up and manage Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html)\n* [Share data and AI assets securely using Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html)\n* [Upgrade Hive tables and views to Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### What happened to Databricks Repos?\n\nDatabricks rolled out new user interface elements that allow users to work directly with Git repo-backed folders from the Workspace UI, effectively replacing the prior, separate \u201cRepos\u201d feature functionality.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/what-happened-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### What happened to Databricks Repos?\n##### What does this change mean for me?\n\nIf you are a user of the Databricks Repos feature for co-versioned Git-based source control of project assets, the core functionality has not changed. The most notable difference is that many contextual UI operations now refer to \u201cGit folders\u201d rather than \u201cRepos\u201d. \nFor example, a Databricks folder backed by a Git repo could be created by selecting **New** and then **Repo** from the UI: \n![The \"New\" menu option used to refer to a \"Repo\"](https:\/\/docs.databricks.com\/_images\/repos-ui-old.png) \nNow, you select **New** and choose **Git folder**. Same thing, different name! \n![The \"New\" menu option now asks you to create a \"Git folder\"](https:\/\/docs.databricks.com\/_images\/repos-ui-new.png) \nThis change provides some improvements that simplify working with version-controlled folders: \n1. **Better folder organization**: Git folders can be created at any level of the workspace file tree, allowing you to organize your Git folders in a way that works best for your project. For example, you can create Git folders at `\/Workspace\/Users\/<user email>\/level_1\/level_2\/level_3\/<Git folder name>`. Repos can only be created at a fixed directory level, such as the root of the Repos user folder like `\/Workspace\/Repos\/<user email>\/<Repo name>`. \n* Note: Git folders can contain or collocate with other assets that are not supported by Repos today. Unsupported asset types like DBSQL assets and MLflow experiments can be moved into Git folders. Serialization support for additional assets will be added over time.\n2. **Simplified UI behaviors**: This change brings a common workspace interaction\u2013working with Git\u2013directly into your Databricks workspace, and reduces time spent navigating between your workspace and your version-controlled Git folders.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/what-happened-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### What happened to Databricks Repos?\n##### What has changed, specifically?\n\n1. Git folders can be created outside of the `\/Repos` directory.\n2. Git folders are created by selecting **New** > **Git folder** in a Databricks workspace. This creates a new Git folder under `\/Workspace\/Users\/<user-email>\/`.\n3. Git folders can be created at various depths of the workspace file tree as long as they are under `\/Workspace\/Users\/<user-email>`. For example, you can create Git folders at `\/Workspace\/Users\/<user-email>\/level_1\/level_2\/level_3\/<git-folder-name>`. You can have multiple Git folders under `\/Workspace\/Users\/<user-email>`.\n4. [Unsupported assets](https:\/\/docs.databricks.com\/repos\/manage-assets.html) are allowed in Git folders. Serialization support for other asset types will be added over time.\n5. Unlike Repos, you cannot create a new Git folder in Databricks without a remote repository URL.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/what-happened-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### What happened to Databricks Repos?\n##### Additional details\n\nExisting Repos that users have created are not going away. Users are not required to migrate existing Repos to Git folders. Repos have been integrated into the Workspace UI and are no longer a separate top-level experience in the UI. \n* Existing `\/Repos` references will continue to work: `jobs`, `dbutils.notebook.run` and `%run` references that use notebooks located under `\/Repos` paths will continue to work.\n* The existing `\/Repos` folder will be converted to a normal folder under `\/Workspace` as `\/Workspace\/Repos`, and any special handling may be removed. In rare cases, you may need to make some modification in your workspace for this redirection to work. For more details, see [References to workspace objects](https:\/\/docs.databricks.com\/workspace\/workspace-assets.html#references-to-workspace-objects). \nDatabricks recommends that users create new Git folders instead of Repos if they need to connect to Git source control from the Databricks workspace. Colocating Git repos and other workspace assets makes Git folders more discoverable and easier to manage than Repos. \n**Git folder permissions**\nGit folders have the same [workspace folder permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#folders) as other workspace folders. Users must have the `CAN_MANAGE` permission in order to perform most Git operations.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/what-happened-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### What happened to Databricks Repos?\n##### Which DBR I should use for running code in Git folders?\n\nFor consistent code execution between Git folders and legacy Repos, Databricks recommends users run code only in Git folders with DBR 15+. \n**Current working directory (CWD) behavior** \nDatabricks Runtime (DBR) version 14 or greater allows for the use of relative paths and provides the same [current working directory (CWD) experience](https:\/\/docs.databricks.com\/files\/cwd-dbr-14.html) for all notebooks, where you run the notebook from the current working directory. [Current working directory (CWD)](https:\/\/docs.databricks.com\/files\/cwd-dbr-14.html) behaviors might be inconsistent between notebooks in a Git folder and a non-Git folder for older versions of the Databricks Runtime (DBR). \n**Python sys.path behavior** \nDatabricks Runtime (DBR) version 14.3 or greater provides the same `sys.path` behavior in Git folders as in legacy Repos. With earlier DBR versions, Git folder behavior differs from legacy Repos as the root repo directory is not automatically added to `sys.path` for Git folders. For Python, `sys.path` contains a list of directories that the interpreter searches when importing modules. If you cannot use DBR 15 or above, you can manually append a folder path to `sys.path` as a workaround. \nFor examples on how to add directories to `sys.path` using relative paths, see [Import Python and R modules](https:\/\/docs.databricks.com\/files\/workspace-modules.html#import-python-and-r-modules). \n**Python library precedence** \nDatabricks Runtime (DBR) version 14.3 or greater provides the same [python library precedence](https:\/\/docs.databricks.com\/libraries\/index.html#python-library-precedence) in Git folders as in legacy Repos.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/what-happened-repos.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use a JAR in a Databricks job\n\nThe Java archive or [JAR](https:\/\/en.wikipedia.org\/wiki\/JAR_(file_format) file format is based on the popular ZIP file format and is used for aggregating many Java or Scala files into one. Using the JAR task, you can ensure fast and reliable installation of Java or Scala code in your Databricks jobs. This article provides an example of creating a JAR and a job that runs the application packaged in the JAR. In this example, you will: \n* Create the JAR project defining an example application.\n* Bundle the example files into a JAR.\n* Create a job to run the JAR.\n* Run the job and view the results.\n\n##### Use a JAR in a Databricks job\n###### Before you begin\n\nYou need the following to complete this example: \n* For Java JARs, the Java Development Kit (JDK).\n* For Scala JARs, the JDK and sbt.\n\n##### Use a JAR in a Databricks job\n###### Step 1: Create a local directory for the example\n\nCreate a local directory to hold the example code and generated artifacts, for example, `databricks_jar_test`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-jars-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use a JAR in a Databricks job\n###### Step 2: Create the JAR\n\nComplete the following instructions to use Java or Scala to create the JAR. \n### Create a Java JAR \n1. From the `databricks_jar_test` folder, create a file named `PrintArgs.java` with the following contents: \n```\nimport java.util.Arrays;\n\npublic class PrintArgs {\npublic static void main(String[] args) {\nSystem.out.println(Arrays.toString(args));\n}\n}\n\n```\n2. Compile the `PrintArgs.java` file, which creates the file `PrintArgs.class`: \n```\njavac PrintArgs.java\n\n```\n3. (Optional) Run the compiled program: \n```\njava PrintArgs Hello World!\n\n# [Hello, World!]\n\n```\n4. In the same folder as the `PrintArgs.java` and `PrintArgs.class` files, create a folder named `META-INF`.\n5. In the `META-INF` folder, create a file named `MANIFEST.MF` with the following contents. Be sure to add a newline at the end of this file: \n```\nMain-Class: PrintArgs\n\n```\n6. From the root of the `databricks_jar_test` folder, create a JAR named `PrintArgs.jar`: \n```\njar cvfm PrintArgs.jar META-INF\/MANIFEST.MF *.class\n\n```\n7. (Optional) To test it, from the root of the `databricks_jar_test` folder, run the JAR: \n```\njava -jar PrintArgs.jar Hello World!\n\n# [Hello, World!]\n\n``` \nNote \nIf you get the error `no main manifest attribute, in PrintArgs.jar`, be sure to add a newline to the end of the `MANIFEST.MF` file, and then try creating and running the JAR again.\n8. Upload `PrintArgs.jar` to a volume. See [Upload files to a Unity Catalog volume](https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-to-volume.html). \n### Create a Scala JAR \n1. From the `databricks_jar_test` folder, create an empty file named `build.sbt` with the following contents: \n```\nThisBuild \/ scalaVersion := \"2.12.14\"\nThisBuild \/ organization := \"com.example\"\n\nlazy val PrintArgs = (project in file(\".\"))\n.settings(\nname := \"PrintArgs\"\n)\n\n```\n2. From the `databricks_jar_test` folder, create the folder structure `src\/main\/scala\/example`.\n3. In the `example` folder, create a file named `PrintArgs.scala` with the following contents: \n```\npackage example\n\nobject PrintArgs {\ndef main(args: Array[String]): Unit = {\nprintln(args.mkString(\", \"))\n}\n}\n\n```\n4. Compile the program: \n```\nsbt compile\n\n```\n5. (Optional) Run the compiled program: \n```\nsbt \"run Hello World\\!\"\n\n# Hello, World!\n\n```\n6. In the `databricks_jar_test\/project` folder, create a file named `assembly.sbt` with the following contents: \n```\naddSbtPlugin(\"com.eed3si9n\" % \"sbt-assembly\" % \"2.0.0\")\n\n```\n7. From the root of the `databricks_jar_test` folder, run the `assembly` command, which generates a JAR under the `target` folder: \n```\nsbt assembly\n\n```\n8. (Optional) To test it, from the root of the `databricks_jar_test` folder, run the JAR: \n```\njava -jar target\/scala-2.12\/PrintArgs-assembly-0.1.0-SNAPSHOT.jar Hello World!\n\n# Hello, World!\n\n```\n9. Upload `PrintArgs-assembly-0.1.0-SNAPSHOT.jar` to a volume. See [Upload files to a Unity Catalog volume](https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-to-volume.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-jars-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use a JAR in a Databricks job\n###### Step 3. Create a Databricks job to run the JAR\n\n1. Go to your Databricks landing page and do one of the following: \n* In the sidebar, click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** and click ![Create Job Button](https:\/\/docs.databricks.com\/_images\/create-job.png).\n* In the sidebar, click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** and select **Job** from the menu.\n2. In the task dialog box that appears on the **Tasks** tab, replace **Add a name for your job\u2026** with your job name, for example `JAR example`.\n3. For **Task name**, enter a name for the task, for example `java_jar_task` for Java, or `scala_jar_task` for Scala.\n4. For **Type**, select **JAR**.\n5. For **Main class**, for this example, enter `PrintArgs` for Java, or `example.PrintArgs` for Scala.\n6. For **Cluster**, select a compatible cluster. See [Java and Scala library support](https:\/\/docs.databricks.com\/libraries\/index.html#jar-library-support).\n7. For **Dependent libraries**, click **+ Add**.\n8. In the **Add dependent library** dialog, with **Volumes** selected, enter the location where you uploaded the JAR (`PrintArgs.jar` or `PrintArgs-assembly-0.1.0-SNAPSHOT.jar`) in the previous step into **Volumes File Path**, or filter or browse to find the JAR. Select it.\n9. Click **Add**.\n10. For **Parameters**, for this example, enter `[\"Hello\", \"World!\"]`.\n11. Click **Add**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-jars-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use a JAR in a Databricks job\n###### Step 4: Run the job and view the job run details\n\nClick ![Run Now Button](https:\/\/docs.databricks.com\/_images\/run-now-button.png) to run the workflow. To view [details for the run](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#job-run-details), click **View run** in the **Triggered run** pop-up or click the link in the **Start time** column for the run in the [job runs](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#view-job-run-list) view. \nWhen the run completes, the output displays in the **Output** panel, including the arguments passed to the task.\n\n##### Use a JAR in a Databricks job\n###### Output size limits for JAR jobs\n\nJob output, such as log output emitted to stdout, is subject to a 20MB size limit. If the total output has a larger size, the run is canceled and marked as failed. \nTo avoid encountering this limit, you can prevent stdout from being returned from the driver to Databricks by setting the `spark.databricks.driver.disableScalaOutput` Spark configuration to `true`. By default, the flag value is `false`. The flag controls cell output for Scala JAR jobs and Scala notebooks. If the flag is enabled, Spark does not return job execution results to the client. The flag does not affect the data that is written in the cluster\u2019s log files. Databricks recommends setting this flag only for job clusters for JAR jobs because it disables notebook results.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-jars-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use a JAR in a Databricks job\n###### Recommendation: Use the shared `SparkContext`\n\nBecause Databricks is a managed service, some code changes might be necessary to ensure that your Apache Spark jobs run correctly. JAR job programs must use the shared `SparkContext` API to get the `SparkContext`. Because Databricks initializes the `SparkContext`, programs that invoke `new SparkContext()` will fail. To get the `SparkContext`, use only the shared `SparkContext` created by Databricks: \n```\nval goodSparkContext = SparkContext.getOrCreate()\nval goodSparkSession = SparkSession.builder().getOrCreate()\n\n``` \nThere are also several methods you should avoid when using the shared `SparkContext`. \n* Do not call `SparkContext.stop()`.\n* Do not call `System.exit(0)` or `sc.stop()` at the end of your `Main` program. This can cause undefined behavior.\n\n##### Use a JAR in a Databricks job\n###### Recommendation: Use `try-finally` blocks for job clean up\n\nConsider a JAR that consists of two parts: \n* `jobBody()` which contains the main part of the job.\n* `jobCleanup()` which has to be executed after `jobBody()`, whether that function succeeded or returned an exception. \nFor example, `jobBody()` creates tables and `jobCleanup()` drops those tables. \nThe safe way to ensure that the clean-up method is called is to put a `try-finally` block in the code: \n```\ntry {\njobBody()\n} finally {\njobCleanup()\n}\n\n``` \nYou *should not* try to clean up using `sys.addShutdownHook(jobCleanup)` or the following code: \n```\nval cleanupThread = new Thread { override def run = jobCleanup() }\nRuntime.getRuntime.addShutdownHook(cleanupThread)\n\n``` \nBecause of the way the lifetime of Spark containers is managed in Databricks, the shutdown hooks are not run reliably.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-jars-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use a JAR in a Databricks job\n###### Configuring JAR job parameters\n\nYou pass parameters to JAR jobs with a JSON string array. See the `spark_jar_task` object in the request body passed to the [Create a new job](https:\/\/docs.databricks.com\/api\/workspace\/jobs) operation (`POST \/jobs\/create`) in the Jobs API. To access these parameters, inspect the `String` array passed into your `main` function.\n\n##### Use a JAR in a Databricks job\n###### Manage library dependencies\n\nThe Spark driver has certain library dependencies that cannot be overridden. If your job adds conflicting libraries, the Spark driver library dependencies take precedence. \nTo get the full list of the driver library dependencies, run the following command in a notebook attached to a cluster configured with the same Spark version (or the cluster with the driver you want to examine): \n```\n%sh\nls \/databricks\/jars\n\n``` \nWhen you define library dependencies for JARs, Databricks recommends listing Spark and Hadoop as `provided` dependencies. In Maven, add Spark and Hadoop as provided dependencies: \n```\n<dependency>\n<groupId>org.apache.spark<\/groupId>\n<artifactId>spark-core_2.11<\/artifactId>\n<version>2.3.0<\/version>\n<scope>provided<\/scope>\n<\/dependency>\n<dependency>\n<groupId>org.apache.hadoop<\/groupId>\n<artifactId>hadoop-core<\/artifactId>\n<version>1.2.1<\/version>\n<scope>provided<\/scope>\n<\/dependency>\n\n``` \nIn `sbt`, add Spark and Hadoop as provided dependencies: \n```\nlibraryDependencies += \"org.apache.spark\" %% \"spark-core\" % \"2.3.0\" % \"provided\"\nlibraryDependencies += \"org.apache.hadoop\" %% \"hadoop-core\" % \"1.2.1\" % \"provided\"\n\n``` \nTip \nSpecify the correct Scala version for your dependencies based on the version you are running.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-jars-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use a JAR in a Databricks job\n###### Next steps\n\nTo learn more about creating and running Databricks jobs, see [Create and run Databricks Jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-jars-in-workflows.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Databricks Vector Search\n#### How to create and query a Vector Search index\n\nThis article describes how to create and query a vector search index using [Databricks Vector Search](https:\/\/docs.databricks.com\/generative-ai\/vector-search.html). \nYou can create and manage Vector Search components, like a vector search endpoint and vector search indices, using the UI, the [Python SDK](https:\/\/api-docs.databricks.com\/python\/vector-search\/databricks.vector_search.html), or the [REST API](https:\/\/docs.databricks.com\/api\/workspace\/vectorsearchendpoints).\n\n#### How to create and query a Vector Search index\n##### Requirements\n\n* Unity Catalog enabled workspace.\n* Serverless compute enabled.\n* Source table must have Change Data Feed enabled.\n* To create an index, you must have CREATE TABLE privileges on catalog schema(s) to create indexes. To query an index that is owned by another user, you must have additional privileges. See [Query a Vector Search endpoint](https:\/\/docs.databricks.com\/generative-ai\/create-query-vector-search.html#query).\n* If you want to use personal access tokens (not recommended for production workloads), check that [Personal access tokens are enabled](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html). To use a service principal token instead, pass it explicitly using SDK or API calls. \nTo use the SDK, you must install it in your notebook. Use the following code: \n```\n%pip install databricks-vectorsearch\n\ndbutils.library.restartPython()\n\nfrom databricks.vector_search.client import VectorSearchClient\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/create-query-vector-search.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Databricks Vector Search\n#### How to create and query a Vector Search index\n##### Create a vector search endpoint\n\nYou can create a vector search endpoint using the Databricks UI, Python SDK, or the API. \n### Create a vector search endpoint using the UI \nFollow these steps to create a vector search endpoint using the UI. \n1. In the left sidebar, click **Compute**.\n2. Click the **Vector Search** tab and click **Create**. \n![Create endpoint form](https:\/\/docs.databricks.com\/_images\/create-endpoint.png)\n3. The **Create endpoint form** opens. Enter a name for this endpoint.\n4. Click **Confirm**. \n### Create a vector search endpoint using the Python SDK \nThe following example uses the [create\\_endpoint()](https:\/\/api-docs.databricks.com\/python\/vector-search\/databricks.vector_search.html#databricks.vector_search.client.VectorSearchClient.create_endpoint) SDK function to create a Vector Search endpoint. \n```\n# The following line automatically generates a PAT Token for authentication\nclient = VectorSearchClient()\n\n# The following line uses the service principal token for authentication\n# client = VectorSearch(service_principal_client_id=<CLIENT_ID>,service_principal_client_secret=<CLIENT_SECRET>)\n\nclient.create_endpoint(\nname=\"vector_search_endpoint_name\",\nendpoint_type=\"STANDARD\"\n)\n\n``` \n### Create a vector search endpoint using the REST API \nSee [POST \/api\/2.0\/vector-search\/endpoints](https:\/\/docs.databricks.com\/api\/workspace\/vectorsearchendpoints\/createendpoint).\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/create-query-vector-search.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Databricks Vector Search\n#### How to create and query a Vector Search index\n##### (Optional) Create and configure an endpoint to serve the embedding model\n\nIf you choose to have Databricks compute the embeddings, you must set up a model serving endpoint to serve the embedding model. See [Create foundation model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-foundation-model-endpoints.html) for instructions. For example notebooks, see [Notebook examples for calling an embeddings model](https:\/\/docs.databricks.com\/generative-ai\/create-query-vector-search.html#embedding-model-notebooks). \nWhen you configure an embedding endpoint, Databricks recommends that you remove the default selection of **Scale to zero**. Serving endpoints can take a couple of minutes to warm up, and the initial query on an index with a scaled down endpoint might timeout. \nNote \nThe vector search index initialization might time out if the embedding endpoint isn\u2019t configured appropriately for the dataset. You should only use CPU endpoints for small datasets and tests. For larger datasets, use a GPU endpoint for optimal performance.\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/create-query-vector-search.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Databricks Vector Search\n#### How to create and query a Vector Search index\n##### Create a vector search index\n\nYou can create a vector search index using the UI, the Python SDK, or the REST API. The UI is the simplest approach. \nThere are two types of indexes: \n* **Delta Sync Index** automatically syncs with a source Delta Table, automatically and incrementally updating the index as the underlying data in the Delta Table changes.\n* **Direct Vector Access Index** supports direct read and write of vectors and metadata. The user is responsible for updating this table using the REST API or the Python SDK. This type of index cannot be created using the UI. You must use the REST API or the SDK. \n### Create index using the UI \n1. In the left sidebar, click **Catalog** to open the Catalog Explorer UI.\n2. Navigate to the Delta table you want to use.\n3. Click the **Create** button at the upper-right, and select **Vector search index** from the drop-down menu. \n![Create index button](https:\/\/docs.databricks.com\/_images\/create-index-button.png)\n4. Use the selectors in the dialog to configure the index. \n![create index dialog](https:\/\/docs.databricks.com\/_images\/create-index-form.png) \n**Name**: Name to use for the online table in Unity Catalog. The name requires a three-level namespace, `<catalog>.<schema>.<name>`. Only alphanumeric characters and underscores are allowed. \n**Primary key**: Column to use as a primary key. \n**Endpoint**: Select the model serving endpoint that you want to use. \n**Embedding source**: Indicate if you want Databricks to compute embeddings for a text column in the Delta table (**Compute embeddings**), or if your Delta table contains precomputed embeddings (**Use existing embedding column**). \n* If you selected **Compute embeddings**, select the column that you want embeddings computed for and the endpoint that is serving the embedding model. Only text columns are supported.\n* If you selected **Use existing embedding column**, select the column that contains the precomputed embeddings and the embedding dimension. The format of the precomputed embedding column should be `array[float]`.**Sync computed embeddings**: Toggle this setting to save the generated embeddings to a Unity Catalog table. For more information, see [Save generated embedding table](https:\/\/docs.databricks.com\/generative-ai\/create-query-vector-search.html#sync-embeddings-table). \n**Sync mode**: **Continuous** keeps the index in sync with seconds of latency. However, it has a higher cost associated with it since a compute cluster is provisioned to run the continuous sync streaming pipeline. **Triggered** is more cost-effective, but must be started manually using the API. For both **Continuous** and **Triggered**, the update is incremental \u2014 only data that has changed since the last sync is processed.\n5. When you have finished configuring the index, click **Create**. \n### Create index using the Python SDK \nThe following example creates a Delta Sync Index with embeddings computed by Databricks. \n```\nclient = VectorSearchClient()\n\nindex = client.create_delta_sync_index(\nendpoint_name=\"vector_search_demo_endpoint\",\nsource_table_name=\"vector_search_demo.vector_search.en_wiki\",\nindex_name=\"vector_search_demo.vector_search.en_wiki_index\",\npipeline_type='TRIGGERED',\nprimary_key=\"id\",\nembedding_source_column=\"text\",\nembedding_model_endpoint_name=\"e5-small-v2\"\n)\n\n``` \nThe following example creates a Direct Vector Access Index. \n```\n\nclient = VectorSearchClient()\n\nindex = client.create_direct_access_index(\nendpoint_name=\"storage_endpoint\",\nindex_name=\"{catalog_name}.{schema_name}.{index_name}\",\nprimary_key=\"id\",\nembedding_dimension=1024,\nembedding_vector_column=\"text_vector\",\nschema={\n\"id\": \"int\",\n\"field2\": \"str\",\n\"field3\": \"float\",\n\"text_vector\": \"array<float>\"}\n)\n\n``` \n### Create index using the REST API \nSee [POST \/api\/2.0\/vector-search\/indexes](https:\/\/docs.databricks.com\/api\/workspace\/vectorsearchindexes\/createindex). \n### Save generated embedding table \nIf Databricks generates the embeddings, you can save the generated embeddings to a table in Unity Catalog. This table is created in the same schema as the vector index and is linked from the vector index page. \nThe name of the table is the name of the vector search index, appended by `_writeback_table`. The name is not editable. \nYou can access and query the table like any other table in Unity Catalog. However, you should not drop or modify the table, as it is not intended to be manually updated. The table is deleted automatically if the index is deleted.\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/create-query-vector-search.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Databricks Vector Search\n#### How to create and query a Vector Search index\n##### Update a vector search index\n\n### Update a Delta Sync Index \nIndexes created with **Continuous** sync mode automatically update when the source Delta table changes. If you are using **Triggered** sync mode, you can use the Python SDK or the REST API to start the sync. \n```\nindex.sync()\n\n``` \nSee [REST API](https:\/\/docs.databricks.com\/api\/workspace\/vectorsearchindexes\/syncindex) (POST \/api\/2.0\/vector-search\/indexes\/{index\\_name}\/sync). \n### Update a Direct Vector Access Index \nYou can use the Python SDK or the REST API to insert, update, or delete data from a Direct Vector Access Index. \n```\nindex.upsert([{\"id\": 1,\n\"field2\": \"value2\",\n\"field3\": 3.0,\n\"text_vector\": [1.0, 2.0, 3.0]\n},\n{\"id\": 2,\n\"field2\": \"value2\",\n\"field3\": 3.0,\n\"text_vector\": [1.1, 2.1, 3.0]\n}\n])\n\n``` \nSee [REST API](https:\/\/docs.databricks.com\/api\/workspace\/vectorsearchindexes) (POST \/api\/2.0\/vector-search\/indexes).\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/create-query-vector-search.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Databricks Vector Search\n#### How to create and query a Vector Search index\n##### Query a Vector Search endpoint\n\nYou can only query the Vector Search endpoint using the Python SDK or the REST API. \nNote \nIf the user querying the endpoint is not the owner of the vector search index, the user must have the following UC privileges: \n* USE CATALOG on the catalog that contains the vector search index.\n* USE SCHEMA on the schema that contains the vector search index.\n* SELECT on the vector search index. \n```\nresults = index.similarity_search(\nquery_text=\"Greek myths\",\ncolumns=[\"id\", \"text\"],\nnum_results=2\n)\n\nresults\n\n``` \nSee [POST \/api\/2.0\/vector-search\/indexes\/{index\\_name}\/query](https:\/\/docs.databricks.com\/api\/workspace\/vectorsearchindexes\/queryindex). \n### Use filters on queries \nA query can define filters based on any column in the Delta table. `similarity_search` returns only rows that match the specified filters. The following filters are supported: \n| Filter operator | Behavior | Examples |\n| --- | --- | --- |\n| `NOT` | Negates the filter. The key must end with \u201cNOT\u201d. For example, \u201ccolor NOT\u201d with value \u201cred\u201d matches documents where the color is not red. | `{\"id NOT\": 2}` `{\u201ccolor NOT\u201d: \u201cred\u201d}` |\n| `<` | Checks if the field value is less than the filter value. The key must end with \u201d <\u201d. For example, \u201cprice <\u201d with value 100 matches documents where the price is less than 100. | `{\"id <\": 200}` |\n| `<=` | Checks if the field value is less than or equal to the filter value. The key must end with \u201d <=\u201d. For example, \u201cprice <=\u201d with value 100 matches documents where the price is less than or equal to 100. | `{\"id <=\": 200}` |\n| `>` | Checks if the field value is greater than the filter value. The key must end with \u201d >\u201d. For example, \u201cprice >\u201d with value 100 matches documents where the price is greater than 100. | `{\"id >\": 200}` |\n| `>=` | Checks if the field value is greater than or equal to the filter value. The key must end with \u201d >=\u201d. For example, \u201cprice >=\u201d with value 100 matches documents where the price is greater than or equal to 100. | `{\"id >=\": 200}` |\n| `OR` | Checks if the field value matches any of the filter values. The key must contain `OR` to separate multiple subkeys. For example, `color1 OR color2` with value `[\"red\", \"blue\"]` matches documents where either `color1` is `red` or `color2` is `blue`. | `{\"color1 OR color2\": [\"red\", \"blue\"]}` |\n| `LIKE` | Matches partial strings. | `{\"column LIKE\": \"hello\"}` |\n| No filter operator specified | Filter checks for an exact match. If multiple values are specified, it matches any of the values. | `{\"id\": 200}` `{\"id\": [200, 300]}` | \nSee the following code examples: \n```\n# Match rows where `title` exactly matches `Athena` or `Ares`\nresults = index.similarity_search(\nquery_text=\"Greek myths\",\ncolumns=[\"id\", \"text\"],\nfilters={\"title\": [\"Ares\", \"Athena\"]}\nnum_results=2\n)\n\n# Match rows where `title` or `id` exactly matches `Athena` or `Ares`\nresults = index.similarity_search(\nquery_text=\"Greek myths\",\ncolumns=[\"id\", \"text\"],\nfilters={\"title OR id\": [\"Ares\", \"Athena\"]}\nnum_results=2\n)\n\n# Match only rows where `title` is not `Hercules`\nresults = index.similarity_search(\nquery_text=\"Greek myths\",\ncolumns=[\"id\", \"text\"],\nfilters={\"title NOT\": \"Hercules\"}\nnum_results=2\n)\n\n``` \nSee [POST \/api\/2.0\/vector-search\/indexes\/{index\\_name}\/query](https:\/\/docs.databricks.com\/api\/workspace\/vectorsearchindexes\/queryindex).\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/create-query-vector-search.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Databricks Vector Search\n#### How to create and query a Vector Search index\n##### Example notebooks\n\nThe examples in this section demonstrate usage of the Vector Search Python SDK. \n### LangChain examples \nSee [How to use LangChain with Databricks Vector Search](https:\/\/python.langchain.com\/docs\/integrations\/vectorstores\/databricks_vector_search) for using Databricks Vector Search as in integration with LangChain packages. \nThe following notebook shows how to convert your similarity search results to LangChain documents. \n#### Vector Search with the Python SDK notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/generative-ai\/vector-search-python-sdk-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Notebook examples for calling an embeddings model \nThe following notebooks demonstrate how to configure a Databricks Model Serving endpoint for embeddings generation. \n#### Call an OpenAI embeddings model using Databricks Model Serving notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/generative-ai\/vector-search-external-embedding-model-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n#### Call a BGE embeddings model using Databricks Model Serving notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/generative-ai\/vector-search-foundation-embedding-model-bge-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n#### Register and serve an OSS embedding model notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/generative-ai\/embedding-with-oss-models.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/create-query-vector-search.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Authentication settings for the Databricks ODBC Driver\n\nThis article describes how to configure Databricks authentication settings for the [Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/index.html). \nThe Databricks ODBC Driver supports the following Databricks authentication types: \n* [Databricks personal access token](https:\/\/docs.databricks.com\/integrations\/odbc\/authentication.html#authentication-pat)\n* [Databricks username and password](https:\/\/docs.databricks.com\/integrations\/odbc\/authentication.html#authentication-username-password)\n* [OAuth 2.0 tokens](https:\/\/docs.databricks.com\/integrations\/odbc\/authentication.html#authentication-pass-through)\n* [OAuth user-to-machine (U2M) authentication](https:\/\/docs.databricks.com\/integrations\/odbc\/authentication.html#authentication-u2m)\n* [OAuth machine-to-machine (M2M) authentication](https:\/\/docs.databricks.com\/integrations\/odbc\/authentication.html#authentication-m2m)\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/authentication.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Authentication settings for the Databricks ODBC Driver\n########## Databricks personal access token\n\nTo create a Databricks personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**. \nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n* [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n* [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nTo authenticate using a Databricks personal access token, add the following configurations to your [compute settings](https:\/\/docs.databricks.com\/integrations\/odbc\/compute.html) and any special or advanced [driver capability settings](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html): \n| Setting | Value |\n| --- | --- |\n| `AuthMech` | 3 |\n| `UID` | `token` |\n| `PWD` | The Databricks personal access token for your workspace user. | \nTo create a DSN for non-Windows systems, use the following format: \n```\n[Databricks]\nDriver=<path-to-driver>\nHost=<server-hostname>\nPort=443\nHTTPPath=<http-path>\nSSL=1\nThriftTransport=2\nAuthMech=3\nUID=token\nPWD=<personal-access-token>\n\n``` \nTo create a DSN-less connection string, use the following format. Line breaks have been added for readability. The string must not contain these line breaks: \n```\nDriver=<path-to-driver>;\nHost=<server-hostname>;\nPort=443;\nHTTPPath=<http-path>;\nSSL=1;\nThriftTransport=2;\nAuthMech=3;\nUID=token;\nPWD=<personal-access-token>\n\n``` \n* To get the value for `<path-to-driver>`, see [Download and install the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html).\n* To get the values for `<server-hostname>` and `<http-path>`, see [Compute settings for the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/compute.html).\n* You can also add special or advanced [driver capability settings](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/authentication.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Authentication settings for the Databricks ODBC Driver\n########## Databricks username and password\n\nDatabricks username and password authentication is also known as Databricks *basic* authentication. \nUsername and password authentication is possible only if [single sign-on](https:\/\/docs.databricks.com\/admin\/users-groups\/single-sign-on\/index.html) is disabled. \nTo authenticate using a Databricks username and password, add the following configurations to your [compute settings](https:\/\/docs.databricks.com\/integrations\/odbc\/compute.html) and any special or advanced [driver capability settings](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html): \n| Setting | Value |\n| --- | --- |\n| `AuthMech` | 3 |\n| `UID` | The username. |\n| `PWD` | The password. | \nTo create a DSN for non-Windows systems, use the following format: \n```\n[Databricks]\nDriver=<path-to-driver>\nHost=<server-hostname>\nPort=443\nHTTPPath=<http-path>\nSSL=1\nThriftTransport=2\nAuthMech=3\nUID=<username>\nPWD=<password>\n\n``` \nTo create a DSN-less connection string, use the following format. Line breaks have been added for readability. The string must not contain these line breaks: \n```\nDriver=<path-to-driver>;\nHost=<server-hostname>;\nPort=443;\nHTTPPath=<http-path>;\nSSL=1;\nThriftTransport=2;\nAuthMech=3;\nUID=<username>;\nPWD=<password>\n\n``` \n* To get the value for `<path-to-driver>`, see [Download and install the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html).\n* To get the values for `<server-hostname>` and `<http-path>`, see [Compute settings for the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/compute.html).\n* You can also add special or advanced [driver capability settings](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/authentication.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Authentication settings for the Databricks ODBC Driver\n########## OAuth 2.0 tokens\n\nODBC driver 2.7.5 and above supports an OAuth 2.0 token for a Databricks user or service principal. This is also known as OAuth 2.0 *token pass-through* authentication. \nTo create an OAuth 2.0 token for token pass-through authentication, do the following: \n* For a user, you can use the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/install.html) to generate the OAuth 2.0 token by initiating the OAuth U2M process, and then get the generated OAuth 2.0 token by running the `databricks auth token` command. See [OAuth user-to-machine (U2M) authentication](https:\/\/docs.databricks.com\/dev-tools\/cli\/authentication.html#u2m-auth). OAuth 2.0 tokens have a default lifetime of 1 hour. To generate a new OAuth 2.0 token, repeat this process.\n* For a service principal, follow Steps 1-3 in [Manually generate and use access tokens for OAuth machine-to-machine (M2M) authentication](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html#oauth-m2m-manual). Make a note of the service principal\u2019s OAuth `access_token` value. OAuth 2.0 tokens have a default lifetime of 1 hour. To generate a new OAuth 2.0 token, repeat this process. \nTo authenticate using OAuth 2.0 token pass-through authentication, add the following configurations to your [compute settings](https:\/\/docs.databricks.com\/integrations\/odbc\/compute.html) and any special or advanced [driver capability settings](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html): \n| Setting | Value |\n| --- | --- |\n| `AuthMech` | 11 |\n| `Auth_Flow` | 0 |\n| `Auth_AccessToken` | The OAuth 2.0 token. | \nTo create a DSN for non-Windows systems, use the following format: \n```\n[Databricks]\nDriver=<path-to-driver>\nHost=<server-hostname>\nPort=443\nHTTPPath=<http-path>\nSSL=1\nThriftTransport=2\nAuthMech=11\nAuth_Flow=0\nAuth_AccessToken=<oauth-token>\n\n``` \nTo create a DSN-less connection string, use the following format. Line breaks have been added for readability. The string must not contain these line breaks: \n```\nDriver=<path-to-driver>;\nHost=<server-hostname>;\nPort=443;\nHTTPPath=<http-path>;\nSSL=1;\nThriftTransport=2;\nAuthMech=11;\nAuth_Flow=0;\nAuth_AccessToken=<oauth-token>\n\n``` \n* To get the value for `<path-to-driver>`, see [Download and install the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html).\n* To get the values for `<server-hostname>` and `<http-path>`, see [Compute settings for the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/compute.html).\n* You can also add special or advanced [driver capability settings](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html). \nFor more information, see the `Token Pass-through` sections in the [Databricks ODBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Simba-Apache-Spark-ODBC-Connector-Install-and-Configuration-Guide.pdf).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/authentication.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Authentication settings for the Databricks ODBC Driver\n########## OAuth user-to-machine (U2M) authentication\n\nODBC driver 2.7.5 and above supports OAuth user-to-machine (U2M) authentication for a Databricks user. This is also known as OAuth 2.0 *browser-based* authentication. \nOAuth U2M or OAuth 2.0 browser-based authentication has no prerequisites. OAuth 2.0 tokens have a default lifetime of 1 hour. OAuth U2M or OAuth 2.0 browser-based authentication should refresh expired OAuth 2.0 tokens for you automatically. \nNote \nOAuth U2M or OAuth 2.0 browser-based authentication works only with applications that run locally. It does not work with server-based or cloud-based applications. \nTo authenticate using OAuth user-to-machine (U2M) or OAuth 2.0 browser-based authentication, add the following configurations to your [compute settings](https:\/\/docs.databricks.com\/integrations\/odbc\/compute.html) and any special or advanced [driver capability settings](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html): \n| Configuration | Value |\n| --- | --- |\n| `AuthMech` | 11 |\n| `Auth_Flow` | 2 |\n| `PWD` | A password of your choice. The driver uses this key for refresh token encryption. |\n| `OAuth2ClientId` (optional) | `power-bi,tableau-desktop,databricks-cli,` `databricks-sql-python,databricks-sql-jdbc,` `databricks-sql-odbc,databricks-dbt-adapter,` `databricks-sql-connector` (default) |\n| `Auth_Scope` (optional) | `sql,offline_access` (default) | \nTo create a DSN for non-Windows systems, use the following format: \n```\n[Databricks]\nDriver=<path-to-driver>\nHost=<server-hostname>\nPort=443\nHTTPPath=<http-path>\nSSL=1\nThriftTransport=2\nAuthMech=11\nAuth_Flow=2\nPWD=<password>\n\n``` \nTo create a DSN-less connection string, use the following format. Line breaks have been added for readability. The string must not contain these line breaks: \n```\nDriver=<path-to-driver>;\nHost=<server-hostname>;\nPort=443;\nHTTPPath=<http-path>;\nSSL=1;\nThriftTransport=2;\nAuthMech=11;\nAuth_Flow=2;\nPWD=<password>\n\n``` \n* To get the value for `<path-to-driver>`, see [Download and install the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html).\n* To get the values for `<server-hostname>` and `<http-path>`, see [Compute settings for the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/compute.html).\n* You can also add special or advanced [driver capability settings](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html). \nFor more information, see the `Browser Based` sections in the [Databricks ODBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Simba-Apache-Spark-ODBC-Connector-Install-and-Configuration-Guide.pdf).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/authentication.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Authentication settings for the Databricks ODBC Driver\n########## OAuth machine-to-machine (M2M) authentication\n\nODBC driver 2.7.5 and above supports OAuth machine-to-machine (M2M) authentication for a Databricks service principal. This is also known as OAuth 2.0 *client credentials* authentication. \nTo configure OAuth M2M or OAuth 2.0 client credentials authentication, do the following: \n1. Create a Databricks service principal in your Databricks workspace, and create an OAuth secret for that service principal. \nTo create the service principal and its OAuth secret, see [OAuth machine-to-machine (M2M) authentication](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). Make a note of the service principal\u2019s **UUID** or **Application ID** value, and the **Secret** value for the service principal\u2019s OAuth secret.\n2. Give the service principal access to your cluster or warehouse. See [Compute permissions](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions) or [Manage a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html#manage). \nTo authenticate using OAuth machine-to-machine (M2M) or OAuth 2.0 client credentials authentication, add the following configurations to your [compute settings](https:\/\/docs.databricks.com\/integrations\/odbc\/compute.html) and any special or advanced [driver capability settings](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html): \n| Setting | Value |\n| --- | --- |\n| `AuthMech` | 11 |\n| `Auth_Flow` | 1 |\n| `Auth_Client_Id` | The service principal\u2019s **UUID**\/**Application ID** value. |\n| `Auth_Client_Secret` | The service principal\u2019s OAuth **Secret** value. |\n| `OAuth2ClientId` (optional) | `power-bi,tableau-desktop,databricks-cli,` `databricks-sql-python,databricks-sql-jdbc,` `databricks-sql-odbc,databricks-dbt-adapter,` `databricks-sql-connector` (default) |\n| `Auth_Scope` (optional) | `all-apis` (default) | \nTo create a DSN for non-Windows systems, use the following format: \n```\n[Databricks]\nDriver=<path-to-driver>\nHost=<server-hostname>\nPort=443\nHTTPPath=<http-path>\nSSL=1\nThriftTransport=2\nAuthMech=11\nAuth_Flow=1\nAuth_Client_Id=<service-principal-application-ID>\nAuth_Client_Secret=<service-principal-secret>\nAuth_Scope=all-apis\n\n``` \nTo create a DSN-less connection string, use the following format. Line breaks have been added for readability. The string must not contain these line breaks: \n```\nDriver=<path-to-driver>;\nHost=<server-hostname>;\nPort=443;\nHTTPPath=<http-path>;\nSSL=1;\nThriftTransport=2;\nAuthMech=11;\nAuth_Flow=1;\nAuth_Client_Id=<service-principal-application-ID>;\nAuth_Client_Secret=<service-principal-secret>;\nAuth_Scope=all-apis\n\n``` \n* To get the value for `<path-to-driver>`, see [Download and install the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html).\n* To get the values for `<server-hostname>` and `<http-path>`, see [Compute settings for the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/compute.html).\n* You can also add special or advanced [driver capability settings](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html). \nFor more information, see the `Client Credentials` sections in the [Databricks ODBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Simba-Apache-Spark-ODBC-Connector-Install-and-Configuration-Guide.pdf).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/authentication.html"} +{"content":"# Connect to data sources\n### What is Lakehouse Federation\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article introduces Lakehouse Federation, the query federation platform that enables you to use Databricks to run queries against multiple external data sources. It also describes how to set up Lakehouse Federation *connections* and create *foreign catalogs* in your Unity Catalog metastore.\n\n### What is Lakehouse Federation\n#### What is Lakehouse Federation?\n\nLakehouse Federation is the query federation platform for Databricks. The term *query federation* describes a collection of features that enable users and systems to run queries against multiple data sources without needing to migrate all data to a unified system. \nDatabricks uses Unity Catalog to manage query federation. You configure read-only connections to popular database solutions using drivers that are included on Pro SQL Warehouses, Serverless SQL Warehouses, and Databricks Runtime clusters. Unity Catalog\u2019s data governance and data lineage tools ensure that data access is managed and audited for all federated queries made by the users in your Databricks workspaces.\n\n### What is Lakehouse Federation\n#### Why use Lakehouse Federation?\n\nThe lakehouse emphasizes central storage of data to reduce data redundancy and isolation. Your organization may have numerous data systems in production, and you might want to query data in connected systems for a number of reasons: \n* Ad hoc reporting.\n* Proof-of-concept work.\n* The exploratory phase of new ETL pipelines or reports.\n* Supporting workloads during incremental migration. \nIn each of these scenarios, query federation gets you to insights faster, because you can query the data in place and avoid complex and time-consuming ETL processing. \nLakehouse Federation is meant for use cases when: \n* You don\u2019t want to ingest data into Databricks.\n* You want your queries to take advantage of compute in the external database system.\n* You want the advantages of Unity Catalog interfaces and data governance, including fine-grained access control, data lineage, and search.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/index.html"} +{"content":"# Connect to data sources\n### What is Lakehouse Federation\n#### Overview of Lakehouse Federation setup\n\nTo make a dataset available for read-only querying using Lakehouse Federation, you create the following: \n* A *connection*, a securable object in Unity Catalog that specifies a path and credentials for accessing an external database system.\n* A *foreign catalog*, a securable object in Unity Catalog that mirrors a database in an external data system, enabling you to perform read-only queries on that data system in your Databricks workspace, managing access using Unity Catalog.\n\n### What is Lakehouse Federation\n#### Supported data sources\n\nLakehouse Federation supports connections to the following database types: \n* [MySQL](https:\/\/docs.databricks.com\/query-federation\/mysql.html)\n* [PostgreSQL](https:\/\/docs.databricks.com\/query-federation\/postgresql.html)\n* [Amazon Redshift](https:\/\/docs.databricks.com\/query-federation\/redshift.html)\n* [Snowflake](https:\/\/docs.databricks.com\/query-federation\/snowflake.html)\n* [Microsoft SQL Server](https:\/\/docs.databricks.com\/query-federation\/sql-server.html)\n* [Azure Synapse (SQL Data Warehouse)](https:\/\/docs.databricks.com\/query-federation\/sqldw.html)\n* [Google BigQuery](https:\/\/docs.databricks.com\/query-federation\/bigquery.html)\n* [Databricks](https:\/\/docs.databricks.com\/query-federation\/databricks.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/index.html"} +{"content":"# Connect to data sources\n### What is Lakehouse Federation\n#### Connection requirements\n\nWorkspace requirements: \n* Workspace enabled for Unity Catalog. \nCompute requirements: \n* Network connectivity from your Databricks Runtime cluster or SQL warehouse to the target database systems. See [Networking recommendations for Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/networking.html).\n* Databricks clusters must use Databricks Runtime 13.3 LTS or above and shared or single-user access mode.\n* SQL warehouses must be Pro or Serverless. \nPermissions required: \n* To create a connection, you must be a metastore admin or a user with the `CREATE CONNECTION` privilege on the Unity Catalog metastore attached to the workspace.\n* To create a foreign catalog, you must have the `CREATE CATALOG` permission on the metastore and be either the owner of the connection or have the `CREATE FOREIGN CATALOG` privilege on the connection. \nAdditional permission requirements are specified in each task-based section that follows.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/index.html"} +{"content":"# Connect to data sources\n### What is Lakehouse Federation\n#### Create a connection\n\nA connection specifies a path and credentials for accessing an external database system. To create a connection, you can use Catalog Explorer or the `CREATE CONNECTION` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** Metastore admin or user with the `CREATE CONNECTION` privilege. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **External Data** menu and select **Connections**.\n3. Click **Create connection**.\n4. Enter a user-friendly **Connection name**.\n5. Select the **Connection type** (database provider, like MySQL or PostgreSQL).\n6. Enter the connection properties (such as host information, path, and access credentials). \nEach connection type requires different connection information. See the article for your connection type, listed in the table of contents to the left.\n7. (Optional) Click **Test connection** to confirm that it works.\n8. (Optional) Add a comment.\n9. Click **Create**. \nRun the following command in a notebook or the Databricks SQL query editor. This example is for connections to a PostgreSQL database. The options differ by connection type. See the article for your connection type, listed in the table of contents to the left. \n```\nCREATE CONNECTION <connection-name> TYPE postgresql\nOPTIONS (\nhost '<hostname>',\nport '<port>',\nuser '<user>',\npassword '<password>'\n);\n\n``` \nWe recommend that you use Databricks [secrets](https:\/\/docs.databricks.com\/security\/secrets\/index.html) instead of plaintext strings for sensitive values like credentials. For example: \n```\nCREATE CONNECTION <connection-name> TYPE postgresql\nOPTIONS (\nhost '<hostname>',\nport '<port>',\nuser secret ('<secret-scope>','<secret-key-user>'),\npassword secret ('<secret-scope>','<secret-key-password>')\n)\n\n``` \nFor information about setting up secrets, see [Secret management](https:\/\/docs.databricks.com\/security\/secrets\/index.html). \nFor information about managing existing connections, see [Manage connections for Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/connections.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/index.html"} +{"content":"# Connect to data sources\n### What is Lakehouse Federation\n#### Create a foreign catalog\n\nA foreign catalog mirrors a database in an external data system so that you can query and manage access to data in that database using Databricks and Unity Catalog. To create a foreign catalog, you use a connection to the data source that has already been defined. \nTo create a foreign catalog, can use Catalog Explorer or the `CREATE FOREIGN CATALOG` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** `CREATE CATALOG` permission on the metastore and either ownership of the connection or the `CREATE FOREIGN CATALOG` privilege on the connection. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Click the **Create Catalog** button.\n3. On the **Create a new catalog** dialog, enter a name for the catalog and select a **Type** of **Foreign**.\n4. Select the **Connection** that provides access to the database that you want to mirror as a Unity Catalog catalog.\n5. Enter the name of the **Database** that you want to mirror as a catalog. \nRequirements differ depending on the data source: \n* MySQL uses a two-layer namespace and therefore does not require a database name.\n* For connections to a catalog in another Databricks workspace, enter the Databricks **Catalog** name instead of a database name.\n6. Click **Create.** \n1. Run the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional. Replace the placeholder values: \n* `<catalog-name>`: Name for the catalog in Databricks.\n* `<connection-name>`: The [connection object](https:\/\/docs.databricks.com\/query-federation\/index.html#connection) that specifies the data source, path, and access credentials.\n* `<database-name>`: Name of the database you want to mirror as a catalog in Databricks. Not required for MySQL, which uses a two-layer namespace.\n* `<external-catalog-name>`: *Databricks-to-Databricks* only: Name of the catalog in the external Databricks workspace that you are mirroring. See [Create a foreign catalog](https:\/\/docs.databricks.com\/query-federation\/databricks.html#foreign-catalog).\n```\nCREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>\nOPTIONS (database '<database-name>');\n\n``` \nFor information about managing and working with foreign catalogs, see [Manage and work with foreign catalogs](https:\/\/docs.databricks.com\/query-federation\/foreign-catalogs.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/index.html"} +{"content":"# Connect to data sources\n### What is Lakehouse Federation\n#### Lakehouse federation and materialized views\n\nDatabricks recommends loading external data using Lakehouse Federation when you are creating materialized views. See [Use materialized views in Databricks SQL](https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html). \nWhen you use Lakehouse Federation, users can reference the federated data as follows: \n```\nCREATE MATERIALIZED VIEW xyz AS SELECT * FROM federated_catalog.federated_schema.federated_table;\n\n```\n\n### What is Lakehouse Federation\n#### Limitations\n\n* Queries are read-only.\n* Throttling of connections is determined using the Databricks SQL concurrent query limit. There is no limit across warehouses per connection. See [Queueing and autoscaling for pro and classic SQL warehouses](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/warehouse-behavior.html#scaling).\n* Tables and schemas with names that are invalid in Unity Catalog are not supported and are ignored by Unity Catalog upon creation of a foreign catalog. See the list of naming rules and limitations in [Unity Catalog limitations](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#limitations).\n* Table names and schema names are converted to lowercase in Unity Catalog. Lookups must also use lowercase names. If there are tables or schemas with duplicate lowercase names, only one of the tables or schemas is imported into the foreign catalog. \n* Private Link and static IP range support on Serverless SQL warehouses is not available. \n* For each foreign table referenced, Databricks schedules a subquery in the remote system to return a subset of data from that table and then returns the result to one Databricks executor task over a single stream.\n* Single-user access mode is only available for users that own the connection.\n* Lakehouse Federation cannot federate foreign tables with case-sensitive identifiers for Azure Synapse connections or Redshift connections.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/index.html"} +{"content":"# \n### Visualization deep dive in Python\n#### Charts and graphs Python notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/charts-and-graphs-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/charts-and-graphs-python.html"} +{"content":"# What is Delta Lake?\n### Enrich Delta Lake tables with custom metadata\n\nDatabricks recommends always providing comments for tables and columns in tables. You can generate these comments using AI. See [Add AI-generated comments to a table](https:\/\/docs.databricks.com\/catalog-explorer\/ai-comments.html). \nUnity Catalog also provides the ability to tag data. See [Apply tags to Unity Catalog securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/tags.html). \nYou can also log messages for individual commits to tables in a field in the Delta Lake transaction log.\n\n### Enrich Delta Lake tables with custom metadata\n#### Set user-defined commit metadata\n\nYou can specify user-defined strings as metadata in commits, either using the DataFrameWriter option `userMetadata` or the SparkSession configuration `spark.databricks.delta.commitInfo.userMetadata`. If both of them have been specified, then the option takes preference. This user-defined metadata is readable in the `DESCRIBE HISTORY` operation. See [Work with Delta Lake table history](https:\/\/docs.databricks.com\/delta\/history.html). \n```\nSET spark.databricks.delta.commitInfo.userMetadata=overwritten-for-fixing-incorrect-data\nINSERT OVERWRITE default.people10m SELECT * FROM morePeople\n\n``` \n```\ndf.write.format(\"delta\") \\\n.mode(\"overwrite\") \\\n.option(\"userMetadata\", \"overwritten-for-fixing-incorrect-data\") \\\n.save(\"\/tmp\/delta\/people10m\")\n\n``` \n```\ndf.write.format(\"delta\")\n.mode(\"overwrite\")\n.option(\"userMetadata\", \"overwritten-for-fixing-incorrect-data\")\n.save(\"\/tmp\/delta\/people10m\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/custom-metadata.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Create custom model serving endpoints\n\nThis article describes how to create model serving endpoints that serve [custom models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/custom-models.html) using Databricks [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html). \nModel Serving provides the following options for serving endpoint creation: \n* The Serving UI\n* REST API\n* MLflow Deployments SDK \nFor creating endpoints that serve generative AI foundation models, see [Create foundation model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-foundation-model-endpoints.html).\n\n#### Create custom model serving endpoints\n##### Requirements\n\n* Your workspace must be in a [supported region](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html#regions).\n* If you use custom libraries or libraries from a private mirror server with your model, see [Use custom Python libraries with Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/private-libraries-model-serving.html) before you create the model endpoint.\n* For creating endpoints using the MLflow Deployments SDK, you must install the MLflow Deployment client. To install it, run: \n```\nimport mlflow.deployments\n\nclient = mlflow.deployments.get_deploy_client(\"databricks\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Create custom model serving endpoints\n##### Access control\n\nTo understand access control options for model serving endpoints for endpoint management, see [Manage permissions on your model serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/manage-serving-endpoints.html#permissions). \nYou can also: \n* [Add an instance profile to a model serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/add-model-serving-instance-profile.html)\n* [Configure access to resources from model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/store-env-variable-model-serving.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Create custom model serving endpoints\n##### Create an endpoint\n\nYou can create an endpoint for model serving with the **Serving** UI. \n1. Click **Serving** in the sidebar to display the Serving UI.\n2. Click **Create serving endpoint**. \n![Model serving pane in Databricks UI](https:\/\/docs.databricks.com\/_images\/serving-pane.png) \nFor models registered in the Workspace model registry or models in Unity Catalog: \n1. In the **Name** field provide a name for your endpoint.\n2. In the **Served entities** section \n1. Click into the **Entity** field to open the **Select served entity** form.\n2. Select the type of model you want to serve. The form dynamically updates based on your selection.\n3. Select which model and model version you want to serve.\n4. Select the percentage of traffic to route to your served model.\n5. Select what size compute to use. You can use CPU or GPU computes for your workloads. Support for model serving on GPU is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). See [GPU workload types](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html#gpu) for more information on available GPU computes.\n6. Under **Compute Scale-out**, select the size of the compute scale out that corresponds with the number of requests this served model can process at the same time. This number should be roughly equal to QPS x model run time. \n1. Available sizes are **Small** for 0-4 requests, **Medium** 8-16 requests, and **Large** for 16-64 requests.\n7. Specify if the endpoint should scale to zero when not in use.\n8. Under Advanced configuration, you can [add an instance profile](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/add-model-serving-instance-profile.html) to connect to AWS resources from your endpoint.\n3. Click **Create**. The **Serving endpoints** page appears with **Serving endpoint state** shown as Not Ready. \n![Create a model serving endpoint](https:\/\/docs.databricks.com\/_images\/create-endpoint1.png) \nYou can create endpoints using the REST API. See [POST \/api\/2.0\/serving-endpoints](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/create) for endpoint configuration parameters. \nThe following example creates an endpoint that serves the first version of the `ads1` model that is registered in the model registry. To specify a model from Unity Catalog, provide the full model name including parent catalog and schema such as, `catalog.schema.example-model`. \n```\nPOST \/api\/2.0\/serving-endpoints\n\n{\n\"name\": \"workspace-model-endpoint\",\n\"config\":{\n\"served_entities\": [\n{\n\"name\": \"ads-entity\"\n\"entity_name\": \"my-ads-model\",\n\"entity_version\": \"3\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": true\n},\n{\n\"entity_name\": \"my-ads-model\",\n\"entity_version\": \"4\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": true\n}\n],\n\"traffic_config\":{\n\"routes\": [\n{\n\"served_model_name\": \"my-ads-model-3\",\n\"traffic_percentage\": 100\n},\n{\n\"served_model_name\": \"my-ads-model-4\",\n\"traffic_percentage\": 20\n}\n]\n}\n},\n\"tags\": [\n{\n\"key\": \"team\",\n\"value\": \"data science\"\n}\n]\n}\n\n``` \nThe following is an example response. The endpoint\u2019s `config_update` state is `NOT_UPDATING` and the served model is in a `READY` state. \n```\n{\n\"name\": \"workspace-model-endpoint\",\n\"creator\": \"user@email.com\",\n\"creation_timestamp\": 1700089637000,\n\"last_updated_timestamp\": 1700089760000,\n\"state\": {\n\"ready\": \"READY\",\n\"config_update\": \"NOT_UPDATING\"\n},\n\"config\": {\n\"served_entities\": [\n{\n\"name\": \"ads-entity\",\n\"entity_name\": \"my-ads-model-3\",\n\"entity_version\": \"3\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": true,\n\"workload_type\": \"CPU\",\n\"state\": {\n\"deployment\": \"DEPLOYMENT_READY\",\n\"deployment_state_message\": \"\"\n},\n\"creator\": \"user@email.com\",\n\"creation_timestamp\": 1700089760000\n}\n],\n\"traffic_config\": {\n\"routes\": [\n{\n\"served_model_name\": \"my-ads-model-3\",\n\"traffic_percentage\": 100\n}\n]\n},\n\"config_version\": 1\n},\n\"tags\": [\n{\n\"key\": \"team\",\n\"value\": \"data science\"\n}\n],\n\"id\": \"e3bd3e471d6045d6b75f384279e4b6ab\",\n\"permission_level\": \"CAN_MANAGE\",\n\"route_optimized\": false\n}\n\n``` \n[MLflow Deployments](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.deployments.html?highlight=deployments#mlflow.deployments.DatabricksDeploymentClient) provides an API for create, update and deletion tasks. The APIs for these tasks accept the same parameters as the REST API for serving endpoints. See [POST \/api\/2.0\/serving-endpoints](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/create) for endpoint configuration parameters. \n```\n\nfrom mlflow.deployments import get_deploy_client\n\nclient = get_deploy_client(\"databricks\")\nendpoint = client.create_endpoint(\nname=\"workspace-model-endpoint\",\nconfig={\n\"served_entities\": [\n{\n\"name\": \"ads-entity\"\n\"entity_name\": \"my-ads-model\",\n\"entity_version\": \"3\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": true\n}\n],\n\"traffic_config\": {\n\"routes\": [\n{\n\"served_model_name\": \"my-ads-model-3\",\n\"traffic_percentage\": 100\n}\n]\n}\n}\n)\n\n``` \nYou can also: \n* [Configure your endpoint to serve multiple models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/serve-multiple-models-to-serving-endpoint.html).\n* [Configure your endpoint for route optimization](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/route-optimization.html).\n* [Configure your endpoint to access external resources using Databricks Secrets](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/store-env-variable-model-serving.html).\n* [Add an instance profile to your model serving endpoint to access AWS resources](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/add-model-serving-instance-profile.html).\n* [Enable inference tables](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/enable-model-serving-inference-tables.html) to automatically capture incoming requests and outgoing responses to your model serving endpoints. \n### GPU workload types \nGPU deployment is compatible with the following package versions: \n* Pytorch 1.13.0 - 2.0.1\n* TensorFlow 2.5.0 - 2.13.0\n* MLflow 2.4.0 and above \nTo deploy your models using GPUs include the `workload_type` field in your endpoint configuration during [endpoint creation](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html#create) or as an endpoint configuration update using the API. To configure your endpoint for GPU workloads with the **Serving** UI, select the desired GPU type from the **Compute Type** dropdown. \n```\n{\n\"served_entities\": [{\n\"name\": \"ads1\",\n\"entity_version\": \"2\",\n\"workload_type\": \"GPU_MEDIUM\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": false,\n}]\n}\n\n``` \nThe following table summarizes the available GPU workload types supported. \n| GPU workload type | GPU instance | GPU memory |\n| --- | --- | --- |\n| `GPU_SMALL` | 1xT4 | 16GB |\n| `GPU_MEDIUM` | 1xA10G | 24GB |\n| `MULTIGPU_MEDIUM` | 4xA10G | 96GB |\n| `GPU_MEDIUM_8` | 8xA10G | 192GB |\n| `GPU_LARGE_8` | 8xA100-80GB | 320GB |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Create custom model serving endpoints\n##### Modify a custom model endpoint\n\nAfter enabling a custom model endpoint, you can update the compute configuration as desired. This configuration is particularly helpful if you need additional resources for your model. Workload size and compute configuration play a key role in what resources are allocated for serving your model. \nUntil the new configuration is ready, the old configuration keeps serving prediction traffic. While there is an update in progress, another update cannot be made. However, you can cancel an in progress update from the Serving UI. \nAfter you enable a model endpoint, select **Edit endpoint** to modify the compute configuration of your endpoint. \nYou can do the following: \n* Choose from a few workload sizes, and autoscaling is automatically configured within the workload size.\n* Specify if your endpoint should scale down to zero when not in use.\n* Modify the percent of traffic to route to your served model. \nYou can cancel an in progress configuration update by selecting **Cancel update** on the top right of the endpoint\u2019s details page. This functionality is only available in the Serving UI. \nThe following is an endpoint configuration update example using the REST API. See [PUT \/api\/2.0\/serving-endpoints\/{name}\/config](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/updateconfig). \n```\nPUT \/api\/2.0\/serving-endpoints\/{name}\/config\n\n{\n\"name\": \"workspace-model-endpoint\",\n\"config\":{\n\"served_entities\": [\n{\n\"name\": \"ads-entity\"\n\"entity_name\": \"my-ads-model\",\n\"entity_version\": \"5\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": true\n}\n],\n\"traffic_config\":{\n\"routes\": [\n{\n\"served_model_name\": \"my-ads-model-5\",\n\"traffic_percentage\": 100\n}\n]\n}\n}\n}\n\n``` \nThe MLflow Deployments SDK uses the same parameters as the REST API, see [PUT \/api\/2.0\/serving-endpoints\/{name}\/config](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/updateconfig) for request and response schema details. \nThe following code sample uses a model from the Unity Catalog model registry: \n```\nimport mlflow\nfrom mlflow.deployments import get_deploy_client\n\nmlflow.set_registry_uri(\"databricks-uc\")\nclient = get_deploy_client(\"databricks\")\n\nendpoint = client.create_endpoint(\nname=f\"{endpointname}\",\nconfig={\n\"served_entities\": [\n{\n\"entity_name\": f\"{catalog}.{schema}.{model_name}\",\n\"entity_version\": \"1\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": True\n}\n],\n\"traffic_config\": {\n\"routes\": [\n{\n\"served_model_name\": f\"{model_name}-1\",\n\"traffic_percentage\": 100\n}\n]\n}\n}\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Create custom model serving endpoints\n##### Scoring a model endpoint\n\nTo score your model, send requests to the model serving endpoint. \n* See [Query serving endpoints for custom models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-custom-model-endpoints.html).\n* See [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html).\n\n#### Create custom model serving endpoints\n##### Additional resources\n\n* [Manage model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/manage-serving-endpoints.html).\n* [Query serving endpoints for custom models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-custom-model-endpoints.html).\n* [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html).\n* [External models in Databricks Model Serving](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html).\n* [Inference tables for monitoring and debugging models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html).\n* If you prefer to use Python, you can use the [Databricks real-time serving Python SDK](https:\/\/databricks-sdk-py.readthedocs.io\/en\/latest\/dbdataclasses\/serving.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Create custom model serving endpoints\n##### Notebook examples\n\nThe following notebooks include different Databricks registered models that you can use to get up and running with model serving endpoints. \nThe model examples can be imported into the workspace by following the directions in [Import a notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html#import-notebook). After you choose and create a model from one of the examples, [register it in the MLflow Model Registry](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html), and then follow the [UI workflow](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html#create) steps for model serving. \n### Train and register a scikit-learn model for model serving notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/train-register-scikit-model-serving.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Train and register a HuggingFace model for model serving notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/train-register-hugging-face-model-serving.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### How Databricks AutoML works\n\nThis article details how Databricks AutoML works and its implementation of concepts like missing value imputation and [large data sampling](https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html#automl-sampling). \nDatabricks AutoML performs the following: \n1. Prepares the dataset for model training. For example, AutoML carries out [imbalanced data detection](https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html#automl-imbalanced-dataset-support) for classification problems prior to model training.\n2. Iterates to train and [tune](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/index.html) multiple models, where each model is constructed from open source components and can easily be edited and integrated into your machine learning pipelines. \n* AutoML automatically distributes hyperparameter tuning trials across the worker nodes of a cluster.\n* With Databricks Runtime 9.1 LTS ML or above, AutoML automatically samples your dataset if it is too large to fit into the memory of a single worker node. See [Sampling large datasets](https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html#automl-sampling).\n3. Evaluates models based on algorithms from the [scikit-learn](https:\/\/scikit-learn.org\/stable\/), [xgboost](https:\/\/xgboost.readthedocs.io\/en\/latest\/), [LightGBM](https:\/\/lightgbm.readthedocs.io\/en\/latest\/index.html), [Prophet](https:\/\/facebook.github.io\/prophet\/docs\/), and [ARIMA](https:\/\/pypi.org\/project\/pmdarima\/) packages.\n4. Displays the results and provides a Python notebook with the source code for each trial run so you can review, reproduce, and modify the code. AutoML also calculates summary statistics on your dataset and saves this information in a notebook that you can review later.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### How Databricks AutoML works\n##### AutoML algorithms\n\nDatabricks AutoML trains and evaluates models based on the algorithms in the following table. \nNote \nFor classification and regression models, the decision tree, random forests, logistic regression and linear regression with stochastic gradient descent algorithms are based on scikit-learn. \n| Classification models | Regression models | Forecasting models |\n| --- | --- | --- |\n| [Decision trees](https:\/\/scikit-learn.org\/stable\/modules\/tree.html#classification) | [Decision trees](https:\/\/scikit-learn.org\/stable\/modules\/tree.html#regression) | [Prophet](https:\/\/facebook.github.io\/prophet\/docs\/quick_start.html#python-api) |\n| [Random forests](https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) | [Random forests](https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor) | [Auto-ARIMA](https:\/\/pypi.org\/project\/pmdarima\/) (Available in Databricks Runtime 10.3 ML and above.) |\n| [Logistic regression](https:\/\/scikit-learn.org\/stable\/modules\/linear_model.html#logistic-regression) | [Linear regression with stochastic gradient descent](https:\/\/scikit-learn.org\/stable\/modules\/sgd.html#regression) | |\n| [XGBoost](https:\/\/xgboost.readthedocs.io\/en\/latest\/python\/python_api.html#xgboost.XGBClassifier) | [XGBoost](https:\/\/xgboost.readthedocs.io\/en\/latest\/python\/python_api.html#xgboost.XGBRegressor) | |\n| [LightGBM](https:\/\/lightgbm.readthedocs.io\/en\/latest\/pythonapi\/lightgbm.LGBMClassifier.html) | [LightGBM](https:\/\/lightgbm.readthedocs.io\/en\/latest\/pythonapi\/lightgbm.LGBMRegressor.html) | |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### How Databricks AutoML works\n##### Supported data feature types\n\nFeature types not listed below are not supported. For example, images are not supported. \nThe following feature types are supported: \n* Numeric (`ByteType`, `ShortType`, `IntegerType`, `LongType`, `FloatType`, and `DoubleType`)\n* Boolean\n* String (categorical or English text)\n* Timestamps (`TimestampType`, `DateType`)\n* ArrayType[Numeric] (Databricks Runtime 10.4 LTS ML and above)\n* DecimalType (Databricks Runtime 11.3 LTS ML and above)\n\n#### How Databricks AutoML works\n##### Split data into train\/validation\/test sets\n\nThere are two methods available for dividing data into training, validation, and test sets: \n(**Default**) **Random split**: If the data split strategy is not specified, the dataset is split into train, validate, and test sets. The ratio is 60% train split, 20% validate split, 20% test split. This division is done randomly behind the scenes. In the case of classification tasks, a stratified random split is used to ensure that each class is adequately represented in the training, validation, and test sets. \n**Chronological split**: In Databricks Runtime 10.4 LTS ML and above, you can specify a time column to use for the train, validate, and test data split for classification and regression problems. If you specify this column, the dataset is split into training, validation, and test sets by time. The earliest points are used for training, the next earliest for validation, and the latest points are used as a test set. The time column can be a timestamp, integer, or string column.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### How Databricks AutoML works\n##### Sampling large datasets\n\nNote \nSampling is not applied to forecasting problems. \nAlthough AutoML distributes hyperparameter tuning trials across the worker nodes of a cluster, each model is trained on a single worker node. \nAutoML automatically estimates the memory required to load and train your dataset and samples the dataset if necessary. \nIn Databricks Runtime 9.1 LTS ML through Databricks Runtime 10.4 LTS ML, the sampling fraction does not depend on the cluster\u2019s node type or the amount of memory on each node. \nIn Databricks Runtime 11.x ML: \n* The sampling fraction increases for worker nodes that have more memory per core. You can increase the sample size by choosing a **memory optimized** instance type.\n* You can further increase the sample size by choosing a larger value for `spark.task.cpus` in the Spark configuration for the cluster. The default setting is 1; the maximum value is the number of CPUs on the worker node. When you increase this value, the sample size is larger, but fewer trials run in parallel. For example, in a machine with 4 cores and 64GB total RAM, the default `spark.task.cpus=1` runs 4 trials per worker with each trial limited to 16GB RAM. If you set `spark.task.cpus=4`, each worker runs only one trial but that trial can use 64GB RAM. \nIn Databricks Runtime 12.2 LTS ML and above, AutoML can train on larger datasets by allocating more CPU cores per training task. You can increase the sample size by choosing an instance size with larger total memory. \nIn Databricks Runtime 11.3 LTS ML and above, if AutoML sampled the dataset, the sampling fraction is shown in the **Overview** tab in the UI. \nFor classification problems, AutoML uses the PySpark `sampleBy` [method](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrameStatFunctions.sampleBy.html) for stratified sampling to preserve the target label distribution. \nFor regression problems, AutoML uses the PySpark `sample` [method](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrame.sample.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### How Databricks AutoML works\n##### Imbalanced dataset support for classification problems\n\nIn Databricks Runtime 11.3 LTS ML and above, if AutoML detects that a dataset is imbalanced, it tries to reduce the imbalance of the training dataset by downsampling the major class(es) and adding class weights. AutoML only balances the training dataset and does not balance the test and validation datasets. Doing so ensures that the model performance is always evaluated on the non-enriched dataset with the true input class distribution. \nTo balance an imbalanced training dataset, AutoML uses class weights that are inversely related to the degree by which a given class is downsampled. For example, if a training dataset with 100 samples has 95 samples belonging to class A and 5 samples belonging to class B, AutoML reduces this imbalance by downsampling class A to 70 samples, that is downsampling class A by a ratio of 70\/95 or 0.736, while keeping the number of samples in class B at 5. To ensure that the final model is correctly calibrated and the probability distribution of the model output is the same as that of the input, AutoML scales up the class weight for class A by the ratio 1\/0.736, or 1.358, while keeping the weight of class B as 1. AutoML then uses these class weights in model training as a parameter to ensure that the samples from each class are weighted appropriately when training the model.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### How Databricks AutoML works\n##### Semantic type detection\n\nNote \n* Semantic type detection is not applied to forecasting problems.\n* AutoML does not perform semantic type detection for columns that have [custom imputation methods](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html#impute-missing-values) specified. \nWith Databricks Runtime 9.1 LTS ML and above, AutoML tries to detect whether columns have a semantic type that is different from the Spark or pandas data type in the table schema. AutoML treats these columns as the detected semantic type. These detections are best effort and might miss the existence of semantic types in some cases. You can also manually set the semantic type of a column or tell AutoML not to apply semantic type detection to a column [using annotations](https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html#semantic-type-annotations). \nSpecifically, AutoML makes these adjustments: \n* String and integer columns that represent date or timestamp data are treated as a timestamp type.\n* String columns that represent numeric data are treated as a numeric type. \nWith Databricks Runtime 10.1 ML and above, AutoML also makes these adjustments: \n* Numeric columns that contain categorical IDs are treated as a categorical feature.\n* String columns that contain English text are treated as a text feature. \n### Semantic type annotations \nWith Databricks Runtime 10.1 ML and above, you can manually control the assigned semantic type by placing a semantic type annotation on a column. To manually annotate the semantic type of column `<column-name>` as `<semantic-type>`, use the following syntax: \n```\nmetadata_dict = df.schema[\"<column-name>\"].metadata\nmetadata_dict[\"spark.contentAnnotation.semanticType\"] = \"<semantic-type>\"\ndf = df.withMetadata(\"<column-name>\", metadata_dict)\n\n``` \n`<semantic-type>` can be one of the following: \n* `categorical`: The column contains categorical values (for example, numerical values that should be treated as IDs).\n* `numeric`: The column contains numeric values (for example, string values that can be parsed into numbers).\n* `datetime`: The column contains timestamp values (string, numerical, or date values that can be converted into timestamps).\n* `text`: The string column contains English text. \nTo disable semantic type detection on a column, use the special keyword annotation `native`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### How Databricks AutoML works\n##### Shapley values (SHAP) for model explainability\n\nNote \nFor MLR 11.1 and below, SHAP plots are not generated, if the dataset contains a `datetime` column. \nThe notebooks produced by AutoML regression and classification runs include code to calculate [Shapley values](https:\/\/shap.readthedocs.io\/en\/latest\/example_notebooks\/overviews\/An%20introduction%20to%20explainable%20AI%20with%20Shapley%20values.html). Shapley values are based in game theory and estimate the importance of each feature to a model\u2019s predictions. \nAutoML notebooks use the [SHAP package](https:\/\/shap.readthedocs.io\/en\/latest\/overviews.html) to calculate Shapley values. Because these calculations are very memory-intensive, the calculations are not performed by default. \nTo calculate and display Shapley values: \n1. Go to the **Feature importance** section in an AutoML generated trial notebook.\n2. Set `shap_enabled = True`.\n3. Re-run the notebook.\n\n#### How Databricks AutoML works\n##### Time series aggregation\n\nFor forecasting problems, when there are multiple values for a timestamp in a time series, AutoML uses the average of the values. \nTo use the sum instead, edit the source code notebook. In the **Aggregate data by \u2026** cell, change `.agg(y=(target_col, \"avg\"))` to `.agg(y=(target_col, \"sum\"))`, as shown: \n```\ngroup_cols = [time_col] + id_cols\ndf_aggregation = df_loaded \\\n.groupby(group_cols) \\\n.agg(y=(target_col, \"sum\")) \\\n.reset_index() \\\n.rename(columns={ time_col : \"ds\" })\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### How Databricks AutoML works\n##### Feature Store integration\n\nWith Databricks Runtime 11.3 LTS ML and above, you can use existing feature tables in [Feature Store](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html) to augment the original input dataset for your classification and regression problems. \nWith Databricks Runtime 12.2 LTS ML and above, you can use existing feature tables in [Feature Store](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html) to augment the original input dataset for all of your AutoML problems: classification, regression, and forecasting. \nTo create a feature table, see [What is a feature store?](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html). \nTo use existing feature tables, you can [select feature tables with the AutoML UI](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html#feature-store) or set the `feature_store_lookups` parameter in your [AutoML run specification](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html). \n```\nfeature_store_lookups = [\n{\n\"table_name\": \"example.trip_pickup_features\",\n\"lookup_key\": [\"pickup_zip\", \"rounded_pickup_datetime\"],\n},\n{\n\"table_name\": \"example.trip_dropoff_features\",\n\"lookup_key\": [\"dropoff_zip\", \"rounded_dropoff_datetime\"],\n}\n]\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### How Databricks AutoML works\n##### Trial notebook generation\n\nFor forecasting experiments, AutoML generated notebooks are automatically imported to your workspace for all trials of your experiment. \nFor classification and regression experiments, AutoML generated notebooks for data exploration and the best trial in your experiment are automatically imported to your workspace. Generated notebooks for other experiment trials are saved as MLflow artifacts on DBFS, instead of auto-imported into your workspace. For all trials besides the best trial, the `notebook_path` and `notebook_url` in the `TrialInfo` Python API are not set. If you need to use these notebooks, you can manually import them into your workspace with the AutoML experiment UI or the `databricks.automl.import_notebook` [Python API](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html#import-notebook). \nIf you only use the data exploration notebook or best trial notebook generated by AutoML, the **Source** column in the AutoML experiment UI contains the link to the generated notebook for the best trial. \nIf you use other generated notebooks in the AutoML experiment UI, these are not automatically imported into the workspace. You can find the notebooks by clicking into each MLflow run. The IPython notebook is saved in the **Artifacts** section of the run page. You can download this notebook and import it into the workspace, if downloading artifacts is enabled by your workspace administrators.\n\n#### How Databricks AutoML works\n##### Notebook example: AutoML experiment with Feature Store\n\nThe following notebook shows how to train an ML model with AutoML and Feature Store feature tables. \n### AutoML experiment with Feature Store example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/automl-feature-store-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html"} +{"content":"# Databricks administration introduction\n## Manage users\n### service principals\n#### and groups\n##### Manage service principals\n####### Roles for managing service principals\n\nThis article describes how to manage roles on service principals in your Databricks account. \nA service principal is an identity that you create in Databricks for use with automated tools, jobs, and applications. Service principals give automated tools and scripts API-only access to Databricks resources, providing greater security than using users or groups. \nYou can grant Databricks users, service principals, and account groups access to use a service principal. This allows users to run jobs as the service principal, instead of as their identity. This prevents jobs from failing if a user leaves your organization or a group is modified. \nFor an overview of service principals see [Manage service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/service-principal-acl.html"} +{"content":"# Databricks administration introduction\n## Manage users\n### service principals\n#### and groups\n##### Manage service principals\n####### Roles for managing service principals\n######## Service principal roles\n\nService principal roles are account-level roles. This means that they only need to be defined once, in your account, and apply across all workspaces. There are two roles that you can grant on a service principal: **Service Principal Manager** and **Service Principal User**. \n* **Service Principal Manager** allows you to manage roles on a service principal. The creator of a service principal has the **Service Principal Manager** role on the service principal. Account admins have the **Service Principal Manager** role on all service principals in an account. \nNote \nIf a service principal was created before June 13, 2023, the creator of the service principal does not have the **Service Principal Manager** role by default. If you need to be a manager, ask an account admin to grant you the **Service Principal Manager** role. \n* **Service Principal User** allows workspace users to run jobs as the service principal. The job will run with the identity of the service principal, instead of the identity of the job owner. The **Service Principal User** role also allows workspace admins to create tokens on behalf of the service principal. \nNote \nWhen the `RestrictWorkspaceAdmins` setting on a workspace is set to `ALLOW ALL`, workspace admins can create a personal access token on behalf of any service principal in their workspace. To enforce the **Service Principal User** role for workspace admins to create a personal access token for a service principal, see [Restrict workspace admins](https:\/\/docs.databricks.com\/admin\/workspace-settings\/restrict-workspace-admins.html). \nUsers with the **Service Principal Manager** role do not inherit the **Service Principal User** role. If you want to use the service principal to execute jobs, you need to explicitly assign yourself the service principal user role, even after creating the service principal.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/service-principal-acl.html"} +{"content":"# Databricks administration introduction\n## Manage users\n### service principals\n#### and groups\n##### Manage service principals\n####### Roles for managing service principals\n######## Manage service principal roles using the account console\n\nAccount admins can manage service principals roles using the account console. \n### View roles on a service principal \n1. As an account admin, log in to the [account console](https:\/\/accounts.cloud.databricks.com).\n2. In the sidebar, click **User management**.\n3. On the **Service principals** tab, find and click the name.\n4. Click the **Permissions** tab. \nYou can see the list of principals and the roles that they are granted on the service principal. You can also use the search bar to search for a specific principal or role. \n### Grant roles on a service principal \n1. As an account admin, log in to the [account console](https:\/\/accounts.cloud.databricks.com).\n2. In the sidebar, click **User management**.\n3. On the **Service principals** tab, find and click the name.\n4. Click the **Permissions** tab.\n5. Click **Grant access**.\n6. Search for and select the user, service principal, or group and choose the role or roles (**Service principal: Manager** or **Service principal: User**) to assign. \nNote \nUsers with the **Service Principal Manager** role do not inherit the **Service Principal User** role. If you want the user to use the service principal to execute jobs, you will need to explicitly assign the **Service Principal User** role.\n7. Click **Save**. \n### Revoke roles on a service principal \n1. As an account admin, log in to the [account console](https:\/\/accounts.cloud.databricks.com).\n2. In the sidebar, click **User management**.\n3. On the **Service principals** tab, find and click the name.\n4. Click the **Permissions** tab.\n5. Search for the user, service principal, or group to edit their roles.\n6. On the row with the principal, click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) and then select **Edit**. Alternatively, select **Delete** to revoke all of the roles for the principal.\n7. Click **Edit**.\n8. Click the **X** next to the roles that you want to revoke.\n9. Click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/service-principal-acl.html"} +{"content":"# Databricks administration introduction\n## Manage users\n### service principals\n#### and groups\n##### Manage service principals\n####### Roles for managing service principals\n######## Manage service principal roles using the workspace admin settings page\n\nWorkspace admins can manage service principals roles for service principals that they have the **Service Principal Manager** role on using the admin settings page. \n### View roles on a service principal \n1. As a workspace admin, log in to the Databricks workspace.\n2. Click your username in the top bar of the Databricks workspace and select **Settings**.\n3. Click on the **Identity and access** tab.\n4. Next to **Service principals**, click **Manage**.\n5. Find and click the name.\n6. Click the **Permissions** tab. \nYou can see the list of principals and the roles that they are granted on the service principal. You can also use the search bar to search for a specific principal or role. \n### Grant roles on a service principal \nYou must have the **Service Principal Manager** role on a service principal in order to grant roles. \n1. As a workspace admin, log in to the Databricks workspace.\n2. Click your username in the top bar of the Databricks workspace and select **Settings**.\n3. Click on the **Identity and access** tab.\n4. Next to **Service principals**, click **Manage**.\n5. Find and click the name.\n6. Click the **Permissions** tab.\n7. Click **Grant access**.\n8. Search for and select the user, service principal, or group and choose the role or roles (**Service principal: Manager** or **Service principal: User**) to assign. \nNote \nRoles can be granted to any account-level user, service principal, or group, even if they are not a member of the workspace. Roles cannot be granted to workspace-local groups. \nUsers with the **Service Principal Manager** role do not inherit the **Service Principal User** role. If you want the user to use the service principal to execute jobs, you will need to explicitly assign the **Service Principal User** role.\n9. Click **Save**. \n### Revoke roles on a service principal \nYou must have the **Service Principal Manager** role on a service principal in order to revoke roles. \n1. As a workspace admin, log in to the Databricks workspace.\n2. Click your username in the top bar of the Databricks workspace and select **Settings**.\n3. Click on the **Identity and access** tab.\n4. Next to **Service principals**, click **Manage**.\n5. Find and click the name.\n6. Click the **Permissions** tab.\n7. Search for the user, service principal, or group to edit their roles.\n8. On the row with the principal, click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) and then select **Edit**. Alternatively, select **Delete** to revoke all of the roles for the principal.\n9. Click **Edit**.\n10. Click the **X** next to the roles that you want to revoke.\n11. Click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/service-principal-acl.html"} +{"content":"# Databricks administration introduction\n## Manage users\n### service principals\n#### and groups\n##### Manage service principals\n####### Roles for managing service principals\n######## Manage service principal roles using the Databricks CLI\n\nYou must have the **Service Principal Manager** role to manage roles on a service principal. You can use the Databricks CLI to manage roles. For information on installing and authenticating to the Databricks CLI, see [What is the Databricks CLI?](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \nYou can also manage service principal roles using the [Accounts Access Control API](https:\/\/docs.databricks.com\/api\/account\/accountaccesscontrol). The Accounts Access Control API is supported through the Databricks account and workspaces. \nAccount admins call the API on accounts.cloud.databricks.com (`{account-domain}\/api\/2.0\/preview\/accounts\/{account_id}\/access-control`). \nUsers with the **Service Principal Manager** role that are not account admins call the API on the workspace domain (`{workspace-domain}\/api\/2.0\/preview\/accounts\/access-control\/`). \n### Grant roles on a service principal using the Databricks CLI \nThe Accounts Access Control API and CLI uses an `etag` field to ensure consistency. To grant or revoke service principal roles through the API, first issue a `GET` rule set command and receive an `etag` in response. You can then apply changes locally, and finally issue a `PUT` rule set with the `etag`. \nFor example, issue a `GET` rule set on the service principal that you want to grant access to by running the following command: \n```\ndatabricks account access-control get-rule-set accounts\/<account-id>\/servicePrincipals\/<application-id>\/ruleSets\/default <etag>\n\n``` \nReplace: \n* `<account-id>` with the account ID.\n* `<application-id>` with the service principal application ID.\n* `<etag>` with \u201c\u201d \nExample response: \n```\n{\n\"etag\":\"<etag>\",\n\"grant_rules\": [\n{\n\"principals\": [\n\"users\/user@example.com\"\n],\n\"role\":\"roles\/servicePrincipal.manager\"\n},\n{\n\"principals\": [\n\"users\/user@example.com\"\n],\n\"role\":\"roles\/servicePrincipal.user\"\n}\n],\n\"name\":\"<name>\"\n}\n\n``` \nCopy the `etag` field from the response body for later use. \nThen, you can make updates locally when you decide on the final state of the rules and then update the rule set using the etag. To grant the **Service principal: User** role to the user `user2@example.com`, run the following: \n```\ndatabricks account access-control update-rule-set --json '{\n\"name\": \"accounts\/<account-id>\/servicePrincipals\/<application-id>\/ruleSets\/default\",\n\"rule_set\": {\n\"name\": \"accounts\/<account-id>\/servicePrincipals\/<application-id>\/ruleSets\/default\",\n\"grant_rules\": [\n{\n\"role\": \"roles\/servicePrincipal.user\",\n\"principals\": [\"users\/user2@example.com\"]\n}\n],\n\"etag\": \"<etag>\"\n}\n}'\n\n``` \nReplace: \n* `<account-id>` with the account ID.\n* `<application-id>` with the service principal application ID.\n* `<etag>` with the etag that you copied from the last response. \nExample response: \n```\n{\n\"etag\":\"<new-etag>\",\n\"grant_rules\": [\n{\n\"principals\": [\n\"users\/user2@example.com\"\n],\n\"role\":\"roles\/servicePrincipal.user\"\n}\n],\n\"name\":\"accounts\/<account-id>\/servicePrincipals\/<application-id>\/ruleSets\/default\"\n}\n\n``` \nImportant \nBecause this is a `PUT` method, all existing roles are overwritten. To keep any existing roles, you must add them to the `grant_roles` array. \n### List the service principals that you can use \nUsing the Workspace Service Principals API, you can list the service principals that you have the user role on by filtering on `servicePrincipal\/use`. \nTo list the service principals that you have the **Service Principal User** role on, run the following command: \n```\ndatabricks service-principals list -p WORKSPACE --filter \"permission eq 'servicePrincipal\/use'\"\n\n``` \nYou can also list service principals using the [Workspace Service Principals API](https:\/\/docs.databricks.com\/api\/workspace\/serviceprincipals\/list).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/service-principal-acl.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Tutorials: Implement ETL workflows with Delta Live Tables\n\nDelta Live Tables provides a simple declarative approach to build ETL and machine learning pipelines on batch or streaming data, while automating operational complexities such as infrastructure management, task orchestration, error handling and recovery, and performance optimization. You can use the following tutorials to get started with Delta Live Tables, perform common data transformation tasks, and implement more advanced data processing workflows.\n\n#### Tutorials: Implement ETL workflows with Delta Live Tables\n##### Create your first pipeline with Delta Live Tables\n\nTo help you learn about the features of the Delta Live Tables framework and how to implement pipelines, this tutorial walks you through creating and running your first pipeline. The tutorial includes an end-to-end example of a pipeline that ingests data, cleans and prepares the data, and performs transformations on the prepared data. See [Tutorial: Run your first Delta Live Tables pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/dlt-tutorials.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Tutorials: Implement ETL workflows with Delta Live Tables\n##### Programmatically create multiple tables with Python\n\nNote \nPatterns shown in this article cannot be easily completed with only SQL. Because Python datasets can be defined against any query that returns a DataFrame, you can use `spark.sql()` as necessary to use SQL syntax within Python functions. \nYou can use Python user-defined functions (UDFs) in your SQL queries, but you must define these UDFs in Python files in the same pipeline before calling them in SQL source files. See [User-defined scalar functions - Python](https:\/\/docs.databricks.com\/udf\/python.html). \nMany workflows require the implementation of multiple data processing flows or dataset definitions that are identical or differ by only a few parameters. This redundancy can result in pipelines that are error-prone and difficult to maintain. To address this redundancy, you can use a metaprogramming pattern with Python. For an example demonstrating how to use this pattern to call a function invoked multiple times to create different tables, see [Programmatically create multiple tables](https:\/\/docs.databricks.com\/delta-live-tables\/create-multiple-tables.html).\n\n#### Tutorials: Implement ETL workflows with Delta Live Tables\n##### Include a Delta Live Tables pipeline in a Databricks workflow\n\nIn addition to creating end-to-end data processing workflows with Delta Live Tables, you can also use a Delta Live Tables pipeline as a task in a workflow that implements complex data processing and analysis tasks. The tutorial in [Use Databricks SQL in a Databricks job](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html) walks through creating an end-to-end Databricks workflow that includes a Delta Live Tables pipeline to prepare data for analysis and visualization with Databricks SQL.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/dlt-tutorials.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Read and write data in Delta Live Tables pipelines\n\nYou can make the data created in your Delta Live Tables pipelines available for discovery and querying outside of Delta Live Tables by publishing the datasets to external data governance systems. You can also use data managed by external data governance systems as source data for your pipelines. This article is an introduction to supported data governance solutions and has links to articles that provide more information for using these solutions with Delta Live Tables. \nAll tables and views created in Delta Live Tables are local to the pipeline by default. To make output datasets available outside the pipeline, you must *publish* the datasets. To persist output data from a pipeline and make it discoverable and available to query, Delta Live Tables supports Unity Catalog and the Hive metastore. You can also use data stored in Unity Catalog or the Hive metastore as source data for Delta Live Tables pipelines. \nThe articles in this section detail how to use data governance solutions to read and write data with your pipelines.\n\n#### Read and write data in Delta Live Tables pipelines\n##### Use Unity Catalog to read and write data with Delta Live Tables pipelines (Public Preview)\n\nUnity Catalog is the data governance solution for the Databricks Platform and is the recommended way to manage the output datasets from Delta Live Tables pipelines. You can also use Unity Catalog as a data source for pipelines. To learn how to use Unity Catalog with your pipelines, see [Use Unity Catalog with your Delta Live Tables pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html).\n\n#### Read and write data in Delta Live Tables pipelines\n##### Publish data to the Hive metastore from Delta Live Tables pipelines\n\nYou can also use the Hive metastore to read source data into a Delta Live Tables pipeline and publish output data from a pipeline to make it available to external systems. See [Publish data from Delta Live Tables to the Hive metastore](https:\/\/docs.databricks.com\/delta-live-tables\/publish.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/publish-data.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Compare Auto Loader file detection modes\n\nAuto Loader supports two modes for detecting new files: directory listing and file notification. You can switch file discovery modes across stream restarts and still obtain exactly-once data processing guarantees.\n\n#### Compare Auto Loader file detection modes\n##### Directory listing mode\n\nIn directory listing mode, Auto Loader identifies new files by listing the input directory. Directory listing mode allows you to quickly start Auto Loader streams without any permission configurations other than access to your data on cloud storage. \nIn Databricks Runtime 9.1 and above, Auto Loader can automatically detect whether files are arriving with lexical ordering to your cloud storage and significantly reduce the amount of API calls needed to detect new files. See [What is Auto Loader directory listing mode?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/directory-listing-mode.html) for more details.\n\n#### Compare Auto Loader file detection modes\n##### File notification mode\n\nFile notification mode leverages file notification and queue services in your cloud infrastructure account. Auto Loader can automatically set up a notification service and queue service that subscribe to file events from the input directory. \nFile notification mode is more performant and scalable for large input directories or a high volume of files but requires additional cloud permissions to set up. For more information, see [What is Auto Loader file notification mode?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-detection-modes.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Compare Auto Loader file detection modes\n##### Cloud storage supported by modes\n\nThe availability for these modes are listed below. \nIf you migrate from an external location or a DBFS mount to a Unity Catalog volume, Auto Loader continues to provide exactly-once guarantees. \n| Cloud Storage | Directory listing | File notifications |\n| --- | --- | --- |\n| AWS S3 | All versions | All versions |\n| ADLS Gen2 | All versions | All versions |\n| GCS | All versions | Databricks Runtime 9.1 and above |\n| Azure Blob Storage | All versions | All versions |\n| ADLS Gen1 | All versions | Unsupported |\n| DBFS | All versions | For mount points only |\n| Unity Catalog volume | Databricks Runtime 13.3 LTS and above | Unsupported |\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-detection-modes.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/dbt.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to dbt Core\n\nNote \nThis article covers dbt Core, a version of dbt for your local development machine that interacts with Databricks SQL warehouses and Databricks clusters within your Databricks workspaces. To use the hosted version of dbt (called *dbt Cloud*) instead, or to use Partner Connect to quickly create a SQL warehouse within your workspace and then connect it to dbt Cloud, see [Connect to dbt Cloud](https:\/\/docs.databricks.com\/partners\/prep\/dbt-cloud.html). \ndbt (data build tool) is a development environment that enables data analysts and data engineers to transform data by simply writing select statements. dbt handles turning these select statements into tables and views. dbt compiles your code into raw SQL and then runs that code on the specified database in Databricks. dbt supports collaborative coding patterns and best practices such as version control, documentation, modularity, and more. \ndbt does not extract or load data. dbt focuses on the transformation step only, using a \u201ctransform after load\u201d architecture. dbt assumes that you already have a copy of your data in your database. \nThis article focuses on using dbt Core. dbt Core enables you to write dbt code in the text editor or IDE of your choice on your local development machine and then run dbt from the command line. dbt Core includes the dbt Command Line Interface (CLI). The [dbt CLI](https:\/\/docs.getdbt.com\/dbt-cli\/cli-overview) is free to use and [open source](https:\/\/github.com\/dbt-labs\/dbt). \nA hosted version of dbt called dbt Cloud is also available. dbt Cloud comes equipped with turnkey support for scheduling jobs, CI\/CD, serving documentation, monitoring and alerting, and an integrated development environment (IDE). For more information, see [Connect to dbt Cloud](https:\/\/docs.databricks.com\/partners\/prep\/dbt-cloud.html). The dbt Cloud Developer plan provides one free developer seat; Team and Enterprise paid plans are also available. For more information, see [dbt Pricing](https:\/\/www.getdbt.com\/pricing\/) on the dbt website. \nBecause dbt Core and dbt Cloud can use hosted git repositories (for example, on GitHub, GitLab or BitBucket), you can use dbt Core to create a dbt project and then make it available to your dbt Cloud users. For more information, see [Creating a dbt project](https:\/\/docs.getdbt.com\/docs\/building-a-dbt-project\/projects#creating-a-dbt-project) and [Using an existing project](https:\/\/docs.getdbt.com\/docs\/building-a-dbt-project\/projects#using-an-existing-project) on the dbt website. \nFor a general overview of dbt, watch the following YouTube video (26 minutes).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/dbt.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to dbt Core\n##### Requirements\n\nBefore you install dbt Core, you must install the following on your local development machine: \n* [Python](https:\/\/www.python.org\/downloads\/) 3.7 or higher\n* A utility for creating Python virtual environments (such as [pipenv](https:\/\/docs.python-guide.org\/dev\/virtualenvs\/)) \nYou also need one of the following to authenticate: \n* (Recommended) dbt Core enabled as an OAuth application in your account. This is enabled by default. \n(Optional) To use a custom IdP for dbt login, see [SSO in your Databricks account console](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/single-sign-on\/index.html).\n* A personal access token \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use OAuth tokens. \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/dbt.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to dbt Core\n##### Step 1: Create and activate a Python virtual environment\n\nIn this step, you use `pipenv` to create a *[Python virtual environment](https:\/\/realpython.com\/python-virtual-environments-a-primer\/)*. We recommend using a Python virtual environment as it isolates package versions and code dependencies to that specific environment, regardless of the package versions and code dependencies within other environments. This helps reduce unexpected package version mismatches and code dependency collisions. \n1. From your terminal, switch to an empty directory, creating that directory first if necessary. This procedure creates an empty directory named `dbt_demo` in the root of your user home directory. \n```\nmkdir ~\/dbt_demo\ncd ~\/dbt_demo\n\n``` \n```\nmkdir %USERPROFILE%\\dbt_demo\ncd %USERPROFILE%\\dbt_demo\n\n```\n2. In this empty directory, create a file named `Pipfile` with the following content. This *[Pipfile](https:\/\/realpython.com\/pipenv-guide\/#the-pipfile)* instructs `pipenv` to use Python version 3.8.6. If you use a different version, replace `3.8.6` with your version number. \n```\n[[source]]\nurl = \"https:\/\/pypi.org\/simple\"\nverify_ssl = true\nname = \"pypi\"\n\n[packages]\ndbt-databricks = \"*\"\n\n[requires]\npython_version = \"3.8.6\"\n\n``` \nNote \nThe preceding line `dbt-databricks = \"*\"` instructs `pipenv` to use the latest version of the `dbt-databricks` package. In production scenarios, you should replace `*` with the specific version of the package that you want to use. Databricks recommends version 1.6.0 or greater of the dbt-databricks package. See [dbt-databricks Release history](https:\/\/pypi.org\/project\/dbt-databricks\/#history) on the Python Package Index (PyPI) website.\n3. Create a Python virtual environment in this directory by running `pipenv` and specifying the Python version to use. This command specifies Python version 3.8.6. If you use a different version, replace `3.8.6` with your version number: \n```\npipenv --python 3.8.6\n\n```\n4. Install the dbt Databricks adapter by running `pipenv` with the `install` option. This installs the packages in your `Pipfile`, which includes the dbt Databricks adapter package, `dbt-databricks`, from PyPI. The dbt Databricks adapter package automatically installs dbt Core and other dependencies. \nImportant \nIf your local development machine uses any of the following operating systems, you must complete additional steps first: CentOS, MacOS, Ubuntu, Debian, and Windows. See the \u201cDoes my operating system have prerequisites\u201d section of [Use pip to install dbt](https:\/\/docs.getdbt.com\/dbt-cli\/install\/pip) on the dbt Labs website. \n```\npipenv install\n\n```\n5. Activate this virtual environment by running `pipenv shell`. To confirm the activation, the terminal displays `(dbt_demo)` before the prompt. The virtual environment begins using the specified version of Python and isolates all package versions and code dependencies within this new environment. \n```\npipenv shell\n\n``` \nNote \nTo deactivate this virtual environment, run `exit`. `(dbt_demo)` disappears from before the prompt. If you run `python --version` or `pip list` with this virtual environment deactivated, you might see a different version of Python, a different list of available packages or package versions, or both.\n6. Confirm that your virtual environment is running the expected version of Python by running `python` with the `--version` option. \n```\npython --version\n\n``` \nIf an unexpected version of Python displays, make sure you have activated your virtual environment by running `pipenv shell`.\n7. Confirm that your virtual environment is running the expected versions of dbt and the dbt Databricks adapter by running `dbt` with the `--version` option. \n```\ndbt --version\n\n``` \nIf an unexpected version of dbt or the dbt Databricks adapter displays, make sure you have activated your virtual environment by running `pipenv shell`. If an unexpected version still displays, try installing dbt or the dbt Databricks adapter again after you activate your virtual environment.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/dbt.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to dbt Core\n##### Step 2: Create a dbt project and specify and test connection settings\n\nIn this step, you create a dbt *project*, which is a collection of related directories and files that are required to use dbt. You then configure your connection *profiles*, which contain connection settings to a Databricks [cluster](https:\/\/docs.databricks.com\/compute\/configure.html), a [SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html), or both. To increase security, dbt projects and profiles are stored in separate locations by default. \nTip \nYou can connect to an existing cluster or SQL warehouse, or you can create a new one. \n* An existing cluster or SQL warehouse can be efficient for multiple dbt projects, for using dbt in a team, or for development use cases.\n* A new cluster or SQL warehouse allows you to run a single dbt project in isolation for production use cases, as well as leverage automatic termination to save costs when that dbt project is not running. \nUse Databricks to create a new cluster or SQL warehouse, and then reference the newly-created or existing cluster or SQL warehouse from your dbt profile. \n1. With the virtual environment still activated, run the [dbt init](https:\/\/docs.getdbt.com\/reference\/commands\/init) command with a name for your project. This procedure creates a project named `my_dbt_demo`. \n```\ndbt init my_dbt_demo\n\n```\n2. When you are prompted to choose a `databricks` or `spark` database, enter the number that corresponds to `databricks`.\n3. When prompted for a `host` value, do the following: \n* For a cluster, enter the **Server Hostname** value from the [Advanced Options, JDBC\/ODBC](https:\/\/docs.databricks.com\/integrations\/compute-details.html) tab for your Databricks cluster.\n* For a SQL warehouse, enter the **Server Hostname** value from the [Connection Details](https:\/\/docs.databricks.com\/integrations\/compute-details.html) tab for your SQL warehouse.\n4. When prompted for an `http_path` value, do the following: \n* For a cluster, enter the **HTTP Path** value from the [Advanced Options, JDBC\/ODBC](https:\/\/docs.databricks.com\/integrations\/compute-details.html) tab for your Databricks cluster.\n* For a SQL warehouse, enter the **HTTP Path** value from the [Connection Details](https:\/\/docs.databricks.com\/integrations\/compute-details.html) tab for your SQL warehouse.\n5. To choose an authentication type, enter the number that corresponds with `use oauth` (recommended) or `use access token`.\n6. If you chose `use access token` for your authentication type, enter the value of your Databricks [personal access token](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement). \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n7. When prompted for the `desired Unity Catalog option` value, enter the number that corresponds with `use Unity Catalog` or `not use Unity Catalog`.\n8. If you chose to use Unity Catalog, enter the desired value for `catalog` when prompted.\n9. Enter the desired values for `schema` and `threads` when prompted.\n10. dbt writes your entries to a `profiles.yml` file. The location of this file is listed in the output of the `dbt init` command. You can also list this location later by running the `dbt debug --config-dir` command. You can open this file now to examine and verify its contents. \nIf you chose `use oauth` for your authentication type, add your machine-to-machine (M2M) or user-to-machine (U2M) authentication profile to `profiles.yml`. \nThe following is an example `profiles.yml` file with the profile `aws-oauth-u2m` specified. Specifying `aws-oauth-u2m` for `target` sets the U2M profile as the default run profile used by dbt. \n```\nmy_dbt_demo:\noutputs:\nazure-oauth-u2m:\ncatalog: uc_demos\nhost: \"xxx.cloud.databricks.com\"\nhttp_path: \"\/sql\/1.0\/warehouses\/9196548d010cf14d\"\nschema: databricks_demo\nthreads: 1\ntype: databricks\nauth_type: oauth\ntarget: aws-oauth-u2m\n\n``` \nDatabricks does not recommend specifying secrets in `profiles.yml` directly. Instead, set the client ID and client secret as environment variables.\n11. Confirm that the connection details are correct by traversing into the `my_dbt_demo` directory and running the `dbt debug` command. \nIf you chose `use oauth` for your authentication type, you\u2019re prompted to sign in with your identity provider. \nImportant \nBefore you begin, verify that your cluster or SQL warehouse is running. \nYou should see output similar to the following: \n```\ncd my_dbt_demo\ndbt debug\n\n``` \n```\n...\nConfiguration:\nprofiles.yml file [OK found and valid]\ndbt_project.yml file [OK found and valid]\n\nRequired dependencies:\n- git [OK found]\n\nConnection:\n...\nConnection test: OK connection ok\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/dbt.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to dbt Core\n##### Next steps\n\n* Create, run, and test dbt Core models locally. See the [dbt Core tutorial](https:\/\/docs.databricks.com\/integrations\/dbt-core-tutorial.html).\n* Run dbt Core projects as Databricks job tasks. See [Use dbt transformations in a Databricks job](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbt-in-workflows.html).\n\n#### Connect to dbt Core\n##### Additional resources\n\n* [What, exactly, is dbt?](https:\/\/www.getdbt.com\/blog\/what-exactly-is-dbt)\n* [Analytics Engineering for Everyone: Databricks in dbt Cloud](https:\/\/blog.getdbt.com\/analytics-engineering-for-everyone-databricks-in-dbt-cloud\/) on the dbt website.\n* [dbt Getting Started tutorial](https:\/\/docs.getdbt.com\/tutorial\/setting-up)\n* [dbt documentation](https:\/\/docs.getdbt.com\/docs)\n* [dbt CLI documentation](https:\/\/docs.getdbt.com\/dbt-cli\/cli-overview)\n* [dbt + Databricks Demo](https:\/\/github.com\/dbt-labs\/dbt-databricks-demo)\n* [dbt Discourse community](https:\/\/discourse.getdbt.com\/)\n* [dbt blog](https:\/\/blog.getdbt.com\/)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/dbt.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Use scikit-learn on Databricks\n##### Track scikit-learn model training with MLflow\n\nThis notebook is based on the MLflow [scikit-learn diabetes tutorial](https:\/\/github.com\/mlflow\/mlflow\/tree\/master\/examples\/sklearn_elasticnet_diabetes). \nThe notebook shows how to use MLflow to track the model training process, including logging model parameters, metrics, the model itself, and other artifacts like plots to a Databricks hosted tracking server. It also includes instructions for viewing the logged results in the MLflow tracking UI. \nThe following guides describe deployment options for your trained model: \n* Deploy your model using [Model serving with Databricks](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html)\n* [scikit-learn model deployment on SageMaker](https:\/\/docs.databricks.com\/mlflow\/scikit-learn-model-deployment-on-sagemaker.html)\n\n##### Track scikit-learn model training with MLflow\n###### MLflow scikit-learn model training notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-quick-start-training.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/tracking-ex-scikit.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Chart options\n\nThe following visualization types are defined as **charts** within Databricks: \n* Bar\n* Line\n* Area\n* Pie\n* Histogram\n* Heatmap\n* Scatter\n* Bubble\n* Box\n* Combo \nThese charts share a similar set of configuration options. Below is the list of configuration options available for chart types. Not every configuration option is available for every chart type. To see which configuration options are available for your desired chart type, see [Configuration options by chart type](https:\/\/docs.databricks.com\/visualizations\/charts.html#options_by_type). \nFor examples of chart visualizations, see [Visualizations](https:\/\/docs.databricks.com\/visualizations\/visualization-types.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/charts.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Chart options\n###### General\n\nTo configure general options, click **General** and configure each of the following required settings: \n* **X column**: A field to be displayed on the x axis.\n* **Y column**: Fields to be displayed on the y axis. Fields can optionally be aggregated by `SUM`, `COUNT`, `COUNT DISTINCT`, `AVERAGE`, `MEDIAN`, `MIN`, `MAX`, `STANDARD DEVIATION`, `VARIANCE`.\n* **Disable aggregations**: If you\u2019ve written an aggregation in your SQL query and do not want to apply an aggregation on your Y column, select the **Disable aggregations** button under the y axis kebab. This will ensure no aggregation is applied, and ensure any sort order in your query is maintained in the visualization.\n* **Group by**: Also known as color in some tools, select a dimension by which to group all other values. Each group will appear in the legend as a separate entry, and the chart will display each group in its own color.\n* **Error column**: Displays an error boundary for upper and lower ranges of values.\n* **Legend placement**: Options are automatic, flexible, right, bottom, or hidden.\n* **Legend items order**: Options are normal or reversed.\n* **Stacking**: Options are stacked or grouped. When stacked, values belonging to multiple series will be layered on top of each other to show cumulative values. Grouped values will show series placed side by side.\n* **Normalize values to percentage**: When multiple series are used, each series will show a percentage of the total value of the underlying grouping, rather than a raw number.\n* **Missing and NULL values**: If a value is missing or null, you can choose to either convert the value to 0 and display it on the chart, or hide the value.\n* **Horizontal chart**: Flips the X and Y axis on the chart.\n* **Number of bins (histogram only)**: Defines the number of bins for a histogram.\n* **Color column (heatmap only)**: Defines the color scale to use for the heatmap.\n* **Bubble configurations (bubble only)**: Defines the size of each bubble on a bubble chart. \n### X and Y axis tabs \nTo configure formatting options for the X axis or Y axis, click **X axis** or **Y axis** and configure the following optional settings: \n* **Scale Type**: Options are categorical, linear, or logarithmic. Categorical scale types should be selected when each value belongs to a discrete category, for example a geographical region. Linear or logarithmic should be chosen when each value is continuous, for example, temperatures. By default, a scale type will be chosen based off of the data type of the field selected on that axis.\n* **Axis rename**: Enter text to rename the axis.\n* **Sort X axis values**: Enables sorting by the X axis field values in alphabetical order.\n* **Show \/ hide labels**: Shows \/ hides the axis labels.\n* **Show \/ hide axis**: Shows \/ hides the axis lines.\n* **Custom Y axis range:**: Defines a custom range for the Y axis. \n### Series \nTo configure series options, click **Options** and configure the following optional settings: \n* **Series order**: Allows you to reorder series by clicking and dragging.\n* **Series label**: Enter text to rename the series.\n* **Y axis assignment**: Specifies if the particular series values should be assigned to the left or right axis.\n* **Series type**: Specifies if the series should be displayed as a bar or line. \n### Colors \nTo configure colors, click **Colors** and optionally override automatic colors and configure custom colors. \n* **Custom colors**: Allows users to choose custom colors for each series. Defaults to \\*\\*automatic\\*, in which case workspace theme colors are chosen.\n* **Predefined color scheme(s)**: Certain chart types have predefined color schemes which must be chosen from. \n### Data labels \nTo configure labels for each data point in the visualization, click **Data labels** and configure the following optional settings: \n* **Show data labels**: Show data labels.\n* **Number values format**: Formats any number values on the data label and tooltips.\n* **Percent values format**: Formats any percentage values on the data label and tooltips.\n* **Date values format**: Formats any date\/time values on the data label and tooltips.\n* **Data labels**: Allows you to customize what is shown on the data labels and tooltips\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/charts.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Chart options\n###### Configuration options by chart type\n\n| | | Bar | Line | Area | Pie | Histogram | Heatmap | Scatter | Bubble | Box | Combo |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| General | | | | | | | | | | | |\n| | X column | x | x | x | x | x | x | x | x | x | x |\n| | Y column | x | x | x | x | | x | x | x | x | x |\n| | Disable aggregations | x | x | x | x | | | | | | x |\n| | Group by | x | x | x | x | | | x | x | x | |\n| | Error column | x | x | x | | | | x | x | | |\n| | Legend placement | x | x | x | x | | | | | | x |\n| | Legend items order | x | x | x | x | | | | | | x |\n| | Stacking | x | | x | | | | | | | |\n| | Normalize values to percentages | | x | x | | | | | | | |\n| | Missing and NULL values | x | x | x | x | | | | | x | x |\n| | Horizontal chart | x | x | | | | | | | | x |\n| | Number of bins | | | | | x | | | | | |\n| | Color column | | | | | | x | | | | |\n| | Bubble configurations | | | | | | | | x | | |\n| X and Y axis | | | | | | | | | | | |\n| | Scale Type | x | x | x | | x | x | x | x | x | x |\n| | Axis Rename | x | x | x | | x | x | x | x | x | x |\n| | Sort X axis values | x | x | x | | | x | x | x | x | x |\n| | Show \/ hide labels | x | x | x | | x | x | x | x | x | x |\n| | Show \/ hide axis | x | x | x | | x | x | x | x | x | x |\n| | Custom Y axis range | x | x | x | | x | x | x | x | x | x |\n| Series | | | | | | | | | | | |\n| | Series order | x | x | x | x | | | | x | x | x |\n| | Series label | x | x | x | x | | | x | x | x | x |\n| | Y axis assignment | x | x | x | | | | x | x | x | x |\n| | Series type | x | x | | | | | | | | x |\n| Colors | | | | | | | | | | | |\n| | Custom colors | x | x | x | x | x | | x | x | x | x |\n| | Predefined color scheme(s) | | | | | | x | | | | |\n| Data labels | | | | | | | | | | | |\n| | Show data labels | x | x | x | x | x | x | x | x | x | x |\n| | Number values format | x | x | x | x | x | x | x | x | x | x |\n| | Percent values format | x | x | x | x | x | x | x | x | x | x |\n| | Date values format | x | x | x | x | x | x | x | x | x | x |\n| | Data labels | x | x | x | x | x | x | x | x | x | x |\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/charts.html"} +{"content":"# AI and Machine Learning on Databricks\n### GraphFrames\n\nGraphFrames is a package for Apache Spark that provides DataFrame-based graphs. It provides high-level APIs in Java, Python, and Scala. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark [DataFrames](https:\/\/docs.databricks.com\/getting-started\/dataframes.html). This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries. \nThe GraphFrames package is included in [Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html). \n* [Graph analysis tutorial with GraphFrames](https:\/\/docs.databricks.com\/integrations\/graphframes\/graph-analysis-tutorial.html)\n* [GraphFrames user guide - Python](https:\/\/docs.databricks.com\/integrations\/graphframes\/user-guide-python.html)\n* [GraphFrames user guide - Scala](https:\/\/docs.databricks.com\/integrations\/graphframes\/user-guide-scala.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/graphframes\/index.html"} +{"content":"# Ingest data into a Databricks lakehouse\n### Onboard data from Amazon S3\n\nThis article describes how to onboard data to a new Databricks workspace from Amazon S3. You\u2019ll learn how to securely access source data in a cloud object storage location that corresponds with a Unity Catalog volume (recommended) or a Unity Catalog external location. Then, you\u2019ll learn how to ingest the data incrementally into a Unity Catalog managed table using Auto Loader with Delta Live Tables. \nNote \nTo onboard data in Databricks SQL instead of in a notebook, see [Load data using streaming tables in Databricks SQL](https:\/\/docs.databricks.com\/sql\/load-data-streaming-table.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/onboard-data.html"} +{"content":"# Ingest data into a Databricks lakehouse\n### Onboard data from Amazon S3\n#### Before you begin\n\nIf you\u2019re not an admin, this article assumes that an admin has provided you with the following: \n* Access to a Databricks workspace with Unity Catalog enabled. For more information, see [Set up and manage Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html).\n* The `READ FILES` permission on the Unity Catalog external volume or the Unity Catalog external location that corresponds with the cloud storage location that contains your source data. For more information, see [Create an external location to connect cloud storage to Databricks](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html).\n* The path to your source data. \nVolume path example: `\/Volumes\/<catalog>\/<schema>\/<volume>\/<path>\/<folder>` \nExternal location path example: `s3:\/\/<bucket>\/<folder>\/`\n* The `USE SCHEMA` and `CREATE TABLE` privileges on the schema you want to load data into.\n* [Cluster creation permission](https:\/\/docs.databricks.com\/compute\/use-compute.html) or access to a [cluster policy](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html) that defines a Delta Live Tables pipeline cluster (`cluster_type` field set to `dlt`). \nIf the path to your source data is a volume path, your cluster must run Databricks Runtime 13.3 LTS or above. \nImportant \nIf you have questions about these prerequisites, contact your account admin.\n\n### Onboard data from Amazon S3\n#### Step 1: Create a cluster\n\nTo create a cluster, do the following: \n1. Sign in to your Databricks workspace.\n2. In the sidebar, click **New** > **Cluster**.\n3. In the clusters UI, specify a unique name for your cluster.\n4. If the path to your source data is a volume path, for **Databricks Runtime runtime version**, select 13.2 or above.\n5. Click **Create cluster**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/onboard-data.html"} +{"content":"# Ingest data into a Databricks lakehouse\n### Onboard data from Amazon S3\n#### Step 2: Create a data exploration notebook\n\nThis section describes how to create a data exploration notebook so you can understand your data before you create your data pipeline. \n1. In the sidebar, click **+New** > **Notebook**. \nThe notebook is automatically attached to the last cluster you used (in this case, the cluster you created in **Step 1: Create a cluster**).\n2. Enter a name for the notebook.\n3. Click the language button, and then select `Python` or `SQL` from the dropdown menu. `Python` is selected by default.\n4. To confirm data access to your source data in S3, paste the following code into a notebook cell, click ![Run Menu](https:\/\/docs.databricks.com\/_images\/run-menu.png), and then click **Run Cell**. \n```\nLIST '<path-to-source-data>'\n\n``` \n```\n%fs ls '<path-to-source-data>'\n\n``` \nReplace `<path-to-source-data>` with the path to the directory that contains your data. \nThis displays the contents of the directory that contains the dataset.\n5. To view a sample of the records to better understand the contents and format of each record, paste the following into a notebook cell, click ![Run Menu](https:\/\/docs.databricks.com\/_images\/run-menu.png), and then click **Run Cell**. \n```\nSELECT * from read_files('<path-to-source-data>', format => '<file-format>') LIMIT 10\n\n``` \n```\nspark.read.format('<file-format>').load('<path-to-source-data>').limit(10).display()\n\n``` \nReplace the following values: \n* `<file-format>`: A supported file format. See [File format options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#format-options).\n* `<path to source data>`: The path to a file in the directory that contains your data.This displays the first ten records from the specified file.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/onboard-data.html"} +{"content":"# Ingest data into a Databricks lakehouse\n### Onboard data from Amazon S3\n#### Step 3: Ingest raw data\n\nTo ingest raw data, do the following: \n1. In the sidebar, click **New** > **Notebook**. \nThe notebook is automatically attached to the last cluster you used (in this case, the cluster you created earlier in this article).\n2. Enter a name for the notebook.\n3. Click the language button, and then select `Python` or `SQL` from the dropdown menu. `Python` is selected by default.\n4. Paste the following code into a notebook cell: \n```\nCREATE OR REFRESH STREAMING TABLE\n<table-name>\nAS SELECT\n*\nFROM\nSTREAM read_files(\n'<path-to-source-data>',\nformat => '<file-format>'\n)\n\n``` \n```\n@dlt.table(table_properties={'quality': 'bronze'})\ndef <table-name>():\nreturn (\nspark.readStream.format('cloudFiles')\n.option('cloudFiles.format', '<file-format>')\n.load(f'{<path-to-source-data>}')\n)\n\n``` \nReplace the following values: \n* `<table-name>`: A name for the table that will contain the ingested records.\n* `<path-to-source-data>`: The path to your source data.\n* `<file-format>`: A supported file format. See [File format options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#format-options). \nNote \nDelta Live Tables isn\u2019t designed to run interactively in notebook cells. Running a cell that contains Delta Live Tables syntax in a notebook returns a message about whether the query is syntactically valid, but does not run query logic. The following step describes how to create a pipeline from the ingestion notebook you just created.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/onboard-data.html"} +{"content":"# Ingest data into a Databricks lakehouse\n### Onboard data from Amazon S3\n#### Step 4: Create and publish a pipeline\n\nTo create a pipeline and publish it to Unity Catalog, do the following: \n1. In the sidebar, click **Workflows**, click the **Delta Live Tables** tab, and then click **Create pipeline**.\n2. Enter a name for your pipeline.\n3. For **Pipeline mode**, select **Triggered**.\n4. For **Source code**, select the notebook that contains your pipeline source code.\n5. For **Destination**, select **Unity Catalog**.\n6. To ensure that your table is managed by Unity Catalog and any user with access to the parent schema can query it, select a **Catalog** and a **Target schema** from the drop-down lists.\n7. If you don\u2019t have cluster creation permission, select a **Cluster policy** that supports Delta Live Tables from the drop-down list.\n8. For **Advanced**, set the **Channel** to **Preview**.\n9. Accept all other default values and click **Create**.\n\n### Onboard data from Amazon S3\n#### Step 5: Schedule the pipeline\n\nTo schedule the pipeline, do the following: \n1. In the sidebar, click **Delta Live Tables**.\n2. Click the name of the pipeline you want to schedule.\n3. Click **Schedule** > **Add a schedule**.\n4. For **Job name**, enter a name for the job.\n5. Set the **Schedule** to **Scheduled**.\n6. Specify the period, starting time, and time zone.\n7. Configure one or more email addresses to receive alerts on pipeline start, success, or failure.\n8. Click **Create**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/onboard-data.html"} +{"content":"# Ingest data into a Databricks lakehouse\n### Onboard data from Amazon S3\n#### Next steps\n\n* Grant users access to the new table. For more information, see [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html).\n* Users with access to the new table can now query the table in a [notebook](https:\/\/docs.databricks.com\/notebooks\/index.html) or use the [Databricks SQL editor](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/onboard-data.html"} +{"content":"# Technology partners\n## Connect to reverse ETL partners using Partner Connect\n#### Connect to Hightouch\n\nHightouch syncs your data in Databricks to the tools that your business teams rely on. \nYou can integrate your Databricks SQL warehouses (formerly Databricks SQL endpoints) and Databricks clusters with Hightouch.\n\n#### Connect to Hightouch\n##### Connect to Hightouch using Partner Connect\n\nTo connect your Databricks workspace to Hightouch using Partner Connect, see [Connect to reverse ETL partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/reverse-etl.html). \nNote \nPartner Connect only supports SQL warehouses for Hightouch. To connect a cluster to Hightouch, connect to Hightouch manually.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/reverse-etl\/hightouch.html"} +{"content":"# Technology partners\n## Connect to reverse ETL partners using Partner Connect\n#### Connect to Hightouch\n##### Connect to Hightouch manually\n\nThis section describes how to connect an existing SQL warehouse or cluster in your Databricks workspace to Hightouch manually. \nNote \nFor Databricks SQL warehouses, you can connect to Hightouch using Partner Connect to simplify the experience. \n### Requirements \nBefore you connect to Hightouch manually, you need the following: \n* A cluster or SQL warehouse in your Databricks workspace. \n+ [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n+ [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* The connection details for your cluster or SQL warehouse, specifically the **Server Hostname**, **Port**, and **HTTP Path** values. \n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). To create a personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**.\nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n+ [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n+ [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \n### Steps to connect \nTo connect to Hightouch manually, do the following: \n1. [Sign up](https:\/\/app.hightouch.com\/signup) for a new Hightouch account, or [sign in](https:\/\/app.hightouch.com\/login) to your existing Hightouch account.\n2. Create a workspace by clicking **Create a workspace**, or select an existing workspace.\n3. If you chose to create a workspace, enter a name for the workspace, and then click **Create workspace**.\n4. In the workspace navigation pane, click **Sources**.\n5. Click **Add source**.\n6. Click **Databricks**, and then click **Continue**.\n7. For **Server Hostname**, enter the **Server Hostname** value from the requirements.\n8. For **Port**, enter the **Port** value from the requirements.\n9. For **HTTP Path**, enter the **HTTP Path** value from the requirements.\n10. For **Access Token**, enter the token value from the requirements.\n11. For **Default Schema**, enter the name of the target database within the workspace.\n12. Click **Test Connection**.\n13. After the connection attempt succeeds, click **Continue**.\n14. Enter a name for the connection, and then click **Finish**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/reverse-etl\/hightouch.html"} +{"content":"# Technology partners\n## Connect to reverse ETL partners using Partner Connect\n#### Connect to Hightouch\n##### Next steps\n\nCreate a destination, a model, and a sync. See [Create your first sync](https:\/\/hightouch.com\/docs\/get-started\/create-a-sync\/) in the Hightouch documentation.\n\n#### Connect to Hightouch\n##### Additional resources\n\n* [Hightouch website](https:\/\/www.hightouch.com)\n* [Hightouch documentation](https:\/\/hightouch.com\/docs\/)\n* [Databricks connection details](https:\/\/hightouch.com\/docs\/sources\/databricks\/)\n* [Hightouch support](mailto:hello%40hightouch.com)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/reverse-etl\/hightouch.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Identifying an expensive read in Spark\u2019s DAG\n###### Getting to the DAG\n\nAssuming you\u2019re looking at an expensive job, first we need the ID of the stage that\u2019s doing the read. Here we can see the Stage ID is 194: \n![Stage ID](https:\/\/docs.databricks.com\/_images\/stage-id.png) \nNow we need to get to the SQL DAG. Scroll up to the top of the job\u2019s page and click on the **Associated SQL Query**: \n![SQL ID](https:\/\/docs.databricks.com\/_images\/stage-to-sql.png) \nYou should now see the DAG. If not, scroll around a bit and you should see it: \n![SQL DAG](https:\/\/docs.databricks.com\/_images\/sql-dag.png) \nIn some cases, you can follow the DAG and see where the data is coming from. In other cases, look for the Stage ID you noted: \n![SQL Stage in DAG](https:\/\/docs.databricks.com\/_images\/stage-in-dag.png) \nThen you need to look for the \u201cScan\u201d node. In this case it\u2019s pretty simple to tell that we\u2019re reading a table named `transactions`: \n![Scan in DAG](https:\/\/docs.databricks.com\/_images\/scan-node.png) \nIn some cases you may need to click on or roll over the node to get the location of the data you\u2019re reading.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/spark-dag-expensive-read.html"} +{"content":"# What is Databricks Marketplace?\n### Manage existing Databricks Marketplace listings\n\nThis article describes how to edit, unpublish, delete, and revoke access to Databricks Marketplace listings and listed data products. It is intended for data providers.\n\n### Manage existing Databricks Marketplace listings\n#### Before you begin\n\nYou must be a Marketplace admin to manage provider listings. See [Assign the Marketplace admin role](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#marketplace-admin).\n\n### Manage existing Databricks Marketplace listings\n#### Edit a Marketplace listing\n\nTo edit an existing listing: \n1. Log into your Databricks workspace.\n2. In the sidebar, click the ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace** icon.\n3. On the upper-right corner of the Marketplace page, click **Provider console**.\n4. On the **Listings** tab, find the listing you want to modify, click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu (also known as the three-dot menu) at the end of the listing row, and select **Edit**. \nYou cannot change an instantly available listing to one that requires approval, or vice versa.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/manage-listings.html"} +{"content":"# What is Databricks Marketplace?\n### Manage existing Databricks Marketplace listings\n#### Unpublish a data product\n\nIf you want to remove a listing from the Marketplace UI and disable consumer access requests, you can unpublish it. The listing remains in your workspace in unpublished status, available to be relisted. Any pending requests are deleted. \nTo unpublish a listing: \n1. Log into your Databricks workspace.\n2. In the sidebar, click ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace**.\n3. On the upper-right corner of the Marketplace page, click **Provider console**.\n4. On the **Listings** tab, find the listing you want to unpublish, click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu (also known as the three-dot menu) at the end of the listing row, and select **Unpublish listing**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/manage-listings.html"} +{"content":"# What is Databricks Marketplace?\n### Manage existing Databricks Marketplace listings\n#### Delete a data product\n\nIf you want to remove a listing from the Marketplace UI, disable consumer access requests, and remove it from your own workspace, you can delete it. You will not be able to relist it. Underlying shares in your workspace are not deleted. Consumers who already have access to the data product in their workspace will continue to have access to the shared data product unless you revoke access to the underlying Delta Sharing share. \nTo delete a listing: \n1. Log into your Databricks workspace.\n2. In the sidebar, click ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace**.\n3. On the upper-right corner of the Marketplace page, click **Provider console**.\n4. On the **Listings** tab, find the listing you want to unpublish, click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu (also known as the three-dot menu) at the end of the listing row, and select **Delete**.\n\n### Manage existing Databricks Marketplace listings\n#### Revoke a consumer\u2019s access to a data product\n\nTo revoke a consumer\u2019s access to a data product that you have shared with them, you can use Delta Sharing interfaces. See [Revoke recipient access to a share](https:\/\/docs.databricks.com\/data-sharing\/grant-access.html#revoke). \nNote \nIf you revoke a consumer\u2019s access to a data product that you have already shared with them, and your listing for that product is still live, the consumer will continue to see the data product as installed on the **My requests** page of the Databricks Marketplace in their workspace, but they will be unable to access the data and unable to request that data product from that listing from that workspace again.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/manage-listings.html"} +{"content":"# What is data warehousing on Databricks?\n### Legacy dashboards\n\nYou can use the [SQL editor](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html) to build a legacy dashboard that combines [visualizations](https:\/\/docs.databricks.com\/visualizations\/index.html) and text boxes that provide context with your data. \nNote \nDashboards (formerly Lakeview dashboards) are now generally available. \n* Databricks recommends authoring new dashboards using the latest tooling. See [Dashboards](https:\/\/docs.databricks.com\/dashboards\/index.html).\n* Original Databricks SQL dashboards are now called **legacy dashboards**. They will continue to be supported and updated with critical bug fixes, but new functionality will be limited. You can continue to use legacy dashboards for both authoring and consumption.\n* Convert legacy dashboards using the migration tool or REST API. See [Clone a legacy dashboard to a Lakeview dashboard](https:\/\/docs.databricks.com\/dashboards\/clone-legacy-to-lakeview.html) for instructions on using the built-in migration tool. See [Use Databricks APIs to manage dashboards](https:\/\/docs.databricks.com\/dashboards\/tutorials\/index.html#apis) for tutorials on creating and managing dashboards using the REST API.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Legacy dashboards\n#### View and organize dashboards\n\nYou can access dashboards from the workspace browser along with other Databricks objects. \n* Click ![Workspace Icon](https:\/\/docs.databricks.com\/_images\/workspace-icon.png) **Workspace** in the sidebar to view dashboards from the workspace browser. Dashboards are stored in the `\/Workspace\/Users\/<username>` directory by default. Users can organize dashboards into folders in the workspace browser along with other Databricks objects.\n* To view the dashboard listing page, click ![Dashboards Icon](https:\/\/docs.databricks.com\/_images\/dashboards-icon.png) **Dashboards** in the sidebar.\n* Click the **Legacy dashboards** tab to view legacy dashboards. \nBy default, the **My dashboards** tab is selected and shows dashboards that you own sorted in reverse chronological order. Reorder the list by clicking the **Created at** heading. Or, use the tabs near the top of the page to view **Favorites**, **Trash**, or **All dashboards**. Use the **Tags** tab to filter by tag. \n### Organize dashboards into folders in the workspace browser \nOrganize new and existing dashboards into folders in the workspace browser along with other Databricks objects. See [Workspace browser](https:\/\/docs.databricks.com\/workspace\/workspace-browser\/index.html). \n### Filter the list of saved dashboards \nFilter the list of all dashboards by dashboards that you created (**My Dashboards**), by [favorites](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html#favorite-and-tag-queries), and by [tags](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html#favorite-and-tag-queries).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Legacy dashboards\n#### Create a legacy dashboard\n\nFollow these steps to create a new legacy dashboard. To clone an existing dashboard, see [Clone a legacy dashboard](https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html#clone-legacy). \n1. Click ![Dashboards Icon](https:\/\/docs.databricks.com\/_images\/dashboards-icon.png) **Dashboards** in the sidebar, then click the **Legacy dashboards** tab.\n2. Click **Create legacy dashboard**.\n3. Enter a name for the dashboard.\n4. When you create a dashboard, you have the option to specify a warehouse to be used for manual refresh. If you do not select and save a warehouse to the dashboard, it will fall back to using the warehouse saved for individual queries. \nNote \nIf you select and save a warehouse to the dashboard and then it is deleted or otherwise becomes unavailable, a manual refresh will fail until a new warehouse is assigned. \n1. Add content to the dashboard by clicking **Add** and selecting the type of content to add: \n* Click **Textbox** to add commentary. \nEnter text. Style the text boxes using [Markdown](https:\/\/daringfireball.net\/projects\/markdown\/syntax). \n+ To add a static image in a text box, use markdown image syntax with a description and publicly available URL: `![description](url)`. For example, the following markdown inserts an image of the Databricks logo: `![The Databricks Logo](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/6\/63\/Databricks_Logo.png)`. To resize the image, resize the widget dimensions.\n+ To add an image from DBFS, add markdown image syntax with a desired description and FileStore path: `![description](files\/path_to_dbfs_image)`. To resize the image, resize the widget dimensions.\nImportant \nImages used in a dashboard that are stored in DBFS will not render when doing an on-demand pdf or subscription\n* Click **Visualization** to add a query visualization. \n1. Select a query. Search existing queries or pick a recent one from the pre-populated list. If a query was saved with the **Limit 1000** setting, the query in the dashboard limits results to 1000 rows.\n2. In the **Choose Visualization** drop-down, select the visualization type. \n![Add to dashboard](https:\/\/docs.databricks.com\/_images\/add-chart-widget.png)\n2. Click **Add to legacy dashboard**.\n3. Drag and drop content blocks on the dashboard.\n4. Click **Done Editing**. \n![Complete dashboard](https:\/\/docs.databricks.com\/_images\/dashboard.png) \nYou can also create a dashboard with the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_sql\\_dashboard](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/sql_dashboard). You can create a [widget](https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html#add-content-to-a-dashboard) for legacy a dashboard with [databricks\\_sql\\_widget](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/sql_widget). You can create a sample legacy dashboard with [dbsql-nyc-taxi-trip-analysis](https:\/\/github.com\/databricks\/terraform-databricks-examples\/tree\/main\/modules\/dbsql-nyc-taxi-trip-analysis).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Legacy dashboards\n#### Clone a legacy dashboard\n\nYou can clone the legacy dashboard and all upstream queries if you have the CAN RUN, CAN EDIT, and CAN MANAGE permissions on the dashboard and each of its upstream queries. You become the owner of the new dashboard and queries. \nImportant \nSharing settings, alerts, and subscriptions are not copied to the new dashboard. \nTo clone a legacy dashboard: \n1. Open the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu at the top-right of the dashboard and select **Clone**.\n2. Enter a name for the new dashboard, then click **Confirm**. \nNote \nCloning is guaranteed to work reliably with fewer than 50 visualizations and fewer than 30 queries, including queries that are used to generate query-based dropdown list parameters. Attempting to clone a dashboard with visualizations or queries that exceed these limits may fail. \nFor more information about query-based dropdown list parameters, see [Query-Based Dropdown List](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html#query-based-dropdown-list).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Legacy dashboards\n#### Using query parameters in legacy dashboards\n\nQueries can optionally leverage parameters or static values. When a visualization based on a parameterized query is added to a dashboard, the visualization can either be configured to use a: \n* **Widget parameter**: Widget parameters are specific to a single visualization within a dashboard, appear within the visualization panel, and the parameter values specified apply only to the query underlying the visualization.\n* **Dashboard parameter**: Legacy dashboard parameters can apply to multiple visualizations. When you add a visualization based on a parameterized query to a dashboard, by default, the parameter will be counted as a dashboard parameter. Dashboard parameters are configured for one or more visualizations within a dashboard and appear at the top of the dashboard. The parameter values specified for a dashboard parameter apply to visualizations reusing that particular dashboard parameter. A dashboard can have multiple parameters, each of which may apply to some visualizations and not others.\n* **Static value**: Static values are used instead of a parameter that responds to changes. Static values allow you to hard code a value in place of a parameter and will make the parameter \u201cdisappear\u201d from the dashboard or widget where it previously appeared. \nWhen you add a visualization containing a parameterized query, you can choose the title and the source for the parameter in the visualization query by clicking the appropriate pencil icon ![Pencil Icon](https:\/\/docs.databricks.com\/_images\/pencil-icon.png). You can also select the keyword and a default value. See [Parameter properties](https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html#parameter-properties-legacy). \n![Parameter mapping](https:\/\/docs.databricks.com\/_images\/dashboard_parameter_mapping.png) \nAfter adding a visualization to a dashboard, you can access the parameter mapping interface by clicking the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu on the upper-right of a dashboard widget and then clicking **Change widget settings**. \n![Open dashboard parameter mapping](https:\/\/docs.databricks.com\/_images\/dashboard_parameter_mapping_settings.png) \n![Change parameter mapping view](https:\/\/docs.databricks.com\/_images\/dashboard_parameter_mapping_view.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Legacy dashboards\n#### Parameter properties\n\nThe dashboard widget parameter properties are: \n* **Title**: The display name that appears beside the value selector on your dashboard. It defaults to the title set in the query editor. To edit how it appears within the dashboard, click the pencil icon ![Pencil Icon](https:\/\/docs.databricks.com\/_images\/pencil-icon.png). Titles are not displayed for static dashboard parameters because the value selector is hidden. If you select **Static value** as your Value Source then the Title field is grayed out.\n* **Keyword**: The string literal for this parameter in the underlying query. This is useful for debugging if your dashboard does not return the expected results.\n* **Default Value**: The value set for that parameter on dashboard load until another is selected and changes applied. To change this default, open the underlying query in the SQL editor, change the parameter to your desired value, and click the **Save** button.\n* **Value Source**: The source of the parameter value. Click the pencil icon ![Pencil Icon](https:\/\/docs.databricks.com\/_images\/pencil-icon.png) to choose a source. \n+ **New dashboard parameter**: Create a new dashboard-level parameter. This lets you set a parameter value in one place on your dashboard and map it to one or more visualizations. Parameters must have unique names within the dashboard.\n+ **Existing dashboard parameter**: Map this visualization\u2019s parameter to an existing dashboard parameter. You must specify which pre-existing dashboard parameter.\n+ **Widget parameter**: Displays a value selector inside your dashboard widget. This is useful for one-off parameters that are not shared between widgets.\n+ **Static value**: Choose a static value for the widget, regardless of the values used on other widgets. Statically mapped parameter values do not display a value selector anywhere on the dashboard, which is more compact. This lets you take advantage of the flexibility of query parameters without cluttering the user interface on a dashboard when certain parameters are not expected to change frequently.\n![Change parameter mapping](https:\/\/docs.databricks.com\/_images\/dashboard_parameter_mapping_change.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Legacy dashboards\n#### Edit a legacy dashboard\n\nTo open the dashboard for editing, open the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu at the top-right of the dashboard and select **Edit**. \n![Edit dashboard](https:\/\/docs.databricks.com\/_images\/edit-dashboard.png) \nWhile editing, you can add and remove content, edit visualizations, and apply filters. To change the order in which parameters are shown, you can click and drag each parameter to the desired position. \n### Filter across multiple queries \nTo filter across multiple queries on a dashboard: \n1. Go to your legacy dashboard.\n2. In **Edit** mode, click **Add**, and then click **Filter**.\n3. Select **New dashboard filter** and choose the queries and columns to filter. You can also choose to import filters from existing queries by selecting **Existing query filters** and choosing to import a filter from a SQL query editor. The queries you choose must belong to the same catalog and schema.\n4. Click **Save**. This creates a filter that contains the union of all dropdown options. \nNote \nThe queries you choose must belong to the same catalog and schema. Some old queries may not be compatible with filtering across multiple queries.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Legacy dashboards\n#### Edit a dashboard visualization\n\nTo edit a visualization on the dashboard while in edit mode, select the visualization you wish to edit and then click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu at the top-right of the visualization. In the displayed list, select **Edit visualization**. \n![Edit visualization](https:\/\/docs.databricks.com\/_images\/edit-visualization-dashboard.png) \n### Add content to a dashboard \n1. Open the dashboard for [editing](https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html#edit-a-dashboard-legacy).\n2. Click **Add Textbox** or **Add Widget**.\n3. Click **Add to legacy dashboard**.\n4. Click **Done Editing**. \nYou can also [add a visualization to a dashboard](https:\/\/docs.databricks.com\/sql\/user\/visualizations\/index.html#add-to-dashboard) in the SQL editor. \n### Remove content from a dashboard \n1. Click the ![SQL Delete Icon](https:\/\/docs.databricks.com\/_images\/delete-icon1.png) or hover over the object, click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu at the top-right of the widget and select **Remove from Dashboard**.\n2. Click **Delete**. \n### Dashboard filters \nWhen queries [have filters](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-filters.html), you must also apply filters at the dashboard level. Select the **Use Dashboard Level Filters** checkbox to apply the filter to all queries. \n### Customize dashboard colors \nYou can customize the dashboard color palette, including creating a color palette. \n#### Create a color palette \nTo create a custom color palette for a dashboard: \n1. Click ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu at the upper-right, and click **Edit**.\n2. Click **Colors**.\n3. To import an existing color palette, click **Import** and select the palette. You can customize the imported palette.\n4. To create a new palette, or to customize an imported palette, do the following: \n1. To add a new color, click **Add**.\n2. For a newly added color or an existing color, specify the color by doing either of the following: \n* Click the square and select the new color by clicking it in the color selector or using the eyedropper.\n* Click the text field next to the square and enter a hexadecimal value.\n5. Click **Apply**. \n#### Stop using a custom color palette \nTo remove a custom color palette: \n1. Click ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu at the upper-right, and click **Edit**.\n2. Click **Colors**.\n3. Click **Clear**.\n4. Click **Apply**. \n#### Use a different color palatte for a visualization \nBy default, if a color palette has been applied in a dashboard, all visualizations will use that color palette. If you\u2019d like to use different colors for a visualization, you can override this behavior: \n1. Click ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu for the visualization in the dashboard and click **Edit**.\n2. Click the checkbox next to **Retain colors specified on visualization**.\n3. Click **OK**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Legacy dashboards\n#### Refresh a dashboard\n\nLegacy dashboards are designed for efficient loading as they retrieve data from a cache that renews each time a query runs. However, your dashboards can become outdated if you don\u2019t run the associated queries regularly. To prevent your dashboards from becoming stale, you can refresh the dashboard to rerun the associated queries. \nEach time a dashboard is refreshed, either manually or on a schedule, all queries referenced in the dashboard are refreshed. When an individual visualization is refreshed, the upstream query is refreshed. Manually refreshing the dashboard or individual visualization does not refresh queries used in Query Based Dropdown Lists. For details on Query Based Dropdown Lists, see [Query-Based Dropdown List](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html#query-based-dropdown-list). \n### Refresh behavior and execution context \nWhen a dashboard is \u201cRun as Owner\u201d and a schedule is added, the owner\u2019s credential is used for execution, and anyone with at least CAN RUN permission sees the results of those refreshed queries. \nWhen a dashboard is \u201cRun as Viewer\u201d and a schedule is added, the owner\u2019s credential is used for execution, but only the owner sees the results of the refreshed queries; all other viewers must manually refresh to see updated query results. \n### Manually refresh a dashboard \nTo force a refresh, click **Refresh** on the upper-right of the dashboard. This runs all the dashboard queries and updates its visualizations. \n### Automatically refresh a dashboard \nA dashboard\u2019s owner and users with the CAN EDIT permission can configure a dashboard to automatically refresh on a schedule. To automatically refresh a dashboard: \n1. Click **Schedule** at the upper-right corner of the dashboard. Then, click **Add schedule**.\n2. Use the dropdown pickers to specify the frequency, period, starting time, and time zone. Optionally, select the **Show cron syntax** checkbox to edit the schedule in [Quartz Cron Syntax](http:\/\/www.quartz-scheduler.org\/documentation\/quartz-2.3.0\/tutorials\/crontrigger.html).\n3. Choose **More options** to show optional settings. You can choose: \n* A name for the schedule.\n* A SQL warehouse to power the query. By default, the SQL warehouse used for ad hoc query execution is also used for a scheduled job. Use this optional setting to select a different warehouse to run the scheduled query.\nNote \nThis warehouse can be different than the one used for manual refresh.\n4. Optional: In the **Subscribers** tab, enter a list of email addresses to notify when the dashboard is automatically updated. Each email address must be associated with a Databricks account with workspace access or defined as a notification destination in the workspace settings. Notification destinations are configured by a workspace admin.\n5. Click **Create**. The **Schedule** label changes to **Schedule(1)**.\n6. Edit sharing settings. \nNote \nDashboard permissions are not linked to schedule permissions. After creating your scheduled refresh interval, edit the schedule permissions to provide access to other users. Only users with CAN MANAGE permission can edit the schedule or edit the subscriber list. \n* Click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu.\n* Click **Edit schedule permissions**.\n* Choose a user or group from the drop-down menu in the dialog.\n* Choose CAN VIEW to allow the selected users to view the schedule configuration. \nCAN VIEW or CAN RUN permission allows the assigned recipient to see that a schedule exists, as well as other properties like report cadence and number of subscribers. CAN MANAGE allows the recipient to modify the schedule, subscriber list, and schedule permission. CAN MANAGE permission also allows the recipient to pause or unpause the schedule. \n### Refresh behavior on an open dashboard \nWhen you open a dashboard set to `Run as Owner`, it displays data from the latest dashboard update, regardless of whether it was scheduled or manually refreshed. If a dashboard is open in a browser window, and a query is modified or a scheduled run updates dashboard results, the changes won\u2019t be reflected immediately. The updated results will appear the next time you open the dashboard or refresh the open browser window.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Legacy dashboards\n#### Dashboard snapshot subscriptions\n\nYou can periodically export and email dashboard snapshots. Dashboard snapshots are taken from the default dashboard state, meaning that any interaction with the filters and visualizations is not included in the snapshot. \nIf you have at least CAN EDIT permission, you can create a refresh schedule and subscribe other users, who will receive email snapshots of the dashboard every time it\u2019s refreshed. To add subscribers, enter users or groups in the **Subscribers** tab as described above. Eligible subscribers include workspace users and notification destinations. \nNote \nNotification destinations are configured by a workspace admin. To learn how to configure a notification destination, see [Manage notification destinations](https:\/\/docs.databricks.com\/admin\/workspace-settings\/notification-destinations.html). \nThere is a 6 MB file size limit for email attachments. If a dashboard subscription email exceeds the 6 MB size limit, the email will omit the inline dashboard snapshot and include only a PDF of the dashboard snapshot. \nIf the PDF snapshot file exceeds 6 MB, the subscription email will omit the PDF and instead include a link to the refreshed dashboard. The email will have a warning note detailing the current dashboard size. (Users can test the PDF snapshot size by manually downloading a PDF of the dashboard.) \n### Temporarily pause scheduled dashboard updates \nIf a dashboard is configured for automatic updates, and you have at least CAN VIEW permission on the schedule, the label on the **Schedule** button reads **Schedule(#)**, where the # is the number of scheduled events that are visible to you. Additionally, if you have at least CAN MANAGE permission on the schedule, you can temporarily pause the schedule. This is helpful to avoid sending updates while testing changes to the dashboard. To temporarily pause scheduled dashboard updates without modifying the list of subscribers: \n1. Click **Schedule(#)**.\n2. Click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu.\n3. Click **Pause**. \n### Stop automatically updating a dashboard \nTo stop automatically updating the dashboard and remove its subscriptions: \n1. Click **Schedule(#)**.\n2. Click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu>**Delete**. \nNote \nYou must have at least CAN MANAGE permission on a schedule to delete it.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Legacy dashboards\n#### Dashboard size limits for subscriptions\n\nLegacy dashboard subscription emails include the following base64 encoded files: \n* PDF: A PDF file that includes the full dashboard.\n* DesktopImage: An image file optimized for viewing on desktop computers.\n* MobileImage: An image file optimized for viewing on mobile devices. \nA maximum limit of 6MB is imposed on the combined size of the three files. The following descriptions outline the expected behavior when the combined file size exceeds the limit: \n* **If the PDF file is greater than 6MB:** The subscription email does not include the PDF attachment or any images. It includes a note that says the dashboard has exceeded the size limit and shows the actual file size of the current dashboard.\n* **If the combined size of the PDF and DesktopImage files is greater than 6MB:** Only the PDF is attached to the email. The inline message includes a link to the dashboard but no inline image for mobile or desktop viewing.\n* **If the combined file size of all files is greater than 6MB:** The MobileImage is excluded from the email.\n\n### Legacy dashboards\n#### Download as PDF\n\nTo download a dashboard as a PDF file, click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu at the top-right of the dashboard and select **Download as PDF**.\n\n### Legacy dashboards\n#### Move a dashboard to Trash\n\nTo move a dashboard to Trash, click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu at the top-right of the dashboard and select **Move to Trash**. Confirm by clicking **Move to Trash**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Legacy dashboards\n#### Restore a dashboard from Trash\n\n1. Click ![Dashboards Icon](https:\/\/docs.databricks.com\/_images\/dashboards-icon.png) **Dashboards** in the sidebar.\n2. Click the **Dashboards** tab to view legacy dashboards.\n3. Click the **Trash** tab.\n4. Click a dashboard.\n5. Click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu at the top-right of the dashboard and select **Restore**.\n\n### Legacy dashboards\n#### Permanently delete a dashboard\n\n1. In the All Dashboards list, click ![Trash Button](https:\/\/docs.databricks.com\/_images\/trash-button.png).\n2. Click a dashboard.\n3. Click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu at the top-right of the dashboard and select **Delete**.\n\n### Legacy dashboards\n#### Open a query\n\nTo open the query displayed in a widget in the SQL editor, click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu at the top-right of the widget and select **View Query**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Legacy dashboards\n#### Configure dashboard permissions and dashboard ownership\n\nYou must have CAN MANAGE on a dashboard to configure permissions For dashboard permission levels, see [Legacy dashboard ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#dashboards). \n1. In the sidebar, click **Dashboards**.\n2. Click a dashboard.\n3. Click the ![Share Button](https:\/\/docs.databricks.com\/_images\/share-button.png) button at the top right to open the **Sharing** dialog.\n![Manage dashboard permissions](https:\/\/docs.databricks.com\/_images\/manage-permissions.png)\n4. Search for and select the groups or users and assign the permission level. \n1. Set the credentials to **Run as viewer** to assign the CAN EDIT or CAN MANAGE permissions.\n5. Click **Add**. \nYou can quickly share all queries associated with your dashboard by clicking the gear icon and selecting **Share all queries**. Queries referenced by the dashboard have separate permissions and are not shared by default when you share the dashboard. \n### Transfer ownership of a dashboard \nIf a dashboard\u2019s owner is removed from a workspace, the dashboard no longer has an owner. A workspace admin user can transfer ownership of any dashboard, including one without an owner, to a different user. Groups cannot be assigned ownership of a dashboard. You can also transfer ownership using the [Permissions API](https:\/\/docs.databricks.com\/api\/workspace\/permissions). \n1. As a workspace admin, log in to your Databricks workspace.\n2. In the sidebar, click **Dashboards**.\n3. Click a dashboard.\n4. Click the **Share** button at the top right to open the **Sharing** dialog.\n5. Click on the gear icon at the top right and click **Assign new owner**. \n![Assign new owner](https:\/\/docs.databricks.com\/_images\/assign-new-owner.png)\n6. Select the user to assign ownership to.\n7. Click **Confirm**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Legacy dashboards\n#### Access admin view\n\nA Databricks workspace admin user has view access to all dashboards in the workspace. In this view a workspace admin can view and delete any dashboard. However, a workspace admin can\u2019t edit a dashboard when sharing setting credentials are set to **Run as owner**. \nTo view all dashboards: \n1. Click ![Dashboards Icon](https:\/\/docs.databricks.com\/_images\/dashboards-icon.png) **Dashboards** in the sidebar.\n2. Click the **All queries** tab near the top of the screen.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html"} +{"content":"# \n### Tutorial index\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/index.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Mode\n\nThis article describes how to use Mode with a Databricks cluster or a Databricks SQL warehouse (formerly Databricks SQL endpoint).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/mode.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Mode\n##### Requirements\n\nBefore you connect to Mode manually, you need the following: \n* A cluster or SQL warehouse in your Databricks workspace. \n+ [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n+ [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* The connection details for your cluster or SQL warehouse, specifically the **Server Hostname**, **Port**, and **HTTP Path** values. \n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). To create a personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**.\nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n+ [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n+ [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/mode.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Mode\n##### Connect to Mode manually\n\nTo connect to Mode manually, do the following: \n1. [Sign in to Mode](https:\/\/app.mode.com\/).\n2. Click **Definitions > Data > Manage Connections > Connect a database**.\n3. In the list labelled **Or choose from a list of databases we support**, choose **Databricks**.\n4. On the **Enter your Databricks credentials** page, enter the following information: \n1. For **Host**, enter the **Server Hostname** value from the requirements.\n2. For **Port**, enter the **Port** value from the requirements.\n3. For **Database name**, enter the name of the database that you want to use.\n4. For **Token**, enter the personal access token from the requirements.\n5. For **HTTP Path**, enter the **HTTP Path** value from the requirements.\n5. Click **Connect**.\n\n#### Connect to Mode\n##### Next steps\n\nTo continue using Mode, see the following resources on the Mode website: \n* [Quick reference guide](https:\/\/mode.com\/help\/articles\/quick-reference-guide\/)\n* [Getting started with Mode](https:\/\/mode.com\/help\/articles\/getting-started-with-mode\/)\n* [Connect your database](https:\/\/mode.com\/help\/connect-your-database)\n* [Navigate and organize content](https:\/\/mode.com\/help\/navigate-and-organize-content)\n* [Query and analyze data](https:\/\/mode.com\/help\/query-and-analyze-data)\n* [Visualize and present data](https:\/\/mode.com\/help\/visualize-and-present-data)\n* [Explore and share data](https:\/\/mode.com\/help\/explore-and-share-data)\n* [Mode Help](https:\/\/mode.com\/help\/)\n* [Contact Mode](https:\/\/mode.com\/help\/articles\/contact-us\/)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/mode.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use version-controlled source code in a Databricks job\n\nYou can run jobs using notebooks or Python code located in a remote Git repository or a Databricks Git folder. This feature simplifies the creation and management of production jobs and automates continuous deployment: \n* You don\u2019t need to create a separate production repo in Databricks, manage its permissions, and keep it updated.\n* You can prevent unintentional changes to a production job, such as local edits in the production repo or changes from switching a branch.\n* The job definition process has a single source of truth in the remote repository, and each job run is linked to a commit hash. \nTo use source code in a remote Git repository, you must [Set up Databricks Git folders (Repos)](https:\/\/docs.databricks.com\/repos\/repos-setup.html). \nImportant \nNotebooks created by Databricks jobs run from remote Git repositories are ephemeral. You can\u2019t rely on them to track MLflow runs, experiments, and models. In this case, use standalone MLflow experiments instead. \nNote \nIf your job runs using a service principal as the identity, you can configure the service principal on the Git folder containing the job\u2019s source code. See [Use a service principal with Databricks Git folders](https:\/\/docs.databricks.com\/repos\/ci-cd-techniques-with-repos.html#use-sp-repos).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-repos.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use version-controlled source code in a Databricks job\n###### Use a notebook from a remote Git repository\n\nTo create a task with a notebook located in a remote Git repository: \n1. Click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar and click ![Create Job Button](https:\/\/docs.databricks.com\/_images\/create-job.png) or go to an existing job and add a new task.\n2. If this is a new job, replace **Add a name for your job\u2026** with your job name.\n3. Enter a name for the task in the **Task name** field.\n4. In the **Type** drop-down menu, select **Notebook**.\n5. In the **Source** drop-down menu, select **Git provider** and click **Edit** or **Add a git reference**. The **Git information** dialog appears.\n6. In the **Git Information** dialog, enter details for the repository, including the repository URL, the Git Provider, and the Git reference. This Git reference can be a branch, a tag, or a commit. \nFor **Path**, enter a relative path to the notebook location, such as `etl\/notebooks\/`. \nWhen you enter the relative path, don\u2019t begin it with `\/` or `.\/`, and don\u2019t include the notebook file extension, such as `.py`. For example, if the absolute path for the notebook you want to access is `\/notebooks\/covid_eda_raw.py`, enter `notebooks\/covid_eda_raw` in the Path field.\n7. Click **Create**. \nImportant \nIf you work with a Python notebook directly from a source Git repository, the first line of the notebook source file must be `# Databricks notebook source`. For a Scala notebook, the first line of the source file must be `\/\/ Databricks notebook source`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-repos.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use version-controlled source code in a Databricks job\n###### Use Python code from a remote Git repository\n\nTo create a task with Python code located in a remote Git repository: \n1. Click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar and click ![Create Job Button](https:\/\/docs.databricks.com\/_images\/create-job.png) or go to an existing job and add a new task.\n2. If this is a new job, replace **Add a name for your job\u2026** with your job name.\n3. Enter a name for the task in the **Task name** field.\n4. In the **Type** drop-down menu, select **Python script**.\n5. In the **Source** drop-down menu, select **Git provider** and click **Edit** or **Add a git reference**. The **Git information** dialog appears.\n6. In the **Git Information** dialog, enter details for the repository, including the repository URL, the Git Provider, and the Git reference. This Git reference can be a branch, a tag, or a commit. \nFor **Path**, enter a relative path to the source location, such as `etl\/python\/python_etl.py`. \nWhen you enter the relative path, don\u2019t begin it with `\/` or `.\/`. For example, if the absolute path for the Python code you want to access is `\/python\/covid_eda_raw.py`, enter `python\/covid_eda_raw.py` in the Path field.\n7. Click **Create**. \nWhen you view the [run history](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#task-history) of a task that runs Python code stored in a remote Git repository, the **Task run details** panel includes Git details, including the commit SHA associated with the run.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-repos.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use version-controlled source code in a Databricks job\n###### Use SQL queries from a remote Git repository\n\nNote \nOnly one SQL statement is supported in a file. Multiple SQL statements separated by semicolons (;) are not permitted. \nTo run queries stored in `.sql` files located in a remote Git repository: \n1. Click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar and click ![Create Job Button](https:\/\/docs.databricks.com\/_images\/create-job.png) or go to an existing job and add a new task.\n2. If this is a new job, replace **Add a name for your job\u2026** with your job name.\n3. Enter a name for the task in the **Task name** field.\n4. In the **Type** drop-down menu, select **SQL**.\n5. In the **SQL task** drop-down menu, select **File**.\n6. In the **Source** drop-down menu, select **Git provider** and click **Edit** or **Add a git reference**. The **Git information** dialog appears.\n7. In the **Git Information** dialog, enter details for the repository, including the repository URL, the Git Provider, and the Git reference. This Git reference can be a branch, a tag, or a commit. \nFor **Path**, enter a relative path to the source location, such as `queries\/sql\/myquery.sql`. \nWhen you enter the relative path, don\u2019t begin it with `\/` or `.\/`. For example, if the absolute path for the SQL query you want to access is `\/sql\/myqeury.sql`, enter `sql\/myquery.sql` in the Path field.\n8. Select a SQL warehouse. You must select a serverless SQL warehouse or a pro SQL warehouse.\n9. Click **Create**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-repos.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use version-controlled source code in a Databricks job\n###### Adding additional tasks from a remote Git repository\n\nAdditional tasks in a multitask job can reference the same commit in the remote repository in one of the following ways: \n* `sha` of `$branch\/head` when `git_branch` is set\n* `sha` of `$tag` when `git_tag` is set\n* the value of `git_commit` \nYou can mix notebook and Python tasks in a Databricks job, but they must use the same Git reference.\n\n##### Use version-controlled source code in a Databricks job\n###### Use a Databricks Git folder\n\nIf you prefer to use the Databricks UI to version control your source code, [clone your repository](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html) into a Databricks Git folder. For more information, see [Option 2: Set up a production Git folder and Git automation](https:\/\/docs.databricks.com\/repos\/ci-cd-techniques-with-repos.html#automate-production). \nTo add a notebook or Python code from a Git folder in a job task, in the **Source** drop-down menu, select **Workspace** and enter the path to the notebook or Python code in **Path**.\n\n##### Use version-controlled source code in a Databricks job\n###### Access notebooks from an IDE\n\nIf you need to access notebooks from an integrated development environment, make sure you have the comment `# Databricks notebook source` at the top of the notebook source code file. To distinguish between a regular Python file and a Databricks Python-language notebook exported in source-code format, Databricks adds the line `# Databricks notebook source` at the top of the notebook source code file. When you import the notebook, Databricks recognizes it and imports it as a notebook, not as a Python module.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-repos.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use version-controlled source code in a Databricks job\n###### Troubleshooting\n\nNote \nGit-based jobs do not support write access to workspace files. To write data to a temporary storage location, use driver storage. To write persistent data from a Git job, use a UC volume or DBFS. \n**Error message**: \n```\nRun result unavailable: job failed with error message Notebook not found: path-to-your-notebook\n\n``` \n**Possible causes**: \nYour notebook is missing the comment `# Databricks notebook source` at the top of the notebook source code file, or in the comment, `notebook` is capitalized when it must start with lowercase `n`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-repos.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query snippets\n\nIt\u2019s often easier to duplicate prior work and modify than to write something from scratch. This is particularly true for common `JOIN` statements or complex `CASE` expressions. As your list of queries grows, it can be difficult to remember which queries contain the statement you need. You can also create a query snippet that contains an insertion point with placeholder text that a user can replace at runtime. \nQuery snippets are segments of queries that you can share and trigger using auto complete. Use query snippets for: \n* Frequent `JOIN` statements\n* Complicated clauses like `WITH` or `CASE`.\n* Conditional formatting \nHere are examples of snippets: \n```\n--Simple snippet\nWHERE fare_amount > 100\n\n--Snippet with an insertion point for a value to be provided at runtime\nWHERE fare_amount > ${1:value}\n\n--Snippet with an insertion point for a value to be provided at runtime and containing a default value\nWHERE fare_amount > ${1:100}\n\n--Snippet with multiple insertion points\nWHERE fare_amount > ${2:min_value} AND fare_amount < ${1:max_value} AND trip_distance < ${0:max_distance}\n\n```\n\n#### Query snippets\n##### Create query snippets\n\nUse the following steps to create snippets using these snippet examples: \n1. Click your username in the top bar of the workspace and select **Settings** from the drop down.\n2. Click the **Developer** tab.\n3. Next to **SQL query snippets** click **Manage**.\n4. Click **Create query snippet**.\n5. In the **Replace** field, enter the snippet name. You will use this name when writing a query the uses the snippet.\n6. Optionally enter a description.\n7. In the **Snippet** field, enter the snippet.\n8. Click **Create**. \n![Query snippet](https:\/\/docs.databricks.com\/_images\/snippet-simple.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-snippets.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query snippets\n##### Use a query snippet in a query\n\nHere\u2019s an example of a simple query with which you can use these query snippets: \n```\n--Simple query\nSELECT * FROM samples.nyctaxi.trips\n\n``` \nUse the following steps to use a query snippet with this query: \n1. Open **SQL Editor**.\n2. Type your query in the SQL editor query pane.\n3. Type the first 3 letters of the snippet name and then select a snippet from the autocomplete window. You can also manually open the window by pressing `Option` + `Space` and select a snippet. \n![Query selecting a snippet](https:\/\/docs.databricks.com\/_images\/query-with-simple-snippet.png)\n4. Execute the query with the `WHERE` clause from the query snippet. \n![query showing a snippet used in a query](https:\/\/docs.databricks.com\/_images\/query-with-simple-snippet-2.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-snippets.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query snippets\n##### Working with insertion points in query snippets\n\nYou designate insertion points by wrapping an integer tab order with a single dollar sign and curly braces `${}`. A text placeholder preceded by a colon `:`is optional but useful for users unfamiliar with your snippet. In the query snippets with insertion points that you created previously, `${1:value}` is an insertion point with placeholder and `${1:100}` is an insertion point with a default value for the placeholder that you can override at runtime. When Databricks SQL renders the snippet, the dollar sign `$` and curly braces `{}` are stripped away and the word `value` or the default of `100` is highlighted for replacement. \nWhen there are multiple insertion points, the text insertion carat jumps to the first insertion point to prompt for the desired value. When you press `Tab`, the carat jumps to the next insertion point for the next value. When you press `Tab` again, the carat will jump to the next insertion point line until it reaches the final insertion point. \nNote \nAn insertion point of zero `${0}` is always the last point in the tab order. \nUse the following steps to use the insertion point query snippets with the query: \n1. Open **SQL Editor**.\n2. Type your query in the SQL editor query pane.\n3. Type the first 3 letters of the name of your query snippet and then select a query snippet with the insertion point without a default value. \nThe query snippet is added to the query and the text insertion carat jumps to the insertion point. \n![Query using insertion point query snippet with no default value](https:\/\/docs.databricks.com\/_images\/query-with-insertion-point.png)\n4. Enter a value for the `WHERE` clause, such as `200`.\n5. Optionally, execute the query with the `WHERE` clause from the query snippet.\n6. Repeat the previous steps but select the query snippet with the insertion point using a default value. \n![Query using insertion point query snippet containing a default value](https:\/\/docs.databricks.com\/_images\/query-with-insertion-point-with-default-value.png)\n7. Repeat the previous steps but select the query snippet with multiple insertion points. \n![Query using insertion point query snippet containing multiple insertion points](https:\/\/docs.databricks.com\/_images\/query-with-multiple-insertion-points.png)\n8. Enter a value for the first insertion point, tab to the next insertion point and enter a value, and then tab to the final insertion point and enter a value.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-snippets.html"} +{"content":"# Discover data\n### Exploratory data analysis on Databricks: Tools and techniques\n\nThis article describes tools and techniques for exploratory data analysis (EDA) on Databricks.\n\n### Exploratory data analysis on Databricks: Tools and techniques\n#### What is EDA and why is it useful?\n\nExploratory data analysis (EDA) includes methods for exploring data sets to summarize their main characteristics and identify any problems with the data. Using statistical methods and visualizations, you can learn about a data set to determine its readiness for analysis and inform what techniques to apply for data preparation. EDA can also influence which algorithms you choose to apply for training ML models.\n\n","doc_uri":"https:\/\/docs.databricks.com\/exploratory-data-analysis\/index.html"} +{"content":"# Discover data\n### Exploratory data analysis on Databricks: Tools and techniques\n#### What are the EDA tools in Databricks?\n\nDatabricks has built-in analysis and visualization tools in both Databricks SQL and in Databricks Runtime. For an illustrated list of the types of visualizations available in Databricks, see [Visualization types](https:\/\/docs.databricks.com\/visualizations\/visualization-types.html). \n### EDA in Databricks SQL \nHere are some helpful articles about data visualization and exploration tools in Databricks SQL: \n* [Visualize queries and create a dashboard](https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-dashboard.html)\n* [Create data visualizations in Databricks SQL](https:\/\/docs.databricks.com\/sql\/user\/visualizations\/index.html) \n### EDA in Databricks Runtime \nDatabricks Runtime provides a pre-built environment that has popular data exploration libraries already installed. You can see the list of the built-in libraries in the [release notes](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). \nIn addition, the following articles show examples of visualization tools in Databricks Runtime: \n* [Create data visualizations in Databricks notebooks](https:\/\/docs.databricks.com\/visualizations\/index.html)\n* [Do no-code EDA with bamboolib](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html) \nIn a Databricks Python notebook, you can combine SQL and Python to explore data. When you run code in a SQL language cell in a Python notebook, the table results are automatically made available as a Python DataFrame. For details, see [Explore SQL cell results in Python notebooks](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#implicit-sql-df).\n\n","doc_uri":"https:\/\/docs.databricks.com\/exploratory-data-analysis\/index.html"} +{"content":"# Get started: Account and workspace setup\n### Build an end-to-end data pipeline in Databricks\n\nThis article shows you how to create and deploy an end-to-end data processing pipeline, including how to ingest raw data, transform the data, and run analyses on the processed data. \nNote \nAlthough this article demonstrates how to create a complete data pipeline using Databricks [notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html) and a Databricks [job](https:\/\/docs.databricks.com\/workflows\/index.html#what-is-jobs) to orchestrate a workflow, Databricks recommends using [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html), a declarative interface for building reliable, maintainable, and testable data processing pipelines.\n\n### Build an end-to-end data pipeline in Databricks\n#### What is a data pipeline?\n\nA data pipeline implements the steps required to move data from source systems, transform that data based on requirements, and store the data in a target system. A data pipeline includes all the processes necessary to turn raw data into prepared data that users can consume. For example, a data pipeline might prepare data so data analysts and data scientists can extract value from the data through analysis and reporting. \nAn extract, transform, and load (ETL) workflow is a common example of a data pipeline. In ETL processing, data is ingested from source systems and written to a staging area, transformed based on requirements (ensuring data quality, deduplicating records, and so forth), and then written to a target system such as a data warehouse or data lake.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/data-pipeline-get-started.html"} +{"content":"# Get started: Account and workspace setup\n### Build an end-to-end data pipeline in Databricks\n#### Data pipeline steps\n\nTo help you get started building data pipelines on Databricks, the example included in this article walks through creating a data processing workflow: \n* Use Databricks features to explore a raw dataset.\n* Create a Databricks notebook to ingest raw source data and write the raw data to a target table.\n* Create a Databricks notebook to transform the raw source data and write the transformed data to a target table.\n* Create a Databricks notebook to query the transformed data.\n* Automate the data pipeline with a Databricks job.\n\n### Build an end-to-end data pipeline in Databricks\n#### Requirements\n\n* You\u2019re logged into Databricks and in the Data Science & Engineering workspace.\n* You have [permission to create a cluster](https:\/\/docs.databricks.com\/compute\/use-compute.html) or [access to a cluster](https:\/\/docs.databricks.com\/compute\/use-compute.html#permissions).\n* (Optional) To publish tables to Unity Catalog, you must create a [catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html) and [schema](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-schemas.html) in Unity Catalog.\n\n### Build an end-to-end data pipeline in Databricks\n#### Example: Million Song dataset\n\nThe dataset used in this example is a subset of the [Million Song Dataset](http:\/\/labrosa.ee.columbia.edu\/millionsong\/), a collection of features and metadata for contemporary music tracks. This dataset is available in the [sample datasets](https:\/\/docs.databricks.com\/discover\/databricks-datasets.html#databricks-datasets-databricks-datasets) included in your Databricks workspace.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/data-pipeline-get-started.html"} +{"content":"# Get started: Account and workspace setup\n### Build an end-to-end data pipeline in Databricks\n#### Step 1: Create a cluster\n\nTo perform the data processing and analysis in this example, create a cluster to provide the compute resources needed to run commands. \nNote \nBecause this example uses a sample dataset stored in DBFS and recommends persisting tables to [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html), you create a cluster configured with *single user access* mode. Single user access mode provides full access to DBFS while also enabling access to Unity Catalog. See [Best practices for DBFS and Unity Catalog](https:\/\/docs.databricks.com\/dbfs\/unity-catalog.html). \n1. Click **Compute** in the sidebar.\n2. On the Compute page, click **Create Cluster**.\n3. On the New Cluster page, enter a unique name for the cluster.\n4. In **Access mode**, select **Single User**.\n5. In **Single user or service principal access**, select your user name.\n6. Leave the remaining values in their default state, and click **Create Cluster**. \nTo learn more about Databricks clusters, see [Compute](https:\/\/docs.databricks.com\/compute\/index.html).\n\n### Build an end-to-end data pipeline in Databricks\n#### Step 2: Explore the source data\n\nTo learn how to use the Databricks interface to explore the raw source data, see [Explore the source data for a data pipeline](https:\/\/docs.databricks.com\/getting-started\/data-pipeline-explore-data.html). If you want to go directly to ingesting and preparing the data, continue to [Step 3: Ingest the raw data](https:\/\/docs.databricks.com\/getting-started\/data-pipeline-get-started.html#ingest-prepare-data).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/data-pipeline-get-started.html"} +{"content":"# Get started: Account and workspace setup\n### Build an end-to-end data pipeline in Databricks\n#### Step 3: Ingest the raw data\n\nIn this step, you load the raw data into a table to make it available for further processing. To manage data assets on the Databricks platform such as tables, Databricks recommends [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). However, if you don\u2019t have permissions to create the required catalog and schema to publish tables to Unity Catalog, you can still complete the following steps by publishing tables to the Hive metastore. \nTo ingest data, Databricks recommends using [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html). Auto Loader automatically detects and processes new files as they arrive in cloud object storage. \nYou can configure Auto Loader to automatically detect the schema of loaded data, allowing you to initialize tables without explicitly declaring the data schema and evolve the table schema as new columns are introduced. This eliminates the need to manually track and apply schema changes over time. Databricks recommends schema inference when using Auto Loader. However, as seen in the data exploration step, the songs data does not contain header information. Because the header is not stored with the data, you\u2019ll need to explicitly define the schema, as shown in the next example. \n1. In the sidebar, click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** and select **Notebook** from the menu. The **Create Notebook** dialog appears.\n2. Enter a name for the notebook, for example, `Ingest songs data`. By default: \n* **Python** is the selected language.\n* The notebook is attached to the last cluster you used. In this case, the cluster you created in [Step 1: Create a cluster](https:\/\/docs.databricks.com\/getting-started\/data-pipeline-get-started.html#create-a-cluster).\n3. Enter the following into the first cell of the notebook: \n```\nfrom pyspark.sql.types import DoubleType, IntegerType, StringType, StructType, StructField\n\n# Define variables used in the code below\nfile_path = \"\/databricks-datasets\/songs\/data-001\/\"\ntable_name = \"<table-name>\"\ncheckpoint_path = \"\/tmp\/pipeline_get_started\/_checkpoint\/song_data\"\n\nschema = StructType(\n[\nStructField(\"artist_id\", StringType(), True),\nStructField(\"artist_lat\", DoubleType(), True),\nStructField(\"artist_long\", DoubleType(), True),\nStructField(\"artist_location\", StringType(), True),\nStructField(\"artist_name\", StringType(), True),\nStructField(\"duration\", DoubleType(), True),\nStructField(\"end_of_fade_in\", DoubleType(), True),\nStructField(\"key\", IntegerType(), True),\nStructField(\"key_confidence\", DoubleType(), True),\nStructField(\"loudness\", DoubleType(), True),\nStructField(\"release\", StringType(), True),\nStructField(\"song_hotnes\", DoubleType(), True),\nStructField(\"song_id\", StringType(), True),\nStructField(\"start_of_fade_out\", DoubleType(), True),\nStructField(\"tempo\", DoubleType(), True),\nStructField(\"time_signature\", DoubleType(), True),\nStructField(\"time_signature_confidence\", DoubleType(), True),\nStructField(\"title\", StringType(), True),\nStructField(\"year\", IntegerType(), True),\nStructField(\"partial_sequence\", IntegerType(), True)\n]\n)\n\n(spark.readStream\n.format(\"cloudFiles\")\n.schema(schema)\n.option(\"cloudFiles.format\", \"csv\")\n.option(\"sep\",\"\\t\")\n.load(file_path)\n.writeStream\n.option(\"checkpointLocation\", checkpoint_path)\n.trigger(availableNow=True)\n.toTable(table_name)\n)\n\n``` \nIf you are using Unity Catalog, replace `<table-name>` with a catalog, schema, and table name to contain the ingested records (for example, `data_pipelines.songs_data.raw_song_data`). Otherwise, replace `<table-name>` with the name of a table to contain the ingested records, for example, `raw_song_data`. \nReplace `<checkpoint-path>` with a path to a directory in DBFS to maintain checkpoint files, for example, `\/tmp\/pipeline_get_started\/_checkpoint\/song_data`.\n4. Click ![Run Menu](https:\/\/docs.databricks.com\/_images\/run-menu.png), and select **Run Cell**. This example defines the data schema using the information from the `README`, ingests the songs data from all of the files contained in `file_path`, and writes the data to the table specified by `table_name`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/data-pipeline-get-started.html"} +{"content":"# Get started: Account and workspace setup\n### Build an end-to-end data pipeline in Databricks\n#### Step 4: Prepare the raw data\n\nTo prepare the raw data for analysis, the following steps transform the raw songs data by filtering out unneeded columns and adding a new field containing a timestamp for the creation of the new record. \n1. In the sidebar, click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** and select **Notebook** from the menu. The **Create Notebook** dialog appears.\n2. Enter a name for the notebook. For example, `Prepare songs data`. Change the default language to **SQL**.\n3. Enter the following in the first cell of the notebook: \n```\nCREATE OR REPLACE TABLE\n<table-name> (\nartist_id STRING,\nartist_name STRING,\nduration DOUBLE,\nrelease STRING,\ntempo DOUBLE,\ntime_signature DOUBLE,\ntitle STRING,\nyear DOUBLE,\nprocessed_time TIMESTAMP\n);\n\nINSERT INTO\n<table-name>\nSELECT\nartist_id,\nartist_name,\nduration,\nrelease,\ntempo,\ntime_signature,\ntitle,\nyear,\ncurrent_timestamp()\nFROM\n<raw-songs-table-name>\n\n``` \nIf you are using Unity Catalog, replace `<table-name>` with a catalog, schema, and table name to contain the filtered and transformed records (for example, `data_pipelines.songs_data.prepared_song_data`). Otherwise, replace `<table-name>` with the name of a table to contain the filtered and transformed records (for example, `prepared_song_data`). \nReplace `<raw-songs-table-name>` with the name of the table containing the raw songs records ingested in the previous step.\n4. Click ![Run Menu](https:\/\/docs.databricks.com\/_images\/run-menu.png), and select **Run Cell**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/data-pipeline-get-started.html"} +{"content":"# Get started: Account and workspace setup\n### Build an end-to-end data pipeline in Databricks\n#### Step 5: Query the transformed data\n\nIn this step, you extend the processing pipeline by adding queries to analyze the songs data. These queries use the prepared records created in the previous step. \n1. In the sidebar, click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** and select **Notebook** from the menu. The **Create Notebook** dialog appears.\n2. Enter a name for the notebook. For example, `Analyze songs data`. Change the default language to **SQL**.\n3. Enter the following in the first cell of the notebook: \n```\n-- Which artists released the most songs each year?\nSELECT\nartist_name,\ncount(artist_name)\nAS\nnum_songs,\nyear\nFROM\n<prepared-songs-table-name>\nWHERE\nyear > 0\nGROUP BY\nartist_name,\nyear\nORDER BY\nnum_songs DESC,\nyear DESC\n\n``` \nReplace `<prepared-songs-table-name>` with the name of the table containing prepared data. For example, `data_pipelines.songs_data.prepared_song_data`.\n4. Click ![Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret.png) in the cell actions menu, select **Add Cell Below** and enter the following in the new cell: \n```\n-- Find songs for your DJ list\nSELECT\nartist_name,\ntitle,\ntempo\nFROM\n<prepared-songs-table-name>\nWHERE\ntime_signature = 4\nAND\ntempo between 100 and 140;\n\n``` \nReplace `<prepared-songs-table-name>` with the name of the prepared table created in the previous step. For example, `data_pipelines.songs_data.prepared_song_data`.\n5. To run the queries and view the output, click **Run all**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/data-pipeline-get-started.html"} +{"content":"# Get started: Account and workspace setup\n### Build an end-to-end data pipeline in Databricks\n#### Step 6: Create a Databricks job to run the pipeline\n\nYou can create a workflow to automate running the data ingestion, processing, and analysis steps using a Databricks job. \n1. In your Data Science & Engineering workspace, do one of the following: \n* Click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar and click ![Create Job Button](https:\/\/docs.databricks.com\/_images\/create-job.png).\n* In the sidebar, click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** and select **Job**.\n2. In the task dialog box on the **Tasks** tab, replace **Add a name for your job\u2026** with your job name. For example, \u201cSongs workflow\u201d.\n3. In **Task name**, enter a name for the first task, for example, `Ingest_songs_data`.\n4. In **Type**, select the **Notebook** task type.\n5. In **Source**, select **Workspace**.\n6. Use the file browser to find the data ingestion notebook, click the notebook name, and click **Confirm**.\n7. In **Cluster**, select **Shared\\_job\\_cluster** or the cluster you created in the `Create a cluster` step.\n8. Click **Create**.\n9. Click ![Add Task Button](https:\/\/docs.databricks.com\/_images\/add-task.png) below the task you just created and select **Notebook**.\n10. In **Task name**, enter a name for the task, for example, `Prepare_songs_data`.\n11. In **Type**, select the **Notebook** task type.\n12. In **Source**, select **Workspace**.\n13. Use the file browser to find the data preparation notebook, click the notebook name, and click **Confirm**.\n14. In **Cluster**, select **Shared\\_job\\_cluster** or the cluster you created in the `Create a cluster` step.\n15. Click **Create**.\n16. Click ![Add Task Button](https:\/\/docs.databricks.com\/_images\/add-task.png) below the task you just created and select **Notebook**.\n17. In **Task name**, enter a name for the task, for example, `Analyze_songs_data`.\n18. In **Type**, select the **Notebook** task type.\n19. In **Source**, select **Workspace**.\n20. Use the file browser to find the data analysis notebook, click the notebook name, and click **Confirm**.\n21. In **Cluster**, select **Shared\\_job\\_cluster** or the cluster you created in the `Create a cluster` step.\n22. Click **Create**.\n23. To run the workflow, Click ![Run Now Button](https:\/\/docs.databricks.com\/_images\/run-now-button.png). To view [details for the run](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#job-run-details), click the link in the **Start time** column for the run in the [job runs](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#view-job-run-list) view. Click each task to view details for the task run.\n24. To view the results when the workflow completes, click the final data analysis task. The **Output** page appears and displays the query results.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/data-pipeline-get-started.html"} +{"content":"# Get started: Account and workspace setup\n### Build an end-to-end data pipeline in Databricks\n#### Step 7: Schedule the data pipeline job\n\nNote \nTo demonstrate using a Databricks job to orchestrate a scheduled workflow, this getting started example separates the ingestion, preparation, and analysis steps into separate notebooks, and each notebook is then used to create a task in the job. If all of the processing is contained in a single notebook, you can easily schedule the notebook directly from the Databricks notebook UI. See [Create and manage scheduled notebook jobs](https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html). \nA common requirement is to run a data pipeline on a scheduled basis. To define a schedule for the job that runs the pipeline: \n1. Click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar.\n2. In the **Name** column, click the job name. The side panel displays the **Job details**.\n3. Click **Add trigger** in the **Job details** panel and select **Scheduled** in **Trigger type**.\n4. Specify the period, starting time, and time zone. Optionally select the **Show Cron Syntax** checkbox to display and edit the schedule in [Quartz Cron Syntax](http:\/\/www.quartz-scheduler.org\/documentation\/quartz-2.3.0\/tutorials\/crontrigger.html).\n5. Click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/data-pipeline-get-started.html"} +{"content":"# Get started: Account and workspace setup\n### Build an end-to-end data pipeline in Databricks\n#### Learn more\n\n* To learn more about Databricks notebooks, see [Introduction to Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html).\n* To learn more about Databricks Jobs, see [What is Databricks Jobs?](https:\/\/docs.databricks.com\/workflows\/index.html#what-is-jobs).\n* To learn more about Delta Lake, see [What is Delta Lake?](https:\/\/docs.databricks.com\/delta\/index.html).\n* To learn more about data processing pipelines with Delta Live Tables, see [What is Delta Live Tables?](https:\/\/docs.databricks.com\/delta-live-tables\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/data-pipeline-get-started.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Retrieval Augmented Generation (RAG) on Databricks\n\nThis article provides an overview of retrieval augmented generation (RAG) and describes RAG application support in Databricks.\n\n### Retrieval Augmented Generation (RAG) on Databricks\n#### What is Retrieval Augmented Generation?\n\nRAG is a [generative AI design pattern](https:\/\/docs.databricks.com\/generative-ai\/generative-ai.html) that involves combining a large language model (LLM) with external knowledge retrieval. RAG is required to connect real-time data to your generative AI applications. Doing so improves the accuracy and quality of the application, by providing your data as context to the LLM at inference time. \nThe Databricks platform provides an integrated set of tools that supports the following RAG scenarios. \n| Type of RAG | Description | Example use case |\n| --- | --- | --- |\n| Unstructured data | Use of documents - PDFs, wikis, website contents, Google or Microsoft Office documents, and so on. | Chatbot over product documentation |\n| Structured data | Use of tabular data - Delta Tables, data from existing application APIs. | Chatbot to check order status |\n| Tools & function calling | Call third party or internal APIs to perform specific tasks or update statuses. For example, performing calculations or triggering a business workflow. | Chatbot to place an order |\n| Agents | Dynamically decide how to respond to a user\u2019s query by using an LLM to choose a sequence of actions. | Chatbot that replaces a customer service agent |\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/retrieval-augmented-generation.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Retrieval Augmented Generation (RAG) on Databricks\n#### RAG application architecture\n\nThe following illustrates the components that make up a RAG application. \n![RAG application architecture all up](https:\/\/docs.databricks.com\/_images\/rag-app-architecture-all-up.png) \nRAG applications require a pipeline and a chain component to perform the following: \n* **Indexing** A pipeline that ingests data from a source and indexes it. This data can be structured or unstructured.\n* **Retrieval and generation** This is the actual RAG chain. It takes the user query and retrieves similar data from the index, then passes the data, along with the query, to the LLM model. \nThe below diagram demonstrates these core components: \n![RAG application architecture for just the indexing pipeline and retrieval and generation, the RAG chain, pieces of RAG. The top section shows the RAG chain consuming the query and the subsequent steps of query processing, query expansion, retrieval and re-ranking, prompt engineering, initial response generation and post-processing, all before generating a response to the query. The bottom portion shows the RAG chain connected to separate data pipelines for 1. unstructured data, which includes data parsing, chunking and embedding and storing that data in a vector search database or index. Unstructured data pipelines require interaction with embedding and foundational models to feed into the RAG chain and 2. structured data pipelines, which includes consuming already embedded data chunks and performing ETL tasks and feature engineering before serving this data to the RAG chain](https:\/\/docs.databricks.com\/_images\/rag-app-architecture.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/retrieval-augmented-generation.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Retrieval Augmented Generation (RAG) on Databricks\n#### Unstructured data RAG example\n\nThe following sections describe the details of the indexing pipeline and RAG chain in the context of an unstructured data RAG example. \n### Indexing pipeline in a RAG app \nThe following steps describe the indexing pipeline: \n1. Ingest data from your proprietary data source.\n2. Split the data into chunks that can fit into the context window of the foundational LLM. This step also includes parsing the data and extracting metadata. This data is commonly referred to as a knowledge base that the foundational LLM is trained on.\n3. Use an embedding model to create vector embeddings for the data chunks.\n4. Store the embeddings and metadata in a vector database to make them accessible for querying by the RAG chain. \n### Retrieval using the RAG chain \nAfter the index is prepared, the RAG chain of the application can be served to respond to questions. The following steps and diagram describe how the RAG application responds to an incoming request. \n1. Embed the request using the same embedding model that was used to embed the data in the knowledge base.\n2. Query the vector database to do a similarity search between the embedded request and the embedded data chunks in the vector database.\n3. Retrieve the data chunks that are most relevant to the request.\n4. Feed the relevant data chunks and the request to a customized LLM. The data chunks provide context that helps the LLM generate an appropriate response. Often, the LLM has a template for how to format the response.\n5. Generate a response. \nThe following diagram illustrates this process: \n![RAG workflow after a request](https:\/\/docs.databricks.com\/_images\/rag-workflow.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/retrieval-augmented-generation.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Retrieval Augmented Generation (RAG) on Databricks\n#### Develop RAG applications with Databricks\n\nDatabricks provides the following capabilities to help you develop RAG applications. \n* [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) for governance, discovery, versioning, and access control for data, features, models, and functions.\n* [Notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html) and [workflows](https:\/\/docs.databricks.com\/workflows\/index.html) for data pipeline creation and orchestration.\n* [Delta tables](https:\/\/docs.databricks.com\/introduction\/delta-comparison.html) for storing structured data and unstructured data chunks and embeddings.\n* [Vector search](https:\/\/docs.databricks.com\/generative-ai\/vector-search.html) provides a queryable vector database that stores embedding vectors and can be configured to automatically sync to your knowledge base.\n* [Databricks model serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) for deploying LLMs and hosting your RAG chain. You can configure a dedicated model serving endpoint specifically for accessing state-of-the-art open LLMs with [Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html) or third-party models with [External models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html).\n* [MLflow](https:\/\/docs.databricks.com\/mlflow\/tracking.html) for RAG chain development tracking and [LLM evaluation](https:\/\/docs.databricks.com\/mlflow\/llm-evaluate.html).\n* [Feature engineering and serving](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html). This typically applies for structured data RAG scenarios.\n* [Online Tables](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html). You can serve online tables as a low-latency API to include the data in RAG applications.\n* [Lakehouse Monitoring](https:\/\/docs.databricks.com\/lakehouse-monitoring\/index.html) for data monitoring and tracking model prediction quality and drift using [automatic payload logging with inference tables](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html).\n* [AI Playground](https:\/\/docs.databricks.com\/large-language-models\/ai-playground.html). A chat-based UI to test and compare LLMs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/retrieval-augmented-generation.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Retrieval Augmented Generation (RAG) on Databricks\n#### RAG architecture with Databricks\n\nThe following architecture diagrams demonstrate where each Databricks feature fits in the RAG workflow. For an example, see the [Deploy Your LLM Chatbot With Retrieval Augmented Generation demo](https:\/\/www.databricks.com\/resources\/demos\/tutorials\/data-science-and-ai\/lakehouse-ai-deploy-your-llm-chatbot). \n### Process unstructured data and Databricks-managed embeddings \nFor processing unstructured data and Databricks-managed embeddings, the following diagram steps and diagram show: \n1. Data ingestion from your proprietary data source. You can store this data in a Delta Table or Unity Catalog Volume.\n2. The data is then split into chunks that can fit into the context window of the foundational LLM. This step also includes parsing the data and extracting metadata. You can use Databricks Workflows, Databricks notebooks and Delta Live Tables to perform these tasks. This data is commonly referred to as a knowledge base that the foundational LLM is trained on.\n3. The parsed and chunked data is then consumed by an embedding model to create vector embeddings. In this scenario, Databricks computes the embeddings for you as part of the Vector Search functionality which uses Model Serving to provide an embedding model.\n4. After Vector Search computes embeddings, Databricks stores them in a Delta table.\n5. Also as part of Vector Search, the embeddings and metadata are indexed and stored in a vector database to make them accessible for querying by the RAG chain. Vector Search automatically computes embeddings for new data that is added to the source data table and updates the vector search index. \n![RAG indexing pipeline processing unstructured data and Databricks managed embeddings. This diagram shows the RAG application architecture for just the indexing pipeline. ](https:\/\/docs.databricks.com\/_images\/rag-unstructured-data-managed.png) \n### Process unstructured data and customer-managed embeddings \nFor processing unstructured data and customer-managed embeddings, the following steps and diagram show: \n1. Data ingestion from your proprietary data source. You can store this data in a Delta table or Unity Catalog Volume.\n2. You can then split the data into chunks that can fit into the context window of the foundational LLM. This step also includes parsing the data and extracting metadata. You can use Databricks Workflows, Databricks Notebooks and Delta Live Tables to perform these tasks. This data is commonly referred to as a knowledge base that the foundational LLM is trained on.\n3. Next, the parsed and chunked data can be consumed by an embedding model to create vector embeddings. In this scenario, you compute the embeddings yourself and can use Model Serving to serve an embedding model.\n4. After you compute embeddings, you can store them in a Delta table, that can be synced with Vector Search.\n5. As part of Vector Search, the embeddings and metadata are indexed and stored in a vector database to make them accessible for querying by the RAG chain. Vector Search automatically syncs new embeddings that are added to your Delta table and updates the vector search index. \n![RAG with Databricks unstructured data and self managed embeddings](https:\/\/docs.databricks.com\/_images\/rag-unstructured-data-self-managed.png) \n### Process structured data \nFor processing structured data, the following steps and diagram show: \n1. Data ingestion from your proprietary data source. You can store this data in a Delta table or Unity Catalog Volume.\n2. For feature engineering you can use Databricks notebooks, Databricks workflows, and Delta Live Tables.\n3. Create a feature table. A feature table is a Delta table in Unity Catalog that has a primary key.\n4. Create an online table and host it on a feature serving endpoint. The endpoint automatically stays synced with the feature table. \nFor an example notebook illustrating the use of online tables and feature serving for RAG applications, see the [Databricks online tables and feature serving endpoints for RAG example notebook](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html#notebook-examples). \n![RAG with Databricks structured data](https:\/\/docs.databricks.com\/_images\/rag-structured-data.png) \n### RAG chain \nAfter the index is prepared, the RAG chain of the application can be served to respond to questions. The following steps and diagram describe how the RAG chain operates in response to an incoming question. \n1. The incoming question gets embedded using the same embedding model that was used to embed the data in the knowledge base. Model Serving is used to serve the embedding model.\n2. After the question is embedded, you can use Vector Search to do a similarity search between the embedded question and the embedded data chunks in the vector database.\n3. After Vector Search retrieves the data chunks that are most relevant to the request, those data chunks along with relevant features from Feature Serving and the embedded question are consumed in a customized LLM for post processing before it a response is generated.\n4. The data chunks and features provide context that help the LLM generate an appropriate response. Often, the LLM has a template for how to format the response. Once again, Model Serving is used to serve the LLM. You can also use Unity Catalog and Lakehouse Monitoring to store logs and monitor the chain workflow, respectively.\n5. Generate a response. \n![Running the chain](https:\/\/docs.databricks.com\/_images\/rag-chain.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/retrieval-augmented-generation.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Retrieval Augmented Generation (RAG) on Databricks\n#### Region availability\n\nThe features that support RAG application development on Databricks are available in the [same regions as model serving](https:\/\/docs.databricks.com\/resources\/supported-regions.html). \nIf you plan on using Foundation Model APIs as part of your RAG application development, you are limited to the [supported regions for Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html#required).\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/retrieval-augmented-generation.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### CI\/CD techniques with Git and Databricks Git folders (Repos)\n\nLearn techniques for using Databricks Git folders in CI\/CD workflows. By configuring Databricks Git folders in the workspace, you can use source control for project files in Git repositories and you can integrate them into your data engineering pipelines. \nThe following figure shows an overview of the techniques and workflow. \n![Overview of CI\/CD techniques for Git folders.](https:\/\/docs.databricks.com\/_images\/repos-cicd-techniques.png) \nFor an overview of CI\/CD with Databricks, see [What is CI\/CD on Databricks?](https:\/\/docs.databricks.com\/dev-tools\/index-ci-cd.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/ci-cd-techniques-with-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### CI\/CD techniques with Git and Databricks Git folders (Repos)\n##### Development flow\n\nDatabricks Git folders have user-level folders. User-level folders are automatically created when users first clone a remote repository. You can think of Databricks Git folders in user folders as \u201clocal checkouts\u201d that are individual for each user and where users make changes to their code. \nIn your user folder in Databricks Git folders, clone your remote repository. A best practice is to [create a new feature branch](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html#create-a-new-branch) or select a previously created branch for your work, instead of directly committing and pushing changes to the main branch. You can make changes, commit, and push changes in that branch. When you are ready to merge your code, you can do so in the Git folders UI. \n### Requirements \nThis workflow requires that you have already [set up your Git integration](https:\/\/docs.databricks.com\/repos\/repos-setup.html). \nNote \nDatabricks recommends that each developer works on their own feature branch. For information about how to resolve merge conflicts, see [Resolve merge conflicts](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html#merge-conflicts). \n### Collaborate in Git folders \nThe following workflow uses a branch called `feature-b` that is based on the main branch. \n1. [Clone your existing Git repository to your Databricks workspace](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html#clone-repo).\n2. Use the Git folders UI to [create a feature branch](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html#create-a-new-branch) from the main branch. This example uses a single feature branch `feature-b` for simplicity. You can create and use multiple feature branches to do your work.\n3. Make your modifications to Databricks notebooks and other files in the repo.\n4. [Commit and push your changes to your Git provider](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html#commit-push).\n5. Contributors can now clone the Git repository into their own user folder. \n1. Working on a new branch, a coworker makes changes to the notebooks and other files in the Git folder.\n2. The contributor [commits and pushes their changes to the Git provider](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html#commit-push).\n6. To merge changes from other branches or rebase the *feature-b* branch in Databricks, in the Git folders UI use one of the following workflows: \n* [Merge branches](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html#merge-branches). If there\u2019s no conflict, the merge is pushed to the remote Git repository using `git push`.\n* [Rebase on another branch](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html#rebase).\n7. When you are ready to merge your work to the remote Git repository and `main` branch, use the Git folders UI to merge the changes from *feature-b*. If you prefer, you can instead merge changes directly to the Git repository backing your Git folder.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/ci-cd-techniques-with-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### CI\/CD techniques with Git and Databricks Git folders (Repos)\n##### Production job workflow\n\nDatabricks Git folders provides two options for running your production jobs: \n* **Option 1**: Provide a remote Git reference in the job definition. For example, run a specific notebook in the `main` branch of a Git repository.\n* **Option 2**: Set up a production Git repository and call [Repos APIs](https:\/\/docs.databricks.com\/api\/workspace\/repos) to update it programmatically. Run jobs against the Databricks Git folder that clones this remote repository. The Repos API call should be the first task in the job.\n\n#### CI\/CD techniques with Git and Databricks Git folders (Repos)\n##### Option 1: Run jobs using notebooks in a remote repository\n\nSimplify the job definition process and keep a single source of truth by running a Databricks job using notebooks located in a remote Git repository. This Git reference can be a Git commit, tag, or branch and is provided by you in the job definition. \nThis helps prevent unintentional changes to your production job, such as when a user makes local edits in a production repository or switches branches. It also automates the CD step as you do not need to create a separate production Git folder in Databricks, manage permissions for it, and keep it updated. \nSee [Use version-controlled source code in a Databricks job](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-repos.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/ci-cd-techniques-with-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### CI\/CD techniques with Git and Databricks Git folders (Repos)\n##### Option 2: Set up a production Git folder and Git automation\n\nIn this option, you set up a production Git folder and automation to update the Git folder on merge. \n### Step 1: Set up top-level folders \nThe admin creates non-user top-level folders. The most common use case for these top-level folders is to create development, staging, and production folders that contain Databricks Git folders for the appropriate versions or branches for development, staging, and production. For example, if your company uses the `main` branch for production, the \u201cproduction\u201d Git folder must have the `main` branch checked out in it. \nTypically permissions on these top-level folders are read-only for all non-admin users within the workspace. For such top-level folders we recommend you only provide service principal(s) with CAN EDIT and CAN MANAGE permissions to avoid accidental edits to your production code by workspace users. \n![Top-level Git folders.](https:\/\/docs.databricks.com\/_images\/top-level-repo-folders.png) \n### Step 2: Set up automated updates to Databricks Git folders with the Git folders API \nTo keep a Git folder in Databricks at the latest version, you can set up Git automation to call the [Repos API](https:\/\/docs.databricks.com\/api\/workspace\/repos). In your Git provider, set up automation that, after every successful merge of a PR into the main branch, calls the Repos API endpoint on the appropriate Git folder to update it to the latest version. \nFor example, on GitHub this can be achieved with [GitHub Actions](https:\/\/github.com\/features\/actions). \nTo call any Databricks REST API from within a Databricks notebook cell, first install the Databricks SDK with `%pip install databricks-sdk --upgrade` (for the latest Databricks REST APIs) and then import `ApiClient` from `databricks.sdk.core`. \nNote \nIf `%pip install databricks-sdk --upgrade` returns an error that \u201cThe package could not be found\u201d, then the `databricks-sdk` package has not been previously installed. Re-run the command without the `--upgrade` flag: `%pip install databricks-sdk`. \nYou can also run Databricks SDK APIs from a notebook to retrieve the service principals for your workspace. Here\u2019s an example using Python and the [Databricks SDK for Python](https:\/\/databricks-sdk-py.readthedocs.io\/en\/latest\/workspace\/iam\/service_principals.html). \nYou can also use tools such as `curl`, Postman, or Terraform. You cannot use the Databricks user interface. \nTo learn more about service principals on Databricks, see [Manage service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html). For information about service principals and CI\/CD, see [Service principals for CI\/CD](https:\/\/docs.databricks.com\/dev-tools\/ci-cd\/ci-cd-sp.html). For more details on using the Databricks SDK from a notebook, read [Use the Databricks SDK for Python from within a Databricks notebook](https:\/\/docs.databricks.com\/dev-tools\/sdk-python.html#notebook).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/ci-cd-techniques-with-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### CI\/CD techniques with Git and Databricks Git folders (Repos)\n##### Use a service principal with Databricks Git folders\n\nTo run the above mentioned workflows with service principals: \n1. Create a service principal with Databricks.\n2. Add the git credentials: Your Git provider PAT for the service principal. \nTo set up service principals and then add Git provider credentials: \n1. Create a Databricks service principal in your workspace with the [Service Principals API](https:\/\/docs.databricks.com\/api\/workspace\/serviceprincipals).\n2. Create a Databricks access token for a Databricks service principal with the [Token management API](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement).\n3. Add your Git provider credentials to your workspace with your Databricks access token and the [Git Credentials API](https:\/\/docs.databricks.com\/api\/workspace\/gitcredentials).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/ci-cd-techniques-with-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### CI\/CD techniques with Git and Databricks Git folders (Repos)\n##### Terraform integration\n\nYou can also manage Databricks Git folders in a fully automated setup using [Terraform](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_repo](https:\/\/registry.terraform.io\/providers\/databrickslabs\/databricks\/latest\/docs\/resources\/repo): \n```\nresource \"databricks_repo\" \"this\" {\nurl = \"https:\/\/github.com\/user\/demo.git\"\n}\n\n``` \nTo use Terraform to add Git credentials to a service principal, add the following configuration: \n```\nprovider \"databricks\" {\n# Configuration options\n}\n\nprovider \"databricks\" {\nalias = \"sp\"\nhost = \"https:\/\/....cloud.databricks.com\"\ntoken = databricks_obo_token.this.token_value\n}\n\nresource \"databricks_service_principal\" \"sp\" {\ndisplay_name = \"service_principal_name_here\"\n}\n\nresource \"databricks_obo_token\" \"this\" {\napplication_id = databricks_service_principal.sp.application_id\ncomment = \"PAT on behalf of ${databricks_service_principal.sp.display_name}\"\nlifetime_seconds = 3600\n}\n\nresource \"databricks_git_credential\" \"sp\" {\nprovider = databricks.sp\ndepends_on = [databricks_obo_token.this]\ngit_username = \"myuser\"\ngit_provider = \"azureDevOpsServices\"\npersonal_access_token = \"sometoken\"\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/ci-cd-techniques-with-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### CI\/CD techniques with Git and Databricks Git folders (Repos)\n##### Configure an automated CI\/CD pipeline with Databricks Git folders\n\nHere is a simple automation that can be run as a GitHub Action. \n### Requirements \n1. You have created a Git folder in a Databricks workspace that is tracking the base branch being merged into.\n2. You have a Python package that creates the artifacts to place into a DBFS location. Your code must: \n* Update the repository associated with your preferred branch (such as `development`) to contain the latest versions of your notebooks.\n* Build any artifacts and copy them to the library path.\n* Replace the last versions of build artifacts to avoid having to manually update artifact versions in your job. \n### Steps \nNote \nStep 1 must be performed by an admin of the Git repository. \n1. Set up secrets so your code can access the Databricks workspace. Add the following secrets to the Github repository: \n* **DEPLOYMENT\\_TARGET\\_URL**: Set it to the workspace URL, but do not include the `\/?o` substring.\n* **DEPLOYMENT\\_TARGET\\_TOKEN**: Provide a Databricks Personal Access Token (PAT) value. You can generate a Databricks PAT by following the instructions in [Configure Git credentials & connect a remote repo to Databricks](https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html).\n2. Navigate to the **Actions** tab of your Git repository and click the **New workflow** button. At the top of the page, select **Set up a workflow yourself** and paste in this script: \n![The \"set up a workflow yourself\" link in the GitHub Actions UI](https:\/\/docs.databricks.com\/_images\/github-set-up-new-action.png) \n```\n# This is a basic automation workflow to help you get started with GitHub Actions.\n\nname: CI\n\n# Controls when the workflow will run\non:\n# Triggers the workflow on push for main and dev branch\npush:\nbranches:\n# Set your base branch name here\n- your-base-branch-name\n\n# A workflow run is made up of one or more jobs that can run sequentially or in parallel\njobs:\n# This workflow contains a single job called \"deploy\"\ndeploy:\n# The type of runner that the job will run on\nruns-on: ubuntu-latest\nenv:\nDBFS_LIB_PATH: dbfs:\/path\/to\/libraries\/\nREPO_PATH: \/Repos\/path\/here\nLATEST_WHEEL_NAME: latest_wheel_name.whl\n\n# Steps represent a sequence of tasks that will be executed as part of the job\nsteps:\n# Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it\n- uses: actions\/checkout@v2\n\n- name: Setup Python\nuses: actions\/setup-python@v2\nwith:\n# Version range or exact version of a Python version to use, using SemVer's version range syntax.\npython-version: 3.8\n\n- name: Install mods\nrun: |\npip install databricks-cli\npip install pytest setuptools wheel\n\n- name: Configure CLI\nrun: |\necho \"${{ secrets.DEPLOYMENT_TARGET_URL }} ${{ secrets.DEPLOYMENT_TARGET_TOKEN }}\" | databricks configure --token\n\n- name: Extract branch name\nshell: bash\nrun: echo \"##[set-output name=branch;]$(echo ${GITHUB_REF#refs\/heads\/})\"\nid: extract_branch\n\n- name: Update Databricks Git folder\nrun: |\ndatabricks repos update --path ${{env.REPO_PATH}} --branch \"${{ steps.extract_branch.outputs.branch }}\"\n\n- name: Build Wheel and send to Databricks workspace DBFS location\nrun: |\ncd $GITHUB_WORKSPACE\npython setup.py bdist_wheel\ndbfs cp --overwrite .\/dist\/* ${{env.DBFS_LIB_PATH}}\n# there is only one wheel file; this line copies it with the original version number in file name and overwrites if that version of wheel exists; it does not affect the other files in the path\ndbfs cp --overwrite .\/dist\/* ${{env.DBFS_LIB_PATH}}${{env.LATEST_WHEEL_NAME}} # this line copies the wheel file and overwrites the latest version with it\n\n```\n3. Update the following environment variable values with your own: \n* **DBFS\\_LIB\\_PATH**: The path in DBFS to the libraries (wheels) you will use in this automation, which starts with `dbfs:`. For example,`dbfs:\/mnt\/myproject\/libraries`.\n* **REPO\\_PATH**: The path in your Databricks workspace to the Git folder where notebooks will be updated. For example, `\/Repos\/Develop`.\n* **LATEST\\_WHEEL\\_NAME**: The name of the last-compiled Python wheel file (`.whl`). This is used to avoid manually updating wheel versions in your Databricks jobs. For example, `your_wheel-latest-py3-none-any.whl`.\n4. Select **Commit changes\u2026** to commit the script as a GitHub Actions workflow. Once the pull request for this workflow is merged, go to the **Actions** tab of the Git repository and confirm the actions are successful.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/ci-cd-techniques-with-repos.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### What are Hugging Face Transformers?\n##### Fine-tune Hugging Face models for a single GPU\n\nThis article describes how to fine-tune a Hugging Face model with the Hugging Face `transformers` library on a single GPU. It also includes Databricks-specific recommendations for loading data from the lakehouse and logging models to MLflow, which enables you to use and govern your models on Databricks. \nThe Hugging Face `transformers` library provides the [Trainer](https:\/\/huggingface.co\/docs\/transformers\/main_classes\/trainer) utility and [Auto Model](https:\/\/huggingface.co\/docs\/transformers\/model_doc\/auto) classes that enable loading and fine-tuning Transformers models. \nThese tools are available for the following tasks with simple modifications: \n* Loading models to fine-tune.\n* Constructing the configuration for the Hugging Face Transformers Trainer utility.\n* Performing training on a single GPU. \nSee [What are Hugging Face Transformers?](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/index.html)\n\n##### Fine-tune Hugging Face models for a single GPU\n###### Requirements\n\n* A single-node [cluster](https:\/\/docs.databricks.com\/compute\/configure.html) with one GPU on the driver.\n* The GPU version of Databricks Runtime 13.0 ML and above. \n+ This example for fine-tuning requires the \ud83e\udd17 Transformers, \ud83e\udd17 Datasets, and \ud83e\udd17 Evaluate packages which are included in Databricks Runtime 13.0 ML and above.\n* MLflow 2.3.\n* [Data prepared and loaded for fine-tuning a model with transformers](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/load-data.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/fine-tune-model.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### What are Hugging Face Transformers?\n##### Fine-tune Hugging Face models for a single GPU\n###### Tokenize a Hugging Face dataset\n\nHugging Face Transformers models expect tokenized input, rather than the text in the downloaded data. To ensure compatibility with the base model, use an [AutoTokenizer](https:\/\/huggingface.co\/docs\/transformers\/v4.26.1\/en\/autoclass_tutorial#autotokenizer) loaded from the base model. Hugging Face `datasets` allows you to directly apply the tokenizer consistently to both the training and testing data. \nFor example: \n```\nfrom transformers import AutoTokenizer\n\ntokenizer = AutoTokenizer.from_pretrained(base_model)\ndef tokenize_function(examples):\nreturn tokenizer(examples[\"text\"], padding=False, truncation=True)\n\ntrain_test_tokenized = train_test_dataset.map(tokenize_function, batched=True)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/fine-tune-model.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### What are Hugging Face Transformers?\n##### Fine-tune Hugging Face models for a single GPU\n###### Set up the training configuration\n\nHugging Face training configuration tools can be used to configure a [Trainer](https:\/\/huggingface.co\/docs\/transformers\/main_classes\/trainer). The Trainer classes require the user to provide: \n* Metrics\n* A base model\n* A training configuration \nYou can configure evaluation metrics in addition to the default `loss` metric that the `Trainer` computes. The following example demonstrates adding `accuracy` as a metric: \n```\nimport numpy as np\nimport evaluate\nmetric = evaluate.load(\"accuracy\")\ndef compute_metrics(eval_pred):\nlogits, labels = eval_pred\npredictions = np.argmax(logits, axis=-1)\nreturn metric.compute(predictions=predictions, references=labels)\n\n``` \nUse the [Auto Model classes for NLP](https:\/\/huggingface.co\/docs\/transformers\/v4.26.1\/en\/model_doc\/auto#natural-language-processing) to load the appropriate model for your task. \nFor text classification, use [AutoModelForSequenceClassification](https:\/\/huggingface.co\/docs\/transformers\/v4.26.1\/en\/model_doc\/auto#transformers.AutoModelForSequenceClassification) to load a base model for text classification. When creating the model, provide the number of classes and the label mappings created during dataset preparation. \n```\nfrom transformers import AutoModelForSequenceClassification\nmodel = AutoModelForSequenceClassification.from_pretrained(\nbase_model,\nnum_labels=len(label2id),\nlabel2id=label2id,\nid2label=id2label\n)\n\n``` \nNext, create the training configuration. The [TrainingArguments](https:\/\/huggingface.co\/docs\/transformers\/v4.26.1\/en\/main_classes\/trainer#transformers.TrainingArguments) class allows you to specify the output directory, evaluation strategy, learning rate, and other parameters. \n```\nfrom transformers import TrainingArguments, Trainer\ntraining_args = TrainingArguments(output_dir=training_output_dir, evaluation_strategy=\"epoch\")\n\n``` \nUsing a [data collator](https:\/\/huggingface.co\/docs\/transformers\/v4.26.1\/en\/main_classes\/data_collator) batches input in training and evaluation datasets. [DataCollatorWithPadding](https:\/\/huggingface.co\/docs\/transformers\/v4.26.1\/en\/main_classes\/data_collator#transformers.DataCollatorWithPadding) gives good baseline performance for text classification. \n```\nfrom transformers import DataCollatorWithPadding\ndata_collator = DataCollatorWithPadding(tokenizer)\n\n``` \nWith all of these parameters constructed, you can now create a `Trainer`. \n```\ntrainer = Trainer(\nmodel=model,\nargs=training_args,\ntrain_dataset=train_test_dataset[\"train\"],\neval_dataset=train_test_dataset[\"test\"],\ncompute_metrics=compute_metrics,\ndata_collator=data_collator,\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/fine-tune-model.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### What are Hugging Face Transformers?\n##### Fine-tune Hugging Face models for a single GPU\n###### Train and log to MLflow\n\nHugging Face interfaces well with MLflow and automatically logs metrics during model training using the [MLflowCallback](https:\/\/huggingface.co\/docs\/transformers\/main\/en\/main_classes\/callback#transformers.integrations.MLflowCallback). However, you must log the trained model yourself. \nWrap training in an MLflow run. This constructs a Transformers pipeline from the tokenizer and the trained model, and writes it to local disk. Finally, log the model to MLflow with [mlflow.transformers.log\\_model](https:\/\/mlflow.org\/docs\/latest\/models.html#transformers-transformers-experimental). \n```\nfrom transformers import pipeline\n\nwith mlflow.start_run() as run:\ntrainer.train()\ntrainer.save_model(model_output_dir)\npipe = pipeline(\"text-classification\", model=AutoModelForSequenceClassification.from_pretrained(model_output_dir), batch_size=1, tokenizer=tokenizer)\nmodel_info = mlflow.transformers.log_model(\ntransformers_model=pipe,\nartifact_path=\"classification\",\ninput_example=\"Hi there!\",\n)\n\n``` \nIf you don\u2019t need to create a pipeline, you can submit the components that are used in training into a dictionary: \n```\nmodel_info = mlflow.transformers.log_model(\ntransformers_model={\"model\": trainer.model, \"tokenizer\": tokenizer},\ntask=\"text-classification\",\nartifact_path=\"text_classifier\",\ninput_example=[\"MLflow is great!\", \"MLflow on Databricks is awesome!\"],\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/fine-tune-model.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### What are Hugging Face Transformers?\n##### Fine-tune Hugging Face models for a single GPU\n###### Load the model for inference\n\nWhen your model is logged and ready, loading the model for inference is the same as loading the MLflow wrapped pre-trained model. \n```\nlogged_model = \"runs:\/{run_id}\/{model_artifact_path}\".format(run_id=run.info.run_id, model_artifact_path=model_artifact_path)\n\n# Load model as a Spark UDF. Override result_type if the model does not return double values.\nloaded_model_udf = mlflow.pyfunc.spark_udf(spark, model_uri=logged_model, result_type='string')\n\ntest = test.select(test.text, test.label, loaded_model_udf(test.text).alias(\"prediction\"))\ndisplay(test)\n\n``` \nSee [Model serving with Databricks](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) for more information.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/fine-tune-model.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### What are Hugging Face Transformers?\n##### Fine-tune Hugging Face models for a single GPU\n###### Troubleshoot common CUDA errors\n\nThis section describes common CUDA errors and guidance on how to resolve them. \n### OutOfMemoryError: CUDA out of memory \nWhen training large models, a common error you may encounter is the CUDA out of memory error. \nExample: \n```\nOutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.76 GiB total capacity; 666.34 MiB already allocated; 17.75 MiB free; 720.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.\n\n``` \nTry the following recommendations to resolve this error: \n* Reduce the batch size for training. You can reduce the `per_device_train_batch_size` value in [TrainingArguments](https:\/\/huggingface.co\/docs\/transformers\/main\/en\/main_classes\/trainer#transformers.TrainingArguments).\n* Use lower precision training. You can set `fp16=True` in [TrainingArguments](https:\/\/huggingface.co\/docs\/transformers\/main\/en\/main_classes\/trainer#transformers.TrainingArguments).\n* Use gradient\\_accumulation\\_steps in [TrainingArguments](https:\/\/huggingface.co\/docs\/transformers\/main\/en\/main_classes\/trainer#transformers.TrainingArguments) to effectively increase overall batch size.\n* Use [8-bit Adam optimizer](https:\/\/huggingface.co\/docs\/transformers\/main\/en\/perf_train_gpu_one#8bit-adam).\n* Clean up the GPU memory before training. Sometimes, GPU memory may be occupied by some unused code. \n```\nfrom numba import cuda\ndevice = cuda.get_current_device()\ndevice.reset()\n\n``` \n### CUDA kernel errors \nWhen running the training, you may get CUDA kernel errors. \nExample: \n```\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\n\nFor debugging, consider passing CUDA_LAUNCH_BLOCKING=1.\n\n``` \nTo troubleshoot: \n* Try running the code on CPU to see if the error is reproducible.\n* Another option is to get a better traceback by setting `CUDA_LAUNCH_BLOCKING=1`: \n```\nimport os\nos.environ[\"CUDA_LAUNCH_BLOCKING\"] = \"1\"\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/fine-tune-model.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### What are Hugging Face Transformers?\n##### Fine-tune Hugging Face models for a single GPU\n###### Notebook: Fine-tune text classification on a single GPU\n\nTo get started quickly with example code, this example notebook provides an end-to-end example for fine-tuning a model for text classification. The subsequent sections of this article go into more detail around using Hugging Face for fine-tuning on Databricks. \n### Fine-tuning Hugging Face text classification models notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/tune-classification-model-hugging-face-transformers.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n##### Fine-tune Hugging Face models for a single GPU\n###### Additional resources\n\nLearn more about Hugging Face on Databricks. \n* [What are Hugging Face Transformers?](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/index.html)\n* You can use Hugging Face Transformers models on Spark to scale out your NLP batch applications, see [Model inference using Hugging Face Transformers for NLP](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/model-inference-nlp.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/fine-tune-model.html"} +{"content":"# Databricks data engineering\n### Apache Spark on Databricks\n\nThis article describes how Apache Spark is related to Databricks and the Databricks Data Intelligence Platform. \nApache Spark is at the heart of the Databricks platform and is the technology powering compute clusters and SQL warehouses. Databricks is an optimized platform for Apache Spark, providing an efficient and simple platform for running Apache Spark workloads.\n\n### Apache Spark on Databricks\n#### What is the relationship of Apache Spark to Databricks?\n\nThe Databricks company was founded by the original creators of Apache Spark. As an open source software project, Apache Spark has [committers from many top companies](https:\/\/spark.apache.org\/committers.html), including Databricks. \nDatabricks continues to develop and release features to Apache Spark. The Databricks Runtime includes additional optimizations and proprietary features that build on and extend Apache Spark, including [Photon](https:\/\/docs.databricks.com\/compute\/photon.html), an optimized version of Apache Spark rewritten in C++.\n\n### Apache Spark on Databricks\n#### How does Apache Spark work on Databricks?\n\nWhen you deploy a compute cluster or SQL warehouse on Databricks, Apache Spark is configured and deployed to virtual machines. You don\u2019t need to configure or initialize a Spark context or Spark session, as these are managed for you by Databricks.\n\n### Apache Spark on Databricks\n#### Can I use Databricks without using Apache Spark?\n\nDatabricks supports a variety of workloads and includes open source libraries in the Databricks Runtime. Databricks SQL uses Apache Spark under the hood, but end users use standard SQL syntax to create and query database objects. \nDatabricks Runtime for Machine Learning is optimized for ML workloads, and many data scientists use primary open source libraries like TensorFlow and SciKit Learn while working on Databricks. You can use [workflows](https:\/\/docs.databricks.com\/workflows\/index.html) to schedule arbitrary workloads against compute resources deployed and managed by Databricks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/spark\/index.html"} +{"content":"# Databricks data engineering\n### Apache Spark on Databricks\n#### Why use Apache Spark on Databricks?\n\nThe Databricks platform provides a secure, collaborative environment for developing and deploying enterprise solutions that scale with your business. Databricks employees include many of the world\u2019s most knowledgeable Apache Spark maintainers and users. The company continuously develops and releases new optimizations to ensure users can access the fastest environment for running Apache Spark.\n\n### Apache Spark on Databricks\n#### How can I learn more about using Apache Spark on Databricks?\n\nTo get started with Apache Spark on Databricks, dive right in! The Apache Spark DataFrames tutorial walks through loading and transforming data in Python, R, or Scala. See [Tutorial: Load and transform data using Apache Spark DataFrames](https:\/\/docs.databricks.com\/getting-started\/dataframes.html). \nAdditional information on Python, R, and Scala language support in Spark is found in the [PySpark on Databricks](https:\/\/docs.databricks.com\/pyspark\/index.html), [SparkR overview](https:\/\/docs.databricks.com\/sparkr\/overview.html), and [Databricks for Scala developers](https:\/\/docs.databricks.com\/languages\/scala.html) sections, as well as in [Reference for Apache Spark APIs](https:\/\/docs.databricks.com\/reference\/spark.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/spark\/index.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### What is asynchronous progress tracking?\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nAsynchronous progress tracking allows Structured Streaming pipelines to checkpoint progress asynchronously and in parallel to the actual data processing within a micro-batch, reducing latency associated with maintaining the `offsetLog` and `commitLog`. \n![Asynchronous Progress Tracking](https:\/\/docs.databricks.com\/_images\/async-progress.png) \nNote \nAsynchronous progress tracking does not work with `Trigger.once` or `Trigger.availableNow` triggers. Attempting to enable this feature with these triggers results in query failure.\n\n#### What is asynchronous progress tracking?\n##### How does asynchronous progress tracking work to reduce latency?\n\nStructured Streaming relies on persisting and managing offsets as progress indicators for query processing. Offset management operation directly impacts processing latency, because no data processing can occur until these operations are complete. Asynchronous progress tracking enables Structured Streaming pipelines to checkpoint progress without being impacted by these offset management operations.\n\n#### What is asynchronous progress tracking?\n##### When should you configure checkpoint frequency?\n\nUsers can configure the frequency at which progress is checkpointed. The default settings for checkpoint frequency provide good throughput for most queries. Configuring the frequency is helpful for scenarios in which offset management operations occur at a higher rate than they can be processed, which creates an ever increasing backlog of offset management operations. To stem this growing backlog, data processing is blocked or slowed, essentially reverting the processing behavior to eliminate the benefits of asynchronous progress tracking. \nNote \nFailure recovery time increases with the increase in checkpoint interval time. In case of failure, a pipeline has to reprocess all the data before the previous successful checkpoint. Users can consider this trade-off between lower latency during regular processing and recovery time in case of failure.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/async-progress-checking.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### What is asynchronous progress tracking?\n##### What configurations are associated with asynchronous progress tracking?\n\n| Option | Value | Default | Description |\n| --- | --- | --- | --- |\n| asyncProgressTrackingEnabled | true\/false | false | enable or disable asynchronous progress tracking |\n| asyncProgressTrackingCheckpointIntervalMs | milliseconds | 1000 | the interval in which we commit offsets and completion commits |\n\n#### What is asynchronous progress tracking?\n##### How can users enable asynchronous progress tracking?\n\nUsers can use code similar to the code below to enable this feature: \n```\nval stream = spark.readStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", \"host1:port1,host2:port2\")\n.option(\"subscribe\", \"in\")\n.load()\n\nval query = stream.writeStream\n.format(\"kafka\")\n.option(\"topic\", \"out\")\n.option(\"checkpointLocation\", \"\/tmp\/checkpoint\")\n.option(\"asyncProgressTrackingEnabled\", \"true\")\n.start()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/async-progress-checking.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### What is asynchronous progress tracking?\n##### Turning off asynchronous progress tracking\n\nWhen async progress tracking is enabled, the framework does not checkpoint progress for every batch. To address this, before you disable asynchronous progress tracking, process at least two micro-batches with the following settings: \n* `.option(\"asyncProgressTrackingEnabled\", \"true\")`\n* `.option(\"asyncProgressTrackingCheckpointIntervalMs\", 0)` \nStop the query after at least two micro-batches have finished processing. Now you can safely disable the async progress tracking and restart the query. \nIf you have disabled asynchronous progress tracking without completing this step, you may encounter the following error: \n```\njava.lang.IllegalStateException: batch x doesn't exist\n\n``` \nIn the driver logs, you might see the following error: \n```\nThe offset log for batch x doesn't exist, which is required to restart the query from the latest batch x from the offset log. Please ensure there are two subsequent offset logs available for the latest batch via manually deleting the offset file(s). Please also ensure the latest batch for commit log is equal or one batch earlier than the latest batch for offset log.\n\n``` \nFollowing the instructions in this section to disable asynchronous progress tracking allows you to address these errors and repair your streaming workload.\n\n#### What is asynchronous progress tracking?\n##### Limitations with asynchronous progress tracking\n\nThis feature has the following limitations: \n* Asynchronous progress tracking is only supported in stateless pipelines when using Kafka as a sink.\n* Exactly once end-to-end processing is not guaranteed with asynchronous progress tracking because offset ranges for batch can be changed in case of failure. Some sinks, such as Kafka, never provide exactly-once guarantees.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/async-progress-checking.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Archival support in Databricks\n\nPreview \nThis feature is in Public Preview for Databricks Runtime 13.3 LTS and above. \nArchival support in Databricks introduces a collection of capabilities that enable you to use cloud-based lifecycle policies on cloud object storage containing Delta tables. \nWithout archival support, operations against Delta tables can break because data files or transaction log files have moved to archived locations and are not available when queried. Archival support introduces optimizations to avoid querying archived data when possible and adds new syntax to identify files that must be restored from archive to complete queries. \nImportant \nDatabricks only has archival support for S3 Glacier Deep Archive and Glacier Flexible Retrieval. See [AWS docs on working with archived objects](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/archived-objects.html).\n\n#### Archival support in Databricks\n##### Queries optimized for archived data\n\nArchival support in Databricks optimizes the following queries against Delta tables: \n| Query | New behavior |\n| --- | --- |\n| `SELECT * FROM <table_name> LIMIT <limit> [WHERE <partition_predicate>]` | Automatically ignore archived files and return results from data in a non-archived storage tier. |\n| Delta Lake maintenance commands: `OPTIMIZE`, `ZORDER`, `ANALYZE`, `PURGE` | Automatically ignore archived files and run maintenance on rest of table. |\n| DDL and DML statements that overwrite data or delete data, including the following: `REPLACE TABLE`, `INSERT OVERWRITE`, `TRUNCATE TABLE`, `DROP TABLE` | Mark transaction log entries for target archived data files as deleted. |\n| `FSCK REPAIR TABLE` | Ignore archived files and only check for files that haven\u2019t reached life cycle policy. | \nSee [Limitations](https:\/\/docs.databricks.com\/optimizations\/archive-delta.html#limitations).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/archive-delta.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Archival support in Databricks\n##### Early failure and error messages\n\nFor queries that must scan archived files to generate correct results, configuring archival support for Delta Lake ensures the following: \n* Queries fail early if they attempt to access files in archive, reducing wasted compute and allowing users to quickly adapt and re-run queries.\n* Error messages inform users that a query has failed because the query attempted to access archived files. \nUsers can generate a report of files that need to be restored using the `SHOW ARCHIVED FILES` syntax. See [Show archived files](https:\/\/docs.databricks.com\/optimizations\/archive-delta.html#show).\n\n#### Archival support in Databricks\n##### Enable archival support\n\nYou enable archival support in Databricks for Delta tables by manually specifying the archival interval configured in the underlying cloud lifecycle management policy, as in the following example syntax: \n```\nALTER TABLE <table_name> SET TBLPROPERTIES(delta.timeUntilArchived = 'X days');\n\n``` \nDelta Lake does not directly interact with the lifecyle management policies configured in your cloud account. If you update the policy in your cloud account, you must update the policy on your Delta table. See [Change the lifecycle management transition rule](https:\/\/docs.databricks.com\/optimizations\/archive-delta.html#change-rule). \nImportant \nArchival support relies entirely on compatible Databricks compute environments and only works for Delta tables. Configuring archival support does not change behavior, compatibility, or support in OSS Delta Lake clients or Databricks Runtime 12.2 LTS and below.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/archive-delta.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Archival support in Databricks\n##### Show archived files\n\nTo identify files that need to be restored to complete a given query, use `SHOW ARCHIVED FILES`, as in the following example: \n```\nSHOW ARCHIVED FILES FOR table_name [ WHERE predicate ];\n\n``` \nThis operation returns URIs for archived files as a Spark DataFrame. \nNote \nDelta Lake only has access to the data statistics contained within the transaction log during this operation (minimum value, maximum value, null counts, and total number of records for the first 32 columns). The files returned include all archived files that need to be read to determine whether or not records fulfilling a predicate exist in the file. Databricks recommends providing predicates that include fields on which data is partitioned, z-ordered, or clustered, if possible, to reduce the number of files that need to be restored.\n\n#### Archival support in Databricks\n##### Limitations\n\nThe following limitations exist: \n* No support exists for lifecycle management policies that are not based on file creation time. This includes access-time-based policies and tag-based policies.\n* You cannot use `DROP COLUMN` on a table with archived files.\n* `REORG TABLE APPLY PURGE` makes a best effort, but only works on deletion vector files and referenced data files that are not archived. `PURGE` cannot delete archived deletion vector files.\n* Extending the lifecycle management transition rule results in unexpected behavior. See [Extend the lifecycle management transition rule](https:\/\/docs.databricks.com\/optimizations\/archive-delta.html#extend-rule).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/archive-delta.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Archival support in Databricks\n##### Change the lifecycle management transition rule\n\nIf you change the time interval for your cloud lifecycle management transition rule, you must update the property `delta.timeUntilArchived`. \nIf the time interval before archival is shortened (less time since file creation), archival support for the Delta table continues functioning normally after the table property is updated. \n### Extend the lifecycle management transition rule \nIf the time interval before archival is extended (more time since file creation), updating the property `delta.timeUntilArchived` to the new value can lead to errors. Cloud providers do not restore files out of archived storage automatically when data retention policies are changed. This means that files that previously were eligible for archival but now are not considered eligible for archival are still archived. \nImportant \nTo avoid errors, never set the property `delta.timeUntilArchived` to a value greater than the actual age of the most recently archived data. \nConsider a scenario in which the time interval for archival is changed from 60 days to 90 days: \n1. When the policy changes, all records between 60 and 90 days old are already archived.\n2. For 30 days, no new files are archived (the oldest non-archived files are 60 days old at the time the policy is extended).\n3. After 30 days have passed, the life cycle policy correctly describes all archived data. \nThe `delta.timeUntilArchived` setting tracks the set time interval against the file creation time recorded by the Delta transaction log. It does not have explicit knowledge of the underlying policy. During the lag period between the old archival threshold and the new archival threshold, you can take one of the following approaches to avoid querying archived files: \n1. You can leave the setting `delta.timeUntilArchived` with the old threshold until enough time has passed that all files are archived. \n* Following with the example above, each day for the first 30 days another day\u2019s worth of data would be considered archived by Databricks but not yet archived by the cloud provider. This does not result in error, but ignores some data files that could be queried.\n* After 30 days, update the `delta.timeUntilArchived` to `90 days`.\n2. You can update the setting `delta.timeUntilArchived` each day to reflect the current interval during the lag period. \n* While the cloud policy is set to 90 days, the actual age of archived data changes in real time. For example, after 7 days, setting `delta.timeUntilArchived` to `67 days` accurately reflects the age of all data files in archive.\n* This approach is only necessary if you need access to all data in hot tiers. \nNote \nUpdating the value for `delta.timeUntilArchived` does not actually change which data is archived. It only changes which data Databricks treats as if it were archived.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/archive-delta.html"} +{"content":"# \n### Metrics\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html) \nTo evaluate your RAG Application, use `\ud83d\udcc8 Metrics`. Databricks provides a set of metrics that enable you to measure the quality, cost and latency of your RAG Application. These metrics are curated by Databricks\u2019 Research team as the most relevant (no pun intended) metrics for evaluating RAG applications. \n`\ud83d\udcc8 Metrics` are computed using either: \n1. User traffic: `\ud83d\udc4d Assessments` **and** `\ud83d\uddc2\ufe0f Request Log`\n2. `\ud83d\udcd6 Evaluation Set`: developer curated `\ud83d\udc4d Assessments` **and** `\ud83d\uddc2\ufe0f Request Log` that represent common requests \nFor most metrics, `\ud83d\udc4d Assessments` comes from either `\ud83e\udd16 LLM Judge`, `\ud83e\udde0 Expert Users`, or `\ud83d\udc64 End Users`. A small subset of the metrics, such as answer correctness, require `\ud83e\udde0 Expert Users` or `\ud83d\udc64 End Users` annotated asessments.\n\n### Metrics\n#### Collecting `\ud83d\udc4d Assessments`\n\n### From a `\ud83e\udd16 LLM Judge` \n* See [View logs & assessments](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/2-view-logs.html) for instructions on how to enable, disable, and configure the `\ud83e\udd16 LLM Judge`. \n### From `\ud83d\udc64 End Users` & `\ud83e\udde0 Expert Users` \n* See [Collect feedback from \ud83e\udde0 Expert Users](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/4-collect-feedback.html)\n* See [Create an \ud83d\udcd6 Evaluation Set](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/5-create-eval-set.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/metrics.html"} +{"content":"# \n### Metrics\n#### Compute metrics\n\nMetrics are computed as `\ud83d\udcc8 Evaluation Results` by RAG Studio and stored in the `\ud83d\udc4d Assessment & Evaluation Results Log`. \nThere are 2 ways to compute metrics: \n1. **Automatic** Metrics are automatically computed for all traffic that calls the `\ud83d\udd17 Chain`\u2019s REST API (hosted on Mosaic AI Model Serving).\n.. note:: `\ud83d\udd17 Chain`\u2019s REST API (hosted on Mosaic AI Model Serving) traffic includes traffic from the `\ud83d\udcac Review UI`, since this UI calls the REST API.\n2. **Manually** Metric computation for a `Version` using a `\ud83d\udcd6 Evaluation Set` can be trigged by following [Run offline evaluation with a \ud83d\udcd6 Evaluation Set](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/3-run-offline-eval.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/metrics.html"} +{"content":"# \n### Metrics\n#### Unstructured docs retrieval & generation metrics\n\n### Retriever \nRAG Studio supports the following metrics for evaluating the retriever. \n| Question to answer | Metric | Per trace value | Aggregated value | Requires human annotated assessment | Where it can be measured? |\n| --- | --- | --- | --- | --- | --- |\n| Are the retrieved chunks relevant to the user\u2019s query? | Precision of \u201crelevant chunk\u201d @ K | 0 to 100% | 0 to 100% | \u2714\ufe0f | Online, Offline Evaluation |\n| Are **ALL** chunks that are relevant to the user\u2019s query retrieved? | Recall of \u201crelevant chunk\u201d @ K | 0 to 100% | 0 to 100% | \u2714\ufe0f | Online, Offline Evaluation |\n| Are the retrieved chunks returned in the correct order of most to least relevant? | nDCG of \u201crelevant chunk\u201d @ K | 0 to 1 | 0 to 1 | \u2714\ufe0f | Online, Offline Evaluation |\n| What is the latency of retrieval? | Latency | milliseconds | average(milliseconds) | n\/a | Online, Offline Evaluation | \nTip \n**\ud83d\udea7 Roadmap \ud83d\udea7** [1] Cost [2] Do the retrieved chunks contain all the information required to answer the query? [3] Average Precision (AP) [4] Mean Average Precision (mAP) [5] Enabling `\ud83e\udd16 LLM Judge` for retrieval metrics so they do not require a ground-truth assessment. \n### Generation model (for retrieval) \nThese metrics measure the generation model\u2019s performance when the prompt is augemented with unstrctured docs from a retrieval step. \n| Question to answer | Metric | Per trace value | Aggregated value | Requires human annotated assessment | Where it can be measured? |\n| --- | --- | --- | --- | --- | --- |\n| Is the LLM responding based ONLY on the context provided? *Aka not hallucinating & not using knowledge that is part of the model\u2019s pre-training* | Faithfulness (to context) | true\/false | 0 to 100% | \u2716\ufe0f | Online, Offline Evaluation |\n| Is the response on-topic given the query AND retrieved contexts? | Answer relevance (to query given the context) | true\/false | 0 to 100% | \u2716\ufe0f | Online, Offline Evaluation | \nTip \n**\ud83d\udea7 Roadmap \ud83d\udea7** [1] Did the LLM use the correct information from each provided context? [2] Does the response answer the entirety of the query? Aka if I ask \u201cwho are bob and sam?\u201d is the response about both bob and sam? \n### Data corpus \nTip \n**\ud83d\udea7 Roadmap \ud83d\udea7** [1] Does my corpus contain all the information needed to answer a query? aka is the index missing any documents required to answer a specific question?\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/metrics.html"} +{"content":"# \n### Metrics\n#### Generation model (any task) metrics\n\nThese metrics measure the generation model\u2019s performance. They work for any prompt, augmented or non-augmented. \n| Question to answer | Metric | Per trace value | Aggregated value | Requires human annotated assessment | Where it can be measured? |\n| --- | --- | --- | --- | --- | --- |\n| What is the cost of the generation? | Token Count | sum(tokens) | sum(tokens) | n\/a | Online, Offline Evaluation |\n| What is the latency of generation? | Latency | milliseconds | average(milliseconds) | n\/a | Online, Offline Evaluation |\n\n### Metrics\n#### RAG chain metrics\n\nThese metrics measure the chain\u2019s final response back to the user. \n| Question to answer | Metric | Per trace value | Aggregated value | Requires human annotated assessment | Where it can be measured? |\n| --- | --- | --- | --- | --- | --- |\n| Is the response accurate (correct)? | Answer correctness (vs. ground truth) | true\/false | 0 to 100% | \u2714\ufe0f | Offline Evaluation |\n| Does the response violate any of my company policies (racism, toxicity, etc)? | Toxicity | true\/false | 0 to 100% | \u2716\ufe0f | Online, Offline Evaluation | \nTip \n**\ud83d\udea7 Roadmap \ud83d\udea7** [1] Total cost [2] Total latency [3] Answer similarity (to ground truth) using Spearman correlation based on cosine distance [4] Metrics based on assessor-selected reason codes (e.g., helpful, too wordy, etc) [5] User retention rate & other traditional app engagement metrics [6] Is the response inline with my company standards (proper grammar, tone of voice, etc)? [7] Additional asessments for `Does the response violate any of my company policies (racism, toxicity, etc)?` based on LLaMa-Guard [4] % of conversations with no negative feedback signals\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/metrics.html"} +{"content":"# AI and Machine Learning on Databricks\n### Train recommender models\n\nThis article includes two examples of deep-learning-based recommendation models on Databricks. Compared to traditional recommendation models, deep learning models can achieve higher quality results and scale to larger amounts of data. As these models continue to evolve, Databricks provides a framework for effectively training large-scale recommendation models capable of handling hundreds of millions of users. \nA general recommendation system can be viewed as a funnel with the stages shown in the diagram. \n![recommender system architecture diagram](https:\/\/docs.databricks.com\/_images\/recommender-system-architecture.png) \nSome models, such as the two-tower model, perform better as retrieval models. These models are smaller and can effectively operate on millions of data points. Other models, such as DLRM or DeepFM, perform better as reranking models. These models can take in more data, are larger, and can provide fine-grained recommendations.\n\n### Train recommender models\n#### Requirements\n\nDatabricks Runtime 14.3 LTS ML\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-recommender-models.html"} +{"content":"# AI and Machine Learning on Databricks\n### Train recommender models\n#### Tools\n\nThe examples in this article illustrate the following tools: \n* [TorchDistributor](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/spark-pytorch-distributor.html): TorchDistributor is a framework that allows you to run large scale PyTorch model training on Databricks. It uses Spark for orchestration and can scale to as many GPUs as are available in your cluster.\n* [Mosaic StreamingDataset](https:\/\/docs.mosaicml.com\/projects\/streaming\/en\/stable\/): StreamingDataset improves performance and scalability of training on large datasets on Databricks using features like prefetching and interleaving.\n* [MLflow](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html): Mlflow allows you to track parameters, metrics, and model checkpoints.\n* [TorchRec](https:\/\/pytorch.org\/torchrec\/): Modern recommender systems use embedding lookup tables to handle millions of users and items to generate high-quality recommendations. Larger embedding sizes improve model performance but require substantial GPU memory and multi-GPU setups. TorchRec provides a framework to scale recommendation models and lookup tables across multiple GPUs, making it ideal for large embeddings.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-recommender-models.html"} +{"content":"# AI and Machine Learning on Databricks\n### Train recommender models\n#### Example: Movie recommendations using a two-tower model architecture\n\nThe two-tower model is designed to handle large-scale personalization tasks by processing user and item data separately before combining them. It is capable of efficiently generating hundreds or thousands of decent quality recommendations. The model generally expects three inputs: A user\\_id feature, a product\\_id feature, and a binary label defining whether the <user, product> interaction was positive (the user purchased the product) or negative (the user gave the product a one star rating). The outputs of the model are embeddings for both users and items, which are then generally combined (often using a dot product or cosine similarity) to predict user-item interactions. \nAs the two-tower model provides embeddings for both users and products, you can place these embeddings in a vector database, such as [Databricks Vector Store](https:\/\/docs.databricks.com\/generative-ai\/vector-search.html), and perform similarity-search-like operations on the users and items. For example, you could place all the items in a vector store, and for each user, query the vector store to find the top hundred items whose embeddings are similar to the user\u2019s. \nThe following example notebook implements the two-tower model training using the \u201cLearning from Sets of Items\u201d dataset to predict the likelihood that a user will rate a certain movie highly. It uses Mosaic StreamingDataset for distributed data loading, TorchDistributor for distributed model training, and Mlflow for model tracking and logging. \n### Two-tower recommender model notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/two-tower-recommender-model.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nThis notebook is also available in the Databricks Marketplace: [Two-tower model notebook](https:\/\/marketplace.databricks.com\/details\/cc45e324-1523-4d8d-a2a0-b59eb7858e04\/Databricks_Two-Tower-Recommendation-Model-Training) \nNote \n* Inputs for the two-tower model are most often the categorical features user\\_id and product\\_id. The model can be modified to support multiple feature vectors for both users and products.\n* Outputs for the two-tower model are usually binary values indicating whether the user will have a positive or negative interaction with the product. The model can be modified for other applications such as regression, multi-class classification, and probabilities for multiple user actions (for example, dismiss or purchase). Complex outputs should be implemented carefully, as competing objectives can degrade the quality of the embeddings generated by the model.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-recommender-models.html"} +{"content":"# AI and Machine Learning on Databricks\n### Train recommender models\n#### Example: Train a DLRM architecture using a synthetic dataset\n\nDLRM is a state-of-the-art neural network architecture designed specifically for personalization and recommendation systems. It combines categorical and numerical inputs to effectively model user-item interactions and predict user preferences. DLRMs generally expect inputs that include both sparse features (such as user ID, item ID, geographic location, or product category) and dense features (such as user age or item price). The output of a DLRM is typically a prediction of user engagement, such as click-through rates or purchase likelihood. \nDLRMs offer a highly customizable framework that can handle large-scale data, making it suitable for complex recommendation tasks across various domains. Because it is a larger model than the two-tower architecture, this model is often used in the reranking stage. \nThe following example notebook builds a DLRM model to predict binary labels using dense (numerical) features and sparse (categorical) features. It uses a synthetic dataset to train the model, the Mosaic StreamingDataset for distributed data loading, TorchDistributor for distributed model training, and Mlflow for model tracking and logging. \n### DLRM notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/dlrm-recommender-model.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nThis notebook is also available in the Databricks Marketplace: [DLRM notebook](https:\/\/marketplace.databricks.com\/details\/db8353e3-ea71-4437-a80b-6f584cffa42b\/Databricks_DLRM-Recommendation-Model-Training).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-recommender-models.html"} +{"content":"# AI and Machine Learning on Databricks\n### Train recommender models\n#### Comparison of two-tower and DLRM models\n\nThe table shows some guidelines for selecting which recommender model to use. \n| Model type | Dataset size needed for training | Model size | Supported input types | Supported output types | Use cases |\n| --- | --- | --- | --- | --- | --- |\n| Two-tower | Smaller | Smaller | Usually two features (user\\_id, product\\_id) | Mainly binary classification and embeddings generation | Generating hundreds or thousands of possible recommendations |\n| DLRM | Larger | Larger | Various categorical and dense features (user\\_id, gender, geographic\\_location, product\\_id, product\\_category, \u2026) | Multi-class classification, regression, others | Fine-grained retrieval (recommending tens of highly relevant items) | \nIn summary, the two-tower model is best used for generating thousands of good quality recommendations very efficiently. An example might be movie recommendations from a cable provider. The DLRM model is best used for generating very specific recommendations based on more data. An example might be a retailer who wants to present to a customer a smaller number of items that they are highly likely to purchase.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-recommender-models.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use dbt transformations in a Databricks job\n\nYou can run your dbt Core projects as a task in a Databricks job. By running your dbt Core project as a job task, you can benefit from the following Databricks Jobs features: \n* Automate your dbt tasks and schedule workflows that include dbt tasks.\n* Monitor your dbt transformations and send notifications on the status of the transformations.\n* Include your dbt project in a workflow with other tasks. For example, your workflow can ingest data with Auto Loader, transform the data with dbt, and analyze the data with a notebook task.\n* Automatic archiving of the artifacts from job runs, including logs, results, manifests, and configuration. \nTo learn more about dbt Core, see the [dbt documentation](https:\/\/docs.getdbt.com\/docs\/introduction).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbt-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use dbt transformations in a Databricks job\n###### Development and production workflow\n\nDatabricks recommends developing your dbt projects against a Databricks SQL warehouse. Using a Databricks SQL warehouse, you can test the SQL generated by dbt and use the SQL warehouse [query history](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html) to debug the queries generated by dbt. \nTo run your dbt transformations in production, Databricks recommends using the dbt task in a Databricks job. By default, the dbt task will run the dbt Python process using Databricks compute and the dbt generated SQL against the selected SQL warehouse. \nYou can run dbt transformations on a serverless SQL warehouse or pro SQL warehouse, Databricks compute, or any other [dbt-supported warehouse](https:\/\/docs.getdbt.com\/docs\/available-adapters). This article discusses the first two options with examples. \nIf your workspace is Unity Catalog-enabled and [Serverless Workflows](https:\/\/docs.databricks.com\/workflows\/jobs\/run-serverless-jobs.html) is enabled, by default, the job runs on Serverless compute. \nNote \nDeveloping dbt models against a SQL warehouse and running them in production on Databricks compute can lead to subtle differences in performance and SQL language support. Databricks recommends using the same Databricks Runtime version for the compute and the SQL warehouse.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbt-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use dbt transformations in a Databricks job\n###### Requirements\n\n* To learn how to use dbt Core and the `dbt-databricks` package to create and run dbt projects in your development environment, see [Connect to dbt Core](https:\/\/docs.databricks.com\/partners\/prep\/dbt.html). \nDatabricks recommends the [dbt-databricks](https:\/\/github.com\/databricks\/dbt-databricks) package, not the dbt-spark package. The dbt-databricks package is a fork of dbt-spark optimized for Databricks.\n* To use dbt projects in a Databricks job, you must set up [Git integration with Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html). You cannot run a dbt project from DBFS. \n* You must have [serverless or pro SQL warehouses](https:\/\/docs.databricks.com\/admin\/sql\/warehouse-types.html) enabled. \n* You must have the Databricks SQL [entitlement](https:\/\/docs.databricks.com\/security\/auth-authz\/entitlements.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbt-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use dbt transformations in a Databricks job\n###### Create and run your first dbt job\n\nThe following example uses the [jaffle\\_shop](https:\/\/github.com\/dbt-labs\/jaffle_shop) project, an example project that demonstrates core dbt concepts. To create a job that runs the jaffle shop project, perform the following steps. \n1. Go to your Databricks landing page and do one of the following: \n* Click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar and click ![Create Job Button](https:\/\/docs.databricks.com\/_images\/create-job.png).\n* In the sidebar, click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** and select **Job**.\n2. In the task text box on the **Tasks** tab, replace **Add a name for your job\u2026** with your job name.\n3. In **Task name**, enter a name for the task.\n4. In **Type**, select the **dbt** task type. \n![Add a dbt task](https:\/\/docs.databricks.com\/_images\/dbt_task_create_task.png)\n5. In the **Source** drop-down menu, you can select **Workspace** to use a dbt project located in a Databricks workspace folder or **Git provider** for a project located in a remote Git repository. Because this example uses the jaffle shop project located in a Git repository, select **Git provider**, click **Edit**, and enter the details for the jaffle shop GitHub repository. \n![Configure dbt project repo](https:\/\/docs.databricks.com\/_images\/dbt_task_configure_repo_dbt.png) \n* In **Git repository URL**, enter the URL for the jaffle shop project.\n* In **Git reference (branch \/ tag \/ commit)**, enter `main`. You can also use a tag or SHA.\n6. Click **Confirm**.\n7. In the **dbt commands** text boxes, specify the dbt commands to run (**deps**, **seed**, and **run**). You must prefix every command with `dbt`. Commands are run in the specified order. \n![Configure dbt commands](https:\/\/docs.databricks.com\/_images\/dbt_task_configure_commands_serverless.png)\n8. In **SQL warehouse**, select a SQL warehouse to run the SQL generated by dbt. The **SQL warehouse** drop-down menu shows only serverless and pro SQL warehouses.\n9. (Optional) You can specify a schema for the task output. By default, the schema `default` is used.\n10. (Optional) If you want to change the compute configuration that runs dbt Core, click **dbt CLI compute**.\n11. (Optional) You can specify a dbt-databricks version for the task. For example, to pin your dbt task to a specific version for development and production: \n* Under **Dependent libraries**, click ![Delete Icon](https:\/\/docs.databricks.com\/_images\/delete-icon.png) next to the current dbt-databricks version.\n* Click **Add**.\n* In the **Add dependent library** dialog, select **PyPI** and enter the dbt-package version in the **Package** text box (for example, `dbt-databricks==1.6.0`).\n* Click **Add**.\n![Configure the dbt-databricks version](https:\/\/docs.databricks.com\/_images\/dbt_task_configure_package_version.png) \nNote \nDatabricks recommends pinning your dbt tasks to a specific version of the dbt-databricks package to ensure the same version is used for development and production runs. Databricks recommends version 1.6.0 or greater of the dbt-databricks package.\n12. Click **Create**.\n13. To run the job now, click ![Run Now Button](https:\/\/docs.databricks.com\/_images\/run-now-button.png).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbt-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use dbt transformations in a Databricks job\n###### View the results of your dbt job task\n\nWhen the job is complete, you can test the results by running SQL queries from a [notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook) or by running [queries](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html#create-a-query) in your Databricks warehouse. For example, see the following sample queries: \n```\nSHOW tables IN <schema>;\n\n``` \n```\nSELECT * from <schema>.customers LIMIT 10;\n\n``` \nReplace `<schema>` with the schema name configured in the task configuration.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbt-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use dbt transformations in a Databricks job\n###### API example\n\nYou can also use the [Jobs API](https:\/\/docs.databricks.com\/api\/workspace\/jobs) to create and manage jobs that include dbt tasks. The following example creates a job with a single dbt task: \n```\n{\n\"name\": \"jaffle_shop dbt job\",\n\"max_concurrent_runs\": 1,\n\"git_source\": {\n\"git_url\": \"https:\/\/github.com\/dbt-labs\/jaffle_shop\",\n\"git_provider\": \"gitHub\",\n\"git_branch\": \"main\"\n},\n\"job_clusters\": [\n{\n\"job_cluster_key\": \"dbt_CLI\",\n\"new_cluster\": {\n\"spark_version\": \"10.4.x-photon-scala2.12\",\n\"node_type_id\": \"i3.xlarge\",\n\"num_workers\": 0,\n\"spark_conf\": {\n\"spark.master\": \"local[*, 4]\",\n\"spark.databricks.cluster.profile\": \"singleNode\"\n},\n\"custom_tags\": {\n\"ResourceClass\": \"SingleNode\"\n}\n}\n}\n],\n\"tasks\": [\n{\n\"task_key\": \"transform\",\n\"job_cluster_key\": \"dbt_CLI\",\n\"dbt_task\": {\n\"commands\": [\n\"dbt deps\",\n\"dbt seed\",\n\"dbt run\"\n],\n\"warehouse_id\": \"1a234b567c8de912\"\n},\n\"libraries\": [\n{\n\"pypi\": {\n\"package\": \"dbt-databricks>=1.0.0,<2.0.0\"\n}\n}\n]\n}\n]\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbt-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use dbt transformations in a Databricks job\n###### (Advanced) Run dbt with a custom profile\n\nTo run your dbt task with a SQL warehouse (recommended) or all-purpose compute, use a custom `profiles.yml` defining the warehouse or Databricks compute to connect to. To create a job that runs the jaffle shop project with a warehouse or all-purpose compute, perform the following steps. \nNote \nOnly a SQL warehouse or all-purpose compute can be used as the target for a dbt task. You cannot use job compute as a target for dbt. \n1. Create a fork of the [jaffle\\_shop](https:\/\/github.com\/dbt-labs\/jaffle_shop) repository.\n2. Clone the forked repository to your desktop. For example, you could run a command like the following: \n```\ngit clone https:\/\/github.com\/<username>\/jaffle_shop.git\n\n``` \nReplace `<username>` with your GitHub handle.\n3. Create a new file called `profiles.yml` in the `jaffle_shop` directory with the following content: \n```\njaffle_shop:\ntarget: databricks_job\noutputs:\ndatabricks_job:\ntype: databricks\nmethod: http\nschema: \"<schema>\"\nhost: \"<http-host>\"\nhttp_path: \"<http-path>\"\ntoken: \"{{ env_var('DBT_ACCESS_TOKEN') }}\"\n\n``` \n* Replace `<schema>` with a schema name for the project tables.\n* To run your dbt task with a SQL warehouse, replace `<http-host>` with the **Server Hostname** value from the [Connection Details](https:\/\/docs.databricks.com\/integrations\/compute-details.html) tab for your SQL warehouse. To run your dbt task with all-purpose compute, replace `<http-host>` with the **Server Hostname** value from the [Advanced Options, JDBC\/ODBC](https:\/\/docs.databricks.com\/integrations\/compute-details.html) tab for your Databricks compute.\n* To run your dbt task with a SQL warehouse, replace `<http-path>` with the **HTTP Path** value from the [Connection Details](https:\/\/docs.databricks.com\/integrations\/compute-details.html) tab for your SQL warehouse. To run your dbt task with all-purpose compute, replace `<http-path>` with the **HTTP Path** value from the [Advanced Options, JDBC\/ODBC](https:\/\/docs.databricks.com\/integrations\/compute-details.html) tab for your Databricks compute.You do not specify secrets, such as access tokens, in the file because you will check this file into source control. Instead, this file uses the dbt templating functionality to insert credentials dynamically at runtime. \nNote \nThe generated credentials are valid for the duration of the run, up to a maximum of 30 days, and are automatically revoked after completion.\n4. Check this file into Git and push it to your forked repository. For example, you could run commands like the following: \n```\ngit add profiles.yml\ngit commit -m \"adding profiles.yml for my Databricks job\"\ngit push\n\n```\n5. Click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar of the Databricks UI.\n6. Select the dbt job and click the **Tasks** tab.\n7. In **Source**, click **Edit** and enter your forked jaffle shop GitHub repository details. \n![Configure forked project repo](https:\/\/docs.databricks.com\/_images\/dbt_task_configure_repo.png)\n8. In **SQL warehouse**, select **None (Manual)**.\n9. In **Profiles Directory**, enter the relative path to the directory containing the `profiles.yml` file. Leave the path value blank to use the default of the repository root.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbt-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use dbt transformations in a Databricks job\n###### (Advanced) Use dbt Python models in a workflow\n\nNote \ndbt support for Python models is in beta and requires dbt 1.3 or greater. \ndbt now supports [Python models](https:\/\/docs.getdbt.com\/docs\/building-a-dbt-project\/building-models\/python-models) on specific data warehouses, including Databricks. With dbt Python models, you can use tools from the Python ecosystem to implement transformations that are difficult to implement with SQL. You can create a Databricks job to run a single task with your dbt Python model, or you can include the dbt task as part of a workflow that includes multiple tasks. \nYou cannot run Python models in a dbt task using a SQL warehouse. For more information about using dbt Python models with Databricks, see [Specific data warehouses](https:\/\/docs.getdbt.com\/docs\/building-a-dbt-project\/building-models\/python-models#specific-data-warehouses) in the dbt documentation.\n\n##### Use dbt transformations in a Databricks job\n###### Errors and troubleshooting\n\n### *Profile file does not exist* error \n**Error message**: \n```\ndbt looked for a profiles.yml file in \/tmp\/...\/profiles.yml but did not find one.\n\n``` \n**Possible causes**: \nThe `profiles.yml` file was not found in the specified $PATH. Make sure the root of your dbt project contains the profiles.yml file.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbt-in-workflows.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Create a dashboard\n\nLearn how to use the dashboard UI to create and share insights. For information about dashboard features, see [Dashboards](https:\/\/docs.databricks.com\/dashboards\/index.html). \nThe steps in this tutorial demonstrate how to build and share the following dashboard: \n![A published dashboard, configured using the steps in this tutorial.](https:\/\/docs.databricks.com\/_images\/sample-lakeview-dash.png)\n\n##### Create a dashboard\n###### Requirements\n\n* You are logged into a Databricks workspace.\n* You have the SQL entitlement in that workspace.\n* You have at least CAN USE access to one or more SQL warehouses.\n\n##### Create a dashboard\n###### Step 1. Create a dashboard\n\nClick ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar and select **Dashboard**. \nBy default, your new dashboard is automatically named with its creation timestamp and stored in your `\/Workspace\/Users\/<username>` directory. \nNote \nYou can also create a new dashboard from the Dashboards listing page or the **Add** button ![Add button](https:\/\/docs.databricks.com\/_images\/add-button.png) in the Workspace menu.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-dashboard.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Create a dashboard\n###### Step 2. Define datasets\n\nThe **Canvas** tab is for creating and editing widgets like visualizations, text boxes, and filters. The **Data** tab is for defining the underlying datasets used in your dashboard. \nNote \nAll users can write SQL queries to define a dataset. Users in Unity Catalog-enabled workspaces can instead select a Unity Catalog table or view as a dataset. \n1. Click the **Data** tab.\n2. Click **Create from SQL** and paste in the following query. Then click **Run** to return a collection of records. \n```\nSELECT\nT.tpep_pickup_datetime,\nT.tpep_dropoff_datetime,\nT.fare_amount,\nT.pickup_zip,\nT.dropoff_zip,\nT.trip_distance,\nT.weekday,\nCASE\nWHEN T.weekday = 1 THEN 'Sunday'\nWHEN T.weekday = 2 THEN 'Monday'\nWHEN T.weekday = 3 THEN 'Tuesday'\nWHEN T.weekday = 4 THEN 'Wednesday'\nWHEN T.weekday = 5 THEN 'Thursday'\nWHEN T.weekday = 6 THEN 'Friday'\nWHEN T.weekday = 7 THEN 'Saturday'\nELSE 'N\/A'\nEND AS day_of_week\nFROM\n(\nSELECT\ndayofweek(tpep_pickup_datetime) as weekday,\n*\nFROM\n`samples`.`nyctaxi`.`trips`\nWHERE\ntrip_distance > 0\nAND trip_distance < 10\nAND fare_amount > 0\nAND fare_amount < 50\n) T\nORDER BY\nT.weekday\n\n```\n3. Inspect your results. The returned records appear under the editor when the query is finished running.\n4. Change the name of your query. Your newly defined dataset is autosaved with the name, \u201cUntitled dataset.\u201d Double click on the title to rename it, \u201cTaxicab data\u201d. \nNote \nThis query accesses data from the **samples** catalog on Databricks. The table includes publicly available taxicab data from New York City in 2016. Query results are limited to valid rides that are under 10 miles and cost less than fifty dollars.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-dashboard.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Create a dashboard\n###### Step 3. Create and place a visualization\n\nTo create your first visualization, complete the following steps: \n1. Click the **Canvas** tab.\n2. Click ![Create Icon](https:\/\/docs.databricks.com\/_images\/lakeview-create.png) **Create a visualization** to create a visualization widget and use your mouse to place it in the canvas. \n![A visualization moves from the canvas control panel to the canvas grid](https:\/\/docs.databricks.com\/_images\/lakeview-place-chart.gif)\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-dashboard.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Create a dashboard\n###### Step 4. Configure your visualization\n\nWhen a visualization widget is selected, you can use the configuration panel on the right side of the screen to display your data. As shown in the following image, only one **Dataset** has been defined, and it is selected automatically. \n![Configuration panel for a visualization](https:\/\/docs.databricks.com\/_images\/lakeview-config-panel.png) \n### Setup the X-axis \n1. If necessary, select **Bar** from the **Visualization** dropdown menu.\n2. Click the ![add field icon](https:\/\/docs.databricks.com\/_images\/lakeview-add-vis-field.png) to choose the data presented along the **X-axis**. You can use the search bar to search for a field by name. Select **tpep\\_dropoff\\_datetime**.\n3. Click the field name you selected to view additional configuration options. \n* As the **Scale Type**, select **Temporal**.\n* For the **Transform** selection, choose **HOURLY**. \n### Setup the Y-axis \n1. Click the ![add field icon](https:\/\/docs.databricks.com\/_images\/lakeview-add-vis-field.png) next to the **Y-axis** to select the **fare\\_amount** for the data presented along the y-axis.\n2. Click the field name you selected to view additional configuration options. \n* As the **Scale Type**, select **Quantitative**.\n* For the **Transform** selection, choose **AVG**.\n![A chart configured with the provided specifications shows a bar chart with the axis titles \"fare amount\" and tpep_dropoff_datetime\"](https:\/\/docs.databricks.com\/_images\/lakeview-ex-chart.png) \n#### Optional: Create visualizations with Databricks Assistant \nYou can create visualizations using natural language with the Databricks Assistant. \nTo generate the same chart as above, choose one of the following options: \n* To create a new visualization widget: \n+ Click ![Create Icon](https:\/\/docs.databricks.com\/_images\/lakeview-create.png) **Create a visualization**. The widget appears with the prompt: **Describe a chart\u2026**.\n+ Type \u201cBar chart of average fare amount over hourly dropoff time\u201d\n* To edit an existing widget: \n+ Click the ![Databricks Assistant](https:\/\/docs.databricks.com\/_images\/lakeview-sparkle-assist.png)**Assistant** icon. An input prompt appears. Enter a new prompt for your chart. You can ask for a new chart entirely or ask for modifications. For example, you can type, \u201cSwitch to a line chart\u201d to modify the chart type.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-dashboard.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Create a dashboard\n###### Step 5. Clone and modify a visualization\n\nYou can clone an existing chart to create a new visualization. \n1. Right-click on your existing chart and then click **Clone**.\n2. With your new chart selected, use the configuration panel to change the **X-axis** field to **tpep\\_pickup\\_datetime**. If necessary, choose **HOURLY** under the **Transform** type.\n3. Use the **Color\/Group by** selector to choose a new color for your new bar chart.\n\n##### Create a dashboard\n###### Step 6. Create a scatterplot\n\nCreate a new scatterplot with colors differentiated by value. To create a scatterplot, complete the following steps: \n1. Click the **Create a visualization** icon ![Create Icon](https:\/\/docs.databricks.com\/_images\/lakeview-create.png) to create a new visualization widget.\n2. Configure your chart by making the following selections: \n* **Dataset**: **Taxicab data**\n* **Visualization**: **Scatter**\n* **X axis**: **trip\\_distance**\n* **Y axis**: **fare\\_amount**\n* **Color\/Group by**: **day\\_of\\_week**\nNote \nAfter colors have been auto-assigned by category, you can change the color associated with a particular value by clicking on the color in the configuration panel.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-dashboard.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Create a dashboard\n###### Step 7. Create dashboard filters\n\nYou can use filters to make your dashboards interactive. In this step, you create filters on three fields. \n### Create a date range filter \n1. Click ![Filter Icon](https:\/\/docs.databricks.com\/_images\/lakeview-filter.png) **Add a filter (field\/parameter)** to add a filter widget. Place it on the canvas.\n2. From the **Filter** dropdown menu in the configuration panel, select **Date range picker**.\n3. Select the **Title** checkbox to create a title field on your filter. Click the placeholder title and type **Date range** to retitle your filter.\n4. From the **Fields** menu, select **Taxicab\\_data.tpep\\_pickup\\_datetime**. \n### Create a single-select dropdown filter \n1. Click ![Filter Icon](https:\/\/docs.databricks.com\/_images\/lakeview-filter.png) **Add a Add a filter (field\/parameter)** to add a filter widget. Place it on the canvas.\n2. From the **Filter** dropdown menu in the configuration panel, select **Dropdown (single-select)**.\n3. Select the **Title** checkbox to create a title field on your filter. Click on the placeholder title and type **Dropoff zip code** to retitle your filter.\n4. From the **Fields** menu, select **Taxicab\\_data.dropoff\\_zip**. \n### Clone a filter \n1. Right-click on your **Dropoff zip code** filter. Then, click **Clone**.\n2. Click the ![remove field icon](https:\/\/docs.databricks.com\/_images\/lakeview-remove-field.png) to remove the current field. Then, select **Taxicab\\_data.pickup\\_zip** to filter on that field.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-dashboard.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Create a dashboard\n###### Step 8. Resize and arrange charts and filters\n\nUse your mouse to arrange and resize your charts and filters. \nThe following image shows one possible arrangement for this dashboard. \n![3 filters take up the top two rows of the Canvas grid. 2 bar charts are stacked underneath. A scatterplot sits next to the two bar charts.](https:\/\/docs.databricks.com\/_images\/lakeview-move-widgets.gif)\n\n##### Create a dashboard\n###### Step 9. Publish and share\n\nWhile you develop a dashboard, your progress is saved as a draft. To create a clean copy for easy consumption, publish your dashboard. \n1. Click the ![V share icon](https:\/\/docs.databricks.com\/_images\/lakeview-v-share.png) next to the **Share** button. Then, click **Publish**.\n2. Review **People with access** and then click **Publish**. Individual users and groups with at least CAN VIEW permission can see your published dashboard.\n3. Follow the link provided in the Publish notification to view your published dashboard. \n![A notification message with a link appears in the top right corner of the screen.](https:\/\/docs.databricks.com\/_images\/lakeview-publish-message.png)\n4. To change the list of users or groups you want to share the dashboard with, return to your draft dashboard and click the **Share** button. Add users or groups who you want to share with. Set permission levels as appropriate. See [Dashboard ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#lakeview) to learn more about permissions and rights.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-dashboard.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Common data loading patterns using COPY INTO\n\nLearn common patterns for using `COPY INTO` to load data from file sources into Delta Lake. \nThere are many options for using `COPY INTO`. You can also [use temporary credentials with COPY INTO](https:\/\/docs.databricks.com\/ingestion\/copy-into\/temporary-credentials.html) in combination with these patterns. \nSee [COPY INTO](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-copy-into.html) for a full reference of all options.\n\n#### Common data loading patterns using COPY INTO\n##### Create target tables for COPY INTO\n\n`COPY INTO` must target an existing Delta table. In Databricks Runtime 11.3 LTS and above, setting the schema for these tables is optional for formats that support [schema evolution](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html): \n```\nCREATE TABLE IF NOT EXISTS my_table\n[(col_1 col_1_type, col_2 col_2_type, ...)]\n[COMMENT <table-description>]\n[TBLPROPERTIES (<table-properties>)];\n\n``` \nNote that to infer the schema with `COPY INTO`, you must pass additional options: \n```\nCOPY INTO my_table\nFROM '\/path\/to\/files'\nFILEFORMAT = <format>\nFORMAT_OPTIONS ('inferSchema' = 'true')\nCOPY_OPTIONS ('mergeSchema' = 'true');\n\n``` \nThe following example creates a schemaless Delta table called `my_pipe_data` and loads a pipe-delimited CSV with a header: \n```\nCREATE TABLE IF NOT EXISTS my_pipe_data;\n\nCOPY INTO my_pipe_data\nFROM 's3:\/\/my-bucket\/pipeData'\nFILEFORMAT = CSV\nFORMAT_OPTIONS ('mergeSchema' = 'true',\n'delimiter' = '|',\n'header' = 'true')\nCOPY_OPTIONS ('mergeSchema' = 'true');\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/examples.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Common data loading patterns using COPY INTO\n##### Load JSON data with COPY INTO\n\nThe following example loads JSON data from five files in Amazon S3 (S3) into the Delta table called `my_json_data`. This table must be created before `COPY INTO` can be executed. If any data was already loaded from one of the files, the data isn\u2019t reloaded for that file. \n```\nCOPY INTO my_json_data\nFROM 's3:\/\/my-bucket\/jsonData'\nFILEFORMAT = JSON\nFILES = ('f1.json', 'f2.json', 'f3.json', 'f4.json', 'f5.json')\n\n-- The second execution will not copy any data since the first command already loaded the data\nCOPY INTO my_json_data\nFROM 's3:\/\/my-bucket\/jsonData'\nFILEFORMAT = JSON\nFILES = ('f1.json', 'f2.json', 'f3.json', 'f4.json', 'f5.json')\n\n```\n\n#### Common data loading patterns using COPY INTO\n##### Load Avro data with COPY INTO\n\nThe following example loads Avro data in S3 using additional SQL expressions as part of the `SELECT` statement. \n```\nCOPY INTO my_delta_table\nFROM (SELECT to_date(dt) dt, event as measurement, quantity::double\nFROM 's3:\/\/my-bucket\/avroData')\nFILEFORMAT = AVRO\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/examples.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Common data loading patterns using COPY INTO\n##### Load CSV files with COPY INTO\n\nThe following example loads CSV files from S3 under `s3:\/\/bucket\/base\/path\/folder1` into a Delta table at `s3:\/\/bucket\/deltaTables\/target`. \n```\nCOPY INTO delta.`s3:\/\/bucket\/deltaTables\/target`\nFROM (SELECT key, index, textData, 'constant_value'\nFROM 's3:\/\/bucket\/base\/path')\nFILEFORMAT = CSV\nPATTERN = 'folder1\/file_[a-g].csv'\nFORMAT_OPTIONS('header' = 'true')\n\n-- The example below loads CSV files without headers in S3 using COPY INTO.\n-- By casting the data and renaming the columns, you can put the data in the schema you want\nCOPY INTO delta.`s3:\/\/bucket\/deltaTables\/target`\nFROM (SELECT _c0::bigint key, _c1::int index, _c2 textData\nFROM 's3:\/\/bucket\/base\/path')\nFILEFORMAT = CSV\nPATTERN = 'folder1\/file_[a-g].csv'\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/examples.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Common data loading patterns using COPY INTO\n##### Ignore corrupt files while loading data\n\nIf the data you\u2019re loading can\u2019t be read due to some corruption issue, those files can be skipped by setting `ignoreCorruptFiles` to `true` in the `FORMAT_OPTIONS`. \nThe result of the `COPY INTO` command returns how many files were skipped due to corruption in the `num_skipped_corrupt_files` column. This metric also shows up in the `operationMetrics` column under `numSkippedCorruptFiles` after running `DESCRIBE HISTORY` on the Delta table. \nCorrupt files aren\u2019t tracked by `COPY INTO`, so they can be reloaded in a subsequent run if the corruption is fixed. You can see which files are corrupt by running `COPY INTO` in `VALIDATE` mode. \n```\nCOPY INTO my_table\nFROM '\/path\/to\/files'\nFILEFORMAT = <format>\n[VALIDATE ALL]\nFORMAT_OPTIONS ('ignoreCorruptFiles' = 'true')\n\n``` \nNote \n`ignoreCorruptFiles` is available in Databricks Runtime 11.3 LTS and above.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/examples.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### View, manage, and analyze Foundation Model Training runs\n\nImportant \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). Reach out to your Databricks account team to enroll in the Public Preview. \nThis article describes how to view, manage, and analyze Founation Model Training runs using APIs or using the Databricks UI. \nFor information on creating runs, see [Create a training run using the Foundation Model Training API](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/create-fine-tune-run.html) and [Create a training run using the Foundation Model Training UI](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/ui.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/view-manage-runs.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### View, manage, and analyze Foundation Model Training runs\n##### Use Foundation Model Training APIs to view and manage training runs\n\nThe Foundation Model Training APIs provide the following functions for managing your training runs. \n### Get a run \nUse the `get()` function to return a run by name or run object you have launched. \n```\nfrom databricks.model_training import foundation_model as fm\n\nfm.get('<your-run-name>')\n\n``` \n### List runs \nUse the `list()` function to see the runs you have launched. The following table lists the optional filters you can specify. \n| Optional filter | Definition |\n| --- | --- |\n| `finetuning_runs` | A list of runs to get. Defaults to selecting all runs. |\n| `user_emails` | If shared runs is enabled for your workspace, you can filter results by the user who submitted the training run. Defaults to no user filter. |\n| `before` | A datetime or datetime string to filter runs before. Defaults to all runs. |\n| `after` | A datetime or datetime string to filter runs after. Defaults to all runs. | \n```\nfrom databricks.model_training import foundation_model as fm\n\nfm.list()\n\n# filtering example\nfm.list(before='01012023', limit=50)\n\n``` \n### Cancel training runs \nTo cancel a run, use the `cancel()` function and pass the run or a list of the training runs. \n```\nfrom databricks.model_training import foundation_model as fm\n\nrun_to_cancel = '<name-of-run-to-cancel>'\nfm.cancel(run_to_cancel)\n\n``` \n### Delete training runs \nUse `delete()` to delete training runs by passing a single run or a list of runs. \n```\nfrom databricks.model_training import foundation_model as fm\n\nfm.delete('<name-of-run-to-delete>')\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/view-manage-runs.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### View, manage, and analyze Foundation Model Training runs\n##### Review status of training runs\n\nThe following table lists the events created by a training run. Use the `get_events()` function anytime during your run to see your run\u2019s progress. \n| Event type | Example event message | Definition |\n| --- | --- | --- |\n| `CREATED` | Run created. | Training run was created. If resources are availabe, the run starts. Otherwise, it enters the `Pending` state. |\n| `STARTED` | Run started. | Resources have been allocated, and the run has started. |\n| `DATA_VALIDATED` | Training data validated. | Validated that training data is correctly formatted. |\n| `MODEL_INITIALIZED` | Model data downloaded and initialized for base model `meta-llama\/Llama-2-7b-chat-hf`. | Weights for the base model have been downloaded, and training is ready to begin. |\n| `TRAIN_UPDATED` | [epoch=1\/1][batch=50\/56][ETA=5min] Train loss: 1.71 | Reports the current training batch, epoch, or token, estimated time for training to finish (not including checkpoint upload time) and train loss. This event is updated when each batch ends. If the run configuration specifies `max_duration` in `tok` units, progress is reported in tokens. |\n| `TRAIN_FINISHED` | Training completed. | Training has finished. Checkpoint uploading begins. |\n| `COMPLETED` | Run completed. Final weights uploaded. | Checkpoint has been uploaded, and the run has been completed. |\n| `CANCELED` | Run canceled. | The run is canceled if `fm.cancel()` is called on it. |\n| `FAILED` | One or more train dataset samples has unknown keys. Please check the documentation for supported data formats. | The run failed. Check `event_message` for actionable details, or contact support. | \n```\nfrom databricks.model_training import foundation_model as fm\n\nfm.get_events()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/view-manage-runs.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### View, manage, and analyze Foundation Model Training runs\n##### Use the UI to view and manage runs\n\nTo view runs in the UI: \n1. Click **Experiments** in the left nav bar to display the Experiments page.\n2. In the table, click the name of your experiment to display the experiment page. The experiment page lists all runs associated with the experiment. \n![experiment page](https:\/\/docs.databricks.com\/_images\/experiment.png)\n3. To display additional information or metrics in the table, click ![plus sign](https:\/\/docs.databricks.com\/_images\/plus-sign.png) and select the items to display from the menu: \n![add metrics to chart](https:\/\/docs.databricks.com\/_images\/menu.png)\n4. Additional run information is available in the **Chart** tab: \n![chart tab](https:\/\/docs.databricks.com\/_images\/chart.png)\n5. You can also click on the name of the run to display the run screen. This screen gives you access to additional details about the run. \n![run page](https:\/\/docs.databricks.com\/_images\/run-screen.png) \n### Checkpoint folder \nTo access the checkpoint folder, click the **Artifacts** tab on the run screen. Open the experiment name, and then open the **checkpoints** folder. \n![checkpoint folder on artifacts tab](https:\/\/docs.databricks.com\/_images\/checkpoint-folder.png) \nThe epoch folders (named `ep<n>-xxx`) contain the weights at each checkpoint and can be used to start another training run from those weights. \nYou can download the contents of the `huggingface` folder and use it as a Hugging Face model.\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/view-manage-runs.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Feature Engineering in Unity Catalog\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/upgrade-feature-table-to-uc.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Feature Engineering in Unity Catalog\n##### Upgrade a workspace feature table to Unity Catalog\n\nThis page describes how to upgrade an existing workspace feature table to Unity Catalog. \nFirst, you must upgrade the underlying workspace Delta table. Follow these instructions: [Upgrade tables and views to Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html). \nAfter the underlying table and data are available in Unity Catalog, use `upgrade_workspace_table` to upgrade the workspace feature table metadata to Unity Catalog, as illustrated in the following code. Databricks recommends always using the latest version of `databricks-feature-engineering` for this operation, regardless of the Databricks Runtime version you are using. \n```\n%pip install databricks-feature-engineering --upgrade\n\ndbutils.library.restartPython()\n\nfrom databricks.feature_engineering import UpgradeClient\nupgrade_client = UpgradeClient()\nupgrade_client.upgrade_workspace_table(\nsource_workspace_table='recommender_system.customer_features',\ntarget_uc_table='ml.recommender_system.customer_features'\n)\n\n``` \nThe following metadata is upgraded to Unity Catalog: \n* Primary keys\n* Timeseries columns\n* Table and column comments (descriptions)\n* Table and column tags\n* Notebook and job lineage \nIf the target table has existing table or column comments that are different from the source table, the upgrade method skips upgrading comments and logs a warning. If you are using version 0.1.2 or below of `databricks-feature-engineering`, an error is thrown and the upgrade does not run. For all other metadata, a mismatch between the target table and source table causes an error and prevents the upgrade. To bypass the error and overwrite any existing metadata on the target Unity Catalog table, pass `overwrite = True` to the API: \n```\nupgrade_client.upgrade_workspace_table(\nsource_workspace_table='recommender_system.customer_features',\ntarget_uc_table='ml.recommender_system.customer_features',\noverwrite=True\n)\n\n``` \nNote \n* Before calling this API, you must first upgrade the underlying workspace Delta table to Unity Catalog.\n* Upgrading tags and time series columns is not supported in Databricks Runtime 13.2 ML and below.\n* Remember to notify producers and consumers of the upgraded feature table to start using the new table name in Unity Catalog. If the target table in Unity Catalog was upgraded using `CREATE TABLE AS SELECT` or a similar way that cloned the source table, updates to the source table are not automatically synchronized in the target table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/upgrade-feature-table-to-uc.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Hyperparameter tuning\n##### Use distributed training algorithms with Hyperopt\n\nIn addition to single-machine training algorithms such as those from scikit-learn, you can use Hyperopt with distributed training algorithms. In this scenario, Hyperopt generates trials with different hyperparameter settings on the driver node. Each trial is executed from the driver node, giving it access to the full cluster resources. This setup works with any distributed machine learning algorithms or libraries, including Apache Spark MLlib and HorovodRunner. \nWhen you use Hyperopt with distributed training algorithms, do not pass a `trials` argument to `fmin()`, and specifically, do not use the `SparkTrials` class. `SparkTrials` is designed to distribute trials for algorithms that are not themselves distributed. With distributed training algorithms, use the default `Trials` class, which runs on the cluster driver. Hyperopt evaluates each trial on the driver node so that the ML algorithm itself can initiate distributed training. \nNote \nDatabricks does not support automatic logging to MLflow with the `Trials` class. When using distributed training algorithms, you must manually call MLflow to log trials for Hyperopt.\n\n##### Use distributed training algorithms with Hyperopt\n###### Notebook example: Use Hyperopt with MLlib algorithms\n\nThe example notebook shows how to use Hyperopt to tune MLlib\u2019s distributed training algorithms. \n### Hyperopt and MLlib distributed training notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/hyperopt-spark-ml.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-distributed-ml.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Hyperparameter tuning\n##### Use distributed training algorithms with Hyperopt\n###### Notebook example: Use Hyperopt with HorovodRunner\n\nHorovodRunner is a general API used to run distributed deep learning workloads on Databricks. HorovodRunner integrates [Horovod](https:\/\/github.com\/horovod\/horovod) with Spark\u2019s [barrier mode](https:\/\/issues.apache.org\/jira\/browse\/SPARK-24374) to provide higher stability for long-running deep learning training jobs on Spark. \nThe example notebook shows how to use Hyperopt to tune distributed training for deep learning based on HorovodRunner. \n### Hyperopt and HorovodRunner distributed training notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/hyperopt-distributed-ml-training.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-distributed-ml.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Secure cluster connectivity\n\nSecure cluster connectivity means that customer VPCs in the classic compute plane have no open ports and classic compute resources have no public IP addresses. \n* At a network level, each cluster initiates a connection to the control plane secure cluster connectivity relay during cluster creation. The cluster establishes this connection using port 443 (HTTPS) and uses a different IP address than is used for the Web application and REST API.\n* When the control plane logically starts new Databricks Runtime jobs or performs other cluster administration tasks, these requests are sent to the cluster through this tunnel.\n* The compute plane (the VPC) has no open ports, and classic compute plane resources have no public IP addresses. \nBenefits: \n* Easy network administration, with no need to configure ports on security groups or to configure network peering.\n* With enhanced security and simple network administration, information security teams can expedite approval of Databricks as a PaaS provider. \nNote \nAlthough the serverless compute plane does not use the secure cluster connectivity relay for the classic compute plane, serverless SQL warehouses do **not** have public IP addresses. \n![Secure cluster connectivity](https:\/\/docs.databricks.com\/_images\/secure-cluster-connectivity-aws.png)\n\n##### Secure cluster connectivity\n###### Use secure cluster connectivity\n\nTo use secure cluster connectivity for a workspace, create a new workspace. You cannot add secure cluster connectivity to an existing workspace.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/secure-cluster-connectivity.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Create and run Databricks Jobs\n\nThis article details how to create and run Databricks Jobs using the Jobs UI. \nTo learn about configuration options for jobs and how to edit your existing jobs, see [Configure settings for Databricks jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html). \nTo learn how to manage and monitor job runs, see [View and manage job runs](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html). \nTo create your first workflow with a Databricks job, see the [quickstart](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-quickstart.html). \nImportant \n* A workspace is limited to 1000 concurrent task runs. A `429 Too Many Requests` response is returned when you request a run that cannot start immediately.\n* The number of jobs a workspace can create in an hour is limited to 10000 (includes \u201cruns submit\u201d). This limit also affects jobs created by the REST API and notebook workflows.\n\n#### Create and run Databricks Jobs\n##### Create and run jobs using the CLI, API, or notebooks\n\n* To learn about using the Databricks CLI to create and run jobs, see [What is the Databricks CLI?](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html).\n* To learn about using the Jobs API to create and run jobs, see [Jobs](https:\/\/docs.databricks.com\/api\/workspace\/jobs) in the REST API reference.\n* To learn how to run and schedule jobs directly in a Databricks notebook, see [Create and manage scheduled notebook jobs](https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Create and run Databricks Jobs\n##### Create a job\n\n1. Do one of the following: \n* Click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar and click ![Create Job Button](https:\/\/docs.databricks.com\/_images\/create-job.png).\n* In the sidebar, click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** and select **Job**.The **Tasks** tab appears with the create task dialog along with the **Job details** side panel containing job-level settings. \n![Create job screen](https:\/\/docs.databricks.com\/_images\/create-job-ui.png)\n2. Replace **New Job\u2026** with your job name.\n3. Enter a name for the task in the **Task name** field.\n4. In the **Type** drop-down menu, select the type of task to run. See [Task type options](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#task-types).\n5. Configure the cluster where the task runs. By default, serverless compute is selected if your workspace is in a Unity Catalog-enabled workspace and you have selected a task supported by serverless compute for workflows. See [Run your Databricks job with serverless compute for workflows](https:\/\/docs.databricks.com\/workflows\/jobs\/run-serverless-jobs.html). If serverless compute is not available, or you want to use a different compute type, you can select a new job cluster or an existing all-purpose cluster in the **Compute** dropdown menu. \n* **New Job Cluster**: Click **Edit** in the **Cluster** drop-down menu and complete the [cluster configuration](https:\/\/docs.databricks.com\/compute\/configure.html).\n* **Existing All-Purpose Cluster**: Select an existing cluster in the **Cluster** drop-down menu. To open the cluster on a new page, click the ![External Link](https:\/\/docs.databricks.com\/_images\/external-link.png) icon to the right of the cluster name and description.To learn more about selecting and configuring clusters to run tasks, see [Use Databricks compute with your jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/use-compute.html).\n6. To add dependent libraries, click **+ Add** next to **Dependent libraries**. See [Configure dependent libraries](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#task-config-dependent-libraries).\n7. You can pass parameters for your task. For information on the requirements for formatting and passing parameters, see [Pass parameters to a Databricks job task](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#task-parameters).\n8. To optionally receive notifications for task start, success, or failure, click **+ Add** next to **Emails**. Failure notifications are sent on initial task failure and any subsequent retries. To filter notifications and reduce the number of emails sent, check **Mute notifications for skipped runs**, **Mute notifications for canceled runs**, or **Mute notifications until the last retry**.\n9. To optionally configure a retry policy for the task, click **+ Add** next to **Retries**. See [Configure a retry policy for a task](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#retry-policies).\n10. To optionally configure the task\u2019s expected duration or timeout, click **+ Add** next to **Duration threshold**. See [Configure an expected completion time or a timeout for a task](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#timeout-setting-task).\n11. Click **Create**. \nAfter creating the first task, you can configure job-level settings such as notifications, job triggers, and permissions. See [Edit a job](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#job-edit). \nTo add another task, click ![Add Task Button](https:\/\/docs.databricks.com\/_images\/add-task.png) in the DAG view. A shared cluster option is provided if you have selected **Serverless** compute or configured a **New Job Cluster** for a previous task. You can also configure a cluster for each task when you create or edit a task. To learn more about selecting and configuring clusters to run tasks, see [Use Databricks compute with your jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/use-compute.html). \nYou can optionally configure job-level settings such as notifications, job triggers, and permissions. See [Edit a job](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#job-edit). You can also configure job-level parameters that are shared with the job\u2019s tasks. See [Add parameters for all job tasks](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#job-parameters).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Create and run Databricks Jobs\n##### Task type options\n\nThe following are the task types you can add to your Databricks job and available options for the different task types: \n* **Notebook**: In the **Source** drop-down menu, select **Workspace** to use a notebook located in a Databricks workspace folder or **Git provider** for a notebook located in a remote Git repository. \n**Workspace**: Use the file browser to find the notebook, click the notebook name, and click **Confirm**. \n**Git provider**: Click **Edit** or **Add a git reference** and enter the Git repository information. See [Use a notebook from a remote Git repository](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-repos.html#notebook-from-repo). \nNote \nTotal notebook cell output (the combined output of all notebook cells) is subject to a 20MB size limit. Additionally, individual cell output is subject to an 8MB size limit. If total cell output exceeds 20MB in size, or if the output of an individual cell is larger than 8MB, the run is canceled and marked as failed. \nIf you need help finding cells near or beyond the limit, run the notebook against an all-purpose cluster and use this [notebook autosave technique](https:\/\/kb.databricks.com\/notebooks\/notebook-autosave.html).\n* **JAR**: Specify the **Main class**. Use the fully qualified name of the class containing the main method, for example, `org.apache.spark.examples.SparkPi`. Then click **Add** under **Dependent Libraries** to add libraries required to run the task. One of these libraries must contain the main class. \nTo learn more about JAR tasks, see [Use a JAR in a Databricks job](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-jars-in-workflows.html).\n* **Spark Submit**: In the **Parameters** text box, specify the main class, the path to the library JAR, and all arguments, formatted as a JSON array of strings. The following example configures a spark-submit task to run the `DFSReadWriteTest` from the Apache Spark examples: \n```\n[\"--class\",\"org.apache.spark.examples.DFSReadWriteTest\",\"dbfs:\/FileStore\/libraries\/spark_examples_2_12_3_1_1.jar\",\"\/discover\/databricks-datasets\/README.md\",\"\/FileStore\/examples\/output\/\"]\n\n``` \nImportant \nThere are several limitations for **spark-submit** tasks: \n+ You can run spark-submit tasks only on new clusters.\n+ Spark-submit does not support cluster autoscaling. To learn more about autoscaling, see [Cluster autoscaling](https:\/\/docs.databricks.com\/compute\/configure.html#autoscaling).\n+ Spark-submit does not support [Databricks Utilities (dbutils) reference](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html). To use Databricks Utilities, use JAR tasks instead.\n+ If you are using a Unity Catalog-enabled cluster, spark-submit is supported only if the cluster uses the assigned [access mode](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode). Shared access mode is not supported.\n+ Spark Streaming jobs should never have maximum concurrent runs set to greater than 1. Streaming jobs should be set to run using the cron expression `\"* * * * * ?\"` (every minute). Because a streaming task runs continuously, it should always be the final task in a job.\n* **Python script**: In the **Source** drop-down menu, select a location for the Python script, either **Workspace** for a script in the local workspace, **DBFS** for a script located on DBFS, or **Git provider** for a script located in a Git repository. In the **Path** textbox, enter the path to the Python script: \n**Workspace**: In the **Select Python File** dialog, browse to the Python script and click **Confirm**. \n**DBFS**: Enter the URI of a Python script on DBFS or cloud storage; for example, `dbfs:\/FileStore\/myscript.py`. \n**Git provider**: Click **Edit** and enter the Git repository information. See [Use Python code from a remote Git repository](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-repos.html#python-from-repo).\n* **Delta Live Tables Pipeline**: In the **Pipeline** drop-down menu, select an existing [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html) pipeline. \nImportant \nYou can use only triggered pipelines with the **Pipeline** task. Continuous pipelines are not supported as a job task. To learn more about triggered and continuous pipelines, see [Continuous vs. triggered pipeline execution](https:\/\/docs.databricks.com\/delta-live-tables\/updates.html#continuous-triggered).\n* **Python Wheel**: In the **Package name** text box, enter the package to import, for example, `myWheel-1.0-py2.py3-none-any.whl`. In the **Entry Point** text box, enter the function to call when starting the Python wheel file. Click **Add** under **Dependent Libraries** to add libraries required to run the task.\n* **SQL**: In the **SQL task** drop-down menu, select **Query**, **Legacy dashboard**, **Alert**, or **File**. \nNote \n+ The **SQL** task requires Databricks SQL and a [serverless or pro SQL warehouse](https:\/\/docs.databricks.com\/admin\/sql\/warehouse-types.html). \n**Query**: In the **SQL query** drop-down menu, select the query to run when the task runs. \n**Legacy dashboard**: In the **SQL dashboard** drop-down menu, select a dashboard to be updated when the task runs. \n**Alert**: In the **SQL alert** drop-down menu, select an alert to trigger for evaluation. \n**File**: To use a SQL file located in a Databricks workspace folder, in the **Source** drop-down menu, select **Workspace**, use the file browser to find the SQL file, click the filename, and click **Confirm**. To use a SQL file located in a remote Git repository, select **Git provider**, click **Edit** or **Add a git reference** and enter details for the Git repository. See [Use SQL queries from a remote Git repository](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-repos.html#sql-from-repo). \nIn the **SQL warehouse** drop-down menu, select a serverless or pro SQL warehouse to run the task.\n* **dbt**: See [Use dbt transformations in a Databricks job](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbt-in-workflows.html) for a detailed example of configuring a dbt task.\n* **Run Job**: In the **Job** drop-down menu, select a job to be run by the task. To search for the job to run, start typing the job name in the **Job** menu. \nImportant \nYou should not create jobs with circular dependencies when using the `Run Job` task or jobs that nest more than three `Run Job` tasks. Circular dependencies are `Run Job` tasks that directly or indirectly trigger each other. For example, Job A triggers Job B, and Job B triggers Job A. Databricks does not support jobs with circular dependencies or that nest more than three `Run Job` tasks and might not allow running these jobs in future releases.\n* **If\/else**: To learn how to use the `If\/else condition` task, see [Add branching logic to your job with the If\/else condition task](https:\/\/docs.databricks.com\/workflows\/jobs\/conditional-tasks.html#if-else-condition).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Create and run Databricks Jobs\n##### Pass parameters to a Databricks job task\n\nYou can pass parameters to many of the job task types. Each task type has different requirements for formatting and passing the parameters. \nTo access information about the current task, such as the task name, or pass context about the current run between job tasks, such as the start time of the job or the identifier of the current job run, use [dynamic value references](https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html). To view a list of available dynamic value references, click **Browse dynamic values**. \nIf [job parameters](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#job-parameters) are configured on the job a task belongs to, those parameters are displayed when you add task parameters. If job and task parameters share a key, the job parameter takes precedence. A warning is shown in the UI if you attempt to add a task parameter with the same key as a job parameter. To pass job parameters to tasks that are not configured with key-value parameters such as `JAR` or `Spark Submit` tasks, format arguments as `{{job.parameters.[name]}}`, replacing `[name]` with the `key` that identifies the parameter. \n* **Notebook**: Click **Add** and specify the key and value of each parameter to pass to the task. You can override or add additional parameters when you manually run a task using the [Run a job with different parameters](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#job-run-with-different-params) option. Parameters set the value of the [notebook widget](https:\/\/docs.databricks.com\/notebooks\/widgets.html) specified by the key of the parameter.\n* **JAR**: Use a JSON-formatted array of strings to specify parameters. These strings are passed as arguments to the main method of the main class. See [Configuring JAR job parameters](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-jars-in-workflows.html#jar-job-parameters).\n* **Spark Submit**: Parameters are specified as a JSON-formatted array of strings. Conforming to the [Apache Spark spark-submit](https:\/\/spark.apache.org\/docs\/latest\/submitting-applications.html) convention, parameters after the JAR path are passed to the main method of the main class.\n* **Python Wheel**: In the **Parameters** drop-down menu, select **Positional arguments** to enter parameters as a JSON-formatted array of strings, or select **Keyword arguments > Add** to enter the key and value of each parameter. Both positional and keyword arguments are passed to the Python wheel task as command-line arguments. To see an example of reading arguments in a Python script packaged in a Python wheel file, see [Use a Python wheel file in a Databricks job](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-python-wheels-in-workflows.html).\n* **Run Job**: Enter the key and value of each job parameter to pass to the job. \n* **Python script**: Use a JSON-formatted array of strings to specify parameters. These strings are passed as arguments and can be read as positional arguments or parsed using the [argparse](https:\/\/docs.python.org\/3\/library\/argparse.html) module in Python. To see an example of reading positional arguments in a Python script, see [Step 2: Create a script to fetch GitHub data](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html#github-script).\n* **SQL**: If your task runs a [parameterized query](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-parameters.html) or a [parameterized dashboard](https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html#query-params-dashboards-legacy), enter values for the parameters in the provided text boxes.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Create and run Databricks Jobs\n##### Copy a task path\n\nCertain task types, for example, notebook tasks, allow you to copy the path to the task source code: \n1. Click the **Tasks** tab.\n2. Select the task containing the path to copy.\n3. Click ![Jobs Copy Icon](https:\/\/docs.databricks.com\/_images\/copy-icon1.png) next to the task path to copy the path to the clipboard.\n\n#### Create and run Databricks Jobs\n##### Create a job from an existing job\n\nYou can quickly create a new job by cloning an existing job. Cloning a job creates an identical copy of the job, except for the job ID. On the job\u2019s page, click **More \u2026** next to the job\u2019s name and select **Clone** from the drop-down menu.\n\n#### Create and run Databricks Jobs\n##### Create a task from an existing task\n\nYou can quickly create a new task by cloning an existing task: \n1. On the job\u2019s page, click the **Tasks** tab.\n2. Select the task to clone.\n3. Click ![Jobs Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/jobs-vertical-ellipsis.png) and select **Clone task**.\n\n#### Create and run Databricks Jobs\n##### Delete a job\n\nTo delete a job, on the job\u2019s page, click **More \u2026** next to the job\u2019s name and select **Delete** from the drop-down menu.\n\n#### Create and run Databricks Jobs\n##### Delete a task\n\nTo delete a task: \n1. Click the **Tasks** tab.\n2. Select the task to be deleted.\n3. Click ![Jobs Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/jobs-vertical-ellipsis.png) and select **Remove task**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Create and run Databricks Jobs\n##### Run a job\n\n1. Click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar.\n2. Select a job and click the **Runs** tab. You can run a job immediately or schedule the job to run later. \nIf one or more tasks in a job with multiple tasks are unsuccessful, you can re-run the subset of unsuccessful tasks. See [Re-run failed and skipped tasks](https:\/\/docs.databricks.com\/workflows\/jobs\/repair-job-failures.html#repair-run). \n### Run a job immediately \nTo run the job immediately, click ![Run Now Button](https:\/\/docs.databricks.com\/_images\/run-now-button.png). \nTip \nYou can perform a test run of a job with a notebook task by clicking **Run Now**. If you need to make changes to the notebook, clicking **Run Now** again after editing the notebook will automatically run the new version of the notebook. \n### Run a job with different parameters \nYou can use **Run Now with Different Parameters** to re-run a job with different parameters or different values for existing parameters. \nNote \nYou cannot override job parameters if a job that was run before the introduction of job parameters overrode task parameters with the same key. \n1. Click ![Blue Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret-blue.png) next to **Run Now** and select **Run Now with Different Parameters** or, in the [Active Runs](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#view-job-run-list) table, click **Run Now with Different Parameters**. Enter the new parameters depending on the type of task. See [Pass parameters to a Databricks job task](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#task-parameters).\n2. Click **Run**. \n### Run a job as a service principal \nNote \nIf your job runs SQL queries using the SQL task, the identity used to run the queries is determined by the sharing settings of each query, even if the job runs as a service principal. If a query is configured to `Run as owner`, the query is always run using the owner\u2019s identity and not the service principal\u2019s identity. If the query is configured to `Run as viewer`, the query is run using the service principal\u2019s identity. To learn more about query sharing settings, see [Configure query permissions](https:\/\/docs.databricks.com\/sql\/user\/queries\/index.html#share). \nBy default, jobs run as the identity of the job owner. This means that the job assumes the permissions of the job owner. The job can only access data and Databricks objects that the job owner has permissions to access. You can change the identity that the job is running as to a [service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html). Then, the job assumes the permissions of that service principal instead of the owner. \nTo change the **Run as** setting, you must have either the CAN MANAGE or IS OWNER permission on the job. You can set the **Run as** setting to yourself or to any service principal in the workspace on which you have the **Service Principal User** role. For more information, see [Roles for managing service principals](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/service-principal-acl.html). \nNote \nWhen the `RestrictWorkspaceAdmins` setting on a workspace is set to `ALLOW ALL`, workspace admins can also change the **Run as** setting to any user in their workspace. To restrict workspace admins to only change the **Run as** setting to themselves or service principals that they have the **Service Principal User** role on, see [Restrict workspace admins](https:\/\/docs.databricks.com\/admin\/workspace-settings\/restrict-workspace-admins.html). \nTo change the run as field, do the following: \n1. In the sidebar, click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows**.\n2. In the **Name** column, click the job name.\n3. In the **Job details** side panel, click the pencil icon next to the **Run as** field.\n4. Search for and select the service principal.\n5. Click **Save**. \nYou can also list the service principals that you have the **User** role on using the Workspace Service Principals API. For more information, see [List the service principals that you can use](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/service-principal-acl.html#list-sps).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Create and run Databricks Jobs\n##### Run a job on a schedule\n\nYou can use a schedule to automatically run your Databricks job at specified times and periods. See [Add a job schedule](https:\/\/docs.databricks.com\/workflows\/jobs\/schedule-jobs.html#job-schedule).\n\n#### Create and run Databricks Jobs\n##### Run a continuous job\n\nYou can ensure there\u2019s always an active run of your job. See [Run a continuous job](https:\/\/docs.databricks.com\/workflows\/jobs\/schedule-jobs.html#continuous-jobs).\n\n#### Create and run Databricks Jobs\n##### Run a job when new files arrive\n\nTo trigger a job run when new files arrive in a Unity Catalog external location or volume, use a [file arrival trigger](https:\/\/docs.databricks.com\/workflows\/jobs\/file-arrival-triggers.html).\n\n#### Create and run Databricks Jobs\n##### View and run a job created with a Databricks Asset Bundle\n\nYou can use the Databricks Jobs UI to view and run jobs deployed by a [Databricks Asset Bundle](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-bundles-with-jobs.html). By default, these jobs are read-only in the Jobs UI. To edit a job deployed by a bundle, change the bundle configuration file and redeploy the job. Applying changes only to the bundle configuration ensures that the bundle source files always capture the current job configuration. \nHowever, if you must make immediate changes to a job, you can disconnect the job from the bundle configuration to enable editing the job settings in the UI. To disconnect the job, click **Disconnect from source**. In the **Disconnect from source** dialog, click **Disconnect** to confirm. \nAny changes you make to the job in the UI are not applied to the bundle configuration. To apply changes you make in the UI to the bundle, you must manually update the bundle configuration. To reconnect the job to the bundle configuration, redeploy the job using the bundle.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Create and run Databricks Jobs\n##### What if my job cannot run because of concurrency limits?\n\nNote \nQueueing is enabled by default when jobs are created in the UI. \nTo prevent runs of a job from being skipped because of concurrency limits, you can enable queueing for the job. When queueing is enabled, if resources are unavailable for a job run, the run is queued for up to 48 hours. When capacity is available, the job run is dequeued and run. Queued runs are displayed in the [runs list for the job](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#view-job-run-list) and the [recent job runs list](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#view-runs). \nA run is queued when one of the following limits is reached: \n* The maximum concurrent active runs in the workspace.\n* The maximum concurrent `Run Job` task runs in the workspace.\n* The maximum concurrent runs of the job. \nQueueing is a job-level property that queues runs only for that job. \nTo enable or disable queueing, click **Advanced settings** and click the **Queue** toggle button in the **Job details** side panel.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Microsoft Azure Synapse\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes how to set up Lakehouse Federation to run federated queries on Azure Synapse (SQL Data Warehouse) data that is not managed by Databricks. To learn more about Lakehouse Federation, see [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). \nTo connect to an Azure Synapse (SQL Data Warehouse) database using Lakehouse Federation, you must create the following in your Databricks Unity Catalog metastore: \n* A *connection* to your Azure Synapse (SQL Data Warehouse) database.\n* A *foreign catalog* that mirrors your Azure Synapse (SQL Data Warehouse) database in Unity Catalog so that you can use Unity Catalog query syntax and data governance tools to manage Databricks user access to the database.\n\n#### Run federated queries on Microsoft Azure Synapse\n##### Before you begin\n\nWorkspace requirements: \n* Workspace enabled for Unity Catalog. \nCompute requirements: \n* Network connectivity from your Databricks Runtime cluster or SQL warehouse to the target database systems. See [Networking recommendations for Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/networking.html).\n* Databricks clusters must use Databricks Runtime 13.3 LTS or above and shared or single-user access mode.\n* SQL warehouses must be Pro or Serverless. \nPermissions required: \n* To create a connection, you must be a metastore admin or a user with the `CREATE CONNECTION` privilege on the Unity Catalog metastore attached to the workspace.\n* To create a foreign catalog, you must have the `CREATE CATALOG` permission on the metastore and be either the owner of the connection or have the `CREATE FOREIGN CATALOG` privilege on the connection. \nAdditional permission requirements are specified in each task-based section that follows.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/sqldw.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Microsoft Azure Synapse\n##### Create a connection\n\nA connection specifies a path and credentials for accessing an external database system. To create a connection, you can use Catalog Explorer or the `CREATE CONNECTION` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** Metastore admin or user with the `CREATE CONNECTION` privilege. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **External Data** menu and select **Connections**.\n3. Click **Create connection**.\n4. Enter a user-friendly **Connection name**.\n5. Select a **Connection type** of **SQLDW**.\n6. Enter the following connection properties for your Azure Synapse instance. \n* **Host**: For example, `sqldws-demo.database.windows.net`.\n* **Port**: For example, `1433`\n* **trustServerCertificate**: Defaults to `false`. When set to `true`, the transport layer uses SSL to encrypt the channel and bypasses the certificate chain to validate trust. Leave this set to the default unless you have a specific need to bypass trust validation.\n* **User**\n* **Password**\n7. (Optional) Click **Test connection** to confirm that it works.\n8. (Optional) Add a comment.\n9. Click **Create**. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nCREATE CONNECTION <connection-name> TYPE sqldw\nOPTIONS (\nhost '<hostname>',\nport '<port>',\nuser '<user>',\npassword '<password>'\n);\n\n``` \nWe recommend that you use Databricks [secrets](https:\/\/docs.databricks.com\/security\/secrets\/index.html) instead of plaintext strings for sensitive values like credentials. For example: \n```\nCREATE CONNECTION <connection-name> TYPE sqldw\nOPTIONS (\nhost '<hostname>',\nport '<port>',\nuser secret ('<secret-scope>','<secret-key-user>'),\npassword secret ('<secret-scope>','<secret-key-password>')\n)\n\n``` \nFor information about setting up secrets, see [Secret management](https:\/\/docs.databricks.com\/security\/secrets\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/sqldw.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Microsoft Azure Synapse\n##### Create a foreign catalog\n\nA foreign catalog mirrors a database in an external data system so that you can query and manage access to data in that database using Databricks and Unity Catalog. To create a foreign catalog, you use a connection to the data source that has already been defined. \nTo create a foreign catalog, you can use Catalog Explorer or the `CREATE FOREIGN CATALOG` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** `CREATE CATALOG` permission on the metastore and either ownership of the connection or the `CREATE FOREIGN CATALOG` privilege on the connection. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Click the **Create Catalog** button.\n3. On the **Create a new catalog** dialog, enter a name for the catalog and select a **Type** of **Foreign**.\n4. Select the **Connection** that provides access to the database that you want to mirror as a Unity Catalog catalog.\n5. Enter the name of the **Database** that you want to mirror as a catalog.\n6. Click **Create.** \nRun the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional. Replace the placeholder values: \n* `<catalog-name>`: Name for the catalog in Databricks.\n* `<connection-name>`: The [connection object](https:\/\/docs.databricks.com\/query-federation\/index.html#connection) that specifies the data source, path, and access credentials.\n* `<database-name>`: Name of the database you want to mirror as a catalog in Databricks. \n```\nCREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>\nOPTIONS (database '<database-name>');\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/sqldw.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Microsoft Azure Synapse\n##### Supported pushdowns\n\nThe following pushdowns are supported: \n* Filters\n* Projections\n* Limit\n* Aggregates (Average, Count, Max, Min, StddevPop, StddevSamp, Sum, VarianceSamp)\n* Functions (Arithmetic and other miscellaneous functions, such as Alias, Cast, SortOrder)\n* Sorting \nThe following pushdowns are not supported: \n* Joins\n* Windows functions\n\n#### Run federated queries on Microsoft Azure Synapse\n##### Data type mappings\n\nWhen you read from Synapse \/ SQL Data Warehouse to Spark, data types map as follows: \n| Synapse type | Spark type |\n| --- | --- |\n| decimal, money, numeric, smallmoney | DecimalType |\n| smallint | ShortType |\n| tinyint | ByteType |\n| int | IntegerType |\n| bigint | LongType |\n| real | FloatType |\n| float | DoubleType |\n| char, nchar, ntext, nvarchar, text, uniqueidentifier, varchar, xml | StringType |\n| binary, geography, geometry, image, timestamp, udt, varbinary | BinaryType |\n| bit | BooleanType |\n| date | DateType |\n| datetime, datetime, smalldatetime, time | TimestampType\/TimestampNTZType\\* | \n\\*When you read from Synapse \/ SQL Data Warehouse (SQLDW), SQLDW `datetimes` are mapped to Spark `TimestampType` if `preferTimestampNTZ = false` (default). SQLDW `datetimes` are mapped to `TimestampNTZType` if `preferTimestampNTZ = true`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/sqldw.html"} +{"content":"# Security and compliance guide\n## Authentication and access control\n#### Manage access to Databricks automation\n\nThis article describes the how to configure permissions for Databricks credentials. To learn how to use credentials to authenticate to Databricks, see [Authentication for Databricks automation - overview](https:\/\/docs.databricks.com\/dev-tools\/auth\/index.html). \nNote \nDatabricks automation authentication permissions are available only in the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html"} +{"content":"# Security and compliance guide\n## Authentication and access control\n#### Manage access to Databricks automation\n##### Personal access token permissions\n\nWorkspace admins can set permissions on personal access tokens to control which users, service principals, and groups can create and use tokens. Before you can use token access control, a Databricks workspace admin must enable personal access tokens for the workspace. See [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens). \nA workspace user can have one of the following token permissions: \n* NO PERMISSIONS: User cannot create or use personal access tokens to authenticate to the Databricks workspace.\n* CAN USE: User can create a personal access token and use it to authenticate to the workspace.\n* CAN MANAGE (workspace admins only):\\*\\* User can manage all workspace users\u2019 personal access tokens and permission to use them. Users in the workspace `admins` group have this permission by default and you cannot revoke it. No other users, service principals, or groups can be granted this permission. \nThis table lists the permissions required for each token-related task: \n| Task | NO PERMISSIONS | CAN USE | CAN MANAGE |\n| --- | --- | --- | --- |\n| Create a token | | x | x |\n| Use a token for authentication | | x | x |\n| Revoke your own token | | x | x |\n| Revoke any user\u2019s or service principal\u2019s token | | | x |\n| List all tokens | | | x |\n| Modify token permissions | | | x | \n### Manage token permissions using the admin settings page \nThis section describes how to manage permissions using the workspace UI. You can also use the [Permissions API](https:\/\/docs.databricks.com\/api\/workspace\/permissions) or [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html). \n1. Go to the [settings page](https:\/\/docs.databricks.com\/admin\/index.html#admin-settings).\n2. Click the **Advanced** tab.\n3. Next to **Personal Access Tokens**, click the **Permissions** button to open the token permissions editor. \n![Manage token permissions](https:\/\/docs.databricks.com\/_images\/token-permissions.png)\n4. Search for and select the user, service principal, or group and choose the permission to assign. \nIf the `users` group has the CAN USE permission and you want to apply more fine-grained access for non-admin users, remove the CAN USE permission from the `users` group by clicking the **X** next to the permission drop-down menu in the **users** row.\n5. Click **+ Add**.\n6. Click **Save**. \nWarning \nAfter you save your changes, any users who previously had either the CAN USE or CAN MANAGE permission and no longer have either permission are denied access to personal access token authentication and their active tokens are immediately deleted (revoked). Deleted tokens cannot be retrieved.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html"} +{"content":"# Security and compliance guide\n## Authentication and access control\n#### Manage access to Databricks automation\n##### Password permissions\n\nWhen unified login is disabled, by default all workspace admin users can sign in to Databricks using either workspace-level SSO or their username and password, and all API users can authenticate to the Databricks REST APIs using their username and password. \nAs a workspace admin, when workspace-level SSO is enabled you can configure password access control to limit workspace admin users\u2019 and API users\u2019 ability to authenticate with their username and password. \nNote \nPassword access control can only be configured when unified login is disabled. Unified login is enabled for all accounts created after June 21, 2023. If unified login is enabled on your account and you require password access control, contact your Databricks account team. \nFor more information on the sign-in process when unified login is enabled, see [Workspace sign-in process](https:\/\/docs.databricks.com\/admin\/users-groups\/single-sign-on\/index.html#sign-in-process). \nThere are two permission levels for passwords: NO PERMISSIONS and CAN USE. CAN USE grants more abilities to workspace admins than to non-admin users. This table lists the abilities for each permission. \n| Task | NO PERMISSIONS | CAN USE |\n| --- | --- | --- |\n| Can authenticate to the API using password | | x |\n| Can authenticate to the Databricks UI using password | | x (Workspace admins only) | \nIf a non-admin user with no permissions attempts to make a REST API call using a password, authentication will fail. Databricks recommends personal access token REST authentication instead of username and password. \nWorkspace admin users with CAN USE permission see the **Admin Log In** tab on the sign-in page. They can choose to use that tab to log in to Databricks with username and password. \n![SSO admin login tab](https:\/\/docs.databricks.com\/_images\/admin-login.png) \nWorkspace admins with no permissions do not see this page and must log in using SSO. When workspace-level SSO is enabled, all non-admin users do not see this page and must log in using SSO. \n### Configure password permissions \nThis section describes how to manage permissions using the workspace admin settings page. \n1. As a workspace admin, log in to the Databricks workspace.\n2. Click your username in the top bar of the Databricks workspace and select **Settings**.\n3. Click the **Advanced** tab.\n4. Next to **Password Usage**, click **Permission Settings**.\n5. In the Permissions Settings dialog, assign password permission to users and groups using the drop-down menu next to the user or group. You can also configure permissions for the `Admins` group.\n6. Click **Save**. \nYou can also configure password permissions using the [Permissions API](https:\/\/docs.databricks.com\/api\/workspace\/permissions).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html"} +{"content":"# Query data\n## Data format options\n#### Image\n\nImportant \nDatabricks recommends that you use the [binary file](https:\/\/docs.databricks.com\/query\/formats\/binary.html) data source to load image data into the Spark DataFrame as raw bytes. See [Reference solution for image applications](https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/images-etl-inference.html) for the recommended workflow to handle image data. \nThe [image data source](https:\/\/spark.apache.org\/docs\/latest\/ml-datasource#image-data-source) abstracts from the details of image representations and provides a standard API to load image data. To read image files, specify the data source `format` as `image`. \n```\ndf = spark.read.format(\"image\").load(\"<path-to-image-data>\")\n\n``` \nSimilar APIs exist for Scala, Java, and R. \nYou can import a nested directory structure (for example, use a path like `\/path\/to\/dir\/`) and you can use partition discovery by specifying a path with a partition directory (that is, a path like `\/path\/to\/dir\/date=2018-01-02\/category=automobile`).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/image.html"} +{"content":"# Query data\n## Data format options\n#### Image\n##### Image structure\n\nImage files are loaded as a DataFrame containing a single struct-type column called `image` with the following fields: \n```\nimage: struct containing all the image data\n|-- origin: string representing the source URI\n|-- height: integer, image height in pixels\n|-- width: integer, image width in pixels\n|-- nChannels\n|-- mode\n|-- data\n\n``` \nwhere the fields are: \n* `nChannels`: The number of color channels. Typical values are 1 for grayscale images, 3 for colored images (for example, RGB), and 4 for colored images with alpha channel.\n* `mode`: Integer flag that indicates how to interpret the data field. It specifies the data type and channel order the data is stored in. The value of the field is expected (but not enforced) to map to one of the OpenCV types displayed in the following table. OpenCV types are defined for 1, 2, 3, or 4 channels and several data types for the pixel values. Channel order specifies the order in which the colors are stored. For example, if you have a typical three channel image with red, blue, and green components, there are six possible orderings. Most libraries use either RGB or BGR. Three (four) channel OpenCV types are expected to be in BGR(A) order. \n**Map of Type to Numbers in OpenCV (data types x number of channels)** \n| Type | C1 | C2 | C3 | C4 |\n| --- | --- | --- | --- | --- |\n| CV\\_8U | 0 | 8 | 16 | 24 |\n| CV\\_8S | 1 | 9 | 17 | 25 |\n| CV\\_16U | 2 | 10 | 18 | 26 |\n| CV\\_16S | 3 | 11 | 19 | 27 |\n| CV\\_32U | 4 | 12 | 20 | 28 |\n| CV\\_32S | 5 | 13 | 21 | 29 |\n| CV\\_64F | 6 | 14 | 22 | 30 |\n* `data`: Image data stored in a binary format. Image data is represented as a 3-dimensional array with the dimension shape (height, width, nChannels) and array values of type t specified by the mode field. The array is stored in row-major order.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/image.html"} +{"content":"# Query data\n## Data format options\n#### Image\n##### Display image data\n\nThe Databricks `display` function supports displaying image data. See [Images](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#display-image-type).\n\n#### Image\n##### Notebook example: Read and write data to image files\n\nThe following notebook shows how to read and write data to image files. \n### Image data source notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/image-data-source.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### Image\n##### Limitations of image data source\n\nThe image data source decodes the image files during the creation of the Spark DataFrame, increases the data size, and introduces limitations in the following scenarios: \n1. Persisting the DataFrame: If you want to persist the DataFrame into a Delta table for easier access, you should persist the raw bytes instead of the decoded data to save disk space.\n2. Shuffling the partitions: Shuffling the decoded image data takes more disk space and network bandwidth, which results in slower shuffling. You should delay decoding the image as much as possible.\n3. Choosing other decoding method: The image data source uses the Image IO library of javax to decode the image, which prevents you from choosing other image decoding libraries for better performance or implementing customized decoding logic. \nThose limitations can be avoided by using the [binary file](https:\/\/docs.databricks.com\/query\/formats\/binary.html) data source to load image data and decoding only as needed.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/image.html"} +{"content":"# Security and compliance guide\n## Networking\n#### Classic compute plane networking\n\nThis guide introduces features to customize network access between the Databricks control plane and the classic compute plane. Connectivity between the control plane and the serverless compute plane is always over the cloud network backbone and not the public internet. \nTo learn more about the control plane and the compute plane, see [Databricks architecture overview](https:\/\/docs.databricks.com\/security\/network\/index.html#architecture). \nThe features in this section focus on establishing and securing the connection between the Databricks control plane and classic compute plane. This connection is labeled as 2 the diagram below: \n![Network connectivity overview diagram](https:\/\/docs.databricks.com\/_images\/networking-classic.png)\n\n#### Classic compute plane networking\n##### What is secure cluster connectivity?\n\nAll new workspaces are created with secure cluster connectivity by default. Secure cluster connectivity means that customer VPCs have no open ports and classic compute plane resources have no public IP addresses. This simplifies network administration by removing the need to configure ports on security groups or network peering. To learn more about deploying a workspace with secure cluster connectivity, see [Secure cluster connectivity](https:\/\/docs.databricks.com\/security\/network\/classic\/secure-cluster-connectivity.html).\n\n#### Classic compute plane networking\n##### Deploy a workspace in your own VPC\n\nAn AWS Virtual Private Cloud (VPC) lets you provision a logically isolated section of the AWS Cloud where you can launch AWS resources in a virtual network. The VPC is the network location for your Databricks clusters. By default, Databricks creates and manages a VPC for the Databricks workspace. \nYou can instead provide your own VPC to host your Databricks clusters, enabling you to maintain more control of your own AWS account and limit outgoing connections. To take advantage of a customer-managed VPC, you must specify a VPC when you first create the Databricks workspace. For more information, see [Configure a customer-managed VPC](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/index.html"} +{"content":"# Security and compliance guide\n## Networking\n#### Classic compute plane networking\n##### Peer the Databricks VPC with another AWS VPC\n\nBy default, Databricks creates and manages a VPC for the Databricks workspace. For additional security, workers that belong to a cluster can *only* communicate with other workers that belong to the same cluster. Workers cannot talk to any other EC2 instances or other AWS services running in the Databricks VPC. If you have any AWS service running on the same VPC as that of the Databricks cluster, you might not be able to talk to the service because of this firewall restriction. You can run such services outside of the Databricks VPC and peer with that VPC to connect to those services. See [VPC peering](https:\/\/docs.databricks.com\/security\/network\/classic\/vpc-peering.html).\n\n#### Classic compute plane networking\n##### Enable private connectivity from the control plane to the classic compute plane\n\nAWS PrivateLink provides private connectivity from AWS VPCs and on-premises networks to AWS services without exposing the traffic to the public network. You can enable private connectivity from the classic compute plane to Databricks workspace\u2019s core services in the control plane by enabling AWS Private Link. \nFor more information, see [Enable AWS PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/index.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Adaptive query execution\n\nAdaptive query execution (AQE) is query re-optimization that occurs during query execution. \nThe motivation for runtime re-optimization is that Databricks has the most up-to-date accurate statistics at the end of a shuffle and broadcast exchange (referred to as a query stage in AQE). As a result, Databricks can opt for a better physical strategy, pick an optimal post-shuffle partition size and number, or do optimizations that used to require hints, for example, skew join handling. \nThis can be very useful when statistics collection is not turned on or when statistics are stale. It is also useful in places where statically derived statistics are inaccurate, such as in the middle of a complicated query, or after the occurrence of data skew.\n\n#### Adaptive query execution\n##### Capabilities\n\nAQE is enabled by default. It has 4 major features: \n* Dynamically changes sort merge join into broadcast hash join.\n* Dynamically coalesces partitions (combine small partitions into reasonably sized partitions) after shuffle exchange. Very small tasks have worse I\/O throughput and tend to suffer more from scheduling overhead and task setup overhead. Combining small tasks saves resources and improves cluster throughput.\n* Dynamically handles skew in sort merge join and shuffle hash join by splitting (and replicating if needed) skewed tasks into roughly evenly sized tasks.\n* Dynamically detects and propagates empty relations.\n\n#### Adaptive query execution\n##### Application\n\nAQE applies to all queries that are: \n* Non-streaming\n* Contain at least one exchange (usually when there\u2019s a join, aggregate, or window), one sub-query, or both. \nNot all AQE-applied queries are necessarily re-optimized. The re-optimization might or might not come up with a different query plan than the one statically compiled. To determine whether a query\u2019s plan has been changed by AQE, see the following section, [Query plans](https:\/\/docs.databricks.com\/optimizations\/aqe.html#query-plans).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/aqe.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Adaptive query execution\n##### Query plans\n\nThis section discusses how you can examine query plans in different ways. \nIn this section: \n* [Spark UI](https:\/\/docs.databricks.com\/optimizations\/aqe.html#spark-ui)\n* [`DataFrame.explain()`](https:\/\/docs.databricks.com\/optimizations\/aqe.html#dataframeexplain)\n* [`SQL EXPLAIN`](https:\/\/docs.databricks.com\/optimizations\/aqe.html#sql-explain) \n### [Spark UI](https:\/\/docs.databricks.com\/optimizations\/aqe.html#id5) \n#### `AdaptiveSparkPlan` node \nAQE-applied queries contain one or more `AdaptiveSparkPlan` nodes, usually as the root node of each main query or sub-query.\nBefore the query runs or when it is running, the `isFinalPlan` flag of the corresponding `AdaptiveSparkPlan` node shows as `false`; after the query execution completes, the `isFinalPlan` flag changes to `true.` \n#### Evolving plan \nThe query plan diagram evolves as the execution progresses and reflects the most current plan that is being executed. Nodes that have already been executed (in which metrics are available) will not change, but those that haven\u2019t can change over time as the result of re-optimizations. \nThe following is a query plan diagram example: \n![Query plan diagram](https:\/\/docs.databricks.com\/_images\/query-plan-diagram.png) \n### [`DataFrame.explain()`](https:\/\/docs.databricks.com\/optimizations\/aqe.html#id6) \n#### `AdaptiveSparkPlan` node \nAQE-applied queries contain one or more `AdaptiveSparkPlan` nodes, usually as the root node of each main query or sub-query. Before the query runs or when it is running, the `isFinalPlan` flag of the corresponding `AdaptiveSparkPlan` node shows as `false`; after the query execution completes, the `isFinalPlan` flag changes to `true`. \n#### Current and initial plan \nUnder each `AdaptiveSparkPlan` node there will be both the initial plan (the plan before applying any AQE optimizations) and the current or the final plan, depending on whether the execution has completed. The current plan will evolve as the execution progresses. \n#### Runtime statistics \nEach shuffle and broadcast stage contains data statistics. \nBefore the stage runs or when the stage is running, the statistics are compile-time estimates, and the flag `isRuntime` is `false`, for example: `Statistics(sizeInBytes=1024.0 KiB, rowCount=4, isRuntime=false);` \nAfter the stage execution completes, the statistics are those collected at runtime, and the flag `isRuntime` will become `true`, for example: `Statistics(sizeInBytes=658.1 KiB, rowCount=2.81E+4, isRuntime=true)` \nThe following is a `DataFrame.explain` example: \n* Before the execution \n![Before execution](https:\/\/docs.databricks.com\/_images\/before-execution.png)\n* During the execution \n![During execution](https:\/\/docs.databricks.com\/_images\/during-execution.png)\n* After the execution \n![After execution](https:\/\/docs.databricks.com\/_images\/after-execution.png) \n### [`SQL EXPLAIN`](https:\/\/docs.databricks.com\/optimizations\/aqe.html#id7) \n#### `AdaptiveSparkPlan` node \nAQE-applied queries contain one or more AdaptiveSparkPlan nodes, usually as the root node of each main query or sub-query. \n#### No current plan \nAs `SQL EXPLAIN` does not execute the query, the current plan is always the same as the initial plan and does not reflect what would eventually get executed by AQE. \nThe following is a SQL explain example: \n![SQL explain](https:\/\/docs.databricks.com\/_images\/sql-explain.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/aqe.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Adaptive query execution\n##### Effectiveness\n\nThe query plan will change if one or more AQE optimizations take effect. The effect of these AQE optimizations is demonstrated by the difference between the current and final plans and the initial plan and specific plan nodes in the current and final plans. \n* Dynamically change sort merge join into broadcast hash join: different physical join nodes between the current\/final plan and the initial plan \n![Join strategy string](https:\/\/docs.databricks.com\/_images\/join-strategy-string.png)\n* Dynamically coalesce partitions: node `CustomShuffleReader` with property `Coalesced` \n![Custom shuffle reader](https:\/\/docs.databricks.com\/_images\/custom-shuffle-reader.png) \n![Custom shuffle reader string](https:\/\/docs.databricks.com\/_images\/custom-shuffle-reader-string.png)\n* Dynamically handle skew join: node `SortMergeJoin` with field `isSkew` as true. \n![Skew join plan](https:\/\/docs.databricks.com\/_images\/skew-join-plan.png) \n![Skew join string](https:\/\/docs.databricks.com\/_images\/skew-join-string.png)\n* Dynamically detect and propagate empty relations: part of (or entire) the plan is replaced by node LocalTableScan with the relation field as empty. \n![Local table scan](https:\/\/docs.databricks.com\/_images\/local-table-scan.png) \n![Local table scan string](https:\/\/docs.databricks.com\/_images\/local-table-scan-string.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/aqe.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Adaptive query execution\n##### Configuration\n\nIn this section: \n* [Enable and disable adaptive query execution](https:\/\/docs.databricks.com\/optimizations\/aqe.html#enable-and-disable-adaptive-query-execution)\n* [Enable auto-optimized shuffle](https:\/\/docs.databricks.com\/optimizations\/aqe.html#enable-auto-optimized-shuffle)\n* [Dynamically change sort merge join into broadcast hash join](https:\/\/docs.databricks.com\/optimizations\/aqe.html#dynamically-change-sort-merge-join-into-broadcast-hash-join)\n* [Dynamically coalesce partitions](https:\/\/docs.databricks.com\/optimizations\/aqe.html#dynamically-coalesce-partitions)\n* [Dynamically handle skew join](https:\/\/docs.databricks.com\/optimizations\/aqe.html#dynamically-handle-skew-join)\n* [Dynamically detect and propagate empty relations](https:\/\/docs.databricks.com\/optimizations\/aqe.html#dynamically-detect-and-propagate-empty-relations) \n### [Enable and disable adaptive query execution](https:\/\/docs.databricks.com\/optimizations\/aqe.html#id8) \n| Property |\n| --- |\n| **spark.databricks.optimizer.adaptive.enabled** Type: `Boolean` Whether to enable or disable adaptive query execution. Default value: `true` | \n### [Enable auto-optimized shuffle](https:\/\/docs.databricks.com\/optimizations\/aqe.html#id9) \n| Property |\n| --- |\n| **spark.sql.shuffle.partitions** Type: `Integer` The default number of partitions to use when shuffling data for joins or aggregations. Setting the value `auto` enables auto-optimized shuffle, which automatically determines this number based on the query plan and the query input data size. Note: For Structured Streaming, this configuration cannot be changed between query restarts from the same checkpoint location. Default value: 200 | \n### [Dynamically change sort merge join into broadcast hash join](https:\/\/docs.databricks.com\/optimizations\/aqe.html#id10) \n| Property |\n| --- |\n| **spark.databricks.adaptive.autoBroadcastJoinThreshold** Type: `Byte String` The threshold to trigger switching to broadcast join at runtime. Default value: `30MB` | \n### [Dynamically coalesce partitions](https:\/\/docs.databricks.com\/optimizations\/aqe.html#id11) \n| Property |\n| --- |\n| **spark.sql.adaptive.coalescePartitions.enabled** Type: `Boolean` Whether to enable or disable partition coalescing. Default value: `true` |\n| **spark.sql.adaptive.advisoryPartitionSizeInBytes** Type: `Byte String` The target size after coalescing. The coalesced partition sizes will be close to but no bigger than this target size. Default value: `64MB` |\n| **spark.sql.adaptive.coalescePartitions.minPartitionSize** Type: `Byte String` The minimum size of partitions after coalescing. The coalesced partition sizes will be no smaller than this size. Default value: `1MB` |\n| **spark.sql.adaptive.coalescePartitions.minPartitionNum** Type: `Integer` The minimum number of partitions after coalescing. Not recommended, because setting explicitly overrides `spark.sql.adaptive.coalescePartitions.minPartitionSize`. Default value: 2x no. of cluster cores | \n### [Dynamically handle skew join](https:\/\/docs.databricks.com\/optimizations\/aqe.html#id12) \n| Property |\n| --- |\n| **spark.sql.adaptive.skewJoin.enabled** Type: `Boolean` Whether to enable or disable skew join handling. Default value: `true` |\n| **spark.sql.adaptive.skewJoin.skewedPartitionFactor** Type: `Integer` A factor that when multiplied by the median partition size contributes to determining whether a partition is skewed. Default value: `5` |\n| **spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes** Type: `Byte String` A threshold that contributes to determining whether a partition is skewed. Default value: `256MB` | \nA partition is considered skewed when both `(partition size > skewedPartitionFactor * median partition size)` and `(partition size > skewedPartitionThresholdInBytes)` are `true`. \n### [Dynamically detect and propagate empty relations](https:\/\/docs.databricks.com\/optimizations\/aqe.html#id13) \n| Property |\n| --- |\n| **spark.databricks.adaptive.emptyRelationPropagation.enabled** Type: `Boolean` Whether to enable or disable dynamic empty relation propagation. Default value: `true` |\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/aqe.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Adaptive query execution\n##### Frequently asked questions (FAQ)\n\nIn this section: \n* [Why didn\u2019t AQE broadcast a small join table?](https:\/\/docs.databricks.com\/optimizations\/aqe.html#why-didnt-aqe-broadcast-a-small-join-table)\n* [Should I still use a broadcast join strategy hint with AQE enabled?](https:\/\/docs.databricks.com\/optimizations\/aqe.html#should-i-still-use-a-broadcast-join-strategy-hint-with-aqe-enabled)\n* [What is the difference between skew join hint and AQE skew join optimization? Which one should I use?](https:\/\/docs.databricks.com\/optimizations\/aqe.html#what-is-the-difference-between-skew-join-hint-and-aqe-skew-join-optimization-which-one-should-i-use)\n* [Why didn\u2019t AQE adjust my join ordering automatically?](https:\/\/docs.databricks.com\/optimizations\/aqe.html#why-didnt-aqe-adjust-my-join-ordering-automatically)\n* [Why didn\u2019t AQE detect my data skew?](https:\/\/docs.databricks.com\/optimizations\/aqe.html#why-didnt-aqe-detect-my-data-skew) \n### [Why didn\u2019t AQE broadcast a small join table?](https:\/\/docs.databricks.com\/optimizations\/aqe.html#id14) \nIf the size of the relation expected to be broadcast does fall under this threshold but is still not broadcast: \n* Check the join type. Broadcast is not supported for certain join types, for example, the left relation of a `LEFT OUTER JOIN` cannot be broadcast.\n* It can also be that the relation contains a lot of empty partitions, in which case the majority of the tasks can finish quickly with sort merge join or it can potentially be optimized with skew join handling. AQE avoids changing such sort merge joins to broadcast hash joins if the percentage of non-empty partitions is lower than `spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin`. \n### [Should I still use a broadcast join strategy hint with AQE enabled?](https:\/\/docs.databricks.com\/optimizations\/aqe.html#id15) \nYes. A statically planned broadcast join is usually more performant than a dynamically planned one by AQE as AQE might not switch to broadcast join until after performing shuffle for both sides of the join (by which time the actual relation sizes are obtained). So using a broadcast hint can still be a good choice if you know your query well. AQE will respect query hints the same way as static optimization does, but can still apply dynamic optimizations that are not affected by the hints. \n### [What is the difference between skew join hint and AQE skew join optimization? Which one should I use?](https:\/\/docs.databricks.com\/optimizations\/aqe.html#id16) \nIt is recommended to rely on AQE skew join handling rather than use the skew join hint, because AQE skew join is completely automatic and in general performs better than the hint counterpart. \n### [Why didn\u2019t AQE adjust my join ordering automatically?](https:\/\/docs.databricks.com\/optimizations\/aqe.html#id17) \nDynamic join reordering is not part of AQE. \n### [Why didn\u2019t AQE detect my data skew?](https:\/\/docs.databricks.com\/optimizations\/aqe.html#id18) \nThere are two size conditions that must be satisfied for AQE to detect a partition as a skewed partition: \n* The partition size is larger than the `spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes` (default 256MB)\n* The partition size is larger than the median size of all partitions times the skewed partition factor `spark.sql.adaptive.skewJoin.skewedPartitionFactor` (default 5) \nIn addition, skew handling support is limited for certain join types, for example, in `LEFT OUTER JOIN`, only skew on the left side can be optimized.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/aqe.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Adaptive query execution\n##### Legacy\n\nThe term \u201cAdaptive Execution\u201d has existed since Spark 1.6, but the new AQE in Spark 3.0 is fundamentally different. In terms of functionality, Spark 1.6 does only the \u201cdynamically coalesce partitions\u201d part. In terms of technical architecture, the new AQE is a framework of dynamic planning and replanning of queries based on runtime stats, which supports a variety of optimizations such as the ones we have described in this article and can be extended to enable more potential optimizations.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/aqe.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Histogram options\n\nThis section covers the configuration options for histogram chart visualizations. For an example, see [histogram example](https:\/\/docs.databricks.com\/visualizations\/visualization-types.html#histogram).\n\n##### Histogram options\n###### General\n\nTo configure general options, click **General** and configure each of the following required settings: \n* **X Column**: Select the results column from the dataset to display.\n* **Number of Bins**: Number of bins in which to display the data.\n\n##### Histogram options\n###### X axis\n\nTo configure formatting options for the X axis, click **X axis** and configure each of the following optional settings: \n* **Scale**: Select Automatic, Datetime, Linear, Logarithmic, or Categorical.\n* **Name**: Specify a display name for the X axis column if different from the column name.\n* **Show labels**: Whether to show X axis labels.\n* **Hide axis**: Whether to hide the X axis labels and line.\n\n##### Histogram options\n###### Y axis\n\nTo configure formatting options for the Y axis, click **Y axis** and configure each of the following optional settings: \n* **Name**: Specify a display name for the Y axis column if different from the column name.\n* **Start Value**: Show only values higher than a given value, regardless of the query result.\n* **End Value**: Show only values lower than a given value, regardless of the query result.\n* **Hide axis**: If enabled, hides the Y axis labels and scale markers.\n\n##### Histogram options\n###### Colors\n\nTo configure colors, click **Colors** and optionally override automatic colors and configure custom colors.\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/histogram.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Histogram options\n###### Data labels\n\nTo configure labels for each data point in the visualization, click **Data labels** and configure the following optional settings: \n* **Number values format**: The format to use for labels for numeric values.\n* **Percent values format**: The format to use for labels for percentages.\n* **Date\/time values format**: The format to use for labels for date\/time values.\n* **Data labels**: The format to use for labels for other types of values.\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/histogram.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Feature Engineering in Unity Catalog\n##### Discover features in Unity Catalog\n\nWith Feature Engineering in Unity Catalog, you can: \n* Search for feature tables by feature table name, feature, comment, or tag.\n* Filter feature tables by tags.\n* Explore and manage feature tables with Catalog Explorer. \nTo access the Feature Engineering in Unity Catalog UI, click ![Feature Store Icon](https:\/\/docs.databricks.com\/_images\/feature-store-icon.png) **Features** in the sidebar. Select a catalog with the catalog selector to view all of the available feature tables in that catalog, along with the following metadata: \n* Who owns the feature table.\n* Online stores where the feature table has been published.\n* The last time a notebook or job wrote to the feature table.\n* Tags of the feature table.\n* Comment of the feature table. \n![Feature store page](https:\/\/docs.databricks.com\/_images\/feature-store-ui-uc.png) \nNote \nAny table managed by Unity Catalog that has a primary key is automatically a feature table and appears on this page. If you don\u2019t see a table on this page, see how to [add a primary key constraint on the table](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html#use-existing-uc-table).\n\n##### Discover features in Unity Catalog\n###### Search and browse for feature tables\n\nSelect a catalog with the catalog selector. Use the search box to search for feature tables in the catalog. You can enter all or part of the name of a feature table, a feature, a comment, or a tag of the feature table. Search text is case-insensitive. \n![Feature search example](https:\/\/docs.databricks.com\/_images\/feature-search-example-uc.png) \nYou can also use the tag selector to filter feature tables with a specific tag.\n\n##### Discover features in Unity Catalog\n###### Explore and manage feature tables with Catalog Explorer\n\nClick the feature table name to [explore and manage feature table in Catalog Explorer](https:\/\/docs.databricks.com\/discover\/database-objects.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/ui-uc.html"} +{"content":"# Model serving with Databricks\n### Deploy generative AI foundation models\n\nThis article describes support for serving and querying generative AI and LLM foundation models using [Databricks Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html). \nImportant \nFor a getting started tutorial on how to query a foundation model on Databricks, see [Get started querying LLMs on Databricks](https:\/\/docs.databricks.com\/large-language-models\/llm-serving-intro.html).\n\n### Deploy generative AI foundation models\n#### What are foundation models?\n\nFoundation models are large ML models pre-trained with the intention that they are to be fine-tuned for more specific language understanding and generation tasks. These models are utilized to discern patterns within the input data for generative AI and LLM workloads. \nDatabricks Model Serving supports serving and querying foundation models using the following capabilities: \n* [Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html). This functionality makes state-of-the-art open models available to your model serving endpoint. These models are curated foundation model architectures that support optimized inference. Base models, like DBRX Instruct, Llama-2-70B-chat, BGE-Large, and Mistral-7B are available for immediate use with **pay-per-token** pricing, and workloads that require performance guarantees and fine-tuned model variants can be deployed with **provisioned throughput**.\n* [External models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html). These are models that are hosted outside of Databricks. Endpoints that serve external models can be centrally governed and customers can establish rate limits and access control for them. Examples include foundation models like, OpenAI\u2019s GPT-4, Anthropic\u2019s Claude, and others.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/foundation-models.html"} +{"content":"# Model serving with Databricks\n### Deploy generative AI foundation models\n#### Requirements\n\nTo access and query foundation models using Databricks Model Serving, review the requirements for each functionality. \n* [Foundation Model API requirements](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html#required).\n* [External models requirements](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html#required).\n\n### Deploy generative AI foundation models\n#### Create a foundation model serving endpoint\n\nSee [Create foundation model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-foundation-model-endpoints.html)\n\n### Deploy generative AI foundation models\n#### Query a foundation model\n\n* See [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html)\n* [Batch inference using Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/fmapi-batch-inference.html)\n\n### Deploy generative AI foundation models\n#### Additional resources\n\n* [Get started querying LLMs on Databricks](https:\/\/docs.databricks.com\/large-language-models\/llm-serving-intro.html)\n* [Create foundation model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-foundation-model-endpoints.html)\n* [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/foundation-models.html"} +{"content":"# What is Delta Lake?\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/merge.html"} +{"content":"# What is Delta Lake?\n### Upsert into a Delta Lake table using merge\n\nYou can upsert data from a source table, view, or DataFrame into a target Delta table by using the `MERGE` SQL operation. Delta Lake supports inserts, updates, and deletes in `MERGE`, and it supports extended syntax beyond the SQL standards to facilitate advanced use cases. \nSuppose you have a source table named `people10mupdates` or a source path at `\/tmp\/delta\/people-10m-updates` that contains new data for a target table named `people10m` or a target path at `\/tmp\/delta\/people-10m`. Some of these new records may already be present in the target data. To merge the new data, you want to update rows where the person\u2019s `id` is already present and insert the new rows where no matching `id` is present. You can run the following query: \n```\nMERGE INTO people10m\nUSING people10mupdates\nON people10m.id = people10mupdates.id\nWHEN MATCHED THEN\nUPDATE SET\nid = people10mupdates.id,\nfirstName = people10mupdates.firstName,\nmiddleName = people10mupdates.middleName,\nlastName = people10mupdates.lastName,\ngender = people10mupdates.gender,\nbirthDate = people10mupdates.birthDate,\nssn = people10mupdates.ssn,\nsalary = people10mupdates.salary\nWHEN NOT MATCHED\nTHEN INSERT (\nid,\nfirstName,\nmiddleName,\nlastName,\ngender,\nbirthDate,\nssn,\nsalary\n)\nVALUES (\npeople10mupdates.id,\npeople10mupdates.firstName,\npeople10mupdates.middleName,\npeople10mupdates.lastName,\npeople10mupdates.gender,\npeople10mupdates.birthDate,\npeople10mupdates.ssn,\npeople10mupdates.salary\n)\n\n``` \n```\nfrom delta.tables import *\n\ndeltaTablePeople = DeltaTable.forPath(spark, '\/tmp\/delta\/people-10m')\ndeltaTablePeopleUpdates = DeltaTable.forPath(spark, '\/tmp\/delta\/people-10m-updates')\n\ndfUpdates = deltaTablePeopleUpdates.toDF()\n\ndeltaTablePeople.alias('people') \\\n.merge(\ndfUpdates.alias('updates'),\n'people.id = updates.id'\n) \\\n.whenMatchedUpdate(set =\n{\n\"id\": \"updates.id\",\n\"firstName\": \"updates.firstName\",\n\"middleName\": \"updates.middleName\",\n\"lastName\": \"updates.lastName\",\n\"gender\": \"updates.gender\",\n\"birthDate\": \"updates.birthDate\",\n\"ssn\": \"updates.ssn\",\n\"salary\": \"updates.salary\"\n}\n) \\\n.whenNotMatchedInsert(values =\n{\n\"id\": \"updates.id\",\n\"firstName\": \"updates.firstName\",\n\"middleName\": \"updates.middleName\",\n\"lastName\": \"updates.lastName\",\n\"gender\": \"updates.gender\",\n\"birthDate\": \"updates.birthDate\",\n\"ssn\": \"updates.ssn\",\n\"salary\": \"updates.salary\"\n}\n) \\\n.execute()\n\n``` \n```\nimport io.delta.tables._\nimport org.apache.spark.sql.functions._\n\nval deltaTablePeople = DeltaTable.forPath(spark, \"\/tmp\/delta\/people-10m\")\nval deltaTablePeopleUpdates = DeltaTable.forPath(spark, \"tmp\/delta\/people-10m-updates\")\nval dfUpdates = deltaTablePeopleUpdates.toDF()\n\ndeltaTablePeople\n.as(\"people\")\n.merge(\ndfUpdates.as(\"updates\"),\n\"people.id = updates.id\")\n.whenMatched\n.updateExpr(\nMap(\n\"id\" -> \"updates.id\",\n\"firstName\" -> \"updates.firstName\",\n\"middleName\" -> \"updates.middleName\",\n\"lastName\" -> \"updates.lastName\",\n\"gender\" -> \"updates.gender\",\n\"birthDate\" -> \"updates.birthDate\",\n\"ssn\" -> \"updates.ssn\",\n\"salary\" -> \"updates.salary\"\n))\n.whenNotMatched\n.insertExpr(\nMap(\n\"id\" -> \"updates.id\",\n\"firstName\" -> \"updates.firstName\",\n\"middleName\" -> \"updates.middleName\",\n\"lastName\" -> \"updates.lastName\",\n\"gender\" -> \"updates.gender\",\n\"birthDate\" -> \"updates.birthDate\",\n\"ssn\" -> \"updates.ssn\",\n\"salary\" -> \"updates.salary\"\n))\n.execute()\n\n``` \nSee the [Delta Lake API documentation](https:\/\/docs.databricks.com\/delta\/index.html#delta-api) for Scala and Python syntax details. For SQL syntax details, see [MERGE INTO](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-merge-into.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/merge.html"} +{"content":"# What is Delta Lake?\n### Upsert into a Delta Lake table using merge\n#### Modify all unmatched rows using merge\n\nIn Databricks SQL and Databricks Runtime 12.2 LTS and above, you can use the `WHEN NOT MATCHED BY SOURCE` clause to `UPDATE` or `DELETE` records in the target table that do not have corresponding records in the source table. Databricks recommends adding an optional conditional clause to avoid fully rewriting the target table. \nThe following code example shows the basic syntax of using this for deletes, overwriting the target table with the contents of the source table and deleting unmatched records in the target table. For a more scalable pattern for tables where source updates and deletes are time-bound, see [Incrementally sync Delta table with source](https:\/\/docs.databricks.com\/delta\/merge.html#incremental-sync). \n```\n(targetDF\n.merge(sourceDF, \"source.key = target.key\")\n.whenMatchedUpdateAll()\n.whenNotMatchedInsertAll()\n.whenNotMatchedBySourceDelete()\n.execute()\n)\n\n``` \n```\ntargetDF\n.merge(sourceDF, \"source.key = target.key\")\n.whenMatched()\n.updateAll()\n.whenNotMatched()\n.insertAll()\n.whenNotMatchedBySource()\n.delete()\n.execute()\n\n``` \n```\nMERGE INTO target\nUSING source\nON source.key = target.key\nWHEN MATCHED THEN\nUPDATE SET *\nWHEN NOT MATCHED THEN\nINSERT *\nWHEN NOT MATCHED BY SOURCE THEN\nDELETE\n\n``` \nThe following example adds conditions to the `WHEN NOT MATCHED BY SOURCE` clause and specifies values to update in unmatched target rows. \n```\n(targetDF\n.merge(sourceDF, \"source.key = target.key\")\n.whenMatchedUpdate(\nset = {\"target.lastSeen\": \"source.timestamp\"}\n)\n.whenNotMatchedInsert(\nvalues = {\n\"target.key\": \"source.key\",\n\"target.lastSeen\": \"source.timestamp\",\n\"target.status\": \"'active'\"\n}\n)\n.whenNotMatchedBySourceUpdate(\ncondition=\"target.lastSeen >= (current_date() - INTERVAL '5' DAY)\",\nset = {\"target.status\": \"'inactive'\"}\n)\n.execute()\n)\n\n``` \n```\ntargetDF\n.merge(sourceDF, \"source.key = target.key\")\n.whenMatched()\n.updateExpr(Map(\"target.lastSeen\" -> \"source.timestamp\"))\n.whenNotMatched()\n.insertExpr(Map(\n\"target.key\" -> \"source.key\",\n\"target.lastSeen\" -> \"source.timestamp\",\n\"target.status\" -> \"'active'\",\n)\n)\n.whenNotMatchedBySource(\"target.lastSeen >= (current_date() - INTERVAL '5' DAY)\")\n.updateExpr(Map(\"target.status\" -> \"'inactive'\"))\n.execute()\n\n``` \n```\nMERGE INTO target\nUSING source\nON source.key = target.key\nWHEN MATCHED THEN\nUPDATE SET target.lastSeen = source.timestamp\nWHEN NOT MATCHED THEN\nINSERT (key, lastSeen, status) VALUES (source.key, source.timestamp, 'active')\nWHEN NOT MATCHED BY SOURCE AND target.lastSeen >= (current_date() - INTERVAL '5' DAY) THEN\nUPDATE SET target.status = 'inactive'\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/merge.html"} +{"content":"# What is Delta Lake?\n### Upsert into a Delta Lake table using merge\n#### Merge operation semantics\n\nThe following is a detailed description of the `merge` programmatic operation semantics. \n* There can be any number of `whenMatched` and `whenNotMatched` clauses.\n* `whenMatched` clauses are executed when a source row matches a target table row based on the match condition. These clauses have the following semantics. \n+ `whenMatched` clauses can have at most one `update` and one `delete` action. The `update` action in `merge` only updates the specified columns (similar to the `update` [operation](https:\/\/docs.databricks.com\/delta\/tutorial.html#update)) of the matched target row. The `delete` action deletes the matched row.\n+ Each `whenMatched` clause can have an optional condition. If this clause condition exists, the `update` or `delete` action is executed for any matching source-target row pair only when the clause condition is true.\n+ If there are multiple `whenMatched` clauses, then they are evaluated in the order they are specified. All `whenMatched` clauses, except the last one, must have conditions.\n+ If none of the `whenMatched` conditions evaluate to true for a source and target row pair that matches the merge condition, then the target row is left unchanged.\n+ To update all the columns of the target Delta table with the corresponding columns of the source dataset, use `whenMatched(...).updateAll()`. This is equivalent to: \n```\nwhenMatched(...).updateExpr(Map(\"col1\" -> \"source.col1\", \"col2\" -> \"source.col2\", ...))\n\n``` \nfor all the columns of the target Delta table. Therefore, this action assumes that the source table has the same columns as those in the target table, otherwise the query throws an analysis error. \nNote \nThis behavior changes when automatic schema migration is enabled. See [automatic schema evolution](https:\/\/docs.databricks.com\/delta\/update-schema.html#merge-schema-evolution) for details.\n* `whenNotMatched` clauses are executed when a source row does not match any target row based on the match condition. These clauses have the following semantics. \n+ `whenNotMatched` clauses can have only the `insert` action. The new row is generated based on the specified column and corresponding expressions. You do not need to specify all the columns in the target table. For unspecified target columns, `NULL` is inserted.\n+ Each `whenNotMatched` clause can have an optional condition. If the clause condition is present, a source row is inserted only if that condition is true for that row. Otherwise, the source column is ignored.\n+ If there are multiple `whenNotMatched` clauses, then they are evaluated in the order they are specified. All `whenNotMatched` clauses, except the last one, must have conditions.\n+ To insert all the columns of the target Delta table with the corresponding columns of the source dataset, use `whenNotMatched(...).insertAll()`. This is equivalent to: \n```\nwhenNotMatched(...).insertExpr(Map(\"col1\" -> \"source.col1\", \"col2\" -> \"source.col2\", ...))\n\n``` \nfor all the columns of the target Delta table. Therefore, this action assumes that the source table has the same columns as those in the target table, otherwise the query throws an analysis error. \nNote \nThis behavior changes when automatic schema migration is enabled. See [automatic schema evolution](https:\/\/docs.databricks.com\/delta\/update-schema.html#merge-schema-evolution) for details.\n* `whenNotMatchedBySource` clauses are executed when a target row does not match any source row based on the merge condition. These clauses have the following semantics. \n+ `whenNotMatchedBySource` clauses can specify `delete` and `update` actions.\n+ Each `whenNotMatchedBySource` clause can have an optional condition. If the clause condition is present, a target row is modified only if that condition is true for that row. Otherwise, the target row is left unchanged.\n+ If there are multiple `whenNotMatchedBySource` clauses, then they are evaluated in the order they are specified. All `whenNotMatchedBySource` clauses, except the last one, must have conditions.\n+ By definition, `whenNotMatchedBySource` clauses do not have a source row to pull column values from, and so source columns can\u2019t be referenced. For each column to be modified, you can either specify a literal or perform an action on the target column, such as `SET target.deleted_count = target.deleted_count + 1`. \nImportant \n* A `merge` operation can fail if multiple rows of the source dataset match and the merge attempts to update the same rows of the target Delta table. According to the SQL semantics of merge, such an update operation is ambiguous as it is unclear which source row should be used to update the matched target row. You can preprocess the source table to eliminate the possibility of multiple matches.\n* You can apply a SQL `MERGE` operation on a SQL VIEW only if the view has been defined as `CREATE VIEW viewName AS SELECT * FROM deltaTable`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/merge.html"} +{"content":"# What is Delta Lake?\n### Upsert into a Delta Lake table using merge\n#### Data deduplication when writing into Delta tables\n\nA common ETL use case is to collect logs into Delta table by appending them to a table. However, often the sources can generate duplicate log records and downstream deduplication steps are needed to take care of them. With `merge`, you can avoid inserting the duplicate records. \n```\nMERGE INTO logs\nUSING newDedupedLogs\nON logs.uniqueId = newDedupedLogs.uniqueId\nWHEN NOT MATCHED\nTHEN INSERT *\n\n``` \n```\ndeltaTable.alias(\"logs\").merge(\nnewDedupedLogs.alias(\"newDedupedLogs\"),\n\"logs.uniqueId = newDedupedLogs.uniqueId\") \\\n.whenNotMatchedInsertAll() \\\n.execute()\n\n``` \n```\ndeltaTable\n.as(\"logs\")\n.merge(\nnewDedupedLogs.as(\"newDedupedLogs\"),\n\"logs.uniqueId = newDedupedLogs.uniqueId\")\n.whenNotMatched()\n.insertAll()\n.execute()\n\n``` \n```\ndeltaTable\n.as(\"logs\")\n.merge(\nnewDedupedLogs.as(\"newDedupedLogs\"),\n\"logs.uniqueId = newDedupedLogs.uniqueId\")\n.whenNotMatched()\n.insertAll()\n.execute();\n\n``` \nNote \nThe dataset containing the new logs needs to be deduplicated within itself. By the SQL semantics of merge, it matches and deduplicates the new data with the existing data in the table, but if there is duplicate data within the new dataset, it is inserted. Hence, deduplicate the new data before merging into the table. \nIf you know that you may get duplicate records only for a few days, you can optimize your query further by partitioning the table by date, and then specifying the date range of the target table to match on. \n```\nMERGE INTO logs\nUSING newDedupedLogs\nON logs.uniqueId = newDedupedLogs.uniqueId AND logs.date > current_date() - INTERVAL 7 DAYS\nWHEN NOT MATCHED AND newDedupedLogs.date > current_date() - INTERVAL 7 DAYS\nTHEN INSERT *\n\n``` \n```\ndeltaTable.alias(\"logs\").merge(\nnewDedupedLogs.alias(\"newDedupedLogs\"),\n\"logs.uniqueId = newDedupedLogs.uniqueId AND logs.date > current_date() - INTERVAL 7 DAYS\") \\\n.whenNotMatchedInsertAll(\"newDedupedLogs.date > current_date() - INTERVAL 7 DAYS\") \\\n.execute()\n\n``` \n```\ndeltaTable.as(\"logs\").merge(\nnewDedupedLogs.as(\"newDedupedLogs\"),\n\"logs.uniqueId = newDedupedLogs.uniqueId AND logs.date > current_date() - INTERVAL 7 DAYS\")\n.whenNotMatched(\"newDedupedLogs.date > current_date() - INTERVAL 7 DAYS\")\n.insertAll()\n.execute()\n\n``` \n```\ndeltaTable.as(\"logs\").merge(\nnewDedupedLogs.as(\"newDedupedLogs\"),\n\"logs.uniqueId = newDedupedLogs.uniqueId AND logs.date > current_date() - INTERVAL 7 DAYS\")\n.whenNotMatched(\"newDedupedLogs.date > current_date() - INTERVAL 7 DAYS\")\n.insertAll()\n.execute();\n\n``` \nThis is more efficient than the previous command as it looks for duplicates only in the last 7 days of logs, not the entire table. Furthermore, you can use this insert-only merge with Structured Streaming to perform continuous deduplication of the logs. \n* In a streaming query, you can use merge operation in `foreachBatch` to continuously write any streaming data to a Delta table with deduplication. See the following [streaming example](https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html#merge-in-streaming) for more information on `foreachBatch`.\n* In another streaming query, you can continuously read deduplicated data from this Delta table. This is possible because an insert-only merge only appends new data to the Delta table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/merge.html"} +{"content":"# What is Delta Lake?\n### Upsert into a Delta Lake table using merge\n#### Slowly changing data (SCD) and change data capture (CDC) with Delta Lake\n\nDelta Live Tables has native support for tracking and applying SCD Type 1 and Type 2. Use `APPLY CHANGES INTO` with Delta Live Tables to ensure that out of order records are handled correctly when processing CDC feeds. See [APPLY CHANGES API: Simplify change data capture in Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html).\n\n### Upsert into a Delta Lake table using merge\n#### Incrementally sync Delta table with source\n\nIn Databricks SQL and Databricks Runtime 12.2 LTS and above, you can use `WHEN NOT MATCHED BY SOURCE` to create arbitrary conditions to atomically delete and replace a portion of a table. This can be especially useful when you have a source table where records may change or be deleted for several days after initial data entry, but eventually settle to a final state. \nThe following query shows using this pattern to select 5 days of records from the source, update matching records in the target, insert new records from the source to the target, and delete all unmatched records from the past 5 days in the target. \n```\nMERGE INTO target AS t\nUSING (SELECT * FROM source WHERE created_at >= (current_date() - INTERVAL '5' DAY)) AS s\nON t.key = s.key\nWHEN MATCHED THEN UPDATE SET *\nWHEN NOT MATCHED THEN INSERT *\nWHEN NOT MATCHED BY SOURCE AND created_at >= (current_date() - INTERVAL '5' DAY) THEN DELETE\n\n``` \nBy providing the same boolean filter on the source and target tables, you are able to dynamically propagate changes from your source to target tables, including deletes. \nNote \nWhile this pattern can be used without any conditional clauses, this would lead to fully rewriting the target table which can be expensive.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/merge.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Read and write data in Delta Live Tables pipelines\n##### Publish data from Delta Live Tables to the Hive metastore\n\nYou can make the output data of your pipeline discoverable and available to query by publishing datasets to the [Hive metastore](https:\/\/docs.databricks.com\/archive\/external-metastores\/external-hive-metastore.html). To publish datasets to the metastore, enter a schema name in the **Target** field when you create a pipeline. You can also add a target database to an existing pipeline. \nBy default, all tables and views created in Delta Live Tables are local to the pipeline. You must publish tables to a target schema to query or use Delta Live Tables datasets outside the pipeline in which they are declared. \nTo publish tables from your pipelines to Unity Catalog, see [Use Unity Catalog with your Delta Live Tables pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html).\n\n##### Publish data from Delta Live Tables to the Hive metastore\n###### How to publish Delta Live Tables datasets to a schema\n\nYou can declare a target schema for all tables in your Delta Live Tables pipeline using the **Target schema** field in the **Pipeline settings** and **Create pipeline** UIs. \nYou can also specify a schema in a JSON configuration by setting the `target` value. \nYou must run an update for the pipeline to publish results to the target schema. \nYou can use this feature with multiple environment configurations to publish to different schemas based on the environment. For example, you can publish to a `dev` schema for development and a `prod` schema for production data.\n\n##### Publish data from Delta Live Tables to the Hive metastore\n###### How to query datasets in Delta Live Tables\n\nAfter an update completes, you can view the schema and tables, query the data, or use the data in downstream applications. \nOnce published, Delta Live Tables tables can be queried from any environment with access to the target schema. This includes Databricks SQL, notebooks, and other Delta Live Tables pipelines. \nImportant \nWhen you create a `target` configuration, only tables and associated metadata are published. Views are not published to the metastore.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/publish.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Read and write data in Delta Live Tables pipelines\n##### Publish data from Delta Live Tables to the Hive metastore\n###### Exclude tables from target schema\n\nIf you need to calculate intermediate tables that are not intended for external consumption, you can prevent them from being published to a schema using the `TEMPORARY` keyword. Temporary tables still store and process data according to Delta Live Tables semantics, but should not be accessed outside of the current pipeline. A temporary table persists for the lifetime of the pipeline that creates it. Use the following syntax to declare temporary tables: \n```\nCREATE TEMPORARY LIVE TABLE temp_table\nAS SELECT ... ;\n\n``` \n```\n@dlt.table(\ntemporary=True)\ndef temp_table():\nreturn (\"...\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/publish.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Load data using the add data UI\n#### Upload files to a Unity Catalog volume\n\nThe **Upload to volume** UI allows you to upload files in any format to a Unity Catalog volume, including structured, semi-structured, and unstructured data. See [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html). \nUploaded files cannot exceed 5 gigabytes. \n* In Databricks Runtime 13.3 LTS and above, Databricks recommends using volumes to store `.whl` libraries for compute with assigned or shared access modes.\n* In Databricks Runtime 13.3 LTS and above, Databricks recommends using volumes to store JARs and init scripts for compute with assigned or shared access modes. \nYou can create a Unity Catalog managed table from an uploaded file. See [Create table from volumes](https:\/\/docs.databricks.com\/catalog-explorer\/manage-volumes.html#create-table). \nYou can also run various machine learning and data science workloads on files uploaded to a volume. Furthermore, you can upload libraries, certificates, and other configuration files of arbitrary formats, such as .whl or .txt, that you want to use to configure cluster libraries, notebook-scoped libraries, or job dependencies.\n\n#### Upload files to a Unity Catalog volume\n##### Where can you access the UI to upload files to a volume?\n\nYou can access this UI in the following ways: \n* In the sidebar, click **New** > **Add data** > **Upload files to volume**.\n* In Catalog Explorer, click **Add** > **Upload to volume**. You can also upload files directly to a volume or to a directory in a volume while browsing volumes in Catalog Explorer.\n* From within a notebook, by clicking **File** > **Upload files to volume**. \nNote \nVolumes are only supported in Databricks Runtime 13.3 LTS and above. In Databricks Runtime 12.2 LTS and below, operations against `\/Volumes` paths might succeed, but might write data to ephemeral storage disks attached to compute clusters rather than persisting data to Unity Catalog volumes as expected.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-to-volume.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Load data using the add data UI\n#### Upload files to a Unity Catalog volume\n##### Before you begin\n\nBefore you upload files to a Unity Catalog volume, you must have the following: \n* A workspace with Unity Catalog enabled. For more information, see [Set up and manage Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html).\n* The `WRITE VOLUME` privilege on the volume you want to upload files to.\n* The `USE SCHEMA` privilege on the parent schema\n* The `USE CATALOG` privilege on the parent catalog. \nFor more information, see [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html).\n\n#### Upload files to a Unity Catalog volume\n##### Steps to upload files to a volume\n\nTo upload files to a Unity Catalog volume, do the following: \n1. Click **New** > **Add Data**.\n2. Click **Upload files to volume**.\n3. Select a volume or a directory inside a volume, or paste a volume path. \n* If no volume exists in the target schema, you can use the dialog to create a new volume.\n* Optionally, you can create a new directory within the target volume by specifying the full path to the target directory.\n4. Click the browse button or drag and drop files directly into the drop zone.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-to-volume.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### MongoDB\n\n[MongoDB](https:\/\/www.mongodb.com\/) is a document database that stores data in flexible, JSON-like documents. \nThe following notebook shows you how to read and write data to MongoDB Atlas, the hosted version of MongoDB, using Apache Spark. The [MongoDB Connector for Spark](https:\/\/docs.mongodb.com\/spark-connector\/current\/) was developed by MongoDB.\n\n#### MongoDB\n##### MongoDB notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mongodb.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/mongodb.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Performance efficiency for the data lakehouse\n##### Best practices for performance efficiency\n\nThis article covers best practices for **performance efficiency**, organized by architectural principles listed in the following sections.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/performance-efficiency\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Performance efficiency for the data lakehouse\n##### Best practices for performance efficiency\n###### Vertical scaling, horizontal scaling, and linear scalability\n\nBefore discussing the best practices, let\u2019s first look at a few concepts around distributed computing: horizontal and vertical scaling, and linear scalability: \n* **Vertical scaling** by adding or removing resources from a single computer, typically CPUs, memory, or GPUs. Usually, this means stopping the workload, moving it to a bigger machine, and restarting it again. Vertical scaling has limits: There might not be a bigger machine, or the price for the next bigger machine is prohibitively high.\n* **Horizontal scaling** by adding or removing nodes from a distributed system: When the limits of vertical scaling are reached, scaling horizontally is the solution: Distributed computing uses systems with several machines (called [clusters](https:\/\/en.wikipedia.org\/wiki\/Computer_cluster)) to run the workloads. It is essential to understand that for this to be possible, the workloads must be prepared for parallel execution, as supported by the engines of the Databricks lakehouse, Apache Spark, and Photon. This allows combining multiple reasonably priced machines into a larger computing system. If one needs more compute resources, then horizontal scaling adds more nodes to the cluster and removes them when no longer needed. While technically there is no limit (and the Spark engine will take over the complex part of distributing the loads), large numbers of nodes do increase the management complexity.\n* **Linear scalability**, meaning that when you add more resources to a system, the relationship between throughput and used resources is linear. This is only possible if the parallel tasks are independent. If not, intermediate results on one set of nodes will be needed on another set of nodes in the cluster for further computation. This data exchange between nodes involves transporting the results over the network from one set of nodes to another set of nodes, which takes considerable time. In general, distributed computing will always have some overhead for managing the distribution and exchange of data. As a result, small data set workloads that can be analyzed on a single node may be even slower when run on a distributed system. The Databricks Data Intelligence Platform provides flexible computing (single node and distributed) to meet the unique needs of your workloads.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/performance-efficiency\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Performance efficiency for the data lakehouse\n##### Best practices for performance efficiency\n###### Use serverless architectures\n\n### Use serverless compute \nWith the [serverless compute](https:\/\/docs.databricks.com\/getting-started\/overview.html#serverless) on the Databricks Data Intelligence Platform, the compute layer runs in the customer\u2019s Databricks account. Workspace admins can create serverless SQL warehouses that enable instant compute and are managed by Databricks. A serverless SQL warehouse uses compute clusters hosted in the Databricks customer account. Use them with Databricks SQL queries just like you usually would with the original Databricks SQL warehouses. Serverless compute comes with a very fast starting time for SQL warehouses (10s and below), and the infrastructure is managed by Databricks. \nThis leads to improved productivity: \n* Cloud administrators no longer have to manage complex cloud environments, for example by adjusting quotas, creating and maintaining networking assets, and joining billing sources.\n* Users benefit from near-zero waiting times for cluster start and improved concurrency on their queries.\n* Cloud administrators can refocus their time on higher-value projects instead of managing low-level cloud components.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/performance-efficiency\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Performance efficiency for the data lakehouse\n##### Best practices for performance efficiency\n###### Design workloads for performance\n\n### Understand your data ingestion and access patterns \nFrom a performance perspective, data access patterns - such as \u201caggregations versus point access\u201d or \u201cscan versus search\u201d - behave differently depending on the data size. Large files are more efficient for scan queries and smaller files better for search since you have to read fewer data to find the specific row(s). \nFor ingestion patterns, it\u2019s common to use DML statements. DML statements are most performant when the data is clustered, and you can simply isolate the section of data. Keeping the data clustered and isolatable on ingestion is important: Consider keeping a natural time sort order and apply as many filters as possible to the ingest target table. For append-only and overwrite ingestion workloads, there isn\u2019t much to consider, as this is a relatively cheap operation. \nThe ingestion and access patterns often point to an obvious data layout and clustering. If they do not, decide what is more important to your business and skew toward how to solve that goal better. \n### Use parallel computation where it is beneficial \nTime to value is an important dimension when working with data. While many use cases can be easily implemented on a single machine (small data, few and simple computation steps), often use cases come up that: \n* Need to process large data sets.\n* Have long running times due to complicated algorithms.\n* Must be repeated 100s and 1000s of times. \nThe cluster environment of the Databricks platform is a great environment to distribute these workloads efficiently. It automatically parallelizes SQL queries across all nodes of a cluster and it provides libraries for [Python](https:\/\/docs.databricks.com\/getting-started\/dataframes.html) and [Scala](https:\/\/docs.databricks.com\/getting-started\/dataframes.html) to do the same. Under the hood, the engines Apache Spark and Photon analyze the queries, determine the optimal way of parallel execution, and manage the distributed execution in a resilient way. \nIn the same way as batch tasks, [Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/index.html) distributes streaming jobs across the cluster for best performance. \nOne of the easiest way to use parallel computing are [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html). You declare tasks and dependencies of a job in SQL or Python, and then Delta Live Tables takes over the execution planning, efficient infrastructure setup, job execution, and monitoring. \nFor data scientists, [pandas](https:\/\/pandas.pydata.org\/) is a Python package that provides easy-to-use data structures and data analysis tools for the Python programming language. However, Pandas does not scale out to big data. [Pandas API on Spark](https:\/\/docs.databricks.com\/pandas\/pandas-on-spark.html) fills this gap by providing pandas equivalent APIs that work on Apache Spark. \nAdditionally, the platform comes with parallelized algorithms for machine learning called [MLlib](https:\/\/spark.apache.org\/docs\/latest\/ml-guide.html). It supports out-of-the-box leveraging multi-GPU and distributed deep learning compute, such as by Horovod Runner. See [HorovodRunner: distributed deep learning with Horovod](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-runner.html). Specific libraries also coming with the platform help distribute massively repeated tasks to all cluster nodes, cutting time to value down in a near-linear fashion. For example, [Hyperopt](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/index.html) for parallel hyperparameter optimization in ML. \n### Analyze the whole chain of execution \nMost pipelines or consumption patterns use a chain of systems. For example, for BI tools the performance is impacted by several factors: \n* The BI tool itself.\n* The connector that connects the BI tool and the SQL engine.\n* The SQL engine where the BI tool sends the query. \nFor best-in-class performance, the whole chain needs to be taken into account and selected\/tuned for best performance. \n### Prefer larger clusters \nNote \nServerless compute manages clusters automatically, so this is not needed for serverless compute. \nPlan for larger clusters, especially when the workload scales linearly. In that case, it is not more expensive to use a large cluster for a workload than to use a smaller one. It\u2019s just faster. The key is that you\u2019re renting the cluster for the length of the workload. So, if you spin up two worker clusters and it takes an hour, you\u2019re paying for those workers for the full hour. Similarly, if you spin up a four-worker cluster and it takes only half an hour (here comes the linear scalability into play), the costs are the same. If costs are the primary driver with a very flexible SLA, an autoscaling cluster is almost always going to be the cheapest but not necessarily the fastest. \n### Use native Spark operations \nUser Defined Functions (UDFs) are a great way to extend the functionality of Spark SQL. However, don\u2019t use Python or Scala UDFs if a native function exists: \n* [Spark SQL](https:\/\/spark.apache.org\/docs\/latest\/api\/sql\/index.html)\n* [PySpark](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/user_guide\/pandas_on_spark\/index.html) \nReasons: \n* To transfer data between Python and Spark, serialization is needed. This drastically slows down queries.\n* Higher efforts for implementing and testing functionality already existing in the platform. \nIf native functions are missing and should be implemented as Python UDFs, use [Pandas UDFs](https:\/\/docs.databricks.com\/udf\/pandas.html). [Apache Arrow](https:\/\/arrow.apache.org\/) ensures data moves efficiently back and forth between Spark and Python. \n### Use Photon \n[Photon](https:\/\/docs.databricks.com\/compute\/photon.html) is the engine on Databricks that provides fast query performance at low cost \u2013 from data ingestion, ETL, streaming, data science, and interactive queries \u2013 directly on your data lake. Photon is compatible with Apache Spark APIs, so getting started is as easy as turning it on \u2013 no code changes and no lock-in. \nPhoton is part of a high-performance runtime that runs your existing SQL and DataFrame API calls faster and reduces your total cost per workload. Photon is used by default in Databricks SQL warehouses. \n### Understand your hardware and workload type \nNote \nServerless compute manages clusters automatically, so this is not needed for serverless compute. \nNot all cloud VMs are created equally. The different families of machines offered by cloud providers are all different enough to matter. There are obvious differences - RAM and cores - and more subtle differences - processor type and generation, network bandwidth guarantees, and local high-speed storage versus local disk versus remote disk. There are also differences in the \u201cspot\u201d markets. These should be understood before deciding on the best VM type for your workload. \n### Use caching \nThere are two types of caching available in Databricks: disk caching and Spark caching. Here are the characteristics of each type: \n* **Use disk cache** \nThe [disk cache](https:\/\/docs.databricks.com\/optimizations\/disk-cache.html) (formerly known as \u201cDelta cache\u201d) stores copies of remote data on the local disks (for example, SSD) of the virtual machines. It can improve the performance of a wide range of queries but cannot be used to store the results of arbitrary subqueries. The disk cache automatically detects when data files are created or deleted and updates its content accordingly. The recommended (and easiest) way to use disk caching is to choose a worker type with SSD volumes when you configure your cluster. Such workers are enabled and configured for disk caching.\n* **Avoid Spark Caching** \nThe [Spark cache](https:\/\/spark.apache.org\/docs\/latest\/sql-performance-tuning.html#caching-data-in-memory) (by using `.persist()` and `.unpersist()`) can store the result of any subquery data and data stored in formats other than Parquet (such as CSV, JSON, and ORC). However, if used at the wrong locations in a query, it might eat up all memory and can even slow down queries substantially. As a rule of thumb, avoid Spark caching unless you know exactly the impact. See [Spark caching](https:\/\/docs.databricks.com\/delta\/best-practices.html#spark-caching).\n* **Query Result Cache** \nPer cluster [caching of query results](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-caching.html) for all queries through SQL warehouses. To benefit from query result caching, focus on deterministic queries that for example, don\u2019t use predicates like `= NOW()`. When a query is deterministic, and the underlying data is in Delta format and unchanged, SQL Warehouses will return the result directly from the query result cache.\n* **Databricks SQL UI caching** \nPer user caching of all query and legacy dashboard results in the [Databricks SQL UI](https:\/\/docs.databricks.com\/sql\/user\/queries\/index.html).\n* **Prewarm clusters** \nNote \nServerless compute manages clusters automatically, so this is not needed for serverless compute. \nIndependent of query and data format, the first query on a cluster will always be slower than subsequent queries. This has to do with all the different subsystems that will be started and read all the data they need. Take this into account for performance benchmarking. It is also possible to attach a cluster to a ready-to-use pool. See [Pool configuration reference](https:\/\/docs.databricks.com\/compute\/pools.html). \n### Use compaction \nDelta Lake on Databricks can improve the speed of reading queries from a table. One way to improve this speed is to coalesce small files into larger ones. You trigger compaction by running the OPTIMIZE command. See [Compact data files with optimize on Delta Lake](https:\/\/docs.databricks.com\/delta\/optimize.html). \nYou can also compact small files automatically using Auto Optimize. See [Consider file size tuning](https:\/\/docs.databricks.com\/lakehouse-architecture\/performance-efficiency\/best-practices.html#consider-auto-optimize). \n### Use data skipping \n**Data skipping:** To achieve this, data skipping information is collected automatically when you write data into a Delta table (by default Delta Lake on Databricks collects statistics on the first 32 columns defined in your table schema). Delta Lake on Databricks takes advantage of this information (minimum and maximum values) at query time to provide faster queries. See [Data skipping for Delta Lake](https:\/\/docs.databricks.com\/delta\/data-skipping.html). \nFor best results, apply [Z-ordering](https:\/\/docs.databricks.com\/delta\/data-skipping.html#delta-zorder), a technique to collocate related information in the same set of files. This co-locality is automatically used on Databricks by Delta Lake data-skipping algorithms. This behavior dramatically reduces the amount of data Delta Lake on Databricks needs to read. \n**Dynamic file pruning:** [Dynamic file pruning](https:\/\/docs.databricks.com\/optimizations\/dynamic-file-pruning.html) (DFP) can significantly improve the performance of many queries on Delta tables. DFP is especially efficient for non-partitioned tables or joins on non-partitioned columns. \n### Avoid over-partitioning \nIn the past, partitioning was the most common way to skip data. However, partitioning is static and manifests as a file system hierarchy. There is no easy way to change partitions if the access patterns change over time. Often, partitioning leads to over-partitioning - in other words, too many partitions with too small files, which results in bad query performance. See [Partitions](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-partition.html). \nIn the meantime, a much better choice than partitioning is Z-ordering. \n### Consider file size tuning \nThe term *auto optimize* is sometimes used to describe functionality controlled by the settings `delta.autoCompact` and `delta.optimizeWrite`. This term has been retired in favor of describing each setting individually. See [Configure Delta Lake to control data file size](https:\/\/docs.databricks.com\/delta\/tune-file-size.html). \nAuto Optimize is particularly useful in the following scenarios: \n* Streaming use cases where latency in the order of minutes is acceptable.\n* MERGE INTO is the preferred method of writing into Delta Lake.\n* CREATE TABLE AS SELECT or INSERT INTO are commonly used operations. \n### Optimize join performance \n* Consider range join optimization. See [Range join optimization](https:\/\/docs.databricks.com\/optimizations\/range-join.html). \nA range join occurs when two relations are joined using a point in interval or interval overlap condition. The range join optimization support in Databricks Runtime can bring orders of magnitude improvement in query performance but requires careful manual tuning.\n* Consider skew join optimization. \nData skew is a condition in which a table\u2019s data is unevenly distributed among partitions in the cluster. Data skew can severely downgrade the performance of queries, especially those with joins. Joins between big tables require shuffling data, and the skew can lead to an extreme imbalance of work in the cluster. It\u2019s likely that data skew is affecting a query if a query appears to be stuck finishing very few tasks. To ameliorate skew, Delta Lake on Databricks SQL accepts skew hints in queries. With the information from a skew hint, Databricks Runtime can construct a better query plan that does not suffer from data skew. There are two options: \n+ If the skew is known, manual skew hints can be provided. See [Skew join optimization using skew hints](https:\/\/docs.databricks.com\/archive\/legacy\/skew-join.html).\n+ Skew join hints are not required. Skew is automatically taken care of if [adaptive query execution](https:\/\/docs.databricks.com\/optimizations\/aqe.html) (AQE) and `spark.sql.adaptive.skewJoin.enabled` are both enabled. \n### Run analyze table to collect table statistics \nRun analyze table to collect statistics on the entire table for the query planner. See [ANALYZE TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-aux-analyze-table.html). \n```\nANALYZE TABLE mytable COMPUTE STATISTICS FOR ALL COLUMNS;\n\n``` \nThis information is persisted in the metastore and helps the query optimizer by: \n* Choosing the proper join type.\n* Selecting the correct build side in a hash-join.\n* Calibrating the join order in a multi-way join. \nIt should be run alongside OPTIMIZE on a daily basis and is recommended on tables < 5TB. The only caveat is that analyze table is not incremental.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/performance-efficiency\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Performance efficiency for the data lakehouse\n##### Best practices for performance efficiency\n###### Run performance testing in the scope of development\n\n### Test on data representative of production data \nRun performance testing on production data (read-only) or similar data. When using similar data, characteristics like volume, file layout, and data skews should be like production data, since this has a significant impact on performance. \n### Take prewarming of resources into account \nThe first query on a new cluster is slower than all the others: \n* In general, cluster resources need to initialize on multiple layers.\n* When caching is part of the setup, the first run ensures that the data is in the cache, which speeds up subsequent jobs. \nPrewarming resources - running specific queries for the sake of initializing resources and filling caches (for example, after a cluster restart) - can significantly increase the performance of the first queries. So, to understand the behavior for the different scenarios, test the performance of the first execution (with and without prewarming) and subsequent executions. \nTip \nInteractive workloads like dashboard refreshes can significantly benefit from prewarming. However, this does not apply to job clusters, where the load by design is executed only once. \n### Identify bottlenecks \nBottlenecks are areas in your workload that might worsen the overall performance when the load in production increases. Identifying these at design time and testing against higher workloads will help to keep the workloads stable in production.\n\n##### Best practices for performance efficiency\n###### Monitor performance\n\nSee [Operational Excellence - Set up monitoring, alerting and logging](https:\/\/docs.databricks.com\/lakehouse-architecture\/operational-excellence\/best-practices.html#system-monitoring).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/performance-efficiency\/best-practices.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Enable a workspace for Unity Catalog\n\nThis article explains how to enable a workspace for Unity Catalog by assigning a Unity Catalog metastore. \nImportant \nOn November 8, 2023, Databricks started to enable new workspaces for Unity Catalog automatically, with a rollout proceeding gradually. If your workspace was enabled for Unity Catalog automatically, this article does not apply to you. \nTo determine if your workspace is already enabled for Unity Catalog, see [Step 1: Confirm that your workspace is enabled for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#workspace).\n\n#### Enable a workspace for Unity Catalog\n##### About enabling workspaces for Unity Catalog\n\nEnabling Unity Catalog for a workspace means that: \n* Users in that workspace can potentially access the same data that users in other workspaces in your account can access, and data stewards can manage that data access centrally, across workspaces\n* Data access is audited automatically\n* Identity federation is enabled for the workspace, allowing admins to [manage identities](https:\/\/docs.databricks.com\/admin\/users-groups\/index.html) centrally using the account console and other account-level interfaces. This includes [assigning users to workspaces](https:\/\/docs.databricks.com\/admin\/users-groups\/index.html#assign-users-to-workspaces). \nTo enable a Databricks workspace for Unity Catalog, you assign the workspace to a [Unity Catalog metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#metastore). A metastore is the top-level container for data in Unity Catalog. Each metastore exposes a 3-level namespace (`catalog`.`schema`.`table`) by which data can be organized. \nYou can share a single metastore across multiple Databricks workspaces in an account. Each linked workspace has the same view of the data in the metastore, and you can manage data access control across workspaces. You can create one metastore per region and attach it to any number of workspaces in that region.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/enable-workspaces.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Enable a workspace for Unity Catalog\n##### Considerations before you enable a workspace for Unity Catalog\n\nBefore you enable a workspace for Unity Catalog, you should: \n* Understand the privileges of workspace admins in workspaces that are enabled for Unity Catalog, and review your existing workspace admin assignments. \nWorkspace admin is a privileged role that you should distribute carefully. \nWorkspace admins can manage operations for their workspace including adding users and service principals, creating clusters, and delegating other users to be workspace admins. If your workspace was enabled for Unity Catalog automatically, the workspace admin also has a number of additional privileges by default, including the ability to create most Unity Catalog object types and grant access to the ones they create. See [Admin privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html). \nIf your workspace was not enabled for Unity Catalog automatically, then your workspace admins have no more access to Unity Catalog objects by default than any other user, but they do have the ability to perform workspace management tasks such as managing job ownership and viewing notebooks, which may give indirect access to data registered in Unity Catalog. \nAccount admins can restrict workspace admin privileges using the the `RestrictWorkspaceAdmins` setting. See [Restrict workspace admins](https:\/\/docs.databricks.com\/admin\/workspace-settings\/restrict-workspace-admins.html). \nIf you use workspaces to isolate user data access, you might want to use workspace-catalog bindings. Workspace-catalog bindings enable you to limit catalog access by workspace boundaries. For example, you can ensure that workspace admins and users can only access production data in `prod_catalog` from a production workspace environment, `prod_workspace`. The default is to share the catalog with all workspaces attached to the current metastore. Likewise, you can bind access to external locations such that they are accessible only from specified workspaces. See [(Optional) Assign a catalog to specific workspaces](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#catalog-binding) and [(Optional) Assign an external location to specific workspaces](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html#workspace-binding).\n* Update any automation that has been configured to manage users, groups, and service principals, such as SCIM provisioning connectors and Terraform automation, so that they refer to account endpoints instead of workspace endpoints. See [Account-level and workspace-level SCIM provisioning](https:\/\/docs.databricks.com\/admin\/users-groups\/scim\/index.html#account-workspace-scim).\n* Be aware that enabling a workspace for Unity Catalog cannot be reversed. Once you enable the workspace, you will manage users, groups, and service principals for this workspace using account-level interfaces.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/enable-workspaces.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Enable a workspace for Unity Catalog\n##### Requirements\n\nBefore you can enable your workspace for Unity Catalog, you must have a Unity Catalog metastore configured for your Databricks account. See [Create a Unity Catalog metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/enable-workspaces.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Enable a workspace for Unity Catalog\n##### Enable your workspace for Unity Catalog\n\nWhen you create a metastore, you are prompted to assign workspaces to that metastore, which enables those workspaces for Unity Catalog. You can also return to the account console to enable a workspace for Unity Catalog at any time, including when you create workspaces using the account console. Note that most workspaces are automatically enabled for Unity Catalog when you create them. See [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement). \nTo enable an existing workspace for Unity Catalog using the account console: \n1. As an account admin, log in to the [account console](https:\/\/accounts.cloud.databricks.com).\n2. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. Click the metastore name.\n4. Click the **Workspaces** tab.\n5. Click **Assign to workspace**.\n6. Select one or more workspaces. You can type part of the workspace name to filter the list.\n7. Scroll to the bottom of the dialog, and click **Assign**.\n8. On the confirmation dialog, click **Enable**. \nTo enable Unity Catalog when you create a workspace using the account console: \n1. As an account admin, log in to the [account console](https:\/\/accounts.cloud.databricks.com).\n2. Click ![Workspaces Icon](https:\/\/docs.databricks.com\/_images\/workspaces-icon-account.png) **Workspaces**.\n3. Click **Create workspace**.\n4. On the Create workspace page, click the **Enable Unity Catalog** toggle.\n5. On the confirmation dialog, click **Enable**.\n6. Select the **Metastore**.\n7. Complete the workspace creation configuration and click **Save**. \nWhen the assignment is complete, the workspace appears in the metastore\u2019s **Workspaces** tab, and the metastore appears on the workspace\u2019s **Configuration** tab. \n### Next steps \n* [Create and manage catalogs](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html)\n* [Create and manage schemas (databases)](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-schemas.html)\n* [Create tables in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html)\n* Learn more about Unity Catalog: [What is Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/enable-workspaces.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Enable a workspace for Unity Catalog\n##### Remove the metastore link from a workspace\n\nTo remove a workspace\u2019s access to data in a metastore, you can unlink the metastore from the workspace. \nWarning \nIf you break the link between a workspace and a Unity Catalog metastore: \n* Users in the workspace will no longer be able to access data in the metastore.\n* You will break any notebook, query, or job that references the data managed in the metastore. \n1. As an account admin, log in to the [account console](https:\/\/accounts.cloud.databricks.com).\n2. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. Click the metastore name.\n4. On the **Workspaces** tab, find the workspace you want to remove from the metastore.\n5. Click the three-button menu at the far right of the workspace row and select **Remove from this metastore**.\n6. On the confirmation dialog, click **Unassign**. \nWhen the removal is complete, the workspace no longer appears in the metastore\u2019s **Workspaces** tab.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/enable-workspaces.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n###### Enhanced security monitoring\n\nThis article describes the enhanced security monitoring feature and how to configure it on your Databricks workspace or account.\n\n###### Enhanced security monitoring\n####### Enhanced security monitoring overview\n\nDatabricks enhanced security monitoring provides an enhanced hardened disk image and additional security monitoring agents that generate log rows that you can review using audit logs. \nThe security enhancements apply only to compute resources in the [classic compute plane](https:\/\/docs.databricks.com\/getting-started\/overview.html), such as clusters and non-serverless SQL warehouses. \nServerless compute plane resources, such as serverless SQL warehouses, do not have extra monitoring when enhanced security monitoring is enabled. \nEnhanced security monitoring includes: \n* An enhanced hardened operating system image based on [Ubuntu Advantage](https:\/\/ubuntu.com\/advantage). \nUbuntu Advantage is a package of enterprise security and support for open source infrastructure and applications that includes the following: \n+ A [CIS Level 1](https:\/\/www.cisecurity.org\/cis-hardened-images) hardened image.\n+ [FIPS 140-2 Level 1](https:\/\/csrc.nist.gov\/publications\/detail\/fips\/140\/2\/final) validated encryption modules.\n* Antivirus monitoring agent that generate logs that you can review.\n* File integrity monitoring agent that generate logs that you can review.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-monitoring.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n###### Enhanced security monitoring\n####### Monitoring agents in Databricks compute plane images\n\nWhile enhanced security monitoring enabled, there are additional security monitoring agents, including two agents that are pre-installed in the enhanced compute plane image. You cannot disable the monitoring agents that are in the enhanced compute plane disk image. \n| Monitoring agent | Location | Description | How to get output |\n| --- | --- | --- | --- |\n| File integrity monitoring | Enhanced compute plane image | Monitors for file integrity and security boundary violations. This monitor agent runs on the worker VM in your cluster. | Enable the audit log [system table](https:\/\/docs.databricks.com\/admin\/system-tables\/index.html) and review logs for [new rows](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html#capsule8). |\n| Antivirus and malware detection | Enhanced compute plane image | Scans the filesystem for viruses daily. This monitor agent runs on the VMs in your compute resources such as clusters and pro or classic SQL warehouses. The antivirus and malware detection agent scans the entire host OS filesystem and the Databricks Runtime container filesystem. Anything outside the cluster VMs is outside of its scanning scope. | Enable the audit log [system table](https:\/\/docs.databricks.com\/admin\/system-tables\/index.html) and review logs for [new rows](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html#clamav). |\n| Vulnerability scanning | Scanning happens in representative images in the Databricks environments. | Scans the container host (VM) for certain known vulnerabilities and CVEs. | Request scan reports on the image from your Databricks account team. | \nTo get the latest versions of monitoring agents, you can restart your clusters. If your workspace uses [automatic cluster update](https:\/\/docs.databricks.com\/admin\/clusters\/automatic-cluster-update.html), by default clusters restart if needed during the scheduled maintenance windows. If the [compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html) is enabled on a workspace, automatic cluster update is permanently enabled on that workspace. \n### File integrity monitoring \nThe enhanced compute plane image includes a file integrity monitoring service that provides runtime visibility and threat detection for compute resources (cluster workers) in the classic compute plane in your workspace. \nThe file integrity monitor output is generated within your audit logs, which you can access with system tables. See [Monitor usage with system tables](https:\/\/docs.databricks.com\/admin\/system-tables\/index.html). For the JSON schema for new auditable events that are specific to file integrity monitoring, see [File integrity monitoring events](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html#capsule8). \nImportant \nIt is your responsibility to review these logs. Databricks may, in its sole discretion, review these logs but does not make a commitment to do so. If the agent detects a malicious activity, it is your responsibility to triage these events and open a support ticket with Databricks if the resolution or remediation requires an action by Databricks. Databricks may take action on the basis of these logs, including suspending or terminating the resources, but does not make any commitment to do so. \n### Antivirus and malware detection \nThe enhanced compute plane image includes an antivirus engine for detecting trojans, viruses, malware, and other malicious threats. The antivirus monitor scans the entire host OS filesystem and the Databricks Runtime container filesystem. Anything outside the cluster VMs is outside of its scanning scope. \nThe antivirus monitor output is generated within audit logs, which you can access with [system tables (Public Preview)](https:\/\/docs.databricks.com\/admin\/system-tables\/index.html). For the JSON schema for new auditable events that are specific to antivirus monitoring, see [Antivirus monitoring events](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html#clamav). \nWhen a new virtual machine image is built, updated signature files are included within it. \nImportant \nIt is your responsibility to review these logs. Databricks may, in its sole discretion, review these logs but does not make a commitment to do so. If the agent detects a malicious activity, it is your responsibility to triage these events and open a support ticket with Databricks if the resolution or remediation requires an action by Databricks. Databricks may take action on the basis of these logs, including suspending or terminating the resources, but does not make any commitment to do so. \nWhen a new AMI image is built, updated signature files are included within the new AMI image. \n### Vulnerability scanning \nA vulnerability monitor agent performs vulnerability scans of the container host (VM) for certain known CVEs. The scanning happens in representative images in the Databricks environments. You can request the vulnerability scan reports from your Databricks account team. \nWhen vulnerabilities are found with this agent, Databricks tracks them against its Vulnerability Management SLA and releases an updated image when available.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-monitoring.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n###### Enhanced security monitoring\n####### Management and upgrade of monitoring agents\n\nThe additional monitoring agents that are on the disk images used for the compute resources in the classic compute plane are part of the standard Databricks process for upgrading systems: \n* The classic compute plane base disk image (AMI) is owned, managed, and patched by Databricks.\n* Databricks delivers and applies security patches by releasing new AMI disk images. The delivery schedule depends on new functionality and the SLA for discovered vulnerabilities. Typical delivery is every two to four weeks.\n* The base operating system for the compute plane is Ubuntu Advantage.\n* Databricks clusters and pro or classic SQL warehouses are ephemeral by default. Upon launch, clusters and pro or classic SQL warehouses use the latest available base image. Older versions that may have security vulnerabilities are unavailable for new clusters. \n+ You are responsible for [restarting clusters](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-start) (using the UI or API) regularly to ensure they use the latest patched host VM images. \n### Monitor agent termination \nIf a monitor agent on the worker VM is found to be not running due to crash or other termination, the system will attempt to restart the agent. \n### Data retention policy for monitor agent data \nMonitoring logs are sent to the audit log system table or your own Amazon S3 bucket if you configured [audit log delivery](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html). Retention, ingestion, and analysis of these logs is your responsibility. \nVulnerability scanning reports and logs are retained for at least one year by Databricks. You can request the vulnerability reports from your Databricks account team.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-monitoring.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n###### Enhanced security monitoring\n####### Enable Databricks enhanced security monitoring\n\n* Your Databricks workspace must be on the Enterprise pricing tier.\n* Your Databricks account must include the Enhanced Security and Compliance add-on. For details, see the [pricing page](https:\/\/databricks.com\/product\/aws-pricing). \nTo enable the enhanced security monitoring directly on a workspace, see [Enable enhanced security and compliance features on a workspace](https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-compliance.html#aws-workspace-config). \nYou can also set an account-level default for new workspaces to enable enhanced security monitoring initially. Alternatively, you can set an account-level default to enable the compliance security profile, which automatically enables enhanced security monitoring. See [Set account-level defaults for new workspaces](https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-compliance.html#aws-account-level-defaults). \nUpdates may take up to six hours to propagate to all environments and to downstream systems like billing. Workloads that are actively running continue with the settings that were active at the time of starting the cluster or other compute resource, and new settings will start applying the next time these workloads are started.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-monitoring.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### View and manage job runs\n\nThis article describes the features available in the Databricks UI to view jobs you have access to, view a history of runs for a job, and view details of job runs. To learn about using the Databricks CLI to view jobs and run jobs, run the CLI commands `databricks jobs list -h`, `databricks jobs get -h`, and `databricks jobs run-now -h`. To learn about using the Jobs API, see the [Jobs API](https:\/\/docs.databricks.com\/api\/workspace\/jobs).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### View and manage job runs\n##### View jobs\n\nTo view the list of jobs you have access to, click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar. The **Jobs** tab in the Workflows UI lists information about all available jobs, such as the creator of the job, the trigger for the job, if any, and the result of the last run. \nTo change the columns displayed in the jobs list, click ![Settings icon](https:\/\/docs.databricks.com\/_images\/settings-icon.png) and select or deselect columns. \nYou can filter jobs in the Jobs list: \n* Using keywords. If you have the increased jobs limit feature enabled for this workspace, searching by keywords is supported only for the name, job ID, and job tag fields.\n* Selecting only the jobs you own.\n* Selecting all jobs you have permissions to access.\n* Using [tags](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#job-tags). To search for a tag created with only a key, type the key into the search box. To search for a tag created with a key and value, you can search by the key, the value, or both the key and value. For example, for a tag with the key `department` and the value `finance`, you can search for `department` or `finance` to find matching jobs. To search by the key and value, enter the key and value separated by a colon; for example, `department:finance`. \nYou can also click any column header to sort the list of jobs (either descending or ascending) by that column. When the increased jobs limit feature is enabled, you can sort only by `Name`, `Job ID`, or `Created by`. The default sorting is by `Name` in ascending order. \nClick ![Jobs Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/jobs-vertical-ellipsis.png) to access actions for the job, for example, delete the job.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### View and manage job runs\n##### View runs for a job\n\nYou can view a list of currently running and recently completed runs for all jobs you have access to, including runs started by external orchestration tools such as Apache Airflow or Azure Data Factory. To view the list of recent job runs: \n1. Click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar.\n2. In the **Name** column, click a job name. The **Runs** tab appears with matrix and list views of active and completed runs. \nThe matrix view shows a history of runs for the job, including each job task. \nThe **Run total duration** row of the matrix displays the run\u2019s total duration and the run\u2019s state. To view details of the run, including the start time, duration, and status, hover over the bar in the **Run total duration** row. \nEach cell in the **Tasks** row represents a task and the corresponding status of the task. To view details of each task, including the start time, duration, cluster, and status, hover over the cell for that task. \nThe job run and task run bars are color-coded to indicate the status of the run. Successful runs are green, unsuccessful runs are red, and skipped runs are pink. The height of the individual job run and task run bars visually indicate the run duration. \nIf you have configured an [expected completion time](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#timeout-setting-job), the matrix view displays a warning when the duration of a run exceeds the configured time. \nBy default, the runs list view displays: \n* The start time for the run.\n* The run identifier.\n* Whether the run was triggered by a job schedule or an API request, or was manually started.\n* The time elapsed for a currently running job or the total running time for a completed run. A warning is displayed if the duration exceeds a configured [expected completion time](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#timeout-setting-job).\n* Links to the Spark logs.\n* The status of the run, either `Queued`, `Pending`, `Running`, `Skipped`, `Succeeded`, `Failed`, `Terminating`, `Terminated`, `Internal Error`, `Timed Out`, `Canceled`, `Canceling`, or `Waiting for Retry`.\n* Click ![Jobs Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/jobs-vertical-ellipsis.png) to access context-specific actions for the run, for example, stop an active run or delete a completed run. \nTo change the columns displayed in the runs list view, click ![Settings icon](https:\/\/docs.databricks.com\/_images\/settings-icon.png) and select or deselect columns. \nTo view [details for a job run](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#job-run-details), click the link for the run in the **Start time** column in the runs list view. To view details for this job\u2019s most recent successful run, click **Go to the latest successful run**. \nDatabricks maintains a history of your job runs for up to 60 days. If you need to preserve job runs, Databricks recommends exporting results before they expire. For more information, see [Export job run results](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#export-job-runs).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### View and manage job runs\n##### View job run details\n\nThe job run details page contains job output and links to logs, including information about the success or failure of each task in the job run. You can access job run details from the **Runs** tab for the job. To view job run details from the **Runs** tab, click the link for the run in the **Start time** column in the runs list view. To return to the **Runs** tab for the job, click the **Job ID** value. \nIf the job contains multiple tasks, click a task to view task run details, including: \n* the cluster that ran the task \n+ the Spark UI for the task\n+ logs for the task\n+ metrics for the task \nClick the **Job ID** value to return to the **Runs** tab for the job.\n\n#### View and manage job runs\n##### View task run history\n\nTo view the run history of a task, including successful and unsuccessful runs: \n1. Click on a task on the **Job run details** page. The **Task run details** page appears.\n2. Select the task run in the run history drop-down menu.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### View and manage job runs\n##### View recent job runs\n\nYou can view a list of currently running and recently completed runs for all jobs in a workspace that you have access to, including runs started by external orchestration tools such as Apache Airflow or Azure Data Factory. To view the list of recent job runs: \n1. Click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar.\n2. Click the **Job runs** tab to display the Job runs list. \nThe **Finished runs count** graph displays the number of job runs completed in the last 48 hours. By default, the graph displays the failed, skipped, and successful job runs. You can also filter the graph to show specific run statuses or restrict the graph to a specific time range. The **Job runs** tab also includes a table of job runs from the last 67 days. By default, the table includes details on failed, skipped, and successful job runs. \nNote \nThe **Finished runs count** graph is only displayed when you click **Owned by me**. \nYou can filter the **Finished runs count** by run status: \n* To update the graph to show jobs currently running or waiting to run, click **Active runs**.\n* To update the graph to show only completed runs, including failed, successful, and skipped runs, click **Completed runs**.\n* To update the graph to show only runs that completed successfully over the last 48 hours, click **Successful runs**.\n* To update the graph to show only skipped runs, click **Skipped runs**. Runs are skipped because you exceeded the maximum number of concurrent runs in your workspace or the job exceeded the maximum number of concurrent runs specified by the job configuration.\n* To update the graph to show only runs that completed in an error state, click **Failed runs**. \nWhen you click any of the filter buttons, the list of runs in the runs table also updates to show only job runs that match the selected status. \nTo limit the time range displayed in the **Finished runs count** graph, click and drag your cursor in the graph to select the time range. The graph and the runs table update to display runs from only the selected time range. \nBy default, the list of runs in the runs table displays: \n* The start time for the run.\n* The name of the job associated with the run.\n* The user name that the job runs as.\n* Whether the run was triggered by a job schedule or an API request, or was manually started.\n* The time elapsed for a currently running job or the total running time for a completed run. A warning is displayed if the duration exceeds a configured [expected completion time](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#timeout-setting-job).\n* The status of the run, either `Queued`, `Pending`, `Running`, `Skipped`, `Succeeded`, `Failed`, `Terminating`, `Terminated`, `Internal Error`, `Timed Out`, `Canceled`, `Canceling`, or `Waiting for Retry`.\n* Any parameters for the run.\n* Click ![Jobs Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/jobs-vertical-ellipsis.png) to access context-specific actions for the run, for example, stop an active run or delete a completed run. \nTo change the columns displayed in the runs list, click ![Settings icon](https:\/\/docs.databricks.com\/_images\/settings-icon.png) and select or deselect columns. \nThe **Top 5 error types** table displays a list of the most frequent error types from the selected time range, allowing you to quickly see the most common causes of job issues in your workspace. \nTo view [job run details](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#job-run-details), click the link in the **Start time** column for the run. To view job details, click the job name in the **Job** column.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### View and manage job runs\n##### View lineage information for a job\n\nIf Unity Catalog is enabled in your workspace, you can view [lineage information](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html) for any Unity Catalog tables in your workflow. If lineage information is available for your workflow, you will see a link with a count of upstream and downstream tables in the **Job details** panel for your job, the **Job run details** panel for a job run, or the **Task run details** panel for a task run. Click the link to show the list of tables. Click a table to see detailed information in [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### View and manage job runs\n##### Export job run results\n\nYou can export notebook run results and job run logs for all job types. \n### Export notebook run results \nYou can persist job runs by exporting their results. For notebook job runs, you can [export](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html#export-notebook) a rendered notebook that can later be [imported](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html#import-notebook) into your Databricks workspace. \nTo export notebook run results for a job with a single task: \n1. On the job detail page, click the **View Details** link for the run in the **Run** column of the **Completed Runs (past 60 days)** table.\n2. Click **Export to HTML**. \nTo export notebook run results for a job with multiple tasks: \n1. On the job detail page, click the **View Details** link for the run in the **Run** column of the **Completed Runs (past 60 days)** table.\n2. Click the notebook task to export.\n3. Click **Export to HTML**. \n### Export job run logs \nYou can also export the logs for your job run. You can set up your job to automatically deliver logs to DBFS or S3 through the Job API. See the `new_cluster.cluster_log_conf` object in the request body passed to the [Create a new job](https:\/\/docs.databricks.com\/api\/workspace\/jobs) operation (`POST \/jobs\/create`) in the Jobs API.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n### Query databases using JDBC\n##### Query PostgreSQL with Databricks\n\nThis example queries PostgreSQL using its JDBC driver. For more details on reading, writing, configuring parallelism, and query pushdown, see [Query databases using JDBC](https:\/\/docs.databricks.com\/connect\/external-systems\/jdbc.html). \nNote \nYou may prefer Lakehouse Federation for managing queries to PostgreSQL. See [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/postgresql.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n### Query databases using JDBC\n##### Query PostgreSQL with Databricks\n###### Using JDBC\n\n```\ndriver = \"org.postgresql.Driver\"\n\ndatabase_host = \"<database-host-url>\"\ndatabase_port = \"5432\" # update if you use a non-default port\ndatabase_name = \"<database-name>\"\ntable = \"<table-name>\"\nuser = \"<username>\"\npassword = \"<password>\"\n\nurl = f\"jdbc:postgresql:\/\/{database_host}:{database_port}\/{database_name}\"\n\nremote_table = (spark.read\n.format(\"jdbc\")\n.option(\"driver\", driver)\n.option(\"url\", url)\n.option(\"dbtable\", table)\n.option(\"user\", user)\n.option(\"password\", password)\n.load()\n)\n\n``` \n```\nval driver = \"org.postgresql.Driver\"\n\nval database_host = \"<database-host-url>\"\nval database_port = \"5432\" # update if you use a non-default port\nval database_name = \"<database-name>\"\nval table = \"<table-name>\"\nval user = \"<username>\"\nval password = \"<password>\"\n\nval url = s\"jdbc:postgresql:\/\/${database_host}:${database_port}\/${database_name}\"\n\nval remote_table = spark.read\n.format(\"jdbc\")\n.option(\"driver\", driver)\n.option(\"url\", url)\n.option(\"dbtable\", table)\n.option(\"user\", user)\n.option(\"password\", password)\n.load()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/postgresql.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n### Query databases using JDBC\n##### Query PostgreSQL with Databricks\n###### Using the PostgreSQL connector in Databricks Runtime\n\nIn Databricks Runtime 11.3 LTS and above, you can use the named connector to query PosgresQL. See the following examples: \n```\nremote_table = (spark.read\n.format(\"postgresql\")\n.option(\"dbtable\", \"schema_name.table_name\") # if schema_name not provided, default to \"public\".\n.option(\"host\", \"database_hostname\")\n.option(\"port\", \"5432\") # Optional - will use default port 5432 if not specified.\n.option(\"database\", \"database_name\")\n.option(\"user\", \"username\")\n.option(\"password\", \"password\")\n.load()\n)\n\n``` \n```\nDROP TABLE IF EXISTS postgresql_table;\nCREATE TABLE postgresql_table\nUSING postgresql\nOPTIONS (\ndbtable '<schema-name>.<table-name>' \/* if schema_name not provided, default to \"public\". *\/,\nhost '<database-host-url>',\nport '5432', \/* Optional - will use default port 5432 if not specified. *\/\ndatabase '<database-name>',\nuser '<username>',\npassword '<password>'\n);\n\n``` \n```\nval remote_table = spark.read\n.format(\"postgresql\")\n.option(\"dbtable\", \"schema_name.table_name\") # if schema_name not provided, default to \"public\".\n.option(\"host\", \"database_hostname\")\n.option(\"port\", \"5432\") # Optional - will use default port 5432 if not specified.\n.option(\"database\", \"database_name\")\n.option(\"user\", \"username\")\n.option(\"password\", \"password\")\n.load()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/postgresql.html"} +{"content":"# What is Delta Lake?\n### Compact data files with optimize on Delta Lake\n\nSee [OPTIMIZE](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-optimize.html). \nDelta Lake on Databricks can improve the speed of read queries from a table. One way to improve this speed is to coalesce small files into larger ones. \nNote \nIn Databricks Runtime 13.3 and above, Databricks recommends using clustering for Delta table layout. See [Use liquid clustering for Delta tables](https:\/\/docs.databricks.com\/delta\/clustering.html). \nDatabricks recommends using predictive optimization to automatically run `OPTIMIZE` for Delta tables. See [Predictive optimization for Delta Lake](https:\/\/docs.databricks.com\/optimizations\/predictive-optimization.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/optimize.html"} +{"content":"# What is Delta Lake?\n### Compact data files with optimize on Delta Lake\n#### Syntax examples\n\nYou trigger compaction by running the `OPTIMIZE` command: \n```\nOPTIMIZE delta.`\/data\/events`\n\n``` \n```\nfrom delta.tables import *\ndeltaTable = DeltaTable.forPath(spark, \"\/data\/events\")\ndeltaTable.optimize().executeCompaction()\n\n``` \n```\nimport io.delta.tables._\nval deltaTable = DeltaTable.forPath(spark, \"\/data\/events\")\ndeltaTable.optimize().executeCompaction()\n\n``` \nor, alternately: \n```\nOPTIMIZE events\n\n``` \n```\nfrom delta.tables import *\ndeltaTable = DeltaTable.forName(spark, \"events\")\ndeltaTable.optimize().executeCompaction()\n\n``` \n```\nimport io.delta.tables._\nval deltaTable = DeltaTable.forName(spark, \"events\")\ndeltaTable.optimize().executeCompaction()\n\n``` \nIf you have a large amount of data and only want to optimize a subset of it, you can specify an optional partition predicate using `WHERE`: \n```\nOPTIMIZE events WHERE date >= '2022-11-18'\n\n``` \n```\nfrom delta.tables import *\ndeltaTable = DeltaTable.forName(spark, \"events\")\ndeltaTable.optimize().where(\"date='2021-11-18'\").executeCompaction()\n\n``` \n```\nimport io.delta.tables._\nval deltaTable = DeltaTable.forName(spark, \"events\")\ndeltaTable.optimize().where(\"date='2021-11-18'\").executeCompaction()\n\n``` \nNote \n* Bin-packing optimization is *idempotent*, meaning that if it is run twice on the same dataset, the second run has no effect.\n* Bin-packing aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number of tuples per file. However, the two measures are most often correlated.\n* Python and Scala APIs for executing `OPTIMIZE` operation are available from Databricks Runtime 11.3 LTS and above. \nReaders of Delta tables use snapshot isolation, which means that they are not interrupted when `OPTIMIZE` removes unnecessary files from the transaction log. `OPTIMIZE` makes no data related changes to the table, so a read before and after an `OPTIMIZE` has the same results. Performing `OPTIMIZE` on a table that is a streaming source does not affect any current or future streams that treat this table as a source. `OPTIMIZE` returns the file statistics (min, max, total, and so on) for the files removed and the files added by the operation. Optimize stats also contains the Z-Ordering statistics, the number of batches, and partitions optimized. \nYou can also compact small files automatically using auto compaction. See [Auto compaction for Delta Lake on Databricks](https:\/\/docs.databricks.com\/delta\/tune-file-size.html#auto-compact).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/optimize.html"} +{"content":"# What is Delta Lake?\n### Compact data files with optimize on Delta Lake\n#### How often should I run `OPTIMIZE`?\n\nWhen you choose how often to run `OPTIMIZE`, there is a trade-off between performance and cost. For better end-user query performance, run `OPTIMIZE` more often. This will incur a higher cost because of the increased resource usage. To optimize cost, run it less often. \nDatabricks recommends that you start by running `OPTIMIZE` on a daily basis (preferably at night when spot prices are low), and then adjust the frequency to balance cost and performance trade-offs.\n\n### Compact data files with optimize on Delta Lake\n#### What\u2019s the best instance type to run `OPTIMIZE` (bin-packing and Z-Ordering) on?\n\nBoth operations are CPU intensive operations doing large amounts of Parquet decoding and encoding. \nDatabricks recommends **Compute optimized** instance types. `OPTIMIZE` also benefits from attached SSDs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/optimize.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Jobs timeline\n\nThe jobs timeline is a great starting point for understanding your pipeline or query. It gives you an overview of what was running, how long each step took, and if there were any failures along the way.\n\n##### Jobs timeline\n###### How to open the jobs timeline\n\nIn the Spark UI, click on **Jobs** and **Event Timeline** as highlighted in red in the following screenshot. You will see the timeline. This example shows the driver and executor 0 being added: \n![Jobs Timeline](https:\/\/docs.databricks.com\/_images\/jobs-timeline.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/jobs-timeline.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Jobs timeline\n###### What to look for\n\nThe sections below explain how to read the event timeline to discover the possible cause of your performance or cost issue. If you notice any of these trends in your timeline, the end of each corresponding section contains a link to an article that provides guidance. \n### Failing jobs or failing executors \nHere\u2019s an example of a failed job and removed executors, indicated by a red status, in the event timeline. \n![Failing Jobs](https:\/\/docs.databricks.com\/_images\/failing-jobs.png) \nIf you see failing jobs or failing executors, see [Failing jobs or executors removed](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/failing-spark-jobs.html). \n### Gaps in execution \nLook for gaps of a minute or more, such as in this example: \n![Job Gaps](https:\/\/docs.databricks.com\/_images\/job-gaps.png) \nThis example has several gaps, a few of which are highlighted by the red arrows. If you see gaps in your timeline, are they a minute or more? Short gaps are to be expected as the driver coordinates work. If you do have longer gaps, are they in the middle of a pipeline? Or is this cluster constantly running and so the gaps are explained by pauses in activity? You might be able to determine this based on what time your workload started and ended. \nIf you see long unexplained gaps in the middle of a pipeline, see [Gaps between Spark jobs](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/spark-job-gaps.html). \n### Long jobs \nIs the timeline dominated by one or a few long jobs? These long jobs would be something to investigate. In the following example, the workload has one job that\u2019s much longer than the others. This is a good target for investigation. \n![Long Jobs](https:\/\/docs.databricks.com\/_images\/long-jobs.png) \nClick on the longest job to dig in. For information about investigating this long stage, see [Diagnosing a long stage in Spark](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/long-spark-stage.html). \n### Many small jobs \nWhat we\u2019re looking for here is a timeline dominated by tiny jobs. It might look something like this: \n![Small Jobs](https:\/\/docs.databricks.com\/_images\/small-jobs.png) \nNotice all the tiny blue lines. Each of those is a small job that took a few seconds or less. \nIf your timeline is mostly small jobs, see [Many small Spark jobs](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/small-spark-jobs.html). \n### None of the above \nIf your timeline doesn\u2019t look like any of the above, the next step is to identify the longest job. Sort the jobs by duration and click on the link in the description for the longest job: \n![Identifying Longest Job](https:\/\/docs.databricks.com\/_images\/find-long-job.png) \nOnce you\u2019re in the page for the longest job, additional information about investigating this long stage is in [Diagnosing a long stage in Spark](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/long-spark-stage.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/jobs-timeline.html"} +{"content":"# Technology partners\n### Connect to reverse ETL partners using Partner Connect\n\nTo connect your Databricks workspace to a reverse ETL partner solution using Partner Connect, you typically follow the steps in this article. \nImportant \nBefore you follow the steps in this article, see the appropriate partner article for important partner-specific information. There might be differences in the connection steps between partner solutions. Some partner solutions also allow you to integrate with Databricks SQL warehouses (formerly Databricks SQL endpoints) or Databricks clusters, but not both.\n\n### Connect to reverse ETL partners using Partner Connect\n#### Requirements\n\nSee the [requirements](https:\/\/docs.databricks.com\/partner-connect\/index.html#requirements) for using Partner Connect. \nImportant \nFor partner-specific requirements, see the appropriate partner article.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/reverse-etl.html"} +{"content":"# Technology partners\n### Connect to reverse ETL partners using Partner Connect\n#### Steps to connect to a reverse ETL partner\n\nTo connect your Databricks workspace to a reverse ETL partner solution, follow the steps in this section. \nTip \nIf you have an existing partner account, Databricks recommends that you follow the steps to connect to the partner solution manually in the appropriate partner article. This is because the connection experience in Partner Connect is optimized for new partner accounts. \n1. In the sidebar, click ![Partner Connect button](https:\/\/docs.databricks.com\/_images\/partner-connect.png) **Partner Connect**.\n2. Click the partner tile. \nNote \nIf the partner tile has a check mark icon inside it, an administrator has already used Partner Connect to connect the partner to your workspace. Skip to step 8. The partner will use the email address for your Databricks account to prompt you to sign in to your existing partner account.\n3. If there are SQL warehouses in your workspace, select a SQL warehouse from the drop-down list. If your SQL warehouse is stopped, click **Start**.\n4. If there are no SQL warehouses in your workspace, do the following: \n1. Click **Create warehouse**. A new tab opens in your browser that displays the **New SQL Warehouse** page in the Databricks SQL UI.\n2. Follow the steps in [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n3. Return to the Partner Connect tab in your browser, then close the partner tile.\n4. Re-open the partner tile.\n5. Select the SQL warehouse you just created from the drop-down list.\n5. Select a catalog and a schema from the drop-down lists, then click **Add**. You can repeat this step to add multiple schemas. \nNote \nIf a partner doesn\u2019t support Unity Catalog with Partner Connect, the workspace default catalog is used. If your workspace isn\u2019t Unity Catalog-enabled, `hive_metastore` is used.\n6. Click **Next**. \nPartner Connect creates the following resources in your workspace: \n* A Databricks [service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) named **`<PARTNER>_USER`**.\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) that is associated with the **`<PARTNER>_USER`** service principal.Partner Connect also grants the following privileges to the **`<PARTNER>_USER`** service principal: \n* (Unity Catalog)`USE CATALOG`: Required to interact with objects within the selected catalog.\n* (Unity Catalog) `USE SCHEMA`: Required to interact with objects within the selected schema.\n* (Hive metastore) `USAGE`: Required to grant the `SELECT` and `READ METADATA` privileges for the schemas you selected.\n* `SELECT`: Grants the ability to read the schemas you selected.\n* (Hive metastore) `READ METADATA`: Grants the ability to read metadata for the schemas you selected.\n* **CAN\\_USE**: Grants permissions to use the SQL warehouse you selected.\n7. Click **Next**. \nThe **Email** box displays the email address for your Databricks account. The partner uses this email address to prompt you to either create a new partner account or sign in to your existing partner account.\n8. Click **Connect to `<Partner>`** or **Sign in**. \nA new tab opens in your web browser, which displays the partner website.\n9. Complete the on-screen instructions on the partner website to create your trial partner account or sign in to your existing partner account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/reverse-etl.html"} +{"content":"# Databricks reference documentation\n","doc_uri":"https:\/\/docs.databricks.com\/reference\/spark.html"} +{"content":"# Databricks reference documentation\n### Reference for Apache Spark APIs\n\nDatabricks is built on top of Apache Spark, a unified analytics engine for big data and machine learning. For more information, see [Apache Spark on Databricks](https:\/\/docs.databricks.com\/spark\/index.html). \nApache Spark has DataFrame APIs for operating on large datasets, which include over 100 operators, in several languages. \n* [PySpark APIs](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/index.html) for Python developers. See [Tutorial: Load and transform data using Apache Spark DataFrames](https:\/\/docs.databricks.com\/getting-started\/dataframes.html). Key classes include: \n+ [SparkSession](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/spark_session.html) - The entry point to programming Spark with the Dataset and DataFrame API.\n+ [DataFrame](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/dataframe.html) - A distributed collection of data grouped into named columns. See [DataFrames](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.pandas\/frame.html) and [DataFrame-based MLlib](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.ml.html).\n* [SparkR APIs](https:\/\/spark.apache.org\/docs\/latest\/api\/R\/reference\/index.html) for R developers. Key classes include: \n+ [SparkSession](https:\/\/spark.apache.org\/docs\/latest\/api\/R\/reference\/sparkR.session.html) - SparkSession is the entry point into SparkR. See [Starting Point: SparkSession](https:\/\/spark.apache.org\/docs\/latest\/sql-getting-started.html#starting-point-sparksession).\n+ [SparkDataFrame](https:\/\/spark.apache.org\/docs\/latest\/api\/R\/reference\/SparkDataFrame.html) - A distributed collection of data grouped into named columns. See [Datasets and DataFrames](https:\/\/spark.apache.org\/docs\/latest\/sql-programming-guide.html#datasets-and-dataframes), [Creating DataFrames](https:\/\/spark.apache.org\/docs\/latest\/sql-getting-started.html#creating-dataframes), and [Creating SparkDataFrames](https:\/\/spark.apache.org\/docs\/latest\/sparkr.html#creating-sparkdataframes).\n* [Scala APIs](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/index.html) for Scala developers. Key classes include: \n+ [SparkSession](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/sql\/SparkSession.html) - The entry point to programming Spark with the Dataset and DataFrame API. See [Starting Point: SparkSession](https:\/\/spark.apache.org\/docs\/latest\/sql-getting-started.html#starting-point-sparksession).\n+ [Dataset](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/sql\/Dataset.html) - A strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each `Dataset` also has an untyped view called a DataFrame, which is a `Dataset` of [Row](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/sql\/Row.html). See [Datasets and DataFrames](https:\/\/spark.apache.org\/docs\/latest\/sql-programming-guide.html#datasets-and-dataframes), [Creating Datasets](https:\/\/spark.apache.org\/docs\/latest\/sql-getting-started.html#creating-datasets), [Creating DataFrames](https:\/\/spark.apache.org\/docs\/latest\/sql-getting-started.html#creating-dataframes), and [DataFrame functions](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/sql\/functions$.html).\n* [Java APIs](https:\/\/spark.apache.org\/docs\/latest\/api\/java\/index.html) for Java developers. Key classes include: \n+ [SparkSession](https:\/\/spark.apache.org\/docs\/latest\/api\/java\/org\/apache\/spark\/sql\/SparkSession.html) - The entry point to programming Spark with the Dataset and DataFrame API. See [Starting Point: SparkSession](https:\/\/spark.apache.org\/docs\/latest\/sql-getting-started.html#starting-point-sparksession).\n+ [Dataset](https:\/\/spark.apache.org\/docs\/latest\/api\/java\/index.html?org\/apache\/spark\/sql\/Dataset.html) - A strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each `Dataset` also has an untyped view called a DataFrame, which is a `Dataset` of [Row](https:\/\/spark.apache.org\/docs\/latest\/api\/java\/index.html?org\/apache\/spark\/sql\/Dataset.html). See [Datasets and DataFrames](https:\/\/spark.apache.org\/docs\/latest\/sql-programming-guide.html#datasets-and-dataframes), [Creating Datasets](https:\/\/spark.apache.org\/docs\/latest\/sql-getting-started.html#creating-datasets), [Creating DataFrames](https:\/\/spark.apache.org\/docs\/latest\/sql-getting-started.html#creating-dataframes), and [DataFrame functions](https:\/\/spark.apache.org\/docs\/latest\/api\/java\/org\/apache\/spark\/sql\/functions.html). \nTo learn how to use the Apache Spark APIs on Databricks, see: \n* [PySpark on Databricks](https:\/\/docs.databricks.com\/pyspark\/index.html)\n* [Databricks for R developers](https:\/\/docs.databricks.com\/sparkr\/index.html)\n* [Databricks for Scala developers](https:\/\/docs.databricks.com\/languages\/scala.html)\n* For Java, you can run Java code as a [JAR job](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-jars-in-workflows.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/reference\/spark.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### New chart visualizations in Databricks\n\nImportant \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nDatabricks has released a Public Preview of new charts for visualizing data in notebooks and in Databricks SQL. These new charts feature better performance, improved colors, and faster interactivity. These charts will replace the legacy rendering library currently used by Databricks charts. \nThe following chart types are affected by the preview: \n* Area charts\n* Bar charts\n* Box charts\n* Bubble charts\n* Combo charts\n* Heatmap charts\n* Histogram charts\n* Line charts\n* Pie charts\n* Scatter charts\n\n#### New chart visualizations in Databricks\n##### Colors\n\nThe new chart visualizations feature new default colors to improve aesthetics and readability. Color is very important in clarifying what signals are in the data. These new colors have been extensively tested for readability to ensure that chart elements remain distinguishable. \n![Color palette](https:\/\/docs.databricks.com\/_images\/color-palette.png)\n\n#### New chart visualizations in Databricks\n##### Series selection\n\nWhen analyzing visualizations with multiple series, you might want to select a specific series to analyze on a chart. The new charts support this behavior with the following commands: \n* Click on a single legend item to select that series\n* Cmd\/Ctrl + click on a legend item to select or deselect multiple series \n![Series selection](https:\/\/docs.databricks.com\/_images\/series-selection.gif)\n\n#### New chart visualizations in Databricks\n##### Sorted tooltips\n\nWhen analyzing time series visualizations, it can be helpful to understand the ranking of each series at a given point in time. With the new chart visualizations, tooltips on line charts and unstacked bar charts are now ordered by magnitude for easier analysis. \n![sorted tooltips](https:\/\/docs.databricks.com\/_images\/sorted-tooltips.gif)\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/preview-chart-visualizations.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### New chart visualizations in Databricks\n##### Zoom\n\nFor data-dense charts, zooming in on individual data points can be helpful to investigate details and to crop outliers. To zoom in on the new charts, click and drag on the canvas. To clear the zoom, hover over the canvas and click the **Clear zoom** button in the upper right corner of the visualization. \n![zoom in to see details](https:\/\/docs.databricks.com\/_images\/zoom.gif)\n\n#### New chart visualizations in Databricks\n##### Download as PNG file\n\nAfter creating a visualization, you may want to add the visualization to a presentation. If your presentation has a brand theme, you might want to export the visualization as a PNG to ensure a transparent background. The new visualizations support downloading images as PNGs. To dowload a visualization as a PNG file, hover over the canvas and click the download icon. \n![export visual as png file](https:\/\/docs.databricks.com\/_images\/png.gif)\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/preview-chart-visualizations.html"} +{"content":"# AI and Machine Learning on Databricks\n### MLOps workflows on Databricks\n\nThis article describes how you can use MLOps on the Databricks platform to optimize the performance and long-term efficiency of your machine learning (ML) systems. It includes general recommendations for an MLOps architecture and describes a generalized workflow using the Databricks platform that you can use as a model for your ML development-to-production process. \nFor more details, see [The Big Book of MLOps](https:\/\/www.databricks.com\/resources\/ebook\/the-big-book-of-mlops).\n\n### MLOps workflows on Databricks\n#### What is MLOps?\n\nMLOps is a set of processes and automated steps to manage code, data, and models. It combines DevOps, DataOps, and ModelOps. \n![MLOps lakehouse](https:\/\/docs.databricks.com\/_images\/mlops-lakehouse.png) \nML assets such as code, data, and models are developed in stages that progress from early development stages that do not have tight access limitations and are not rigorously tested, through an intermediate testing stage, to a final production stage that is tightly controlled. The Databricks platform lets you manage these assets on a single platform with unified access control. You can develop data applications and ML applications on the same platform, reducing the risks and delays associated with moving data around.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html"} +{"content":"# AI and Machine Learning on Databricks\n### MLOps workflows on Databricks\n#### General recommendations for MLOps\n\nThis section includes some general recommendations for MLOps on Databricks with links for more information. \n### Create a separate environment for each stage \nAn execution environment is the place where models and data are created or consumed by code. Each execution environment consists of compute instances, their runtimes and libraries, and automated jobs. \nDatabricks recommends creating separate environments for the different stages of ML code and model development with clearly defined transitions between stages. The workflow described in this article follows this process, using the common names for the stages: \n* [Development](https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html#development-stage)\n* [Staging](https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html#staging-stage)\n* [Production](https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html#production-stage) \nOther configurations can also be used to meet the specific needs of your organization. \n### Access control and versioning \nAccess control and versioning are key components of any software operations process. Databricks recommends the following: \n* **Use Git for version control.** Pipelines and code should be stored in Git for version control. Moving ML logic between stages can then be interpreted as moving code from the development branch, to the staging branch, to the release branch. Use [Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html) to integrate with your Git provider and sync notebooks and source code with Databricks workspaces. Databricks also provides additional tools for Git integration and version control; see [Developer tools and guidance](https:\/\/docs.databricks.com\/dev-tools\/index.html).\n* **Store data in a lakehouse architecture using Delta tables.** Data should be stored in a [lakehouse architecture](https:\/\/docs.databricks.com\/lakehouse\/index.html) in your cloud account. Both raw data and feature tables should be stored as [Delta tables](https:\/\/docs.databricks.com\/delta\/index.html) with access controls to determine who can read and modify them.\n* **Manage model development with MLflow.** You can use [MLflow](https:\/\/docs.databricks.com\/mlflow\/index.html) to track the model development process and save code snapshots, model parameters, metrics, and other metadata.\n* **Use Models in Unity Catalog to manage the model lifecycle.** Use [Models in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html) to manage model versioning, governance, and deployment status. \n### Deploy code, not models \nIn most situations, Databricks recommends that during the ML development process, you promote *code*, rather than *models*, from one environment to the next. Moving project assets this way ensures that all code in the ML development process goes through the same code review and integration testing processes. It also ensures that the production version of the model is trained on production code. For a more detailed discussion of the options and trade-offs, see [Model deployment patterns](https:\/\/docs.databricks.com\/machine-learning\/mlops\/deployment-patterns.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html"} +{"content":"# AI and Machine Learning on Databricks\n### MLOps workflows on Databricks\n#### Recommended MLOps workflow\n\nThe following sections describe a typical MLOps workflow, covering each of the three stages: development, staging, and production. \nThis section uses the terms \u201cdata scientist\u201d and \u201cML engineer\u201d as archetypal personas; specific roles and responsibilities in the MLOps workflow will vary between teams and organizations.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html"} +{"content":"# AI and Machine Learning on Databricks\n### MLOps workflows on Databricks\n#### Development stage\n\nThe focus of the development stage is experimentation. Data scientists develop features and models and run experiments to optimize model performance. The output of the development process is ML pipeline code that can include feature computation, model training, inference, and monitoring. \n![MLOps development stage diagram](https:\/\/docs.databricks.com\/_images\/mlops-dev-diagram.png) \nThe numbered steps correspond to the numbers shown in the diagram. \n### 1. Data sources \nThe development environment is represented by the dev catalog in Unity Catalog. Data scientists have read-write access to the dev catalog as they create temporary data and feature tables in the development workspace. Models created in the development stage are registered to the dev catalog. \nIdeally, data scientists working in the development workspace also have read-only access to production data in the prod catalog. Allowing data scientists read access to production data, inference tables, and metric tables in the prod catalog enables them to analyze current production model predictions and performance. Data scientists should also be able to load production models for experimentation and analysis. \nIf it is not possible to grant read-only access to the prod catalog, a snapshot of production data can be written to the dev catalog to enable data scientists to develop and evaluate project code. \n### 2. Exploratory data analysis (EDA) \nData scientists explore and analyze data in an interactive, iterative process using notebooks. The goal is to assess whether the available data has the potential to solve the business problem. In this step, the data scientist begins identifying data preparation and featurization steps for model training. This ad hoc process is generally not part of a pipeline that will be deployed in other execution environments. \n[Databricks AutoML](https:\/\/docs.databricks.com\/machine-learning\/automl\/index.html) accelerates this process by generating baseline models for a dataset. AutoML performs and records a set of trials and provides a Python notebook with the source code for each trial run, so you can review, reproduce, and modify the code. AutoML also calculates summary statistics on your dataset and saves this information in a notebook that you can review. \n### 3. Code \nThe code repository contains all of the pipelines, modules, and other project files for an ML project. Data scientists create new or updated pipelines in a development (\u201cdev\u201d) branch of the project repository. Starting from EDA and the initial phases of a project, data scientists should work in a repository to share code and track changes. \n### 4. Train model (development) \nData scientists develop the model training pipeline in the development environment using tables from the dev or prod catalogs. \nThis pipeline includes 2 tasks: \n* **Training and tuning.** The training process logs model parameters, metrics, and artifacts to the MLflow Tracking server. After training and tuning hyperparameters, the final model artifact is logged to the tracking server to record a link between the model, the input data it was trained on, and the code used to generate it.\n* **Evaluation.** Evaluate model quality by testing on held-out data. The results of these tests are logged to the MLflow Tracking server. The purpose of evaluation is to determine if the newly developed model performs better than the current production model. Given sufficient permissions, any production model registered to the prod catalog can be loaded into the development workspace and compared against a newly trained model. \nIf your organization\u2019s governance requirements include additional information about the model, you can save it using [MLflow tracking](https:\/\/docs.databricks.com\/mlflow\/tracking.html). Typical artifacts are plain text descriptions and model interpretations such as plots produced by SHAP. Specific governance requirements may come from a data governance officer or business stakeholders. \nThe output of the model training pipeline is an ML model artifact stored in the MLflow Tracking server for the development environment. If the pipeline is executed in the staging or production workspace, the model artifact is stored in the MLflow Tracking server for that workspace. \nWhen the model training is complete, register the model to Unity Catalog. Set up your pipeline code to register the model to the catalog corresponding to the environment that the model pipeline was executed in; in this example, the dev catalog. \nWith the recommended architecture, you deploy a multitask Databricks workflow in which the first task is the model training pipeline, followed by model validation and model deployment tasks. The model training task yields a model URI that the model validation task can use. You can use [task values](https:\/\/docs.databricks.com\/workflows\/jobs\/share-task-context.html) to pass this URI to the model. \n### 5. Validate and deploy model (development) \nIn addition to the model training pipeline, other pipelines such as model validation and model deployment pipelines are developed in the development environment. \n* **Model validation.** The model validation pipeline takes the model URI from the model training pipeline, loads the model from Unity Catalog, and runs validation checks. \nValidation checks depend on the context. They can include fundamental checks such as confirming format and required metadata, and more complex checks that might be required for highly regulated industries, such as predefined compliance checks and confirming model performance on selected data slices. \nThe primary function of the model validation pipeline is to determine whether a model should proceed to the deployment step. If the model passes pre-deployment checks, it can be assigned the \u201cChallenger\u201d alias in Unity Catalog. If the checks fail, the process ends. You can configure your workflow to notify users of a validation failure. See [Add email and system notifications for job events](https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html).\n* **Model deployment.** The model deployment pipeline typically either directly promotes the newly trained \u201cChallenger\u201d model to \u201cChampion\u201d status using an alias update, or facilitates a comparison between the existing \u201cChampion\u201d model and the new \u201cChallenger\u201d model. This pipeline can also set up any required inference infrastructure, such as Model Serving endpoints. For a detailed discussion of the steps involved in the model deployment pipeline, see [Production](https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html#production-stage). \n### 6. Commit code \nAfter developing code for training, validation, deployment and other pipelines, the data scientist or ML engineer commits the dev branch changes into source control.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html"} +{"content":"# AI and Machine Learning on Databricks\n### MLOps workflows on Databricks\n#### Staging stage\n\nThe focus of this stage is testing the ML pipeline code to ensure it is ready for production. All of the ML pipeline code is tested in this stage, including code for model training as well as feature engineering pipelines, inference code, and so on. \nML engineers create a CI pipeline to implement the unit and integration tests run in this stage. The output of the staging process is a release branch that triggers the CI\/CD system to start the production stage. \n![MLOps staging stage diagram](https:\/\/docs.databricks.com\/_images\/mlops-staging-diagram.png) \n### 1. Data \nThe staging environment should have its own catalog in Unity Catalog for testing ML pipelines and registering models to Unity Catalog. This catalog is shown as the \u201cstaging\u201d catalog in the diagram. Assets written to this catalog are generally temporary and only retained until testing is complete. The development environment may also require access to the staging catalog for debugging purposes. \n### 2. Merge code \nData scientists develop the model training pipeline in the development environment using tables from the development or production catalogs. \n* **Pull request.** The deployment process begins when a pull request is created against the main branch of the project in source control.\n* **Unit tests (CI).** The pull request automatically builds source code and triggers unit tests. If unit tests fail, the pull request is rejected. \nUnit tests are part of the software development process and are continuously executed and added to the codebase during the development of any code. Running unit tests as part of a CI pipeline ensures that changes made in a development branch do not break existing functionality. \n### 3. Integration tests (CI) \nThe CI process then runs the integration tests. Integration tests run all pipelines (including feature engineering, model training, inference, and monitoring) to ensure that they function correctly together. The staging environment should match the production environment as closely as is reasonable. \nIf you are deploying an ML application with real-time inference, you should create and test serving infrastructure in the staging environment. This involves triggering the model deployment pipeline, which creates a serving endpoint in the staging environment and loads a model. \nTo reduce the time required to run integration tests, some steps can trade off between fidelity of testing and speed or cost. For example, if models are expensive or time-consuming to train, you might use small subsets of data or run fewer training iterations. For model serving, depending on production requirements, you might do full-scale load testing in integration tests, or you might just test small batch jobs or requests to a temporary endpoint. \n### 4. Merge to staging branch \nIf all tests pass, the new code is merged into the main branch of the project. If tests fail, the CI\/CD system should notify users and post results on the pull request. \nYou can schedule periodic integration tests on the main branch. This is a good idea if the branch is updated frequently with concurrent pull requests from multiple users. \n### 5. Create a release branch \nAfter CI tests have passed and the dev branch is merged into the main branch, the ML engineer creates a release branch, which triggers the CI\/CD system to update production jobs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html"} +{"content":"# AI and Machine Learning on Databricks\n### MLOps workflows on Databricks\n#### Production stage\n\nML engineers own the production environment where ML pipelines are deployed and executed. These pipelines trigger model training, validate and deploy new model versions, publish predictions to downstream tables or applications, and monitor the entire process to avoid performance degradation and instability. \nData scientists typically do not have write or compute access in the production environment. However, it is important that they have visibility to test results, logs, model artifacts, production pipeline status, and monitoring tables. This visibility allows them to identify and diagnose problems in production and to compare the performance of new models to models currently in production. You can grant data scientists read-only access to assets in the production catalog for these purposes. \n![MLOps production stage diagram](https:\/\/docs.databricks.com\/_images\/mlops-prod-diagram.png) \nThe numbered steps correspond to the numbers shown in the diagram. \n### 1. Train model \nThis pipeline can be triggered by code changes or by automated retraining jobs. In this step, tables from the production catalog are used for the following steps. \n* **Training and tuning.** During the training process, logs are recorded to the production environment MLflow Tracking server. These logs include model metrics, parameters, tags, and the model itself. If you use feature tables, the model is logged to MLflow using the Databricks Feature Store client, which packages the model with feature lookup information that is used at inference time. \nDuring development, data scientists may test many algorithms and hyperparameters. In the production training code, it\u2019s common to consider only the top-performing options. Limiting tuning in this way saves time and can reduce the variance from tuning in automated retraining. \nIf data scientists have read-only access to the production catalog, they may be able to determine the optimal set of hyperparameters for a model. In this case, the model training pipeline deployed in production can be executed using the selected set of hyperparameters, typically included in the pipeline as a configuration file.\n* **Evaluation.** Model quality is evaluated by testing on held-out production data. The results of these tests are logged to the MLflow tracking server. This step uses the evaluation metrics specified by data scientists in the development stage. These metrics may include custom code.\n* **Register model.** When model training is complete, the model artifact is saved as a registered model version at the specified model path in the production catalog in Unity Catalog. The model training task yields a model URI that the model validation task can use. You can use [task values](https:\/\/docs.databricks.com\/workflows\/jobs\/share-task-context.html) to pass this URI to the model. \n### 2. Validate model \nThis pipeline uses the model URI from Step 1 and loads the model from Unity Catalog. It then executes a series of validation checks. These checks depend on your organization and use case, and can include things like basic format and metadata validations, performance evaluations on selected data slices, and compliance with organizational requirements such as compliance checks for tags or documentation. \nIf the model successfully passes all validation checks, you can assign the \u201cChallenger\u201d alias to the model version in Unity Catalog. If the model does not pass all validation checks, the process exits and users can be automatically notified. You can use tags to add key-value attributes depending on the outcome of these validation checks. For example, you could create a tag \u201cmodel\\_validation\\_status\u201d and set the value to \u201cPENDING\u201d as the tests execute, and then update it to \u201cPASSED\u201d or \u201cFAILED\u201d when the pipeline is complete. \nBecause the model is registered to Unity Catalog, data scientists working in the development environment can load this model version from the production catalog to investigate if the model fails validation. Regardless of the outcome, results are recorded to the registered model in the production catalog using annotations to the model version. \n### 3. Deploy model \nLike the validation pipeline, the model deployment pipeline depends on your organization and use case. This section assumes that you have assigned the newly validated model the \u201cChallenger\u201d alias, and that the existing production model has been assigned the \u201cChampion\u201d alias. The first step before deploying the new model is to confirm that it performs at least as well as the current production model. \n* **Compare \u201cCHALLENGER\u201d to \u201cCHAMPION\u201d model.** You can perform this comparison offline or online. An offline comparison evaluates both models against a held-out data set and tracks results using the MLflow Tracking server. For real-time model serving, you might want to perform longer running online comparisons, such as A\/B tests or a gradual rollout of the new model. If the \u201cChallenger\u201d model version performs better in the comparison, it replaces the current \u201cChampion\u201d alias. \nDatabricks Model Serving and Databricks Lakehouse Monitoring allow you to automatically collect and monitor inference tables that contain request and response data for an endpoint. \nIf there is no existing \u201cChampion\u201d model, you might compare the \u201cChallenger\u201d model to a business heuristic or other threshold as a baseline. \nThe process described here is fully automated. If manual approval steps are required, you can set those up using workflow notifications or CI\/CD callbacks from the model deployment pipeline.\n* **Deploy model.** Batch or streaming inference pipelines can be set up to use the model with the \u201cChampion\u201d alias. For real-time use cases, you must set up the infrastructure to deploy the model as a REST API endpoint. You can create and manage this endpoint using Databricks Model Serving. If an endpoint is already in use for the current model, you can update the endpoint with the new model. Databricks Model Serving executes a zero-downtime update by keeping the existing configuration running until the new one is ready. \n### 4. Model Serving \nWhen configuring a Model Serving endpoint, you specify the name of the model in Unity Catalog and the version to serve. If the model version was trained using features from tables in Unity Catalog, the model stores the dependencies for the features and functions. Model Serving automatically uses this dependency graph to look up features from appropriate online stores at inference time. This approach can also be used to apply functions for data preprocessing or to compute on-demand features during model scoring. \nYou can create a single endpoint with multiple models and specify the endpoint traffic split between those models, allowing you to conduct online \u201cChampion\u201d versus \u201cChallenger\u201d comparisons. \n### 5. Inference: batch or streaming \nThe inference pipeline reads the latest data from the production catalog, executes functions to compute on-demand features, loads the \u201cChampion\u201d model, scores the data, and returns predictions. Batch or streaming inference is generally the most cost-effective option for higher throughput, higher latency use cases. For scenarios where low-latency predictions are required, but predictions can be computed offline, these batch predictions can be published to an online key-value store such as DynamoDB or Cosmos DB. \nThe registered model in Unity Catalog is referenced by its alias. The inference pipeline is configured to load and apply the \u201cChampion\u201d model version. If the \u201cChampion\u201d version is updated to a new model version, the inference pipeline automatically uses the new version for its next execution. In this way the model deployment step is decoupled from inference pipelines. \nBatch jobs typically publish predictions to tables in the production catalog, to flat files, or over a JDBC connection. Streaming jobs typically publish predictions either to Unity Catalog tables or to message queues like Apache Kafka. \n### 6. Lakehouse Monitoring \nLakehouse Monitoring monitors statistical properties, such as data drift and model performance, of input data and model predictions. You can create alerts based on these metrics or publish them in dashboards. \n* **Data ingestion.** This pipeline reads in logs from batch, streaming, or online inference.\n* **Check accuracy and data drift.** The pipeline computes metrics about the input data, the model\u2019s predictions, and the infrastructure performance. Data scientists specify data and model metrics during development, and ML engineers specify infrastructure metrics. You can also define custom metrics with Lakehouse Monitoring.\n* **Publish metrics and set up alerts.** The pipeline writes to tables in the production catalog for analysis and reporting. You should configure these tables to be readable from the development environment so data scientists have access for analysis. You can use Databricks SQL to create monitoring dashboards to track model performance, and set up the monitoring job or the dashboard tool to issue a notification when a metric exceeds a specified threshold.\n* **Trigger model retraining.** When monitoring metrics indicate performance issues or changes in the input data, the data scientist may need to develop a new model version. You can set up SQL alerts to notify data scientists when this happens. \n### 7. Retraining \nThis architecture supports automatic retraining using the same model training pipeline above. Databricks recommends beginning with scheduled, periodic retraining and moving to triggered retraining when needed. \n* **Scheduled.** If new data is available on a regular basis, you can create a [scheduled job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) to run the model training code on the latest available data.\n* **Triggered.** If the monitoring pipeline can identify model performance issues and send alerts, it can also trigger retraining. For example, if the distribution of incoming data changes significantly or if the model performance degrades, automatic retraining and redeployment can boost model performance with minimal human intervention. This can be achieved through a SQL alert to check whether a metric is anomalous (for example, check drift or model quality against a threshold). The alert can be configured to use a webhook destination, which can subsequently trigger the training workflow. \nIf the retraining pipeline or other pipelines exhibit performance issues, the data scientist may need to return to the development environment for additional experimentation to address the issues.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n##### Distributed training with TorchDistributor\n\nThis article describes how to perform distributed training on PyTorch ML models using [TorchDistributor](https:\/\/github.com\/apache\/spark\/blob\/master\/python\/pyspark\/ml\/torch\/distributor.py). \nTorchDistributor is an open-source module in PySpark that helps users do distributed training with PyTorch on their Spark clusters, so it lets you launch PyTorch training jobs as Spark jobs. Under-the-hood, it initializes the environment and the communication channels between the workers and utilizes the CLI command `torch.distributed.run` to run distributed training across the worker nodes. \nThe [TorchDistributor API](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/reference\/api\/pyspark.ml.torch.distributor.TorchDistributor.html) supports the methods shown in the following table. \n| Method and signature | Description |\n| --- | --- |\n| **`init(self, num_processes, local_mode, use_gpu)`** | Create an instance of TorchDistributor. |\n| **`run(self, main, *args)`** | Runs distributed training by invoking `main(**kwargs)` if main is a function and runs the CLI command `torchrun main *args` if main is a file path. |\n\n##### Distributed training with TorchDistributor\n###### Requirements\n\n* Spark 3.4\n* Databricks Runtime 13.0 ML or above\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/spark-pytorch-distributor.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n##### Distributed training with TorchDistributor\n###### Development workflow for notebooks\n\nIf the model creation and training process happens entirely from a notebook on your local machine or a [Databricks Notebook](https:\/\/docs.databricks.com\/notebooks\/index.html), you only have to make minor changes to get your code ready for distributed training. \n1. **Prepare single node code:** Prepare and test the single node code with PyTorch, PyTorch Lightning, or other frameworks that are based on PyTorch\/PyTorch Lightning like, the HuggingFace Trainer API.\n2. **Prepare code for standard distributed training:** You need to [convert your single process training to distributed training](https:\/\/pytorch.org\/tutorials\/intermediate\/ddp_tutorial.html). Have this distributed code all encompassed within one training function that you can use with the `TorchDistributor`.\n3. **Move imports within training function:** Add the necessary imports, such as `import torch`, within the training function. Doing so allows you to avoid common pickling errors. Furthermore, the `device_id` that models and data are be tied to is determined by: \n```\ndevice_id = int(os.environ[\"LOCAL_RANK\"])\n\n```\n4. **Launch distributed training:** Instantiate the `TorchDistributor` with the desired parameters and call `.run(*args)` to launch training. \nThe following is a training code example: \n```\nfrom pyspark.ml.torch.distributor import TorchDistributor\n\ndef train(learning_rate, use_gpu):\nimport torch\nimport torch.distributed as dist\nimport torch.nn.parallel.DistributedDataParallel as DDP\nfrom torch.utils.data import DistributedSampler, DataLoader\n\nbackend = \"nccl\" if use_gpu else \"gloo\"\ndist.init_process_group(backend)\ndevice = int(os.environ[\"LOCAL_RANK\"]) if use_gpu else \"cpu\"\nmodel = DDP(createModel(), **kwargs)\nsampler = DistributedSampler(dataset)\nloader = DataLoader(dataset, sampler=sampler)\n\noutput = train(model, loader, learning_rate)\ndist.cleanup()\nreturn output\n\ndistributor = TorchDistributor(num_processes=2, local_mode=False, use_gpu=True)\ndistributor.run(train, 1e-3, True)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/spark-pytorch-distributor.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n##### Distributed training with TorchDistributor\n###### Migrate training from external repositories\n\nIf you have an existing distributed training procedure stored in an external repository, you can easily migrate to Databricks by doing the following: \n1. **Import the repository:** Import the external repository as a [Databricks Git folder](https:\/\/docs.databricks.com\/repos\/index.html).\n2. **Create a new notebook** Initialize a new Databricks Notebook within the repository.\n3. **Launch distributed training** In a notebook cell, call `TorchDistributor` like the following: \n```\nfrom pyspark.ml.torch.distributor import TorchDistributor\n\ntrain_file = \"\/path\/to\/train.py\"\nargs = [\"--learning_rate=0.001\", \"--batch_size=16\"]\ndistributor = TorchDistributor(num_processes=2, local_mode=False, use_gpu=True)\ndistributor.run(train_file, *args)\n\n```\n\n##### Distributed training with TorchDistributor\n###### Troubleshooting\n\nA common error for the notebook workflow is that objects cannot be found or pickled when running distributed training. This can happen when the library import statements are not distributed to other executors. \nTo avoid this issue, include all import statements (for example, `import torch`) *both* at the top of the training function that is called with `TorchDistributor(...).run(<func>)` and inside any other user-defined functions called in the training method. \n### CUDA failure: peer access is not supported between these two devices \nThis is a potential error on the G5 suite of GPUs on AWS.\nTo resolve this error, add the following snippet in your training code: \n```\nimport os\nos.environ[\"NCCL_P2P_DISABLE\"] = \"1\"\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/spark-pytorch-distributor.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n##### Distributed training with TorchDistributor\n###### Example notebooks\n\nThe following notebook examples demonstrate how to perform distributed training with PyTorch. \n### End-to-end distributed training on Databricks notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/torch-distributor-notebook.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Distributed fine-tuning a Hugging Face model notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/distributed-fine-tuning-hugging-face.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Distributed training on a PyTorch File notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/torch-distributor-file.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Distributed training using PyTorch Lightning notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/torch-distributor-lightning.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Distributed data loading using Petastorm notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/distributed-data-loading-petastorm.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/spark-pytorch-distributor.html"} +{"content":"# \n### View logs & assessments\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html) \nThis tutorial walks you through the process of viewing logs from your application: \n* **`\ud83d\uddc2\ufe0f Request Log`**: Detailed traces of the `\ud83d\udd17 Chain` executions\n* **`\ud83d\udc4d Assessment & Evaluation Results Log`**: Assessments from `\ud83d\udc64 End Users` & `\ud83e\udde0 Expert Users` and `\ud83e\udd16 LLM Judge`s \nWe will use the `\ud83d\udcac Review UI` deployed in the previous tutorial to generate a few logs and view the data. \nIt assumes you have followed the steps in [Initialize a RAG Application](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1-create-sample-app.html).\n\n### View logs & assessments\n#### Step 1: Collect Assessments from humans using the Review App\n\n1. Open the `\ud83d\udcac Review UI` you deployed in the previous step.\n2. Interact with the application by asking questions. \nWe suggest the following: \n* Press `New Chat` on the left side. \n+ Ask `what is rag studio?` followed by `how do i set up the dev environment for it?`\n* Press `New Chat` on the left side. \n+ Ask `what is mlflow?`\n![RAG application](https:\/\/docs.databricks.com\/_images\/new-chat-ui.png)\n3. After asking a question, you will see the feedback widget appear below the bot\u2019s answer. At minimum, provide a thumbs up or thumbs down signal for \u201cIs this response correct for your question?\u201d. \nBefore providing feedback: \n![feedback ui before](https:\/\/docs.databricks.com\/_images\/feedback-ui.png) \nAfter providing feedback: \n![feedback ui completed](https:\/\/docs.databricks.com\/_images\/feedback-ui-after.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/2-view-logs.html"} +{"content":"# \n### View logs & assessments\n#### Step 2: Collect Assessments from a LLM judge\n\n1. The sample app is configured to automatically have an `\ud83e\udd16 LLM Judge` provide Assessments for every interaction with the application. \nNote \nFor more information on configuring LLM judges, see [\ud83e\udd16 LLM Judge](https:\/\/docs.databricks.com\/rag-studio\/details\/llm-judge.html)\n2. As such, an LLM judge has already provided assessments for the questions you asked in Step 1!\n\n### View logs & assessments\n#### Step 3: Run online evaluation ETL\n\nIn the Reviewers and End Users environments, the ETL job for processing logs and assessment automatically runs. In the development environment (where we are working now), you need to manually run the ETL job. \nWarning \nRAG Studio logging is based on [Inference Tables](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html) - logs can take 10 - 30 minutes before they are ready to be ETLd. If you run the below job and do not see any results, wait 10 minutes and try again. \n1. Run the following command to start the logs ETL process. This step will take approximately 5 minutes. \n```\n.\/rag run-online-eval -e dev\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/2-view-logs.html"} +{"content":"# \n### View logs & assessments\n#### Step 4. View the logs\n\nRAG Studio stores all logs within the Unity Catalog schema that you configured. \nNote \nThe logging schema is designed to enable measurement of metrics. For more information on how these logs are used to compute metrics, see [Metrics](https:\/\/docs.databricks.com\/rag-studio\/details\/metrics.html). \n1. Open the Catalog browser and navigate to your schema.\n2. In the schema, you will see the below tables \n1. **`\ud83d\uddc2\ufe0f Request Log`**: Detailed traces of the `\ud83d\udd17 Chain` executions; created by the ETL job\n2. **`\ud83d\udc4d Assessment & Evaluation Results Log`**: Assessments from `\ud83d\udc64 End Users` & `\ud83e\udde0 Expert Users` and `\ud83e\udd16 LLM Judge`s; created by the ETL job\n3. Raw payload logging table: Raw payload logs that are used by the ETL job.\n![logs](https:\/\/docs.databricks.com\/_images\/log-tables.png)\n3. Let\u2019s first explore the `\ud83d\uddc2\ufe0f Request Log`. \n```\nselect * from catalog.schema.`rag_studio_databricks-docs-bot_dev_request_log`\n\n``` \n* `request`: The user\u2019s input to the bot\n* `trace`: Step-by-step logs of each step exeecuted by the app\u2019s `\ud83d\udd17 Chain`\n* `output`: The bot\u2019s generated response that was returned to the user\n![logs](https:\/\/docs.databricks.com\/_images\/request-log-ui.png)\n4. Next, let\u2019s explore the `\ud83d\udc4d Assessment & Evaluation Results Log`. Each `request.request_id` has multiple assessments. \n```\nselect * from catalog.schema.`rag_studio_databricks-docs-bot_dev_assessment_log`\n\n``` \n* `request_id`: Maps to `request.request_id` in the `\ud83d\uddc2\ufe0f Request Log`\n* `source`: Who provided the feedback - the user id of the human or the `\ud83e\udd16 LLM Judge` ID\n* `text_assessment`: The `source`\u2019s assessment of the request\n* `output`: The bot\u2019s generated response that was returned to the user\n![logs](https:\/\/docs.databricks.com\/_images\/assessment-log-ui.png) \nNote \nThere is an additional column called `retrieval_assessments` - this is used for assessments of the `\ud83d\udd0d Retriever`. In this release of RAG Studio, retrieval assessment is only possible using a `\ud83d\udcd6 Evaluation Set` and offline evaluation. Future releases will include support for capturing retrieval assessment\u2019s from users in the `\ud83d\udcac Review UI` and from `\ud83e\udd16 LLM Judge`s.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/2-view-logs.html"} +{"content":"# \n### View logs & assessments\n#### Follow the next tutorial!\n\n[Run offline evaluation with a \ud83d\udcd6 Evaluation Set](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/3-run-offline-eval.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/2-view-logs.html"} +{"content":"# Security and compliance guide\n## Networking\n### Serverless compute plane networking\n##### Configure a firewall for serverless compute access\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To join this preview, contact your Databricks account team. \nThis article describes how to configure a firewall for serverless compute using the Databricks account console UI. You can also use the [Network Connectivity Configurations API](https:\/\/docs.databricks.com\/api\/account\/networkconnectivity). Firewall enablement is not supported for Amazon S3 or Amazon DynamoDB. \nNote \nThere are currently no networking charges for serverless features. In a later release, you might be charged. Databricks will provide advance notice for networking pricing changes.\n\n##### Configure a firewall for serverless compute access\n###### Overview of firewall enablement for serverless compute\n\nServerless network connectivity is managed with network connectivity configurations (NCCs). Account admins create NCCs in the account console and an NCC can be attached to one or more workspaces \nAn NCC contains a list of IPs. When an NCC is attached to a workspace, serverless compute in that workspace uses one of those IP addresses to connect your resources. You can allow list those networks on your resource firewalls. \nNCC firewall enablement is only supported from serverless SQL warehouses for data sources that you manage. It is not supported from other compute resources in the serverless compute plane. \nFor more information on NCCs, see [What is a network connectivity configuration (NCC)?](https:\/\/docs.databricks.com\/security\/network\/serverless-network-security\/index.html#ncc).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/serverless-network-security\/serverless-firewall.html"} +{"content":"# Security and compliance guide\n## Networking\n### Serverless compute plane networking\n##### Configure a firewall for serverless compute access\n###### Requirements\n\n* Your workspace must be on the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons).\n* You must be a Databricks account admin.\n* Each NCC can be attached to up to 50 workspaces. \n* Each Databricks account can have up to 10 NCCs per supported region. For the list of supported regions, see [Databricks clouds and regions](https:\/\/docs.databricks.com\/resources\/supported-regions.html).\n* Your target resource must be publicly accessible.\n\n##### Configure a firewall for serverless compute access\n###### Step 1: Create a network connectivity configuration and copy the stable IPs\n\nDatabricks recommends sharing NCCs among workspaces in the same business unit and those sharing the same region. \n1. As an account admin, go to the account console.\n2. In the sidebar, click **Cloud Resources**.\n3. Click **Network**.\n4. Click **Network Connectivity Configuration**.\n5. Click **Add Network Connectivity Configuration**.\n6. Type a name for the NCC.\n7. Choose the region. This must match your workspace region.\n8. Click **Add**.\n9. Click the **Default Rules** tab.\n10. Under **Stable IPs**, click **Copy all IPs** and save the list of IPs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/serverless-network-security\/serverless-firewall.html"} +{"content":"# Security and compliance guide\n## Networking\n### Serverless compute plane networking\n##### Configure a firewall for serverless compute access\n###### Step 3: Attach an NCC to workspaces\n\nYou can attach an NCC to up to 50 workspaces in the same region as the NCC. \nTo use the API to attach an NCC to a workspace, see the [Account Workspaces API](https:\/\/docs.databricks.com\/api\/account\/workspaces\/update). \n1. In the account console sidebar, click **Workspaces**.\n2. Click your workspace\u2019s name.\n3. Click **Update workspace**.\n4. In the **Network Connectivity Config** field, select your NCC. If it\u2019s not visible, confirm that you\u2019ve selected the same region for both the workspace and the NCC.\n5. Click **Update**.\n6. Wait 10 minutes for the change to take effect.\n7. Restart any running serverless SQL warehouses in the workspace.\n\n##### Configure a firewall for serverless compute access\n###### Step 3: Update your resource access rules to allowlist the IPs\n\nAdd the stable IPs to your resource access rules. For more information, see [AWS global condition context keys](https:\/\/docs.aws.amazon.com\/IAM\/latest\/UserGuide\/reference_policies_condition-keys.html#condition-keys-sourceip) in the AWS documentation. \nCreate a storage firewall also affects connectivity from classic compute plane resources to resources. You must also update your resource access rules to allowlist the IPs to connect to them from classic compute resources. \nNCC firewall enablement is not supported for Amazon S3 or Amazon DynamoDB.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/serverless-network-security\/serverless-firewall.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n#### Track ML and deep learning training runs\n###### Manage training code with MLflow runs\n\nThis article describes MLflow runs for managing machine learning training. It also includes guidance on how to manage and compare runs across [experiments](https:\/\/docs.databricks.com\/mlflow\/experiments.html). \nAn MLflow *run* corresponds to a single execution of model code. Each run records the following information: \n* **Source**: Name of the notebook that launched the run or the project name and entry point for the run. \n+ **Version**: Git commit hash if notebook is stored in a [Databricks Git folder](https:\/\/docs.databricks.com\/repos\/index.html) or run from an [MLflow Project](https:\/\/docs.databricks.com\/mlflow\/projects.html). Otherwise, notebook revision.\n+ **Start & end time**: Start and end time of the run.\n+ **Parameters**: Model parameters saved as key-value pairs. Both keys and values are strings.\n+ **Metrics**: Model evaluation metrics saved as key-value pairs. The value is numeric. Each metric can be updated throughout the course of the run (for example, to track how your model\u2019s loss function is converging), and MLflow records and lets you visualize the metric\u2019s history.\n+ **Tags**: Run metadata saved as key-value pairs. You can update tags during and after a run completes. Both keys and values are strings.\n+ **Artifacts**: Output files in any format. For example, you can record images, models (for example, a pickled scikit-learn model), and data files (for example, a Parquet file) as an artifact. \nAll MLflow runs are logged to the [active experiment](https:\/\/docs.databricks.com\/mlflow\/tracking.html#where-mlflow-runs-are-logged). If you have not explicitly set an experiment as the active experiment, runs are logged to the notebook experiment.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/runs.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n#### Track ML and deep learning training runs\n###### Manage training code with MLflow runs\n####### View runs\n\nYou can access a run either from its parent experiment page or directly from the notebook that created the run. \nFrom the [experiment page](https:\/\/docs.databricks.com\/mlflow\/experiments.html#experiment-page), in the runs table, click the start time of a run. \nFrom the notebook, click ![External Link](https:\/\/docs.databricks.com\/_images\/external-link.png) next to the date and time of the run in the Experiment Runs sidebar. \nThe [run screen](https:\/\/docs.databricks.com\/mlflow\/runs.html#run-details-screen) shows the parameters used for the run, the metrics resulting from the run, and any tags or notes. To display **Notes**, **Parameters**, **Metrics**, or **Tags** for this run, click ![right-pointing arrow](https:\/\/docs.databricks.com\/_images\/right-arrow.png) to the left of the label. \nYou also access artifacts saved from a run in this screen. \n![View run](https:\/\/docs.databricks.com\/_images\/quick-start-nb-run.png) \n### Code snippets for prediction \nIf you log a model from a run, the model appears in the **Artifacts** section of this page. To display code snippets illustrating how to load and use the model to make predictions on Spark and pandas DataFrames, click the model name. \n![predict code snippets](https:\/\/docs.databricks.com\/_images\/model-snippets.png) \n### View the notebook or Git project used for a run \nTo view the [version of the notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#version-control) that created a run: \n* On the experiment page, click the link in the **Source** column.\n* On the run page, click the link next to **Source**.\n* From the notebook, in the Experiment Runs sidebar, click the **Notebook** icon ![Notebook Version Icon](https:\/\/docs.databricks.com\/_images\/notebook-version.png) in the box for that Experiment Run. \nThe version of the notebook associated with the run appears in the main window with a highlight bar showing the date and time of the run. \nIf the run was launched remotely from a [Git project](https:\/\/docs.databricks.com\/mlflow\/projects.html#remote-run), click the link in the **Git Commit** field to open the specific version of the project used in the run. The link in the **Source** field opens the main branch of the Git project used in the run.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/runs.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n#### Track ML and deep learning training runs\n###### Manage training code with MLflow runs\n####### Add a tag to a run\n\nTags are key-value pairs that you can create and use later to [search for runs](https:\/\/docs.databricks.com\/mlflow\/runs.html#filter-runs). \n1. From the [run page](https:\/\/docs.databricks.com\/mlflow\/runs.html#run-details-screen), click ![Tag icon](https:\/\/docs.databricks.com\/_images\/tags1.png) if it is not already open. The tags table appears. \n![tag table](https:\/\/docs.databricks.com\/_images\/tags-open.png)\n2. Click in the **Name** and **Value** fields and type the key and value for your tag.\n3. Click **Add**. \n![add tag](https:\/\/docs.databricks.com\/_images\/tag-add.png) \n### Edit or delete a tag for a run \nTo edit or delete an existing tag, use the icons in the **Actions** column. \n![tag actions](https:\/\/docs.databricks.com\/_images\/tag-edit-or-delete.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/runs.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n#### Track ML and deep learning training runs\n###### Manage training code with MLflow runs\n####### Reproduce the software environment of a run\n\nYou can reproduce the exact software environment for the run by clicking **Reproduce Run**. The following dialog appears: \n![Reproduce run dialog](https:\/\/docs.databricks.com\/_images\/reproduce-run.png) \nWith the default settings, when you click **Confirm**: \n* The notebook is cloned to the location shown in the dialog.\n* If the original cluster still exists, the cloned notebook is attached to the original cluster and the cluster is started.\n* If the original cluster no longer exists, a new cluster with the same configuration, including any installed libraries, is created and started. The notebook is attached to the new cluster. \nYou can select a different location for the cloned notebook and inspect the cluster configuration and installed libraries: \n* To select a different folder to save the cloned notebook, click **Edit Folder**.\n* To see the cluster spec, click **View Spec**. To clone only the notebook and not the cluster, uncheck this option.\n* To see the libraries installed on the original cluster, click **View Libraries**. If you don\u2019t care about installing the same libraries as on the original cluster, uncheck this option.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/runs.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n#### Track ML and deep learning training runs\n###### Manage training code with MLflow runs\n####### Manage runs\n\n### Rename run \nTo rename a run, click ![three button icon](https:\/\/docs.databricks.com\/_images\/three-button-icon.png) at the upper right corner of the run page and select **Rename**. \n### Filter runs \nYou can search for runs based on parameter or metric values. You can also search for runs by tag. \n* To search for runs that match an expression containing parameter and metric values, enter a query in the search field and click **Search**. Some query syntax examples are: \n`metrics.r2 > 0.3` \n`params.elasticNetParam = 0.5` \n`params.elasticNetParam = 0.5 AND metrics.avg_areaUnderROC > 0.3`\n* To search for runs by tag, enter tags in the format: `tags.<key>=\"<value>\"`. String values must be enclosed in quotes as shown. \n`tags.estimator_name=\"RandomForestRegressor\"` \n`tags.color=\"blue\" AND tags.size=5` \nBoth keys and values can contain spaces. If the key includes spaces, you must enclose it in backticks as shown. \n```\ntags.`my custom tag` = \"my value\"\n\n``` \nYou can also filter runs based on their state (Active or Deleted) and based on whether a model version is associated with the run. To do this, make your selections from the **State** and **Time Created** drop-down menus respectively. \n![Filter runs](https:\/\/docs.databricks.com\/_images\/quick-start-nb-experiment.png) \n### Download runs \n1. Select one or more runs.\n2. Click **Download CSV**. A CSV file containing the following fields downloads: \n```\nRun ID,Name,Source Type,Source Name,User,Status,<parameter1>,<parameter2>,...,<metric1>,<metric2>,...\n\n``` \n### Delete runs \nYou can delete runs using the Databricks Machine Learning UI with the following steps: \n1. In the experiment, select one or more runs by clicking in the checkbox to the left of the run.\n2. Click **Delete**.\n3. If the run is a parent run, decide whether you also want to delete descendant runs. This option is selected by default.\n4. Click **Delete** to confirm. Deleted runs are saved for 30 days. To display deleted runs, select **Deleted** in the State field. \n#### Bulk delete runs based on the creation time \nYou can use Python to bulk delete runs of an experiment that were created prior to or at a UNIX timestamp.\nUsing Databricks Runtime 14.1 or later, you can call the `mlflow.delete_runs` API to delete runs and return the number of runs deleted. \nThe following are the `mlflow.delete_runs` parameters: \n* `experiment_id`: The ID of the experiment containing the runs to delete.\n* `max_timestamp_millis`: The maximum creation timestamp in milliseconds since the UNIX epoch for deleting runs. Only runs created prior to or at this timestamp are deleted.\n* `max_runs`: Optional. A positive integer that indicates the maximum number of runs to delete. The maximum allowed value for max\\_runs is 10000. If not specified, `max_runs` defaults to 10000. \n```\nimport mlflow\n\n# Replace <experiment_id>, <max_timestamp_ms>, and <max_runs> with your values.\nruns_deleted = mlflow.delete_runs(\nexperiment_id=<experiment_id>,\nmax_timestamp_millis=<max_timestamp_ms>,\nmax_runs=<max_runs>\n)\n# Example:\nruns_deleted = mlflow.delete_runs(\nexperiment_id=\"4183847697906956\",\nmax_timestamp_millis=1711990504000,\nmax_runs=10\n)\n\n``` \nUsing Databricks Runtime 13.3 LTS or earlier, you can run the following client code in a Databricks Notebook. \n```\nfrom typing import Optional\n\ndef delete_runs(experiment_id: str,\nmax_timestamp_millis: int,\nmax_runs: Optional[int] = None) -> int:\n\"\"\"\nBulk delete runs in an experiment that were created prior to or at the specified timestamp.\nDeletes at most max_runs per request.\n\n:param experiment_id: The ID of the experiment containing the runs to delete.\n:param max_timestamp_millis: The maximum creation timestamp in milliseconds\nsince the UNIX epoch for deleting runs. Only runs\ncreated prior to or at this timestamp are deleted.\n:param max_runs: Optional. A positive integer indicating the maximum number\nof runs to delete. The maximum allowed value for max_runs\nis 10000. If not specified, max_runs defaults to 10000.\n:return: The number of runs deleted.\n\"\"\"\nfrom mlflow.utils.databricks_utils import get_databricks_host_creds\nfrom mlflow.utils.request_utils import augmented_raise_for_status\nfrom mlflow.utils.rest_utils import http_request\n\njson_body = {\"experiment_id\": experiment_id, \"max_timestamp_millis\": max_timestamp_millis}\nif max_runs is not None:\njson_body[\"max_runs\"] = max_runs\nresponse = http_request(\nhost_creds=get_databricks_host_creds(),\nendpoint=\"\/api\/2.0\/mlflow\/databricks\/runs\/delete-runs\",\nmethod=\"POST\",\njson=json_body,\n)\naugmented_raise_for_status(response)\nreturn response.json()[\"runs_deleted\"]\n\n``` \nSee the Databricks Experiments API documentation for parameters and return value specifications for [deleting runs based on creation time](https:\/\/docs.databricks.com\/api\/workspace\/experiments\/deleteruns). \n### Restore runs \nYou can restore previously deleted runs using the Databricks Machine Learning UI. \n1. On the **Experiment** page, select **Deleted** in the **State** field to display deleted runs.\n2. Select one or more runs by clicking in the checkbox to the left of the run.\n3. Click **Restore**.\n4. Click **Restore** to confirm. To display the restored runs, select **Active** in the State field. \n#### Bulk restore runs based on the deletion time \nYou can also use Python to bulk restore runs of an experiment that were deleted at or after a UNIX timestamp.\nUsing Databricks Runtime 14.1 or later, you can call the `mlflow.restore_runs` API to restore runs and return the number of restored runs. \nThe following are the `mlflow.restore_runs` parameters: \n* `experiment_id`: The ID of the experiment containing the runs to restore.\n* `min_timestamp_millis`: The minimum deletion timestamp in milliseconds since the UNIX epoch for restoring runs. Only runs deleted at or after this timestamp are restored.\n* `max_runs`: Optional. A positive integer that indicates the maximum number of runs to restore. The maximum allowed value for max\\_runs is 10000. If not specified, max\\_runs defaults to 10000. \n```\nimport mlflow\n\n# Replace <experiment_id>, <min_timestamp_ms>, and <max_runs> with your values.\nruns_restored = mlflow.restore_runs(\nexperiment_id=<experiment_id>,\nmin_timestamp_millis=<min_timestamp_ms>,\nmax_runs=<max_runs>\n)\n# Example:\nruns_restored = mlflow.restore_runs(\nexperiment_id=\"4183847697906956\",\nmin_timestamp_millis=1711990504000,\nmax_runs=10\n)\n\n``` \nUsing Databricks Runtime 13.3 LTS or earlier, you can run the following client code in a Databricks Notebook. \n```\nfrom typing import Optional\n\ndef restore_runs(experiment_id: str,\nmin_timestamp_millis: int,\nmax_runs: Optional[int] = None) -> int:\n\"\"\"\nBulk restore runs in an experiment that were deleted at or after the specified timestamp.\nRestores at most max_runs per request.\n\n:param experiment_id: The ID of the experiment containing the runs to restore.\n:param min_timestamp_millis: The minimum deletion timestamp in milliseconds\nsince the UNIX epoch for restoring runs. Only runs\ndeleted at or after this timestamp are restored.\n:param max_runs: Optional. A positive integer indicating the maximum number\nof runs to restore. The maximum allowed value for max_runs\nis 10000. If not specified, max_runs defaults to 10000.\n:return: The number of runs restored.\n\"\"\"\nfrom mlflow.utils.databricks_utils import get_databricks_host_creds\nfrom mlflow.utils.request_utils import augmented_raise_for_status\nfrom mlflow.utils.rest_utils import http_request\njson_body = {\"experiment_id\": experiment_id, \"min_timestamp_millis\": min_timestamp_millis}\nif max_runs is not None:\njson_body[\"max_runs\"] = max_runs\nresponse = http_request(\nhost_creds=get_databricks_host_creds(),\nendpoint=\"\/api\/2.0\/mlflow\/databricks\/runs\/restore-runs\",\nmethod=\"POST\",\njson=json_body,\n)\naugmented_raise_for_status(response)\nreturn response.json()[\"runs_restored\"]\n\n``` \nSee the Databricks Experiments API documentation for parameters and return value specifications for [restoring runs based on deletion time](https:\/\/docs.databricks.com\/api\/workspace\/experiments\/restoreruns).\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/runs.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n#### Track ML and deep learning training runs\n###### Manage training code with MLflow runs\n####### Compare runs\n\nYou can compare runs from a single experiment or from multiple experiments. The **Comparing Runs** page presents information about the selected runs in graphic and tabular formats. You can also create visualizations of run results and tables of run information, run parameters, and metrics. \nTo create a visualization: \n1. Select the plot type (**Parallel Coordinates Plot**, **Scatter Plot**, or **Contour Plot**). \n1. For a **Parallel Coordinates Plot**, select the parameters and metrics to plot. From here, you can identify relationships between the selected parameters and metrics, which helps you better define the hyperparameter tuning space for your models. \n![compare runs page visualization](https:\/\/docs.databricks.com\/_images\/mlflow-run-comparison-viz.png)\n2. For a **Scatter Plot** or **Contour Plot**, select the parameter or metric to display on each axis. \nThe **Parameters** and **Metrics** tables display the run parameters and metrics from all selected runs. The columns in these tables are identified by the **Run details** table immediately above. For simplicity, you can hide parameters and metrics that are identical in all selected runs by toggling ![Show diff only button](https:\/\/docs.databricks.com\/_images\/show-diff-only.png). \n![compare runs page tables](https:\/\/docs.databricks.com\/_images\/mlflow-run-comparison-table.png) \n### Compare runs from a single experiment \n1. On the [experiment page](https:\/\/docs.databricks.com\/mlflow\/experiments.html), select two or more runs by clicking in the checkbox to the left of the run, or select all runs by checking the box at the top of the column.\n2. Click **Compare**. The Comparing `<N>` Runs screen appears. \n### Compare runs from multiple experiments \n1. On the [experiments page](https:\/\/docs.databricks.com\/mlflow\/experiments.html), select the experiments you want to compare by clicking in the box at the left of the experiment name.\n2. Click **Compare (n)** (**n** is the number of experiments you selected). A screen appears showing all of the runs from the experiments you selected.\n3. Select two or more runs by clicking in the checkbox to the left of the run, or select all runs by checking the box at the top of the column.\n4. Click **Compare**. The Comparing `<N>` Runs screen appears.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/runs.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n#### Track ML and deep learning training runs\n###### Manage training code with MLflow runs\n####### Copy runs between workspaces\n\nTo import or export MLflow runs to or from your Databricks workspace, you can use the community-driven open source project [MLflow Export-Import](https:\/\/github.com\/mlflow\/mlflow-export-import#why-use-mlflow-export-import).\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/runs.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Structured Streaming patterns on Databricks\n\nThis contains notebooks and code samples for common patterns for working with Structured Streaming on Databricks.\n\n#### Structured Streaming patterns on Databricks\n##### Getting started with Structured Streaming\n\nIf you are brand new to Structured Streaming, see [Run your first Structured Streaming workload](https:\/\/docs.databricks.com\/structured-streaming\/tutorial.html).\n\n#### Structured Streaming patterns on Databricks\n##### Write to Cassandra as a sink for Structured Streaming in Python\n\n[Apache Cassandra](https:\/\/cassandra.apache.org\/) is a distributed, low-latency, scalable, highly-available OLTP database. \nStructured Streaming works with Cassandra through the [Spark Cassandra Connector](https:\/\/github.com\/datastax\/spark-cassandra-connector). This connector supports both RDD and DataFrame APIs, and it has native support for writing streaming data.\n**\\*Important**\\* You must use the corresponding version of the [spark-cassandra-connector-assembly](https:\/\/mvnrepository.com\/artifact\/com.datastax.spark\/spark-cassandra-connector-assembly). \nThe following example connects to one or more hosts in a Cassandra database cluster. It also specifies connection configurations such as the checkpoint location and the specific keyspace and table names: \n```\nspark.conf.set(\"spark.cassandra.connection.host\", \"host1,host2\")\n\ndf.writeStream \\\n.format(\"org.apache.spark.sql.cassandra\") \\\n.outputMode(\"append\") \\\n.option(\"checkpointLocation\", \"\/path\/to\/checkpoint\") \\\n.option(\"keyspace\", \"keyspace_name\") \\\n.option(\"table\", \"table_name\") \\\n.start()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/examples.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Structured Streaming patterns on Databricks\n##### Write to Azure Synapse Analytics using `foreachBatch()` in Python\n\n`streamingDF.writeStream.foreachBatch()` allows you to reuse existing batch data writers to write the\noutput of a streaming query to Azure Synapse Analytics. See the [foreachBatch documentation](https:\/\/docs.databricks.com\/structured-streaming\/foreach.html) for details. \nTo run this example, you need the Azure Synapse Analytics connector. For details on the Azure Synapse Analytics connector, see [Query data in Azure Synapse Analytics](https:\/\/docs.databricks.com\/connect\/external-systems\/synapse-analytics.html). \n```\nfrom pyspark.sql.functions import *\nfrom pyspark.sql import *\n\ndef writeToSQLWarehouse(df, epochId):\ndf.write \\\n.format(\"com.databricks.spark.sqldw\") \\\n.mode('overwrite') \\\n.option(\"url\", \"jdbc:sqlserver:\/\/<the-rest-of-the-connection-string>\") \\\n.option(\"forward_spark_azure_storage_credentials\", \"true\") \\\n.option(\"dbtable\", \"my_table_in_dw_copy\") \\\n.option(\"tempdir\", \"wasbs:\/\/<your-container-name>@<your-storage-account-name>.blob.core.windows.net\/<your-directory-name>\") \\\n.save()\n\nspark.conf.set(\"spark.sql.shuffle.partitions\", \"1\")\n\nquery = (\nspark.readStream.format(\"rate\").load()\n.selectExpr(\"value % 10 as key\")\n.groupBy(\"key\")\n.count()\n.toDF(\"key\", \"count\")\n.writeStream\n.foreachBatch(writeToSQLWarehouse)\n.outputMode(\"update\")\n.start()\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/examples.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Structured Streaming patterns on Databricks\n##### Write to Amazon DynamoDB using `foreach()` in Scala and Python\n\n`streamingDF.writeStream.foreach()` allows you to write the output of a streaming query to arbitrary locations. \n### Use Python \nThis example shows how to use `streamingDataFrame.writeStream.foreach()` in Python to write to DynamoDB. The first step gets the DynamoDB boto resource. This example is written to use `access_key` and `secret_key`, but Databricks recommends that you use instance profiles. See [Tutorial: Configure S3 access with an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html). \n1. Define a few helper methods to create DynamoDB table for running the example. \n```\ntable_name = \"PythonForeachTest\"\n\ndef get_dynamodb():\nimport boto3\n\naccess_key = \"<access key>\"\nsecret_key = \"<secret key>\"\nregion = \"<region name>\"\nreturn boto3.resource('dynamodb',\naws_access_key_id=access_key,\naws_secret_access_key=secret_key,\nregion_name=region)\n\ndef createTableIfNotExists():\n'''\nCreate a DynamoDB table if it does not exist.\nThis must be run on the Spark driver, and not inside foreach.\n'''\ndynamodb = get_dynamodb()\n\nexisting_tables = dynamodb.meta.client.list_tables()['TableNames']\nif table_name not in existing_tables:\nprint(\"Creating table %s\" % table_name)\ntable = dynamodb.create_table(\nTableName=table_name,\nKeySchema=[ { 'AttributeName': 'key', 'KeyType': 'HASH' } ],\nAttributeDefinitions=[ { 'AttributeName': 'key', 'AttributeType': 'S' } ],\nProvisionedThroughput = { 'ReadCapacityUnits': 5, 'WriteCapacityUnits': 5 }\n)\n\nprint(\"Waiting for table to be ready\")\n\ntable.meta.client.get_waiter('table_exists').wait(TableName=table_name)\n\n```\n2. Define the classes and methods that writes to DynamoDB and then call them from `foreach`. There are two ways to specify your custom logic in `foreach`. \n* Use a function: This is the simple approach that can be used to write 1 row at a time. However, client\/connection initialization to write a row will be done in every call. \n```\ndef sendToDynamoDB_simple(row):\n'''\nFunction to send a row to DynamoDB.\nWhen used with `foreach`, this method is going to be called in the executor\nwith the generated output rows.\n'''\n# Create client object in the executor,\n# do not use client objects created in the driver\ndynamodb = get_dynamodb()\n\ndynamodb.Table(table_name).put_item(\nItem = { 'key': str(row['key']), 'count': row['count'] })\n\n```\n* Use a class with `open`, `process`, and `close` methods: This allows for a more efficient implementation where a client\/connection is initialized and multiple rows can be written out. \n```\nclass SendToDynamoDB_ForeachWriter:\n'''\nClass to send a set of rows to DynamoDB.\nWhen used with `foreach`, copies of this class is going to be used to write\nmultiple rows in the executor. See the python docs for `DataStreamWriter.foreach`\nfor more details.\n'''\n\ndef open(self, partition_id, epoch_id):\n# This is called first when preparing to send multiple rows.\n# Put all the initialization code inside open() so that a fresh\n# copy of this class is initialized in the executor where open()\n# will be called.\nself.dynamodb = get_dynamodb()\nreturn True\n\ndef process(self, row):\n# This is called for each row after open() has been called.\n# This implementation sends one row at a time.\n# For further enhancements, contact the Spark+DynamoDB connector\n# team: https:\/\/github.com\/audienceproject\/spark-dynamodb\nself.dynamodb.Table(table_name).put_item(\nItem = { 'key': str(row['key']), 'count': row['count'] })\n\ndef close(self, err):\n# This is called after all the rows have been processed.\nif err:\nraise err\n\n```\n3. Invoke `foreach` in your streaming query with the above function or object. \n```\nfrom pyspark.sql.functions import *\n\nspark.conf.set(\"spark.sql.shuffle.partitions\", \"1\")\n\nquery = (\nspark.readStream.format(\"rate\").load()\n.selectExpr(\"value % 10 as key\")\n.groupBy(\"key\")\n.count()\n.toDF(\"key\", \"count\")\n.writeStream\n.foreach(SendToDynamoDB_ForeachWriter())\n#.foreach(sendToDynamoDB_simple) \/\/ alternative, use one or the other\n.outputMode(\"update\")\n.start()\n)\n\n``` \n### Use Scala \nThis example shows how to use `streamingDataFrame.writeStream.foreach()` in Scala to write to DynamoDB. \nTo run this you will have to create a DynamoDB table that has a single string key named \u201cvalue\u201d. \n1. Define an implementation of the `ForeachWriter` interface that performs the write. \n```\nimport org.apache.spark.sql.{ForeachWriter, Row}\nimport com.amazonaws.AmazonServiceException\nimport com.amazonaws.auth._\nimport com.amazonaws.services.dynamodbv2.AmazonDynamoDB\nimport com.amazonaws.services.dynamodbv2.AmazonDynamoDBClientBuilder\nimport com.amazonaws.services.dynamodbv2.model.AttributeValue\nimport com.amazonaws.services.dynamodbv2.model.ResourceNotFoundException\nimport java.util.ArrayList\n\nimport scala.collection.JavaConverters._\n\nclass DynamoDbWriter extends ForeachWriter[Row] {\nprivate val tableName = \"<table name>\"\nprivate val accessKey = \"<aws access key>\"\nprivate val secretKey = \"<aws secret key>\"\nprivate val regionName = \"<region>\"\n\n\/\/ This will lazily be initialized only when open() is called\nlazy val ddb = AmazonDynamoDBClientBuilder.standard()\n.withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials(accessKey, secretKey)))\n.withRegion(regionName)\n.build()\n\n\/\/\n\/\/ This is called first when preparing to send multiple rows.\n\/\/ Put all the initialization code inside open() so that a fresh\n\/\/ copy of this class is initialized in the executor where open()\n\/\/ will be called.\n\/\/\ndef open(partitionId: Long, epochId: Long) = {\nddb \/\/ force the initialization of the client\ntrue\n}\n\n\/\/\n\/\/ This is called for each row after open() has been called.\n\/\/ This implementation sends one row at a time.\n\/\/ A more efficient implementation can be to send batches of rows at a time.\n\/\/\ndef process(row: Row) = {\nval rowAsMap = row.getValuesMap(row.schema.fieldNames)\nval dynamoItem = rowAsMap.mapValues {\nv: Any => new AttributeValue(v.toString)\n}.asJava\n\nddb.putItem(tableName, dynamoItem)\n}\n\n\/\/\n\/\/ This is called after all the rows have been processed.\n\/\/\ndef close(errorOrNull: Throwable) = {\nddb.shutdown()\n}\n}\n\n```\n2. Use the `DynamoDbWriter` to write a rate stream into DynamoDB. \n```\nspark.readStream\n.format(\"rate\")\n.load()\n.select(\"value\")\n.writeStream\n.foreach(new DynamoDbWriter)\n.start()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/examples.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Structured Streaming patterns on Databricks\n##### Stream-Stream joins\n\nThese two notebooks show how to use stream-stream joins in Python and Scala. \n### Stream-Stream joins Python notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/stream-stream-joins-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Stream-Stream joins Scala notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/stream-stream-joins-scala.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/examples.html"} +{"content":"# \n","doc_uri":"https:\/\/docs.databricks.com\/workspace-index.html"} +{"content":"# \n### Databricks data engineering\n\nDatabricks data engineering features are a robust environment for collaboration among data scientists, data engineers, and data analysts. Data engineering tasks are also the backbone of [Databricks machine learning](https:\/\/docs.databricks.com\/machine-learning\/index.html) solutions. \nNote \nIf you are a data analyst who works primarily with SQL queries and BI tools, you might prefer [Databricks SQL](https:\/\/docs.databricks.com\/sql\/index.html). \nThe data engineering documentation provides how-to guidance to help you get the most out of the Databricks collaborative analytics platform. For getting started tutorials and introductory information, see [Get started: Account and workspace setup](https:\/\/docs.databricks.com\/getting-started\/index.html) and [What is Databricks?](https:\/\/docs.databricks.com\/introduction\/index.html). \n* [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html)Learn how to build data pipelines for ingestion and transformation with Databricks Delta Live Tables.\n* [Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/index.html)Learn about streaming, incremental, and real-time workloads powered by Structured Streaming on Databricks.\n* [Apache Spark](https:\/\/docs.databricks.com\/spark\/index.html)Learn how Apache Spark works on Databricks and the Databricks platform.\n* [Notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html)Learn what a Databricks notebook is, and how to use and manage notebooks to process, analyze, and visualize your data.\n* [Workflows](https:\/\/docs.databricks.com\/workflows\/index.html)Learn how to orchestrate data processing, machine learning, and data analysis workflows on the Databricks Data Intelligence Platform.\n* [Libraries](https:\/\/docs.databricks.com\/libraries\/index.html)Learn how to make third-party or custom code available in Databricks using libraries. Learn about the different modes for installing libraries on Databricks.\n* [Init scripts](https:\/\/docs.databricks.com\/init-scripts\/index.html)Learn how to use initialization (init) scripts to install packages and libraries, set system properties and environment variables, modify Apache Spark config parameters, and set other configurations on Databricks clusters.\n* [Git folders](https:\/\/docs.databricks.com\/repos\/index.html)Learn how to use Git to version control your notebooks and other files for development in Databricks.\n* [DBFS](https:\/\/docs.databricks.com\/dbfs\/index.html)Learn about Databricks File System (DBFS), a distributed file system mounted into a Databricks workspace and available on Databricks clusters\n* [Files](https:\/\/docs.databricks.com\/files\/index.html)Learn about options for working with files on Databricks.\n* [Migration](https:\/\/docs.databricks.com\/migration\/index.html)Learn how to migrate data applications such as ETL jobs, enterprise data warehouses, ML, data science, and analytics to Databricks.\n* [Optimization & performance](https:\/\/docs.databricks.com\/optimizations\/index.html)Learn about optimizations and performance recommendations on Databricks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace-index.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Use the Lakeview APIs to create and manage dashboards\n\nLakeview APIs provide management tools specifically for managing Lakeview dashboards. This article demonstrates create a new Lakeview dashboards from an existing legacy dashboard. Then, it shows how to use the Lakeview API to manage the dashboard.\n\n##### Use the Lakeview APIs to create and manage dashboards\n###### Prerequisites\n\n* You need a personal access token to connect with your workspace. See [Databricks personal access token authentication](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html).\n* You need the workspace URL (s) that you want to access. See [Workspace instance names, URLs, and IDs](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-instance-names-urls-and-ids)\n* Familiarity with the [Databricks REST API reference](https:\/\/docs.databricks.com\/api\/workspace\/introduction).\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/lakeview-crud-api.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Use the Lakeview APIs to create and manage dashboards\n###### Migrate a dashboard\n\nYou can create a new Lakeview dashboard from an existing legacy dashboard. The **Migrate dashboard** endpoint in the Lakeview API requires the `source_dashboard_id`. Optionally, you can include a display name and a path where you want the new dashboard to be stored. \n### Get Databricks SQL dashboards \nTo get the `source_dashboard_id`, use the Databricks SQL dashboards API to get a list of all of the dashboards in your workspace. Each dashboard object in the `results` list includes a UUID that you can use to refer to the legacy dashboard across Databricks REST API services. \nThe following example shows a sample request and response for the **Get dashboard objects** endpoint. Some response details have been omitted for clarity. See [GET \/api\/2.0\/preview\/sql\/dashboards](https:\/\/docs.databricks.com\/api\/2.0\/preview\/sql\/dashboards) for a full description of this endpoint and sample response. \nThe UUID for a legacy dashboard is the `id` from the top level of the list of objects returned in `results`. For legacy dashboards, the UUID looks like `4e443c27-9f61-4f2e-a12d-ea5668460bf1`. \n```\nGET \/api\/2.0\/preview\/sql\/dashboards\n\nQuery Parameters:\n\n{\n\"page_size\": <optional>,\n\"page\": <optional>,\n\"order\": <optional>\n\"q\": <optional>\n}\n\nResponse:\n\n{\n\"count\": 1,\n\"page\": 1,\n\"page_size\": 25,\n\"results\": [\n{\n\"id\": \"4e443c27-9f61-4f2e-a12d-ea5668460bf1\",\n\"slug\": \"sales-dashboard\",\n\"parent\": \"folders\/2025532471912059\",\n...\n}\n]\n}\n\n``` \n### Migrate legacy dashboard \nUse the UUID associated with the legacy dashboard to create a copy that is automatically converted to a new Lakeview dashboard. This works like the **Clone to Lakeview** tool available in the UI. See [Clone a legacy dashboard to a Lakeview dashboard](https:\/\/docs.databricks.com\/dashboards\/clone-legacy-to-lakeview.html) to learn how to perform this operation using the Databricks UI. \nThe UUID of the legacy dashboard you want to convert is required in the request body. Optionally, you can include a new `display_name` value and a `parent_path` that identifies the workspace path of the folder where you want the converted dashboard to be stored. \nThe response includes a `dashboard_id`, the UUID for the new dashboard. The UUID for a Lakeview dashboard is a 32-digit alphanumeric value like `04aab30f99ea444490c10c85852f216c`. You can use it to identify this dashboard in the Lakeview namespace and across different Databricks REST API services. \nThe following example shows a sample request and response. See [POST \/api\/2.0\/lakeview\/dashboards\/migrate](https:\/\/docs.databricks.com\/api\/2.0\/lakeview\/dashboards\/migrate). \n```\nPOST \/api\/2.0\/lakeview\/dashboards\/migrate\n\nRequest body parameters:\n{\n\"source_dashboard_id\": \"4e443c27-9f61-4f2e-a12d-ea5668460bf1\",\n\"display_name\": \"Monthly Traffic Report\",\n\"parent_path\": \"\/path\/to\/dir\"\n}\n\nResponse:\n{\n\"dashboard_id\": \"04aab30f99ea444490c10c85852f216c\",\n\"display_name\": \"Monthly Traffic Report\",\n\"path\": \"\/path\/to\/dir\/Monthly Traffic Report.lvdash.json\",\n\"create_time\": \"2019-08-24T14:15:22Z\",\n\"update_time\": \"2019-08-24T14:15:22Z\",\n\"warehouse_id\": \"47bb1c472649e711\",\n\"etag\": \"80611980\",\n\"serialized_dashboard\": \"{\\\"pages\\\":[{\\\"name\\\":\\\"b532570b\\\",\\\"displayName\\\":\\\"New Page\\\"}]}\",\n\"lifecycle_state\": \"ACTIVE\",\n\"parent_path\": \"\/path\/to\/dir\"\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/lakeview-crud-api.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Use the Lakeview APIs to create and manage dashboards\n###### Get a draft dashboard\n\nYou can use the `dashboard_id` to pull dashboard details from a draft dashboard. The following sample request and response includes details for the current version of the draft dashboard in the workspace. \nThe `etag` field tracks the latest version of the dashboard. You can use this to verify the version before making additional updates. \n```\nGET \/api\/2.0\/lakeview\/dashboards\/04aab30f99ea444490c10c85852f216c\n\nResponse:\n\n{\n\"dashboard_id\": \"04aab30f99ea444490c10c85852f216c\",\n\"display_name\": \"Monthly Traffic Report\",\n\"path\": \"\/path\/to\/dir\/Monthly Traffic Report.lvdash.json\",\n\"create_time\": \"2019-08-24T14:15:22Z\",\n\"update_time\": \"2019-08-24T14:15:22Z\",\n\"warehouse_id\": \"47bb1c472649e711\",\n\"etag\": \"80611980\",\n\"serialized_dashboard\": \"{\\\"pages\\\":[{\\\"name\\\":\\\"b532570b\\\",\\\"displayName\\\":\\\"New Page\\\"}]}\",\n\"lifecycle_state\": \"ACTIVE\",\n\"parent_path\": \"\/path\/to\/dir\"\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/lakeview-crud-api.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Use the Lakeview APIs to create and manage dashboards\n###### Update a dashboard\n\nYou can use the `dashboard_id` in the previous response to update the new Lakeview dashboard created with that operation. The following example shows a sample request and response. The `dashboard_id` from the previous example is included as a path parameter. \nThe `display_name` and `warehouse_id` have been changed. The updated dashboard has a new name and assigned default warehouse, as shown in the response. The `etag` field is optional. If the version specified in the `etag` does not match the current version, the update is rejected. \n```\nPATCH \/api\/2.0\/lakeview\/dashboards\/04aab30f99ea444490c10c85852f216c\n\nRequest body:\n\n{\n\"display_name\": \"Monthly Traffic Report 2\",\n\"warehouse_id\": \"c03a4f8a7162bc9f\",\n\"etag\": \"80611980\"\n}\n\nResponse:\n\n{\n\"dashboard_id\": \"04aab30f99ea444490c10c85852f216c\",\n\"display_name\": \"Monthly Traffic Report 2\",\n\"path\": \"\/path\/to\/dir\/Monthly Traffic Report 2.lvdash.json\",\n\"create_time\": \"2019-08-24T14:15:22Z\",\n\"update_time\": \"2019-08-24T14:15:22Z\",\n\"warehouse_id\": \"c03a4f8a7162bc9f\",\n\"etag\": \"80611981\",\n\"serialized_dashboard\": \"{\\\"pages\\\":[{\\\"name\\\":\\\"b532570b\\\",\\\"displayName\\\":\\\"New Page\\\"}]}\",\n\"lifecycle_state\": \"ACTIVE\",\n\"parent_path\": \"\/path\/to\/dir\"\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/lakeview-crud-api.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Use the Lakeview APIs to create and manage dashboards\n###### Create a dashboard\n\nYou can use the **Create dashboard** endpoint in the Lakeview API to move your dashboard between workspaces. The following example includes a sample request body and response that creates a new dashboard. The `serialized_dashboard` key from the previous example contains all the necessary details to create a duplicate draft dashboard. \nThe sample includes a new `warehouse_id` value corresponding to a warehouse in the new workspace. See [POST \/api\/2.0\/lakeview\/dashboards](https:\/\/docs.databricks.com\/api\/2.0\/lakeview\/dashboards). \n```\nPOST \/api\/2.0\/lakeview\/dashboards\n\nRequest body:\n\n{\n\"display_name\": \"Monthly Traffic Report 2\",\n\"warehouse_id\": \"5e2f98ab3476cfd0\",\n\"serialized_dashboard\": \"{\\\"pages\\\":[{\\\"name\\\":\\\"b532570b\\\",\\\"displayName\\\":\\\"New Page\\\"}]}\",\n\"parent_path\": \"\/path\/to\/dir\"\n}\n\nResponse:\n\n{\n\"dashboard_id\": \"1e23fd84b6ac7894e2b053907dca9b2f\",\n\"display_name\": \"Monthly Traffic Report 2\",\n\"path\": \"\/path\/to\/dir\/Monthly Traffic Report 2.lvdash.json\",\n\"create_time\": \"2019-08-24T14:15:22Z\",\n\"update_time\": \"2019-08-24T14:15:22Z\",\n\"warehouse_id\": \"5e2f98ab3476cfd0\",\n\"etag\": \"14350695\",\n\"serialized_dashboard\": \"{\\\"pages\\\":[{\\\"name\\\":\\\"b532570b\\\",\\\"displayName\\\":\\\"New Page\\\"}]}\",\n\"lifecycle_state\": \"ACTIVE\",\n\"parent_path\": \"\/path\/to\/dir\"\n}\n\n``` \nThe only required property in the request body is a `display_name`. This tool can copy dashboard content or create new, blank dashboards.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/lakeview-crud-api.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Use the Lakeview APIs to create and manage dashboards\n###### Publish a dashboard\n\nYou can use the **Publish dashboard** endpoint to publish a dashboard, set credentials for viewers, and override the `warehouse_id` set in the draft dashboard. You must include the dashboard\u2019s UUID as a path parameter. \nThe request body sets the `embed_credentials` property to `false`. By default, `embed_credentials` is set to `true`. Embedding credentials allows account-level users to view dashboard data. See [Publish a dashboard](https:\/\/docs.databricks.com\/dashboards\/index.html#publish-dashboard). A new `warehouse_id` value is omitted, so the published dashboard uses the same warehouse assigned to the draft dashboard. \n```\nPOST \/api\/2.0\/lakeview\/dashboards\/1e23fd84b6ac7894e2b053907dca9b2f\/published\n\nRequest body:\n\n{\n\"embed_credentials\": false\n}\n\nResponse:\n\n{\n\"display_name\": \"Monthly Traffic Report 2\",\n\"warehouse_id\": \"5e2f98ab3476cfd0\",\n\"embed_credentials\": false,\n\"revision_create_time\": \"2019-08-24T14:15:22Z\"\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/lakeview-crud-api.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Use the Lakeview APIs to create and manage dashboards\n###### Get published dashboard\n\nThe response from [GET \/api\/2.0\/lakeview\/dashboards\/{dashboard\\_id}\/published](https:\/\/docs.databricks.com\/api\/workspace\/lakeview\/getpublished) is similar to the response provided in the previous example. The `dashboard_id` is included as a path parameter. \n```\nGET \/api\/2.0\/lakeview\/dashboards\/1e23fd84b6ac7894e2b053907dca9b2f\/published\n\nResponse:\n\n{\n\"display_name\": \"Monthly Traffic Report 2\",\n\"warehouse_id\": \"5e2f98ab3476cfd0\",\n\"embed_credentials\": false,\n\"revision_create_time\": \"2019-08-24T14:15:22Z\"\n}\n\n```\n\n##### Use the Lakeview APIs to create and manage dashboards\n###### Unpublish a dashboard\n\nThe draft dashboard is retained when you use the Lakeview API to unpublish a dashboard. This request deletes the published version of the dashboard. \nThe following example uses the `dashboard_id` from the previous example. A successful request yields a `200` status code. There is no response body. \n```\nDELETE \/api\/2.0\/lakeview\/dashboards\/1e23fd84b6ac7894e2b053907dca9b2f\/published\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/lakeview-crud-api.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Use the Lakeview APIs to create and manage dashboards\n###### Trash dashboard\n\nUse [DELETE \/api\/2.0\/lakeview\/dashboards\/{dashboard\\_id}](https:\/\/docs.databricks.com\/api\/workspace\/lakeview\/trash) to send a draft dashboard to the trash. The dashboard can still be recovered. \nThe following example uses the `dashboard_id` from the previous example. A successful request yields a `200` status code. There is no response body. \n```\nDELETE \/api\/2.0\/lakeview\/dashboards\/1e23fd84b6ac7894e2b053907dca9b2f\n\n``` \nNote \nTo perform a permanent delete, use [POST \/api.2.0\/workspace\/delete](https:\/\/docs.databricks.com\/api\/workspace\/workspace\/delete)\n\n##### Use the Lakeview APIs to create and manage dashboards\n###### Next steps\n\n* To learn more about dashboards, see [Dashboards](https:\/\/docs.databricks.com\/dashboards\/index.html).\n* See [Databricks REST API reference](https:\/\/docs.databricks.com\/api\/workspace\/introduction) to learn more about the REST API.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/lakeview-crud-api.html"} +{"content":"# Technology partners\n## What is Databricks Partner Connect?\n#### Manage connections in Partner Connect\n\nYou can perform administrative tasks with Databricks workspace connections to partner solutions, such as: \n* Managing users of partner accounts.\n* Managing the Databricks service principal and related Databricks personal access token that a connection uses.\n* Disconnecting a workspace from a partner. \nTo administer Partner Connect, you must sign in to your workspace as a workspace admin. For more information, see [Manage users](https:\/\/docs.databricks.com\/admin\/users-groups\/users.html).\n\n#### Manage connections in Partner Connect\n##### Manage partner account users\n\nFor partners that allow users to use Partner Connect to sign in to that partner\u2019s account or website (such as Fivetran and Rivery), when someone in your organization connects from one of your Databricks workspaces to a partner for the first time, that person becomes the *partner account administrator* for that partner across all of your organization\u2019s workspaces. To enable other users within your organization to sign in to that partner, your partner account administrator must first add those users to your organization\u2019s partner account. Some partners allow the partner account administrator to delegate this permission as well. For details, see the documentation on the partner\u2019s website. \nIf no one can add users to your organization\u2019s partner account (for example, your partner account administrator is no longer available), contact the partner for assistance. For support links, see the list of [Databricks Partner Connect partners](https:\/\/docs.databricks.com\/integrations\/index.html#partner-connect).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/admin.html"} +{"content":"# Technology partners\n## What is Databricks Partner Connect?\n#### Manage connections in Partner Connect\n##### Connect data managed by Unity Catalog to partner solutions\n\nIf your workspace is Unity Catalog-enabled, you can connect select partner solutions to data managed by Unity Catalog. When you create the connection using Partner Connect, you can choose whether the partner uses the legacy Hive metastore (`hive_metastore`) or another catalog that you own. Metastore admins can select any catalog in the metastore that\u2019s assigned to your workspace. \nNote \nIf a partner solution doesn\u2019t support Unity Catalog with Partner Connect, you can only use the workspace default catalog. If the default catalog isn\u2019t `hive_metastore` and you don\u2019t own the default catalog, you\u2019ll receive an error. \nFor a list of partners that support Unity Catalog with Partner Connect, see the [Databricks Partner Connect partners](https:\/\/docs.databricks.com\/integrations\/index.html#partner-connect) list. \nFor information about troubleshooting connections, see [Troubleshoot Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/troubleshoot.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/admin.html"} +{"content":"# Technology partners\n## What is Databricks Partner Connect?\n#### Manage connections in Partner Connect\n##### Manage service principals and personal access tokens\n\nFor partners that require Databricks service principals, when someone in your Databricks workspace connects to a specific partner for the first time, Partner Connect creates a Databricks [service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) in your workspace for use with that partner. Partner Connect generates service principal display names by using the format `<PARTNER-NAME>_USER`. For example, for the partner Fivetran, the service principal\u2019s display name is `FIVETRAN_USER`. \nPartner Connect also creates a Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html) and associates it with that Databricks service principal. Partner Connect provides this token\u2019s value to the partner behind the scenes to complete the connection to that partner. You cannot view or get this token\u2019s value. This token does not expire until you or someone else deletes it. See also [Disconnect from a partner](https:\/\/docs.databricks.com\/partner-connect\/admin.html#disconnect). \nPartner Connect grants the following access permissions to Databricks service principals in your workspace: \n| Partners | Permissions |\n| --- | --- |\n| Fivetran, Matillion, Power BI, Tableau, erwin Data Modeler | These solutions do not require service principals. |\n| Hevo Data, Hunters, Rivery, RudderStack, Snowplow | * The CAN USE token permission to create a personal access token. * SQL warehouse creation permission. * Access to your workspace. * Access to Databricks SQL. * (Unity Catalog) The`USE CATALOG` privilege on the selected catalog. * (Unity Catalog) The`CREATE SCHEMA` privilege on the selected catalog. * (Legacy Hive metastore) The `USAGE` privilege on the selected catalog. * (Legacy Hive metastore) The `CREATE` privilege on the `hive_metastore` catalog so Partner Connect can create objects in the legacy Hive metastore on your behalf. * Ownership of the tables that it creates. The service principal cannot query any tables that it does not create. |\n| Prophecy | * The CAN USE token permission to create a personal access token. * Access your workspace. * Cluster creation permission. The service principal cannot access any clusters that it does not create. * Job creation permission. The service principal cannot access any jobs that it does not create. |\n| John Snow Labs, Labelbox | * The CAN USE token permission to create a personal access token. * Access to your workspace. |\n| Alation, Anomalo, AtScale, Census, dbt Cloud, Hex, Hightouch, Lightup, Monte Carlo, Preset, Privacera, Qlik Sense, Sigma, Stardog, ThoughtSpot | * The CAN USE token permission to create a personal access token. * The CAN USE privilege on the selected Databricks SQL warehouse. * The `SELECT` privilege on the selected schemas. * (Unity Catalog) The`USE CATALOG` privilege on the selected catalog. * (Unity Catalog) The `USE SCHEMA` privilege on the selected schema. * (Legacy Hive metastore) The `USAGE` privilege on the selected schema. * (Legacy Hive metastore) The `READ METADATA` privilege for the selected schemas. |\n| Dataiku | * The CAN USE token permission to create a personal access token. * SQL warehouse creation permission. * (Unity Catalog) The`USE CATALOG` privilege on the selected catalog. * (Unity Catalog) The `USE SCHEMA` privilege on the selected schemas. * (Unity Catalog) The`CREATE SCHEMA` privilege on the selected catalog. * (Legacy Hive metastore) The `USAGE` privilege on the `hive_metastore` catalog and on the selected schemas. * (Legacy Hive metastore) The `CREATE` privilege on the `hive_metastore` catalog so Partner Connect can create objects in the legacy Hive metastore on your behalf. * (Legacy Hive metastore) The `SELECT` privilege on the selected schemas. | \nYou might need to generate a new Databricks personal access token if the existing token has been compromised, is lost or deleted, or your organization has a periodic token rotation policy. To generate a new token, use the Databricks REST API: \n1. Get the Databricks service principal\u2019s application ID by calling the `GET \/preview\/scim\/v2\/ServicePrincipals` operation in the [Workspace Service Principals API](https:\/\/docs.databricks.com\/api\/workspace\/serviceprincipals) for your workspace. Make a note of the service principal\u2019s `applicationId` in the response.\n2. Use the service principal\u2019s `applicationId` to call the `POST \/token-management\/on-behalf-of\/tokens` operation in the [Databricks Token Management REST API](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement) for your workspace.\n3. Make a note of the `token_value` in the response and store it in a safe location, as there is no other way to access it again if you ever need to retrieve it. \nFor example, to get the list of available Databricks service principal display names and application IDs for a workspace, you can call `curl` as follows: \n```\ncurl --netrc --request GET \\\nhttps:\/\/<databricks-instance>\/api\/2.0\/preview\/scim\/v2\/ServicePrincipals \\\n| jq '[ .Resources[] | { displayName: .displayName, applicationId: .applicationId } ]'\n\n``` \nReplace `<databricks-instance>` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com` for the workspace where the service principal exists. \nThe service principal\u2019s application ID value is in the response\u2019s `applicationId` field, for example `123456a7-8901-2b3c-45de-f678a901b2c`. \nTo create the new token value for the service principal, you can call `curl` as follows: \n```\ncurl --netrc --request POST \\\nhttps:\/\/<databricks-instance>\/api\/2.0\/token-management\/on-behalf-of\/tokens \\\n--data @create-token.json \\\n| jq '[ . | { token_value: .token_value } ]'\n\n``` \n`create-token.json`: \n```\n{\n\"application_id\": \"<application-id>\",\n\"comment\": \"Partner Connect\",\n\"lifetime_seconds\": 1209600\n}\n\n``` \nReplace: \n* `<databricks-instance>` with the [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com` for the workspace where the service principal exists.\n* `<application-id>` with the service principal\u2019s application ID value.\n* `1209600` with the number of seconds until this token expires. For example, `1209600` is the number of seconds in 14 days. \nThe new token\u2019s value is in the response\u2019s `token_value` field, for example `dapi12345678901234567890123456789012`. Make a note of the new token\u2019s value in the response and store it in a safe location, as there is no other way to access it again if you ever need to retrieve it. \nThe preceding examples use a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file and [jq](https:\/\/stedolan.github.io\/jq\/). Note that in this case, the `.netrc` file uses *your* [personal access token](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement) value\u2013*not* the one for the service principal. \nAfter you create the new token, you must update your partner account with the new token\u2019s value. To do this, see the documentation on the partner\u2019s website. For documentation links, see appropriate partner connection guide. \nTo delete an existing token: \nWarning \nDeleting an existing Databricks personal access token is permanent and cannot be undone. \n1. Get the list of existing tokens by calling the `GET \/token-management\/tokens GET` operation in the [Databricks Token Management REST API](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement) for your workspace.\n2. In the response, make a note of the `token_id` value for the token that you want to delete.\n3. Use this `token_id` value to delete the token by calling the `DELETE \/token-management\/tokens\/{token_id}` operation in the Databricks Token Management REST API for your workspace.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/admin.html"} +{"content":"# Technology partners\n## What is Databricks Partner Connect?\n#### Manage connections in Partner Connect\n##### Disconnect from a partner\n\nIf the tile for a partner has a check mark icon, this means that someone in your Databricks workspace has already created a connection to that partner. To disconnect from that partner, you reset that partner\u2019s tile in Partner Connect. Resetting a partner\u2019s tile does the following: \n* Clears the check mark icon from the partner\u2019s tile.\n* Deletes the associated SQL warehouse or cluster if the partner requires one.\n* Deletes the associated Databricks service principal, if the partner requires one. Deleting a service principal also deletes that service principal\u2019s related Databricks personal access token. This token\u2019s value is what completes the connection between your workspace and the partner. For more information, see [Manage service principals and personal access tokens](https:\/\/docs.databricks.com\/partner-connect\/admin.html#service-principal-pat). \nWarning \nDeleting a SQL warehouse, a cluster, a Databricks service principal, or a Databricks service principal\u2019s personal access token is permanent and cannot be undone. \nResetting a partner\u2019s tile does not delete your organization\u2019s related partner account or change related connection settings with the partner. However, resetting a partner\u2019s tile does break the connection between the workspace and the related partner account. To reconnect, you must create a new connection from the workspace to the partner, and then you must manually edit the original connection settings in the related partner account to match the new connection settings. \nTo reset a partner\u2019s tile, click the tile, click **Delete Connection**, and then follow the on-screen directions. \nAlternatively, you can manually disconnect a Databricks workspace from a partner by deleting the related Databricks service principal in your workspace that is associated with that partner. You might want to do this if you want to disconnect your workspace from a partner but still keep other associated resources and still keep the check mark icon displayed on the tile. Deleting a service principal also deletes that service principal\u2019s related personal access token. This token\u2019s value is what completes the connection between your workspace and the partner. For more information, see [Manage service principals and personal access tokens](https:\/\/docs.databricks.com\/partner-connect\/admin.html#service-principal-pat). \nTo delete a Databricks service principal, you use the Databricks REST API as follows: \n1. Get the Databricks service principal\u2019s application ID by calling the `GET \/preview\/scim\/v2\/ServicePrincipals` operation in the [Workspace Service Principals API](https:\/\/docs.databricks.com\/api\/workspace\/serviceprincipals) for your workspace. Make a note of the service principal\u2019s `applicationId` in the response.\n2. Use the service principal\u2019s `applicationId` to call the `DELETE \/preview\/scim\/v2\/ServicePrincipals` operation in the [Workspace Service Principals API](https:\/\/docs.databricks.com\/api\/workspace\/serviceprincipals) for your workspace. \nFor example, to get the list of available service principal display names and application IDs for a workspace, you can call `curl` as follows: \n```\ncurl --netrc --request GET \\\nhttps:\/\/<databricks-instance>\/api\/2.0\/preview\/scim\/v2\/ServicePrincipals \\\n| jq '[ .Resources[] | { displayName: .displayName, applicationId: .applicationId } ]'\n\n``` \nReplace `<databricks-instance>` with the Databricks [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com` for your workspace. \nThe service principal\u2019s display name is in the output\u2019s `displayName` field. Partner Connect generates service principal display names using the format `<PARTNER-NAME>_USER`. For example, for the partner Fivetran, the service principal\u2019s display name is `FIVETRAN_USER`. \nThe service principal\u2019s application ID value is in the output\u2019s `applicationId` field, for example `123456a7-8901-2b3c-45de-f678a901b2c`. \nTo delete the service principal, you can call `curl` as follows: \n```\ncurl --netrc --request DELETE \\\nhttps:\/\/<databricks-instance>\/api\/2.0\/preview\/scim\/v2\/ServicePrincipals\/<application-id>\n\n``` \nReplace: \n* `<databricks-instance>` with the [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `dbc-a1b2345c-d6e7.cloud.databricks.com` for your workspace.\n* `<application-id>` with the service principal\u2019s application ID value. \nThe preceding examples use a [.netrc](https:\/\/everything.curl.dev\/usingcurl\/netrc) file and [jq](https:\/\/stedolan.github.io\/jq\/). Note that in this case, the `.netrc` file uses *your* [personal access token](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement) value\u2013*not* the one for the service principal. \nAfter you disconnect your workspace from a partner, you might want to clean up any related resources that the partner creates in the workspace. This could include a SQL warehouse or cluster and any related data storage locations. For more information, see [What is a SQL warehouse?](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html) or [Delete a compute](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-delete). \nIf you\u2019re sure that there are no other workspaces across your organization that are connected to the partner, you might also want to delete your organization\u2019s account with that partner. To do this, contact the partner for assistance. For support links, see the appropriate partner connection guide.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/admin.html"} +{"content":"# Technology partners\n### Connect to security partners using Partner Connect\n\nTo connect your Databricks workspace to a security partner solution using Partner Connect, you typically follow the steps in this article. \nImportant \nBefore you follow the steps in this article, see the appropriate partner article for important partner-specific information. There might be differences in the connection steps between partner solutions. For example, some partner solutions allow you to connect Databricks SQL warehouses (formerly Databricks SQL endpoints) or Databricks clusters, but not both.\n\n### Connect to security partners using Partner Connect\n#### Requirements\n\nSee the [requirements](https:\/\/docs.databricks.com\/partner-connect\/index.html#requirements) for using Partner Connect. \nImportant \nFor partner-specific requirements, see the appropriate partner article.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/data-security.html"} +{"content":"# Technology partners\n### Connect to security partners using Partner Connect\n#### Steps to connect to a security partner\n\nTo connect your Databricks workspace to a security partner solution, do the following: \n1. In the sidebar, click **Partner Connect**.\n2. Click the partner tile. \nIf the partner tile has a check mark icon, a workspace admin has already used Partner Connect to connect your workspace to the partner. Click **Sign in** to sign in to your existing partner account and skip the rest of the steps in this section.\n3. Select a catalog from the drop-down list. \nNote \nIf a partner doesn\u2019t support Unity Catalog with Partner Connect, the default catalog for your Unity Catalog enabled workspace is used. If your workspace isn\u2019t Unity Catalog enabled, the legacy Hive metastore (`hive_metastore`) is used.\n4. Click **Next**. \nPartner Connect creates the following resources in your workspace: \n* A Databricks service principal named **`<PARTNER>_USER`**.\n* A Databricks personal access token that is associated with the **`<PARTNER>_USER`** service principal.\n* A SQL warehouse named **`<PARTNER>_WAREHOUSE`** by default. You can click **Edit** to change the SQL warehouse name before you click **Next**.Partner Connect also grants the following privileges to the **`<PARTNER>_USER`** service principal: \n* (Unity Catalog)`USE CATALOG`: Required to interact with objects within the selected catalog.\n* (Unity Catalog)`CREATE SCHEMA`: Required to create schemas in the selected catalog.\n* (Legacy Hive metastore) `USAGE`: Required to grant the `CREATE` privilege for the catalog you selected.\n* (Legacy Hive metastore) `CREATE`: Grants the ability to create schemas in the Hive metastore.\n* **CAN USE** Grants permissions to use the SQL warehouse that Databricks created on your behalf.\n5. Click **Next**.\n6. Click **Connect to `<Partner>`**. \nA new tab that displays the partner website opens in your web browser.\n7. Complete the on-screen instructions on the partner website to create your trial partner account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/data-security.html"} +{"content":"# What is data warehousing on Databricks?\n### Load data using streaming tables in Databricks SQL\n\nPreview \nThis feature is in Public Preview. \nDatabricks recommends using streaming tables to ingest data using Databricks SQL. A *streaming table* is a Unity Catalog managed table with extra support for streaming or incremental data processing. A DLT pipeline is automatically created for each streaming table. You can use streaming tables for incremental data loading from Kafka and cloud object storage. \nThis article demonstrates using streaming tables to load data from cloud object storage configured as a Unity Catalog volume (recommended) or external location. \nNote \nTo learn how to use Delta Lake tables as streaming sources and sinks, see [Delta table streaming reads and writes](https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/load-data-streaming-table.html"} +{"content":"# What is data warehousing on Databricks?\n### Load data using streaming tables in Databricks SQL\n#### Before you begin\n\nBefore you begin, make sure you have the following: \n* A Databricks account with serverless enabled. For more information, see [Enable serverless SQL warehouses](https:\/\/docs.databricks.com\/admin\/sql\/serverless.html).\n* A workspace with Unity Catalog enabled. For more information, see [Set up and manage Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html).\n* A SQL warehouse that uses the `Current` channel.\n* To query streaming tables created by a Delta Live Tables pipeline, you must use a shared compute using Databricks Runtime 13.3 LTS and above or a SQL warehouse. Streaming tables created in a Unity Catalog enabled pipeline cannot be queried from assigned or no isolation clusters.\n* The `READ FILES` privilege on a Unity Catalog external location. For information, see [Create an external location to connect cloud storage to Databricks](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html).\n* The `USE CATALOG` privilege on the catalog in which you create the streaming table.\n* The `USE SCHEMA` privilege on the schema in which you create the streaming table.\n* The `CREATE TABLE` privilege on the schema in which you create the streaming table.\n* The path to your source data. \nVolume path example: `\/Volumes\/<catalog>\/<schema>\/<volume>\/<path>\/<file-name>` \nExternal location path example: `s3:\/\/myBucket\/analysis` \nNote \nThis article assumes the data you want to load is in a cloud storage location that corresponds to a Unity Catalog volume or external location you have access to.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/load-data-streaming-table.html"} +{"content":"# What is data warehousing on Databricks?\n### Load data using streaming tables in Databricks SQL\n#### Discover and preview source data\n\n1. In the sidebar of your workspace, click **Queries**, and then click **Create query**.\n2. In the query editor, select a SQL warehouse that uses the `Current` channel from the drop-down list.\n3. Paste the following into the editor, substituting values in angle brackets (`<>`) for the information identifying your source data, and then click **Run**. \nNote \nYou might encounter schema inference errors when running the `read_files` table valued function if the defaults for the function can\u2019t parse your data. For example, you might need to configure multi-line mode for multi-line CSV or JSON files. For a list of parser options, see [read\\_files table-valued function](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/read_files.html). \n```\n\/* Discover your data in a volume *\/\nLIST \"\/Volumes\/<catalog>\/<schema>\/<volume>\/<path>\/<folder>\"\n\n\/* Preview your data in a volume *\/\nSELECT * FROM read_files(\"\/Volumes\/<catalog>\/<schema>\/<volume>\/<path>\/<folder>\") LIMIT 10\n\n\/* Discover your data in an external location *\/\nLIST \"s3:\/\/<bucket>\/<path>\/<folder>\"\n\n\/* Preview your data *\/\nSELECT * FROM read_files(\"s3:\/\/<bucket>\/<path>\/<folder>\") LIMIT 10\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/load-data-streaming-table.html"} +{"content":"# What is data warehousing on Databricks?\n### Load data using streaming tables in Databricks SQL\n#### Load data into a streaming table\n\nTo create a streaming table from data in cloud object storage, paste the following into the query editor, and then click **Run**: \n```\n\/* Load data from a volume *\/\nCREATE OR REFRESH STREAMING TABLE <table-name> AS\nSELECT * FROM STREAM read_files('\/Volumes\/<catalog>\/<schema>\/<volume>\/<path>\/<folder>')\n\n\/* Load data from an external location *\/\nCREATE OR REFRESH STREAMING TABLE <table-name> AS\nSELECT * FROM STREAM read_files('s3:\/\/<bucket>\/<path>\/<folder>')\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/load-data-streaming-table.html"} +{"content":"# What is data warehousing on Databricks?\n### Load data using streaming tables in Databricks SQL\n#### Refresh a streaming table using a DLT pipeline\n\nThis section describes patterns for refreshing a streaming table with the latest data available from the sources defined in the query. \n`CREATE` operations for streaming tables use a Databricks SQL warehouse for the initial creation and loading of data into the streaming table. `REFRESH` operations for streaming tables use Delta Live Tables (DLT). A DLT pipeline is automatically created for each streaming table. When a streaming table is refreshed, an update to the DLT pipeline is initiated to process the refresh. \nAfter you run the `REFRESH` command, the DLT pipeline link is returned. You can use the DLT pipeline link to check the status of the refresh. \nNote \nOnly the table owner can refresh a streaming table to get the latest data. The user that creates the table is the owner, and the owner can\u2019t be changed. \nSee [What is Delta Live Tables?](https:\/\/docs.databricks.com\/delta-live-tables\/index.html). \n### Ingest new data only \nBy default, the `read_files` function reads all existing data in the source directory during table creation, and then processes newly arriving records with each refresh. \nTo avoid ingesting data that already exists in the source directory at the time of table creation, set the `includeExistingFiles` option to `false`. This means that only data that arrives in the directory after table creation is processed. For example: \n```\nCREATE OR REFRESH STREAMING TABLE my_bronze_table\nAS SELECT *\nFROM STREAM read_files('s3:\/\/mybucket\/analysis\/*\/*\/*.json', includeExistingFiles => false)\n\n``` \n### Fully refresh a streaming table \nFull refreshes re-process all data available in the source with the latest definition. It is not recommended to call full refreshes on sources that don\u2019t keep the entire history of the data or have short retention periods, such as Kafka, because the full refresh truncates the existing data. You might not be able to recover old data if the data is no longer available in the source. \nFor example: \n```\nREFRESH STREAMING TABLE my_bronze_table FULL\n\n``` \n### Schedule a streaming table for automatic refresh \nTo configure a streaming table to automatically refresh based on a defined schedule, paste the following into the query editor, and then click **Run**: \n```\nALTER STREAMING TABLE\n[[<catalog>.]<database>.]<name>\nADD [SCHEDULE [REFRESH]\nCRON '<cron-string>'\n[ AT TIME ZONE '<timezone-id>' ]];\n\n``` \nFor example refresh schedule queries, see [ALTER STREAMING TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-streaming-table.html). \n### Track the status of a refresh \nYou can view the status of a streaming table refresh by viewing the pipeline that manages the streaming table in the Delta Live Tables UI or by viewing the **Refresh Information** returned by the `DESCRIBE EXTENDED` command for the streaming table. \n```\nDESCRIBE EXTENDED <table-name>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/load-data-streaming-table.html"} +{"content":"# What is data warehousing on Databricks?\n### Load data using streaming tables in Databricks SQL\n#### Streaming ingestion from Kafka\n\nFor an example of streaming ingestion from Kafka, see [read\\_kafka](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/read_kafka.html#examples).\n\n### Load data using streaming tables in Databricks SQL\n#### Grant users access to a streaming table\n\nTo grant users the `SELECT` privilege on the streaming table so they can query it, paste the following into the query editor, and then click **Run**: \n```\nGRANT SELECT ON TABLE <catalog>.<schema>.<table> TO <user-or-group>\n\n``` \nFor more information about granting privileges on Unity Catalog securable objects, see [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html).\n\n### Load data using streaming tables in Databricks SQL\n#### Additional resources\n\n* [Streaming table](https:\/\/docs.databricks.com\/delta-live-tables\/index.html#streaming-table)\n* [read\\_files table-valued function](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/read_files.html)\n* [CREATE STREAMING TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-streaming-table.html)\n* [ALTER STREAMING TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-streaming-table.html)\n* [read\\_kafka table-valued function](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/read_kafka.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/load-data-streaming-table.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n### Compare Auto Loader file detection modes\n##### What is Auto Loader file notification mode?\n\nIn file notification mode, Auto Loader automatically sets up a notification service and queue service that subscribes to file events from the input directory. You can use file notifications to scale Auto Loader to ingest millions of files an hour. When compared to directory listing mode, file notification mode is more performant and scalable for large input directories or a high volume of files but requires additional cloud permissions. \nYou can switch between file notifications and directory listing at any time and still maintain exactly-once data processing guarantees. \nWarning \n[Changing the source path for Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/directory-listing-mode.html#change-location) is not supported for file notification mode. If file notification mode is used and the path is changed, you might fail to ingest files that are already present in the new directory at the time of the directory update.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n### Compare Auto Loader file detection modes\n##### What is Auto Loader file notification mode?\n###### Cloud resources used in Auto Loader file notification mode\n\nImportant \nYou need elevated permissions to automatically configure cloud infrastructure for file notification mode. Contact your cloud administrator or workspace admin. See: \n* [Required permissions for configuring file notification for ADLS Gen2 and Azure Blob Storage](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html#permissions-azure)\n* [Required permissions for configuring file notification for AWS S3](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html#permissions-s3)\n* [Required permissions for configuring file notification for GCS](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html#permissions-gcs) \nAuto Loader can set up file notifications for you automatically when you set the option `cloudFiles.useNotifications` to `true` and provide the necessary permissions to create cloud resources. In addition, you might need to provide [additional options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#file-notification-options) to grant Auto Loader authorization to create these resources. \nThe following table summarizes which resources are created by Auto Loader. \n| Cloud Storage | Subscription Service | Queue Service | Prefix \\* | Limit \\*\\* |\n| --- | --- | --- | --- | --- |\n| AWS S3 | AWS SNS | AWS SQS | databricks-auto-ingest | 100 per S3 bucket |\n| ADLS Gen2 | Azure Event Grid | Azure Queue Storage | databricks | 500 per storage account |\n| GCS | Google Pub\/Sub | Google Pub\/Sub | databricks-auto-ingest | 100 per GCS bucket |\n| Azure Blob Storage | Azure Event Grid | Azure Queue Storage | databricks | 500 per storage account | \n\\* Auto Loader names the resources with this prefix. \n\\*\\* How many concurrent file notification pipelines can be launched \nIf you require running more than the limited number of file notification pipelines for a given storage account, you can: \n* Leverage a service such as AWS Lambda, Azure Functions, or Google Cloud Functions to fan out notifications from a single queue that listens to an entire container or bucket into directory specific queues. \n### File notification events \nAWS S3 provides an `ObjectCreated` event when a file is uploaded to an S3 bucket regardless of whether it was uploaded by a put or multi-part upload. \nADLS Gen2 provides different event notifications for files appearing in your Gen2 container. \n* Auto Loader listens for the `FlushWithClose` event for processing a file.\n* Auto Loader streams support the `RenameFile` action for discovering files. `RenameFile` actions require an API request to the storage system to get the size of the renamed file.\n* Auto Loader streams created with Databricks Runtime 9.0 and after support the `RenameDirectory` action for discovering files. `RenameDirectory` actions require API requests to the storage system to list the contents of the renamed directory. \nGoogle Cloud Storage provides an `OBJECT_FINALIZE` event when a file is uploaded, which includes overwrites and file copies. Failed uploads do not generate this event. \nNote \nCloud providers do not guarantee 100% delivery of all file events under very rare conditions and do not provide strict SLAs on the latency of the file events. Databricks recommends that you trigger regular backfills with Auto Loader by using the `cloudFiles.backfillInterval` option to guarantee that all files are discovered within a given SLA if data completeness is a requirement. Triggering regular backfills does not cause duplicates.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n### Compare Auto Loader file detection modes\n##### What is Auto Loader file notification mode?\n###### Required permissions for configuring file notification for ADLS Gen2 and Azure Blob Storage\n\nYou must have read permissions for the input directory. See [Azure Blob Storage](https:\/\/docs.databricks.com\/connect\/storage\/azure-storage.html). \nTo use file notification mode, you must provide authentication credentials for setting up and accessing the event notification services.\nYou only need a service principal for authentication. \n* Service principal - using Azure built-in roles \nCreate [a Microsoft Entra ID (formerly Azure Active Directory) app and service principal](https:\/\/learn.microsoft.com\/azure\/active-directory\/develop\/howto-create-service-principal-portal) in the form of client ID and client secret. \nAssign this app the following roles to the storage account in which the input path resides: \n+ **[Contributor](https:\/\/learn.microsoft.com\/azure\/role-based-access-control\/built-in-roles#storage-account-contributor)**: This role is for setting up resources in your storage account, such as queues and event subscriptions.\n+ **[Storage Queue Data Contributor](https:\/\/learn.microsoft.com\/azure\/role-based-access-control\/built-in-roles#storage-queue-data-contributor)**: This role is for performing queue operations such as retrieving and deleting messages from the queues. This role is required only when you provide a service principal without a connection string.Assign this app the following role to the related resource group: \n+ **[EventGrid EventSubscription Contributor](https:\/\/learn.microsoft.com\/azure\/role-based-access-control\/built-in-roles#eventgrid-eventsubscription-contributor)**: This role is for performing event grid subscription operations such as creating or listing event subscriptions.For more information, see [Assign Azure roles using the Azure portal](https:\/\/learn.microsoft.com\/azure\/role-based-access-control\/role-assignments-portal).\n* Service principal - using custom role \nIf you are concerned with the excessive permissions required for the preceding roles, you can create a **[Custom Role](https:\/\/learn.microsoft.com\/azure\/role-based-access-control\/custom-roles-portal)** with at least the following permissions, listed below in Azure role JSON format: \n```\n\"permissions\": [\n{\n\"actions\": [\n\"Microsoft.EventGrid\/eventSubscriptions\/write\",\n\"Microsoft.EventGrid\/eventSubscriptions\/read\",\n\"Microsoft.EventGrid\/eventSubscriptions\/delete\",\n\"Microsoft.EventGrid\/locations\/eventSubscriptions\/read\",\n\"Microsoft.Storage\/storageAccounts\/read\",\n\"Microsoft.Storage\/storageAccounts\/write\",\n\"Microsoft.Storage\/storageAccounts\/queueServices\/read\",\n\"Microsoft.Storage\/storageAccounts\/queueServices\/write\",\n\"Microsoft.Storage\/storageAccounts\/queueServices\/queues\/write\",\n\"Microsoft.Storage\/storageAccounts\/queueServices\/queues\/read\",\n\"Microsoft.Storage\/storageAccounts\/queueServices\/queues\/delete\"\n],\n\"notActions\": [],\n\"dataActions\": [\n\"Microsoft.Storage\/storageAccounts\/queueServices\/queues\/messages\/delete\",\n\"Microsoft.Storage\/storageAccounts\/queueServices\/queues\/messages\/read\",\n\"Microsoft.Storage\/storageAccounts\/queueServices\/queues\/messages\/write\",\n\"Microsoft.Storage\/storageAccounts\/queueServices\/queues\/messages\/process\/action\"\n],\n\"notDataActions\": []\n}\n]\n\n``` \nThen, you can assign this custom role to your app. \nFor more information, see [Assign Azure roles using the Azure portal](https:\/\/learn.microsoft.com\/azure\/role-based-access-control\/role-assignments-portal). \n![Auto loader permissions](https:\/\/docs.databricks.com\/_images\/auto-loader-permissions.png) \n### Troubleshooting common errors \n**Error:** \n```\njava.lang.RuntimeException: Failed to create event grid subscription.\n\n``` \nIf you see this error message when you run Auto Loader for the first time, the Event Grid is not registered as a Resource Provider in your Azure subscription. To register this on Azure portal: \n1. Go to your subscription.\n2. Click **Resource Providers** under the Settings section.\n3. Register the provider `Microsoft.EventGrid`. \n**Error:** \n```\n403 Forbidden ... does not have authorization to perform action 'Microsoft.EventGrid\/eventSubscriptions\/[read|write]' over scope ...\n\n``` \nIf you see this error message when you run Auto Loader for the first time, ensure you have given the **Contributor** role to your service principal for Event Grid as well as your storage account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n### Compare Auto Loader file detection modes\n##### What is Auto Loader file notification mode?\n###### Required permissions for configuring file notification for AWS S3\n\nYou must have read permissions for the input directory. See [S3 connection details](https:\/\/docs.databricks.com\/connect\/storage\/amazon-s3.html) for more details. \nTo use file notification mode, attach the following JSON policy document to your [IAM user or role](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html). \n```\n{\n\"Version\": \"2012-10-17\",\n\"Statement\": [\n{\n\"Sid\": \"DatabricksAutoLoaderSetup\",\n\"Effect\": \"Allow\",\n\"Action\": [\n\"s3:GetBucketNotification\",\n\"s3:PutBucketNotification\",\n\"sns:ListSubscriptionsByTopic\",\n\"sns:GetTopicAttributes\",\n\"sns:SetTopicAttributes\",\n\"sns:CreateTopic\",\n\"sns:TagResource\",\n\"sns:Publish\",\n\"sns:Subscribe\",\n\"sqs:CreateQueue\",\n\"sqs:DeleteMessage\",\n\"sqs:ReceiveMessage\",\n\"sqs:SendMessage\",\n\"sqs:GetQueueUrl\",\n\"sqs:GetQueueAttributes\",\n\"sqs:SetQueueAttributes\",\n\"sqs:TagQueue\",\n\"sqs:ChangeMessageVisibility\"\n],\n\"Resource\": [\n\"arn:aws:s3:::<bucket-name>\",\n\"arn:aws:sqs:<region>:<account-number>:databricks-auto-ingest-*\",\n\"arn:aws:sns:<region>:<account-number>:databricks-auto-ingest-*\"\n]\n},\n{\n\"Sid\": \"DatabricksAutoLoaderList\",\n\"Effect\": \"Allow\",\n\"Action\": [\n\"sqs:ListQueues\",\n\"sqs:ListQueueTags\",\n\"sns:ListTopics\"\n],\n\"Resource\": \"*\"\n},\n{\n\"Sid\": \"DatabricksAutoLoaderTeardown\",\n\"Effect\": \"Allow\",\n\"Action\": [\n\"sns:Unsubscribe\",\n\"sns:DeleteTopic\",\n\"sqs:DeleteQueue\"\n],\n\"Resource\": [\n\"arn:aws:sqs:<region>:<account-number>:databricks-auto-ingest-*\",\n\"arn:aws:sns:<region>:<account-number>:databricks-auto-ingest-*\"\n]\n}\n]\n}\n\n``` \nwhere: \n* `<bucket-name>`: The S3 bucket name where your stream will read files, for example, `auto-logs`. You can use `*` as a wildcard, for example, `databricks-*-logs`. To find out the underlying S3 bucket for your DBFS path, you can list all the DBFS mount points in a notebook by running `%fs mounts`.\n* `<region>`: The AWS region where the S3 bucket resides, for example, `us-west-2`. If you don\u2019t want to specify the region, use `*`.\n* `<account-number>`: The AWS account number that owns the S3 bucket, for example, `123456789012`. If don\u2019t want to specify the account number, use `*`. \nThe string `databricks-auto-ingest-*` in the SQS and SNS ARN specification is the name prefix that the `cloudFiles` source uses when creating SQS and SNS services. Since Databricks sets up the notification services in the initial run of the stream, you can use a policy with reduced permissions after the initial run (for example, stop the stream and then restart it). \nNote \nThe preceding policy is concerned only with the permissions needed for setting up file notification services, namely S3 bucket notification, SNS, and SQS services and assumes you already have read access to the S3 bucket. If you need to add S3 read-only permissions, add the following to the `Action` list in the `DatabricksAutoLoaderSetup` statement in the JSON document: \n* `s3:ListBucket`\n* `s3:GetObject` \n### Reduced permissions after initial setup \nThe resource setup permissions described above are required only during the initial run of the stream. After the first run, you can switch to the following IAM policy with reduced permissions. \nImportant \nWith the reduced permissions, you can\u2019t start new streaming queries or recreate resources in case of failures (for example, the SQS queue has been accidentally deleted); you also can\u2019t use the cloud resource management API to list or tear down resources. \n```\n{\n\"Version\": \"2012-10-17\",\n\"Statement\": [\n{\n\"Sid\": \"DatabricksAutoLoaderUse\",\n\"Effect\": \"Allow\",\n\"Action\": [\n\"s3:GetBucketNotification\",\n\"sns:ListSubscriptionsByTopic\",\n\"sns:GetTopicAttributes\",\n\"sns:TagResource\",\n\"sns:Publish\",\n\"sqs:DeleteMessage\",\n\"sqs:ReceiveMessage\",\n\"sqs:SendMessage\",\n\"sqs:GetQueueUrl\",\n\"sqs:GetQueueAttributes\",\n\"sqs:TagQueue\",\n\"sqs:ChangeMessageVisibility\"\n],\n\"Resource\": [\n\"arn:aws:sqs:<region>:<account-number>:<queue-name>\",\n\"arn:aws:sns:<region>:<account-number>:<topic-name>\",\n\"arn:aws:s3:::<bucket-name>\"\n]\n},\n{\n\"Effect\": \"Allow\",\n\"Action\": [\n\"s3:GetBucketLocation\",\n\"s3:ListBucket\"\n],\n\"Resource\": [\n\"arn:aws:s3:::<bucket-name>\"\n]\n},\n{\n\"Effect\": \"Allow\",\n\"Action\": [\n\"s3:PutObject\",\n\"s3:PutObjectAcl\",\n\"s3:GetObject\",\n\"s3:DeleteObject\"\n],\n\"Resource\": [\n\"arn:aws:s3:::<bucket-name>\/*\"\n]\n},\n{\n\"Sid\": \"DatabricksAutoLoaderListTopics\",\n\"Effect\": \"Allow\",\n\"Action\": [\n\"sqs:ListQueues\",\n\"sqs:ListQueueTags\",\n\"sns:ListTopics\"\n],\n\"Resource\": \"arn:aws:sns:<region>:<account-number>:*\"\n}\n]\n}\n\n``` \n### Securely ingest data in a different AWS account \nAuto Loader can load data across AWS accounts by assuming an IAM role. After setting the temporary security credentials created by `AssumeRole`, you can have Auto Loader load cloud files cross-accounts. To set up the Auto Loader for cross-AWS accounts, follow the doc: [Access cross-account S3 buckets with an AssumeRole policy](https:\/\/docs.databricks.com\/archive\/admin-guide\/assume-role.html). Make sure you: \n* Verify that you have the AssumeRole meta role assigned to the cluster.\n* Configure the cluster\u2019s Spark configuration to include the following properties: \n```\nfs.s3a.credentialsType AssumeRole\nfs.s3a.stsAssumeRole.arn arn:aws:iam::<bucket-owner-acct-id>:role\/MyRoleB\nfs.s3a.acl.default BucketOwnerFullControl\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n### Compare Auto Loader file detection modes\n##### What is Auto Loader file notification mode?\n###### Required permissions for configuring file notification for GCS\n\nYou must have `list` and `get` permissions on your GCS bucket and on all the objects. For details, see the Google documentation on [IAM permissions](https:\/\/cloud.google.com\/storage\/docs\/access-control\/iam-permissions). \nTo use file notification mode, you need to add permissions for the [GCS service account](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html#finding-the-gcs-service-account) and the account used to access the Google Cloud Pub\/Sub resources. \nAdd the `Pub\/Sub Publisher` role to the GCS service account. This allows the account to publish event notification messages from your GCS buckets to Google Cloud Pub\/Sub. \nAs for the service account used for the Google Cloud Pub\/Sub resources, you need to add the following permissions: \n```\npubsub.subscriptions.consume\npubsub.subscriptions.create\npubsub.subscriptions.delete\npubsub.subscriptions.get\npubsub.subscriptions.list\npubsub.subscriptions.update\npubsub.topics.attachSubscription\npubsub.topics.create\npubsub.topics.delete\npubsub.topics.get\npubsub.topics.list\npubsub.topics.update\n\n``` \nTo do this, you can either [create an IAM custom role](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html#custom-gcp-role) with these permissions or assign [pre-existing GCP roles](https:\/\/cloud.google.com\/pubsub\/docs\/access-control#roles) to cover these permissions. \n### Finding the GCS Service Account \nIn the Google Cloud Console for the corresponding project, navigate to `Cloud Storage > Settings`.\nThe section \u201cCloud Storage Service Account\u201d contains the email of the GCS service account. \n![GCS Service Account](https:\/\/docs.databricks.com\/_images\/google-gcs-service-account.png) \n### Creating a Custom Google Cloud IAM Role for File Notification Mode \nIn the Google Cloud console for the corresponding project, navigate to `IAM & Admin > Roles`. Then, either create a role at the top or update an existing role. In the screen for role creation or edit, click `Add Permissions`. A menu appears in which you can add the desired permissions to the role. \n![GCP IAM Custom Roles](https:\/\/docs.databricks.com\/_images\/google-gcp-custom-role.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n### Compare Auto Loader file detection modes\n##### What is Auto Loader file notification mode?\n###### Manually configure or manage file notification resources\n\nPrivileged users can manually configure or manage file notification resources. \n* Set up the file notification services manually through the cloud provider and manually specify the queue identifier. See [File notification options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#file-notification-options) for more details.\n* Use Scala APIs to create or manage the notifications and queuing services, as shown in the following example: \nNote \nYou must have appropriate permissions to configure or modify cloud infrastructure. See permissions documentation for [Azure](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html#permissions-azure), [S3](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html#permissions-s3), or [GCS](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html#permissions-gcs). \n```\n# Databricks notebook source\n# MAGIC %md ## Python bindings for CloudFiles Resource Managers for all 3 clouds\n\n# COMMAND ----------\n\n#####################################\n## Creating a ResourceManager in AWS\n#####################################\n\nmanager = spark._jvm.com.databricks.sql.CloudFilesAWSResourceManager \\\n.newManager() \\\n.option(\"cloudFiles.region\", <region>) \\\n.option(\"path\", <path-to-specific-bucket-and-folder>) \\\n.create()\n\n#######################################\n## Creating a ResourceManager in Azure\n#######################################\n\nmanager = spark._jvm.com.databricks.sql.CloudFilesAzureResourceManager \\\n.newManager() \\\n.option(\"cloudFiles.connectionString\", <connection-string>) \\\n.option(\"cloudFiles.resourceGroup\", <resource-group>) \\\n.option(\"cloudFiles.subscriptionId\", <subscription-id>) \\\n.option(\"cloudFiles.tenantId\", <tenant-id>) \\\n.option(\"cloudFiles.clientId\", <service-principal-client-id>) \\\n.option(\"cloudFiles.clientSecret\", <service-principal-client-secret>) \\\n.option(\"path\", <path-to-specific-container-and-folder>) \\\n.create()\n\n#######################################\n## Creating a ResourceManager in GCP\n#######################################\nmanager = spark._jvm.com.databricks.sql.CloudFilesGCPResourceManager \\\n.newManager() \\\n.option(\"path\", <path-to-specific-bucket-and-folder>) \\\n.create()\n\n# Set up a queue and a topic subscribed to the path provided in the manager.\nmanager.setUpNotificationServices(<resource-suffix>)\n\n# List notification services created by <AL>\nfrom pyspark.sql import DataFrame\ndf = DataFrame(manager.listNotificationServices())\n\n# Tear down the notification services created for a specific stream ID.\n# Stream ID is a GUID string that you can find in the list result above.\nmanager.tearDownNotificationServices(<stream-id>)\n\n``` \n```\n\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\n\/\/ Creating a ResourceManager in AWS\n\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\n\nimport com.databricks.sql.CloudFilesAWSResourceManager\nval manager = CloudFilesAWSResourceManager\n.newManager\n.option(\"cloudFiles.region\", <region>) \/\/ optional, will use the region of the EC2 instances by default\n.option(\"path\", <path-to-specific-bucket-and-folder>) \/\/ required only for setUpNotificationServices\n.create()\n\n\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\n\/\/ Creating a ResourceManager in Azure\n\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\n\nimport com.databricks.sql.CloudFilesAzureResourceManager\nval manager = CloudFilesAzureResourceManager\n.newManager\n.option(\"cloudFiles.connectionString\", <connection-string>)\n.option(\"cloudFiles.resourceGroup\", <resource-group>)\n.option(\"cloudFiles.subscriptionId\", <subscription-id>)\n.option(\"cloudFiles.tenantId\", <tenant-id>)\n.option(\"cloudFiles.clientId\", <service-principal-client-id>)\n.option(\"cloudFiles.clientSecret\", <service-principal-client-secret>)\n.option(\"path\", <path-to-specific-container-and-folder>) \/\/ required only for setUpNotificationServices\n.create()\n\n\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\n\/\/ Creating a ResourceManager in GCP\n\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\n\nimport com.databricks.sql.CloudFilesGCPResourceManager\nval manager = CloudFilesGCPResourceManager\n.newManager\n.option(\"path\", <path-to-specific-bucket-and-folder>) \/\/ Required only for setUpNotificationServices.\n.create()\n\n\/\/ Set up a queue and a topic subscribed to the path provided in the manager.\nmanager.setUpNotificationServices(<resource-suffix>)\n\n\/\/ List notification services created by <AL>\nval df = manager.listNotificationServices()\n\n\/\/ Tear down the notification services created for a specific stream ID.\n\/\/ Stream ID is a GUID string that you can find in the list result above.\nmanager.tearDownNotificationServices(<stream-id>)\n\n``` \nUse `setUpNotificationServices(<resource-suffix>)` to create a queue and a subscription with the name `<prefix>-<resource-suffix>` (the prefix depends on the storage system summarized in [Cloud resources used in Auto Loader file notification mode](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html#file-notification). If there is an existing resource with the same name, Databricks reuses the existing resource instead of creating a new one. This function returns a queue identifier that you can pass to the `cloudFiles` source using the identifier in [File notification options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#file-notification-options). This enables the `cloudFiles` source user to have fewer permissions than the user who creates the resources. \nProvide the `\"path\"` option to `newManager` only if calling `setUpNotificationServices`; it is not needed for `listNotificationServices` or `tearDownNotificationServices`. This is the same `path` that you use when running a streaming query. \nThe following matrix indicates which API methods are supported in which Databricks Runtime for each type of storage: \n| Cloud Storage | Setup API | List API | Tear down API |\n| --- | --- | --- | --- |\n| AWS S3 | All versions | All versions | All versions |\n| ADLS Gen2 | All versions | All versions | All versions |\n| GCS | Databricks Runtime 9.1 and above | Databricks Runtime 9.1 and above | Databricks Runtime 9.1 and above |\n| Azure Blob Storage | All versions | All versions | All versions |\n| ADLS Gen1 | Unsupported | Unsupported | Unsupported |\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Amazon S3 Select\n\n[Amazon S3 Select](https:\/\/aws.amazon.com\/s3\/features\/?nc=nsb&pg=ln#Query_in_Place) enables retrieving only required data from an object. The Databricks S3 Select connector provides an Apache Spark data source that leverages S3 Select. When you use an S3 Select data source, filter and column selection on a DataFrame is pushed down, saving S3 data bandwidth.\n\n#### Amazon S3 Select\n##### Limitations\n\nAmazon S3 Select supports the following file formats: \n* CSV and JSON files\n* UTF-8 encoding\n* GZIP or no compression \nThe Databricks S3 Select connector has the following limitations: \n* Complex types (arrays and objects) cannot be used in JSON\n* Schema inference is not supported\n* File splitting is not supported, however multiline records are supported\n* DBFS mount points are not supported \nImportant \nDatabricks strongly encourages you to use `S3AFileSystem` provided by Databricks, which is the default for `s3a:\/\/`, `s3:\/\/`, and `s3n:\/\/` file system schemes in Databricks Runtime. If you need assistance with migration to `S3AFileSystem`, contact Databricks support or your Databricks account team.\n\n#### Amazon S3 Select\n##### Usage\n\n```\nsc.read.format(\"s3select\").schema(...).options(...).load(\"s3:\/\/bucket\/filename\")\n\n``` \n```\nCREATE TABLE name (...) USING S3SELECT LOCATION 's3:\/\/bucket\/filename' [ OPTIONS (...) ]\n\n``` \nIf the filename extension is `.csv` or `.json`, the format is automatically detected; otherwise you must provide the `FileFormat` option.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-s3-select.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Amazon S3 Select\n##### Options\n\nThis section describes options for all file types and options specific to CSV and JSON. \n### Generic options \n| Option name | Default value | Description |\n| --- | --- | --- |\n| FileFormat | \u2018auto\u2019 | Input file type (\u2018auto\u2019, \u2018csv\u2019, or \u2018json\u2019) |\n| CompressionType | \u2018none\u2019 | Compression codec used by the input file (\u2018none\u2019 or \u2018gzip\u2019) | \n### CSV specific options \n| Option name | Default value | Description |\n| --- | --- | --- |\n| NullValue | \u2018\u2019 | Character string representing null values in the input |\n| Header | false | Whether to skip the first line of the input (potential header contents are ignored) |\n| Comment | \u2018#\u2019 | Lines starting with the value of this parameters are ignored |\n| RecordDelimiter | \u2018n\u2019 | Character separating records in a file |\n| Delimiter | \u2018,\u2019 | Character separating fields within a record |\n| Quote | \u2018\u201d\u2019 | Character used to quote values containing reserved characters |\n| Escape | \u2018\u201d\u2019 | Character used to escape quoted quote character |\n| AllowQuotedRecordDelimiter | false | Whether values can contain quoted record delimiters | \n### JSON specific options \n| Option name | Default value | Description |\n| --- | --- | --- |\n| Type | document | Type of input (\u2018document\u2019 or \u2018lines\u2019) |\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-s3-select.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Amazon S3 Select\n##### S3 authentication\n\nYou can use the S3 authentication methods (keys and instance profiles) available in Databricks; we recommend that you use [instance profiles](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html). There are three ways of providing the credentials: \n1. **Default Credential Provider Chain (recommended option):**\nAWS credentials are automatically retrieved through the [DefaultAWSCredentialsProviderChain](https:\/\/docs.aws.amazon.com\/sdk-for-java\/v1\/developer-guide\/credentials.html). If you use instance profiles to authenticate to S3 then you should use this method. Other methods of providing credentials (methods 2 and 3) take precedence over this default.\n2. **Set keys in Hadoop conf:** Specify AWS keys in [Hadoop configuration properties](https:\/\/github.com\/apache\/hadoop\/blob\/trunk\/hadoop-tools\/hadoop-aws\/src\/site\/markdown\/tools\/hadoop-aws\/index.md). \nImportant \n* When using AWS keys to access S3, always set the configuration properties `fs.s3n.awsAccessKeyId` and `fs.s3n.awsSecretAccessKey` as shown in the following example; the properties `fs.s3a.access.key` and `fs.s3a.secret.key` are *not supported*.\n* To reference the `s3a:\/\/` filesystem, set the `fs.s3n.awsAccessKeyId` and `fs.s3n.awsSecretAccessKey` properties in a Hadoop XML configuration file or call `sc.hadoopConfiguration.set()` to set Spark\u2019s global Hadoop configuration. \n```\nsc.hadoopConfiguration.set(\"fs.s3n.awsAccessKeyId\", \"$AccessKey\")\nsc.hadoopConfiguration.set(\"fs.s3n.awsSecretAccessKey\", \"$SecretKey\")\n\n``` \n```\nsc._jsc.hadoopConfiguration().set(\"fs.s3n.awsAccessKeyId\", ACCESS_KEY)\nsc._jsc.hadoopConfiguration().set(\"fs.s3n.awsSecretAccessKey\", SECRET_KEY)\n\n```\n3. **Encode keys in URI**: For example, the URI `s3a:\/\/$AccessKey:$SecretKey@bucket\/path\/to\/dir` encodes the key pair (`AccessKey`, `SecretKey`).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-s3-select.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### Tutorial: Create and deploy a training run using Foundation Model Training\n\nImportant \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). Reach out to your Databricks account team to enroll in the Public Preview. \nThis article describes how to create and configure a run using the Foundation Model Training API, and then review the results and deploy the model using the Databricks UI and Databricks Model Serving.\n\n#### Tutorial: Create and deploy a training run using Foundation Model Training\n##### Requirements\n\n* A workspace in the `us-east-1` or `us-west-2` AWS region.\n* Databricks Runtime 12.2 LTS ML or above.\n* This tutorial must be run in a [Databricks notebook](https:\/\/docs.databricks.com\/notebooks\/index.html).\n* Training data in the accepted format. See [Prepare data for Foundation Model Training](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/data-preparation.html).\n\n#### Tutorial: Create and deploy a training run using Foundation Model Training\n##### Step 1: Prepare your data for training\n\nSee [Prepare data for Foundation Model Training](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/data-preparation.html).\n\n#### Tutorial: Create and deploy a training run using Foundation Model Training\n##### Step 2: Install the `databricks_genai` SDK\n\nUse the following to install the `databricks_genai` SDK. \n```\n%pip install databricks_genai\n\n``` \nNext, import the `foundation_model` library: \n```\ndbutils.library.restartPython()\nfrom databricks.model_training import foundation_model as fm\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/fine-tune-run-tutorial.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### Tutorial: Create and deploy a training run using Foundation Model Training\n##### Step 3: Create a training run\n\nCreate a training run using the Foundation Model Training\u2019s `create()` function. The following parameters are required: \n* `model`: the model you want to train.\n* `train_data_path`: the location of the training dataset in.\n* `register_to`: the Unity Catalog catalog and schema where you want checkpoints saved in. \nFor example: \n```\nrun = fm.create(model='meta-llama\/Llama-2-7b-chat-hf',\ntrain_data_path='dbfs:\/Volumes\/main\/my-directory\/ift\/train.jsonl', # UC Volume with JSONL formatted data\nregister_to='main.my-directory',\ntraining_duration='1ep')\n\nrun\n\n```\n\n#### Tutorial: Create and deploy a training run using Foundation Model Training\n##### Step 4: View the status of a run\n\nThe time it takes to complete a training run depends on the number of tokens, the model, and GPU availability. For faster training, Databricks recommends that you use reserved compute. Reach out to your Databricks account team for details. \nAfter you launch your run, you can monitor the status of it using `get_events()`. \n```\nrun.get_events()\n\n```\n\n#### Tutorial: Create and deploy a training run using Foundation Model Training\n##### Step 5: View metrics and outputs\n\nFollow these steps to view the results in the Databricks UI: \n1. In the Databricks workspace, click **Experiments** in the left nav bar.\n2. Select your experiment from the list.\n3. Review the metrics charts in the **Charts** tab. \n1. The primary training metric showing progress is loss. Evaluation loss can be used to see if your model is overfitting to your training data. However, loss should not be relied on entirely because in supervised training tasks, the evaluation loss can appear to be overfitting while the model continues to improve.\n2. In this tab, you can also view the output of your evaluation prompts if you specified them.\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/fine-tune-run-tutorial.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### Tutorial: Create and deploy a training run using Foundation Model Training\n##### Step 6: Evaluate multiple customized model with MLflow LLM Evaluate before deploy\n\nSee [Evaluate large language models with MLflow](https:\/\/docs.databricks.com\/mlflow\/llm-evaluate.html).\n\n#### Tutorial: Create and deploy a training run using Foundation Model Training\n##### Step 7: Deploy your model\n\nThe training run automatically registers your model in Unity Catalog after it completes. The model is registered based on what you specified in the `register_to` field in the run `create()` method. \nTo deploy the model for serving, follow these steps: \n1. Navigate to the model in Unity Catalog.\n2. Click **Serve this model**.\n3. Click **Create serving endpoint**.\n4. In the **Name** field, provide a name for your endpoint.\n5. Click **Create**.\n\n#### Tutorial: Create and deploy a training run using Foundation Model Training\n##### Additional resources\n\n* [Create a training run using the Foundation Model Training API](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/create-fine-tune-run.html)\n* [Foundation Model Training](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html)\n* [Model serving with Databricks](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/fine-tune-run-tutorial.html"} +{"content":"# AI and Machine Learning on Databricks\n### ML lifecycle management using MLflow\n\nThis article describes how MLflow is used in Databricks for machine learning lifecycle management. It also includes examples that introduce each MLflow component and links to content that describe how these components are hosted within Databricks. \nML lifecycle management in Databricks is provided by managed [MLflow](https:\/\/www.mlflow.org\/). Databricks provides a fully managed and hosted version of MLflow integrated with enterprise security features, high availability, and other Databricks workspace features such as experiment and run management and notebook revision capture. \nFirst-time users should begin with [Get started with MLflow experiments](https:\/\/docs.databricks.com\/mlflow\/quick-start.html), which demonstrates the basic MLflow tracking APIs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n### ML lifecycle management using MLflow\n#### What is MLflow?\n\nMLflow is an open source platform for managing the end-to-end machine learning lifecycle. It has the following primary components: \n* Tracking: Allows you to track experiments to record and compare parameters and results.\n* Models: Allow you to manage and deploy models from a variety of ML libraries to a variety of model serving and inference platforms.\n* Projects: Allow you to package ML code in a reusable, reproducible form to share with other data scientists or transfer to production.\n* Model Registry: Allows you to centralize a model store for managing models\u2019 full lifecycle stage transitions: from staging to production, with capabilities for versioning and annotating. Databricks provides a managed version of the Model Registry in Unity Catalog.\n* Model Serving: Allows you to host MLflow models as REST endpoints. Databricks provides a unified interface to deploy, govern, and query your served AI models. \nMLflow supports [Java](https:\/\/www.mlflow.org\/docs\/latest\/java_api\/index.html), [Python](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/index.html), [R](https:\/\/www.mlflow.org\/docs\/latest\/R-api.html), and [REST](https:\/\/docs.databricks.com\/api\/workspace\/experiments) APIs. \nNote \nIf you\u2019re just getting started with Databricks, consider using MLflow on [Databricks Community Edition](https:\/\/docs.databricks.com\/getting-started\/community-edition.html), which provides a simple managed MLflow experience for lightweight experimentation. Remote execution of MLflow projects is not supported on Databricks Community Edition. We plan to impose moderate limits on the number of experiments and runs. For the initial launch of MLflow on Databricks Community Edition no limits are imposed. \nMLflow data stored in the control plane (experiment runs, metrics, tags and params) is encrypted using a platform-managed key. Encryption using [Customer-managed keys for managed services](https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html#managed-services) is not supported for that data. On the other hand, the MLflow models and artifacts stored in your root (DBFS) storage can be encrypted using your own key by configuring customer-managed keys for workspace storage.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n### ML lifecycle management using MLflow\n#### MLflow tracking\n\nMLflow on Databricks offers an integrated experience for tracking and securing training runs for machine learning and deep learning models. \n* [Track model development using MLflow](https:\/\/docs.databricks.com\/machine-learning\/track-model-development\/index.html)\n* [Databricks Autologging](https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n### ML lifecycle management using MLflow\n#### Model lifecycle management\n\n[MLflow Model Registry](https:\/\/www.mlflow.org\/docs\/latest\/model-registry.html) is a centralized model repository and a UI and set of APIs that enable you to manage the full lifecycle of MLflow Models. Databricks provides a hosted version of the MLflow Model Registry in Unity Catalog. Unity Catalog provides centralized model governance, cross-workspace access, lineage, and deployment. For details about managing the model lifecycle in Unity Catalog, see [Manage model lifecycle in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html). \nIf your workspace is not enabled for Unity Catalog, you can use the [Workspace Model Registry](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html). \n### Model Registry concepts \n* **Model**: An MLflow Model logged from an experiment or run that is logged with one of the model flavor\u2019s `mlflow.<model-flavor>.log_model` methods. After a model is logged, you can register it with the Model Registry.\n* **Registered model**: An MLflow Model that has been registered with the Model Registry. The registered model has a unique name, versions, model lineage, and other metadata.\n* **Model version**: A version of a registered model. When a new model is added to the Model Registry, it is added as Version 1. Each model registered to the same model name increments the version number.\n* **Model alias**: An alias is a mutable, named reference to a particular version of a registered model. Typical uses of aliases are to specify which model versions are deployed in a given environment in your model training workflows or to write inference workloads that target a specific alias. For example, you could assign the \u201cChampion\u201d alias of your \u201cFraud Detection\u201d registered model to the model version that should serve the majority of production traffic, and then write inference workloads that target that alias (that is, make predictions using the \u201cChampion\u201d version).\n* **Model stage** (workspace model registry only): A model version can be assigned one or more stages. MLflow provides predefined stages for the common use cases: **None**, **Staging**, **Production**, and **Archived**. With the appropriate permission you can transition a model version between stages or you can request a model stage transition. Model version stages are not used in Unity Catalog.\n* **Description**: You can annotate a model\u2019s intent, including a description and any relevant information useful for the team such as algorithm description, dataset employed, or methodology. \n### Example notebooks \nFor an example that illustrates how to use the Model Registry to build a machine learning application that forecasts the daily power output of a wind farm, see the following: \n* [Models in Unity Catalog example](https:\/\/docs.databricks.com\/mlflow\/models-in-uc-example.html)\n* [Workspace Model Registry example](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n### ML lifecycle management using MLflow\n#### Model deployment\n\n[Databricks Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) provides a unified interface to deploy, govern, and query AI models. Each model you serve is available as a REST API that you can integrate into your web or client application. \nModel serving supports serving: \n* [Custom models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/custom-models.html). These are Python models packaged in the MLflow format. They can be registered either in Unity Catalog or in the workspace model registry. Examples include scikit-learn, XGBoost, PyTorch, and Hugging Face transformer models.\n* State-of-the-art open models made available by [Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html). These models are curated foundation model architectures that support optimized inference. Base models, like Llama-2-70B-chat, BGE-Large, and Mistral-7B are available for immediate use with **pay-per-token** pricing, and workloads that require performance guarantees and fine-tuned model variants can be deployed with **provisioned throughput**.\n* [External models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html). These are models that are hosted outside of Databricks. Examples include foundation models like, OpenAI\u2019s GPT-4, Anthropic\u2019s Claude, and others. Endpoints that serve external models can be centrally governed and customers can establish rate limits and access controls for them. \nYou also can deploy MLflow models for offline inference, see [Deploy models for batch inference](https:\/\/docs.databricks.com\/machine-learning\/model-inference\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n##### Access the MLflow tracking server from outside Databricks\n\nYou may wish to log to the MLflow tracking server from your own applications or from the MLflow CLI. \nThis article describes the required configuration steps. Start by installing MLflow and configuring your credentials (Step 1). You can then either configure an application (Step 2) or configure the MLflow CLI (Step 3). \nFor information on how to launch and log to an open-source tracking server, see the [open source documentation](https:\/\/mlflow.org\/docs\/latest\/quickstart.html#logging-to-a-remote-tracking-server).\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/access-hosted-tracking-server.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n##### Access the MLflow tracking server from outside Databricks\n###### Step 1: Configure your environment\n\nIf you don\u2019t have a Databricks account, you can try Databricks for free. See [Get started: Account and workspace setup](https:\/\/docs.databricks.com\/getting-started\/index.html). \nTo configure your environment to access your Databricks hosted MLflow tracking server: \n1. Install MLflow using `pip install mlflow`.\n2. Configure authentication according to your Databricks subscription. \n* Community Edition. Do one of the following: \n+ (Recommended) Use `mlflow.login()` to be prompted for your credentials. \n```\nimport mlflow\n\nmlflow.login()\n\n``` \nThe following is a response example. If the authentication succeeds, you see the message, \u201cSuccessfully signed into Databricks!\u201d. \n```\n2023\/10\/25 22:59:27 ERROR mlflow.utils.credentials: Failed to sign in Databricks: default auth: cannot configure default credentials\nDatabricks Host (should begin with https:\/\/): https:\/\/community.cloud.databricks.com\/\nUsername: weirdmouse@gmail.com\nPassword: \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\n2023\/10\/25 22:59:38 INFO mlflow.utils.credentials: Successfully signed in Databricks!\n\n```\n+ Specify credentials using environment variables: \n```\n# Configure MLflow to communicate with a Databricks-hosted tracking server\nexport MLFLOW_TRACKING_URI=databricks\n# Specify your Databricks username & password\nexport DATABRICKS_USERNAME=\"...\"\nexport DATABRICKS_PASSWORD=\"...\"\n\n```\n* Databricks Platform. Do one of: \n+ [Generate a REST API token](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement) and [create your credentials file](https:\/\/docs.databricks.com\/archive\/dev-tools\/cli\/index.html#cli-auth) using `databricks configure --token`.\n+ Specify credentials using environment variables: \n```\n# Configure MLflow to communicate with a Databricks-hosted tracking server\nexport MLFLOW_TRACKING_URI=databricks\n# Specify the workspace hostname and token\nexport DATABRICKS_HOST=\"...\"\nexport DATABRICKS_TOKEN=\"...\"\n# Or specify your Databricks username & password\nexport DATABRICKS_USERNAME=\"...\"\nexport DATABRICKS_PASSWORD=\"...\"\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/access-hosted-tracking-server.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n##### Access the MLflow tracking server from outside Databricks\n###### Step 2: Configure MLflow applications\n\nConfigure MLflow applications to log to Databricks by [setting the tracking URI](https:\/\/mlflow.org\/docs\/latest\/tracking.html#logging-to-a-tracking-server) to `databricks`, or `databricks:\/\/<profileName>`, if you specified a profile name via `--profile` while creating your credentials file. For example, you can achieve this by setting the `MLFLOW_TRACKING_URI` environment variable to \u201cdatabricks\u201d.\n\n##### Access the MLflow tracking server from outside Databricks\n###### Step 3: Configure the MLflow CLI\n\nConfigure the MLflow CLI to communicate with a Databricks tracking\nserver with the `MLFLOW_TRACKING_URI` environment variable. For example, to create an experiment\nusing the CLI with the tracking URI `databricks`, run: \n```\n# Replace <your-username> with your Databricks username\nexport MLFLOW_TRACKING_URI=databricks\nmlflow experiments create -n \/Users\/<your-username>\/my-experiment\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/access-hosted-tracking-server.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n#### Credential redaction\n\nDatabricks redacts keys and credentials in audit logs and log4j Apache Spark logs to\nprotect your data from information leaking. Databricks redacts three\ntypes of credentials at logging time: AWS access key, AWS secret access Key, and\ncredentials in URI. Upon detection of these secrets, Databricks replaces\nthem with placeholders. For some credential types, Databricks also appends a `hash_prefix`,\nwhich is the first 8 hex bytes of the md5 checksum of the\ncredential for verification purpose.\n\n#### Credential redaction\n##### AWS access key redaction\n\nFor AWS access keys, Databricks searches for strings starting with `AKIA` and replace them with `REDACTED_AWS_ACCESS_KEY(hash_prefix)`. For example, Databricks logs `2017\/02\/08: Accessing AWS using AKIADEADBEEFDEADBEEF` as `2017\/01\/08: Accessing AWS using REDACTED_AWS_ACCESS_KEY(655f9d2f)`\n\n#### Credential redaction\n##### AWS secret access key redaction\n\nDatabricks replaces a AWS secret access key with `REDACTED_POSSIBLE_AWS_SECRET_ACCESS_KEY` without appending\nits hash. For example, Databricks logs `2017\/01\/08: Accessing AWS using 99Abcdeuw+zXXAxllliupwqqqzDEUFdAtaBrickX`\nas `2017\/01\/08: Accessing AWS using REDACTED_POSSIBLE_AWS_SECRET_ACCESS_KEY`. \nSince AWS does not have an explicit identifier for secret access keys, it\u2019s possible\nthat Databricks redacts some seemingly randomly-generated 40-characters long strings\nother than AWS secret access keys.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/redaction.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n#### Credential redaction\n##### Credentials in URI redaction\n\nDatabricks detects `\/\/username:password@mycompany.com` in URI and replaces `username:password` with\n`REDACTED_CREDENTIALS(hash_prefix)`. Databricks computes the hash from `username:password`\n(including the `:`). For example, Databricks logs `2017\/01\/08: Accessing https:\/\/admin:admin@mycompany.com`\nas `2017\/01\/08: Accessing https:\/\/REDACTED_CREDENTIALS(d2abaa37)@mycompany.com`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/redaction.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Snowplow\n\nSnowplow provides a behavioral data platform that collects first-party customer data in real time from web, mobile, IoT, and server-side apps and creates modeled tables. \nYou can integrate your Databricks SQL warehouses and Databricks clusters with Snowplow.\n\n#### Connect to Snowplow\n##### Connect to Snowplow using Partner Connect\n\nTo connect to Snowplow using Partner Connect, see [Connect to ingestion partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/ingestion.html). \nNote \nPartner Connect only supports SQL warehouses for Snowplow. To connect using a cluster, do so manually.\n\n#### Connect to Snowplow\n##### Connect to Snowplow manually\n\nNote \nYou can use Partner Connect to simplify the connection experience with a SQL warehouse. \nTo connect to Snowplow manually, see [Databricks destination configuration](https:\/\/docs.snowplow.io\/docs\/getting-started-on-snowplow-bdp-cloud\/configuring-destinations\/databricks\/) and [Databricks loader](https:\/\/docs.snowplow.io\/docs\/pipeline-components-and-applications\/loaders-storage-targets\/snowplow-rdb-loader\/loading-transformed-data\/databricks-loader\/) in the Snowplow documentation.\n\n#### Connect to Snowplow\n##### Additional resources\n\nExplore the following Snowplow resources: \n* [Website](https:\/\/snowplow.io\/)\n* [Documentation](https:\/\/docs.snowplow.io\/docs\/)\n* [Support channels](https:\/\/docs.snowplow.io\/statement-of-support\/#support-channels)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/snowplow.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n#### Operational excellence for the data lakehouse\n\nThe architectural principles of the **operational excellence** pillar cover all operational processes that keep the lakehouse running. Operational excellence addresses the ability to operate the lakehouse efficiently and discusses how to operate, manage, and monitor the lakehouse to deliver business value. \n![Operational excellence lakehouse architecture diagram for Databricks.](https:\/\/docs.databricks.com\/_images\/operational-excellence.png)\n\n#### Operational excellence for the data lakehouse\n##### Principles of operational excellence\n\n1. **Optimize build and release processes** \nUse software engineering best practices across your entire lakehouse environment. Build and release using continuous integration and continuous delivery pipelines for both DevOps and MLOps.\n2. **Automate deployments and workloads** \nAutomating deployments and workloads for the lakehouse helps standardize these processes, eliminate human error, improve productivity, and provide greater repeatability. This includes using \u201cconfiguration as code\u201d to avoid configuration drift, and \u201cinfrastructure as code\u201d to automate the provisioning of all required lakehouse and cloud services. \nFor ML specifically, processes should drive automation: Not every step of a process can or should be automated. People still determine the business questions, and some models will always need human oversight before deployment. Therefore, the development process is primary and each module in the process should be automated as needed. This allows incremental build-out of automation and customization.\n3. **Set up monitoring, alerting, and logging** \nWorkloads in the lakehouse typically integrate Databricks platform services and external cloud services, for example as data sources or targets. Successful execution can only occur if each service in the execution chain is functioning properly. When this is not the case, monitoring, alerting, and logging are important to detect and track problems and understand system behavior.\n4. **Manage capacity and quotas** \nFor any service that is launched in a cloud, take limits into account, for example access rate limits, number of instances, number of users, and memory requirements. Before designing a solution, these limits must be understood.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/operational-excellence\/index.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n#### Operational excellence for the data lakehouse\n##### Next: Best practices for operational excellence\n\nSee [Best practices for operational excellence](https:\/\/docs.databricks.com\/lakehouse-architecture\/operational-excellence\/best-practices.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/operational-excellence\/index.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Ingest and insert additional data\n\nThis get started article walks you through using a Databricks notebook to ingest a CSV file containing additional baby name data into your Unity Catalog volume and then import the new baby name data into an existing table by using Python, Scala, and R. \nImportant \nThis get started article builds on [Get started: Import and visualize CSV data from a notebook](https:\/\/docs.databricks.com\/getting-started\/import-visualize-data.html). You must complete the steps in that article in order to complete this article. For the complete notebook for that getting started article, see [Import and visualize data notebooks](https:\/\/docs.databricks.com\/getting-started\/import-visualize-data.html#notebook).\n\n### Get started: Ingest and insert additional data\n#### Requirements\n\nTo complete the tasks in this article, you must meet the following requirements: \n* Your workspace must have [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) enabled. For information on getting started with Unity Catalog, see [Set up and manage Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html).\n* You must have permission to use an existing compute resource or create a new compute resource. See [Get started: Account and workspace setup](https:\/\/docs.databricks.com\/getting-started\/index.html) or see your Databricks administrator. \nTip \nFor a completed notebook for this article, see [Ingest additional data notebooks](https:\/\/docs.databricks.com\/getting-started\/ingest-insert-additional-data.html#notebook).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/ingest-insert-additional-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Ingest and insert additional data\n#### Step 1: Create a new notebook\n\nTo create a notebook in your workspace: \n1. Click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar, and then click **Notebook**.\n2. On the Create Notebook page: \n* Specify a unique name for your notebook.\n* Set the default language for your notebook and then click **Confirm** if prompted.\n* Click **Connect** and select a compute resource. To create a new compute resource, see [Use compute](https:\/\/docs.databricks.com\/compute\/use-compute.html). \nTo learn more about creating and managing notebooks, see [Manage notebooks](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/ingest-insert-additional-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Ingest and insert additional data\n#### Step 2: Define variables\n\nIn this step, you define variables for use in the example notebook you create in this article. \n1. Copy and paste the following code into the new empty notebook cell. Replace `<catalog-name>`, `<schema-name>`, and `<volume-name>` with the catalog, schema, and volume names for a Unity Catalog volume. Replace `<table_name>` with a table name of your choice. You will save the baby name data into this table later in this article.\n2. Press `Shift+Enter` to run the cell and create a new blank cell. \n```\ncatalog = \"<catalog_name>\"\nschema = \"<schema_name>\"\nvolume = \"<volume_name>\"\nfile_name = \"new_baby_names.csv\"\ntable_name = \"baby_names\"\npath_volume = \"\/Volumes\/\" + catalog + \"\/\" + schema + \"\/\" + volume\npath_table = catalog + \".\" + schema\nprint(path_table) # Show the complete path\nprint(path_volume) # Show the complete path\n\n``` \n```\nval catalog = \"<catalog_name>\"\nval schema = \"<schema_name>\"\nval volume = \"<volume_name>\"\nval fileName = \"new_baby_names.csv\"\nval tableName = \"baby_names\"\nval pathVolume = s\"\/Volumes\/${catalog}\/${schema}\/${volume}\"\nval pathTable = s\"${catalog}.${schema}\"\nprint(pathVolume) \/\/ Show the complete path\nprint(pathTable) \/\/ Show the complete path\n\n``` \n```\ncatalog <- \"<catalog_name>\"\nschema <- \"<schema_name>\"\nvolume <- \"<volume_name>\"\nfile_name <- \"new_baby_names.csv\"\ntable_name <- \"baby_names\"\npath_volume <- paste0(\"\/Volumes\/\", catalog, \"\/\", schema, \"\/\", volume, sep = \"\")\npath_table <- paste0(catalog, \".\", schema, sep = \"\")\nprint(path_volume) # Show the complete path\nprint(path_table) # Show the complete path\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/ingest-insert-additional-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Ingest and insert additional data\n#### Step 3: Add new CSV file of data to your Unity Catalog volume\n\nThis step creates a DataFrame named `df` with a new baby name for 2022 and then saves that data into a new CSV file in your Unity Catalog volume. \nNote \nThis step simulates adding new yearly data to the existing data loaded for previous years. In your production environment, this incremental data would be stored in cloud storage. \n1. Copy and paste the following code into the new empty notebook cell. This code creates the DataFrame with additional baby name data, and then writes that data to a CSV file in your Unity Catalog volume. \n```\ndata = [[2022, \"CARL\", \"Albany\", \"M\", 42]]\n\ndf = spark.createDataFrame(data, schema=\"Year int, First_Name STRING, County STRING, Sex STRING, Count int\")\n# display(df)\n(df.coalesce(1)\n.write\n.option(\"header\", \"true\")\n.mode(\"overwrite\")\n.csv(f\"{path_volume}\/{file_name}\"))\n\n``` \n```\nval data = Seq((2022, \"CARL\", \"Albany\", \"M\", 42))\nval columns = Seq(\"Year\", \"First_Name\", \"County\", \"Sex\", \"Count\")\n\nval df = data.toDF(columns: _*)\n\n\/\/ display(df)\ndf.coalesce(1)\n.write\n.option(\"header\", \"true\")\n.mode(\"overwrite\")\n.csv(f\"{pathVolume}\/{fileName}\")\n\n``` \n```\n# Load the SparkR package that is already preinstalled on the cluster.\nlibrary(SparkR)\n\ndata <- data.frame(Year = 2022,\nFirst_Name = \"CARL\",\nCounty = \"Albany\",\nSex = \"M\",\nCount = 42)\n\ndf <- createDataFrame(data)\n# display(df)\nwrite.df(df, path = paste0(path_volume, \"\/\", file_name),\nsource = \"csv\",\nmode = \"overwrite\",\nheader = \"true\")\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/ingest-insert-additional-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Ingest and insert additional data\n#### Step 4: Load data into DataFrame from CSV file\n\nNote \nThis step simulates loading data from cloud storage. \n1. Copy and paste the following code into an empty notebook cell. This code loads the new baby names data into a new DataFrame from the CSV file. \n```\ndf1 = spark.read.csv(f\"{path_volume}\/{file_name}\",\nheader=True,\ninferSchema=True,\nsep=\",\")\ndisplay(df1)\n\n``` \n```\nval df1 = spark.read\n.option(\"header\", \"true\")\n.option(\"inferSchema\", \"true\")\n.option(\"delimiter\", \",\")\n.csv(s\"$pathVolume\/$fileName\")\ndisplay(df1)\n\n``` \n```\ndf1 <- read.df(paste0(path_volume, \"\/\", file_name),\nsource = \"csv\",\nheader = TRUE,\ninferSchema = TRUE)\ndisplay(df1)\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/ingest-insert-additional-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Ingest and insert additional data\n#### Step 5: Insert into existing table\n\n1. Copy and paste the following code into an empty notebook cell. This code appends the new baby names data from the DataFrame into the existing table. \n```\ndf.write.mode(\"append\").insertInto(f\"{path_table}.{table_name}\")\ndisplay(spark.sql(f\"SELECT * FROM {path_table}.{table_name} WHERE Year = 2022\"))\n\n``` \n```\ndf1.write.mode(\"append\").insertInto(s\"${pathTable}.${tableName}\")\ndisplay(spark.sql(s\"SELECT * FROM ${pathTable}.${tableName} WHERE Year = 2022\"))\n\n``` \n```\n# The write.df function in R, as provided by the SparkR package, does not directly support writing to Unity Catalog.\n# In this example, you write the DataFrame into a temporary view and then use the SQL command to insert data from the temporary view to the Unity Catalog table\ncreateOrReplaceTempView(df1, \"temp_view\")\nsql(paste0(\"INSERT INTO \", path_table, \".\", table_name, \" SELECT * FROM temp_view\"))\ndisplay(sql(paste0(\"SELECT * FROM \", path_table, \".\", table_name, \" WHERE Year = 2022\")))\n\n```\n2. Press `Ctrl+Enter` to run the cell.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/ingest-insert-additional-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Ingest and insert additional data\n#### Ingest additional data notebooks\n\nUse one of the following notebooks to perform the steps in this article. \n### Ingest and insert additional data using Python \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/getting-started\/ingest-insert-additional-data-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Ingest and insert additional data using Scala \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/getting-started\/ingest-insert-additional-data-scala.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Ingest and insert additional data using R \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/getting-started\/ingest-insert-additional-data-sparkr.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n### Get started: Ingest and insert additional data\n#### Next steps\n\nTo learn about cleansing and enhancing data, see [Get started: Enhance and cleanse data](https:\/\/docs.databricks.com\/getting-started\/cleanse-enhance-data.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/ingest-insert-additional-data.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Ingest and insert additional data\n#### Additional resources\n\n* [Get started: Query and visualize data from a notebook](https:\/\/docs.databricks.com\/getting-started\/quick-start.html)\n* [Get started: Import and visualize CSV data from a notebook](https:\/\/docs.databricks.com\/getting-started\/import-visualize-data.html)\n* [Tutorial: Load and transform data using Apache Spark DataFrames](https:\/\/docs.databricks.com\/getting-started\/dataframes.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/ingest-insert-additional-data.html"} +{"content":"# What is Delta Lake?\n### How does Databricks manage Delta Lake feature compatibility?\n\nDelta Lake is an independent open-source project under the governance of the Linux Foundation. Databricks introduces support for new Delta Lake features and optimizations that build on top of Delta Lake in Databricks Runtime releases. \nDatabricks optimizations that leverage Delta Lake features respect the protocols used in OSS Delta Lake for compatibility. \nMany Databricks optimizations require enabling Delta Lake features on a table. Delta Lake features are always backwards compatible, so tables written by a lower Databricks Runtime version can always be read and written by a higher Databricks Runtime version. Enabling some features breaks forward compatibility with workloads running in a lower Databricks Runtime version. For features that break forward compatibility, you must update all workloads that reference the upgraded tables to use a compliant Databricks Runtime version. \nNote \nYou can drop `deletionVectors`, `v2Checkpoint`, and `typeWidening-preview` on Databricks. See [Drop Delta table features](https:\/\/docs.databricks.com\/delta\/drop-feature.html). \nImportant \nAll protocol change operations conflict with all concurrent writes. \nStreaming reads fail when they encounter a commit that changes table metadata. If you want the stream to continue you must restart it. For recommended methods, see [Production considerations for Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/production.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/feature-compatibility.html"} +{"content":"# What is Delta Lake?\n### How does Databricks manage Delta Lake feature compatibility?\n#### What Delta Lake features require Databricks Runtime upgrades?\n\nThe following Delta Lake features break forward compatibility. Features are enabled on a table-by-table basis. This table lists the lowest Databricks Runtime version still supported by Databricks. \n| Feature | Requires Databricks Runtime version or later | Documentation |\n| --- | --- | --- |\n| `CHECK` constraints | Databricks Runtime 9.1 LTS | [Set a CHECK constraint in Databricks](https:\/\/docs.databricks.com\/tables\/constraints.html#check-constraint) |\n| Change data feed | Databricks Runtime 9.1 LTS | [Use Delta Lake change data feed on Databricks](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html) |\n| Generated columns | Databricks Runtime 9.1 LTS | [Delta Lake generated columns](https:\/\/docs.databricks.com\/delta\/generated-columns.html) |\n| Column mapping | Databricks Runtime 10.4 LTS | [Rename and drop columns with Delta Lake column mapping](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html) |\n| Identity columns | Databricks Runtime 10.4 LTS | [Use identity columns in Delta Lake](https:\/\/docs.databricks.com\/delta\/generated-columns.html#identity) |\n| Table features | Databricks Runtime 12.2 LTS | [What are table features?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html#table-features) |\n| Deletion vectors | Databricks Runtime 12.2 LTS | [What are deletion vectors?](https:\/\/docs.databricks.com\/delta\/deletion-vectors.html) |\n| TimestampNTZ | Databricks Runtime 13.3 LTS | [TIMESTAMP\\_NTZ type](https:\/\/docs.databricks.com\/sql\/language-manual\/data-types\/timestamp-ntz-type.html) |\n| UniForm | Databricks Runtime 13.3 LTS | [Use UniForm to read Delta Tables with Iceberg clients](https:\/\/docs.databricks.com\/delta\/uniform.html) |\n| Liquid clustering | Databricks Runtime 13.3 LTS | [Use liquid clustering for Delta tables](https:\/\/docs.databricks.com\/delta\/clustering.html) |\n| Type widening | Databricks Runtime 15.2 | [Type widening](https:\/\/docs.databricks.com\/delta\/type-widening.html) | \nSee [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). \nNote \nDelta Live Tables and Databricks SQL automatically upgrade runtime environments with regular releases to support new features. See [Delta Live Tables release notes and the release upgrade process](https:\/\/docs.databricks.com\/release-notes\/delta-live-tables\/index.html) and [Databricks SQL release notes](https:\/\/docs.databricks.com\/sql\/release-notes\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/feature-compatibility.html"} +{"content":"# What is Delta Lake?\n### How does Databricks manage Delta Lake feature compatibility?\n#### What is a table protocol specification?\n\nEvery Delta table has a protocol specification which indicates the set of features that the table supports. The protocol specification is used by applications that read or write the table to determine if they can handle all the features that the table supports. If an application does not know how to handle a feature that is listed as supported in the protocol of a table, then that application is not be able to read or write that table. \nThe protocol specification is separated into two components: the *read protocol* and the *write protocol*. \nWarning \nMost protocol version upgrades are irreversible, and upgrading the protocol version might break the existing Delta Lake table readers, writers, or both. Databricks recommends you upgrade specific tables only when needed, such as to opt-in to new features in Delta Lake. You should also check to make sure that all of your current and future production tools support Delta Lake tables with the new protocol version. \nProtocol downgrades are available for some features. See [Drop Delta table features](https:\/\/docs.databricks.com\/delta\/drop-feature.html). \n### Read protocol \nThe read protocol lists all features that a table supports and that an application must understand in order to read the table correctly. Upgrading the read protocol of a table requires that all reader applications support the added features. \nImportant \nAll applications that write to a Delta table must be able to construct a snapshot of the table. As such, workloads that write to Delta tables must respect both reader and writer protocol requirements. \nIf you encounter a protocol that is unsupported by a workload on Databricks, you must upgrade to a higher Databricks Runtime that supports that protocol. \n### Write protocol \nThe write protocol lists all features that a table supports and that an application must understand in order to write to the table correctly. Upgrading the write protocol of a table requires that all writer applications support the added features. It does not affect read-only applications, unless the read protocol is also upgraded.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/feature-compatibility.html"} +{"content":"# What is Delta Lake?\n### How does Databricks manage Delta Lake feature compatibility?\n#### Which protocols must be upgraded?\n\nSome features require upgrading both the read protocol and the write protocol. Other features only require upgrading the write protocol. \nAs an example, support for `CHECK` constraints is a write protocol feature: only writing applications need to know about `CHECK` constraints and enforce them. \nIn contrast, column mapping requires upgrading both the read and write protocols. Because the data is stored differently in the table, reader applications must understand column mapping so they can read the data correctly.\n\n### How does Databricks manage Delta Lake feature compatibility?\n#### Minimum reader and writer versions\n\nNote \nYou must explicity upgrade the table protocol version when enabling column mapping. \nWhen you enable Delta features on a table, the table protocol is automatically upgraded. Databricks recommends against changing the `minReaderVersion` and `minWriterVersion` table properties. Changing these table properties does not prevent protocol upgrade. Setting these values to a lower value does not downgrade the table. See [Drop Delta table features](https:\/\/docs.databricks.com\/delta\/drop-feature.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/feature-compatibility.html"} +{"content":"# What is Delta Lake?\n### How does Databricks manage Delta Lake feature compatibility?\n#### What are table features?\n\nIn Databricks Runtime 12.2 LTS and above, Delta Lake table features introduce granular flags specifying which features are supported by a given table. In Databricks Runtime 11.3 LTS and below, Delta Lake features were enabled in bundles called *protocol versions*. Table features are the successor to protocol versions and are designed with the goal of improved flexibility for clients that read and write Delta Lake. See [What is a protocol version?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html#protocol-versions). \nNote \nTable features have protocol version requirements. See [Features by protocol version](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html#protocol-table). \nA Delta table feature is a marker that indicates that the table supports a particular feature. Every feature is either a write protocol feature (meaning it only upgrades the write protocol) or a read\/write protocol feature (meaning both read and write protocols are upgraded to enable the feature). \nTo learn more about supported table features in Delta Lake, see the [Delta Lake protocol](https:\/\/github.com\/delta-io\/delta\/blob\/master\/PROTOCOL.md#valid-feature-names-in-table-features).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/feature-compatibility.html"} +{"content":"# What is Delta Lake?\n### How does Databricks manage Delta Lake feature compatibility?\n#### Do table features change how Delta Lake features are enabled?\n\nIf you only interact with Delta tables through Databricks, you can continue to track support for Delta Lake features using minimum Databricks Runtime requirements. Databricks supports reading Delta tables that have been upgraded to table features in all Databricks Runtime LTS releases, as long as all features used by the table are supported by that release. \nIf you read and write from Delta tables using other systems, you might need to consider how table features impact compatibility, because there is a risk that the system could not understand the upgraded protocol versions. \nImportant \nTable features are introduced to the Delta Lake format for writer version 7 and reader version 3. Databricks has backported code to all supported Databricks Runtime LTS versions to add support for table features, but **only for those features already supported in that Databricks Runtime**. This means that while you can opt in to using table features to enable generated columns and still work with these tables in Databricks Runtime 9.1 LTS, tables with identity columns enabled (which requires Databricks Runtime 10.4 LTS) are still not supported in that Databricks Runtime.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/feature-compatibility.html"} +{"content":"# What is Delta Lake?\n### How does Databricks manage Delta Lake feature compatibility?\n#### What is a protocol version?\n\nA protocol version is a protocol number that indicates a particular grouping of table features. In Databricks Runtime 11.3 LTS and below, you cannot enable table features individually. Protocol versions bundle a group of features. \nDelta tables specify a separate protocol version for read protocol and write protocol. The transaction log for a Delta table contains protocol versioning information that supports Delta Lake evolution. See [Review Delta Lake table details with describe detail](https:\/\/docs.databricks.com\/delta\/table-details.html). \nThe protocol versions bundle all features from previous protocols. See [Features by protocol version](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html#protocol-table). \nNote \nStarting with writer version 7 and reader version 3, Delta Lake has introduced the concept of table features. Using table features, you can now choose to only enable those features that are supported by other clients in your data ecosystem. See [What are table features?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html#table-features).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/feature-compatibility.html"} +{"content":"# What is Delta Lake?\n### How does Databricks manage Delta Lake feature compatibility?\n#### Features by protocol version\n\nThe following table shows minimum protocol versions required for Delta Lake features. \nNote \nIf you are only concerned with Databricks Runtime compatibility, see [What Delta Lake features require Databricks Runtime upgrades?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html#dbr). Delta Sharing only supports reading tables with features that require `minReaderVersion` = `1`. \n| Feature | `minWriterVersion` | `minReaderVersion` | Documentation |\n| --- | --- | --- | --- |\n| Basic functionality | 2 | 1 | [What is Delta Lake?](https:\/\/docs.databricks.com\/delta\/index.html) |\n| `CHECK` constraints | 3 | 1 | [Set a CHECK constraint in Databricks](https:\/\/docs.databricks.com\/tables\/constraints.html#check-constraint) |\n| Change data feed | 4 | 1 | [Use Delta Lake change data feed on Databricks](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html) |\n| Generated columns | 4 | 1 | [Delta Lake generated columns](https:\/\/docs.databricks.com\/delta\/generated-columns.html) |\n| Column mapping | 5 | 2 | [Rename and drop columns with Delta Lake column mapping](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html) |\n| Identity columns | 6 | 2 | [Use identity columns in Delta Lake](https:\/\/docs.databricks.com\/delta\/generated-columns.html#identity) |\n| Table features read | 7 | 1 | [What are table features?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html#table-features) |\n| Table features write | 7 | 3 | [What are table features?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html#table-features) |\n| Deletion vectors | 7 | 3 | [What are deletion vectors?](https:\/\/docs.databricks.com\/delta\/deletion-vectors.html) |\n| TimestampNTZ | 7 | 3 | [TIMESTAMP\\_NTZ type](https:\/\/docs.databricks.com\/sql\/language-manual\/data-types\/timestamp-ntz-type.html) |\n| Liquid clustering | 7 | 3 | [Use liquid clustering for Delta tables](https:\/\/docs.databricks.com\/delta\/clustering.html) |\n| UniForm | 7 | 2 | [Use UniForm to read Delta Tables with Iceberg clients](https:\/\/docs.databricks.com\/delta\/uniform.html) |\n| Type widening | 7 | 4 | [Type widening](https:\/\/docs.databricks.com\/delta\/type-widening.html) |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/feature-compatibility.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Syncsort\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nSyncsort helps you break down data silos by integrating legacy, mainframe, and IBM data with Databricks. You can easily pull data from these sources into Delta Lake. \nHere are the steps for using Syncsort with Databricks.\n\n#### Connect to Syncsort\n##### Step 1: Generate a Databricks personal access token\n\nSyncsort authenticates with Databricks using a Databricks personal access token. \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/syncsort.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Syncsort\n##### Step 2: Set up a cluster to support integration needs\n\nSyncsort will write data to an S3 bucket and the Databricks integration cluster will read data from that location. Therefore the integration cluster requires secure access to the S3 bucket. \n### Secure access to an S3 bucket \nTo access AWS resources, you can launch the Databricks integration cluster with an instance profile. The instance profile should have access to the staging S3 bucket and the target S3 bucket where you want to write the Delta tables. To create an instance profile and configure the integration cluster to use the role, follow the instructions in [Tutorial: Configure S3 access with an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html). \nAs an alternative, you can use [IAM credential passthrough](https:\/\/docs.databricks.com\/archive\/credential-passthrough\/iam-passthrough.html), which enables user-specific access to S3 data from a shared cluster. \n### Specify the cluster configuration \n1. Set **Cluster Mode** to **Standard**.\n2. Set **Databricks Runtime Version** to a Databricks runtime version.\n3. Enable [optimized writes and auto compaction](https:\/\/docs.databricks.com\/delta\/tune-file-size.html) by adding the following properties to your [Spark configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration): \n```\nspark.databricks.delta.optimizeWrite.enabled true\nspark.databricks.delta.autoCompact.enabled true\n\n```\n4. Configure your cluster depending on your integration and scaling needs. \nFor cluster configuration details, see [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html). \nSee [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html) for the steps to obtain the JDBC URL and HTTP path.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/syncsort.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Syncsort\n##### Step 3: Obtain JDBC and ODBC connection details to connect to a cluster\n\nTo connect a Databricks cluster to Syncsort you need the following JDBC\/ODBC connection properties: \n* JDBC URL\n* HTTP Path\n\n#### Connect to Syncsort\n##### Step 4: Configure Syncsort with Databricks\n\nGo to the [Databricks and Connect for Big Data](https:\/\/www.syncsort.com\/en\/solutions\/databricks-and-connect-for-big-data) login page and follow the instructions.\n\n#### Connect to Syncsort\n##### Additional resources\n\n[Support](https:\/\/support.precisely.com\/)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/syncsort.html"} +{"content":"# What is Databricks Marketplace?\n### Databricks Marketplace provider policies\n\nThank you for your interest in becoming a Provider on the Databricks Marketplace (**\u201cMarketplace\u201d**). This Marketplace Policy (**\u201cPolicy\u201d**) applies to all Providers and their Listing(s) of Product(s) in the Marketplace. This Policy is part of your Databricks Data Partner Program Agreement (**\u201cAgreement\u201d**), as well as any other terms that reference this Policy. Capitalized terms not defined in this Policy have the meanings given elsewhere in the Agreement (or the terms to which these apply). For convenient reference: Your Agreement consists of this Policy, the Marketplace Documentation, the Databricks Data Partner Program Overview (**\u201cProgram Overview\u201d**), and the Terms, all as more fully defined in the Program Overview. \nProviders must comply at all times with the policies described and\/or referenced in this document in order to participate in the Marketplace. If a Provider fails to meet any of the enumerated policies at any given time, Databricks may, among other things, remove the Provider and\/or Listing(s) from the Marketplace. We may update this Policy from time to time as set forth in the Agreement.\n\n### Databricks Marketplace provider policies\n#### General Provider requirements\n\n* To publish a Listing for Products in the Marketplace, you must provide all information required by Databricks. You must: \n+ Clearly and accurately describe who you are. In other words, you may not impersonate or imply that a relationship or affiliation that does not exist; and\n+ Only use trademarks for which you own or are authorized to use.\n* You must have all necessary rights to sell or share the Product(s) you intend to offer.\n* Only one profile is allowed per Provider, unless otherwise approved in writing by Databricks.\n* Your profile must be accurate and kept up to date.\n* Manipulation of your position in the Marketplace is expressly forbidden.\n* All Providers are expected to promptly respond to all Consumers as follows: \n+ Within three business days to inquiries from prospective Consumers requesting access to your Product; and\n+ All other inquiries must be responded to within five business days.\n* Providers must deliver the Product as advertised in the Listing and provide Consumers notice when a change is made to the Listing that may affect them.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/provider-policies.html"} +{"content":"# What is Databricks Marketplace?\n### Databricks Marketplace provider policies\n#### Listing and Product requirements\n\n### Listings \n* Any Listing published in the Marketplace must provide all information required by Databricks including but not limited to: \n+ An accurate description of what is being offered including a brief and clear explanation of the Product(s);\n+ Your Provider\/Consumer Terms of use, such as those describing the license grant, any other pertinent terms and conditions, and restrictions (if any) applicable to the Product being offered;\n+ Accurate documentation;\n+ Update frequency;\n+ Whether a Product consists of or includes Data Assets (for example, notebooks or dashboards);\n+ A description of Personal Data contained in the Product, if applicable; and\n+ Keeping all links up-to-date.\n* Listings may not: \n+ Include Personal Data;\n+ Include, advertise, or promote illegal content;\n+ Encourage or imply illegal, threatening, or violent uses; or\n+ Advertise or promote anything other than the Product(s) in the Listing.\n* Products and Listings may not: \n+ Infringe any third party intellectual property rights;\n+ Contain or provide access to inappropriate, harmful, or offensive content; or\n+ Encourage Consumers to access or use the Product outside of the service or functionality set forth in the Listing unless otherwise agreed upon by Databricks.\n* Any material changes to your Listing may be reviewed by Databricks, and Databricks reserves the right to remove the Listing. \n### Products \n* For clarity - as stated in the Program Overview, if your Product includes Personal Data, you must be legally authorized to share such Personal Data. However, a Product must never include Sensitive Personal Data.\n* If you are offering a Dataset, you must update the Dataset at the most frequent logical refresh rate. Ideally, this is as close to real time as possible.\n* If your Product shares anonymized or aggregated data, you must structure the data in such a way as to ensure that it remains anonymous even when combined with other data.\n* In addition to the requirements in this Policy, your Product(s) must also comply with the descriptions and commitments made in your Provider\/Consumer Terms.\n* Any notebooks accompanying queries and\/or code content in your materials must work for all Consumers and produce the results advertised in your materials. \n### Data shared by Databricks with Providers \n* Any usage data we share with you about your Listing is confidential and only for your internal use. You may not use this data for benchmarking.\n* Any Personal Data we share with you may only be used in connection with marketing and selling your Product(s) in your Listing(s) in a manner consistent with your agreement with Databricks, your obligations to Consumers, and in accordance with your privacy policy as well as any other laws that may be applicable. \n### Delisting and discontinuing a Product \n* You may delist a Listing (cease to offer) at any time, such that new Consumers cannot discover the Listing; and\n* You may cease supplying a Product so long as you: \n+ Comply with the notice requirements set forth in the Provider\/Consumer Terms but in no event shall said notice be less than 30 days; and\n+ Allow any current Consumer(s) continued access to the Product for the remainder of the current term you granted under your Provider\/Consumer Terms with the Consumer(s).\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/provider-policies.html"} +{"content":"# Security and compliance guide\n### Auditing, privacy, and compliance\n\nDatabricks has put in place controls to meet the unique compliance needs of highly regulated industries. \nTo learn how you can use Delta Lake on Databricks to manage General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) compliance for your data lake, see [GDPR and CCPA compliance with Delta Lake](https:\/\/docs.databricks.com\/security\/privacy\/gdpr-delta.html) \nFor more information about privacy and compliance and Databricks, see the [Databricks Security and Trust Center](https:\/\/databricks.com\/trust).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/index.html"} +{"content":"# Security and compliance guide\n### Auditing, privacy, and compliance\n#### Enhanced Security and Compliance add-on\n\nEnhanced Security and Compliance is a platform add-on that provides enhanced security and controls for your compliance needs. See the [pricing page](https:\/\/www.databricks.com\/product\/pricing\/platform-addons). The Enhanced Security and Compliance add-on includes: \n* **Compliance security profile**: The compliance security profile provides additional security monitoring, a hardened host OS image, enforced instance types for inter-node encryption, and enforced automatic cluster update to ensure your clusters run with the latest security updates. \nWhen you enable the compliance security profile, you make a selection for the compliance standards. The options available for compliance standards are: \n+ [HIPAA](https:\/\/docs.databricks.com\/security\/privacy\/hipaa.html)\n+ [Infosec Registered Assessors Program (IRAP)](https:\/\/docs.databricks.com\/security\/privacy\/irap.html)\n+ [PCI-DSS](https:\/\/docs.databricks.com\/security\/privacy\/pci.html)\n+ [FedRAMP High](https:\/\/docs.databricks.com\/security\/privacy\/gov-cloud.html)\n+ [FedRAMP Moderate](https:\/\/docs.databricks.com\/security\/privacy\/fedramp.html)You can also choose to enable the compliance security profile for its enhanced security features without the need to conform to a compliance standard. See [Compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html). \n* **Enhanced security monitoring**: Enhanced security monitoring provides the workspace with an enhanced hardened host OS image and additional security monitoring agents to improve your threat detection capabilities. For more information, see [Enhanced security monitoring](https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-monitoring.html).\n* **Automatic cluster update**: Automatic cluster update ensures that all the clusters in a workspace are periodically updated to the latest host OS image and security updates. For more information, see [Automatic cluster update](https:\/\/docs.databricks.com\/admin\/clusters\/automatic-cluster-update.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/index.html"} +{"content":"# Security and compliance guide\n### Auditing, privacy, and compliance\n#### Databricks on AWS GovCloud\n\nDatabricks on AWS GovCloud provides the Databricks platform deployed in AWS GovCloud with compliance and security controls. See [Databricks on AWS GovCloud](https:\/\/docs.databricks.com\/security\/privacy\/gov-cloud.html).\n\n### Auditing, privacy, and compliance\n#### Audit logs\n\nDatabricks provides access to audit logs of activities performed by Databricks users, allowing you to monitor detailed usage patterns. See [Audit log reference](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/index.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Operational excellence for the data lakehouse\n##### Best practices for operational excellence\n\nThis article covers best practices of **operational excellence**, organized by architectural principles listed in the following sections.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/operational-excellence\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Operational excellence for the data lakehouse\n##### Best practices for operational excellence\n###### 1. Optimize build and release processes\n\n### Create a dedicated operations team for the lakehouse \nIt is a general best practice to have a platform operations team to enable data teams to work on one or more data platforms. This team is responsible for coming up with blueprints and best practices internally. They provide tooling - for example, for [infrastructure automation](https:\/\/docs.databricks.com\/lakehouse-architecture\/operational-excellence\/best-practices.html#use-infrastructure-as-code-for-deployments-and-maintenance) and self-service access - and ensure that security and compliance needs are met. This way, the burden of securing platform data is on a central team, so that distributed teams can focus on working with data and producing new insights. \n### Use Databricks Git folders to store code in Git \nDatabricks Git folders allow users to store notebooks or other files in a Git repository, providing features like cloning a repository, committing and pushing, pulling, branch management and viewing file diffs. Use Git folders for better code visibility and tracking. See [Git integration with Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html). \n### Standardize DevOps processes (CI\/CD) \nContinuous integration and continuous delivery (CI\/CD) refer to developing and delivering software in short, frequent cycles using automation pipelines. While this is by no means a new process, having been ubiquitous in traditional software engineering for decades, it is becoming an increasingly necessary process for data engineering and data science teams. For data products to be valuable, they must be delivered in a timely way. Additionally, consumers must have confidence in the validity of outcomes within these products. By automating the building, testing, and deployment of code, development teams can deliver releases more frequently and reliably than manual processes still prevalent across many data engineering and data science teams. See [What is CI\/CD on Databricks?](https:\/\/docs.databricks.com\/dev-tools\/index-ci-cd.html). \nFor more information about best practices for code development using Databricks Git folders, see [CI\/CD techniques with Git and Databricks Git folders (Repos)](https:\/\/docs.databricks.com\/repos\/ci-cd-techniques-with-repos.html). This, together with the Databricks REST API, allows you to build automated deployment processes with GitHub Actions, Azure DevOps pipelines, or Jenkins jobs. \n### Standardize in MLOps processes \nBuilding and deploying ML models is complex. There are many options available to achieve this, but little in the way of well-defined standards. As a result, over the past few years, we have seen the emergence of machine learning operations (MLOps). MLOps is a set of processes and automation for managing models, data, and code to improve performance stability and long-term efficiency in ML systems. It covers data preparation, exploratory data analysis (EDA), feature engineering, model training, model validation, deployment, and monitoring. See [MLOps workflows on Databricks](https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html). \n* **Always keep your business goals in mind:** Just as the core purpose of ML in a business is to enable data-driven decisions and products, the core purpose of MLOps is to ensure that those data-driven applications remain stable, are kept up to date and continue to have positive impacts on the business. When prioritizing technical work on MLOps, consider the business impact: Does it enable new business use cases? Does it improve data teams\u2019 productivity? Does it reduce operational costs or risks?\n* **Manage ML models with a specialized but open tool:** It is recommended to track and manage ML models with MLflow, which has been designed with the ML model lifecycle in mind. See [ML lifecycle management using MLflow](https:\/\/docs.databricks.com\/mlflow\/index.html).\n* **Implement MLOps in a modular fashion:** As with any software application, code quality is paramount for an ML application. Modularized code enables testing of individual components and mitigates difficulties with future code refactoring. Define clear steps (like training, evaluation, or deployment), super steps (like training-to-deployment pipeline), and responsibilities to clarify the modular structure of your ML application. \nThis is described in detail in the Databricks [MLOps whitepaper](https:\/\/www.databricks.com\/p\/ebook\/the-big-book-of-mlops).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/operational-excellence\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Operational excellence for the data lakehouse\n##### Best practices for operational excellence\n###### 2. Automate deployments and workloads\n\n### Use Infrastructure as Code for deployments and maintenance \nHashiCorp Terraform is a popular open source tool for creating safe and predictable cloud infrastructure across several cloud providers. The [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) manages Databricks workspaces and the associated cloud infrastructure using a flexible, powerful tool. The goal of the Databricks Terraform provider is to support all Databricks REST APIs, supporting automation of the most complicated aspects of deploying and managing your data platforms. The Databricks Terraform provider is the recommended tool to deploy and manage clusters and jobs reliably, provision Databricks workspaces, and configure data access. \n### Use cluster policies \nDatabricks workspace admins can control many aspects of the clusters that are spun up, including available instance types, Databricks versions, and the size of instances by using cluster policies. Workspace admins can enforce some Spark configuration settings, and they can configure multiple cluster policies, allowing certain groups of users to create small clusters or single-user clusters, some groups of users to create large clusters and other groups only to use existing clusters. See [Create and manage compute policies](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html). \n### Use automated workflows for jobs \n* **Workflows with jobs (internal orchestration):** \nWe recommend using workflows with jobs to schedule data processing and data analysis tasks on Databricks clusters with scalable resources. Jobs can consist of a single task or a large, multitask workflow with complex dependencies. Databricks manages task orchestration, cluster management, monitoring, and error reporting for all your jobs. You can run your jobs immediately or periodically through an easy-to-use scheduling system. You can implement job tasks using notebooks, JARS, Delta Live Tables pipelines, or Python, Scala, Spark submit, and Java applications. See [Introduction to Databricks Workflows](https:\/\/docs.databricks.com\/workflows\/index.html). \n* **External orchestrators:** \nThe comprehensive Databricks REST API is used by external orchestrators to orchestrate Databricks assets, notebooks, and jobs. See [Apache Airflow](https:\/\/airflow.apache.org\/docs\/apache-airflow-providers-databricks\/stable\/index.html). \n### Use Auto Loader \n[Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html) incrementally and efficiently processes new data files as they arrive in cloud storage. It can ingest many file formats like JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE. With an input folder on the cloud storage, Auto Loader automatically processes new files as they arrive. \nFor one-off ingestions, consider using the command COPY INTO instead. See [Get started using COPY INTO to load data](https:\/\/docs.databricks.com\/ingestion\/copy-into\/index.html). \n### Use Delta Live Tables \nDelta Live Tables is a declarative framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations to perform on your data and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling. \nWith Delta Live Tables, easily define end-to-end data pipelines in SQL or Python: Specify the data source, the transformation logic, and the destination state of the data. Delta Live Tables maintains dependencies and automatically determines the infrastructure to run the job in. \nTo manage data quality, Delta Live Tables monitors data quality trends over time, preventing bad data from flowing into tables through validation and integrity checks with predefined error policies. See [What is Delta Live Tables?](https:\/\/docs.databricks.com\/delta-live-tables\/index.html). \n### Follow the deploy-code approach for ML workloads \nThe deploy-code approach follows these steps: \n* Training environment: Develop training code and ancillary code. Then promote the code to staging.\n* Staging environment: Train model on data subset and test ancillary code. Then promote the code to production.\n* Production environment: Train model on prod data and test model. Then deploy the model and ancillary pipelines. \nSee [Model deployment patterns](https:\/\/docs.databricks.com\/machine-learning\/mlops\/deployment-patterns.html). \nThe main advantages of this model are: \n* This fits traditional software engineering workflows, using familiar tools like Git and CI\/CD systems.\n* Supports automated retraining in a locked-down environment.\n* Only the production environment needs read access to prod training data.\n* Full control over the training environment, which helps to simplify reproducibility.\n* It enables the data science team to use modular code and iterative testing, which helps with coordination and development in larger projects. \nThis is described in detail in the [MLOps whitepaper](https:\/\/www.databricks.com\/p\/ebook\/the-big-book-of-mlops). \n### Use a model registry to decouple code and model lifecycle \nSince model lifecycles do not correspond one-to-one with code lifecycles, it makes sense for model management to have its own service. MLflow and its Model Registry support managing model artifacts directly via UI and APIs. The loose coupling of model artifacts and code provides flexibility to update production models without code changes, streamlining the deployment process in many cases. Model artifacts are secured using MLflow access controls or cloud storage permissions. See [Manage model lifecycle in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html). \n### Use MLflow Autologging \n[Databricks Autologging](https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html) is a no-code solution that extends MLflow automatic logging to deliver automatic experiment tracking for machine learning training sessions on Databricks. Databricks Autologging automatically captures model parameters, metrics, files and lineage information when you train models with training runs recorded as MLflow tracking runs. \n### Reuse the same infrastructure to manage ML pipelines \nML pipelines should be automated using many of the same techniques as other data pipelines. Use [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) to automate deployment. ML requires deploying infrastructure such as inference jobs, serving endpoints, and featurization jobs. All ML pipelines can be automated as [Workflows with Jobs](https:\/\/docs.databricks.com\/workflows\/index.html), and many data-centric ML pipelines can use the more specialized [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html) to ingest images and other data and [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html) to compute features or to monitor metrics.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/operational-excellence\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Operational excellence for the data lakehouse\n##### Best practices for operational excellence\n###### 3. Set up monitoring, alerting, and logging\n\n### Platform monitoring using CloudWatch \nIntegrating Databricks with [CloudWatch](https:\/\/docs.aws.amazon.com\/AmazonCloudWatch\/latest\/monitoring\/WhatIsCloudWatch.html) enables metrics derived from logs and alerts. [CloudWatch Application Insights](https:\/\/aws.amazon.com\/about-aws\/whats-new\/2020\/11\/amazon-cloudwatch-application-insights-adds-automatic-applications\/) can help you automatically discover the fields contained in the logs, and [CloudWatch Logs Insights](https:\/\/docs.aws.amazon.com\/AmazonCloudWatch\/latest\/logs\/AnalyzingLogData.html) provides a purpose-built query language for faster debugging and analysis. \nSee [How to Monitor Databricks with Amazon CloudWatch](https:\/\/aws.amazon.com\/blogs\/mt\/how-to-monitor-databricks-with-amazon-cloudwatch\/). \n### Cluster monitoring via Ganglia \nTo help monitor clusters, Databricks provides access to Ganglia metrics from the cluster details page, and these include GPU metrics. See [Ganglia metrics](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#metrics-ganglia). \n### SQL warehouse monitoring \nMonitoring the SQL warehouse is essential to understand the load profile over time and to manage the SQL warehouse efficiently. With [SQL warehouse monitoring](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/monitor.html), you can view information, such as the number of queries handled by the warehouse or the number of clusters allocated to the warehouse. \n### Auto Loader monitoring \nAuto Loader provides a SQL API for inspecting the state of a stream. With SQL functions, you can find metadata about files that have been discovered by an Auto Loader stream. See [Monitoring Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/production.html#monitoring). \nWith Apache Spark [Streaming Query Listener](https:\/\/docs.databricks.com\/structured-streaming\/stream-monitoring.html) interface, Auto Loader streams can be further monitored. \n### Delta Live Tables monitoring \nAn event log is created and maintained for every Delta Live Tables pipeline. The event log contains all information related to the pipeline, including audit logs, data quality checks, pipeline progress, and data lineage. You can use the event log to track, understand, and monitor the state of your data pipelines. See [Monitor Delta Live Tables pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/observability.html). \n### Streaming monitoring \nStreaming is one of the most important data processing techniques for ingestion and analysis. It provides users and developers with low latency and real-time data processing capabilities for analytics and triggering actions. The Databricks Data Intelligence Platform allows you to easily monitor Structured Streaming queries. See [Monitoring Structured Streaming queries on Databricks](https:\/\/docs.databricks.com\/structured-streaming\/stream-monitoring.html). \nAdditional information can be found in the dedicated UI with real-time metrics and statistics. For more information, see [A look at the new Structured Streaming UI in Apache Spark 3.0](https:\/\/databricks.com\/blog\/2020\/07\/29\/a-look-at-the-new-structured-streaming-ui-in-apache-spark-3-0.html). \n### Security monitoring \nSee [Security, compliance & privacy - Security Monitoring](https:\/\/docs.databricks.com\/lakehouse-architecture\/security-compliance-and-privacy\/best-practices.html#6-monitor-system-security). \n### Cost monitoring \nSee [Cost Optimization - Monitor and control cost](https:\/\/docs.databricks.com\/lakehouse-architecture\/cost-optimization\/best-practices.html#3-monitor-and-control-cost).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/operational-excellence\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Operational excellence for the data lakehouse\n##### Best practices for operational excellence\n###### 4. Manage capacity and quotas\n\n### Manage service limits and quotas \nEvery service launched on a cloud will have to take limits into account, such as access rate limits, number of instances, number of users, and memory requirements. For your cloud provider, check [the cloud limits](https:\/\/docs.aws.amazon.com\/general\/latest\/gr\/aws_service_limits.html). Before designing a solution, these limits must be understood. \nSpecifically, for the Databricks platform, there are different types of limits: \n**Databricks platform limits:** These are specific limits for Databricks resources. The limits for the overall platform are documented in [Resource limits](https:\/\/docs.databricks.com\/resources\/limits.html). \n**Unity Catalog limits:** [Unity Catalog Resource Quotas](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#quotas) \n**Subscription\/account quotas:** Databricks leverages cloud resources for its service. For example, workloads on Databricks run on clusters, for which the Databricks platform starts cloud provider\u2019s virtual machines (VM). Cloud providers set default quotas on how many VMs can be started at the same time. Depending on the need, these quotas might need to be adjusted. \nFor further details, see [Amazon EC2 service quotas](https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/ec2-resource-limits.html). \nIn a similar way, storage, network, and other cloud services have limitations that must be understood and factored in. \n### Invest in capacity planning \nPlan for fluctuation in the expected load that can occur for several reasons like sudden business changes or even world events. Test variations of load, including unexpected ones, to ensure that your workloads can scale. Ensure all regions can adequately scale to support the total load if a region fails. To be taken into consideration: \n* Technology and service limits and limitations of the cloud. See [Manage capacity and quota](https:\/\/docs.databricks.com\/lakehouse-architecture\/operational-excellence\/best-practices.html#4-manage-capacity-and-quotas).\n* SLAs when determining the services to use in the design.\n* Cost analysis to determine how much improvement will be realized in the application if costs are increased. Evaluate if the price is worth the investment.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/operational-excellence\/best-practices.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Diagnosing a long stage in Spark\n\nStart by identifying the longest stage of the job. Scroll to the bottom of the job\u2019s page to the list of stages and order them by duration: \n![Long Stage](https:\/\/docs.databricks.com\/_images\/long-stage.png)\n\n##### Diagnosing a long stage in Spark\n###### Stage I\/O details\n\nTo see high-level data about what this stage was doing, look at the **Input**, **Output**, **Shuffle Read**, and **Shuffle Write** columns: \n![Long Stage I\/O](https:\/\/docs.databricks.com\/_images\/long-stage-io.jpeg) \nThe columns mean the following: \n* **Input:** How much data this stage read from storage. This could be reading from Delta, Parquet, CSV, etc.\n* **Output:** How much data this stage wrote to storage. This could be writing to Delta, Parquet, CSV, etc.\n* **Shuffle Read:** How much shuffle data this stage read.\n* **Shuffle Write:** How much shuffle data this stage wrote. \nIf you\u2019re not familiar with what shuffle is, now is a good time to [learn](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#data-shuffling) what that means. \nMake note of these numbers as you\u2019ll likely need them later.\n\n##### Diagnosing a long stage in Spark\n###### Number of tasks\n\nThe number of tasks in the long stage can point you in the direction of your issue. You can determine the number of tasks by looking here: \n![Determining number of tasks](https:\/\/docs.databricks.com\/_images\/long-stage-tasks.jpeg) \nIf you see one task, that could be a sign of a problem. For more information, see [One Spark task](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/one-spark-task.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/long-spark-stage.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Diagnosing a long stage in Spark\n###### View more stage details\n\nIf the stage has more than one task, you should investigate further. Click on the link in the stage\u2019s description to get more info about the longest stage: \n![Open Stage Info](https:\/\/docs.databricks.com\/_images\/long-stage-description.png) \nNow that you\u2019re in the stage\u2019s page, see [Skew and spill](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/long-spark-stage-page.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/long-spark-stage.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Run jobs on a schedule or continuously\n\nYou can run your Databricks job periodically with the *Scheduled* trigger type or ensure there\u2019s always an active run of the job with the *Continuous* trigger type. \nYou can use a schedule to automatically run your Databricks job at specified times and periods. You can define a schedule to run your job on minute, hourly, daily, weekly, or monthly periods and at specified times. You can also specify a time zone for your schedule and pause a scheduled job at any time. \nWhen you run your job with the continuous trigger, Databricks Jobs ensures there is always one active run of the job. A new job run starts after the previous run completes successfully or with a failed status, or if there is no instance of the job currently running.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/schedule-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Run jobs on a schedule or continuously\n##### Add a job schedule\n\nTo define a schedule for the job: \n1. In the sidebar, click **Workflows**.\n2. In the **Name** column on the **Jobs** tab, click the job name.\n3. Click **Add trigger** in the **Job details** panel and select **Scheduled** in **Trigger type**.\n4. Specify the period, starting time, and time zone. Optionally select the **Show Cron Syntax** checkbox to display and edit the schedule in [Quartz Cron Syntax](http:\/\/www.quartz-scheduler.org\/documentation\/quartz-2.3.0\/tutorials\/crontrigger.html).\n5. Click **Save**. \nYou can also schedule a notebook job directly in the [notebook UI](https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html). \nNote \n* Databricks enforces a minimum interval of 10 seconds between subsequent runs triggered by the schedule of a job regardless of the seconds configuration in the cron expression.\n* You can choose a time zone that observes daylight saving time or UTC. If you select a time zone that observes daylight saving time, an hourly job will be skipped or might appear to be delayed by an hour or two [when daylight saving time begins or ends](https:\/\/www.quartz-scheduler.net\/documentation\/faq.html#daylight-saving-time-and-triggers). To run at every hour (absolute time), choose UTC.\n* The job scheduler is not intended for low-latency jobs. Due to network or cloud issues, job runs might occasionally be delayed up to several minutes. In these situations, scheduled jobs run immediately upon service availability.\n\n#### Run jobs on a schedule or continuously\n##### Pause and resume a job schedule\n\nTo pause a job, click **Pause** in the **Job details** panel. \nTo resume a paused job schedule, click **Resume**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/schedule-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Run jobs on a schedule or continuously\n##### Run a continuous job\n\n1. In the sidebar, click **Workflows**.\n2. In the **Name** column on the **Jobs** tab, click the job name.\n3. Click **Add trigger** in the **Job details** panel, click **Add trigger** in the **Job details** panel, select **Continuous** in **Trigger type**, and click **Save**. \nTo stop a continuous job, click ![Blue Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret-blue.png) next to **Run Now** and click **Stop**. \nNote \n* There can be only one running instance of a continuous job.\n* There is a small delay between a run finishing and a new run starting. This delay should be less than 60 seconds.\n* You cannot use [task dependencies](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#task-edit-dependencies) with a continuous job.\n* You cannot use [retry policies](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#retry-policies) with a continuous job. Instead, continuous jobs use [exponential backoff](https:\/\/docs.databricks.com\/workflows\/jobs\/schedule-jobs.html#exponential-backoff) to manage job run failures.\n* Selecting **Run now** on a continuous job that is paused triggers a new job run. If the job is unpaused, an exception is thrown.\n* To have your continuous job pick up a new job configuration, cancel the existing run and then a new run automatically starts. You can also click **Restart run** to restart the job run with the updated configuration.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/schedule-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Run jobs on a schedule or continuously\n##### How are failures handled for continuous jobs?\n\nDatabricks Jobs uses an *exponential backoff* scheme to manage continuous jobs with multiple consecutive failures. Exponential backoff allows continuous jobs to run without pausing and return to a healthy state when recoverable failures occur. \nWhen a continuous job exceeds the allowable threshold for consecutive failures, the following describes how subsequent job runs are managed: \n1. The job is restarted after a retry period set by the system.\n2. If the next job run fails, the retry period is increased, and the job is restarted after this new retry period. \n1. For each subsequent job run failure, the retry period is increased again, up to a maximum retry period set by the system. After reaching the maximum retry period, the job continues to be retried using the maximum retry period. There is no limit on the number of retries for a continuous job.\n2. If the job run completes successfully and starts a new run, or if the run exceeds a threshold without failure, the job is considered healthy, and the backoff sequence resets. \nYou can restart a continuous job in the exponential backoff state in the [Jobs UI](https:\/\/docs.databricks.com\/workflows\/jobs\/repair-job-failures.html#continuous-job-failures) or by passing the job ID to the [POST \/api\/2.1\/jobs\/run-now](https:\/\/docs.databricks.com\/api\/workspace\/jobs\/runnow) request in the Jobs 2.1 API or the [POST \/api\/2.0\/jobs\/run-now](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-2.0-api.html#jobsjobsservicerunnow) request in the Jobs 2.0 API.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/schedule-jobs.html"} +{"content":"# Transform data\n### Work with joins on Databricks\n\nDatabricks supports ANSI standard join syntax. This article describes differences between joins with batch and stream processing and provides some recommendations for optimizing join performance. \nNote \nDatabricks also supports standard syntax for the set operators `UNION`, `INTERSECT`, and `EXCEPT`. See [Set operators](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-qry-select-setops.html).\n\n### Work with joins on Databricks\n#### Differences between streaming and batch joins\n\nJoins on Databricks are either stateful or stateless. \nAll batch joins are stateless joins. Results process immediately and reflect data at the time the query runs. Each time the query executes, new results are calculated based on the specified source data. See [Batch joins](https:\/\/docs.databricks.com\/transform\/join.html#batch). \nJoins between two streaming data sources are stateful. In stateful joins, Databricks tracks information about the data sources and the results and iteratively updates the results. Stateful joins can provide powerful solutions for online data processing, but can be difficult to implement effectively. They have complex operational semantics depending on the output mode, trigger interval, and watermark. See [Stream-stream joins](https:\/\/docs.databricks.com\/transform\/join.html#stream-stream). \nStream-static joins are stateless, but provide a good option for joining an incremental data source (such as a facts table) with a static data source (such as a slowly-changing dimensional table). Rather than joining all records from both sides each time a query executes, only newly received records from the streaming source are joined with the current version of the static table. See [Stream-static joins](https:\/\/docs.databricks.com\/transform\/join.html#stream-static).\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/join.html"} +{"content":"# Transform data\n### Work with joins on Databricks\n#### Batch joins\n\nDatabricks supports standard SQL join syntax, including inner, outer, semi, anti, and cross joins. See [JOIN](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-qry-select-join.html). \nNote \nDatabricks recommends using a materialized view to optimize incremental computation of the results of an inner join. See [Use materialized views in Databricks SQL](https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html).\n\n### Work with joins on Databricks\n#### Stream-stream joins\n\nJoining two streaming data sources can present significant challenges in managing state information and reasoning about results computation and output. Before implementing a stream-stream join, Databricks recommends developing a strong understanding of the operational semantics for stateful streaming, including how watermarks impact state management. See the following articles: \n* [Optimize stateful Structured Streaming queries](https:\/\/docs.databricks.com\/structured-streaming\/stateful-streaming.html)\n* [Apply watermarks to control data processing thresholds](https:\/\/docs.databricks.com\/structured-streaming\/watermarks.html)\n* [Stream-Stream joins](https:\/\/docs.databricks.com\/structured-streaming\/examples.html#stream-stream-joins) \nDatabricks recommends specifying watermarks for both sides of all stream-steam joins. The following join types are supported: \n* Inner joins\n* Left outer joins\n* Right outer joins\n* Full outer joins\n* Left semi joins \nSee the Apache Spark Structured Streaming documentation on [stream-steam joins](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html#stream-stream-joins).\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/join.html"} +{"content":"# Transform data\n### Work with joins on Databricks\n#### Stream-static joins\n\nNote \nThe described behavior for stream-static joins assumes that the static data is stored using Delta Lake. \nA stream-static join joins the latest valid version of a Delta table (the static data) to a data stream using a stateless join. \nWhen Databricks processes a micro-batch of data in a stream-static join, the latest valid version of data from the static Delta table joins with the records present in the current micro-batch. Because the join is stateless, you do not need to configure watermarking and can process results with low latency. The data in the static Delta table used in the join should be slowly-changing. \nThe following example demonstrates this pattern: \n```\nstreamingDF = spark.readStream.table(\"orders\")\nstaticDF = spark.read.table(\"customers\")\n\nquery = (streamingDF\n.join(staticDF, streamingDF.customer_id==staticDF.id, \"inner\")\n.writeStream\n.option(\"checkpointLocation\", checkpoint_path)\n.table(\"orders_with_customer_info\")\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/join.html"} +{"content":"# Transform data\n### Work with joins on Databricks\n#### Optimize join performance\n\nCompute with Photon enabled always selects the best join type. See [What is Photon?](https:\/\/docs.databricks.com\/compute\/photon.html). \nUsing a recent Databricks Runtime version with Photon enabled generally provides good join performance, but you should also consider the following recommendations: \n* Cross joins are very expensive. Remove cross joins from workloads and queries that require low latency or frequent recomputation.\n* Join order matters. When performing multiple joins, always join your smallest tables first and then join the result with larger tables.\n* The optimizer can struggle on queries with many joins and aggregations. Saving out intermediate results can accelerate query planning and computing results.\n* Keep fresh statistics to improve performance. Run the query `ANALYZE TABLE table_name COMPUTE STATISTICS` to update statistics in the query planner. \nNote \nIn Databricks Runtime 14.3 LTS and above, you can modify the columns that Delta Lake collects stats on for data skipping and then recompute existing statistics in the Delta log. See [Specify Delta statistics columns](https:\/\/docs.databricks.com\/delta\/data-skipping.html#stats-cols).\n\n### Work with joins on Databricks\n#### Join hints on Databricks\n\nApache Spark supports specifying join hints for range joins and skew joins. Hints for skew joins are not necessary as Databricks automatically optimizes these joins. See [Hints](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-qry-select-hints.html) \nHints for range joins can be useful if join performance is poor and you are performing inequality joins. Examples include joining on timestamp ranges or a range of clustering IDs. See [Range join optimization](https:\/\/docs.databricks.com\/optimizations\/range-join.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/join.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Cassandra\n\nThe following notebook shows how to connect Cassandra with Databricks.\n\n#### Cassandra\n##### Connect to Cassandra and manage ambiguous column in DataFrame notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/cassandra.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/cassandra.html"} +{"content":"# \n### Create versions of your RAG application to iterate on the app\u2019s quality\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html) \nThis tutorial walks you through the process of creating a new `Version`. You create `Version` in order to experiment with various settings that can improve your app\u2019s quality. \nYou can modify one or both of the RAG Application\u2019s configuration and code when creating a new version. \nYou can modify one, any, or all of the following components when creating a new `Version`. \n* Create a [version](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-data-processor.html) of the `\ud83d\uddc3\ufe0f Data Processor`\n* Create a [version](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-retriever.html) of the `\ud83d\udd0d Retriever`\n* Create a [version](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-chain.html) of the `\ud83d\udd17 Chain` \nOnce you have modified the `Version`, you [deploy](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1c-deploy-version.html) the new version.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-app-versions.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create and work with volumes\n\nThis article introduces volumes, which are Unity Catalog objects that enable governance over non-tabular datasets. It also describes how to create, manage, and work with volumes. \nFor details on uploading and managing files in volumes, see [Upload files to a Unity Catalog volume](https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-to-volume.html) and [File management operations for Unity Catalog volumes](https:\/\/docs.databricks.com\/catalog-explorer\/manage-volumes.html). \nNote \nWhen you work with volumes, you must use a SQL warehouse or a cluster running Databricks Runtime 13.3 LTS or above, unless you are using Databricks UIs such as Catalog Explorer.\n\n#### Create and work with volumes\n##### What are Unity Catalog volumes?\n\nVolumes are Unity Catalog objects that represent a logical volume of storage in a cloud object storage location. Volumes provide capabilities for accessing, storing, governing, and organizing files. While tables provide governance over tabular datasets, volumes add governance over non-tabular datasets. You can use volumes to store and access files in any format, including structured, semi-structured, and unstructured data. \nImportant \nYou cannot use volumes as a location for tables. Volumes are intended for path-based data access only. Use tables for storing tabular data with Unity Catalog.\n\n#### Create and work with volumes\n##### What is a managed volume?\n\nA **managed volume** is a Unity Catalog-governed storage volume created within the managed storage location of the containing schema. Managed volumes allow the creation of governed storage for working with files without the overhead of external locations and storage credentials. You do not need to specify a location when creating a managed volume, and all file access for data in managed volumes is through paths managed by Unity Catalog. See [What path is used for accessing files in a volume?](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html#path). \nWhen you delete a managed volume, the files stored in this volume are also deleted from your cloud tenant within 30 days.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create and work with volumes\n##### What is an external volume?\n\nAn **external volume** is a Unity Catalog-governed storage volume registered against a directory within an external location using Unity Catalog-governed storage credentials. External volumes allow you to add Unity Catalog data governance to existing cloud object storage directories. Some use cases for external volumes include the following: \n* Adding governance to data files without migration.\n* Governing files produced by other systems that must be ingested or accessed by Databricks.\n* Governing data produced by Databricks that must be accessed directly from cloud object storage by other systems. \nExternal volumes **must** be directories within external locations governed by Unity Catalog storage credentials. Unity Catalog does not manage the lifecycle or layout of the files in external volumes. When you drop an external volume, Unity Catalog does not delete the underlying data. \nNote \nWhen you define a volume, cloud URI access to data under the volume path is governed by the permissions of the volume.\n\n#### Create and work with volumes\n##### What path is used for accessing files in a volume?\n\nThe path to access volumes is the same whether you use Apache Spark, SQL, Python, or other languages and libraries. This differs from legacy access patterns for files in object storage bound to a Databricks workspace. \nThe path to access files in volumes uses the following format: \n```\n\/Volumes\/<catalog>\/<schema>\/<volume>\/<path>\/<file-name>\n\n``` \nDatabricks also supports an optional `dbfs:\/` scheme when working with Apache Spark, so the following path also works: \n```\ndbfs:\/Volumes\/<catalog>\/<schema>\/<volume>\/<path>\/<file-name>\n\n``` \nThe sequence `\/<catalog>\/<schema>\/<volume>` in the path corresponds to the three Unity Catalog object names associated with the file. These path elements are read-only and not directly writeable by users, meaning it is not possible to create or delete these directories using filesystem operations. They are automatically managed and kept in sync with the corresponding UC entities. \nNote \nYou can also access data in external volumes using cloud storage URIs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create and work with volumes\n##### What are the privileges for volumes?\n\nVolumes use the same basic privilege model as tables, but where privileges for tables focus on granting access to querying and manipulating rows in a table, privileges for volumes focus on working with files. As such, volumes introduce the following privileges: \n* [CREATE VOLUME](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html#create-volume)\n* [CREATE EXTERNAL VOLUME](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html#create-external-volume)\n* [READ VOLUME](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html#read-volume)\n* [WRITE VOLUME](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html#write-volume) \nSee [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html).\n\n#### Create and work with volumes\n##### Who can manage volume privileges?\n\nYou must have owner privileges on a volume to manage volume privileges or drop volumes. Each object in Unity Catalog can only have one principal assigned as an owner, and while ownership does not cascade (that is, the owner of a catalog does not automatically become the owner of all objects in that catalog), the privileges associated with ownership apply to all objects contained within an object. \nThis means that for Unity Catalog volumes, the following principals can manage volume privileges: \n* The owner of the parent catalog.\n* The owner of the parent schema.\n* The owner of the volume. \nWhile each object can only have a single owner, Databricks recommends assigning ownership for most objects to a group rather than an individual user. Initial ownership for any object is assigned to the user that creates that object. See [Manage Unity Catalog object ownership](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/ownership.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create and work with volumes\n##### Create a managed volume\n\nYou must have the following permissions to create a managed volume: \n| Resource | Permissions required |\n| --- | --- |\n| Schema | `USE SCHEMA`, `CREATE VOLUME` |\n| Catalog | `USE CATALOG` | \nTo create a managed volume, use the following syntax: \n```\nCREATE VOLUME <catalog>.<schema>.<volume-name>;\n\n``` \nTo create a managed volume in Catalog Explorer: \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Search or browse for the schema that you want to add the volume to and select it.\n3. Click the **Create Volume** button. (You must have sufficient privileges.)\n4. Enter a name for the volume.\n5. Provide a comment (optional).\n6. Click **Create**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create and work with volumes\n##### Create an external volume\n\nYou must have the following permissions to create an external volume: \n| Resource | Permissions required |\n| --- | --- |\n| External location | `CREATE EXTERNAL VOLUME` |\n| Schema | `USE SCHEMA`, `CREATE VOLUME` |\n| Catalog | `USE CATALOG` | \nTo create an external volume, specify a path within an external location using the following syntax: \n```\nCREATE EXTERNAL VOLUME <catalog>.<schema>.<external-volume-name>\nLOCATION 's3:\/\/<external-location-bucket-path>\/<directory>';\n\n``` \nTo create an external volume in Catalog Explorer: \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Search or browse for the schema that you want to add the volume to and select it.\n3. Click the **Create Volume** button. (You must have sufficient privileges.)\n4. Enter a name for the volume.\n5. Choose an external location in which to create the volume.\n6. Edit the path to reflect the sub-directory where you want to create the volume.\n7. Provide a comment (optional).\n8. Click **Create**.\n\n#### Create and work with volumes\n##### Drop a volume\n\nOnly users with owner privileges can drop a volume. See [Who can manage volume privileges?](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html#owner). \nUse the following syntax to drop a volume: \n```\nDROP VOLUME IF EXISTS <volume-name>;\n\n``` \nWhen you drop a managed volume, Databricks deletes the underlying data within 30 days. When you drop an external volume, you remove the volume from Unity Catalog but the underlying data remains unchanged in the external location.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create and work with volumes\n##### Read files in a volume\n\nYou must have the following permissions to view the contents of a volume or access files that are stored on volumes: \n| Resource | Permissions required |\n| --- | --- |\n| Volume | `READ` |\n| Schema | `USE SCHEMA` |\n| Catalog | `USE CATALOG` | \nYou interact with the contents of volumes using paths. See [What path is used for accessing files in a volume?](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html#path).\n\n#### Create and work with volumes\n##### Create, delete, and perform other file management operations on a volume\n\nYou must have the following permissions to perform file management operations on files that are stored on volumes: \n| Resource | Permissions required |\n| --- | --- |\n| Volume | `READ`, `WRITE` |\n| Schema | `USE SCHEMA` |\n| Catalog | `USE CATALOG` | \nYou can perform file management operations on volumes with the following tools: \n* Catalog Explorer provides many UI options for file management tasks. See [What is Catalog Explorer?](https:\/\/docs.databricks.com\/catalog-explorer\/index.html).\n* Databricks utilities `fs` commands. See [File system utility (dbutils.fs)](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#dbutils-fs).\n* The `%fs` magic command provides the same functionality as `dbutils.fs`.\n* Upload files to volume UI. See [Upload files to a Unity Catalog volume](https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-to-volume.html).\n* Open source commands such as `os.listdir()`.\n* Some bash commands are supported. `%sh cp` is supported, but `%sh mv` is not. \nFor full details on programmatically interacting with files on volumes, see [Work with files in Unity Catalog volumes](https:\/\/docs.databricks.com\/files\/index.html#volumes).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create and work with volumes\n##### Example notebook: Create and work with volumes\n\nThe following notebook demonstrates the basic SQL syntax to create and interact with Unity Catalog volumes. \n### Tutorial: Unity Catalog volumes notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/unity-catalog-volumes.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### Create and work with volumes\n##### Reserved paths for volumes\n\nVolumes introduces the following reserved paths used for accessing volumes: \n* `dbfs:\/Volumes`\n* `\/Volumes` \nNote \nPaths are also reserved for potential typos for these paths from Apache Spark APIs and `dbutils`, including `\/volumes`, `\/Volume`, `\/volume`, whether or not they are preceded by `dbfs:\/`. The path `\/dbfs\/Volumes` is also reserved, but cannot be used to access volumes. \nVolumes are only supported on Databricks Runtime 13.3 LTS and above. In Databricks Runtime 12.2 LTS and below, operations against `\/Volumes` paths might succeed, but can write data to ephemeral storage disks attached to compute clusters rather than persisting data to Unity Catalog volumes as expected. \nImportant \nIf you have pre-existing data stored in a reserved path on the DBFS root, you can file a support ticket to gain temporary access to this data to move it to another location.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create and work with volumes\n##### Limitations\n\nYou must use Unity Catalog-enabled compute to interact with Unity Catalog volumes. Volumes do not support all workloads. \nNote \nVolumes do not support `dbutils.fs` commands distributed to executors. \nThe following limitations apply: \nIn Databricks Runtime 14.3 LTS and above: \n* On single-user user clusters, you cannot access volumes from threads and subprocesses in Scala. \nIn Databricks Runtime 14.2 and below: \n* On compute configured with shared access mode, you can\u2019t use UDFs to access volumes. \n+ Both Python or Scala have access to FUSE from the driver but not from executors.\n+ Scala code that performs I\/O operations can run on the driver but not the executors.\n* On compute configured with single user access mode, there is no support for FUSE in Scala, Scala IO code accessing data using volume paths, or Scala UDFs. Python UDFs are supported in single user access mode. \nOn all supported Databricks Runtime versions: \n* Unity Catalog UDFs do not support accessing volume file paths.\n* You cannot access volumes from RDDs.\n* You cannot use spark-submit with JARs stored in a volume.\n* You cannot define dependencies to other libraries accessed via volume paths inside a Wheel or JAR file.\n* You cannot list Unity Catalog objects using the `\/Volumes\/<catalog-name>` or `\/Volumes\/<catalog-name>\/<schema-name>` patterns. You must use a fully-qualified path that includes a volume name.\n* The DBFS endpoint for the REST API does not support volumes paths.\n* Volumes are excluded from global search results in the Databricks workspace.\n* You cannot specify volumes as the destination for cluster log delivery.\n* `%sh mv` is not supported for moving files between volumes. Use `dbutils.fs.mv` or `%sh cp` instead.\n* You cannot create a custom Hadoop file system with volumes, meaning the following is not supported: \n```\nimport org.apache.hadoop.fs.Path\nval path = new Path(\"dbfs:\/Volumes\/main\/default\/test-volume\/file.txt\")\nval fs = path.getFileSystem(sc.hadoopConfiguration)\nfs.listStatus(path)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Log\n#### load\n##### register\n###### and deploy MLflow models\n######## scikit-learn model deployment on SageMaker\n\nThis notebook uses ElasticNet models trained on the diabetes dataset described in [Track scikit-learn model training with MLflow](https:\/\/docs.databricks.com\/mlflow\/tracking-ex-scikit.html). The notebook shows how to: \n* Select a model to deploy using the MLflow experiment UI\n* Deploy the model to SageMaker using the MLflow API\n* Query the deployed model using the `sagemaker-runtime` API\n* Repeat the deployment and query process for another model\n* Delete the deployment using the MLflow API \nFor information on how to configure AWS authentication so that you can deploy MLflow models in AWS SageMaker from Databricks, see [Set up AWS authentication for SageMaker deployment](https:\/\/docs.databricks.com\/admin\/cloud-configurations\/aws\/sagemaker.html).\n\n######## scikit-learn model deployment on SageMaker\n######### MLflow scikit-learn model training notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-quick-start-deployment-aws.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n######## scikit-learn model deployment on SageMaker\n######### Deploy on Model Serving\n\nIf you prefer to serve your registered model using Databricks, see [Model serving with Databricks](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/scikit-learn-model-deployment-on-sagemaker.html"} +{"content":"# \n### Get started: Account and workspace setup\n\nIf you\u2019re new to Databricks, you\u2019ve found the place to start. This article walks you through the minimum steps required to create your account and get your first workspace up and running. \nFor information about online training resources, see [Get free Databricks training](https:\/\/docs.databricks.com\/getting-started\/free-training.html).\n\n### Get started: Account and workspace setup\n#### Requirements\n\nTo use your Databricks account on AWS, you need an existing AWS account. If you don\u2019t have an AWS account, you can sign up for an AWS Free Tier account at <https:\/\/aws.amazon.com\/free\/>.\n\n### Get started: Account and workspace setup\n#### Step 1: Sign up for a free trial\n\nYou can sign up for your free Databricks trial either on the [AWS Marketplace](https:\/\/aws.amazon.com\/marketplace\/pp\/prodview-wtyi5lgtce6n6) or through the [Databricks website](https:\/\/databricks.com\/try-databricks). The only difference between the two is where you\u2019ll handle the account billing after the free trial ends. \nFor detailed instructions on the free trial and billing, see [Databricks free trial](https:\/\/docs.databricks.com\/getting-started\/free-trial.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/index.html"} +{"content":"# \n### Get started: Account and workspace setup\n#### Step 2: Create and set up your first Databricks workspace\n\nAfter you sign up for the free trial, you\u2019re prompted to set up your first workspace using the AWS Quick Start. This deployment method creates Databricks-enabled AWS resources for you so you can get your workspace up and running quickly. \nFor instructions on deploying your workspace and basic set up, see [Get started: Databricks workspace onboarding](https:\/\/docs.databricks.com\/getting-started\/onboarding-account.html). \nNote \nIf you\u2019re more familiar with AWS and want to manually create AWS resources for your workspace deployment, see [Manually create a workspace (existing Databricks accounts)](https:\/\/docs.databricks.com\/admin\/workspace\/create-workspace.html).\n\n### Get started: Account and workspace setup\n#### Step 3: Explore and use the Databricks platform\n\nAt this point, you have a functional Databricks workspace. To learn how to navigate the platform, see [Navigate the workspace](https:\/\/docs.databricks.com\/workspace\/index.html). To jump in and start querying data, run the [Get started: Query and visualize data from a notebook](https:\/\/docs.databricks.com\/getting-started\/quick-start.html) tutorial.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/index.html"} +{"content":"# \n### Get started: Account and workspace setup\n#### Next steps\n\nYour next steps depend on whether you want to continue setting up your account organization and security or want to start building out data pipelines: \n* Connect your Databricks workspace to external data sources. See [Connect to data sources](https:\/\/docs.databricks.com\/connect\/index.html).\n* Ingest your data into the workspace. See [Ingest data into a Databricks lakehouse](https:\/\/docs.databricks.com\/ingestion\/index.html). \n* Onboard data to your workspace in Databricks SQL. See [Load data using streaming tables in Databricks SQL](https:\/\/docs.databricks.com\/sql\/load-data-streaming-table.html). \n* Build out your account organization and security. See [Get started with Databricks administration](https:\/\/docs.databricks.com\/getting-started\/admin-get-started.html).\n* Learn about managing access to data in your workspace. See [What is Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html).\n* Learn about managing access to workspace objects like notebooks, compute, dashboards, queries. See [Access control lists](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/index.html"} +{"content":"# \n### Get started: Account and workspace setup\n#### Get help\n\nIf you have any questions about setting up Databricks and need live help, please e-mail [onboarding-help@databricks.com](mailto:onboarding-help%40databricks.com). \nIf you have a Databricks support package, you can open and manage support cases with Databricks. See [Learn how to use Databricks support](https:\/\/docs.databricks.com\/resources\/support.html). \nIf your organization does not have a Databricks support subscription, or if you are not an authorized contact for your company\u2019s support subscription, you can get answers to many questions in [Databricks Office Hours](https:\/\/www.databricks.com\/p\/webinar\/officehours?utm_source=databricks&utm_medium=site&utm_content=docs) or from the [Databricks Community](https:\/\/community.databricks.com). \nIf you need additional help, [sign up for a live weekly demo](https:\/\/databricks.com\/p\/databricks-weekly-demo) to ask questions and practice alongside Databricks experts. Or, follow this [blog series on best practices for managing and maintaining your environments](https:\/\/databricks.com\/blog\/2022\/03\/10\/functional-workspace-organization-on-databricks.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n##### HorovodRunner: distributed deep learning with Horovod\n\nLearn how to perform distributed training of machine learning models using HorovodRunner to launch Horovod training jobs as Spark jobs on Databricks.\n\n##### HorovodRunner: distributed deep learning with Horovod\n###### What is HorovodRunner?\n\nHorovodRunner is a general API to run distributed deep learning workloads on Databricks using the [Horovod](https:\/\/github.com\/horovod\/horovod) framework. By integrating Horovod with Spark\u2019s [barrier mode](https:\/\/jira.apache.org\/jira\/browse\/SPARK-24374), Databricks is able to provide higher stability for long-running deep learning training jobs on Spark. HorovodRunner takes a Python method that contains deep learning training code with Horovod hooks. HorovodRunner pickles the method on the driver and distributes it to Spark workers. A Horovod MPI job is embedded as a Spark job using the barrier execution mode. The first executor collects the IP addresses of all task executors using `BarrierTaskContext` and triggers a Horovod job using `mpirun`. Each Python MPI process loads the pickled user program, deserializes it, and runs it. \n![HorovodRunner](https:\/\/docs.databricks.com\/_images\/horovod-runner.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-runner.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n##### HorovodRunner: distributed deep learning with Horovod\n###### Distributed training with HorovodRunner\n\nHorovodRunner lets you launch Horovod training jobs as Spark jobs. The HorovodRunner API supports the methods shown in the table. For details, see the [HorovodRunner API documentation](https:\/\/databricks.github.io\/spark-deep-learning\/#sparkdl.HorovodRunner). \n| Method and signature | Description |\n| --- | --- |\n| **`init(self, np)`** | Create an instance of HorovodRunner. |\n| **`run(self, main, **kwargs)`** | Run a Horovod training job invoking `main(**kwargs)`. The main function and the keyword arguments are serialized using cloudpickle and distributed to cluster workers. | \nThe general approach to developing a distributed training program using HorovodRunner is: \n1. Create a `HorovodRunner` instance initialized with the number of nodes.\n2. Define a Horovod training method according to the methods described in [Horovod usage](https:\/\/github.com\/horovod\/horovod#usage), making sure to add any import statements inside the method.\n3. Pass the training method to the `HorovodRunner` instance. \nFor example: \n```\nhr = HorovodRunner(np=2)\n\ndef train():\nimport tensorflow as tf\nhvd.init()\n\nhr.run(train)\n\n``` \nTo run HorovodRunner on the driver only with `n` subprocesses, use `hr = HorovodRunner(np=-n)`. For example, if there are 4 GPUs on the driver node, you can choose `n` up to `4`. For details about the parameter `np`, see the [HorovodRunner API documentation](https:\/\/databricks.github.io\/spark-deep-learning\/#api-documentation). For details about how to pin one GPU per subprocess, see the [Horovod usage guide](https:\/\/github.com\/horovod\/horovod#usage). \nA common error is that TensorFlow objects cannot be found or pickled. This happens when the library import statements are not distributed to other executors. To avoid this issue, include all import statements (for example, `import tensorflow as tf`) *both* at the top of the Horovod training method and inside any other user-defined functions called in the Horovod training method.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-runner.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n##### HorovodRunner: distributed deep learning with Horovod\n###### Record Horovod training with Horovod Timeline\n\nHorovod has the ability to record the timeline of its activity, called [Horovod Timeline](https:\/\/horovod.readthedocs.io\/en\/latest\/timeline.html). \nImportant \nHorovod Timeline has a significant impact on performance. Inception3 throughput can decrease by ~40% when Horovod Timeline is enabled. To speed up HorovodRunner jobs, do not use Horovod Timeline. \nYou cannot view the Horovod Timeline while training is in progress. \nTo record a Horovod Timeline, set the `HOROVOD_TIMELINE` environment variable to the location where you want to save the timeline file. Databricks recommends using a location on shared storage so that the timeline file can be easily retrieved. For example, you can use [DBFS local file APIs](https:\/\/docs.databricks.com\/dbfs\/index.html) as shown: \n```\ntimeline_dir = \"\/dbfs\/ml\/horovod-timeline\/%s\" % uuid.uuid4()\nos.makedirs(timeline_dir)\nos.environ['HOROVOD_TIMELINE'] = timeline_dir + \"\/horovod_timeline.json\"\nhr = HorovodRunner(np=4)\nhr.run(run_training_horovod, params=params)\n\n``` \nThen, add timeline specific code to the beginning and end of the training function. The following example notebook includes example code that you can use as a workaround to view training progress. \n### Horovod timeline example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/horovod-runner-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nTo download the timeline file, use the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html), and then use the Chrome browser\u2019s `chrome:\/\/tracing` facility to view it. For example: \n![Horovod timeline](https:\/\/docs.databricks.com\/_images\/mnist-timeline.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-runner.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n##### HorovodRunner: distributed deep learning with Horovod\n###### Development workflow\n\nThese are the general steps in migrating single node deep learning code to distributed training. The [Examples: Migrate to distributed deep learning with HorovodRunner](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-runner.html#examples) in this section illustrate these steps. \n1. **Prepare single node code:** Prepare and test the single node code with TensorFlow, Keras, or PyTorch.\n2. **Migrate to Horovod:** Follow the instructions from [Horovod usage](https:\/\/github.com\/horovod\/horovod#usage) to migrate the code with Horovod and test it on the driver: \n1. Add `hvd.init()` to initialize Horovod.\n2. Pin a server GPU to be used by this process using `config.gpu_options.visible_device_list`. With the typical setup of one GPU per process, this can be set to local rank. In that case, the first process on the server will be allocated the first GPU, second process will be allocated the second GPU and so forth.\n3. Include a shard of the dataset. This dataset operator is very useful when running distributed training, as it allows each worker to read a unique subset.\n4. Scale the learning rate by number of workers. The effective batch size in synchronous distributed training is scaled by the number of workers. Increasing the learning rate compensates for the increased batch size.\n5. Wrap the optimizer in `hvd.DistributedOptimizer`. The distributed optimizer delegates gradient computation to the original optimizer, averages gradients using allreduce or allgather, and then applies the averaged gradients.\n6. Add `hvd.BroadcastGlobalVariablesHook(0)` to broadcast initial variable states from rank 0 to all other processes. This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint. Alternatively, if you\u2019re not using `MonitoredTrainingSession`, you can execute the `hvd.broadcast_global_variables` operation after global variables have been initialized.\n7. Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them.\n3. **Migrate to HorovodRunner:** HorovodRunner runs the Horovod training job by invoking a Python function. You must wrap the main training procedure into a single Python function. Then you can test HorovodRunner in local mode and distributed mode.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-runner.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n##### HorovodRunner: distributed deep learning with Horovod\n###### Update the deep learning libraries\n\nNote \nThis article contains references to the term *slave*, a term that Databricks does not use. When the term is removed from the software, we\u2019ll remove it from this article. \nIf you upgrade or downgrade TensorFlow, Keras, or PyTorch, you must reinstall Horovod so that it is compiled against the newly installed library. For example, if you want to upgrade TensorFlow, Databricks recommends using the init script from the [TensorFlow installation instructions](https:\/\/docs.databricks.com\/machine-learning\/train-model\/tensorflow.html) and appending the following TensorFlow specific Horovod installation code to the end of it. See [Horovod installation instructions](https:\/\/github.com\/horovod\/horovod#install) to work with different combinations, such as upgrading or downgrading PyTorch and other libraries. \n```\nadd-apt-repository -y ppa:ubuntu-toolchain-r\/test\napt update\n# Using the same compiler that TensorFlow was built to compile Horovod\napt install g++-7 -y\nupdate-alternatives --install \/usr\/bin\/gcc gcc \/usr\/bin\/gcc-7 60 --slave \/usr\/bin\/g++ g++ \/usr\/bin\/g++-7\n\nHOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_CUDA_HOME=\/usr\/local\/cuda pip install horovod==0.18.1 --force-reinstall --no-deps --no-cache-dir\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-runner.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n##### HorovodRunner: distributed deep learning with Horovod\n###### Examples: Migrate to distributed deep learning with HorovodRunner\n\nThe following examples, based on the [MNIST](https:\/\/en.wikipedia.org\/wiki\/MNIST_database) dataset, demonstrate how to migrate a single-node deep learning program to distributed deep learning with HorovodRunner. \n* [Deep learning using TensorFlow with HorovodRunner for MNIST](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/mnist-tensorflow-keras.html)\n* [Adapt single node PyTorch to distributed deep learning](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/mnist-pytorch.html)\n\n##### HorovodRunner: distributed deep learning with Horovod\n###### Limitations\n\n* When working with workspace files, HorovodRunner will not work if `np` is set to greater than 1 and the notebook imports from other relative files. Consider using [horovod.spark](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-spark.html) instead of `HorovodRunner`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-runner.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n#### HorovodRunner: distributed deep learning with Horovod\n###### Deep learning using TensorFlow with HorovodRunner for MNIST\n\nThe following notebook demonstrates the recommended [development workflow](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-runner.html#development-workflow). Before running the notebook, prepare data for distributed training.\n\n###### Deep learning using TensorFlow with HorovodRunner for MNIST\n####### HorovodRunner TensorFlow and Keras MNIST example notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/mnist-tensorflow-keras.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/mnist-tensorflow-keras.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Hyperparameter tuning\n##### Compare model types with Hyperopt and MLflow\n\nThis notebook demonstrates how to tune the hyperparameters for multiple models and arrive at a best model overall. It uses Hyperopt with `SparkTrials` to compare three model types, evaluating model performance with a different set of hyperparameters appropriate for each model type.\n\n##### Compare model types with Hyperopt and MLflow\n###### Compare models using scikit-learn, Hyperopt, and MLflow notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/hyperopt-sklearn-model-selection.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-model-selection.html"} +{"content":"# Compute\n## Use compute\n#### View compute metrics\n\nThis article explains how to use the native compute metrics tool in the Databricks UI to gather key hardware and Spark metrics. Any compute that uses Databricks Runtime 13.3 LTS and above has access to these metrics by default. \nMetrics are available in almost real-time with a normal delay of less than one minute. Metrics are stored in Databricks-managed storage, not in the customer\u2019s storage.\n\n#### View compute metrics\n##### How are these new metrics different from Ganglia?\n\nThe new compute metrics UI has a more comprehensive view of your cluster\u2019s resource usage, including Spark consumption and internal Databricks processes. In contrast, the Ganglia UI only measures Spark container consumption. This difference might result in discrepancies in the metric values between the two interfaces.\n\n#### View compute metrics\n##### Access compute metrics UI\n\nTo view the compute metrics UI: \n1. Click **Compute** in the sidebar.\n2. Click on the compute resource you want to view metrics for.\n3. Click the **Metrics** tab. \n![Cluster metrics for the last 24 hours](https:\/\/docs.databricks.com\/_images\/cluster-metrics.png) \nHardware metrics are shown by default. To view Spark metrics, click the drop-down menu labeled **Hardware** and select **Spark**. You can also select **GPU** if the instance is GPU-enabled.\n\n#### View compute metrics\n##### Filter metrics by time period\n\nYou can view historical metrics by selecting a time range using the date picker filter. Metrics are collected every minute, so you can filter by any range of day, hour, or minute from the last 30 days. Click the calendar icon to select from predefined data ranges, or click inside the text box to define custom values. \nNote \nThe time intervals displayed in the charts adjust based on the length of time you are viewing. Most metrics are averages based on the time interval you are currently viewing. \nYou can also get the latest metrics by clicking the **Refresh** button.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/cluster-metrics.html"} +{"content":"# Compute\n## Use compute\n#### View compute metrics\n##### View metrics at the node level\n\nYou can view metrics for individual nodes by clicking the **Compute** drop-down menu and selecting the node you want to view metrics for. GPU metrics are only available at the individual-node level. Spark metrics are not available for individual nodes. \nNote \nIf you don\u2019t select a specific node, the result will be averaged over all nodes within a cluster (including the driver).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/cluster-metrics.html"} +{"content":"# Compute\n## Use compute\n#### View compute metrics\n##### Hardware metric charts\n\nThe following hardware metric charts are available to view in the compute metrics UI: \n* **Server load distribution**: This chart shows the CPU utilization over the past minute for each node.\n* **CPU utilization**: The percentage of time the CPU spent in each mode, based on total CPU seconds cost. The metric is averaged out based on whichever time interval is displayed in the chart. The following are the tracked modes: \n+ guest: If you are running VMs, the CPU those VMs use\n+ iowait: Time spent waiting for I\/O\n+ idle: Time the CPU had nothing to do\n+ irq: Time spent on interrupt requests\n+ nice: Time used by processes that have a positive niceness, meaning a lower priority than other tasks\n+ softirq: Time spent on software interrupt requests\n+ steal: If you are a VM, time other VMs \u201cstole\u201d from your CPUs\n+ system: The time spent in the kernel\n+ user: The time spent in userland\n* **Memory utilization**: The total memory usage by each mode, measured in bytes and averaged out based on whichever time interval is displayed in the chart. The following usage types are tracked: \n+ used: Used memory (including memory used by background processes running on a compute)\n+ free: Unused memory\n+ buffer: Memory used by kernel buffers\n+ cached: Memory used by the file system cache on the OS level\n* **Memory swap utilization**: The total memory swap usage by each mode, measured in bytes and averaged out based on whichever time interval is displayed in the chart.\n* **Free filesystem space**: The total filesystem usage by each mount point, measured in bytes and averaged out based on whichever time interval is displayed in the chart.\n* **Received through network**: The number of bytes received through the network by each device, averaged out based on whichever time interval is displayed in the chart.\n* **Transmitted through network**: The number of bytes transmitted through network by each device, averaged out based on whichever time interval is displayed in the chart.\n* **Number of active nodes**: This shows the number of active nodes at every timestamp for the given compute.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/cluster-metrics.html"} +{"content":"# Compute\n## Use compute\n#### View compute metrics\n##### Spark metrics charts\n\nThe following Spark metric charts are available to view in the compute metrics UI: \n* **Server load distribution**: This chart shows the CPU utilization over the past minute for each node.\n* **Active tasks**: The total number of tasks executing at any given time, averaged out based on whichever time interval is displayed in the chart.\n* **Total failed tasks**: The total number of tasks that have failed in executors, averaged out based on whichever time interval is displayed in the chart.\n* **Total completed tasks**: The total number of tasks that have completed in executors, averaged out based on whichever time interval is displayed in the chart.\n* **Total number of task**s: The total number of all tasks (running, failed and completed) in executors, averaged out based on whichever time interval is displayed in the chart.\n* **Total shuffle read**: The total size of shuffle read data, measured in bytes and averaged out based on whichever time interval is displayed in the chart. `Shuffle read` means the sum of serialized read data on all executors at the beginning of a stage.\n* **Total shuffle write:** The total size of shuffle write data, measured in bytes and averaged out based on whichever time interval is displayed in the chart. `Shuffle Write` is the sum of all written serialized data on all executors before transmitting (normally at the end of a stage).\n* **Total task duration**: The total elapsed time the JVM spent executing tasks on executors, measured in seconds and averaged out based on whichever time interval is displayed in the chart.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/cluster-metrics.html"} +{"content":"# Compute\n## Use compute\n#### View compute metrics\n##### GPU metric charts\n\nThe following GPU metric charts are available to view in the compute metrics UI: \n* **Server load distribution**: This chart shows the CPU utilization over the past minute for each node.\n* **Per-GPU decoder utilization**: The percentage of GPU decoder utilization, averaged out based on whichever time interval is displayed in the chart.\n* **Per-GPU encoder utilization**: The percentage of GPU encoder utilization, averaged out based on whichever time interval is displayed in the chart.\n* **Per-GPU frame buffer memory utilization bytes**: The frame buffer memory utilization, measured in bytes and averaged out based on whichever time interval is displayed in the chart.\n* **Per-GPU memory utilization**: The percentage of GPU memory utilization, averaged out based on whichever time interval is displayed in the chart.\n* **Per-GPU utilization**: The percentage of GPU utilization, averaged out based on whichever time interval is displayed in the chart.\n\n#### View compute metrics\n##### Troubleshooting\n\nIf you see incomplete or missing metrics for a period, it could be one of the following issues: \n* An outage in the Databricks service responsible for querying and storing metrics.\n* Network issues on the customer\u2019s side.\n* The compute is or was in an unhealthy state.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/cluster-metrics.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Log\n#### load\n##### register\n###### and deploy MLflow models\n######## Log model dependencies\n\nIn this article, you learn how to log a model and its dependencies as model artifacts, so they are available in your environment for production tasks like model serving.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/log-model-dependencies.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Log\n#### load\n##### register\n###### and deploy MLflow models\n######## Log model dependencies\n######### Log Python package model dependencies\n\nMLflow has native support for some Python ML libraries, where MLflow can reliably log dependencies for models that use these libraries. See [built-in model flavors](https:\/\/mlflow.org\/docs\/latest\/models.html#built-in-model-flavors). \nFor example, MLflow supports scikit-learn in the [mlflow.sklearn module](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.sklearn.html), and the command [mlflow.sklearn.log\\_model](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.sklearn.html#mlflow.sklearn.log_model) logs the sklearn version. The same applies for [autologging](https:\/\/mlflow.org\/docs\/latest\/tracking.html#automatic-logging) with those ML libraries. See the [MLflow github repository](https:\/\/github.com\/mlflow\/mlflow\/tree\/master\/examples) for additional examples. \nFor ML libraries that can be installed with `pip install PACKAGE_NAME==VERSION`, but do not have built-in MLflow model flavors, you can log those packages using the [mlflow.pyfunc.log\\_model](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.pyfunc.html#mlflow.pyfunc.log_model) method. Be sure to log the requirements with the exact library version, for example, `f\"nltk=={nltk.__version__}\"` instead of just `nltk`. \n`mlflow.pyfunc.log_model` supports logging for: \n* Public and custom libraries packaged as Python egg or Python wheel files.\n* Public packages on PyPI and privately hosted packages on your own PyPI server. \nWith [mlflow.pyfunc.log\\_model](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.pyfunc.html#mlflow.pyfunc.log_model), MLflow tries to infer the dependencies automatically. MLflow infers the dependencies using [mlflow.models.infer\\_pip\\_requirements](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.models.html#mlflow.models.infer_pip_requirements), and logs them to a `requirements.txt` file as a model artifact. \nIn older versions, MLflow sometimes doesn\u2019t identify all Python requirements automatically, especially if the library isn\u2019t a built-in model flavor. In these cases, you can specify additional dependencies with the `extra_pip_requirements` parameter in the `log_model` command. See an example of using the [extra\\_pip\\_requirements parameter](https:\/\/www.mlflow.org\/docs\/latest\/model\/dependencies.html#adding-extra-dependencies-to-an-mlflow-model). \nImportant \nYou can also overwrite the entire set of requirements with the `conda_env` and `pip_requirements` parameters, but doing so is generally discouraged because this overrides the dependencies which MLflow picks up automatically. See an example of how to use the [`pip\\_requirements` parameter to overwrite requirements](https:\/\/www.mlflow.org\/docs\/latest\/model\/dependencies.html). \n### Customized model logging \nFor scenarios where more customized model logging is necessary, you can either: \n* Write a [custom Python model](https:\/\/mlflow.org\/docs\/latest\/models.html#custom-python-models). Doing so allows you to subclass `mlflow.pyfunc.PythonModel` to customize initialization and prediction. This approach works well for customization of Python-only models. \n+ For a simple example, see the [add N model example](https:\/\/mlflow.org\/docs\/latest\/models.html#example-creating-a-custom-add-n-model).\n+ For a more complex example, see the custom [XGBoost model example](https:\/\/mlflow.org\/docs\/latest\/models.html#example-saving-an-xgboost-model-in-mlflow-format).\n* Write a [custom flavor](https:\/\/mlflow.org\/docs\/latest\/models.html#custom-flavors). In this scenario, you can customize logging more than the generic `pyfunc` flavor, but doing so requires more work to implement. \n### Custom Python code \nYou may have Python code dependencies that can\u2019t be installed using the `%pip install` command, such as one or more `.py` files. \nWhen logging a model, you can tell MLflow that the model can find those dependencies at a specified path by using the `code_path` parameter in [mlflow.pyfunc.log\\_model](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.pyfunc.html#mlflow.pyfunc.log_model). MLflow stores any files or directories passed using `code_path` as artifacts along with the model in a code directory. When loading the model, MLflow adds these files or directories to the Python path. This route also works with custom Python wheel files, which can be included in the model using `code_path`, just like `.py` files. \n```\nmlflow.pyfunc.log_model( artifact_path=artifact_path,\ncode_path=[filename.py],\ndata_path=data_path,\nconda_env=conda_env,\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/log-model-dependencies.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Log\n#### load\n##### register\n###### and deploy MLflow models\n######## Log model dependencies\n######### Log non-Python package model dependencies\n\nMLflow does not automatically pick up non-Python dependencies, such as Java packages, R packages, and native packages (such as Linux packages). For these packages, you need to log additional data. \n* Dependency list: Databricks recommends logging an artifact with the model specifying these non-Python dependencies. This could be a simple `.txt` or `.json` file. [mlflow.pyfunc.log\\_model](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.pyfunc.html#mlflow.pyfunc.log_model) allows you to specify this additional artifact using the `artifacts` argument.\n* Custom packages: Just as for custom Python dependencies above, you need to ensure that the packages are available in your deployment environment. For packages in a central location such as Maven Central or your own repository, make sure that the location is available at scoring or serving time. For private packages not hosted elsewhere, you can log packages along with the model as artifacts.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/log-model-dependencies.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Log\n#### load\n##### register\n###### and deploy MLflow models\n######## Log model dependencies\n######### Deploy models with dependencies\n\nWhen deploying a model from the MLflow Tracking Server or Model Registry, you need to ensure that the deployment environment has the right dependencies installed. The simplest path may depend on your deployment mode: batch\/streaming or online serving, and on the types of dependencies. \nFor all deployment modes, Databricks recommends running inference on the same runtime version that you used during training, since the Databricks Runtime in which you created your model has various libraries already installed. MLflow in Databricks automatically saves that runtime version in the `MLmodel` metadata file in a `databricks_runtime` field, such as `databricks_runtime: 10.2.x-cpu-ml-scala2.12`. \n### Online serving: Databricks model serving \nDatabricks offers [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html), where your MLflow machine learning models are exposed as scalable REST API endpoints. \nFor Python dependencies in the `requirements.txt` file, Databricks and MLflow handle everything for public PyPI dependencies. Similarly, if you specified `.py` files or Python wheel files when logging the model by using the `code_path` argument, MLflow loads those dependencies for you automatically. \nFor these model serving scenarios, see the following: \n* [Use custom Python libraries with Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/private-libraries-model-serving.html)\n* [Package custom artifacts and files for Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-custom-artifacts.html) \nFor Python dependencies in the `requirements.txt` file, Databricks and MLflow handle everything for public PyPI dependencies. Similarly, if you specified `.py` files or Python wheel files when logging the model by using the `code_path` argument, MLflow loads those dependencies for you automatically. \n### Online serving: third-party systems or Docker containers \nIf your scenario requires serving to third-party serving solutions or your own Docker-based solution, you can export your model as a Docker container. \nDatabricks recommends the following for third-party serving that automatically handles Python dependencies. However, for non-Python dependencies, the container needs to be modified to include them. \n* MLflow\u2019s Docker integration for Docker-based serving solution: [MLflow models build-docker](https:\/\/mlflow.org\/docs\/latest\/cli.html#mlflow-models-build-docker) \n* MLflow\u2019s SageMaker integration: [mlflow.sagemaker API](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.sagemaker.html) \n### Batch and streaming jobs \nBatch and streaming scoring should be run as [Databricks Jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html). A notebook job often suffices, and the simplest way to prepare code is to use the [Databricks Model Registry](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#use-model-for-inference) to generate a scoring notebook. \nThe following describes the process and the steps to follow to ensure dependencies are installed and applied accordingly: \n1. Start your scoring cluster with the same Databricks Runtime version used during training. Read the `databricks_runtime` field from the `MLmodel` metadata file, and start a cluster with that runtime version. \n* This can be done manually in the cluster configuration or automated with custom logic. For automation, the runtime version format that you read from the metadata file in the [Jobs API](https:\/\/docs.databricks.com\/api\/workspace\/jobs) and [Clusters API](https:\/\/docs.databricks.com\/api\/workspace\/clusters).\n2. Next, install any non-Python dependencies. To ensure your non-Python dependencies are accessible to your deployment environment, you can either: \n* Manually install the non-Python dependencies of your model on the Databricks cluster as part of the cluster configuration before running inference.\n* Alternatively, you can write custom logic in your scoring job deployment to automate the installation of the dependencies onto your cluster. Assuming you saved your non-Python dependencies as artifacts as described in [Log non-Python package model dependencies](https:\/\/docs.databricks.com\/mlflow\/log-model-dependencies.html#log-non-python), this automation can install libraries using the [Libraries API](https:\/\/docs.databricks.com\/api\/workspace\/libraries). Or, you can write specific code to generate a [cluster-scoped initialization script](https:\/\/docs.databricks.com\/init-scripts\/cluster-scoped.html) to install the dependencies.\n3. Your scoring job installs the Python dependencies in the job execution environment. In Databricks, the Model Registry allows you to generate a notebook for inference which does this for you. \n* When you use the Databricks Model Registry to [generate a scoring notebook](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#use-model-for-inference), the notebook contains code to install the Python dependencies in the model\u2019s `requirements.txt` file. For your notebook job for batch or streaming scoring, this code initializes your notebook environment, so that the model dependencies are installed and ready for your model.\n4. MLflow handles any custom Python code included in the `code_path` parameter in `log_model`. This code is added to the Python path when the model\u2019s `predict()` method is called. You can also do this manually by either: \n* Calling [mlflow.pyfunc.spark\\_udf](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.pyfunc.html#mlflow.pyfunc.spark_udf) with the `env_manager=['virtualenv'\/'conda']` argument.\n* Extracting the requirements using [mlflow.pyfunc.get\\_model\\_dependencies](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.pyfunc.html#mlflow.pyfunc.get_model_dependencies) and installing them using [%pip install](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html).\nNote \nIf you specified `.py` files or Python wheel files when logging the model using the `code_path` argument, MLflow loads those dependencies for you automatically.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/log-model-dependencies.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL database tool\n##### Connect to SQL Workbench\/J\n\nThis article describes how to use SQL Workbench\/J with Databricks. \nNote \nThis article covers SQL Workbench\/J, which is neither provided nor supported by Databricks. To contact the provider, see use the [SQL Workbench\/J support forum](https:\/\/groups.google.com\/g\/sql-workbench) in Google Groups..\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/workbenchj.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL database tool\n##### Connect to SQL Workbench\/J\n###### Requirements\n\n* [SQL Workbench\/J](https:\/\/www.sql-workbench.eu\/downloads.html).\n* The [Databricks JDBC Driver](https:\/\/databricks.com\/spark\/jdbc-drivers-archive). Download the Databricks JDBC Driver onto your local development machine, extracting the `DatabricksJDBC42.jar` file from the downloaded `DatabricksJDBC42-<version>.zip` file. \nNote \nThis article was tested with macOS, SQL Workbench\/J Build 130, Zulu OpenJDK 21.0.1, and Databricks JDBC Driver 2.6.36. \nFor Databricks authentication, if you are not using Databricks personal access token authentication, you can skip generating a personal access token later in these requirements. For more information about available Databricks authentication types, see [Authentication settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html). \n* A cluster or SQL warehouse in your Databricks workspace. \n+ [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n+ [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* The connection details for your cluster or SQL warehouse, specifically the **Server Hostname**, **Port**, and **HTTP Path** values. \n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). To create a personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**.\nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n+ [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n+ [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/workbenchj.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL database tool\n##### Connect to SQL Workbench\/J\n###### Steps to connect to Workbench\/J\n\nTo connect to Workbench\/J, do the following: \n1. Launch SQL Workbench\/J.\n2. Select **File > Connect window**.\n3. In the **Select Connection Profile** dialog, click **Manage Drivers**. \n1. In the **Name** field, type `Databricks`.\n2. In the **Library** field, click the **Select the JAR file(s)** icon. Browse to the directory where you extracted the `DatabricksJDBC42.jar` file from the downloaded `DatabricksJDBC42-<version>.zip` file, and select the JAR file. Then click **Choose**.\n3. Verify that the **Classname** field is populated with `com.databricks.client.jdbc.Driver`.\n4. Click **OK**.\n4. Click the **Create a new connection profile** icon. \n1. Type a name for the profile.\n2. In the Driver field, select **Databricks (com.databricks.client.jdbc.Driver)**.\n3. In the **URL** field, enter the JDBC URL for your Databricks resource. For the **URL** field syntax for JDBC URLs, see [Authentication settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html).\n4. Click **Test**.\n5. Click **OK** twice.\n\n##### Connect to SQL Workbench\/J\n###### Additional resources\n\n* [SQL Workbench\/J](https:\/\/www.sql-workbench.eu\/)\n* [Support](https:\/\/groups.google.com\/g\/sql-workbench)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/workbenchj.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/run-notebook.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Run Databricks notebooks\n\nBefore you can run any cell in a notebook, you must [attach the notebook to a cluster](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach). \nTo run all the cells in a notebook, select **Run All** in the notebook toolbar. \nImportant \nDo not use **Run All** if steps for [mount and unmount](https:\/\/docs.databricks.com\/dbfs\/mounts.html) are in the same notebook. It could lead to a race condition and possibly corrupt the mount points. \nTo run a single cell, click in the cell and press **shift+enter**. You can also run a subset of lines in a cell or a subset of cells. See [Run selected text](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#run-selected-text) and [Run selected cells](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#run-selected-cells). \nTo run all cells before or after a cell, use the cell actions menu ![Cell actions](https:\/\/docs.databricks.com\/_images\/cell-actions.png) at the far right. Click ![Run Menu](https:\/\/docs.databricks.com\/_images\/run-menu.png) and select **Run All Above** or **Run All Below**. **Run All Below** includes the cell you are in; **Run All Above** does not. \nThe behavior of **Run All Above** and **Run All Below** depends on the cluster that the notebook is attached to. \n* On a cluster running Databricks Runtime 13.3 LTS or below, cells are executed individually. If an error occurs in a cell, the execution continues with subsequent cells.\n* On a cluster running Databricks Runtime 14.0 or above, or on a SQL warehouse, cells are executed as a batch. Any error halts execution, and you cannot cancel the execution of individual cells. You can use the **Interrupt** button to stop execution of all cells. \nWhen a notebook is running, the icon in the notebook tab changes from ![notebook tab icon](https:\/\/docs.databricks.com\/_images\/nb-not-running-icon.png) to ![running notebook tab icon](https:\/\/docs.databricks.com\/_images\/nb-running-icon.png). If notifications are enabled in your browser and you navigate to a different tab while a notebook is running, a notification appears when the notebook finishes. \nTo stop or interrupt a running notebook, select ![the interrupt button](https:\/\/docs.databricks.com\/_images\/nb-interrupt-button.png) in the notebook toolbar. You can also select **Run > Interrupt execution**, or use the keyboard shortcut `I I`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/run-notebook.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Run Databricks notebooks\n##### Schedule a notebook run\n\nTo automatically run a notebook on a regular schedule, [create a notebook job](https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html).\n\n#### Run Databricks notebooks\n##### Run a Delta Live Tables pipeline from a notebook\n\nFor information about starting a Delta Live Tables run from a notebook, see [Open or run a Delta Live Tables pipeline from a notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-dlt-pipeline.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/run-notebook.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Run Databricks notebooks\n##### Notifications\n\nNotifications alert you to certain events, such as which command is currently running and which commands are in error state. When your notebook is showing multiple error notifications, the first one will have a link that allows you to clear all notifications. \n![Notebook notifications](https:\/\/docs.databricks.com\/_images\/notification.png) \nNotebook notifications are enabled by default. You can disable them in [user settings](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#configure-notebook-settings). \n### Background notifications \nIf you start a notebook run and then navigate away from the tab or window that the notebook is running in, a notification appears when the notebook is completed. You can disable this notification in your browser settings. \n### Databricks Advisor \nDatabricks Advisor automatically analyzes commands every time they are run and displays appropriate advice in the notebooks. The advice notices provide information that can assist you in improving the performance of workloads, reducing costs, and avoiding common mistakes. \n#### View advice \nA blue box with a lightbulb icon signals that advice is available for a command. The box displays the number of distinct pieces of advice. \n![Databricks advice](https:\/\/docs.databricks.com\/_images\/advice-collapsed.png) \nClick the lightbulb to expand the box and view the advice. One or more pieces of advice will become visible. \n![View advice](https:\/\/docs.databricks.com\/_images\/advice-expanded.png) \nClick the **Learn more** link to view documentation providing more information related to the advice. \nClick the **Don\u2019t show me this again** link to hide the piece of advice. The advice of this type will no longer be displayed. This action can be [reversed in Editor settings](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html#advice-settings). \nClick the lightbulb again to collapse the advice box. \n### Advice settings \nTo enable or disable Databricks Advisor, go to [user settings](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#configure-notebook-settings) or click the gear icon in the expanded advice box. \nToggle the **Turn on Databricks Advisor** option to enable or disable advice. \nThe **Reset hidden advice** link is displayed if one or more types of advice is currently hidden. Click the link to make that advice type visible again.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/run-notebook.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n###### GDPR and CCPA compliance with Delta Lake\n\nThis article describes how you can use Delta Lake on Databricks to manage General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) compliance for your data lake. Compliance often requires *point deletes*, or deleting individual records within a large collection of data. Delta Lake speeds up point deletes in large data lakes with ACID transactions, allowing you to locate and remove personally idenfiable information (PII) in response to consumer GDPR or CCPA requests.\n\n###### GDPR and CCPA compliance with Delta Lake\n####### Plan your data model for compliance\n\nModeling your data for compliance is an important step in dealing with PII. There are numerous viable approaches depending on the needs of your data consumers. \nOne frequently applied approach is *[pseudonymization](https:\/\/en.wikipedia.org\/wiki\/Pseudonymization)*, or reversible tokenization of personal information elements (*identifiers*) to keys (*pseudonyms*) that cannot be externally identified. Compliance through pseudonymization requires careful planning, including the following: \n* Storage of information in a manner linked to pseudonyms rather than identifiers.\n* Maintenance of strict policies for the access and usage of data that combine the identifiers and pseudonyms.\n* Pipelines or storage policies to remove raw data.\n* Logic to locate and delete the linkage between the pseudonyms and identifiers.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/gdpr-delta.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n###### GDPR and CCPA compliance with Delta Lake\n####### How Delta Lake simplifies point deletes\n\nDelta Lake has many [data skipping](https:\/\/docs.databricks.com\/delta\/data-skipping.html) optimizations built in. To accelerate point deletes, Databricks recommends using Z-order on fields that you use during `DELETE` operations. \nDelta Lake retains table history and makes it available for point-in-time queries and rollbacks. The [VACUUM](https:\/\/docs.databricks.com\/delta\/vacuum.html) function removes data files that are no longer referenced by a Delta table and are older than a specified retention threshold, permanently deleting the data. To learn more about defaults and recommendations, see [Work with Delta Lake table history](https:\/\/docs.databricks.com\/delta\/history.html). \nNote \nFor tables with deletion vectors enabled, you must also run `REORG TABLE ... APPLY (PURGE)` to permanently delete underlying records. See [Apply changes to Parquet data files](https:\/\/docs.databricks.com\/delta\/deletion-vectors.html#purge).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/gdpr-delta.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Production considerations for Structured Streaming\n##### Configure Structured Streaming trigger intervals\n\nApache Spark Structured Streaming processes data incrementally; controlling the trigger interval for batch processing allows you to use Structured Streaming for workloads including near-real time processing, refreshing databases every 5 minutes or once per hour, or batch processing all new data for a day or week. \nBecause Databricks Auto Loader uses Structured Streaming to load data, understanding how triggers work provides you with the greatest flexibility to control costs while ingesting data with the desired frequency.\n\n##### Configure Structured Streaming trigger intervals\n###### Specifying time-based trigger intervals\n\nStructured Streaming refers to time-based trigger intervals as \u201cfixed interval micro-batches\u201d. Using the `processingTime` keyword, specify a time duration as a string, such as `.trigger(processingTime='10 seconds')`. \nWhen you specify a `trigger` interval that is too small (less than tens of seconds), the system may perform unnecessary checks to see if new data arrives. Configure your processing time to balance latency requirements and the rate that data arrives in the source.\n\n##### Configure Structured Streaming trigger intervals\n###### Configuring incremental batch processing\n\nImportant \nIn Databricks Runtime 11.3 LTS and above, the `Trigger.Once` setting is deprecated. Databricks recommends you use `Trigger.AvailableNow` for all incremental batch processing workloads. \nThe available now trigger option consumes all available records as an incremental batch with the ability to configure batch size with options such as `maxBytesPerTrigger` (sizing options vary by data source). \nDatabricks supports using `Trigger.AvailableNow` for incremental batch processing from many Structured Streaming sources. The following table includes the minimum supported Databricks Runtime version required for each data source: \n| Source | Minimum Databricks Runtime version |\n| --- | --- |\n| File sources (JSON, Parquet, etc.) | 9.1 LTS |\n| Delta Lake | 10.4 LTS |\n| Auto Loader | 10.4 LTS |\n| Apache Kafka | 10.4 LTS |\n| Kinesis | 13.1 |\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/triggers.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Production considerations for Structured Streaming\n##### Configure Structured Streaming trigger intervals\n###### What is the default trigger interval?\n\nStructured Streaming defaults to fixed interval micro-batches of 500ms. Databricks recommends you always specify a tailored `trigger` to minimize costs associated with checking if new data has arrived and processing undersized batches.\n\n##### Configure Structured Streaming trigger intervals\n###### Changing trigger intervals between runs\n\nYou can change the trigger interval between runs while using the same checkpoint. \nIf a Structured Streaming job stops while a micro-batch is being processed, that micro-batch must complete before the new trigger interval applies. As such, you might observe a micro-batch processing with the previously specified settings after changing the trigger interval. \nWhen moving from time-based interval to using `AvailableNow`, this might result in a micro-batch processing ahead of processing all available records as an incremental batch. \nWhen moving from `AvailableNow` to a time-based interval, this might result in continuing to process all records that were available when the last `AvailableNow` job triggered. This is the expected behavior. \nNote \nIf you are trying to recover from query failure associated with an incremental batch, changing the trigger interval does not solve this problem because the batch must still be completed. Databricks recommends scaling up the compute capacity used to process the batch to try to resolve the issue. In rare cases, you might need to restart the stream with a new checkpoint.\n\n##### Configure Structured Streaming trigger intervals\n###### What is continuous processing mode?\n\nApache Spark supports an additional trigger interval known as [Continuous Processing](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html#continuous-processing). This mode has been classified as experimental since Spark 2.3; consult with your Databricks account team to make sure you understand the trade-offs of this processing model. \nNote that this continuous processing mode does not relate at all to continuous processing as applied in Delta Live Tables.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/triggers.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n### Query databases using JDBC\n##### Use the Databricks connector to connect to another Databricks workspace\n\nThis article provides syntax examples of using the Databricks connector to connect to another Databricks workspace. This connector leverages the Databricks JDBC driver, which is included in Databricks Runtime 13.3 LTS and above. \nImportant \nFor most data sharing operations, Databricks recommends Delta Sharing. See [Share data and AI assets securely using Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html). You may also prefer Lakehouse Federation for managing queries on data in other Databricks workspaces. See [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html).\n\n##### Use the Databricks connector to connect to another Databricks workspace\n###### Connecting to another Databricks workspace\n\nThe Databricks Spark connector allows you to connect to compute resources configured in another Databricks workspace and return results to your current Databricks workspace. You must have access to active compute on both workspaces for queries to succeed. \nThe JDBC driver is registered for `jdbc:databricks:\/\/` URLs. You must configure and use a personal access token that grants you permissions on the workspace resources being accessed remotely. See the [Token management API](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement). \nNote \nIf you have a Databricks JDBC library attached to your cluster, the library version attached your cluster is used instead of the version included in Databricks Runtime.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/databricks.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n### Query databases using JDBC\n##### Use the Databricks connector to connect to another Databricks workspace\n###### Read data from another Databricks workspace\n\nYou can specify the format `databricks` to use the Databricks Spark connector when you\u2019re reading data, as in the following example: \n```\ndf = (spark.read\n.format(\"databricks\")\n.option(\"host\", \"<host-name>.cloud.databricks.com\")\n.option(\"httpPath\", \"\/sql\/1.0\/warehouses\/<warehouse-id>\")\n.option(\"personalAccessToken\", \"<auth-token>\")\n.option(\"dbtable\", \"<table-name>\")\n.load()\n)\n\n```\n\n##### Use the Databricks connector to connect to another Databricks workspace\n###### Create an external table against another Databricks workspace\n\nYou can register an external table in a Databricks workspace linked to a separate Databricks workspace. \nThe following example demonstrates this syntax, using the `secret` function to get credentials stored with Databricks secrets: \nNote \nFor more on Databricks secrets, see [secret function](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/secret.html). \n```\nCREATE TABLE databricks_external_table\nUSING databricks\nOPTIONS (\nhost '<host-name>.cloud.databricks.com',\nhttpPath '\/sql\/1.0\/warehouses\/<warehouse-id>',\npersonalAccessToken secret('<scope>', '<token>'),\ndbtable '<table-name>'\n);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/databricks.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n##### Track ML and deep learning training runs\n\nThe MLflow tracking component lets you log source properties, parameters, metrics, tags, and artifacts related to training a machine learning or deep learning model. To get started with MLflow, try one of the [MLflow quickstart tutorials](https:\/\/docs.databricks.com\/mlflow\/quick-start.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/tracking.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n##### Track ML and deep learning training runs\n###### MLflow tracking with experiments and runs\n\nMLflow tracking is based on two concepts, *experiments* and *runs*: \nNote \nStarting March 27, 2024, MLflow imposes a quota limit on the number of total parameters, tags, and metric steps for all existing and new runs, and the number of total runs for all existing and new experiments, see [Resource limits](https:\/\/docs.databricks.com\/resources\/limits.html). If you hit the runs per experiment quota, Databricks recommends you delete runs that you no longer need [using the delete runs API in Python](https:\/\/docs.databricks.com\/mlflow\/runs.html#bulk-delete). If you hit other quota limits, Databricks recommends adjusting your logging strategy to keep under the limit. If you require an increase to this limit, reach out to your Databricks account team with a brief explanation of your use case, why the suggested mitigation approaches do not work, and the new limit you request. \n* An MLflow *experiment* is the primary unit of organization and access control for MLflow runs; all MLflow runs belong to an experiment. Experiments let you visualize, search for, and compare runs, as well as download run artifacts and metadata for analysis in other tools.\n* An MLflow *run* corresponds to a single execution of model code. \n* [Organize training runs with MLflow experiments](https:\/\/docs.databricks.com\/mlflow\/experiments.html)\n* [Manage training code with MLflow runs](https:\/\/docs.databricks.com\/mlflow\/runs.html) \nThe [MLflow Tracking API](https:\/\/www.mlflow.org\/docs\/latest\/tracking.html) logs parameters, metrics, tags, and artifacts from a model run. The Tracking API communicates with an MLflow [tracking server](https:\/\/www.mlflow.org\/docs\/latest\/tracking.html#tracking-server). When you use Databricks, a Databricks-hosted tracking server logs the data. The hosted MLflow tracking server has Python, Java, and R APIs. \nNote \nMLflow is installed on Databricks Runtime ML clusters. To use MLflow on a Databricks Runtime cluster, you must install the `mlflow` library. For instructions on installing a library onto a cluster, see [Install a library on a cluster](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html#install-libraries). The specific packages to install for MLflow are: \n* For Python, select **Library Source** PyPI and enter `mlflow` in the **Package** field.\n* For R, select **Library Source** CRAN and enter `mlflow` in the **Package** field.\n* For Scala, install these two packages: \n+ Select **Library Source** Maven and enter `org.mlflow:mlflow-client:1.11.0` in the **Coordinates** field.\n+ Select **Library Source** PyPI and enter `mlflow` in the **Package** field.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/tracking.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n##### Track ML and deep learning training runs\n###### Where MLflow runs are logged\n\nAll MLflow runs are logged to the active experiment, which can be set using any of the following ways: \n* Use the [mlflow.set\\_experiment() command](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.set_experiment).\n* Use the `experiment_id` parameter in the [mlflow.start\\_run() command](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.start_run).\n* Set one of the MLflow environment variables [MLFLOW\\_EXPERIMENT\\_NAME or MLFLOW\\_EXPERIMENT\\_ID](https:\/\/mlflow.org\/docs\/latest\/cli.html#cmdoption-mlflow-run-arg-uri). \nIf no active experiment is set, runs are logged to the [notebook experiment](https:\/\/docs.databricks.com\/mlflow\/experiments.html#mlflow-notebook-experiments). \nTo log your experiment results to a remotely hosted MLflow Tracking server in a workspace other than the one in which you are running your experiment, set the tracking URI to reference the remote workspace with `mlflow.set_tracking_uri()`, and set the path to your experiment in the remote workspace by using `mlflow.set_experiment()`. \n```\nmlflow.set_tracking_uri(<uri-of-remote-workspace>)\nmlflow.set_experiment(\"path to experiment in remote workspace\")\n\n``` \nIf you are running experiments locally and want to log experiment results to the Databricks MLflow Tracking server, provide your Databricks workspace instance (`DATABRICKS_HOST`) and Databricks personal access token (`DATABRICKS_TOKEN`). Next, you can set the tracking URI to reference the workspace with `mlflow.set_tracking_uri()`, and set the path to your experiment by using `mlflow.set_experiment()`. See [Perform Databricks personal access token authentication](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html#token-auth) for details on where to find values for the `DATABRICKS_HOST` and `DATABRICKS_TOKEN` environment variables. \nThe following code example demonstrates setting these values: \n```\n\nos.environ[\"DATABRICKS_HOST\"] = \"https:\/\/dbc-1234567890123456.cloud.databricks.com\" # set to your server URI\nos.environ[\"DATABRICKS_TOKEN\"] = \"dapixxxxxxxxxxxxx\"\n\nmlflow.set_tracking_uri(\"databricks\")\nmlflow.set_experiment(\"\/your-experiment\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/tracking.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n##### Track ML and deep learning training runs\n###### Logging example notebook\n\nThis notebook shows how to log runs to a notebook experiment and to a workspace experiment. Only MLflow runs initiated within a notebook can be logged to the notebook experiment. MLflow runs launched from any notebook or from the APIs can be logged to a workspace experiment. For information about viewing logged runs, see [View notebook experiment](https:\/\/docs.databricks.com\/mlflow\/experiments.html#view-notebook-experiment) and [View workspace experiment](https:\/\/docs.databricks.com\/mlflow\/experiments.html#view-workspace-experiment). \n### Log MLflow runs notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-log-runs.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nYou can use MLflow Python, Java or Scala, and R APIs to start runs and record run data. For details, see the [MLflow example notebooks](https:\/\/docs.databricks.com\/mlflow\/quick-start.html).\n\n##### Track ML and deep learning training runs\n###### Access the MLflow tracking server from outside Databricks\n\nYou can also write to and read from the tracking server from outside Databricks, for example\nusing the MLflow CLI. See [Access the MLflow tracking server from outside Databricks](https:\/\/docs.databricks.com\/mlflow\/access-hosted-tracking-server.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/tracking.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n##### Track ML and deep learning training runs\n###### Analyze MLflow runs programmatically\n\nYou can access MLflow run data programmatically using the following two DataFrame APIs: \n* The MLflow Python client [search\\_runs API](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html#mlflow.search_runs) returns a pandas DataFrame.\n* The [MLflow experiment](https:\/\/docs.databricks.com\/query\/formats\/mlflow-experiment.html) data source returns an Apache Spark DataFrame. \nThis example demonstrates how to use the MLflow Python client to build a dashboard that visualizes changes in evaluation metrics over time, tracks the number of runs started by a specific user, and measures the total number of runs across all users: \n* [Build dashboards with the MLflow Search API](https:\/\/docs.databricks.com\/mlflow\/build-dashboards.html)\n\n##### Track ML and deep learning training runs\n###### Why model training metrics and outputs may vary\n\nMany of the algorithms used in ML have a random element, such as sampling or random initial conditions within the algorithm itself. When you train a model using one of these algorithms, the results might not be the same with each run, even if you start the run with the same conditions. Many libraries offer a seeding mechanism to fix the initial conditions for these stochastic elements. However, there may be other sources of variation that are not controlled by seeds. Some algorithms are sensitive to the order of the data, and distributed ML algorithms may also be affected by how the data is partitioned. Generally this variation is not significant and not important in the model development process. \nTo control variation caused by differences in ordering and partitioning, use the PySpark functions [repartition](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/reference\/pyspark.sql\/api\/pyspark.sql.DataFrame.repartition.html) and [sortWithinPartitions](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/reference\/pyspark.sql\/api\/pyspark.sql.DataFrame.sortWithinPartitions.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/tracking.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n##### Track ML and deep learning training runs\n###### MLflow tracking examples\n\nThe following notebooks demonstrate how to train several types of models and track the training data in MLflow and how to store tracking data in Delta Lake. \n* [Track scikit-learn model training with MLflow](https:\/\/docs.databricks.com\/mlflow\/tracking-ex-scikit.html)\n* [Track ML Model training data with Delta Lake](https:\/\/docs.databricks.com\/mlflow\/tracking-ex-delta.html)\n* [Track Keras model training with MLflow](https:\/\/mlflow.org\/docs\/latest\/deep-learning\/keras\/quickstart\/quickstart_keras_core.html_)\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/tracking.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Manage file assets in Databricks Git folders\n\nDatabricks Git folders serve as Git clients for Databricks-managed clones of Git-based source repositories, enabling you to perform a subset of Git operations on their contents from your workspace. As part of this Git integration, files stored in the remote repo are viewed as \u201cassets\u201d based on their type, with some limitations in place specific to their type. Notebook files, in particular, have different properties based on their type. Read this article to understand how to work with assets, particularly IPYNB notebooks, in Git folders.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/manage-assets.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Manage file assets in Databricks Git folders\n##### Supported asset types\n\nOnly certain Databricks asset types are supported by Git folders. In this case, \u201csupported\u201d means \u201ccan be serialized, version-controlled, and pushed to the backing Git repo.\u201d \nCurrently, the supported asset types are: \n| Asset Type | Details |\n| --- | --- |\n| **File** | Files are serialized data, and can include anything from libraries to binaries to code to images. For more information, read [What are workspace files?](https:\/\/docs.databricks.com\/files\/workspace.html) |\n| **Notebook** | Notebooks are specifically the notebook file formats supported by Databricks. Notebooks are considered a separate Databricks asset type from Files because they are not serialized. Git folders determine a Notebook by the file extension (such as `.ipynb`) or by file extensions combined with a special marker in file content (for example, a `# Databricks notebook source` comment at the beginning of `.py` source files). |\n| **Folder** | A folder is a Databricks-specific structure that represents serialized information about a logical grouping of files in Git. As expected, the user experiences this as a \u201cfolder\u201d when viewing a Databricks Git folder or accessing it with the Databricks CLI. | \nDatabricks asset types that are currently not supported in Git folders include the following: \n* DBSQL queries\n* Alerts\n* Dashboards (including legacy dashboards) \nWhen working with your assets in Git, observe the following limitations in file naming: \n* A folder cannot contain a notebook with the same name as another notebook, file, or folder in the same Git repository, even if the file extension differs. (For source-format notebooks, the extension is `.py` for python, `.scala` for Scala, `.sql` for SQL, and `.r` for R. For IPYNB-format notebooks, the extension is `.ipynb`.) For example, you can\u2019t use a source-format notebook named `test1.py` and an IPYNB notebook named `test1` in the same Git folder because the source-format Python notebook file (`test1.py`) will be serialized as `test1` and a conflict will occur.\n* The character `\/` is not supported in file names. For example, you can\u2019t have a file named `i\/o.py` in your Git folder. \nIf you attempt to perform Git operations on files that have names that have these patterns, you will get an \u201cError fetching Git status\u201d message. If you receive this error unexpectedly, review the filenames of the assets in your Git repository. If you find files with names that have these conflicting patterns, rename them and try the operation again. \nNote \nYou can move existing unsupported assets into a Git folder, but cannot commit changes to these assets back to the repo. You cannot create new unsupported assets in a Git folder.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/manage-assets.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Manage file assets in Databricks Git folders\n##### Notebook formats\n\nDatabricks considers two kinds of high-level, Databricks-specific notebook formats: \u201csource\u201d and \u201cipynb\u201d. When a user commits a notebook in the \u201csource\u201d format, the Databricks platform commits a flat file with a language suffix, such as `.py`, `.sql`, `.scala`, or `.r`. A \u201csource\u201d-format notebook contains only source code and does not contain outputs such as table displays and visualizations that are the results of running the notebook. \nThe \u201cipynb\u201d format, however, does have outputs associated with it, and those artifacts are automatically pushed to the Git repo backing the Git folder when pushing the `.ipynb` notebook that generated them. If you want to commit outputs along with the code, use the \u201cipynb\u201d notebook format and setup configuration to allow a user to commit any generated outputs. As a result, \u201cipynb\u201d also supports a better viewing experience in Databricks for notebooks pushed to remote Git repos through Git folders. \n| Notebook source format | Details |\n| --- | --- |\n| source | Can be any code file with a standard file suffix that signals the code language, such as `.py`, `.scala`, `.r` and `.sql`. \u201csource\u201d notebooks are treated as text files and will not include any associated outputs when committed back to a Git repo. |\n| ipynb | \u201cipynb\u201d files end with `.ipynb` and can, if configured, push outputs (such as visualizations) from the Databricks Git folder to the backing Git repo. An `.ipnynb` notebook can contain code in any language supported by Databricks notebooks (despite the `py` part of `.ipynb`). | \nIf you want outputs pushed back to your repo after running a notebook, use a `.ipynb` (Jupyter) notebook. If you just want to run the notebook and manage it in Git, use a \u201csource\u201d format like `.py`. \nFor more details on supported notebook formats, read [Export and import Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html). \nNote \n**What are \u201coutputs\u201d?** \nOutputs are the results of running a notebook on the Databricks platform, including table displays and visualizations. \n**How do I tell what format a notebook is using, other than the file extension?** \nAt the top of a notebook managed by Databricks, there is usually a single-line comment that indicates the format. For example, for a `.py` \u201csource\u201d notebook, you will see a line that looks like this: \n`# Databricks notebook source` \nFor `.ipynb` files, the file suffix is used to indicate that it is the \u201cipynb\u201d notebook format.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/manage-assets.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Manage file assets in Databricks Git folders\n##### IPYNB notebooks in Databricks Git folders\n\nSupport for Jupyter notebooks (`.ipynb` files) is available in Git folders. You can clone repositories with `.ipynb` notebooks, work with them in the Databricks product, and then commit and push them as `.ipynb` notebooks. Metadata such as the notebook dashboard is preserved. Admins can control whether outputs can be committed or not.\n\n#### Manage file assets in Databricks Git folders\n##### Allow committing `.ipynb` notebook output\n\nBy default, the admin setting for Git folders doesn\u2019t allow `.ipynb` notebook output to be committed. Workspace admins can change this setting: \n1. Go to **Admin settings > Workspace settings**.\n2. Under **Git folders > Allow Git folders to Export IPYNB outputs**, select **Allow: IPYNB outputs can be toggled on**. \n![Admin console: Allow Git folders to Export IPYNB outputs.](https:\/\/docs.databricks.com\/_images\/allow-commit-ipynb.png) \nImportant \nWhen outputs are included, the visualization and dashboard configs are preserved with the .ipynb file format.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/manage-assets.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Manage file assets in Databricks Git folders\n##### Control IPYNB notebook output artifact commits\n\nWhen you commit an `.ipynb` file, Databricks creates a config file that lets you control how you commit outputs: `.databricks\/commit_outputs`. \n1. If you have a `.ipynb` notebook file but no config file in your repo, open the Git Status modal.\n2. In the notification dialog, click **Create commit\\_outputs file**. \n![Notebook commit UI: Create commit_outputs file button.](https:\/\/docs.databricks.com\/_images\/commit-message-output-option.png) \nYou can also generate config files from the **File** menu. The **File** menu has a control that lets you automatically update the config file to specify the inclusion or exclusion of outputs for a specific notebook. \n1. In the **File** menu, select **Commit notebooks outputs.** \n![Noteboook editor: Commit notebooks outputs status and control.](https:\/\/docs.databricks.com\/_images\/commit-nb-outputs.png)\n2. In the dialog box, confirm your choice to commit notebook outputs. \n![Commit notebooks outputs dialog box.](https:\/\/docs.databricks.com\/_images\/commit-nb-outputs-db.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/manage-assets.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Manage file assets in Databricks Git folders\n##### Convert a source notebook to IPYNB\n\nYou can convert an existing source notebook in a Git folder to an IPYNB notebook through the Databricks UI. \n1. Open a source notebook in your workspace.\n2. Select **File** from the workspace menu, and then select **Change notebook format [source]**. If the notebook is already in IPYNB format, **[source]** will be **[ipynb]** in the menu element. \n![The workspace file menu, expanded, showing the Change notebook format option.](https:\/\/docs.databricks.com\/_images\/repos-change-notebook-format1.png)\n3. In the modal dialog, select \u201cJupyter notebook format (.ipynb)\u201d and click **Change**. \n![The modal dialog box where you can select the IPYNB notebook format.](https:\/\/docs.databricks.com\/_images\/repos-change-notebook-format2.png) \nYou can also: \n* Create new `.ipynb` notebooks.\n* View diffs as **Code diff** (code changes in cells) or **Raw diff** (code changes are presented as JSON syntax, which includes notebook outputs as metadata). \nFor more information on the kinds of notebooks supported in Databricks, read [Export and import Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/manage-assets.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n### Databricks notebook interface and controls\n##### Databricks notebooks: orientation to the new cell UI\n\nThe new cell UI is an updated look and feel for Databricks notebooks. This guide is designed to orient users who are familiar with the existing notebook UI.\n\n##### Databricks notebooks: orientation to the new cell UI\n###### Enable the new UI\n\nThe quickest way to preview the new UI is the preview tag available in the notebook header. \n![New cell UI preview tag](https:\/\/docs.databricks.com\/_images\/new-cell-ui-preview-tag.png) \nThis tag displays the current status. Click the tag and toggle the switch to **ON**. Then click **Reload page** next to the toggle. The page reloads with the new cell UI enabled. \nIf you click **Don\u2019t show this again** to remove the preview tag, you can click **View > Developer settings** at any time and toggle **New cell UI** under **Experimental features**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/new-cell-ui-orientation.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n### Databricks notebook interface and controls\n##### Databricks notebooks: orientation to the new cell UI\n###### Orientation\n\nThis section describes some commonly used features and where to find them in the new UI. \n### Run button \nThe run button is at the upper-left of the cell. Click the sideways-pointing arrow to run the cell with a single click. Click the downward-pointing arrow to display a menu. \n![Cell run menu - new UI](https:\/\/docs.databricks.com\/_images\/cell-run-new.png) \nWhen a cell is running, the run button displays a spinner and shows the current time spent running the command. You can click this button to cancel the execution. After a cell has finished running, the last run time and duration appear to the right of the button. Hover your cursor over this to see more details. \n![last run image](https:\/\/docs.databricks.com\/_images\/last-cell-run.png) \n### Cell numbers and titles \nCell numbers and titles appear in the center of the cell toolbar. To add or edit a title, click on the cell number or title. Cells with titles now appear in the [table of contents](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#notebook-toc) to assist you in navigating around a notebook. \n![add cell title](https:\/\/docs.databricks.com\/_images\/add-cell-title.gif) \n### Add cells shortcut \nTo add a new cell, hover in the space between cells. You have the option to add a new, empty code cell or a Markdown text cell. \n![buttons to create a new cell](https:\/\/docs.databricks.com\/_images\/create-cell.png) \n### Hidden code or results \nTo view hidden code or results, click the show icon ![show icon](https:\/\/docs.databricks.com\/_images\/show-icon.png) at the upper-right of the cell. \n### Floating toolbar \nThe toolbar remains visible when you scroll down a large code cell to provide more convenient access to cell status and actions. \n![floating toolbar](https:\/\/docs.databricks.com\/_images\/new-cell-ui-floating-toolbar.png) \n### Focus mode \nTo edit a single cell in full screen mode, use focus mode. Click the focus mode icon in the toolbar. \n![focus mode toolbar](https:\/\/docs.databricks.com\/_images\/focus-mode-toolbar.png) \nThis opens a full screen editor for the cell. Any results appear in the bottom panel. \n![focus mode view](https:\/\/docs.databricks.com\/_images\/focus-mode.png) \nYou can navigate to adjacent cells using the arrows on either side of the cell title or using the notebook [table of contents](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#notebook-toc). \n### Drag and drop to re-order cells \nTo move a cell up or down, click and hold the drag handle icon ![move cell icon](https:\/\/docs.databricks.com\/_images\/move-cell-icon.png) at the left of the cell. Move the cell to the desired location and release the mouse. \n![drag cell up or down](https:\/\/docs.databricks.com\/_images\/drag-and-drop.gif)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/new-cell-ui-orientation.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n### Databricks notebook interface and controls\n##### Databricks notebooks: orientation to the new cell UI\n###### Frequently asked questions\n\n### Can I remove the margin on the sides of the cell? \nYou can toggle this preference using the **View > Centered layout** setting in the notebook menu. \n![notebook menu](https:\/\/docs.databricks.com\/_images\/notebook-menu.png) \n### Where can I see detailed run information of a cell? \nMouse over the run information next to the run button to see a tooltip with detailed run information. \n![last run image](https:\/\/docs.databricks.com\/_images\/last-cell-run.png) \nIf you have a tabular result output this information is also accessible by hovering over the \u201cLast refreshed\u201d section of the UI. \n![tabular results last run image](https:\/\/docs.databricks.com\/_images\/last-refreshed-run-info.png) \n### How can I get line numbers back? \nUse **View > Line numbers** in the notebook menu to toggle line numbers on or off. \n### Where did the minimize cell icon go? \nThe minimize icon has been removed. To minimize a cell, double-click the drag handle or select **Collapse cell** in the cell menu. \n### Where did the dashboard icon go? \nSelect **Add to dashboard** in the cell menu. \n### How can I give additional feedback on the new cell UI? \nUse the **Provide feedback** link in the expanded preview tag or, if you have hidden the tag, in the notebook header.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/new-cell-ui-orientation.html"} +{"content":"# Get started: Account and workspace setup\n### Navigate the workspace\n\nThis article walks you through the Databricks workspace UI, an environment for accessing all of your Databricks objects. \nYou can manage the workspace using the workspace UI, the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html), and the [Workspace API](https:\/\/docs.databricks.com\/api\/workspace\/introduction). Most of the articles in the Databricks documentation focus on performing tasks using the workspace UI.\n\n### Navigate the workspace\n#### Homepage\n\nThe following sections of the workspace homepage provide shortcuts to common tasks and workspace objects to help you onboard to and navigate the Databricks Data Intelligence Platform: \n**Get started** \nThis section provides shortcuts to the following common tasks across product areas: \n* Import data using the **Create or modify table from file upload** page\n* Create a notebook\n* Create a query\n* Configure an AutoML experiment \nNote \nThe tiles that display on your homepage depend on your assigned [entitlements](https:\/\/docs.databricks.com\/security\/auth-authz\/entitlements.html). \n**Recents** \nThis section displays your recently viewed workspace objects across product areas, including files, notebooks, experiments, queries, dashboards, and alerts. \nYou can also access recents from the sidebar and from the search bar. \n**Popular** \nThis section displays objects with the most user interactions in the last 30 days across product areas, including files, notebooks, experiments, queries, dashboards, and alerts.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/index.html"} +{"content":"# Get started: Account and workspace setup\n### Navigate the workspace\n#### Sidebar\n\nThe following common Databricks Data Intelligence Platform categories are visible at the top of the sidebar: \n* Workspace\n* Recents\n* Data\n* Workflows\n* Compute \n![New navigation sidebar](https:\/\/docs.databricks.com\/_images\/common.png) \nNote \nThere is a lock icon next to items that require an entitlement you aren\u2019t assigned. \nThe features in the following sections are also always visible in the sidebar, grouped by product area: \n### SQL \n* SQL Editor\n* Queries\n* Dashboards\n* Alerts\n* Query History\n* SQL Warehouses \n![New navigation sidebar SQL task group](https:\/\/docs.databricks.com\/_images\/sql.png) \n### Data Engineering \n* Delta Live Tables \n![New navigation sidebar Data Engineering task group](https:\/\/docs.databricks.com\/_images\/data-engineering.png) \n### Machine Learning \n* Experiments\n* Feature Store\n* Models\n* Serving \n![New navigation sidebar ML task group](https:\/\/docs.databricks.com\/_images\/ml.png)\n\n### Navigate the workspace\n#### + New menu\n\nClick **+ New** to complete the following tasks: \n* Create workspace objects such as notebooks, queries, repos, dashboards, alerts, jobs, experiments, models, and serving endpoints\n* Create compute resources such as clusters, SQL warehouses, and ML endpoints\n* Upload CSV or TSV files to Delta Lake using the **Create or modify table from file upload** page or load data from various data sources using the add data UI \n![New navigation create menu](https:\/\/docs.databricks.com\/_images\/create-menu.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/index.html"} +{"content":"# Get started: Account and workspace setup\n### Navigate the workspace\n#### Full-page workspace browser\n\nThe full-page workspace browser experience unifies **Workspace** and **Git folders**. \n![Unified filebrowser in sidebar](https:\/\/docs.databricks.com\/_images\/sidebar-workspace.png) \nYou can browse content in Databricks Git folders alongside workspace objects by clicking **Workspace** in the sidebar. \n![Unified filebrowser with workspace objects Git folders content](https:\/\/docs.databricks.com\/_images\/unified-filebrowser.png) \nYou can also browse workspace content and Git folders content from within a notebook using a contextual browser. \n![Notebook contextual browser](https:\/\/docs.databricks.com\/_images\/contextual-browser.png)\n\n### Navigate the workspace\n#### Search\n\nUse the top bar to search for workspace objects such as notebooks, queries, dashboards, alerts, files, folders, libraries, tables registered in Unity Catalog, jobs, and repos in a single place. You can also access recently viewed objects in the search bar.\n\n### Navigate the workspace\n#### Workspace admin and user settings\n\nWorkspace admin and workspace user settings are unified across product areas. SQL settings are combined with general settings to create a unified experience for admin and non-admin users. \nAll workspace admin settings are now accessed from **Settings**. \nAll workspace user settings are now accessed from **Settings**. \n* The **Password** setting is on the **Profile** tab.\n* **SQL query snippets** (**Settings** > **Developer**) is visible to users with the Databricks SQL access entitlement.\n\n### Navigate the workspace\n#### Switch to a different workspace\n\nIf you have access to more than one workspace in the same account, you can quickly switch among them. \n1. Click the workspace name in the top bar of the Databricks workspace.\n2. Select a workspace from the drop-down to switch to it.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/index.html"} +{"content":"# Get started: Account and workspace setup\n### Navigate the workspace\n#### Change the workspace language settings\n\nThe workspace is available in multiple languages. To change the workspace language, click your username in the top navigation bar, select **Settings** and go to the **Preferences** tab.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/index.html"} +{"content":"# Get started: Account and workspace setup\n### Navigate the workspace\n#### Get help\n\nClick the help icon ![Help icon](https:\/\/docs.databricks.com\/_images\/in-product-help-icon.png) in the top bar of the workspace to access the in-product help experience ([Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html)). In-product help provides the following resources: \n* **Send Feedback**: Use the feedback form to submit product feedback from your workspace. See [Submit product feedback](https:\/\/docs.databricks.com\/resources\/ideas.html). \n* **Help Center**: Visit the help center to search across Databricks documentation, Databricks Knowledge Base articles, Apache Spark documentation, training courses, and Databricks forums, or submit a help ticket. \nImportant \nDatabricks plans to deprecate the support ticket submission experience in the Help Center and recommends using the in-product support options described in this section.\n* **Create Support Ticket**: If your organization has a [Databricks Support contract](https:\/\/docs.databricks.com\/resources\/support.html), you can create and submit a support ticket from your Databricks workspace. \nThe in-product experience will replace support ticket submission in the Help Center. For more information, see [What\u2019s coming?](https:\/\/docs.databricks.com\/whats-coming.html).\n* If your organization has a [Databricks Support contract](https:\/\/docs.databricks.com\/resources\/support.html), you can create a support ticket by typing \u201cI need support\u201d into the assistant.\n* Enter a question in the text box for the assistant. \nThe assistant is intended to help quickly answer questions that can be answered with Databricks documentation and knowledge base articles. Its answers are based on documentation that the AI can find related to the question. If it cannot find documentation related to the user question, it declines to answer. Results include a link to any documentation used to answer the question. The assistant is new, so mistakes are possible. Check the facts with the linked documentation and share feedback.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/index.html"} +{"content":"# Compute\n## Use compute\n#### Manage compute\n\nThis article describes how to manage Databricks compute, including displaying, editing, starting, terminating, deleting, controlling access, and monitoring performance and logs. You can also use the [Clusters API](https:\/\/docs.databricks.com\/api\/workspace\/clusters) to manage compute programmatically.\n\n#### Manage compute\n##### View compute\n\nTo view your compute, click ![compute icon](https:\/\/docs.databricks.com\/_images\/clusters-icon.png) **Compute** in the workspace sidebar. \nOn the left side are two columns indicating if the compute has been pinned and the status of the compute. Hover over the status to get more information. \n### View compute configuration as a JSON file \nSometimes it can be helpful to view your compute configuration as JSON. This is especially useful when you want to create similar compute using the [Clusters API](https:\/\/docs.databricks.com\/api\/workspace\/clusters). When you view an existing compute, go to the **Configuration** tab, click **JSON** in the top right of the tab, copy the JSON, and paste it into your API call. JSON view is read-only.\n\n#### Manage compute\n##### Pin a compute\n\n30 days after a compute is terminated, it is permanently deleted. To keep an all-purpose compute configuration after a compute has been [terminated](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-terminate) for more than 30 days, an administrator can pin the compute. Up to 100 compute resources can be pinned. \nAdmins can pin a compute from the compute list or the compute detail page by clicking the pin icon.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/clusters-manage.html"} +{"content":"# Compute\n## Use compute\n#### Manage compute\n##### Edit a compute\n\nYou can edit a compute\u2019s configuration from the compute details UI. \nNote \n* Notebooks and jobs that were attached to the compute remain attached after editing.\n* Libraries installed on the compute remain installed after editing.\n* If you edit any attribute of a running compute (except for the compute size and permissions), you must restart it. This can disrupt users who are currently using the compute.\n* You can only edit a running or terminated compute. You can, however, update *permissions* for compute not in those states on the compute details page.\n\n#### Manage compute\n##### Clone a compute\n\nTo clone an existing compute, select **Clone** from the compute\u2019s ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu. \nAfter you select **Clone**, the compute creation UI opens pre-populated with the compute configuration. The following attributes are NOT included in the clone: \n* Compute permissions\n* Attached notebooks \nIf you don\u2019t want to include the previously installed libraries in the cloned compute, click the drop-down menu next to the **Create compute** button and select **Create without libraries**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/clusters-manage.html"} +{"content":"# Compute\n## Use compute\n#### Manage compute\n##### Compute permissions\n\nThere are four permission levels for a compute: NO PERMISSIONS, CAN ATTACH TO, CAN RESTART, and CAN MANAGE. The table lists the abilities for each permission. \nImportant \nUsers with CAN ATTACH TO permissions can view the service account\nkeys in the log4j file. Use caution when granting this permission level. \n| Ability | NO PERMISSIONS | CAN ATTACH TO | CAN RESTART | CAN MANAGE |\n| --- | --- | --- | --- | --- |\n| Attach notebook to compute | | x | x | x |\n| View Spark UI | | x | x | x |\n| View compute metrics | | x | x | x |\n| View driver logs | | | | x [(see note)](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#log-access-note) |\n| Terminate compute | | | x | x |\n| Start and restart compute | | | x | x |\n| Edit compute | | | | x |\n| Attach library to compute | | | | x |\n| Resize compute | | | | x |\n| Modify permissions | | | | x | \nWorkspace admins have the CAN MANAGE permission on all compute in their workspace. Users automatically have the CAN MANAGE permission on the compute they create. \nNote \n[Secrets](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html) are not redacted from a cluster\u2019s Spark driver log `stdout` and `stderr` streams. To protect sensitive data, by default, Spark driver logs are viewable only by users with CAN MANAGE permission on job, single user access mode, and shared access mode clusters. To allow users with CAN ATTACH TO or CAN RESTART permission to view the logs on these clusters, set the following Spark configuration property in the cluster configuration: `spark.databricks.acl.needAdminPermissionToViewLogs false`. \nOn No Isolation Shared access mode clusters, the Spark driver logs can be viewed by users with CAN ATTACH TO or CAN MANAGE permission. To limit who can read the logs to only users with the CAN MANAGE permission, set `spark.databricks.acl.needAdminPermissionToViewLogs` to `true`. \nSee [Spark configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration) to learn how to add Spark properties to a cluster configuration. \n### Configure compute permissions \nThis section describes how to manage permissions using the workspace UI. You can also use the [Permissions API](https:\/\/docs.databricks.com\/api\/workspace\/permissions) or [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html). \nYou must have the CAN MANAGE permission on a compute to configure compute permissions. \n1. In the sidebar, click **Compute**.\n2. On the row for the compute, click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) on the right, and select **Edit permissions**.\n3. In **Permission Settings**, click the **Select user, group or service principal\u2026** drop-down menu and select a user, group, or service principal.\n4. Select a permission from the permission drop-down menu.\n5. Click **Add** and click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/clusters-manage.html"} +{"content":"# Compute\n## Use compute\n#### Manage compute\n##### Terminate a compute\n\nTo save compute resources, you can terminate a compute. The terminated compute\u2019s configuration is stored so that it can be [reused](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-start) (or, in the case of jobs, [autostarted](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#autostart-clusters)) at a later time. You can manually terminate a compute or configure the compute to terminate automatically after a specified period of inactivity. When the number of terminated compute exceeds 150, the oldest compute is deleted. \nUnless a compute is [pinned](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-pin) or restarted, it is automatically and permanently deleted 30 days after termination. \nTerminated compute appear in the compute list with a gray circle at the left of the compute name. \nNote \nWhen you run a [job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) on a new Job compute (which is usually recommended), the compute terminates and is unavailable for restarting when the job is complete. On the other hand, if you schedule a job to run on an existing All-Purpose compute that has been terminated, that compute will [autostart](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#autostart-clusters). \n### Manual termination \nYou can manually terminate a compute from the compute list (by clicking the square on the compute\u2019s row) or the compute detail page (by clicking **Terminate**). \n### Automatic termination \nYou can also set auto termination for a compute. During compute creation, you can specify an inactivity period in minutes after which you want the compute to terminate. \nIf the difference between the current time and the last command run on the compute is more than the inactivity period specified, Databricks automatically terminates that compute. \nA compute is considered inactive when all commands on the compute, including Spark jobs, Structured Streaming, and JDBC calls, have finished executing. This does not include commands run by SSH-ing into the compute and running bash commands. \nWarning \n* Compute do not report activity resulting from the use of DStreams. This means that an auto-terminating compute may be terminated while it is running DStreams. Turn off auto termination for compute running DStreams or consider using Structured Streaming.\n* Idle compute continue to accumulate DBU and cloud instance charges during the inactivity period before termination. \n#### Configure automatic termination \nYou can configure automatic termination in the new compute UI. Ensure that the box is checked, and enter the number of minutes in the **Terminate after \\_\\_\\_ of minutes of inactivity** setting. \nYou can opt out of auto termination by clearing the Auto Termination checkbox or by specifying an inactivity period of `0`. \nNote \nAuto termination is best supported in the latest Spark versions. Older Spark versions have known limitations which can result in inaccurate reporting of compute activity. For example, compute running JDBC, R, or streaming commands can report a stale activity time that leads to premature compute termination. Please upgrade to the most recent Spark version to benefit from bug fixes and improvements to auto termination. \n### Unexpected termination \nSometimes a compute is terminated unexpectedly, not as a result of a manual termination or a configured automatic termination. \nFor a list of termination reasons and remediation steps, see the [Knowledge Base](https:\/\/kb.databricks.com\/clusters\/termination-reasons.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/clusters-manage.html"} +{"content":"# Compute\n## Use compute\n#### Manage compute\n##### Delete a compute\n\nDeleting a compute terminates the compute and removes its configuration. To delete a compute, select **Delete** from the compute\u2019s ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) menu. \nWarning \nYou cannot undo this action. \nTo delete a pinned compute, it must first be unpinned by an administrator. \nYou can also invoke the [Clusters API](https:\/\/docs.databricks.com\/api\/workspace\/clusters) endpoint to delete a compute programmatically.\n\n#### Manage compute\n##### Restart a compute\n\nYou can restart a previously terminated compute from the compute list, the compute detail page, or a notebook. You can also invoke the [Clusters API](https:\/\/docs.databricks.com\/api\/workspace\/clusters) endpoint to start a compute programmatically. \nDatabricks identifies a compute using its unique [cluster ID](https:\/\/docs.databricks.com\/api\/workspace\/clusters). When you start a terminated compute, Databricks re-creates the compute with the same ID, automatically installs all the libraries, and reattaches the notebooks. \n### Restart a compute to update it with the latest images \nWhen you restart a compute, it gets the latest images for the compute resource containers and the VM hosts. It is important to schedule regular restarts for long-running compute such as those used for processing streaming data. \nIt is your responsibility to restart all compute resources regularly to keep the image up-to-date with the latest image version. \nImportant \nIf you enable the [compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html) for your account or your workspace, long-running compute is automatically restarted as needed during a scheduled maintenance window. This reduces the risk of an auto-restart disrupting a scheduled job. You can also force restart during the maintenance window. See [Automatic cluster update](https:\/\/docs.databricks.com\/admin\/clusters\/automatic-cluster-update.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/clusters-manage.html"} +{"content":"# Compute\n## Use compute\n#### Manage compute\n##### Notebook example: Find long-running compute\n\nIf you are a workspace admin, you can run a script that determines how long each of your compute has been running, and optionally, restart them if they are older than a specified number of days. Databricks provides this script as a notebook. \nNote \nIf your workspace is part of the [public preview of automatic compute update](https:\/\/docs.databricks.com\/admin\/clusters\/automatic-cluster-update.html), you might not need this script. Compute restarts automatically if needed during the scheduled maintenance windows. \nThe first lines of the script define configuration parameters: \n* `min_age_output`: The maximum number of days that a compute can run. Default is 1.\n* `perform_restart`: If `True`, the script restarts any compute with age greater than the number of days specified by `min_age_output`. The default is `False`, which identifies long-running compute but does not restart them.\n* `secret_configuration`: Replace `REPLACE_WITH_SCOPE` and `REPLACE_WITH_KEY` with a [secret scope and key name](https:\/\/docs.databricks.com\/security\/secrets\/index.html). For more details of setting up the secrets, see the notebook. \nWarning \nIf you set `perform_restart` to `True`, the script automatically restarts eligible compute, which can cause active jobs to fail and reset open notebooks. To reduce the risk of disrupting your workspace\u2019s business-critical jobs, plan a scheduled maintenance window and be sure to notify the workspace users. \n### Identify and optionally restart long-running compute \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/clusters-long-running-optional-restart.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/clusters-manage.html"} +{"content":"# Compute\n## Use compute\n#### Manage compute\n##### Compute autostart for jobs and JDBC\/ODBC queries\n\nWhen a job assigned to a terminated compute is scheduled to run, or you connect to a terminated compute from a JDBC\/ODBC interface, the compute is automatically restarted. See [Create a job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#job-create) and [JDBC connect](https:\/\/docs.databricks.com\/integrations\/jdbc\/index.html). \nCompute autostart allows you to configure compute to auto-terminate without requiring manual intervention to restart the compute for scheduled jobs. Furthermore, you can schedule compute initialization by scheduling a job to run on a terminated compute. \nBefore a compute is restarted automatically, [compute](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions) and [job](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#jobs) access control permissions are checked. \nNote \nIf your compute was created in Databricks platform version 2.70 or earlier, there is no autostart: jobs scheduled to run on terminated compute will fail.\n\n#### Manage compute\n##### View compute information in the Apache Spark UI\n\nYou can view detailed information about Spark jobs by selecting the **Spark UI** tab on the compute details page. \nIf you restart a terminated compute, the Spark UI displays information for the restarted compute, not the historical information for the terminated compute. \nSee [Diagnose cost and performance issues using the Spark UI](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/index.html) to walk through diagnosing cost and performance issues using the Spark UI.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/clusters-manage.html"} +{"content":"# Compute\n## Use compute\n#### Manage compute\n##### View compute logs\n\nDatabricks provides three kinds of logging of compute-related activity: \n* Compute event logs, which capture compute lifecycle events like creation, termination, and configuration edits.\n* Apache Spark driver and worker log, which you can use for debugging.\n* Compute init-script logs, which are valuable for debugging init scripts. \nThis section discusses compute event logs and driver and worker logs. For details about init-script logs, see [Init script logging](https:\/\/docs.databricks.com\/init-scripts\/logs.html). \n### Compute event logs \nThe compute event log displays important compute lifecycle events that are triggered manually by user actions or automatically by Databricks. Such events affect the operation of a compute as a whole and the jobs running in the compute. \nFor supported event types, see the [Clusters API](https:\/\/docs.databricks.com\/api\/workspace\/clusters) data structure. \nEvents are stored for 60 days, which is comparable to other data retention times in Databricks. \n#### View a compute\u2019s event log \nTo view the compute\u2019s event log, select the **Event log** tab on the compute details pages. \nFor more information about an event, click its row in the log, then click the **JSON** tab for details. \n### Compute driver and worker logs \nThe direct print and log statements from your notebooks, jobs, and libraries go to the Spark driver logs. You can access these log files from the **Driver logs** tab on the compute details page. Click the name of a log file to download it. \nThese logs have three outputs: \n* Standard output\n* Standard error\n* Log4j logs \nTo view Spark worker logs, use the **Spark UI** tab. You can also [configure a log delivery location](https:\/\/docs.databricks.com\/compute\/configure.html#cluster-log-delivery) for the compute. Both worker and compute logs are delivered to the location you specify.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/clusters-manage.html"} +{"content":"# Compute\n## Use compute\n#### Manage compute\n##### Monitor performance\n\nTo help you monitor the performance of Databricks compute, Databricks provides access to metrics from the compute details page. For Databricks Runtime 12.2 and below, Databricks provides access to [Ganglia](http:\/\/ganglia.sourceforge.net\/) metrics. For Databricks Runtime 13.3 LTS and above, compute metrics are provided by Databricks. \nYou can also install [Datadog](https:\/\/www.datadoghq.com\/) agents on compute nodes to send Datadog metrics to your Datadog account. \n### Compute metrics \nCompute metrics is the default monitoring tool for Databricks Runtime 13.3 LTS and above. To access the compute metrics UI, navigate to the **Metrics** tab on the compute details page. \nYou can view historical metrics by selecting a time range using the date picker filter. Metrics are collected every minute. You can also get the latest metrics by clicking the **Refresh** button. For more information, see [View compute metrics](https:\/\/docs.databricks.com\/compute\/cluster-metrics.html). \n### Ganglia metrics \nNote \nGanglia metrics are only available for Databricks Runtime 12.2 and below. \nTo access the Ganglia UI, navigate to the **Metrics** tab on the compute details page. CPU metrics are available in the Ganglia UI for all Databricks runtimes. GPU metrics are available for GPU-enabled compute. \nTo view live metrics, click the **Ganglia UI** link. \nTo view historical metrics, click a snapshot file. The snapshot contains aggregated metrics for the hour preceding the selected time. \nNote \nGanglia isn\u2019t supported with Docker containers. If you use a [Docker container](https:\/\/docs.databricks.com\/compute\/custom-containers.html) with your compute, Ganglia metrics will not be available. \n#### Configure Ganglia metrics collection \nBy default, Databricks collects Ganglia metrics every 15 minutes. To configure the collection period, set the `DATABRICKS_GANGLIA_SNAPSHOT_PERIOD_MINUTES` environment variable using an [init script](https:\/\/docs.databricks.com\/init-scripts\/index.html) or in the `spark_env_vars` field in the [Create cluster API](https:\/\/docs.databricks.com\/api\/workspace\/clusters).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/clusters-manage.html"} +{"content":"# Compute\n## Use compute\n#### Manage compute\n##### Notebook example: Datadog metrics\n\n![Datadog metrics](https:\/\/docs.databricks.com\/_images\/datadog-metrics.png) \nYou can install [Datadog](https:\/\/www.datadoghq.com\/) agents on compute nodes to send Datadog metrics to your Datadog account. The following notebook demonstrates how to install a Datadog agent on a compute using a [compute-scoped init script](https:\/\/docs.databricks.com\/init-scripts\/cluster-scoped.html). \nTo install the Datadog agent on all compute, manage the compute-scoped init script using a compute policy. \n### Install Datadog agent init script notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/datadog-init-script.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/clusters-manage.html"} +{"content":"# Compute\n## Use compute\n#### Manage compute\n##### Decommission spot instances\n\nBecause [spot instances](https:\/\/docs.databricks.com\/compute\/configure.html#spot-instances) can reduce costs, creating compute using spot instances rather than on-demand instances is a common way to run jobs. However, spot instances can be preempted by cloud provider scheduling mechanisms. Preemption of spot instances can cause issues with jobs that are running, including: \n* Shuffle fetch failures\n* Shuffle data loss\n* RDD data loss\n* Job failures \nYou can enable decommissioning to help address these issues. Decommissioning takes advantage of the notification that the cloud provider usually sends before a spot instance is decommissioned. When a spot instance containing an executor receives a preemption notification, the decommissioning process will attempt to migrate shuffle and RDD data to healthy executors. The duration before the final preemption is typically 30 seconds to 2 minutes, depending on the cloud provider. \nDatabricks recommends enabling data migration when decommissioning is also enabled. Generally, the possibility of errors decreases as more data is migrated, including shuffle fetching failures, shuffle data loss, and RDD data loss. Data migration can also lead to less re-computation and saved costs. \nNote \nDecommissioning is a best effort and does not guarantee that all data can be migrated before final preemption. Decommissioning cannot guarantee against shuffle fetch failures when running tasks are fetching shuffle data from the executor. \nWith decommissioning enabled, task failures caused by spot instance preemption are not added to the total number of failed attempts. Task failures caused by preemption are not counted as failed attempts because the cause of the failure is external to the task and will not result in job failure. \n### Enable decommissioning \nTo enable decommissioning on a compute, enter the following properties in the **Spark** tab under **Advanced Options** in the compute configuration UI. For information on these properties, see [Spark configuration](https:\/\/spark.apache.org\/docs\/latest\/configuration.html). \n* To enable decommissioning for applications, enter this property in the **Spark config** field: \n```\nspark.decommission.enabled true\n\n```\n* To enable shuffle data migration during decommissioning, enter this property in the **Spark config** field: \n```\nspark.storage.decommission.enabled true\nspark.storage.decommission.shuffleBlocks.enabled true\n\n```\n* To enable RDD cache data migration during decommissioning, enter this property in the **Spark config** field: \n```\nspark.storage.decommission.enabled true\nspark.storage.decommission.rddBlocks.enabled true\n\n``` \nNote \nWhen RDD StorageLevel replication is set to more than 1, Databricks does not recommend enabling RDD data migration since the replicas ensure RDDs will not lose data.\n* To enable decommissioning for workers, enter this property in the **Environment Variables** field: \n```\nSPARK_WORKER_OPTS=\"-Dspark.decommission.enabled=true\"\n\n``` \n### View the decommission status and loss reason in the UI \nTo access a worker\u2019s decommission status from the UI, navigate to the **Spark compute UI - Master** tab. \nWhen the decommissioning finishes, you can view the executor\u2019s loss reason in the **Spark UI > Executors** tab on the compute details page.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/clusters-manage.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Read and write data in Delta Live Tables pipelines\n##### Use Unity Catalog with your Delta Live Tables pipelines\n\nPreview \nDelta Live Tables support for Unity Catalog is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nIn addition to the existing support for persisting tables to the [Hive metastore](https:\/\/docs.databricks.com\/delta-live-tables\/publish.html), you can use [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) with your Delta Live Tables pipelines to: \n* Define a catalog in Unity Catalog where your pipeline will persist tables.\n* Read data from Unity Catalog tables. \nYour workspace can contain pipelines that use Unity Catalog or the Hive metastore. However, a single pipeline cannot write to both the Hive metastore and Unity Catalog and existing pipelines cannot be upgraded to use Unity Catalog. Your existing pipelines that do not use Unity Catalog are not affected by this preview, and will continue to persist data to the Hive metastore using the configured storage location. \nUnless specified otherwise in this document, all existing data sources and Delta Live Tables functionality are supported with pipelines that use Unity Catalog. Both the [Python](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html) and [SQL](https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html) interfaces are supported with pipelines that use Unity Catalog. \nThe tables created in your pipeline can also be queried from shared Unity Catalog clusters using Databricks Runtime 13.3 LTS and above or a SQL warehouse. Tables cannot be queried from assigned or no isolation clusters. \nTo manage permissions on the tables created by a Unity Catalog pipeline, use [GRANT and REVOKE](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Read and write data in Delta Live Tables pipelines\n##### Use Unity Catalog with your Delta Live Tables pipelines\n###### Requirements\n\nThe following are required to create tables in Unity Catalog from a Delta Live Tables pipeline: \n* You must have `USE CATALOG` privileges on the target catalog.\n* You must have `CREATE MATERIALIZED VIEW` and `USE SCHEMA` privileges in the target schema if your pipeline creates [materialized views](https:\/\/docs.databricks.com\/delta-live-tables\/index.html#dlt-datasets).\n* You must have `CREATE TABLE` and `USE SCHEMA` privileges in the target schema if your pipeline creates [streaming tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html#dlt-datasets).\n* If a target schema is not specified in the pipeline settings, you must have `CREATE MATERIALIZED VIEW` or `CREATE TABLE` privileges on at least one schema in the target catalog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Read and write data in Delta Live Tables pipelines\n##### Use Unity Catalog with your Delta Live Tables pipelines\n###### Limitations\n\nThe following are limitations when using Unity Catalog with Delta Live Tables: \n* By default, only the pipeline owner and workspace admins have permission to view the driver logs from the cluster that runs a Unity Catalog-enabled pipeline. To enable access for other users to view the driver logs, see [Allow non-admin users to view the driver logs from a Unity Catalog-enabled pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#driver-log-permissions).\n* Existing pipelines that use the Hive metastore cannot be upgraded to use Unity Catalog. To migrate an existing pipeline that writes to Hive metastore, you must create a new pipeline and re-ingest data from the data source(s). \n* You cannot create a Unity Catalog-enabled pipeline in a workspace attached to a metastore created during the Unity Catalog public preview. See [Upgrade to privilege inheritance](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/upgrade-privilege-model.html). \n* Third-party libraries and JARs are not supported.\n* Data manipulation language (DML) queries that modify the schema of a streaming table are not supported.\n* A materialized view created in a Delta Live Tables pipeline cannot be used as a streaming source outside of that pipeline, for example, in another pipeline or in a downstream notebook.\n* Publishing to schemas that specify a managed storage location is supported only in the [preview channel](https:\/\/docs.databricks.com\/release-notes\/delta-live-tables\/index.html#runtime-channels).\n* If a pipeline publishes to a schema with a managed storage location, the schema can be changed in a subsequent update, but only if the updated schema uses the same storage location as the previously specified schema.\n* If the target schema specifies a storage location, all tables are stored there. If a schema storage location is not specified, tables are stored in the catalog storage location if the target catalog specifies one. If schema and catalog storage locations are not specified, tables are stored in the root storage location of the metastore where the tables are published.\n* The **History** tab in Catalog Explorer does not show history for streaming tables and materialized views.\n* The `LOCATION` property is not supported when defining a table.\n* Unity Catalog-enabled pipelines cannot publish to the Hive metastore. \n* Python UDF support is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To use Python UDFs, your pipeline must use the [preview channel](https:\/\/docs.databricks.com\/delta-live-tables\/properties.html#config-settings). \n* You cannot use [Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html) with a Delta Live Tables materialized view or streaming table published to Unity Catalog.\n* You cannot use the `event_log` [table valued function](https:\/\/docs.databricks.com\/delta-live-tables\/observability.html#event-log-with-uc) in a pipeline or query to access the event logs of multiple pipelines.\n* You cannot share a view created over the `event_log` [table valued function](https:\/\/docs.databricks.com\/delta-live-tables\/observability.html#event-log-with-uc) with other users.\n* Single-node clusters are not supported with Unity Catalog-enabled pipelines. Because Delta Live Tables might create a single-node cluster to run smaller pipelines, your pipeline might fail with an error message referencing `single-node mode`. If this occurs, make sure you specify at least one worker when you [Configure your compute settings](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#cluster-config).\n* Tables created in a Unity Catalog-enabled pipeline cannot be queried from assigned or no isolation clusters. To query tables created by a Delta Live Tables pipeline, you must use a shared access mode cluster using Databricks Runtime 13.3 LTS and above or a SQL warehouse.\n* Delta Live Tables uses a shared access mode cluster to run a Unity Catalog-enabled pipeline. A Unity Catalog-enabled pipeline cannot run on an assigned cluster. To learn about limitations of shared access mode with Unity Catalog, see [Shared access mode limitations on Unity Catalog](https:\/\/docs.databricks.com\/compute\/access-mode-limitations.html#shared-limitations).\n* You cannot use [row filters or column masks](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/row-and-column-filters.html) with materialized views or streaming tables published to Unity Catalog. \nNote \nThe underlying files supporting materialized views might include data from upstream tables (including possible personally identifiable information) that do not appear in the materialized view definition. This data is automatically added to the underlying storage to support incremental refreshing of materialized views. \nBecause the underlying files of a materialized view might risk exposing data from upstream tables not part of the materialized view schema, Databricks recommends not sharing the underlying storage with untrusted downstream consumers. \nFor example, suppose the definition of a materialized view includes a `COUNT(DISTINCT field_a)` clause. Even though the materialized view definition only includes the aggregate `COUNT DISTINCT` clause, the underlying files will contain a list of the actual values of `field_a`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Read and write data in Delta Live Tables pipelines\n##### Use Unity Catalog with your Delta Live Tables pipelines\n###### Changes to existing functionality\n\nWhen Delta Live Tables is configured to persist data to Unity Catalog, the lifecycle of the table is managed by the Delta Live Tables pipeline. Because the pipeline manages the table lifecycle and permissions: \n* When a table is removed from the Delta Live Tables pipeline definition, the corresponding materialized view or streaming table entry is removed from Unity Catalog on the next pipeline update. The actual data is retained for a period of time so that it can be recovered if it was deleted by mistake. The data can be recovered by adding the materialized view or streaming table back into the pipeline definition.\n* Deleting the Delta Live Tables pipeline results in deletion of all tables defined in that pipeline. Because of this change, the Delta Live Tables UI is updated to prompt you to confirm deletion of a pipeline.\n* Internal backing tables, including backing tables used to support `APPLY CHANGES INTO`, are not directly accessible by users.\n\n##### Use Unity Catalog with your Delta Live Tables pipelines\n###### Write tables to Unity Catalog from a Delta Live Tables pipeline\n\nNote \nIf you do not select a catalog and target schema for a pipeline, tables are not published to Unity Catalog and can only be accessed by queries in the same pipeline. \nTo write your tables to Unity Catalog, when you [create a pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html#create-pipeline), select **Unity Catalog** under **Storage options**, select a catalog in the **Catalog** drop-down menu, and select an existing schema or enter the name for a new schema in the **Target schema** drop-down menu. To learn about Unity Catalog catalogs, see [Catalogs](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#catalog). To learn about schemas in Unity Catalog,see [Schemas](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#schema).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Read and write data in Delta Live Tables pipelines\n##### Use Unity Catalog with your Delta Live Tables pipelines\n###### Ingest data into a Unity Catalog pipeline\n\nYour pipeline configured to use Unity Catalog can read data from: \n* Unity Catalog managed and external tables, views, materialized views and streaming tables.\n* Hive metastore tables and views.\n* Auto Loader using the `cloud_files()` function to read from Unity Catalog external locations.\n* Apache Kafka and Amazon Kinesis. \nThe following are examples of reading from Unity Catalog and Hive metastore tables. \n### Batch ingestion from a Unity Catalog table \n```\nCREATE OR REFRESH LIVE TABLE\ntable_name\nAS SELECT\n*\nFROM\nmy_catalog.my_schema.table1;\n\n``` \n```\n@dlt.table\ndef table_name():\nreturn spark.table(\"my_catalog.my_schema.table\")\n\n``` \n### Stream changes from a Unity Catalog table \n```\nCREATE OR REFRESH STREAMING TABLE\ntable_name\nAS SELECT\n*\nFROM\nSTREAM(my_catalog.my_schema.table1);\n\n``` \n```\n@dlt.table\ndef table_name():\nreturn spark.readStream.table(\"my_catalog.my_schema.table\")\n\n``` \n### Ingest data from Hive metastore \nA pipeline that uses Unity Catalog can read data from Hive metastore tables using the `hive_metastore` catalog: \n```\nCREATE OR REFRESH LIVE TABLE\ntable_name\nAS SELECT\n*\nFROM\nhive_metastore.some_schema.table;\n\n``` \n```\n@dlt.table\ndef table3():\nreturn spark.table(\"hive_metastore.some_schema.table\")\n\n``` \n### Ingest data from Auto Loader \n```\nCREATE OR REFRESH STREAMING TABLE\ntable_name\nAS SELECT\n*\nFROM\ncloud_files(\n<path-to-uc-external-location>,\n\"json\"\n)\n\n``` \n```\n@dlt.table(table_properties={\"quality\": \"bronze\"})\ndef table_name():\nreturn (\nspark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"json\")\n.load(f\"{path_to_uc_external_location}\")\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Read and write data in Delta Live Tables pipelines\n##### Use Unity Catalog with your Delta Live Tables pipelines\n###### Share materialized views (live tables)\n\nBy default, the tables created by a pipeline can be queried only by the pipeline owner. You can give other users the ability to query a table by using [GRANT](https:\/\/docs.databricks.com\/sql\/language-manual\/security-grant.html) statements and you can revoke query access using [REVOKE](https:\/\/docs.databricks.com\/sql\/language-manual\/security-revoke.html) statements. For more information about privileges in Unity Catalog, see [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html). \n### Grant select on a table \n```\nGRANT SELECT ON TABLE\nmy_catalog.my_schema.live_table\nTO\n`user@databricks.com`\n\n``` \n### Revoke select on a table \n```\nREVOKE SELECT ON TABLE\nmy_catalog.my_schema.live_table\nFROM\n`user@databricks.com`\n\n```\n\n##### Use Unity Catalog with your Delta Live Tables pipelines\n###### Grant create table or create materialized view privileges\n\n```\nGRANT CREATE { MATERIALIZED VIEW | TABLE } ON SCHEMA\nmy_catalog.my_schema\nTO\n{ principal | user }\n\n```\n\n##### Use Unity Catalog with your Delta Live Tables pipelines\n###### View lineage for a pipeline\n\nLineage for tables in a Delta Live Tables pipeline is visible in Catalog Explorer. For materialized views or streaming tables in a Unity Catalog-enabled pipeline, the Catalog Explorer lineage UI shows the upstream and downstream tables. To learn more about Unity Catalog lineage, see [Capture and view data lineage using Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html). \nFor a materialized view or streaming table in a Unity Catalog-enabled Delta Live Tables pipeline, the Catalog Explorer lineage UI will also link to the pipeline that produced the materialized view or streaming table if the pipeline is accessible from the current workspace.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Read and write data in Delta Live Tables pipelines\n##### Use Unity Catalog with your Delta Live Tables pipelines\n###### Add, change, or delete data in a streaming table\n\nYou can use [data manipulation language](https:\/\/docs.databricks.com\/sql\/language-manual\/index.html#dml-statements) (DML) statements, including insert, update, delete, and merge statements, to modify streaming tables published to Unity Catalog. Support for DML queries against streaming tables enables use cases such as updating tables for General Data Protection Regulation (GDPR) compliance. \nNote \n* DML statements that modify the table schema of a streaming table are not supported. Ensure that your DML statements do not attempt to evolve the table schema.\n* DML statements that update a streaming table can be run only in a shared Unity Catalog cluster or a SQL warehouse using Databricks Runtime 13.3 LTS and above.\n* Because streaming requires append-only data sources, if your processing requires streaming from a source streaming table with changes (for example, by DML statements), set the [skipChangeCommits flag](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html#ignore-changes) when reading the source streaming table. When `skipChangeCommits` is set, transactions that delete or modify records on the source table are ignored. If your processing does not require a streaming table, you can use a materialized view (which does not have the append-only restriction) as the target table. \nThe following are examples of DML statements to modify records in a streaming table. \n### Delete records with a specific ID: \n```\nDELETE FROM my_streaming_table WHERE id = 123;\n\n``` \n### Update records with a specific ID: \n```\nUPDATE my_streaming_table SET name = 'Jane Doe' WHERE id = 123;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html"} +{"content":"# What is data warehousing on Databricks?\n## Get started with data warehousing using Databricks SQL\n#### Tutorial: Use sample dashboards\n\nThis tutorial shows you how to import and use sample dashboards from the samples gallery. These dashboards illustrate some of the rich visualizations you can use to gain insights from your data. No setup is required. These dashboards use data already available in your workspace and rely on a compute resource (called a SQL warehouse) already configured. You don\u2019t need to be an administrator to get started. \n![A published dashboard from the samples gallery](https:\/\/docs.databricks.com\/_images\/lakeview-published-sample-retail.png) \nSee [Dashboards](https:\/\/docs.databricks.com\/dashboards\/index.html) to learn about all of the visualization types and features available for dashboards.\n\n#### Tutorial: Use sample dashboards\n##### Import a dashboard\n\n1. In the sidebar, click ![Dashboards Icon](https:\/\/docs.databricks.com\/_images\/dashboards-icon.png) **Dashboards** \nIf your workspace has any saved dashboards, they are listed.\n2. Click **View samples gallery**.\n3. In the **Retail Revenue & Supply Chain** tile, click **Import**. The dashboard is imported into your workspace, and you are the owner. \nThe imported draft dashboard appears, and its visualizations are refreshed. \n![Draft Retail Revenue & Supply Chain dashboard](https:\/\/docs.databricks.com\/_images\/lakeview-samples-gallery-draft.png) \nYou can import a sample dashboard multiple times, and multiple users can each import it. You can also import the **NYC Taxi Trip Analysis** dashboard.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/get-started\/sample-dashboards.html"} +{"content":"# What is data warehousing on Databricks?\n## Get started with data warehousing using Databricks SQL\n#### Tutorial: Use sample dashboards\n##### Explore a visualization\u2019s query\n\n1. Each visualization in a dashboard is the result of a query. You can access all queries in the **Data** tab on the draft dashboard. Click **Data** in the upper-left corner of the screen. Then, click the dataset you want to view to see the associated query. \n![Dashboard data tab with queries](https:\/\/docs.databricks.com\/_images\/lakeview-samples-data-tab.png) \nThe SQL editor includes the query and results, which are shown in a table below the query. \nThe sample dashboards use data in the `samples` catalog, separate from data in your workspace. The `samples` catalog is available to every workspace but is read-only. \n1. Click the **Canvas** tab to go back to the canvas that shows the dashboard\u2019s visualization widgets.\n\n#### Tutorial: Use sample dashboards\n##### Interact with a visualization\n\n1. Hover over the **Revenue by Order Priority** visualization.\n2. Click each **Priority** in the legend to focus on that group of data and hide the other lines.\n3. Right-click on the visualization to see its context menu. You can delete or clone a visualization. You can also download the associated dataset as a CSV, TSV, or Excel file. Click **Go to Revenue by Order Priority** to view the associated query. \n![Revenue by Order Priority visualization and context menu](https:\/\/docs.databricks.com\/_images\/lakeview-widget-context-menu.png) \nThe query opens on the **Data** tab of your dashboard.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/get-started\/sample-dashboards.html"} +{"content":"# What is data warehousing on Databricks?\n## Get started with data warehousing using Databricks SQL\n#### Tutorial: Use sample dashboards\n##### Publish the dashboard\n\n* Click **Publish** at the top of the page. A **Publish** dialog appears.\n* Click **Publish** in the dialog to create a sharable, non-editable version of your dashboard. This dashboard is published with your credentials embedded by default. This means that other viewers use your credentials to access the data and compute to generate visualizations on your dashboard. See [Publish a dashboard](https:\/\/docs.databricks.com\/dashboards\/index.html#publish-a-dashboard).\n* Use the switcher at the top of the page to view your published dashboard. \n![Drop-down menu showing available draft and published dashboard versions.](https:\/\/docs.databricks.com\/_images\/draft-published-switcher.png)\n\n#### Tutorial: Use sample dashboards\n##### Share the dashboard\n\nTo share a dashboard with colleagues in your workspace: \n1. Click **Share** at the top of the page.\n2. Select a user or group in your workspace. \nTo share the dashboard with all users in the workspace, select **All workspace users**. Then, click **Add**.\n3. Select the permission to grant. \n![Share dashboard permissions](https:\/\/docs.databricks.com\/_images\/share-dashboard-permission.png) \nTo share a dashboard with account users: \n1. Under **Sharing settings** at the bottom of the sharing dialog, click **Anyone in my organization can view.** \nThis means that anyone who is registered to your Databricks account can use a link to access your dashboard. If you have embedded your credentials, account-level users don\u2019t need workspace access to view your dashboard.\n2. Close the form.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/get-started\/sample-dashboards.html"} +{"content":"# What is data warehousing on Databricks?\n## Get started with data warehousing using Databricks SQL\n#### Tutorial: Use sample dashboards\n##### Schedule automatic dashboard refreshes\n\nYou can schedule the dashboard to refresh at an interval automatically. \n1. At the top of the page, click **Schedule**.\n2. Click **Add schedule**.\n3. Select an interval, such as **Every 1 hour** at **5 minutes past the hour**. The SQL warehouse that you selected to run your queries is used to run the dashboard\u2019s queries and generate visualizations when the dashboard is refreshed. \nWorkspace admin users can create, configure, and delete SQL warehouses.\n4. Click **Create**. \nThe dialog shows all schedules associated with the dashboard.\n5. Optionally, click **Subscribe** to add yourself as a subscriber and receive an email with a PDF snapshot of the dashboard after a scheduled run completes. \nYou can use the kebab menu ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) to edit the schedule and add more subscribers. See [Schedule dashboards for periodic updates](https:\/\/docs.databricks.com\/dashboards\/index.html#schedule-dashboards-for-periodic-updates).\n6. To delete an existing schedule for a dashboard: \n1. Click **Subscribe**.\n2. Click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) on the right.\n3. Click **Delete**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/get-started\/sample-dashboards.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Run a Databricks notebook from another notebook\n\nImportant \nFor notebook orchestration, use Databricks Jobs. For code modularization scenarios, use workspace files. You should only use the techniques described in this article when your use case cannot be implemented using a Databricks job, such as for looping notebooks over a dynamic set of parameters, or if you do not have access to [workspace files](https:\/\/docs.databricks.com\/files\/workspace.html). For more information, see [Databricks Jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) and [share code](https:\/\/docs.databricks.com\/notebooks\/share-code.html).\n\n#### Run a Databricks notebook from another notebook\n##### Comparison of `%run` and `dbutils.notebook.run()`\n\nThe `%run` command allows you to include another notebook within a notebook. You can use `%run` to modularize your code, for example by putting supporting functions in a separate notebook. You can also use it to concatenate notebooks that implement the steps in an analysis. When you use `%run`, the called notebook is immediately executed and the functions and variables defined in it become available in the calling notebook. \nThe `dbutils.notebook` API is a complement to `%run` because it lets you pass parameters to and return values from a notebook. This allows you to build complex workflows and pipelines with dependencies. For example, you can get a list of files in a directory and pass the names to another notebook, which is not possible with `%run`. You can also create if-then-else workflows based on return values or call other notebooks using relative paths. \nUnlike `%run`, the `dbutils.notebook.run()` method starts a new job to run the notebook. \nThese methods, like all of the `dbutils` APIs, are available only in Python and Scala. However, you can use `dbutils.notebook.run()` to invoke an R notebook.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Run a Databricks notebook from another notebook\n##### Use `%run` to import a notebook\n\nIn this example, the first notebook defines a function, `reverse`, which is available in the second notebook after you use the `%run` magic to execute `shared-code-notebook`. \n![Shared code notebook](https:\/\/docs.databricks.com\/_images\/shared-code-notebook.png) \n![Notebook import example](https:\/\/docs.databricks.com\/_images\/notebook-import-example.png) \nBecause both of these notebooks are in the same directory in the workspace, use the prefix `.\/` in `.\/shared-code-notebook` to indicate that the path should be resolved relative to the currently running notebook. You can organize notebooks into directories, such as `%run .\/dir\/notebook`, or use an absolute path like `%run \/Users\/username@organization.com\/directory\/notebook`. \nNote \n* `%run` must be in a cell *by itself*, because it runs the entire notebook inline.\n* You *cannot* use `%run` to run a Python file and `import` the entities defined in that file into a notebook. To import from a Python file, see [Modularize your code using files](https:\/\/docs.databricks.com\/notebooks\/share-code.html#reference-source-code-files-using-git). Or, package the file into a Python library, create a Databricks [library](https:\/\/docs.databricks.com\/libraries\/index.html) from that Python library, and [install the library into the cluster](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html#install-libraries) you use to run your notebook.\n* When you use `%run` to run a notebook that contains widgets, by default the specified notebook runs with the widget\u2019s default values. You can also pass in values to widgets; see [Use Databricks widgets with %run](https:\/\/docs.databricks.com\/notebooks\/widgets.html#widgets-and-percent-run).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Run a Databricks notebook from another notebook\n##### `dbutils.notebook` API\n\nThe methods available in the `dbutils.notebook` API are `run` and `exit`. Both parameters and return values must be strings. \n**`run(path: String, timeout_seconds: int, arguments: Map): String`** \nRun a notebook and return its exit value. The method starts an ephemeral job that runs immediately. \nThe `timeout_seconds` parameter controls the timeout of the run (0 means no timeout): the call to\n`run` throws an exception if it doesn\u2019t finish within the specified time. If Databricks is down for more than 10 minutes,\nthe notebook run fails regardless of `timeout_seconds`. \nThe `arguments` parameter sets widget values of the target notebook. Specifically, if the notebook you are running has a widget\nnamed `A`, and you pass a key-value pair `(\"A\": \"B\")` as part of the arguments parameter to the `run()` call,\nthen retrieving the value of widget `A` will return `\"B\"`. You can find the instructions for creating and\nworking with widgets in the [Databricks widgets](https:\/\/docs.databricks.com\/notebooks\/widgets.html) article. \nNote \n* The `arguments` parameter accepts only Latin characters (ASCII character set). Using non-ASCII characters returns an error.\n* Jobs created using the `dbutils.notebook` API must complete in 30 days or less. \n### `run` Usage \n```\ndbutils.notebook.run(\"notebook-name\", 60, {\"argument\": \"data\", \"argument2\": \"data2\", ...})\n\n``` \n```\ndbutils.notebook.run(\"notebook-name\", 60, Map(\"argument\" -> \"data\", \"argument2\" -> \"data2\", ...))\n\n``` \n### `run` Example \nSuppose you have a notebook named `workflows` with a widget named `foo` that prints the widget\u2019s value: \n```\ndbutils.widgets.text(\"foo\", \"fooDefault\", \"fooEmptyLabel\")\nprint(dbutils.widgets.get(\"foo\"))\n\n``` \nRunning `dbutils.notebook.run(\"workflows\", 60, {\"foo\": \"bar\"})` produces the following result: \n![Notebook with widget](https:\/\/docs.databricks.com\/_images\/notebook-workflow-widget-example.png) \nThe widget had the value you passed in using `dbutils.notebook.run()`, `\"bar\"`, rather than the default. \n`exit(value: String): void`\nExit a notebook with a value. If you call a notebook using the `run` method, this is the value returned. \n```\ndbutils.notebook.exit(\"returnValue\")\n\n``` \nCalling `dbutils.notebook.exit` in a job causes the notebook to complete successfully. If you want to cause the job to fail, throw an exception.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Run a Databricks notebook from another notebook\n##### Example\n\nIn the following example, you pass arguments to `DataImportNotebook` and run different notebooks (`DataCleaningNotebook` or `ErrorHandlingNotebook`) based on the result from `DataImportNotebook`. \n![if-else example](https:\/\/docs.databricks.com\/_images\/notebook-workflow-example.png) \nWhen the code runs, a table appears containing a link to the running notebook: \n![Link to running notebook](https:\/\/docs.databricks.com\/_images\/dbutils.run.png) \nTo view the run details, click the **Start time** link in the table. If the run is complete, you can also view the run details by clicking the **End time** link. \n![Result of ephemeral notebook run](https:\/\/docs.databricks.com\/_images\/notebook-run-results.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Run a Databricks notebook from another notebook\n##### Pass structured data\n\nThis section illustrates how to pass structured data between notebooks. \n```\n# Example 1 - returning data through temporary views.\n# You can only return one string using dbutils.notebook.exit(), but since called notebooks reside in the same JVM, you can\n# return a name referencing data stored in a temporary view.\n\n## In callee notebook\nspark.range(5).toDF(\"value\").createOrReplaceGlobalTempView(\"my_data\")\ndbutils.notebook.exit(\"my_data\")\n\n## In caller notebook\nreturned_table = dbutils.notebook.run(\"LOCATION_OF_CALLEE_NOTEBOOK\", 60)\nglobal_temp_db = spark.conf.get(\"spark.sql.globalTempDatabase\")\ndisplay(table(global_temp_db + \".\" + returned_table))\n\n# Example 2 - returning data through DBFS.\n# For larger datasets, you can write the results to DBFS and then return the DBFS path of the stored data.\n\n## In callee notebook\ndbutils.fs.rm(\"\/tmp\/results\/my_data\", recurse=True)\nspark.range(5).toDF(\"value\").write.format(\"parquet\").save(\"dbfs:\/tmp\/results\/my_data\")\ndbutils.notebook.exit(\"dbfs:\/tmp\/results\/my_data\")\n\n## In caller notebook\nreturned_table = dbutils.notebook.run(\"LOCATION_OF_CALLEE_NOTEBOOK\", 60)\ndisplay(spark.read.format(\"parquet\").load(returned_table))\n\n# Example 3 - returning JSON data.\n# To return multiple values, you can use standard JSON libraries to serialize and deserialize results.\n\n## In callee notebook\nimport json\ndbutils.notebook.exit(json.dumps({\n\"status\": \"OK\",\n\"table\": \"my_data\"\n}))\n\n## In caller notebook\nimport json\n\nresult = dbutils.notebook.run(\"LOCATION_OF_CALLEE_NOTEBOOK\", 60)\nprint(json.loads(result))\n\n``` \n```\n\/\/ Example 1 - returning data through temporary views.\n\/\/ You can only return one string using dbutils.notebook.exit(), but since called notebooks reside in the same JVM, you can\n\/\/ return a name referencing data stored in a temporary view.\n\n\/** In callee notebook *\/\nsc.parallelize(1 to 5).toDF().createOrReplaceGlobalTempView(\"my_data\")\ndbutils.notebook.exit(\"my_data\")\n\n\/** In caller notebook *\/\nval returned_table = dbutils.notebook.run(\"LOCATION_OF_CALLEE_NOTEBOOK\", 60)\nval global_temp_db = spark.conf.get(\"spark.sql.globalTempDatabase\")\ndisplay(table(global_temp_db + \".\" + returned_table))\n\n\/\/ Example 2 - returning data through DBFS.\n\/\/ For larger datasets, you can write the results to DBFS and then return the DBFS path of the stored data.\n\n\/** In callee notebook *\/\ndbutils.fs.rm(\"\/tmp\/results\/my_data\", recurse=true)\nsc.parallelize(1 to 5).toDF().write.format(\"parquet\").save(\"dbfs:\/tmp\/results\/my_data\")\ndbutils.notebook.exit(\"dbfs:\/tmp\/results\/my_data\")\n\n\/** In caller notebook *\/\nval returned_table = dbutils.notebook.run(\"LOCATION_OF_CALLEE_NOTEBOOK\", 60)\ndisplay(sqlContext.read.format(\"parquet\").load(returned_table))\n\n\/\/ Example 3 - returning JSON data.\n\/\/ To return multiple values, you can use standard JSON libraries to serialize and deserialize results.\n\n\/** In callee notebook *\/\n\n\/\/ Import jackson json libraries\nimport com.fasterxml.jackson.module.scala.DefaultScalaModule\nimport com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper\nimport com.fasterxml.jackson.databind.ObjectMapper\n\n\/\/ Create a json serializer\nval jsonMapper = new ObjectMapper with ScalaObjectMapper\njsonMapper.registerModule(DefaultScalaModule)\n\n\/\/ Exit with json\ndbutils.notebook.exit(jsonMapper.writeValueAsString(Map(\"status\" -> \"OK\", \"table\" -> \"my_data\")))\n\n\/** In caller notebook *\/\n\n\/\/ Import jackson json libraries\nimport com.fasterxml.jackson.module.scala.DefaultScalaModule\nimport com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper\nimport com.fasterxml.jackson.databind.ObjectMapper\n\n\/\/ Create a json serializer\nval jsonMapper = new ObjectMapper with ScalaObjectMapper\njsonMapper.registerModule(DefaultScalaModule)\n\nval result = dbutils.notebook.run(\"LOCATION_OF_CALLEE_NOTEBOOK\", 60)\nprintln(jsonMapper.readValue[Map[String, String]](result))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Run a Databricks notebook from another notebook\n##### Handle errors\n\nThis section illustrates how to handle errors. \n```\n# Errors throw a WorkflowException.\n\ndef run_with_retry(notebook, timeout, args = {}, max_retries = 3):\nnum_retries = 0\nwhile True:\ntry:\nreturn dbutils.notebook.run(notebook, timeout, args)\nexcept Exception as e:\nif num_retries > max_retries:\nraise e\nelse:\nprint(\"Retrying error\", e)\nnum_retries += 1\n\nrun_with_retry(\"LOCATION_OF_CALLEE_NOTEBOOK\", 60, max_retries = 5)\n\n``` \n```\n\/\/ Errors throw a WorkflowException.\n\nimport com.databricks.WorkflowException\n\n\/\/ Since dbutils.notebook.run() is just a function call, you can retry failures using standard Scala try-catch\n\/\/ control flow. Here we show an example of retrying a notebook a number of times.\ndef runRetry(notebook: String, timeout: Int, args: Map[String, String] = Map.empty, maxTries: Int = 3): String = {\nvar numTries = 0\nwhile (true) {\ntry {\nreturn dbutils.notebook.run(notebook, timeout, args)\n} catch {\ncase e: WorkflowException if numTries < maxTries =>\nprintln(\"Error, retrying: \" + e)\n}\nnumTries += 1\n}\n\"\" \/\/ not reached\n}\n\nrunRetry(\"LOCATION_OF_CALLEE_NOTEBOOK\", timeout = 60, maxTries = 5)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Run a Databricks notebook from another notebook\n##### Run multiple notebooks concurrently\n\nYou can run multiple notebooks at the same time by using standard Scala and Python constructs such as Threads ([Scala](https:\/\/docs.oracle.com\/javase\/7\/docs\/api\/java\/lang\/Thread.html), [Python](https:\/\/docs.python.org\/3\/library\/threading.html)) and Futures ([Scala](https:\/\/docs.scala-lang.org\/overviews\/core\/futures.html), [Python](https:\/\/docs.python.org\/3\/library\/multiprocessing.html)). The example notebooks demonstrate how to use these constructs. \n1. Download the following 4 notebooks. The notebooks are written in Scala.\n2. Import the notebooks into a single folder in the workspace.\n3. Run the **Run concurrently** notebook. \n### Run concurrently notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/concurrent-notebooks.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Run in parallel notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/parallel-notebooks.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Testing notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/testing.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Testing-2 notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/testing-2.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-workflows.html"} +{"content":"# Databricks data engineering\n### What are init scripts?\n\nAn init script (initialization script) is a shell script that runs during startup of each cluster node before the Apache Spark driver or executor JVM starts. This article provides recommendations for init scripts and configuration information if you must use them.\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/index.html"} +{"content":"# Databricks data engineering\n### What are init scripts?\n#### Recommendations for init scripts\n\nDatabricks recommends using built-in platform features instead of init scripts whenever possible. Widespread use of init scripts can slow migration to new Databricks Runtime versions and prevent adoption of some Databricks optimizations. \nImportant \nIf you need to migrate from init scripts on DBFS, see [Migrate init scripts from DBFS](https:\/\/docs.databricks.com\/init-scripts\/index.html#migrate). \nThe following Databricks features address some of the common use cases for init scripts: \n* Use compute policies to set system properties, environmental variables, and Spark configuration parameters. See [Compute policy reference](https:\/\/docs.databricks.com\/admin\/clusters\/policy-definition.html).\n* Add libraries to cluster policies. See [Add libraries to a policy](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html#libraries). \nIf you must use init scripts: \n* Manage init scripts using compute policies or cluster-scoped init scripts rather than global init scripts. See [init script types](https:\/\/docs.databricks.com\/init-scripts\/index.html#init-script-types).\n* Manage library installation for production and interactive environments using compute policies. Don\u2019t install libraries using init scripts.\n* Use shared access mode for all workloads. Only use the single user access mode if required functionality is not supported by shared access mode.\n* Use new Databricks Runtime versions and Unity Catalog for all workloads. \nThe following table provides recommendations organized by Databricks Runtime version and Unity Catalog enablement. \n| Environment | Recommendation |\n| --- | --- |\n| Databricks Runtime 13.3 LTS and above with Unity Catalog | Store init scripts in Unity Catalog [volumes](https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-to-volume.html). |\n| Databricks Runtime 11.3 LTS and above without Unity Catalog | Store init scripts as [workspace files](https:\/\/docs.databricks.com\/files\/workspace-init-scripts.html). (File size limit is 500 MB). |\n| Databricks Runtime 10.4 LTS and below | Store init scripts using [cloud object storage](https:\/\/docs.databricks.com\/connect\/storage\/index.html). |\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/index.html"} +{"content":"# Databricks data engineering\n### What are init scripts?\n#### What types of init scripts does Databricks support?\n\nDatabricks supports two kinds of init scripts: cluster-scoped and global, but using cluster-scoped init scripts are recommended. \n* **Cluster-scoped**: run on every cluster configured with the script. This is the recommended way to run an init script. See [Use cluster-scoped init scripts](https:\/\/docs.databricks.com\/init-scripts\/cluster-scoped.html).\n* **Global**: run on all clusters in the workspace configured with single user access mode or no-isolation shared access mode. These init scripts can cause unexpected issues, such as library conflicts. Only workspace admin users can create global init scripts. See [Use global init scripts](https:\/\/docs.databricks.com\/init-scripts\/global.html). \nWhenever you change any type of init script, you must restart all clusters affected by the script. \nGlobal init-scripts run before cluster-scoped init scripts. \nImportant \nLegacy global and legacy cluster-named init scripts run before other init scripts. These init scripts are end-of-life, but might be present in workspaces created before February 21, 2023. See [Cluster-named init scripts (legacy)](https:\/\/docs.databricks.com\/archive\/init-scripts\/legacy-cluster-named.html) and [Global init scripts (legacy)](https:\/\/docs.databricks.com\/archive\/init-scripts\/legacy-global.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/index.html"} +{"content":"# Databricks data engineering\n### What are init scripts?\n#### Where can init scripts be installed?\n\nYou can store and configure init scripts from workspace files, Unity Catalog volumes, and cloud object storage, but init scripts are not supported on all cluster configurations and not all files can be referenced from init scripts. \nThe following table indicates the support for init scripts based on the source location and the cluster access mode. The Databricks Runtime version listed is the minimum version required to use the combination. For information about cluster access modes, see [Access modes](https:\/\/docs.databricks.com\/compute\/configure.html#access-modes). \nNote \nShared access mode requires an admin to add init scripts to an `allowlist`. See [Allowlist libraries and init scripts on shared compute](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/allowlist.html). \n| | Shared access mode | Single access mode | No-isolation shared access mode |\n| --- | --- | --- | --- |\n| **Workspace files** | Not supported | All supported Databricks Runtime versions | All supported Databricks Runtime versions |\n| **Volumes** | 13.3 LTS | 13.3 LTS | Not supported |\n| **Cloud storage** | 13.3 LTS | All supported Databricks Runtime versions | All supported Databricks Runtime versions |\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/index.html"} +{"content":"# Databricks data engineering\n### What are init scripts?\n#### Migrate init scripts from DBFS\n\nWarning \nInit scripts on DBFS have reached end-of-life and can no longer be used. You must migrate your init scripts to a supported location before you can start compute. Store init scripts on Unity Catalog Volumes, as workspace files, or in cloud object storage. \nUsers that need to migrate init scripts from DBFS can use the following guides. Make sure you\u2019ve identified the correct target for your configuration. See [Recommendations for init scripts](https:\/\/docs.databricks.com\/init-scripts\/index.html#recommendations). \n* [Migrate init scripts from DBFS to volumes](https:\/\/docs.databricks.com\/_extras\/documents\/aws-init-volumes.pdf)\n* [Migrate init scripts from DBFS to workspace files](https:\/\/docs.databricks.com\/_extras\/documents\/aws-init-workspace-files.pdf)\n* [Migrate init scripts from DBFS to S3](https:\/\/docs.databricks.com\/_extras\/documents\/aws-init-s3.pdf)\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/index.html"} +{"content":"# Security and compliance guide\n### Data security and encryption\n\nThis article introduces data security configurations to help protect your data. \nFor information about securing access to your data, see [Data governance with Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/index.html).\n\n### Data security and encryption\n#### Overview of data security and encryption\n\nDatabricks provides encryption features to help protect your data. Not all security features are available on all pricing tiers. The following table contains an overview of the features and how they align to pricing plans. \n| Feature | Pricing tier |\n| --- | --- |\n| Customer-managed keys for encryption | Enterprise |\n| Encrypt traffic between cluster worker nodes | Enterprise |\n| Encrypt queries, query history, and query results | Enterprise |\n\n### Data security and encryption\n#### Enable customer-managed keys for encryption\n\nDatabricks supports adding a customer-managed key to help protect and control access to data. There are two customer-managed key features for different types of data: \n* **Customer-managed keys for managed services**: Managed services data in the Databricks control plane is encrypted at rest. You can add a customer-managed key for managed services to help protect and control access to the following types of encrypted data: \n+ Notebook source files that are stored in the control plane.\n+ Notebook results for notebooks that are stored in the control plane.\n+ Secrets stored by the secret manager APIs.\n+ Databricks SQL queries and query history.\n+ Personal access tokens or other credentials used to set up Git integration with Databricks Git folders.\n* **Customer-managed keys for workspace storage**: You can configure your own key to encrypt the data on the Amazon S3 bucket in your AWS account that you specified when you created your workspace. You can optionally use the same key to encrypt your cluster\u2019s EBS volumes. \nFor more details of which customer-managed key features in Databricks protect different types kinds of data, see [Customer-managed keys for encryption](https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/index.html"} +{"content":"# Security and compliance guide\n### Data security and encryption\n#### Encrypt queries, query history, and query results\n\nYou can use your own key from AWS KMS to encrypt the Databricks SQL queries and your query history stored in the Databricks [control plane](https:\/\/docs.databricks.com\/getting-started\/overview.html). For more details, see [Encrypt queries, query history, and query results](https:\/\/docs.databricks.com\/security\/keys\/sql-encryption.html)\n\n### Data security and encryption\n#### Encrypt S3 buckets at rest\n\nDatabricks supports encrypting data in S3 using server-side encryption. You can encrypt writes to S3 with a key from KMS. This ensures that your data is safe in case it is lost or stolen. See [Configure encryption for S3 with KMS](https:\/\/docs.databricks.com\/security\/keys\/kms-s3.html). To encrypt your workspace storage bucket, see [Customer-managed keys for encryption](https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html).\n\n### Data security and encryption\n#### Encrypt traffic between cluster worker nodes\n\nUser queries and transformations are typically sent to your clusters over an encrypted channel. By default, however, the data exchanged between worker nodes in a cluster is not encrypted. If your environment requires that data be encrypted at all times, whether at rest or in transit, you can create an init script that configures your clusters to encrypt traffic between worker nodes, using AES 128-bit encryption over a TLS 1.2 connection. For more information, see [Encrypt traffic between cluster worker nodes](https:\/\/docs.databricks.com\/security\/keys\/encrypt-otw.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/index.html"} +{"content":"# Security and compliance guide\n### Data security and encryption\n#### Manage workspace settings\n\nDatabricks workspace administrators can manage their workspace\u2019s security settings, such as the ability to download notebooks and enforcing the user isolation cluster access mode. For more information, see [Manage your workspace](https:\/\/docs.databricks.com\/admin\/workspace-settings\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/index.html"} +{"content":"# \n### Transform data\n\nThis article provides an introduction and overview of transforming data with Databricks. Transforming data, or preparing data, is key step in all data engineering, analytics, and ML workloads. \nThe example patterns and recommendations in this article focus on working with lakehouse tables, which are backed by Delta Lake. Because Delta Lake provides the ACID guarantees of a Databricks lakehouse, you might observe different behavior when working with data in other formats or data systems. \nDatabricks recommends ingesting data into a lakehouse in a raw or nearly raw state, and then applying transformations and enrichment as a separate processing step. This pattern is known as the medallion architecture. See [What is the medallion lakehouse architecture?](https:\/\/docs.databricks.com\/lakehouse\/medallion.html). \nIf you know that the data you need to transform has not yet been loaded into a lakehouse, see [Ingest data into a Databricks lakehouse](https:\/\/docs.databricks.com\/ingestion\/index.html). If you\u2019re trying to find lakehouse data to write transformations against, see [Discover data](https:\/\/docs.databricks.com\/discover\/index.html). \nAll transformations begin by writing either a batch or streaming query against a data source. If you\u2019re not familiar with querying data, see [Query data](https:\/\/docs.databricks.com\/query\/index.html). \nOnce you\u2019ve saved transformed data to a Delta table, you can use that table as a feature table for ML. See [What is a feature store?](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html). \nNote \nArticles here discuss transformations on Databricks. Databricks also supports connections to many common data preparation platforms. See [Connect to data prep partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/prep.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/index.html"} +{"content":"# \n### Transform data\n#### Spark transformations vs. lakehouse transformations\n\nThis article focuses on defining *tranformations* as they relate to the **T** in ETL or ELT. The Apache Spark processing model also uses the word *transformation* in a related way. Briefly: in Apache Spark, all operations are defined as either transformations or actions. \n* **Transformations**: add some processing logic to the plan. Examples include reading data, joins, aggregations, and type casting.\n* **Actions**: trigger processing logic to evaluate and output a result. Examples include writes, displaying or previewing results, manual caching, or getting the count of rows. \nApache Spark uses a *lazy execution* model, meaning that none of the logic defined by a collection of operations are evaluated until an action is triggered. This model has an important ramification when defining data processing pipelines: only use actions to save results back to a target table. \nBecause actions represent a processing bottleneck for optimizing logic, Databricks has added numerous optimizations on top of those already present in Apache Spark to ensure optimal execution of logic. These optimizations consider all transformations triggered by a given action at once and find the optimal plan based on the physical layout of the data. Manually caching data or returning preview results in production pipelines can interrupt these optimizations and lead to significant increases in cost and latency. \nTherefore we can define a *lakehouse transformation* to be any collection of operations applied to one or more lakehouse tables that result in a new lakehouse table. Note that while transformations such as joins and aggregations are discussed separately, you can combine many of these patterns in a single processing step and trust the optimizers on Databricks to find the most efficient plan.\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/index.html"} +{"content":"# \n### Transform data\n#### What are the differences between streaming and batch processing?\n\nWhile streaming and batch processing use much of the same syntax on Databricks, each have their own specific semantics. \nBatch processing allows you to define explicit instructions to process a fixed amount of static, non-changing data as a single operation. \nStream processing allows you to define a query against an unbounded, continuously growing dataset and then process data in small, incremental batches. \nBatch operations on Databricks use Spark SQL or DataFrames, while stream processing leverages Structured Streaming. \nYou can differentiate batch Apache Spark commands from Structured Streaming by looking at read and write operations, as shown in the following table: \n| | Apache Spark | Structured Streaming |\n| --- | --- | --- |\n| **Read** | `spark.read.load()` | `spark.readStream.load()` |\n| **Write** | `spark.write.save()` | `spark.writeStream.start()` | \nMaterialized views generally conform to batch processing guarantees, although Delta Live Tables is used to calculate results incrementally when possible. The results returned by a materialized view are always equivalent to batch evaluation of logic, but Databricks seeks to process these results incrementally when possible. \nStreaming tables always calculate results incrementally. Because many streaming data sources only retain records for a period of hours or days, the processing model used by streaming tables assumes that each batch of records from a data source is only processed once. \nDatabricks supports using SQL to write streaming queries in the following use cases: \n* Defining streaming tables in Unity Catalog using Databricks SQL.\n* Defining source code for Delta Live Tables pipelines. \nNote \nYou can also declare streaming tables in Delta Live Tables using Python Structured Streaming code.\n\n### Transform data\n#### Batch transformations\n\nBatch transformations operate on a well-defined set of data assets at a specific point in time. Batch transformations might be one-time operations, but often are part of scheduled workflows or pipelines that run regularly to keep production systems up to date.\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/index.html"} +{"content":"# \n### Transform data\n#### Incremental transformations\n\nIncremental patterns generally assume that the data source is append-only and has a stable schema. The following articles provide details on nuances for incremental transformations on tables that experience updates, deletes, or schema changes: \n* [APPLY CHANGES API: Simplify change data capture in Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html)\n* [Use Delta Lake change data feed on Databricks](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html)\n* [Update Delta Lake table schema](https:\/\/docs.databricks.com\/delta\/update-schema.html)\n* [Delta table streaming reads and writes](https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html)\n\n### Transform data\n#### Real-time transformations\n\nDelta Lake excels at providing near real-time access to large amounts of data for all users and applications querying your lakehouse, but because of the overhead with writing files and metadata to cloud object storage, true real-time latency cannot be reached for many workloads that write to Delta Lake sinks. \nFor extremely low-latency streaming applications, Databricks recommends choosing source and sink systems designed for real-time workloads such as Kafka. You can use Databricks to enrich data, including aggregations, joins across streams, and joining streaming data with slowly changing dimension data stored in the lakehouse.\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/index.html"} +{"content":"# \n### Data governance with Unity Catalog\n\nThis guide shows how to manage data and AI object access in Databricks. For information on Databricks security, see the [Security and compliance guide](https:\/\/docs.databricks.com\/security\/index.html). Databricks provides centralized governance for data and AI with Unity Catalog and Delta Sharing.\n\n### Data governance with Unity Catalog\n#### Centralize access control using Unity Catalog\n\n[Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) is a fine-grained governance solution for data and AI on the Databricks platform. It helps simplify security and governance of your data and AI assets by providing a central place to administer and audit access to data and AI assets. \nIn most accounts, Unity Catalog is enabled by default when you create a workspace. For details, see [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement). \nFor a discussion of how to use Unity Catalog effectively, see [Unity Catalog best practices](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html). \n### Track data lineage using Unity Catalog \nYou can use Unity Catalog to capture runtime data lineage across queries in any language executed on a Databricks cluster or SQL warehouse. Lineage is captured down to the column level, and includes notebooks, workflows, and dashboards related to the query. To learn more, see [Capture and view data lineage using Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/index.html"} +{"content":"# \n### Data governance with Unity Catalog\n#### Discover data using Catalog Explorer\n\n[Databricks Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html) provides a UI to explore and manage data and AI assets, including schemas (databases), tables, volumes (non-tabular data), and registered ML models, along with asset permissions, data owners, external locations, and credentials. You can use the [Insights](https:\/\/docs.databricks.com\/discover\/table-insights.html) tab in Catalog Explorer to view the most frequent recent queries and users of any table registered in Unity Catalog.\n\n### Data governance with Unity Catalog\n#### Share data using Delta Sharing\n\n[Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html) is an open protocol developed by Databricks for secure data and AI asset sharing with other organizations, or with other teams within your organization, regardless of which computing platforms they use.\n\n### Data governance with Unity Catalog\n#### Configure audit logging\n\nDatabricks provides access to [audit logs](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html) of activities performed by Databricks users, allowing your enterprise to monitor detailed Databricks usage patterns. \nUnity Catalog lets you easily access and query your account\u2019s operational data, including audit logs, billable usage, and lineage using [system tables (Public Preview)](https:\/\/docs.databricks.com\/admin\/system-tables\/index.html).\n\n### Data governance with Unity Catalog\n#### Configure identity\n\nEvery good data governance story starts with a strong identity foundation. To learn how to best configure identity in Databricks, see [Identity best practices](https:\/\/docs.databricks.com\/admin\/users-groups\/best-practices.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/index.html"} +{"content":"# \n### Data governance with Unity Catalog\n#### Legacy data governance solutions\n\nDatabricks also provides these legacy governance models: \n* [Table access control](https:\/\/docs.databricks.com\/data-governance\/table-acls\/index.html) is a legacy data governance model that lets you programmatically grant and revoke access to objects managed by your workspace\u2019s built-in Hive metastore. Databricks recommends that you use Unity Catalog instead of table access control. Unity Catalog simplifies security and governance of your data by providing a central place to administer and audit data access across multiple workspaces in your account. \n* [IAM role credential passthrough](https:\/\/docs.databricks.com\/archive\/credential-passthrough\/iam-passthrough.html) is also a legacy data governance feature that allows users to authenticate automatically to S3 buckets from Databricks clusters using the identity that they use to log in to Databricks. Databricks recommends that you use Unity Catalog instead.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/index.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Stream processing with Apache Kafka and Databricks\n\nThis article describes how you can use Apache Kafka as either a source or a sink when running Structured Streaming workloads on Databricks. \nFor more Kafka, see the [Kafka documentation](https:\/\/kafka.apache.org\/documentation\/).\n\n#### Stream processing with Apache Kafka and Databricks\n##### Read data from Kafka\n\nThe following is an example for a streaming read from Kafka: \n```\ndf = (spark.readStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", \"<server:ip>\")\n.option(\"subscribe\", \"<topic>\")\n.option(\"startingOffsets\", \"latest\")\n.load()\n)\n\n``` \nDatabricks also supports batch read semantics for Kafka data sources, as shown in the following example: \n```\ndf = (spark\n.read\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", \"<server:ip>\")\n.option(\"subscribe\", \"<topic>\")\n.option(\"startingOffsets\", \"earliest\")\n.option(\"endingOffsets\", \"latest\")\n.load()\n)\n\n``` \nFor incremental batch loading, Databricks recommends using Kafka with `Trigger.AvailableNow`. See [Configuring incremental batch processing](https:\/\/docs.databricks.com\/structured-streaming\/triggers.html#available-now). \nIn Databricks Runtime 13.3 LTS and above, Databricks provides a SQL function for reading Kafka data. Streaming with SQL is supported only in Delta Live Tables or with streaming tables in Databricks SQL. See [read\\_kafka table-valued function](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/read_kafka.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/kafka.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Stream processing with Apache Kafka and Databricks\n##### Configure Kafka Structured Streaming reader\n\nDatabricks provides the `kafka` keyword as a data format to configure connections to Kafka 0.10+. \nThe following are the most common configurations for Kafka: \nThere are multiple ways of specifying which topics to subscribe to. You should provide only one of these parameters: \n| Option | Value | Description |\n| --- | --- | --- |\n| subscribe | A comma-separated list of topics. | The topic list to subscribe to. |\n| subscribePattern | Java regex string. | The pattern used to subscribe to topic(s). |\n| assign | JSON string `{\"topicA\":[0,1],\"topic\":[2,4]}`. | Specific topicPartitions to consume. | \nOther notable configurations: \n| Option | Value | Default Value | Description |\n| --- | --- | --- | --- |\n| kafka.bootstrap.servers | Comma-separated list of host:port. | empty | [Required] The Kafka `bootstrap.servers` configuration. If you find there is no data from Kafka, check the broker address list first. If the broker address list is incorrect, there might not be any errors. This is because Kafka client assumes the brokers will become available eventually and in the event of network errors retry forever. |\n| failOnDataLoss | `true` or `false`. | `true` | [Optional] Whether to fail the query when it\u2019s possible that data was lost. Queries can permanently fail to read data from Kafka due to many scenarios such as deleted topics, topic truncation before processing, and so on. We try to estimate conservatively whether data was possibly lost or not. Sometimes this can cause false alarms. Set this option to `false` if it does not work as expected, or you want the query to continue processing despite data loss. |\n| minPartitions | Integer >= 0, 0 = disabled. | 0 (disabled) | [Optional] Minimum number of partitions to read from Kafka. You can configure Spark to use an arbitrary minimum of partitions to read from Kafka using the `minPartitions` option. Normally Spark has a 1-1 mapping of Kafka topicPartitions to Spark partitions consuming from Kafka. If you set the `minPartitions` option to a value greater than your Kafka topicPartitions, Spark will divvy up large Kafka partitions to smaller pieces. This option can be set at times of peak loads, data skew, and as your stream is falling behind to increase processing rate. It comes at a cost of initializing Kafka consumers at each trigger, which may impact performance if you use SSL when connecting to Kafka. |\n| kafka.group.id | A Kafka consumer group ID. | not set | [Optional] Group ID to use while reading from Kafka. Use this with caution. By default, each query generates a unique group ID for reading data. This ensures that each query has its own consumer group that does not face interference from any other consumer, and therefore can read all of the partitions of its subscribed topics. In some scenarios (for example, Kafka group-based authorization), you may want to use specific authorized group IDs to read data. You can optionally set the group ID. However, do this with extreme caution as it can cause unexpected behavior.* Concurrently running queries (both, batch and streaming) with the same group ID are likely interfere with each other causing each query to read only part of the data. * This may also occur when queries are started\/restarted in quick succession. To minimize such issues, set the Kafka consumer configuration `session.timeout.ms` to be very small. |\n| startingOffsets | earliest , latest | latest | [Optional] The start point when a query is started, either \u201cearliest\u201d which is from the earliest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest. | \nSee [Structured Streaming Kafka Integration Guide](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-kafka-integration.html) for other optional configurations.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/kafka.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Stream processing with Apache Kafka and Databricks\n##### Schema for Kafka records\n\nThe schema of Kafka records is: \n| Column | Type |\n| --- | --- |\n| key | binary |\n| value | binary |\n| topic | string |\n| partition | int |\n| offset | long |\n| timestamp | long |\n| timestampType | int | \nThe `key` and the `value` are always deserialized as byte arrays with the `ByteArrayDeserializer`. Use DataFrame operations (such as `cast(\"string\")`) to explicitly deserialize the keys and values.\n\n#### Stream processing with Apache Kafka and Databricks\n##### Write data to Kafka\n\nThe following is an example for a streaming write to Kafka: \n```\n(df\n.writeStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", \"<server:ip>\")\n.option(\"topic\", \"<topic>\")\n.start()\n)\n\n``` \nDatabricks also supports batch write semantics to Kafka data sinks, as shown in the following example: \n```\n(df\n.write\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", \"<server:ip>\")\n.option(\"topic\", \"<topic>\")\n.save()\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/kafka.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Stream processing with Apache Kafka and Databricks\n##### Configure Kafka Structured Streaming writer\n\nImportant \nDatabricks Runtime 13.3 LTS and above includes a newer version of the `kafka-clients` library that enables idempotent writes by default. If a Kafka sink uses version 2.8.0 or below with ACLs configured, but without `IDEMPOTENT_WRITE` enabled, the write fails with the error message `org.apache.kafka.common.KafkaException:` `Cannot execute transactional method because we are in an error state`. \nResolve this error by upgrading to Kafka version 2.8.0 or above, or by setting `.option(\u201ckafka.enable.idempotence\u201d, \u201cfalse\u201d)` while configuring your Structured Streaming writer. \nThe schema provided to the DataStreamWriter interacts with the Kafka sink. You can use the following fields: \n| Column name | Required or optional | Type |\n| --- | --- | --- |\n| `key` | optional | `STRING` or `BINARY` |\n| `value` | required | `STRING` or `BINARY` |\n| `headers` | optional | `ARRAY` |\n| `topic` | optional (ignored if `topic` is set as writer option) | `STRING` |\n| `partition` | optional | `INT` | \nThe following are common options set while writing to Kafka: \n| Option | Value | Default value | Description |\n| --- | --- | --- | --- |\n| `kafka.boostrap.servers` | A comma-separated list of `<host:port>` | none | [Required] The Kafka `bootstrap.servers` configuration. |\n| `topic` | `STRING` | not set | [Optional] Sets the topic for all rows to be written. This option overrides any topic column that exists in the data. |\n| `includeHeaders` | `BOOLEAN` | `false` | [Optional] Whether to include the Kafka headers in the row. | \nSee [Structured Streaming Kafka Integration Guide](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-kafka-integration.html) for other optional configurations.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/kafka.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Stream processing with Apache Kafka and Databricks\n##### Retrieve Kafka metrics\n\nYou can get the average, min, and max of the number of offsets that the streaming query is behind the latest available offset among all the subscribed topics with the `avgOffsetsBehindLatest`, `maxOffsetsBehindLatest`, and `minOffsetsBehindLatest` metrics. See [Reading Metrics Interactively](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html#reading-metrics-interactively). \nNote \nAvailable in Databricks Runtime 9.1 and above. \nGet the estimated total number of bytes that the query process has not consumed from the subscribed topics by examining the value of `estimatedTotalBytesBehindLatest`. This estimate is based on the batches that were processed in the last 300 seconds. The timeframe that the estimate is based on can be changed by setting the option `bytesEstimateWindowLength` to a different value. For example, to set it to 10 minutes: \n```\ndf = (spark.readStream\n.format(\"kafka\")\n.option(\"bytesEstimateWindowLength\", \"10m\") # m for minutes, you can also use \"600s\" for 600 seconds\n)\n\n``` \nIf you are running the stream in a notebook, you can see these metrics under the **Raw Data** tab in the streaming query progress dashboard: \n```\n{\n\"sources\" : [ {\n\"description\" : \"KafkaV2[Subscribe[topic]]\",\n\"metrics\" : {\n\"avgOffsetsBehindLatest\" : \"4.0\",\n\"maxOffsetsBehindLatest\" : \"4\",\n\"minOffsetsBehindLatest\" : \"4\",\n\"estimatedTotalBytesBehindLatest\" : \"80.0\"\n},\n} ]\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/kafka.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Stream processing with Apache Kafka and Databricks\n##### Use SSL to connect Databricks to Kafka\n\nTo enable SSL connections to Kafka, follow the instructions in the Confluent documentation [Encryption and Authentication with SSL](https:\/\/docs.confluent.io\/current\/kafka\/authentication_ssl.html#clients). You can provide the configurations described there, prefixed with `kafka.`, as options. For example, you specify the trust store location in the property `kafka.ssl.truststore.location`. \nDatabricks recommends that you: \n* Store your certificates in cloud object storage. You can restrict access to the certificates only to clusters that can access Kafka. See [Data governance with Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/index.html).\n* Store your certificate passwords as [secrets](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html) in a [secret scope](https:\/\/docs.databricks.com\/security\/secrets\/secret-scopes.html). \nThe following example uses object storage locations and Databricks secrets to enable an SSL connection: \n```\ndf = (spark.readStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", ...)\n.option(\"kafka.security.protocol\", \"SASL_SSL\")\n.option(\"kafka.ssl.truststore.location\", <truststore-location>)\n.option(\"kafka.ssl.keystore.location\", <keystore-location>)\n.option(\"kafka.ssl.keystore.password\", dbutils.secrets.get(scope=<certificate-scope-name>,key=<keystore-password-key-name>))\n.option(\"kafka.ssl.truststore.password\", dbutils.secrets.get(scope=<certificate-scope-name>,key=<truststore-password-key-name>))\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/kafka.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Stream processing with Apache Kafka and Databricks\n##### Use Amazon Managed Streaming for Kafka with IAM\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html) in Databricks Runtime 13.3 LTS and above. \nYou can use Databricks to connect to Amazon Managed Streaming for Kafka (MSK) using IAM. For configuration instructions for MSK, see [Amazon MSK configuration](https:\/\/docs.aws.amazon.com\/msk\/latest\/developerguide\/msk-configuration.html). \nNote \nThe following configurations are only required if you are using IAM to connect to MSK. You can also configure connections to MSK using options provided by the Apache Spark Kafka connector. \nDatabricks recommends managing your connection to MSK using an instance profile. See [Instance profiles](https:\/\/docs.databricks.com\/compute\/configure.html#instance-profiles). \nYou must configure the following options to connect to MSK with an instance profile: \n```\n\"kafka.sasl.mechanism\" -> \"AWS_MSK_IAM\",\n\"kafka.sasl.jaas.config\" ->\n\"shadedmskiam.software.amazon.msk.auth.iam.IAMLoginModule required;\",\n\"kafka.security.protocol\" -> \"SASL_SSL\",\n\"kafka.sasl.client.callback.handler.class\" ->\n\"shadedmskiam.software.amazon.msk.auth.iam.IAMClientCallbackHandler\"\n\n``` \n```\n\"kafka.sasl.mechanism\": \"AWS_MSK_IAM\",\n\"kafka.sasl.jaas.config\":\n\"shadedmskiam.software.amazon.msk.auth.iam.IAMLoginModule required;\",\n\"kafka.security.protocol\": \"SASL_SSL\",\n\"kafka.sasl.client.callback.handler.class\":\n\"shadedmskiam.software.amazon.msk.auth.iam.IAMClientCallbackHandler\"\n\n``` \nYou can optionally configure your connection to MSK with an IAM user or IAM role instead of an instance profile. You must provide values for your AWS access key and secret key using the environmental variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`. See [Reference a secret in an environment variable](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html#env-var). \nIn addition, if you choose to configure your connection using an IAM role, you must modify the value provided to `kafka.sasl.jaas.config` to include the role ARN, as in the following example: `shadedmskiam.software.amazon.msk.auth.iam.IAMLoginModule required awsRoleArn=\"arn:aws:iam::123456789012:role\/msk_client_role\"`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/kafka.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Stream processing with Apache Kafka and Databricks\n##### Service Principal authentication with Microsoft Entra ID (formerly Azure Active Directory) and Azure Event Hubs\n\nDatabricks supports the authentication of Spark jobs with Event Hubs services. This authentication is done via OAuth with Microsoft Entra ID (formerly Azure Active Directory). \n![AAD Authentication diagram](https:\/\/docs.databricks.com\/_images\/aad-auth.png) \nDatabricks supports Microsoft Entra ID authentication with a client ID and secret in the following compute environments: \n* Databricks Runtime 12.2 LTS and above on compute configured with single user access mode.\n* Databricks Runtime 14.3 LTS and above on compute configured with shared access mode.\n* Delta Live Tables pipelines configured without Unity Catalog. \nDatabricks does not support Microsoft Entra ID authentication with a certificate in any compute environment, or in Delta Live Tables pipelines configured with Unity Catalog. \nThis authentication does not work on shared clusters or on Unity Catalog Delta Live Tables. \n### Configuring the Structured Streaming Kafka Connector \nTo perform authentication with Microsoft Entra ID, you\u2019ll need the following values: \n* A tenant ID. You can find this in the **Microsoft Entra ID** services tab.\n* A clientID (also known as Application ID).\n* A client secret. Once you have this, you should add it as a secret to your Databricks Workspace. To add this secret, see [Secret management](https:\/\/docs.databricks.com\/security\/secrets\/index.html).\n* An EventHubs topic. You can find a list of topics in the **Event Hubs** section under the **Entities** section on a specific **Event Hubs Namespace** page. To work with multiple topics, you can set the IAM role at the Event Hubs level.\n* An EventHubs server. You can find this on the overview page of your specific **Event Hubs namespace**: \n![Event Hubs namespace](https:\/\/docs.databricks.com\/_images\/event-hub-namespace.png) \nAdditionally, to use Entra ID, we need to tell Kafka to use the OAuth SASL mechanism (SASL is a generic protocol, and OAuth is a type of SASL \u201cmechanism\u201d): \n* `kafka.security.protocol` should be `SASL_SSL`\n* `kafka.sasl.mechanism` should be `OAUTHBEARER`\n* `kafka.sasl.login.callback.handler.class` should be a fully qualified name of the Java class with a value of `kafkashaded` to the login callback handler of our shaded Kafka class. See the following example for the exact class. \n### Example \nNext, let\u2019s look at a running example: \n```\n# This is the only section you need to modify for auth purposes!\n# ------------------------------\ntenant_id = \"...\"\nclient_id = \"...\"\nclient_secret = dbutils.secrets.get(\"your-scope\", \"your-secret-name\")\n\nevent_hubs_server = \"...\"\nevent_hubs_topic = \"...\"\n# -------------------------------\n\nsasl_config = f'kafkashaded.org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule required clientId=\"{client_id}\" clientSecret=\"{client_secret}\" scope=\"https:\/\/{event_hubs_server}\/.default\" ssl.protocol=\"SSL\";'\n\nkafka_options = {\n# Port 9093 is the EventHubs Kafka port\n\"kafka.bootstrap.servers\": f\"{event_hubs_server}:9093\",\n\"kafka.sasl.jaas.config\": sasl_config,\n\"kafka.sasl.oauthbearer.token.endpoint.url\": f\"https:\/\/login.microsoft.com\/{tenant_id}\/oauth2\/v2.0\/token\",\n\"subscribe\": event_hubs_topic,\n\n# You should not need to modify these\n\"kafka.security.protocol\": \"SASL_SSL\",\n\"kafka.sasl.mechanism\": \"OAUTHBEARER\",\n\"kafka.sasl.login.callback.handler.class\": \"kafkashaded.org.apache.kafka.common.security.oauthbearer.secured.OAuthBearerLoginCallbackHandler\"\n}\n\ndf = spark.readStream.format(\"kafka\").options(**kafka_options)\n\ndisplay(df)\n\n``` \n```\n\/\/ This is the only section you need to modify for auth purposes!\n\/\/ -------------------------------\nval tenantId = \"...\"\nval clientId = \"...\"\nval clientSecret = dbutils.secrets.get(\"your-scope\", \"your-secret-name\")\n\nval eventHubsServer = \"...\"\nval eventHubsTopic = \"...\"\n\/\/ -------------------------------\n\nval saslConfig = s\"\"\"kafkashaded.org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule required clientId=\"$clientId\" clientSecret=\"$clientSecret\" scope=\"https:\/\/$eventHubsServer\/.default\" ssl.protocol=\"SSL\";\"\"\"\n\nval kafkaOptions = Map(\n\/\/ Port 9093 is the EventHubs Kafka port\n\"kafka.bootstrap.servers\" -> s\"$eventHubsServer:9093\",\n\"kafka.sasl.jaas.config\" -> saslConfig,\n\"kafka.sasl.oauthbearer.token.endpoint.url\" -> s\"https:\/\/login.microsoft.com\/$tenantId\/oauth2\/v2.0\/token\",\n\"subscribe\" -> eventHubsTopic,\n\n\/\/ You should not need to modify these\n\"kafka.security.protocol\" -> \"SASL_SSL\",\n\"kafka.sasl.mechanism\" -> \"OAUTHBEARER\",\n\"kafka.sasl.login.callback.handler.class\" -> \"kafkashaded.org.apache.kafka.common.security.oauthbearer.secured.OAuthBearerLoginCallbackHandler\"\n)\n\nval scalaDF = spark.readStream\n.format(\"kafka\")\n.options(kafkaOptions)\n.load()\n\ndisplay(scalaDF)\n\n``` \n### Handling potential errors \n* Streaming options are not supported. \nIf you try to use this authentication mechanism in a Delta Live Tables pipeline configured with Unity Catalog you might receive the following error: \n![Unsupported streaming error](https:\/\/docs.databricks.com\/_images\/unsupported-streaming-option.png) \nTo resolve this error, use a supported compute configuration. See [Service Principal authentication with Microsoft Entra ID (formerly Azure Active Directory) and Azure Event Hubs](https:\/\/docs.databricks.com\/connect\/streaming\/kafka.html#msk-aad).\n* Failed to create a new `KafkaAdminClient`. \nThis is an internal error that Kafka throws if any of the following authentication options are incorrect: \n+ Client ID (also known as Application ID)\n+ Tenant ID\n+ EventHubs serverTo resolve the error, verify that the values are correct for these options. \nAdditionally, you might see this error if you modify the configuration options provided by default in the example (that you were asked not to modify), such as `kafka.security.protocol`.\n* There are no records being returned \nIf you are trying to display or process your DataFrame but aren\u2019t getting results, you will see the following in the UI. \n![No results message](https:\/\/docs.databricks.com\/_images\/no-results-error.png) \nThis message means that authentication was successful, but EventHubs didn\u2019t return any data. Some possible (though by no means exhaustive) reasons are: \n+ You specified the wrong **EventHubs** topic.\n+ The default Kafka configuration option for `startingOffsets` is `latest`, and you\u2019re not currently receiving any data through the topic yet. You can set `startingOffsetstoearliest` to start reading data starting from Kafka\u2019s earliest offsets.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/kafka.html"} +{"content":"# Query data\n### Query streaming data\n\nYou can use Databricks to query streaming data sources using Structured Streaming. Databricks provides extensive support for streaming workloads in Python and Scala, and supports most Structured Streaming functionality with SQL. \nThe following examples demonstrate using a memory sink for manual inspection of streaming data during interactive development in notebooks. Because of row output limits in the notebook UI, you might not observe all data read by streaming queries. In production workloads, you should only trigger streaming queries by writing them to a target table or external system. \nNote \nSQL support for interactive queries on streaming data is limited to notebooks running on all-purpose compute. You can also use SQL when you declare streaming tables in Databricks SQL or Delta Live Tables. See [Load data using streaming tables in Databricks SQL](https:\/\/docs.databricks.com\/sql\/load-data-streaming-table.html) and [What is Delta Live Tables?](https:\/\/docs.databricks.com\/delta-live-tables\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/streaming.html"} +{"content":"# Query data\n### Query streaming data\n#### Query data from streaming systems\n\nDatabricks provides streaming data readers for the following streaming systems: \n* Kafka\n* Kinesis\n* PubSub\n* Pulsar \nYou must provide configuration details when you initialize queries against these systems, which vary depending on your configured environment and the system you choose to read from. See [Configure streaming data sources](https:\/\/docs.databricks.com\/connect\/streaming\/index.html). \nCommon workloads that involve streaming systems include data ingestion to the lakehouse and stream processing to sink data to external systems. For more on streaming workloads, see [Streaming on Databricks](https:\/\/docs.databricks.com\/structured-streaming\/index.html). \nThe following examples demonstrate an interactive streaming read from Kafka: \n```\ndisplay(spark.readStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", \"<server:ip>\")\n.option(\"subscribe\", \"<topic>\")\n.option(\"startingOffsets\", \"latest\")\n.load()\n)\n\n``` \n```\nSELECT * FROM STREAM read_kafka(\nbootstrapServers => '<server:ip>',\nsubscribe => '<topic>',\nstartingOffsets => 'latest'\n);\n\n```\n\n### Query streaming data\n#### Query a table as a streaming read\n\nDatabricks creates all tables using Delta Lake by default. When you perform a streaming query against a Delta table, the query automatically picks up new records when a version of the table is committed. By default, streaming queries expect source tables to contain only appended records. If you need to work with streaming data that contains updates and deletes, Databricks recommends using Delta Live Tables and `APPLY CHANGES INTO`. See [APPLY CHANGES API: Simplify change data capture in Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html). \nThe following examples demonstrate performing an interactive streaming read from a table: \n```\ndisplay(spark.readStream.table(\"table_name\"))\n\n``` \n```\nSELECT * FROM STREAM table_name\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/streaming.html"} +{"content":"# Query data\n### Query streaming data\n#### Query data in cloud object storage with Auto Loader\n\nYou can stream data from cloud object storage using Auto Loader, the Databricks cloud data connector. You can use the connector with files stored in Unity Catalog volumes or other cloud object storage locations. Databricks recommends using volumes to manage access to data in cloud object storage. See [Connect to data sources](https:\/\/docs.databricks.com\/connect\/index.html). \nDatabricks optimizes this connector for streaming ingestion of data in cloud object storage that is stored in popular structured, semi-structured, and unstructured formats. Databricks recommends storing ingested data in a nearly-raw format to maximize throughput and minimize potential data loss due to corrupt records or schema changes. \nFor more recommendations on ingesting data from cloud object storage, see [Ingest data into a Databricks lakehouse](https:\/\/docs.databricks.com\/ingestion\/index.html). \nThe follow examples demonstrate an interactive streaming read from a directory of JSON files in a volume: \n```\ndisplay(spark.readStream.format(\"cloudFiles\").option(\"cloudFiles.format\", \"json\").load(\"\/Volumes\/catalog\/schema\/volumes\/path\/to\/files\"))\n\n``` \n```\nSELECT * FROM STREAM read_files('\/Volumes\/catalog\/schema\/volumes\/path\/to\/files', format => 'json')\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/streaming.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Use the generated SQL dashboard\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis page describes the dashboard that is automatically created when a monitor is run. \nWhen a monitor runs, it creates a legacy dashboard that displays key metrics computed by the monitor. The visualizations included in the default dashboard configuration depend on the profile type, and the different metrics are organized into sections. The left side of the dashboard shows lists of the metrics and statistics included in the tables and charts. \nThe dashboard has user-editable parameters for both the entire dashboard and for each chart, allowing you to customize the date range, data slices, models, and so on. You can also modify the charts shown or add new ones. \nThe dashboard is created in the user\u2019s account and is customizable and shareable like any legacy dashboard. For general information about using and customizing legacy dashboards, including adding new charts, editing charts, viewing queries, and so on, see [legacy dashboards](https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-dashboard.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Use the generated SQL dashboard\n#### Refresh the dashboard\n\nThe dashboard displays metrics that have been calculated by the monitor. To refresh the values shown on the dashboard, you must trigger a monitor refresh using [the UI](https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-ui.html#refresh) or [the API](https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-api.html#refresh), or set up a scheduled run ([UI](https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-ui.html#schedule), [API](https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-api.html#schedule)). You can\u2019t refresh the metrics from the dashboard. When you modify the dashboard, statistics aren\u2019t recalculated. \nThe metric tables and the dashboard generated by a monitor are updated separately. When you trigger a monitor refresh, the metric tables are updated, but the dashboard is not automatically updated. To update the data shown on the dashboard, click the **Refresh** button on the dashboard. \nSimilarly, when you click **Refresh** on the dashboard, it doesn\u2019t trigger monitor calculations. Instead, it runs the queries over the metric tables that the dashboard uses to generate visualizations. To update the data in the tables used to create the visualizations that appear on the dashboard, you must refresh the monitor and then refresh the dashboard.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-dashboard.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Use the generated SQL dashboard\n#### Select data to display\n\nUse the widgets at the top of the dashboard to control what data is included. \nNote \nWhen you open a dashboard from the link provided in the notebook, you must make a selection **for every filter**. If you do not, the charts combine data from all time windows, and the results may be misleading. If you open the dashboard from the Catalog Explorer UI, the selectors are pre-set. \nThe screenshot shows the filters for `Snapshot` analysis. For `TimeSeries` and `InferenceLog` analysis, additional selectors appear. \n![Selectors on monitor dashboard](https:\/\/docs.databricks.com\/_images\/monitor-dashboard-selectors.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-dashboard.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Manage dashboards with Workspace APIs\n\nThis tutorial demonstrates how to manage dashboards using the Lakeview API and Workspace API. Each step includes a sample request and response, and explanations about how to use the API tools and properties together. Each step can be referenced on its own. Following all steps in order guides you through a complete workflow. \nNote \nThis workflow calls the Workspace API to retrieve a Lakeview dashboard as a generic workspace object.\n\n##### Manage dashboards with Workspace APIs\n###### Prerequisites\n\n* You need a personal access token to connect with your workspace. See [Databricks personal access token authentication](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html).\n* You need the workspace ID of the workspace you want to access. See [Workspace instance names, URLs, and IDs](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-instance-names-urls-and-ids)\n* Familiarity with the [Databricks REST API reference](https:\/\/docs.databricks.com\/api\/workspace\/introduction).\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/workspace-lakeview-api.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Manage dashboards with Workspace APIs\n###### Step 1: Explore a workspace directory\n\nThe Workspace List API [GET \/api\/2.0\/workspace\/list](https:\/\/docs.databricks.com\/api\/workspace\/workspace\/list) allows you to explore the directory structure of your workspace. For example, you can retrieve a list of all files and directories in your current workspace. \nIn the following example, the `path` property in the request points to a folder named `examples_folder` stored in a user\u2019s home folder. The username is provided in the path, `first.last@example.com`. \nThe response shows that the folder contains a text file, a directory, and a Lakeview dashboard. \n```\nGET \/api\/2.0\/workspace\/list\n\nQuery Parameters:\n{\n\"path\": \"\/Users\/first.last@example.com\/examples_folder\"\n}\n\nResponse:\n{\n\"objects\": [\n{\n\"object_type\": \"FILE\",\n\"path\": \"\/Users\/first.last@example.com\/examples_folder\/myfile.txt\",\n\"created_at\": 1706822278103,\n\"modified_at\": 1706822278103,\n\"object_id\": 3976707922053539,\n\"resource_id\": \"3976707922053539\"\n},\n{\n\"object_type\": \"DIRECTORY\",\n\"path\": \"\/Users\/first.last@example.com\/examples_folder\/another_folder\",\n\"object_id\": 2514959868792596,\n\"resource_id\": \"2514959868792596\"\n},\n{\n\"object_type\": \"DASHBOARD\",\n\"path\": \"\/Users\/first.last@example.com\/examples_folder\/mydashboard.lvdash.json\",\n\"object_id\": 7944020886653361,\n\"resource_id\": \"01eec14769f616949d7a44244a53ed10\"\n}\n]\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/workspace-lakeview-api.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Manage dashboards with Workspace APIs\n###### Step 2: Export a dashboard\n\nThe Workspace Export API [GET \/api\/2.0\/workspace\/export](https:\/\/docs.databricks.com\/api\/workspace\/workspace\/export) allows you to export the contents of a dashboard as a file. Lakeview dashboard files reflect the draft version of a dashboard. The response in the following examples shows the contents of a minimal dashboard definition. To explore and understand more serialization details, try exporting some of your own dashboards. \n### Download the exported file \nThe following example shows how to download a dashboard file using the API. \nThe `\"path\"` property in this example ends with the file type extension `lvdash.json`, a Lakeview dashboard. The filename, as it appears in the workspace, precedes that extension. In this case, it\u2019s `mydashboard`. \nAdditionally, the `\"direct_download\"` property for this request is set to `true` so the response is the exported file itself. \nNote \nThe `\"displayName\"` property, shown in the pages property of the response, does not reflect the visible name of the dashboard in the workspace. \n```\nGET \/api\/2.0\/workspace\/export\n\nQuery parameters:\n{\n\"path\": \"\/Users\/first.last@example.com\/examples_folder\/mydashboard.lvdash.json\",\n\"direct_download\": true\n}\n\nResponse:\n{\n\"pages\": [\n{\n\"name\": \"880de22a\",\n\"displayName\": \"New Page\"\n}\n]\n}\n\n``` \n### Encode the exported file \nThe following code shows an example response where `\"direct_download\"` property is set to false. The response contains content as a base64 encoded string. \n```\nGET \/api\/2.0\/workspace\/export\n\nQuery parameters:\n{\n\"path\": \"\/Users\/first.last@example.com\/examples_folder\/mydashboard.lvdash.json\",\n\"direct_download\": false\n}\n\nResponse:\n{\n\"content\": \"IORd\/DYYsCNElspwM9XBZS\/i5Z9dYgW5SkLpKJs48dR5p5KkIW8OmEHU8lx6CZotiCDS9hkppQG=\",\n\"file_type\": \"lvdash.json\"\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/workspace-lakeview-api.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Manage dashboards with Workspace APIs\n###### Step 3: Import a dashboard\n\nYou can use the Workspace Import API [POST \/api\/2.0\/workspace\/import](https:\/\/docs.databricks.com\/api\/workspace\/workspace\/import) to import draft dashboards into a workspace. For example, after you\u2019ve exported an encoded file, as in the previous example, you can import that dashboard to a new workspace. \nFor an import to be recognized as a Lakeview dashboard, two parameters must be set: \n* `\"format\"`: \u201cAUTO\u201d - this setting will allow the system to detect the asset type automatically. \n+ `\"path\"`: must include a file path that ends with \u201c.lvdash.json\u201d. \nImportant \nIf these settings are not configured properly, the import might succeed, but the dashboard would be treated like a regular file. \nThe following example shows a properly configured import request. \n```\n\nPOST \/api\/2.0\/workspace\/import\n\nRequest body parameters:\n{\n\"path\": \"\/Users\/first.last@example.com\/examples_folder\/myseconddashboard.lvdash.json\",\n\"content\": \"IORd\/DYYsCNElspwM9XBZS\/i5Z9dYgW5SkLpKJs48dR5p5KkIW8OmEHU8lx6CZotiCDS9hkppQG=\",\n\"format\": \"AUTO\"\n}\n\nResponse:\n{}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/workspace-lakeview-api.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Manage dashboards with Workspace APIs\n###### Step 4: Overwrite on import (Optional)\n\nAttempting to reissue the same API request results in the following error: \n```\n{\n\"error_code\": \"RESOURCE_ALREADY_EXISTS\",\n\"message\": \"Path (\/Users\/first.last@example.com\/examples_folder\/myseconddashboard.lvdash.json) already exists.\"\n}\n\n``` \nIf you want to overwrite the duplicate request instead, set the `\"overwrite\"` property to `true` as in the following example. \n```\n\nPOST \/api\/2.0\/workspace\/import\n\nRequest body parameters:\n{\n\"path\": \/Users\/first.last@example.com\/examples_folder\/myseconddashboard.lvdash.json\",\n\"content\": \"IORd\/DYYsCNElspwM9XBZS\/i5Z9dYgW5SkLpKJs48dR5p5KkIW8OmEHU8lx6CZotiCDS9hkppQG=\",\n\"format\": \"AUTO\",\n\"overwrite\": true\n}\n\nResponse:\n{}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/workspace-lakeview-api.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Manage dashboards with Workspace APIs\n###### Step 5: Retrieve metadata\n\nYou can retrieve metadata for any workspace object, including a Lakeview dashboard. See [GET \/api\/2.0\/workspace\/get-status](https:\/\/docs.databricks.com\/api\/workspace\/workspace\/getstatus). \nThe following example shows a `get-status` request for the imported dashboard from the previous example. The response includes details affirming that the file has been successfully imported as a `\"DASHBOARD\"`. Also, it consists of a `\"resource_id\"` property that you can use as an identifier with the Lakeview API. \n```\nGET \/api\/2.0\/workspace\/get-status\n\nQuery parameters:\n{\n\"path\": \"\/Users\/first.last@example.com\/examples_folder\/myseconddashboard.lvdash.json\"\n}\n\nResponse:\n{\n\"object_type\": \"DASHBOARD\",\n\"path\": \"\/Users\/first.last@example.com\/examples_folder\/myseconddashboard.lvdash.json\",\n\"object_id\": 7616304051637820,\n\"resource_id\": \"9c1fbf4ad3449be67d6cb64c8acc730b\"\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/workspace-lakeview-api.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Manage dashboards with Workspace APIs\n###### Step 6: Publish a dashboard\n\nThe previous examples used the Workspace API, enabling work with Lakeview dashboards as generic workspace objects. The following example uses the Lakeview API to perform a publish operation specific to Lakeview dashboards. See [POST \/api\/2.0\/lakeview\/dashboards\/{dashboard\\_id}\/published](https:\/\/docs.databricks.com\/api\/workspace\/lakeview\/publish). \nThe path to the API endpoint includes the `\"resource_id\"` property returned in the previous example. In the request parameters, `\"embed_credentials\"` is set to `true` so that the publisher\u2019s credentials are embedded in the dashboard. The publisher, in this case, is the user who is making the authorized API request. The publisher cannot embed different user\u2019s credentials. See [Publish a dashboard](https:\/\/docs.databricks.com\/dashboards\/index.html#publish-a-dashboard) to learn how the **Embed credentials** setting works. \nThe `\"warehouse_id\"` property sets the warehouse to be used for the published dashboard. If specified, this property overrides the warehouse specified for the draft dashboard, if any. \n```\nPOST \/api\/2.0\/lakeview\/dashboards\/9c1fbf4ad3449be67d6cb64c8acc730b\/published\n\nRequest parameters\n{\n\"embed_credentials\": true,\n\"warehouse_id\": \"1234567890ABCD12\"\n}\n\nResponse:\n{}\n\n``` \nThe published dashboard can be accessed from your browser when the command is complete. The following example shows how to construct the link to your published dashboard. \n```\nhttps:\/\/<deployment-url>\/dashboardsv3\/<resource_id>\/published\n\n``` \nTo construct your unique link: \n* Replace `<deployment-url>` with your deployment URL. This link is the address in your browser address bar when you are on your Databricks workspace homepage.\n* Replace `<resource_id>` with the value of the `\"resource_id\"` property that you identified in [retrieve metadata](https:\/\/docs.databricks.com\/dashboards\/tutorials\/workspace-lakeview-api.html#retrieve-metadata).\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/workspace-lakeview-api.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n### Dashboard tutorials\n##### Manage dashboards with Workspace APIs\n###### Step 7: Delete a dashboard\n\nTo delete a dashboard, use the Workspace API. See [POST \/api\/2.0\/workspace\/delete](https:\/\/docs.databricks.com\/api\/workspace\/workspace\/delete). \nImportant \nThis is a hard delete. When the command completes, the dashboard is permanently deleted. \nIn the following example, the request includes the path to the file created in the previous steps. \n```\nPOST \/api\/2.0\/workspace\/delete\n\nQuery parameters:\n{\n\"path\": \"\/Users\/first.last@example.com\/examples_folder\/myseconddashboard.lvdash.json\"\n}\n\nResponse:\n{}\n\n```\n\n##### Manage dashboards with Workspace APIs\n###### Next steps\n\n* To learn more about dashboards, see [Dashboards](https:\/\/docs.databricks.com\/dashboards\/index.html).\n* To learn more about the REST API, see [Databricks REST API reference](https:\/\/docs.databricks.com\/api\/workspace\/introduction).\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/workspace-lakeview-api.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Stream from Apache Pulsar\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nIn Databricks Runtime 14.1 and above, you can use Structured Streaming to stream data from Apache Pulsar on Databricks. \nStructured Streaming provides exactly-once processing semantics for data read from Pulsar sources.\n\n#### Stream from Apache Pulsar\n##### Syntax example\n\nThe following is a basic example of using Structured Streaming to read from Pulsar: \n```\nquery = spark.readStream\n.format(\"pulsar\")\n.option(\"service.url\", \"pulsar:\/\/broker.example.com:6650\")\n.option(\"topics\", \"topic1,topic2\")\n.load()\n\n``` \nYou must always provide a `service.url` and one of the following options to specify topics: \n* `topic`\n* `topics`\n* `topicsPattern` \nFor a complete list of options, see [Configure options for Pulsar streaming read](https:\/\/docs.databricks.com\/connect\/streaming\/pulsar.html#options).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/pulsar.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Stream from Apache Pulsar\n##### Authenticate to Pulsar\n\nDatabricks supports truststore and keystore authentication to Pulsar. Databricks recommends using secrets when storing configuration details. \nYou can set the following options during stream configuration: \n* `pulsar.client.authPluginClassName`\n* `pulsar.client.authParams`\n* `pulsar.client.useKeyStoreTls`\n* `pulsar.client.tlsTrustStoreType`\n* `pulsar.client.tlsTrustStorePath`\n* `pulsar.client.tlsTrustStorePassword` \nIf the stream uses a `PulsarAdmin`, also set the following: \n* `pulsar.admin.authPluginClassName`\n* `pulsar.admin.authParams` \nThe following example demonstrates configuring authentication options: \n```\nval clientAuthParams = dbutils.secrets.get(scope = \"pulsar\", key = \"clientAuthParams\")\nval clientPw = dbutils.secrets.get(scope = \"pulsar\", key = \"clientPw\")\n\n\/\/ clientAuthParams is a comma-separated list of key-value pairs, such as:\n\/\/\"keyStoreType:JKS,keyStorePath:\/var\/private\/tls\/client.keystore.jks,keyStorePassword:clientpw\"\n\nquery = spark.readStream\n.format(\"pulsar\")\n.option(\"service.url\", \"pulsar:\/\/broker.example.com:6650\")\n.option(\"topics\", \"topic1,topic2\")\n.option(\"startingOffsets\", startingOffsets)\n.option(\"pulsar.client.authPluginClassName\", \"org.apache.pulsar.client.impl.auth.AuthenticationKeyStoreTls\")\n.option(\"pulsar.client.authParams\", clientAuthParams)\n.option(\"pulsar.client.useKeyStoreTls\", \"true\")\n.option(\"pulsar.client.tlsTrustStoreType\", \"JKS\")\n.option(\"pulsar.client.tlsTrustStorePath\", trustStorePath)\n.option(\"pulsar.client.tlsTrustStorePassword\", clientPw)\n.load()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/pulsar.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Stream from Apache Pulsar\n##### Pulsar schema\n\nThe schema of records read from Pulsar depends on how topics have their schemas encoded. \n* For topics with Avro or JSON schema, field names and field types are preserved in the resulting Spark DataFrame.\n* For topics without schema or with a simple data type in Pulsar, the payload is loaded to a `value` column.\n* If the reader is configured to read multiple topics with different schemas, set `allowDifferentTopicSchemas` to load the raw content to a `value` column. \nPulsar records have the following metadata fields: \n| Column | Type |\n| --- | --- |\n| `__key` | `binary` |\n| `__topic` | `string` |\n| `__messageId` | `binary` |\n| `__publishTime` | `timestamp` |\n| `__eventTime` | `timestamp` |\n| `__messageProperties` | `map<String, String>` |\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/pulsar.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Stream from Apache Pulsar\n##### Configure options for Pulsar streaming read\n\nAll options are configured as part of a Structured Streaming read using `.option(\"<optionName>\", \"<optionValue>\")` syntax. You can also configure authentication using options. See [Authenticate to Pulsar](https:\/\/docs.databricks.com\/connect\/streaming\/pulsar.html#auth). \nThe following table describes required configurations for Pulsar. You must specify only one of the options `topic`, `topics` or `topicsPattern`. \n| Option | Default value | Description |\n| --- | --- | --- |\n| `service.url` | none | The Pulsar `serviceUrl` configuration for the Pulsar service. |\n| `topic` | none | A topic name string for the topic to consume. |\n| `topics` | none | A comma-separated list of the topics to consume. |\n| `topicsPattern` | none | A Java regex string to match on topics to consume. | \nThe following table describes other options supported for Pulsar: \n| Option | Default value | Description |\n| --- | --- | --- |\n| `predefinedSubscription` | none | The predefined subscription name used by the connector to track spark application progress. |\n| `subscriptionPrefix` | none | A prefix used by the connector to generate a random subscription to track spark application progress. |\n| `pollTimeoutMs` | 120000 | The timeout for reading messages from Pulsar in milliseconds. |\n| `waitingForNonExistedTopic` | `false` | Whether the connector should wait until the desired topics are created. |\n| `failOnDataLoss` | `true` | Controls whether to fail a query when data is lost (for example, topics are deleted, or messages are deleted because of retention policy). |\n| `allowDifferentTopicSchemas` | `false` | If multiple topics with different schemas are read, use this parameter to turn off automatic schema-based topic value deserialization. Only the raw values are returned when this is `true`. |\n| `startingOffsets` | `latest` | If `latest`, the reader reads the newest records after it starts running. If `earliest`, the reader reads from the earliest offset. The user can also specify a JSON string that specifies a specific offset. |\n| `maxBytesPerTrigger` | none | A soft limit of the maximum number of bytes we want to process per microbatch. If this is specified, `admin.url` also needs to be specified. |\n| `admin.url` | none | The Pulsar `serviceHttpUrl` configuration. Only needed when `maxBytesPerTrigger` is specified. | \nYou can also specify any Pulsar client, admin, and reader configurations using the following patterns: \n| Pattern | Link to conifiguration options |\n| --- | --- |\n| `pulsar.client.*` | [Pulsar client configuration](https:\/\/pulsar.apache.org\/docs\/2.11.x\/client-libraries-java\/#client) |\n| `pulsar.admin.*` | [Pulsar admin configuration](https:\/\/pulsar.apache.org\/docs\/2.10.x\/admin-api-overview\/) |\n| `pulsar.reader.*` | [Pulsar reader configuration](https:\/\/pulsar.apache.org\/docs\/2.11.x\/client-libraries-java\/#configure-reader) |\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/pulsar.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Stream from Apache Pulsar\n##### Construct starting offsets JSON\n\nYou can manually construct a message ID to specify a specific offset and pass this as a JSON to the `startingOffsets` option. The following code example demonstrates this syntax: \n```\nimport org.apache.spark.sql.pulsar.JsonUtils\nimport org.apache.pulsar.client.api.MessageId\nimport org.apache.pulsar.client.impl.MessageIdImpl\n\nval topic = \"my-topic\"\nval msgId: MessageId = new MessageIdImpl(ledgerId, entryId, partitionIndex)\nval startOffsets = JsonUtils.topicOffsets(Map(topic -> msgId))\n\nquery = spark.readStream\n.format(\"pulsar\")\n.option(\"service.url\", \"pulsar:\/\/broker.example.com:6650\")\n.option(\"topic\", topic)\n.option(\"startingOffsets\", startOffsets)\n.load()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/pulsar.html"} +{"content":"# \n### What is Databricks Marketplace?\n\nThis article introduces Databricks Marketplace, an open forum for exchanging data products. Databricks Marketplace takes advantage of [Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html) to give data providers the tools to share data products securely and data consumers the power to explore and expand their access to the data and data services they need. \n![Marketplace home page](https:\/\/docs.databricks.com\/_images\/marketplace-home.png)\n\n### What is Databricks Marketplace?\n#### What kinds of data assets are shared on Databricks Marketplace?\n\nMarketplace assets include datasets, Databricks notebooks, Databricks [Solution Accelerators](https:\/\/www.databricks.com\/solutions\/accelerators), and machine learning (AI) models. Datasets are typically made available as catalogs of tabular data, although non-tabular data, in the form of Databricks [volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html), is also supported. Solution Accelerators are available as clonable Git repos.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/index.html"} +{"content":"# \n### What is Databricks Marketplace?\n#### How do consumers get access to data in Databricks Marketplace?\n\nTo find a data product you want on the Databricks Marketplace, simply browse or search provider listings. \nYou can browse: \n* The [Open Marketplace](https:\/\/marketplace.databricks.com), which does not require access to a Databricks workspace.\n* The Databricks Marketplace on your Databricks workspace. Just click ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace**. \nTo request access to data products in the Marketplace, you must use the Marketplace on a Databricks workspace. You do not need a Databricks workspace to access and work with data once it is shared, although using a Databricks workspace with Unity Catalog enabled lets you take advantage of the deep integration of Unity Catalog with Delta Sharing. \nSome data products are available to everyone in the *public marketplace*, and others are available as part of a *private exchange*, in which a provider shares their listings only with member consumers. Whether public or private, some data products are available instantly, as soon as you request them and agree to the terms. Others might require provider approval and transaction completion using provider interfaces. In either case, the Delta Sharing protocol that powers the Marketplace ensures that you can access shared data securely. \n### Get started accessing data products \nTo learn how to get started as a data consumer: \n* Using a Databricks workspace that is enabled for Unity Catalog, see [Access data products in Databricks Marketplace (Unity Catalog-enabled workspaces)](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html).\n* Using third-party platforms like Power BI, pandas, or Apache Spark, along with Databricks workspaces that are not enabled for Unity Catalog, see [Access data products in Databricks Marketplace using external platforms](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer-open.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/index.html"} +{"content":"# \n### What is Databricks Marketplace?\n#### How do providers list data products in Databricks Marketplace?\n\nDatabricks Marketplace gives data providers a secure platform for sharing data products that data scientists and analysts can use to help their organizations succeed. Databricks Marketplace uses Delta Sharing to provide security and control over your shared data. You can share public data, free sample data, and commercialized data offerings. You can share data products in public listings or as part of private exchanges that you create, making listings discoverable only by member consumers. In addition to datasets, you can also share Databricks notebooks and other content to demonstrate use cases and show customers how to take full advantage of your data products. \n### Get started listing data products \nTo list your data products on Databricks Marketplace, you must: \n* Have a Databricks account and premium workspace that is enabled for Unity Catalog. You do not need to enable all of your workspaces for Unity Catalog. You can create one specifically for managing Marketplace listings.\n* Apply to be a provider through the [Databricks Data Partner Program](https:\/\/www.databricks.com\/company\/partners\/data-partner-program). Alternatively, if you only want to share data through [private exchanges](https:\/\/docs.databricks.com\/marketplace\/private-exchange.html), you can use the self-service signup flow in the provider console. See [Sign up to be a Databricks Marketplace provider](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#apply).\n* Review the [Marketplace provider policies](https:\/\/docs.databricks.com\/marketplace\/provider-policies.html). \nTo learn how to get started, see [List your data product in Databricks Marketplace](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html).\n\n### What is Databricks Marketplace?\n#### View a demo\n\nThis video introduces Databricks Marketplace, shows how consumers access listings, and demonstrates how providers create them.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/index.html"} +{"content":"# \n### What\u2019s coming?\n\nLearn about upcoming Databricks releases.\n\n### What\u2019s coming?\n#### DigiCert is updating its root CA certificate on May 15\n\nAfter May 15th, 2024, if you do not use a browser that is supported by Databricks or another client that trusts DigiCert\u2019s new root and intermediate CA certificates, you must establish trust with the new DigiCert root and intermediate CA certificates. See [DigiCert root, and intermediate CA certificate updates 2023](https:\/\/knowledge.digicert.com\/general-information\/digicert-root-and-intermediate-ca-certificate-updates-2023). \nFor more information on how to test if your client trusts the root CA, see [DigiCert is updating its root CA certificate](https:\/\/community.databricks.com\/t5\/product-platform-updates\/digicert-is-updating-its-root-ca-certificate\/ba-p\/63869).\n\n### What\u2019s coming?\n#### Legacy Git integration is EOL on January 31\n\nAfter January 31st, 2024, Databricks will remove [legacy notebook Git integrations](https:\/\/docs.databricks.com\/archive\/repos\/git-version-control-legacy.html). This feature has been in legacy status for more than two years, and a deprecation notice has been displayed in the product UI since November 2023. \nFor details on migrating to Databricks Git folders (formerly Repos) from legacy Git integration, see [Switching to Databricks Repos from Legacy Git integration](https:\/\/docs.databricks.com\/_extras\/documents\/migrate-to-repos-from-legacy-git.pdf). If this removal impacts you and you need an extension, contact your Databricks account team.\n\n","doc_uri":"https:\/\/docs.databricks.com\/whats-coming.html"} +{"content":"# \n### What\u2019s coming?\n#### Changes to query, dashboard, and alert listing pages\n\nDatabricks plans to remove the **Admin view** tab from listing pages for queries, dashboards, and alerts. Workspace admins automatically have CAN MANAGE permissions on workspace objects, so all queries, dashboards, and alerts can be accessed from the **All** tab on each listing page.\n\n### What\u2019s coming?\n#### External support ticket submission will be deprecated\n\nDatabricks plans to transition the support ticket submission experience from `help.databricks.com` to the help menu in the Databricks workspace. Support ticket submission via `help.databricks.com` will be deprecated. You\u2019ll still view and triage your tickets at `help.databricks.com`. \nThe in-product experience, which is available if your organization has a [Databricks Support contract](https:\/\/docs.databricks.com\/resources\/support.html), integrates with Databricks Assistant to help address your issues quickly without having to submit a ticket. \nTo access the in-product experience, click the help icon ![Help icon](https:\/\/docs.databricks.com\/_images\/in-product-help-icon.png), and then click **Create Support Ticket** or type \u201cI need help\u201d into the assistant. \nThe **Contact support** modal opens. \n![Contact support modal](https:\/\/docs.databricks.com\/_images\/contact-support.png) \nIf the in-product experience is down, send requests for support with detailed information about your issue to [help@databricks.com](mailto:help%40databricks.com). For more information, see [Get help](https:\/\/docs.databricks.com\/workspace\/index.html#get-help).\n\n","doc_uri":"https:\/\/docs.databricks.com\/whats-coming.html"} +{"content":"# \n### What\u2019s coming?\n#### JDK8 and JDK11 will be unsupported\n\nDatabricks plans to remove JDK 8 support with the next major Databricks Runtime version, when Spark 4.0 releases. Databricks plans to remove JDK 11 support with the next LTS version of Databricks Runtime 14.x.\n\n### What\u2019s coming?\n#### Automatic enablement of Unity Catalog for new workspaces\n\nDatabricks has begun to enable Unity Catalog automatically for new workspaces. This removes the need for account admins to configure Unity Catalog after a workspace is created. Rollout is proceeding gradually across accounts.\n\n### What\u2019s coming?\n#### New charts and chart improvements\n\nDatabricks plans to add new charts to the SQL editor, SQL dashboards, and notebooks. This change will bring faster chart rendering performance, improved colors, and faster interactivity. See [New chart visualizations in Databricks](https:\/\/docs.databricks.com\/visualizations\/preview-chart-visualizations.html)\n\n### What\u2019s coming?\n#### Favorites functionality\n\nDatabricks plans to enable favorites functionality in the workspace. You\u2019ll be able to save content such as notebooks, dashboards, experiments, and queries to a list of favorites, and then access your favorites from the homepage.\n\n### What\u2019s coming?\n#### sqlite-jdbc upgrade\n\nDatabricks Runtime plans to upgrade the sqlite-jdbc version from 3.8.11.2 to 3.42.0.0 in all Databricks Runtime maintenance releases. The APIs of version 3.42.0.0 are not fully compatible with 3.8.11.2. Confirm your methods and return type use version 3.42.0.0. \nIf you are using sqlite-jdbc in your code, check the [sqlite-jdbc compatibility report](https:\/\/docs.databricks.com\/_extras\/documents\/sqlite-jdbc-report.pdf).\n\n","doc_uri":"https:\/\/docs.databricks.com\/whats-coming.html"} +{"content":"# \n### htmlwidgets\n\nWith [htmlwidgets for R](https:\/\/www.htmlwidgets.org\/) you can generate interactive plots using R\u2019s flexible syntax and environment. Databricks notebooks support htmlwidgets. \nThe setup has two steps: \n1. Install [pandoc](https:\/\/pandoc.org\/), a Linux package used by htmlwidgets to generate HTML.\n2. Change one function in the htmlwidgets package to make it work in Databricks. \nYou can automate the first step using an [init script](https:\/\/docs.databricks.com\/init-scripts\/index.html) so that the cluster installs pandoc when it launches. You should do the second step, changing an htmlwidgets function, in every notebook that uses the htmlwidgets package. \nThe notebook shows how to use htmlwidgets with [dygraphs](https:\/\/rstudio.github.io\/dygraphs\/), [leaflet](https:\/\/rstudio.github.io\/leaflet\/), and [plotly](https:\/\/plot.ly\/r\/). \nImportant \nWith each library invocation, an HTML file containing the rendered plot is downloaded. The plot *does not* display inline.\n\n### htmlwidgets\n#### Notebook example: htmlwidgets\n\nThe following notebook shows how to use htmlwidgets. \n### htmlwidgets notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/htmlwidgets.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/htmlwidgets.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Third-party online stores\n##### Authentication for working with online stores\n\nThis article describes how to configure authentication for publishing feature tables to online stores and looking up features from online stores. \nThe table shows the authentication methods supported for each action: \n| Online store provider | Publish | Feature lookup in Legacy MLflow Model Serving | Feature lookup in Model Serving |\n| --- | --- | --- | --- |\n| Amazon DynamoDB (any version of Feature Engineering client, or Feature Store client v0.3.8 and above) | [Instance profile attached to a Databricks cluster](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/fs-authentication.html#auth-instance-profile) or Databricks secrets using `write_secret_prefix` in `AmazonDynamoDBSpec` | Databricks secrets using `read_secret_prefix` in `AmazonDynamoDBSpec`. Instance profiles are not supported for legacy feature lookup. | [Instance profile attached to a Databricks Serving Endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/add-model-serving-instance-profile.html) or Databricks secrets using `read_secret_prefix` in `AmazonDynamoDBSpec`. . |\n| Amazon Aurora (MySQL-compatible) | Databricks secrets using `write_secret_prefix` in `AmazonRdsMySqlSpec`. | Databricks secrets using `read_secret_prefix` in `AmazonRdsMySqlSpec`. | Not supported. |\n| Amazon RDS MySQL | Databricks secrets using `write_secret_prefix` in `AmazonRdsMySqlSpec`. | Databricks secrets using `read_secret_prefix` in `AmazonRdsMySqlSpec`. | Not supported. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/fs-authentication.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Third-party online stores\n##### Authentication for working with online stores\n###### Authentication for publishing feature tables to online stores\n\nTo publish feature tables to an online store, you must provide write authentication. \nDatabricks recommends that you provide write authentication through [an instance profile attached to a Databricks cluster](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/fs-authentication.html#auth-instance-profile). Alternatively, you can [store credentials in Databricks secrets](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/fs-authentication.html#provide-online-store-credentials-using-databricks-secrets), and then refer to them in a `write_secret_prefix` when publishing. \nThe instance profile or IAM user should have all of the following permissions: \n* `dynamodb:DeleteItem`\n* `dynamodb:DeleteTable`\n* `dynamodb:PartiQLSelect`\n* `dynamodb:DescribeTable`\n* `dynamodb:PartiQLInsert`\n* `dynamodb:GetItem`\n* `dynamodb:CreateGlobalTable`\n* `dynamodb:BatchGetItem`\n* `dynamodb:UpdateTimeToLive`\n* `dynamodb:BatchWriteItem`\n* `dynamodb:ConditionCheckItem`\n* `dynamodb:PutItem`\n* `dynamodb:PartiQLUpdate`\n* `dynamodb:Scan`\n* `dynamodb:Query`\n* `dynamodb:UpdateItem`\n* `dynamodb:DescribeTimeToLive`\n* `dynamodb:CreateTable`\n* `dynamodb:UpdateGlobalTableSettings`\n* `dynamodb:UpdateTable`\n* `dynamodb:PartiQLDelete`\n* `dynamodb:DescribeTableReplicaAutoScaling` \n### Provide write authentication through an instance profile attached to a Databricks cluster \nOn clusters running Databricks Runtime 10.5 ML and above, you can use the instance profile attached to the cluster for write authentication when publishing to DynamoDB online stores. \nNote \nUse these steps only for write authentication when publishing to DynamoDB online stores. \n1. Create an [instance profile](https:\/\/docs.aws.amazon.com\/IAM\/latest\/UserGuide\/id_roles_use_switch-role-ec2_instance-profiles.html) that has write permission to the online store.\n2. Attach the instance profile to a Databricks cluster by following these two steps: \n1. [Add the instance profile to Databricks](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html#add-instance-profile).\n2. [Launch a cluster with the instance profile](https:\/\/docs.databricks.com\/compute\/configure.html#instance-profiles).\n3. Select the cluster with the attached instance profile to run the code to publish to the online store. You do not need to provide explicit secret credentials or `write_secret_prefix` to the [online store spec](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html). \n### Provide write credentials using Databricks secrets \nFollow the instructions in [Use Databricks secrets](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/fs-authentication.html#provide-online-store-credentials-using-databricks-secrets).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/fs-authentication.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Third-party online stores\n##### Authentication for working with online stores\n###### Authentication for looking up features from online stores with served MLflow models\n\nTo enable Databricks-hosted MLflow models to connect to online stores and look up feature values, you must provide read authentication. \nDatabricks recommends that you provide lookup authentication through [an instance profile attached to a Databricks served model](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/fs-authentication.html#auth-instance-profile-lookup). Alternatively, you can [store credentials in Databricks secrets](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/fs-authentication.html#provide-online-store-credentials-using-databricks-secrets), and then refer to them in a `read_secret_prefix` when publishing. \n### Provide lookup authentication through an instance profile configured to a served model \n1. Create an [instance profile](https:\/\/docs.aws.amazon.com\/IAM\/latest\/UserGuide\/id_roles_use_switch-role-ec2_instance-profiles.html) that has write permission to the online store. \n1. Configure your [Databricks serving endpoint to use instance profile](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/add-model-serving-instance-profile.html).\n.. note:: When publishing your table, you do not have to specify a `read_prefix`, and any `read_prefix` specified is overridden with the instance profile. \n### Provide read credentials using Databricks secrets \nFollow the instructions in [Use Databricks secrets](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/fs-authentication.html#provide-online-store-credentials-using-databricks-secrets).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/fs-authentication.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Third-party online stores\n##### Authentication for working with online stores\n###### Use Databricks secrets for read and write authentication.\n\nThis section shows the steps to follow to set up authentication with Databricks secrets. For code examples illustrating how to use these secrets, see [Publish features to an online store](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/publish-features.html). \n1. [Create two secret scopes](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html#create-a-secret-in-a-databricks-backed-scope) that contain credentials for the online store: one for read-only access (shown here as `<read-scope>`) and one for read-write access (shown here as `<write-scope>`). Alternatively, you can reuse existing secret scopes. \nIf you intend to use an instance profile for write authentication (configured at Databricks cluster level), you do not need to include the `<write-scope>`.\nIf you intend to use an instance profile for read authentication (configured at Databricks Serving endpoint level), you do not need to include the `<read-scope>`.\n2. Pick a unique name for the target online store, shown here as `<prefix>`. \nFor DynamoDB (works with any version of Feature Engineering client, and Feature Store client v0.3.8 and above), create the following secrets: \n* Access key ID for the IAM user with read-only access to the target online store: `databricks secrets put-secret <read-scope> <prefix>-access-key-id`\n* Secret access key for the IAM user with read-only access to the target online store: `databricks secrets put-secret <read-scope> <prefix>-secret-access-key`\n* Access key ID for the IAM user with read-write access to the target online store: `databricks secrets put-secret <write-scope> <prefix>-access-key-id`\n* Secret access key for the IAM user with read-write access to the target online store: `databricks secrets put-secret <write-scope> <prefix>-secret-access-key`For SQL stores, create the following secrets: \n* User with read-only access to the target online store: `databricks secrets put-secret <read-scope> <prefix>-user`\n* Password for user with read-only access to the target online store: `databricks secrets put-secret <read-scope> <prefix>-password`\n* User with read-write access to the target online store: `databricks secrets put-secret <write-scope> <prefix>-user`\n* Password for user with read-write access to the target online store: `databricks secrets put-secret <write-scope> <prefix>-password` \nNote \nThere is a [limit on the number of secret scopes per workspace](https:\/\/docs.databricks.com\/security\/secrets\/secret-scopes.html). To avoid hitting this limit, you can [define and share](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html#permissions) a single secret scope for accessing all online stores.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/fs-authentication.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Clone a legacy dashboard to a Lakeview dashboard\n\nNote \nDashboards (formerly Lakeview dashboards) are now generally available. \n* Databricks recommends authoring new dashboards using the latest tooling. See [Dashboards](https:\/\/docs.databricks.com\/dashboards\/index.html).\n* Original Databricks SQL dashboards are now called **legacy dashboards**. They will continue to be supported and updated with critical bug fixes, but new functionality will be limited. You can continue to use legacy dashboards for both authoring and consumption.\n* Convert legacy dashboards using the migration tool or REST API. See [Clone a legacy dashboard to a Lakeview dashboard](https:\/\/docs.databricks.com\/dashboards\/clone-legacy-to-lakeview.html) for instructions on using the built-in migration tool. See [Use Databricks APIs to manage dashboards](https:\/\/docs.databricks.com\/dashboards\/tutorials\/index.html#apis) for tutorials on creating and managing dashboards using the REST API. \nThis article describes how to create a new draft dashboard by cloning an existing legacy dashboard. **Clone to Lakeview dashboard** is a menu option in the UI that simplifies the conversion process. **Clone to Lakeview dashboard** is supported for legacy dashboards with a maximum of 100 widgets. \nUsing this button to create a new dashboard does not affect the original legacy dashboard or queries. Instead, this process uses the underlying queries and widget settings to create an equivalent dashboard. \nNote \nDashboards (formerly Lakeview dashboards) do not support all legacy dashboard functionality. See [Dashboards](https:\/\/docs.databricks.com\/dashboards\/index.html) to learn about the available features.\n\n#### Clone a legacy dashboard to a Lakeview dashboard\n##### Required permissions\n\nYou must have at least **Can View** permission on the legacy dashboard and all upstream queries, including those backing query-based dropdown list parameters. Legacy dashboards handle permissions for Databricks SQL queries and dashboards separately. Insufficient permissions on an upstream query causes the clone operation to fail.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/clone-legacy-to-lakeview.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Clone a legacy dashboard to a Lakeview dashboard\n##### Clone to dashboard\n\nThe following animation shows a successful conversion. \n![Gif showing conversion process](https:\/\/docs.databricks.com\/_images\/success-clone-to-lakeview.gif) \nComplete the following steps to clone a dashboard: \n1. Click **Clone to Lakeview dashboard**. \nYou can access the **Clone to Lakeview dashboard** option from the following areas in the UI: \n* The Workspace file browser. \nRight-click on the dashboard title, then click **Clone to Lakeview dashboard**. Alternately, use the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu to access the same option.\n* A legacy dashboard. \n+ **When viewing a saved dashboard**: Click **Clone to Lakeview dashboard** from the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu on an existing legacy dashboard.\n+ **When editing a saved dashboard**: Click **Clone to Lakeview** near the top-right corner of the screen. \n![Clone to Lakeview button from legacy dashboard edit mode.](https:\/\/docs.databricks.com\/_images\/clone-lakeview-from-legacy.png)\n2. (Optional) Specify a title and folder location for the new dashboard. \nBy default, you save the new dashboard in the same folder as the original legacy dashboard, with **(Lakeview)** appended to the original title. At this stage, you can retitle the new dashboard and choose a different destination folder.\n3. Click **Clone**. \n![Success message with link to new dashboard.](https:\/\/docs.databricks.com\/_images\/clone-success-notification.png) \nAfter the operation completes, look for a notification in the screen\u2019s upper-right corner. Use the link to navigate to your new dashboard.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/clone-legacy-to-lakeview.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Clone a legacy dashboard to a Lakeview dashboard\n##### Review cloned dashboard results\n\nA successful cloning operation creates a new draft dashboard. The existing legacy dashboard and related queries remain unchanged. The two dashboards are unrelated. Updates to one dashboard do not affect the other. \nThe new dashboard is created as a draft. It inherits any sharing permissions applied at the folder level. Permissions set on the source dashboard are not propagated. \nAll dashboard drafts are automatically granted **Run as Viewer** credentials, regardless of existing credential settings in the original legacy dashboard. When you publish a dashboard, you can choose to embed credentials or not. This affects how other users view and interact with your dashboard. See [Dashboard ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#lakeview) to learn how to share and manage permissions for published dashboards.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/clone-legacy-to-lakeview.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Clone a legacy dashboard to a Lakeview dashboard\n##### Adjust legacy parameters\n\nDashboards (formerly Lakeview dashboards) offer limited support for parameters. Per-widget parameter filters are not supported. The following parameter types are not supported: \n* Dropdown lists\n* Query-based dropdown lists\n* Date range \nAll supported parameter types persist in the new dashboard. Unsupported parameter types default to `string` values. Previously set default values are retained. \n### Work with parameter widgets \nDuring the clone operation, all legacy dashboard-level parameter widgets are converted to filters and appear at the top of the new dashboard **Canvas**. Date and time parameter widgets are added as **Date picker** filters, and all other types of parameter widgets are converted to **Single value** filters. \nWhen you clone a dashboard that includes queries with parameters, the clone succeeds in recreating the target query, but the query needs to be adjusted to use named parameter syntax for it to succeed. See [\\_] (\/sql\/language-manual\/sql-ref-parameter-marker. md # named-parameter-markers). You must update the syntax in the related dataset query to clear the error message shown in the parameter filter widget. Right-click on the widget to open and edit the connected dataset query. \n### Work with parameters in datasets \nOn the **Data** tab, open the dataset that you want to edit. On conversion, all mustache parameters (`{{ }}`) from queries used in legacy dashboards are automatically added to the collection of parameters shown in the UI. \n![A newly converted dataset query with parameters written with mustache syntax and warning message.](https:\/\/docs.databricks.com\/_images\/auto-convert-param.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/clone-legacy-to-lakeview.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Clone a legacy dashboard to a Lakeview dashboard\n##### Troubleshooting and manual adjustments\n\nDashboards (formerly Lakeview dashboards) do not support all features or chart types available in legacy dashboards. When you clone a legacy dashboard with unsupported elements, widgets on the new dashboard show an error message instead of a copy of the original widget. Commonly, errors occur from converting unsupported visualizations or filters. See [Add or remove visualizations, text, and filter widgets on the canvas](https:\/\/docs.databricks.com\/dashboards\/index.html#add-or-remove-visualizations-text-and-filter-widgets-on-the-canvas) to learn which visualizations and filters are supported. \n![Dashboard including widgets with errors.](https:\/\/docs.databricks.com\/_images\/widget-conversion-errors.png) \nSee [Dashboard visualization types](https:\/\/docs.databricks.com\/dashboards\/visualization-types.html) to learn how to use chart types supported in dashboards.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/clone-legacy-to-lakeview.html"} +{"content":"# What is Delta Lake?\n### Drop Delta table features\n\nPreview \nSupport for dropping Delta table features and downgrading protocol versions is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html) in Databricks Runtime 14.1 and above. \nDatabricks provides limited support for dropping table features. To drop a table feature, the following must occur: \n* Disable table properties that use the table feature.\n* Remove all traces of the table feature from the data files backing the table.\n* Remove transaction entries that use the table feature from the transaction log.\n* Downgrade the table protocol. \nWhere supported, you should only use this functionality to support compatibility with earlier Databricks Runtime versions, Delta Sharing, or other Delta Lake reader or writer clients. \nImportant \nAll `DROP FEATURE` operations conflict with all concurrent writes. \nStreaming reads fail when they encounter a commit that changes table metadata. If you want the stream to continue you must restart it. For recommended methods, see [Production considerations for Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/production.html).\n\n### Drop Delta table features\n#### How can I drop a Delta table feature?\n\nTo remove a Delta table feature, you run an `ALTER TABLE <table-name> DROP FEATURE <feature-name> [TRUNCATE HISTORY]` command. See [ALTER TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table.html). \nYou must use Databricks Runtime 14.1 or above and have `MODIFY` privileges on the target Delta table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/drop-feature.html"} +{"content":"# What is Delta Lake?\n### Drop Delta table features\n#### What Delta table features can be dropped?\n\nYou can drop the following Delta table features: \n* `deletionVectors`. See [What are deletion vectors?](https:\/\/docs.databricks.com\/delta\/deletion-vectors.html).\n* `v2Checkpoint`. See [Compatibility for tables with liquid clustering](https:\/\/docs.databricks.com\/delta\/clustering.html#compatibility).\n* `typeWidening-preview`. See [Type widening](https:\/\/docs.databricks.com\/delta\/type-widening.html). \nYou cannot drop other [Delta table features](https:\/\/github.com\/delta-io\/delta\/blob\/master\/PROTOCOL.md#valid-feature-names-in-table-features).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/drop-feature.html"} +{"content":"# What is Delta Lake?\n### Drop Delta table features\n#### How are Delta table features dropped?\n\nBecause Delta table features represent reader and writer protocols, they must be completely absent from the transaction log for full removal. Dropping a feature occurs in two stages and requires time to elapse before completion. The specifics of feature removal vary by feature, but the following section provides a general overview. \n### Prepare to drop a table feature \nDuring the first stage, the user prepares to drop the table feature. The following describes what happens during this stage: \n1. The user runs the `DROP FEATURE` command.\n2. Table properties that specifically enable a table feature have values set to disable the feature.\n3. Table properties that control behaviors associated with the dropped feature have options set to default values before the feature was introduced.\n4. As necessary, data and metadata files are rewritten respecting the updated table properties.\n5. The command finishes running and returns an error message informing the user they must wait 24 hours to proceed with feature removal. \nAfter first disabling a feature, you can continue writing to the target table before completing the protocol downgrade, but cannot use the table feature you are removing. \nNote \nIf you leave the table in this state, operations against the table do not use the table feature, but the protocol still supports the table feature. Until you complete the final downgrade step, the table is not readable by Delta clients that do not understand the table feature. \n### Downgrade the protocol and drop a table feature \nTo drop the table feature, you must remove all transaction history associated with the feature and downgrade the protocol. \n1. After at least 24 hours have passed, the user executes the `DROP FEATURE` command again with the `TRUNCATE HISTORY` clause.\n2. The client confirms that no transactions in the specified retention threshold use the table feature, then truncates the table history to that treshold.\n3. The protocol is downgraded, dropping the table feature.\n4. If the table features that are present in the table can be represented by a legacy protocol version, the `minReaderVersion` and `minWriterVersion` for the table are downgraded to the lowest version that supports exactly all remaining features in use by the Delta table. \nImportant \nRunning `ALTER TABLE <table-name> DROP FEATURE <feature-name> TRUNCATE HISTORY` removes all transaction log data older than 24 hours. After dropping a Delta table feature, you do not have access to table history or time travel. \nSee [How does Databricks manage Delta Lake feature compatibility?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/drop-feature.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Bloom filter indexes\n\nNote \nWhen using Photon-enabled compute and Databricks Runtime 12.2 or above, predictive I\/O outperforms bloom filters for read performance. See [What is predictive I\/O?](https:\/\/docs.databricks.com\/optimizations\/predictive-io.html). \nIn Databricks Runtime 13.3 and above, Databricks recommends using clustering for Delta table layout. See [Use liquid clustering for Delta tables](https:\/\/docs.databricks.com\/delta\/clustering.html). \nDatabricks only recommends using Bloom filters when using compute that does not support these features. \nA Bloom filter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text.\n\n#### Bloom filter indexes\n##### How Bloom filter indexes work\n\nDatabricks Bloom filter indexes consist of a data skipping index for each data file. The Bloom filter index can be used to determine that a column value is definitively *not in* the file, or that it is *probably in* the file. Before reading a file Databricks checks the index file, and the file is read only if the index indicates that the file might match a data filter. \nBloom filters support columns with the following input data types: `byte`, `short`, `int`, `long`, `float`, `double`, `date`, `timestamp`, and `string`. Nulls are not added to the Bloom filter, so any null related filter requires reading the data file. Databricks supports the following data source filters: `and`, `or`, `in`, `equals`, and `equalsnullsafe`. Bloom filters are not supported on nested columns.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/bloom-filters.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Bloom filter indexes\n##### Configuration and reference\n\nUse the following syntax to enable a Bloom filter: \n```\nCREATE BLOOMFILTER INDEX\nON TABLE table_name\nFOR COLUMNS(column_name OPTIONS (fpp=0.1, numItems=5000))\n\n``` \nFor syntax details, see [CREATE BLOOM FILTER INDEX](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-create-bloomfilter-index.html) and [DROP BLOOM FILTER INDEX](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-drop-bloomfilter-index.html). \nTo disable Bloom filter operations, set the session level `spark.databricks.io.skipping.bloomFilter.enabled` configuration to `false`.\n\n#### Bloom filter indexes\n##### Display the list of Bloom filter indexes\n\nTo display the list of indexes, run: \n```\nspark.table(\"<table-with-indexes>\").schema.foreach(field => println(s\"${field.name}: metadata=${field.metadata}\"))\n\n``` \nFor example: \n![Show indexes](https:\/\/docs.databricks.com\/_images\/show-bloomfilter-indexes.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/bloom-filters.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Notebook-scoped R libraries\n\nNotebook-scoped R libraries enable you to create and modify custom R environments that are specific to a notebook session. When you install an R notebook-scoped library, only the current notebook and any jobs associated with that notebook have access to that library. Other notebooks attached to the same cluster are not affected. \nNotebook-scoped libraries do not persist across sessions. You must reinstall notebook-scoped libraries at the beginning of each session, or whenever the notebook is detached from a cluster. \nNotebook-scoped libraries libraries are automatically available on workers for [SparkR UDFs](https:\/\/spark.apache.org\/docs\/latest\/sparkr.html#applying-user-defined-function). \nTo install libraries for all notebooks attached to a cluster, use cluster-installed libraries. See [Cluster libraries](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html).\n\n#### Notebook-scoped R libraries\n##### Install notebook-scoped libraries in R\n\nYou can use any familiar method of installing packages in R, such as [install.packages()](https:\/\/www.rdocumentation.org\/packages\/utils\/versions\/3.6.2\/topics\/install.packages), the [devtools APIs](https:\/\/cran.r-project.org\/web\/packages\/devtools\/devtools.pdf), or [Bioconductor](https:\/\/www.bioconductor.org\/install\/). \nR packages are accessible to worker nodes as well as the driver node.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Notebook-scoped R libraries\n##### Manage notebook-scoped libraries in R\n\nIn this section: \n* [Install a package](https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html#install-a-package)\n* [Remove an R package from a notebook environment](https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html#remove-an-r-package-from-a-notebook-environment) \n### [Install a package](https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html#id2) \n```\nrequire(devtools)\n\ninstall_version(\npackage = \"caesar\",\nrepos = \"http:\/\/cran.us.r-project.org\"\n)\n\n``` \nDatabricks recommends using a CRAN snapshot as the repository to guarantee [reproducible results](https:\/\/kb.databricks.com\/r\/pin-r-packages.html). \n```\ndevtools::install_github(\"klutometis\/roxygen\")\n\n``` \n### [Remove an R package from a notebook environment](https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html#id3) \nTo remove a notebook-scoped library from a notebook, use the `remove.packages()` command. \n```\nremove.packages(\"caesar\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Notebook-scoped R libraries\n##### Notebook-scoped R libraries with Spark UDFs\n\nIn this section: \n* [Notebook-scoped R libraries and SparkR](https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html#notebook-scoped-r-libraries-and-sparkr)\n* [Notebook-scoped R libraries and sparklyr](https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html#notebook-scoped-r-libraries-and-sparklyr)\n* [Library isolation and hosted RStudio](https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html#library-isolation-and-hosted-rstudio) \n### [Notebook-scoped R libraries and SparkR](https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html#id4) \nNotebook-scoped libraries are available on SparkR workers; just import a library to use it. For example, you can run the following to generate a caesar-encrypted message with a SparkR UDF: \n```\nrequire(devtools)\n\ninstall_version(\npackage = \"caesar\",\nrepos = \"http:\/\/cran.us.r-project.org\"\n)\n\nlibrary(SparkR)\nsparkR.session()\n\nhello <- function(x) {\nlibrary(caesar)\ncaesar(\"hello world\")\n}\n\nspark.lapply(c(1, 2), hello)\n\n``` \n### [Notebook-scoped R libraries and sparklyr](https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html#id5) \nBy default, in `sparklyr::spark_apply()`, the `packages` argument is set to `TRUE`. This copies libraries in the current `libPaths` to the workers, allowing you to import and use them on workers. For example, you can run the following to generate a caesar-encrypted message with `sparklyr::spark_apply()`: \n```\nrequire(devtools)\n\ninstall_version(\npackage = \"caesar\",\nrepos = \"http:\/\/cran.us.r-project.org\"\n)\n\nlibrary(sparklyr)\nsc <- spark_connect(method = 'databricks')\n\napply_caes <- function(x) {\nlibrary(caesar)\ncaesar(\"hello world\")\n}\n\nsdf_len(sc, 5) %>%\nspark_apply(apply_caes)\n\n``` \nIf you do not want libraries to be available on workers, set `packages` to `FALSE`. \n### [Library isolation and hosted RStudio](https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html#id6) \nRStudio creates a separate library path for each user; therefore users are isolated from each other. However, the library path is not available on workers. If you want to use a package inside SparkR workers in a job launched from RStudio, you need to install it using cluster libraries. \nAlternatively, if you use sparklyr UDFs, packages installed inside RStudio are available to workers when using `spark_apply(..., packages = TRUE)`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Notebook-scoped R libraries\n##### Frequently asked questions (FAQ)\n\n### How do I install a package on just the driver for all R notebooks? \nExplicitly set the installation directory to `\/databricks\/spark\/R\/lib`. For example, with `install.packages()`, run `install.packages(\"pckg\", lib=\"\/databricks\/spark\/R\/lib\")`.\nPackages installed in `\/databricks\/spark\/R\/lib` are shared across all notebooks on the cluster, but they are not accessible to SparkR workers. To share libraries across notebooks and also workers, use [cluster libraries](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html). \n### Are notebook-scoped libraries cached? \nThere is no caching implemented for notebook-scoped libraries on a cluster. If you install a package in a notebook, and another user installs the same package in another notebook on the same cluster, the package is downloaded, compiled, and installed again.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html"} +{"content":"# What is data warehousing on Databricks?\n### Refresh operations for materialized views\n\nMaterialized views are database objects that contain the results of a SQL query on one or more base tables. Some materialized views can be incrementally refreshed, automatically and incrementally propagating changes from the base tables. This article explains the refresh operations that can be applied to materialized views.\n\n### Refresh operations for materialized views\n#### Refresh types\n\nRefresh operations are one of these types: \n* **Incremental refresh**: An incremental refresh processes changes in the underlying data after the last refresh and then appends that data to the table. Depending on the base tables and included operations, only certain types of materialized views can be incrementally refreshed.\n* **Full refresh**: A full refresh truncates the table and reprocesses all data available in the source with the latest definition. It is not recommended to perform full refreshes on sources that don\u2019t keep the entire data history or have short retention periods, such as Kafka, because the full refresh truncates the existing data. You may be unable to recover old data if the data is no longer available in the source.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/incremental-refresh.html"} +{"content":"# What is data warehousing on Databricks?\n### Refresh operations for materialized views\n#### How materialized views are refreshed\n\nMaterialized views automatically create and use Delta Live Tables pipelines to process refresh operations. Delta Live Tables pipelines use either a continuous or triggered execution mode. Materialized views can be updated in either execution mode. To avoid unnecessary processing when operating in continuous execution mode, pipelines automatically monitor dependent Delta tables and perform an update only when the contents of those dependent tables have changed. See [What is a Delta Live Tables pipeline?](https:\/\/docs.databricks.com\/delta-live-tables\/index.html#pipeline). \nNote \nThe Delta Live Tables runtime cannot detect changes in non-Delta data sources. The table is still updated regularly but with a higher default trigger interval to prevent excessive recomputation from slowing down any incremental processing happening on compute. \nBy default, refresh operations are performed synchronously. You can also set a refresh operation to occur asynchronously. The behavior associated with each approach is as follows: \n* **Synchronous**: A synchronous refresh blocks other operations until the refresh operation is complete. This allows you to sequence refresh operations in an orchestration tool, like workflows. To orchestrate materialized views with workflows, use the **SQL** task type. See [Introduction to Databricks Workflows](https:\/\/docs.databricks.com\/workflows\/index.html).\n* **Asynchronous**: An asynchronous refresh starts a background job on Delta Live Tables compute when a materialized view refresh begins, and the command returns before the data load is complete. Because a Delta Live Tables pipeline manages the refresh, the Databricks SQL warehouse used to create the materialized view is not used. It does not need to be running during the refresh operation.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/incremental-refresh.html"} +{"content":"# What is data warehousing on Databricks?\n### Refresh operations for materialized views\n#### Support for materialized view incremental refresh\n\nThe following table lists support for incremental refresh by SQL keyword or clause: \n| SQL keyword or clause | Support for incremental refresh |\n| --- | --- |\n| `SELECT` expressions | Expressions including deterministic built-in functions and immutable user-defined functions (UDFs) are supported. |\n| `WITH` | Yes, common table expressions are supported. |\n| `FROM` | Supported base tables include Delta tables, materialized views, and streaming tables |\n| `EXPECTATIONS` | No. Materialized views that use expectations are always fully refreshed. |\n| `UNION ALL` | No |\n| `INNER JOIN` | No |\n| `LEFT JOIN` | No |\n| `GROUP BY` | Yes |\n| `WHERE`, `HAVING` | Filter clauses such as `WHERE` and `HAVING` are supported. |\n| `OVER` | No |\n| `QUALIFY` | No | \nNote \nNon-deterministic functions, for example, `CURRENT_TIMESTAMP`, are not supported.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/incremental-refresh.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Get identifiers for workspace objects\n\nThis article explains how to get workspace, cluster, directory, model, notebook, and job identifiers and URLs in Databricks.\n\n#### Get identifiers for workspace objects\n##### Workspace instance names, URLs, and IDs\n\nAn *instance name* is assigned to each Databricks deployment. To segregate the workload and grant access to relevant users only, usually Databricks customers create separate instances for development, staging, and production. The instance name is the first part of the URL when you log into your Databricks deployment: \n![Workspace](https:\/\/docs.databricks.com\/_images\/workspace-aws.png) \nIf you log into `https:\/\/cust-success.cloud.databricks.com\/`, the instance name is `cust-success.cloud.databricks.com`. \nA Databricks *[workspace](https:\/\/docs.databricks.com\/workspace\/index.html)* is where the Databricks platform runs and where you can create Spark clusters and schedule workloads. Some types of workspaces have a unique workspace ID. If there is `o=` in the deployment URL, for example, `https:\/\/<databricks-instance>\/?o=6280049833385130`, the random number after `o=` is the Databricks workspace ID. Here the workspace ID is `6280049833385130`. If there is no `o=` in the deployment URL, the workspace ID is `0`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/workspace-details.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Get identifiers for workspace objects\n##### Cluster URL and ID\n\nA Databricks *[cluster](https:\/\/docs.databricks.com\/compute\/index.html)* provides a unified platform for various use cases such as running production ETL pipelines, streaming analytics, ad-hoc analytics, and machine learning. Each cluster has a unique ID called the cluster ID. This applies to both all-purpose and job clusters. To get the details of a cluster using the REST API, the cluster ID is essential. \nTo get the cluster ID, click the **Clusters** tab in sidebar and then select a cluster name. The cluster ID is the number after the `\/clusters\/` component in the URL of this page \n```\nhttps:\/\/<databricks-instance>\/#\/setting\/clusters\/<cluster-id>\n\n``` \nIn the following screenshot, the cluster ID is `1115-164516-often242`: \n![Cluster URL](https:\/\/docs.databricks.com\/_images\/aws-cluster.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/workspace-details.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Get identifiers for workspace objects\n##### Notebook URL and ID\n\nA *[notebook](https:\/\/docs.databricks.com\/notebooks\/index.html)* is a web-based interface to a document that contains runnable code, visualizations, and narrative text. Notebooks are one interface for interacting with Databricks. Each notebook has a unique ID. The notebook URL has the notebook ID, hence the notebook URL is unique to a notebook. It can be shared with anyone on Databricks platform with permission to view and edit the notebook. In addition, each notebook command (cell) has a different URL. \nTo find a notebook URL or ID, open a notebook. To find a cell URL, click the contents of the command. \n* Example notebook URL: \n```\nhttps:\/\/cust-success.cloud.databricks.com\/#notebook\/333096\n\n```\n* Example notebook ID: `333096`.\n* Example command (cell) URL: \n```\nhttps:\/\/cust-success.cloud.databricks.com\/#notebook\/333096\/command\/333099\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/workspace-details.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Get identifiers for workspace objects\n##### Folder ID\n\nA *[folder](https:\/\/docs.databricks.com\/workspace\/workspace-objects.html#folders)* is a directory used to store files that can used in the Databricks workspace. These files can be notebooks, libraries or subfolders. There is a specific id associated with each folder and each individual sub-folder. The Permissions API refers to this id as a directory\\_id and is used in setting and updating permissions for a folder. \nTo retrieve the directory\\_id , use the Workspace API: \n```\ncurl -n -X GET -H 'Content-Type: application\/json' -d '{\"path\": \"\/Users\/me@example.com\/MyFolder\"}' \\\nhttps:\/\/<databricks-instance>\/api\/2.0\/workspace\/get-status\n\n``` \nThis is an example of the API call response: \n```\n{\n\"object_type\": \"DIRECTORY\",\n\"path\": \"\/Users\/me@example.com\/MyFolder\",\n\"object_id\": 123456789012345\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/workspace-details.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Get identifiers for workspace objects\n##### Model ID\n\nA model refers to an [MLflow registered model](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html), which lets you manage MLflow Models in production through stage transitions and versioning. The registered model ID is required for changing the permissions on the model programmatically through the [Permissions API](https:\/\/docs.databricks.com\/api\/workspace\/permissions). \nTo get the ID of a registered model, you can use the [Workspace API](https:\/\/docs.databricks.com\/api\/workspace\/introduction) endpoint `mlflow\/databricks\/registered-models\/get`. For example, the following code returns the registered model object with its properties, including its ID: \n```\ncurl -n -X GET -H 'Content-Type: application\/json' -d '{\"name\": \"model_name\"}' \\\nhttps:\/\/<databricks-instance>\/api\/2.0\/mlflow\/databricks\/registered-models\/get\n\n``` \nThe returned value has the format: \n```\n{\n\"registered_model_databricks\": {\n\"name\":\"model_name\",\n\"id\":\"ceb0477eba94418e973f170e626f4471\"\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/workspace-details.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Get identifiers for workspace objects\n##### Job URL and ID\n\nA *[job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html)* is a way of running a notebook or JAR either immediately or on a scheduled basis. \nTo get a job URL, click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar and click a job name. The job ID is after the text `#job\/` in the URL. The job URL is required to troubleshoot the root cause of failed job runs. \nIn the following screenshot, the job URL is: \n```\nhttps:\/\/cust-success.cloud.databricks.com\/#job\/25612\n\n``` \nIn this example, the job ID `25612`. \n![Job URL](https:\/\/docs.databricks.com\/_images\/aws-jobs.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/workspace-details.html"} +{"content":"# What is Delta Lake?\n### Remove unused data files with vacuum\n\nYou can remove data files no longer referenced by a Delta table that are older than the retention threshold by running the `VACUUM` command on the table. Running `VACUUM` regularly is important for cost and compliance because of the following considerations: \n* Deleting unused data files reduces cloud storage costs.\n* Data files removed by `VACUUM` might contain records that have been modified or deleted. Permanently removing these files from cloud storage ensures these records are no longer accessible.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/vacuum.html"} +{"content":"# What is Delta Lake?\n### Remove unused data files with vacuum\n#### Caveats for vacuum\n\nThe default retention threshold for data files after running `VACUUM` is 7 days. To change this behavior, see [Configure data retention for time travel queries](https:\/\/docs.databricks.com\/delta\/history.html#data-retention). \n`VACUUM` might leave behind empty directories after removing all files from within them. Subsequent `VACUUM` operations delete these empty directories. \nDatabricks recommends using predictive optimization to automatically run `VACUUM` for Delta tables. See [Predictive optimization for Delta Lake](https:\/\/docs.databricks.com\/optimizations\/predictive-optimization.html). \nSome Delta Lake features use metadata files to mark data as deleted rather than rewriting data files. You can use `REORG TABLE ... APPLY (PURGE)` to commit these deletions and rewrite data files. See [Purge metadata-only deletes to force data rewrite](https:\/\/docs.databricks.com\/delta\/vacuum.html#purge). \nImportant \n* In Databricks Runtime 13.3 LTS and above, `VACUUM` semantics for shallow clones with Unity Catalog managed tables differ from other Delta tables. See [Vacuum and Unity Catalog shallow clones](https:\/\/docs.databricks.com\/delta\/clone-unity-catalog.html#vacuum).\n* `VACUUM` removes all files from directories not managed by Delta Lake, ignoring directories beginning with `_` or `.`. If you are storing additional metadata like Structured Streaming checkpoints within a Delta table directory, use a directory name such as `_checkpoints`. \n+ Data for change data feed is managed by Delta Lake in the `_change_data` directory and removed with `VACUUM`. See [Use Delta Lake change data feed on Databricks](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html).\n+ Bloom filter indexes use the `_delta_index` directory managed by Delta Lake. `VACUUM` cleans up files in this directory. See [Bloom filter indexes](https:\/\/docs.databricks.com\/optimizations\/bloom-filters.html).\n* The ability to query table versions older than the retention period is lost after running `VACUUM`.\n* Log files are deleted automatically and asynchronously after checkpoint operations and are not governed by `VACUUM`. While the default retention period of log files is 30 days, running `VACUUM` on a table removes the data files necessary for time travel. \nNote \nWhen disk caching is enabled, a cluster might contain data from Parquet files that have been deleted with `VACUUM`. Therefore, it may be possible to query the data of previous table versions whose files have been deleted. Restarting the cluster will remove the cached data. See [Configure the disk cache](https:\/\/docs.databricks.com\/optimizations\/disk-cache.html#configure-cache).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/vacuum.html"} +{"content":"# What is Delta Lake?\n### Remove unused data files with vacuum\n#### Example syntax for vacuum\n\n```\nVACUUM eventsTable -- vacuum files not required by versions older than the default retention period\n\nVACUUM '\/data\/events' -- vacuum files in path-based table\n\nVACUUM delta.`\/data\/events\/`\n\nVACUUM delta.`\/data\/events\/` RETAIN 100 HOURS -- vacuum files not required by versions more than 100 hours old\n\nVACUUM eventsTable DRY RUN -- do dry run to get the list of files to be deleted\n\n``` \nFor Spark SQL syntax details, see [VACUUM](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-vacuum.html). \nSee the [Delta Lake API documentation](https:\/\/docs.databricks.com\/delta\/index.html#delta-api) for Scala, Java, and Python syntax details. \nNote \nUse the `RETAIN` keyword to specify the threshold used to determine if a data file should be removed. The `VACUUM` command uses this threshold to look back in time the specified amount of time and identify the most recent table version at that moment. Delta retains all data files required to query that table version and all newer table versions. This setting interacts with other table properties. See [Configure data retention for time travel queries](https:\/\/docs.databricks.com\/delta\/history.html#data-retention).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/vacuum.html"} +{"content":"# What is Delta Lake?\n### Remove unused data files with vacuum\n#### Purge metadata-only deletes to force data rewrite\n\nThe `REORG TABLE` command provides the `APPLY (PURGE)` syntax to rewrite data to apply soft-deletes. Soft-deletes do not rewrite data or delete data files, but rather use metadata files to indicate that some data values have changed. See [REORG TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-reorg-table.html). \nOperations that create soft-deletes in Delta Lake include the following: \n* Dropping columns with [column mapping](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html) enabled.\n* Deleting rows with [deletion vectors](https:\/\/docs.databricks.com\/delta\/deletion-vectors.html) enabled.\n* Any data modifications on Photon-enabled clusters when deletion vectors are enabled. \nWith soft-deletes enabled, old data may remain physically present in the table\u2019s current files even after the data has been deleted or updated. To remove this data physically from the table, complete the following steps: \n1. Run `REORG TABLE ... APPLY (PURGE)`. After doing this, the old data is no longer present in the table\u2019s *current* files, but it is still present in the older files that are used for time travel.\n2. Run `VACUUM` to delete these older files. \n`REORG TABLE` creates a new version of the table as the operation completes. All table versions in the history prior to this transaction refer to older data files. Conceptually, this is similar to the `OPTIMIZE` command, where data files are rewritten even though data in the current table version stays consistent. \nImportant \nData files are only deleted when the files have *expired* according to the `VACUUM` retention period. This means that the `VACUUM` must be done with a delay after the `REORG` so that the older files have expired. The retention period of `VACUUM` can be reduced to shorten the required waiting time, at the cost of reducing the maximum history that is retained.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/vacuum.html"} +{"content":"# What is Delta Lake?\n### Remove unused data files with vacuum\n#### What size cluster does vacuum need?\n\nTo select the correct cluster size for `VACUUM`, it helps to understand that the operation occurs in two phases: \n1. The job begins by using all available executor nodes to list files in the source directory in parallel. This list is compared to all files currently referenced in the Delta transaction log to identify files to be deleted. The driver sits idle during this time.\n2. The driver then issues deletion commands for each file to be deleted. File deletion is a driver-only operation, meaning that all operations occur in a single node while the worker nodes sit idle. \nTo optimize cost and performance, Databricks recommends the following, especially for long-running vacuum jobs: \n* Run vacuum on a cluster with auto-scaling set for 1-4 workers, where each worker has 8 cores.\n* Select a driver with between 8 and 32 cores. Increase the size of the driver to avoid out-of-memory (OOM) errors. \nIf `VACUUM` operations are regularly deleting more than 10 thousand files or taking over 30 minutes of processing time, you might want to increase either the size of the driver or the number of workers. \nIf you find that the slowdown occurs while identifying files to be removed, add more worker nodes. If the slowdown occurs while delete commands are running, try increasing the size of the driver.\n\n### Remove unused data files with vacuum\n#### How frequently should you run vacuum?\n\nDatabricks recommends regularly running `VACUUM` on all tables to reduce excess cloud data storage costs. The default retention threshold for vacuum is 7 days. Setting a higher threshold gives you access to a greater history for your table, but increases the number of data files stored and, as a result, incurs greater storage costs from your cloud provider.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/vacuum.html"} +{"content":"# What is Delta Lake?\n### Remove unused data files with vacuum\n#### Why can\u2019t you vacuum a Delta table with a low retention threshold?\n\nWarning \nIt is recommended that you set a retention interval to be at least 7 days,\nbecause old snapshots and uncommitted files can still be in use by concurrent\nreaders or writers to the table. If `VACUUM` cleans up active files,\nconcurrent readers can fail or, worse, tables can be corrupted when `VACUUM`\ndeletes files that have not yet been committed. You must choose an interval\nthat is longer than the longest running concurrent transaction and the longest\nperiod that any stream can lag behind the most recent update to the table. \nDelta Lake has a safety check to prevent you from running a dangerous `VACUUM` command. If you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Spark configuration property `spark.databricks.delta.retentionDurationCheck.enabled` to `false`.\n\n### Remove unused data files with vacuum\n#### Audit information\n\n`VACUUM` commits to the Delta transaction log contain audit information. You can query the audit events using `DESCRIBE HISTORY`. \nTo capture audit information, enable `spark.databricks.delta.vacuum.logging.enabled`. Audit logging is not enabled by default for AWS S3 tables due to the limited consistency guarantees provided by S3 with regard to multi-workspace writes. If you enable it on S3, make sure there are no workflows that involve multi-workspace writes. Failing to do so may result in data loss.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/vacuum.html"} +{"content":"# Compute\n## Use compute\n### Troubleshoot compute issues\n##### Handling large queries in interactive workflows\n\nA challenge with interactive data workflows is handling large queries. This includes queries that generate too many output rows, fetch many external partitions, or compute on extremely large data sets. These queries can be extremely slow, saturate compute resources, and make it difficult for others to share the same compute. \nQuery Watchdog is a process that prevents queries from monopolizing compute resources by examining the most common causes of large queries and terminating queries that pass a threshold. This article describes how to enable and configure Query Watchdog. \nImportant \nQuery Watchdog is enabled for all all-purpose computes created using the UI.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/troubleshooting\/query-watchdog.html"} +{"content":"# Compute\n## Use compute\n### Troubleshoot compute issues\n##### Handling large queries in interactive workflows\n###### Example of a disruptive query\n\nAn analyst is performing some ad hoc queries in a just-in-time data warehouse. The analyst uses a shared autoscaling compute that makes it easy for multiple users to use a single compute at the same time. Suppose there are two tables that each have a million rows. \n```\nimport org.apache.spark.sql.functions._\nspark.conf.set(\"spark.sql.shuffle.partitions\", 10)\n\nspark.range(1000000)\n.withColumn(\"join_key\", lit(\" \"))\n.createOrReplaceTempView(\"table_x\")\nspark.range(1000000)\n.withColumn(\"join_key\", lit(\" \"))\n.createOrReplaceTempView(\"table_y\")\n\n``` \nThese table sizes are manageable in Apache Spark. However, they each include a `join_key` column with an empty string in every row. This can happen if the data is not perfectly clean or if there is significant data skew where some keys are more prevalent than others. These empty join keys are far more prevalent than any other value. \nIn the following code, the analyst is joining these two tables on their keys, which produces output of *one trillion results*, and all of these are produced on a single executor (the executor that gets the `\" \"` key): \n```\nSELECT\nid, count(id)\nFROM\n(SELECT\nx.id\nFROM\ntable_x x\nJOIN\ntable_y y\non x.join_key = y.join_key)\nGROUP BY id\n\n``` \nThis query appears to be running. But without knowing about the data, the analyst sees that there\u2019s \u201conly\u201d a single task left over the course of executing the job. The query never finishes, leaving the analyst frustrated and confused about why it did not work. \nIn this case there is only one problematic join key. Other times there may be many more.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/troubleshooting\/query-watchdog.html"} +{"content":"# Compute\n## Use compute\n### Troubleshoot compute issues\n##### Handling large queries in interactive workflows\n###### Enable and configure Query Watchdog\n\nTo enable and configure Query Watchdog, the following steps are required. \n* Enable Watchdog with `spark.databricks.queryWatchdog.enabled`.\n* Configure the task runtime with `spark.databricks.queryWatchdog.minTimeSecs`.\n* Display output with `spark.databricks.queryWatchdog.minOutputRows`.\n* Configure the output ratio with `spark.databricks.queryWatchdog.outputRatioThreshold`. \nTo a prevent a query from creating too many output rows for the number of input rows, you can enable Query Watchdog and configure the maximum number of output rows as a multiple of the number of input rows. In this example we use a ratio of 1000 (the default). \n```\nspark.conf.set(\"spark.databricks.queryWatchdog.enabled\", true)\nspark.conf.set(\"spark.databricks.queryWatchdog.outputRatioThreshold\", 1000L)\n\n``` \nThe latter configuration declares that any given task should never produce more than 1000 times the number of input rows. \nTip \nThe output ratio is completely customizable. We recommend starting lower and seeing what threshold works well for you and your team. A range of 1,000 to 10,000 is a good starting point. \nNot only does Query Watchdog prevent users from monopolizing compute resources for jobs that will never complete, it also saves time by fast-failing a query that would have never completed. For example, the following query will fail after several minutes because it exceeds the ratio. \n```\nSELECT\nz.id\njoin_key,\nsum(z.id),\ncount(z.id)\nFROM\n(SELECT\nx.id,\ny.join_key\nFROM\ntable_x x\nJOIN\ntable_y y\non x.join_key = y.join_key) z\nGROUP BY join_key, z.id\n\n``` \nHere\u2019s what you would see: \n![Query watchdog](https:\/\/docs.databricks.com\/_images\/query-watchdog-example.png) \nIt\u2019s usually enough to enable Query Watchdog and set the output\/input threshold ratio, but you also have the option to set two additional properties: `spark.databricks.queryWatchdog.minTimeSecs` and `spark.databricks.queryWatchdog.minOutputRows`. These properties specify the minimum time a given task in a query must run before cancelling it and the minimum number of output rows for a task in that query. \nFor example, you can set `minTimeSecs` to a higher value if you want to give it a chance to produce a large number of rows per task. Likewise, you can set `spark.databricks.queryWatchdog.minOutputRows` to ten million if you want to stop a query only after a task in that query has produced ten million rows. Anything less and the query succeeds, even if the output\/input ratio was exceeded. \n```\nspark.conf.set(\"spark.databricks.queryWatchdog.minTimeSecs\", 10L)\nspark.conf.set(\"spark.databricks.queryWatchdog.minOutputRows\", 100000L)\n\n``` \nTip \nIf you configure Query Watchdog in a notebook, the configuration does not persist across compute restarts. If you want to configure Query Watchdog for all users of a compute, we recommend that you use a [compute configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/troubleshooting\/query-watchdog.html"} +{"content":"# Compute\n## Use compute\n### Troubleshoot compute issues\n##### Handling large queries in interactive workflows\n###### Detect query on extremely large dataset\n\nAnother typical large query may scan a large amount of data from big tables\/datasets. The scan operation may last for a long time and saturate compute resources (even reading metadata of a big Hive table can take a significant amount of time). You can set `maxHivePartitions` to prevent fetching too many partitions from a big Hive table. Similarly, you can also set `maxQueryTasks` to limit queries on an extremely large dataset. \n```\nspark.conf.set(\"spark.databricks.queryWatchdog.maxHivePartitions\", 20000)\nspark.conf.set(\"spark.databricks.queryWatchdog.maxQueryTasks\", 20000)\n\n```\n\n##### Handling large queries in interactive workflows\n###### When should you enable Query Watchdog?\n\nQuery Watchdog should be enabled for ad hoc analytics compute where SQL analysts and data scientists are sharing a given compute and an administrator needs to make sure that queries \u201cplay nicely\u201d with one another.\n\n##### Handling large queries in interactive workflows\n###### When should you disable Query Watchdog?\n\nIn general we do not advise eagerly cancelling queries used in an ETL scenario because there typically isn\u2019t a human in the loop to correct the error. We recommend that you disable Query Watchdog for all but ad hoc analytics compute.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/troubleshooting\/query-watchdog.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n### Pandas API on Spark\n##### pandas function APIs\n\npandas function APIs enable you to directly apply a Python native function that takes and outputs pandas instances to a PySpark DataFrame. Similar to [pandas user-defined functions](https:\/\/docs.databricks.com\/udf\/pandas.html), function APIs also use [Apache Arrow](https:\/\/arrow.apache.org\/) to transfer data and pandas to work with the data; however, Python type hints are optional in pandas function APIs. \nThere are three types of pandas function APIs: \n* Grouped map\n* Map\n* Cogrouped map \npandas function APIs leverage the same internal logic that pandas UDF execution uses. They share characteristics such as PyArrow, supported SQL types, and the configurations. \nFor more information, see the blog post [New Pandas UDFs and Python Type Hints in the Upcoming Release of Apache Spark 3.0](https:\/\/databricks.com\/blog\/2020\/05\/20\/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/pandas\/pandas-function-apis.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n### Pandas API on Spark\n##### pandas function APIs\n###### Grouped map\n\nYou transform your grouped data using `groupBy().applyInPandas()` to implement the \u201csplit-apply-combine\u201d pattern. Split-apply-combine consists of three steps: \n* Split the data into groups by using `DataFrame.groupBy`.\n* Apply a function on each group. The input and output of the function are both `pandas.DataFrame`. The input data contains all the rows and columns for each group.\n* Combine the results into a new `DataFrame`. \nTo use `groupBy().applyInPandas()`, you must define the following: \n* A Python function that defines the computation for each group\n* A `StructType` object or a string that defines the schema of the output `DataFrame` \nThe column labels of the returned `pandas.DataFrame` must either match the field names in the defined output schema if specified as strings, or match the field data types by position if not strings, for example, integer indices. See [pandas.DataFrame](https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.DataFrame.html) for how to label columns when constructing a `pandas.DataFrame`. \nAll data for a group is loaded into memory before the function is applied. This can lead to out of memory exceptions, especially if the group sizes are skewed. The configuration for [maxRecordsPerBatch](https:\/\/spark.apache.org\/docs\/latest\/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size) is not applied on groups and it is up to you to ensure that the grouped data fits into the available memory. \nThe following example shows how to use `groupby().apply()` to subtract the mean from each value in the group. \n```\ndf = spark.createDataFrame(\n[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],\n(\"id\", \"v\"))\n\ndef subtract_mean(pdf):\n# pdf is a pandas.DataFrame\nv = pdf.v\nreturn pdf.assign(v=v - v.mean())\n\ndf.groupby(\"id\").applyInPandas(subtract_mean, schema=\"id long, v double\").show()\n# +---+----+\n# | id| v|\n# +---+----+\n# | 1|-0.5|\n# | 1| 0.5|\n# | 2|-3.0|\n# | 2|-1.0|\n# | 2| 4.0|\n# +---+----+\n\n``` \nFor detailed usage, see [pyspark.sql.GroupedData.applyInPandas](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.GroupedData.applyInPandas.html#pyspark-sql-groupeddata-applyinpandas).\n\n","doc_uri":"https:\/\/docs.databricks.com\/pandas\/pandas-function-apis.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n### Pandas API on Spark\n##### pandas function APIs\n###### Map\n\nYou perform map operations with pandas instances by `DataFrame.mapInPandas()` in order to transform an iterator of `pandas.DataFrame` to another iterator of `pandas.DataFrame` that represents the current PySpark DataFrame and returns the result as a PySpark DataFrame. \nThe underlying function takes and outputs an iterator of `pandas.DataFrame`. It can return output of arbitrary length in contrast to some pandas UDFs such as Series to Series. \nThe following example shows how to use `mapInPandas()`: \n```\ndf = spark.createDataFrame([(1, 21), (2, 30)], (\"id\", \"age\"))\n\ndef filter_func(iterator):\nfor pdf in iterator:\nyield pdf[pdf.id == 1]\n\ndf.mapInPandas(filter_func, schema=df.schema).show()\n# +---+---+\n# | id|age|\n# +---+---+\n# | 1| 21|\n# +---+---+\n\n``` \nFor detailed usage, see [pyspark.sql.DataFrame.mapInPandas](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrame.mapInPandas.html#pyspark-sql-dataframe-mapinpandas).\n\n","doc_uri":"https:\/\/docs.databricks.com\/pandas\/pandas-function-apis.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n### Pandas API on Spark\n##### pandas function APIs\n###### Cogrouped map\n\nFor cogrouped map operations with pandas instances, use `DataFrame.groupby().cogroup().applyInPandas()` to cogroup two PySpark `DataFrame`s by a common key and then apply a Python function to each cogroup as shown: \n* Shuffle the data such that the groups of each DataFrame which share a key are cogrouped together.\n* Apply a function to each cogroup. The input of the function is two `pandas.DataFrame` (with an optional tuple representing the key). The output of the function is a `pandas.DataFrame`.\n* Combine the `pandas.DataFrame`s from all groups into a new PySpark `DataFrame`. \nTo use `groupBy().cogroup().applyInPandas()`, you must define the following: \n* A Python function that defines the computation for each cogroup.\n* A `StructType` object or a string that defines the schema of the output PySpark `DataFrame`. \nThe column labels of the returned `pandas.DataFrame` must either match the field names in the defined output schema if specified as strings, or match the field data types by position if not strings, for example, integer indices. See [pandas.DataFrame](https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.DataFrame.html) for how to label columns when constructing a `pandas.DataFrame`. \nAll data for a cogroup is loaded into memory before the function is applied. This can lead to out of memory exceptions, especially if the group sizes are skewed. The configuration for [maxRecordsPerBatch](https:\/\/spark.apache.org\/docs\/latest\/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size) is not applied and it is up to you to ensure that the cogrouped data fits into the available memory. \nThe following example shows how to use `groupby().cogroup().applyInPandas()` to perform an `asof join` between two datasets. \n```\nimport pandas as pd\n\ndf1 = spark.createDataFrame(\n[(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)],\n(\"time\", \"id\", \"v1\"))\n\ndf2 = spark.createDataFrame(\n[(20000101, 1, \"x\"), (20000101, 2, \"y\")],\n(\"time\", \"id\", \"v2\"))\n\ndef asof_join(l, r):\nreturn pd.merge_asof(l, r, on=\"time\", by=\"id\")\n\ndf1.groupby(\"id\").cogroup(df2.groupby(\"id\")).applyInPandas(\nasof_join, schema=\"time int, id int, v1 double, v2 string\").show()\n# +--------+---+---+---+\n# | time| id| v1| v2|\n# +--------+---+---+---+\n# |20000101| 1|1.0| x|\n# |20000102| 1|3.0| x|\n# |20000101| 2|2.0| y|\n# |20000102| 2|4.0| y|\n# +--------+---+---+---+\n\n``` \nFor detailed usage, see [pyspark.sql.PandasCogroupedOps.applyInPandas](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.PandasCogroupedOps.applyInPandas.html#pyspark-sql-pandascogroupedops-applyinpandas).\n\n","doc_uri":"https:\/\/docs.databricks.com\/pandas\/pandas-function-apis.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Range join optimization\n\nA *range join* occurs when two relations are joined using a point in interval or interval overlap condition.\nThe range join optimization support in Databricks Runtime can bring orders of magnitude improvement in query performance, but requires careful manual tuning. \nDatabricks recommends using join hints for range joins when performance is poor.\n\n#### Range join optimization\n##### Point in interval range join\n\nA *point in interval range join* is a join in which the condition contains predicates specifying that a value from one relation is between two values from the other relation. For example: \n```\n-- using BETWEEN expressions\nSELECT *\nFROM points JOIN ranges ON points.p BETWEEN ranges.start and ranges.end;\n\n-- using inequality expressions\nSELECT *\nFROM points JOIN ranges ON points.p >= ranges.start AND points.p < ranges.end;\n\n-- with fixed length interval\nSELECT *\nFROM points JOIN ranges ON points.p >= ranges.start AND points.p < ranges.start + 100;\n\n-- join two sets of point values within a fixed distance from each other\nSELECT *\nFROM points1 p1 JOIN points2 p2 ON p1.p >= p2.p - 10 AND p1.p <= p2.p + 10;\n\n-- a range condition together with other join conditions\nSELECT *\nFROM points, ranges\nWHERE points.symbol = ranges.symbol\nAND points.p >= ranges.start\nAND points.p < ranges.end;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/range-join.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Range join optimization\n##### Interval overlap range join\n\nAn *interval overlap range join* is a join in which the condition contains predicates specifying an overlap of intervals between two values from each relation. For example: \n```\n-- overlap of [r1.start, r1.end] with [r2.start, r2.end]\nSELECT *\nFROM r1 JOIN r2 ON r1.start < r2.end AND r2.start < r1.end;\n\n-- overlap of fixed length intervals\nSELECT *\nFROM r1 JOIN r2 ON r1.start < r2.start + 100 AND r2.start < r1.start + 100;\n\n-- a range condition together with other join conditions\nSELECT *\nFROM r1 JOIN r2 ON r1.symbol = r2.symbol\nAND r1.start <= r2.end\nAND r1.end >= r2.start;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/range-join.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Range join optimization\n##### Range join optimization\n\nThe range join optimization is performed for joins that: \n* Have a condition that can be interpreted as a point in interval or interval overlap range join.\n* All values involved in the range join condition are of a numeric type (integral, floating point, decimal), `DATE`, or `TIMESTAMP`.\n* All values involved in the range join condition are of the same type. In the case of the decimal type, the values also need to be of the same scale and precision.\n* It is an `INNER JOIN`, or in case of point in interval range join, a `LEFT OUTER JOIN` with point value on the left side, or `RIGHT OUTER JOIN` with point value on the right side.\n* Have a bin size tuning parameter. \n### Bin size \nThe *bin size* is a numeric tuning parameter that splits the values domain of the range condition into multiple *bins* of equal size. For example, with a bin size of 10, the optimization splits the domain into bins that are intervals of length 10.\nIf you have a point in range condition of `p BETWEEN start AND end`, and `start` is 8 and `end` is 22, this value interval overlaps with three bins of length 10 \u2013 the first bin from 0 to 10, the second bin from 10 to 20, and the third bin from 20 to 30. Only the points that fall within the same three bins need to be considered as possible join matches for that interval. For example, if `p` is 32, it can be ruled out as falling between `start` of 8 and `end` of 22, because it falls in the bin from 30 to 40. \nNote \n* For `DATE` values, the value of the bin size is interpreted as days. For example, a bin size value of 7 represents a week.\n* For `TIMESTAMP` values, the value of the bin size is interpreted as seconds. If a sub-second value is required, fractional values can be used. For example, a bin size value of 60 represents a minute, and a bin size value of 0.1 represents 100 milliseconds. \nYou can specify the bin size either by using a range join hint in the query or by setting a session configuration parameter.\nThe range join optimization is applied *only if* you manually specify the bin size. Section [Choose the bin size](https:\/\/docs.databricks.com\/optimizations\/range-join.html#id2) describes how to choose an optimal bin size.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/range-join.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Range join optimization\n##### Enable range join using a range join hint\n\nTo enable the range join optimization in a SQL query, you can use a *range join hint* to specify the bin size.\nThe hint must contain the relation name of one of the joined relations and the numeric bin size parameter.\nThe relation name can be a table, a view, or a subquery. \n```\nSELECT \/*+ RANGE_JOIN(points, 10) *\/ *\nFROM points JOIN ranges ON points.p >= ranges.start AND points.p < ranges.end;\n\nSELECT \/*+ RANGE_JOIN(r1, 0.1) *\/ *\nFROM (SELECT * FROM ranges WHERE ranges.amount < 100) r1, ranges r2\nWHERE r1.start < r2.start + 100 AND r2.start < r1.start + 100;\n\nSELECT \/*+ RANGE_JOIN(c, 500) *\/ *\nFROM a\nJOIN b ON (a.b_key = b.id)\nJOIN c ON (a.ts BETWEEN c.start_time AND c.end_time)\n\n``` \nNote \nIn the third example, you *must* place the hint on `c`.\nThis is because joins are left associative, so the query is interpreted as `(a JOIN b) JOIN c`,\nand the hint on `a` applies to the join of `a` with `b` and not the join with `c`. \n```\n#create minute table\nminutes = spark.createDataFrame(\n[(0, 60), (60, 120)],\n\"minute_start: int, minute_end: int\"\n)\n\n#create events table\nevents = spark.createDataFrame(\n[(12, 33), (0, 120), (33, 72), (65, 178)],\n\"event_start: int, event_end: int\"\n)\n\n#Range_Join with \"hint\" on the from table\n(events.hint(\"range_join\", 60)\n.join(minutes,\non=[events.event_start < minutes.minute_end,\nminutes.minute_start < events.event_end])\n.orderBy(events.event_start,\nevents.event_end,\nminutes.minute_start)\n.show()\n)\n\n#Range_Join with \"hint\" on the join table\n(events.join(minutes.hint(\"range_join\", 60),\non=[events.event_start < minutes.minute_end,\nminutes.minute_start < events.event_end])\n.orderBy(events.event_start,\nevents.event_end,\nminutes.minute_start)\n.show()\n)\n\n``` \nYou can also place a range join hint on one of the joined DataFrames. In that case, the hint contains just the numeric bin size parameter. \n```\nval df1 = spark.table(\"ranges\").as(\"left\")\nval df2 = spark.table(\"ranges\").as(\"right\")\n\nval joined = df1.hint(\"range_join\", 10)\n.join(df2, $\"left.type\" === $\"right.type\" &&\n$\"left.end\" > $\"right.start\" &&\n$\"left.start\" < $\"right.end\")\n\nval joined2 = df1\n.join(df2.hint(\"range_join\", 0.5), $\"left.type\" === $\"right.type\" &&\n$\"left.end\" > $\"right.start\" &&\n$\"left.start\" < $\"right.end\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/range-join.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Range join optimization\n##### Enable range join using session configuration\n\nIf you don\u2019t want to modify the query, you can specify the bin size as a configuration parameter. \n```\nSET spark.databricks.optimizer.rangeJoin.binSize=5\n\n``` \nThis configuration parameter applies to any join with a range condition. However, a different bin size set through a range join hint always overrides the one set through the parameter.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/range-join.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Range join optimization\n##### Choose the bin size\n\nThe effectiveness of the range join optimization depends on choosing the appropriate bin size. \nA small bin size results in a larger number of bins, which helps in filtering the potential matches.\nHowever, it becomes inefficient if the bin size is significantly smaller than the encountered value intervals, and the value intervals overlap multiple *bin* intervals. For example, with a condition `p BETWEEN start AND end`, where `start` is 1,000,000 and `end` is 1,999,999, and a bin size of 10, the value interval overlaps with 100,000 bins. \nIf the length of the interval is fairly uniform and known, we recommend that you set the bin size to the typical expected length of the value interval. However, if the length of the interval is varying and skewed, a balance must be found to set a bin size that filters the short intervals efficiently, while preventing the long intervals from overlapping too many bins. Assuming a table `ranges`, with intervals that are between columns `start` and `end`, you can determine different percentiles of the skewed interval length value with the following query: \n```\nSELECT APPROX_PERCENTILE(CAST(end - start AS DOUBLE), ARRAY(0.5, 0.9, 0.99, 0.999, 0.9999)) FROM ranges\n\n``` \nA recommended setting of bin size would be the maximum of the value at the 90th percentile, or the value at the 99th percentile divided by 10, or the value at the 99.9th percentile divided by 100 and so on. The rationale is: \n* If the value at the 90th percentile is the bin size, only 10% of the value interval lengths are longer than the bin interval, so span more than 2 adjacent bin intervals.\n* If the value at the 99th percentile is the bin size, only 1% of the value interval lengths span more than 11 adjacent bin intervals.\n* If the value at the 99.9th percentile is the bin size, only 0.1% of the value interval lengths span more than 101 adjacent bin intervals.\n* The same can be repeated for the values at the 99.99th, the 99.999th percentile, and so on if needed. \nThe described method limits the amount of skewed long value intervals that overlap multiple bin intervals.\nThe bin size value obtained this way is only a starting point for fine tuning; actual results may depend on the specific workload.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/range-join.html"} +{"content":"# Technology partners\n## Databricks sign-on from partner solutions\n#### Enable or disable partner OAuth applications\n\nThis article describes how to enable and disable partner OAuth applications for your Databricks account. \ndbt Core, Power BI, Tableau Desktop, and Tableau Cloud OAuth applications are enabled by default for your account. \nNote \nUpdates to OAuth applications can take 30 minutes to process.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/enable-disable-oauth.html"} +{"content":"# Technology partners\n## Databricks sign-on from partner solutions\n#### Enable or disable partner OAuth applications\n##### Enable or disable apps using the Databricks CLI\n\nThis section describes how to use the Databricks CLI to disable the partner OAuth applications that are enabled by default for your account, and how to re-enable them after they\u2019ve been disabled. It also describes how to enable and disable Tableau Server, which is not enabled by default. \n### Before you begin \nBefore you enable or disable partner OAuth application integrations using the Databricks CLI, do the following: \n* [Install the Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/install.html) and [set up authentication between the Databricks CLI and your Databricks account](https:\/\/docs.databricks.com\/dev-tools\/cli\/authentication.html).\n* To disable or modify an existing OAuth application, locate the integration ID. \n+ For dbt Core, Power BI, Tableau Desktop, or Tableau Cloud run the following command: \n```\ndatabricks account published-app-integration list\n\n```\n+ For custom applications, like Tableau Server, run the following command: \n```\ndatabricks account custom-app-integration list\n\n```The unique integration ID for each OAuth application is returned. \n### Disable dbt Core, Power BI, Tableau Desktop, or Tableau Cloud OAuth application using the CLI \ndbt Core, Power BI, Tableau Desktop, and Tableau Cloud OAuth applications are enabled by default for your account. To disable a dbt Core, Power BI, Tableau Desktop, or Tableau Cloud OAuth application, run the following command, replacing `<integration-id>` with the integration ID of the OAuth application you want to delete: \n```\ndatabricks account published-app-integration delete <integration-id>\n\n``` \n### Re-enable dbt Core, Power BI, Tableau Desktop, or Tableau Cloud OAuth application using the CLI \ndbt Core, Power BI, Tableau Desktop, and Tableau Cloud are enabled as OAuth applications in your account by default. To re-enable one of these OAuth applications after it\u2019s been disabled, run the following command, replacing `<application-id>` with `databricks-dbt-adapter`, `power-bi`, `tableau-desktop` or `7de584d0-b7ad-4850-b915-be7de7d58711` (Tableau Cloud): \n```\ndatabricks account published-app-integration create <application-id>\n\n``` \nThe unique integration ID for the OAuth application is returned. \n### Enable custom OAuth applications using the CLI \ndbt Core, Power BI, Tableau Desktop, and Tableau Cloud OAuth applications are enabled by default for your account. You can use the Databricks CLI to enable additional custom OAuth applications. \nFor steps to enable a custom Tableau Server OAuth application, see [Configure Databricks sign-on from Tableau Server](https:\/\/docs.databricks.com\/integrations\/configure-oauth-tableau.html). For generic steps to enable any custom OAuth application using the CLI, see the following: \n1. Run the `custom-app-integration create` command. The following example creates a non-public (confidential) application: \n```\ndatabricks account custom-app-integration create --confidential --json '{\"name\":\"<name>\", \"redirect_urls\":[\"<redirect-url>\"], \"scopes\":[<scopes>]}'\n\n``` \n* Replace `<name>` with a name for your custom OAuth application.\n* Replace `<redirect-url>` with the redirect URLs for your application.\n* Replace `<scopes>` with the list of scopes you want to allow for the application. \n+ For BI applications, the `SQL` scope is required to allow the connected app to access Databricks SQL APIs.\n+ For applications that need to access Databricks APIs for purposes other than SQL, the `ALL APIs` scope is required.\n+ The `openid`, `email`, and `profile` scopes are required to generate the ID token.\n+ The `offline_access` scope is required to generate refresh tokens.For more information about supported values, see [POST \/api\/2.0\/accounts\/{account\\_id}\/oauth2\/custom-app-integrations](https:\/\/docs.databricks.com\/api\/account\/customappintegration\/create) in the REST API reference. \nA client ID is generated. For non-public (confidential) applications, a client secret is also generated. The following output is returned: \n```\n{\"integration_id\":\"<integration-id>\",\"client_id\":\"<client-id>\",\"client_secret\":\"<client-secret>\"}\n\n``` \nNote \nEnabling an OAuth application can take 30 minutes to process.\n2. Securely store the client secret, if applicable. \nImportant \nYou can\u2019t retrieve the client secret later. \n### Disable custom OAuth applications using the CLI \nTo disable an existing custom OAuth application, like Tableau Server, run the following command, replacing `<integration-id>` with the integration ID of the OAuth application you want to disable: \n```\ndatabricks account custom-app-integration delete <integration-id>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/enable-disable-oauth.html"} +{"content":"# Technology partners\n## Databricks sign-on from partner solutions\n#### Enable or disable partner OAuth applications\n##### Enable custom OAuth applications using the Databricks UI\n\ndbt Core, Power BI, Tableau Desktop, and Tableau Cloud OAuth applications are enabled by default for your account. You can use the Databricks UI to enable additional custom OAuth applications. \nTo enable a custom OAuth application in the UI, do the following: \n1. Log in to the [account console](https:\/\/accounts.cloud.databricks.com) and click the **Settings** icon in the sidebar.\n2. On the **App connections** tab, click **Add connection**.\n3. Enter the following details: \n1. A name for your connection.\n2. The redirect URLs for your OAuth connection.\n3. For **Access scopes**, the APIs the application should have access to. \n* For BI applications, the `SQL` scope is required to allow the connected app to access Databricks SQL APIs.\n* For applications that need to access Databricks APIs for purposes other than querying, the `ALL APIs` scope is required.The following scopes are automatically allowed: \n* `openid`, `email`, `profile`: Required to generate the ID token.\n* `offline_access`: Required to generate refresh tokens.If you don\u2019t want to allow these scopes for the application, you can manage fine-grained scopes by using the [POST \/api\/2.0\/accounts\/{account\\_id}\/oauth2\/custom-app-integrations](https:\/\/docs.databricks.com\/api\/account\/customappintegration\/create) API to create your custom application.\n4. The access token time-to-live (TTL) in minutes. Default: `60`.\n5. The refresh token time-to-live (TTL) in minutes. Default: `10080`.\n6. Whether to generate a client secret. This is required for non-public (confidential) clients. \nThe **Connection created** dialog box displays the client ID and the client secret, if applicable, for your connection.\n4. If you selected **Generate a client secret**, copy and securely store the client secret. You can\u2019t retrieve the client secret later. \nYou can edit the redirect URL, token TTL, and refresh token TTL for existing custom OAuth applications in the UI by clicking the application name on the **Settings** > **App connections** page in the account console. You can also view your existing published OAuth applications (dbt Core, Power BI, Tableau) in the UI. You can edit the token TTL and refresh token TTL for existing published applications. \nYou can disable both published and custom OAuth applications in the UI by either clicking the application name or the vertical ellipses next to the application name, and then clicking **Remove**. \nNote \nDisabling an application breaks the application connection, so use caution when disabling OAuth applications. If you disable a published OAuth application in the UI, it can\u2019t be re-enabled in the UI. To re-enable a published application, see [Re-enable dbt Core, Power BI, Tableau Desktop, or Tableau Cloud OAuth application using the CLI](https:\/\/docs.databricks.com\/integrations\/enable-disable-oauth.html#re-enable).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/enable-disable-oauth.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Hyperparameter tuning\n##### Hyperopt concepts\n\nThis article describes some of the concepts you need to know to use distributed Hyperopt. \nIn this section: \n* [`fmin()`](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-concepts.html#fmin)\n* [The `SparkTrials` class](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-concepts.html#the-sparktrials-class)\n* [`SparkTrials` and MLflow](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-concepts.html#sparktrials-and-mlflow) \nFor examples illustrating how to use Hyperopt in Databricks, see [Hyperparameter tuning with Hyperopt](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/index.html#hyperopt-overview).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-concepts.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Hyperparameter tuning\n##### Hyperopt concepts\n###### [`fmin()`](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-concepts.html#id1)\n\nYou use `fmin()` to execute a Hyperopt run. The arguments for `fmin()` are shown in the table; see [the Hyperopt documentation](https:\/\/github.com\/hyperopt\/hyperopt\/wiki\/FMin) for more information. For examples of how to use each argument, see [the example notebooks](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/index.html#hyperopt-overview). \n| Argument name | Description |\n| --- | --- |\n| `fn` | Objective function. Hyperopt calls this function with values generated from the hyperparameter space provided in the space argument. This function can return the loss as a scalar value or in a dictionary (see [Hyperopt docs](https:\/\/github.com\/hyperopt\/hyperopt\/wiki\/FMin) for details). This function typically contains code for model training and loss calculation. |\n| `space` | Defines the hyperparameter space to search. Hyperopt provides great flexibility in how this space is defined. You can choose a categorical option such as algorithm, or probabilistic distribution for numeric values such as uniform and log. |\n| `algo` | Hyperopt search algorithm to use to search hyperparameter space. Most commonly used are `hyperopt.rand.suggest` for Random Search and `hyperopt.tpe.suggest` for TPE. |\n| `max_evals` | Number of hyperparameter settings to try (the number of models to fit). |\n| `max_queue_len` | Number of hyperparameter settings Hyperopt should generate ahead of time. Because the Hyperopt TPE generation algorithm can take some time, it can be helpful to increase this beyond the default value of 1, but generally no larger than the `SparkTrials` setting `parallelism`. |\n| `trials` | A `Trials` or `SparkTrials` object. Use `SparkTrials` when you call single-machine algorithms such as scikit-learn methods in the objective function. Use `Trials` when you call distributed training algorithms such as MLlib methods or Horovod in the objective function. |\n| `early_stop_fn` | An optional early stopping function to determine if `fmin` should stop before `max_evals` is reached. Default is `None`. The input signature of the function is `Trials, *args` and the output signature is `bool, *args`. The output boolean indicates whether or not to stop. `*args` is any state, where the output of a call to `early_stop_fn` serves as input to the next call. `Trials` can be a `SparkTrials` object. When using `SparkTrials`, the early stopping function is not guaranteed to run after every trial, and is instead polled. [Example of an early stopping function](https:\/\/github.com\/hyperopt\/hyperopt\/blob\/master\/hyperopt\/early_stop.py) |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-concepts.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Hyperparameter tuning\n##### Hyperopt concepts\n###### [The `SparkTrials` class](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-concepts.html#id2)\n\n`SparkTrials` is an API developed by Databricks that allows you to distribute a Hyperopt run without making other changes to your Hyperopt code. `SparkTrials` accelerates single-machine tuning by distributing trials to Spark workers. \nNote \n`SparkTrials` is designed to parallelize computations for single-machine ML models such as scikit-learn. For models created with distributed ML algorithms such as MLlib or Horovod, do not use `SparkTrials`. In this case the model building process is automatically parallelized on the cluster and you should use the default Hyperopt class `Trials`. \nThis section describes how to configure the arguments you pass to `SparkTrials` and implementation aspects of `SparkTrials`. \n### Arguments \n`SparkTrials` takes two optional arguments: \n* `parallelism`: Maximum number of trials to evaluate concurrently. A higher number lets you scale-out testing of more hyperparameter settings. Because Hyperopt proposes new trials based on past results, there is a trade-off between parallelism and adaptivity. For a fixed `max_evals`, greater parallelism speeds up calculations, but lower parallelism may lead to better results since each iteration has access to more past results. \nDefault: Number of Spark executors available. Maximum: 128. If the value is greater than the number of concurrent tasks allowed by the cluster configuration, `SparkTrials` reduces parallelism to this value.\n* `timeout`: Maximum number of seconds an `fmin()` call can take. When this number is exceeded, all runs are terminated and `fmin()` exits. Information about completed runs is saved. \n### Implementation \nWhen defining the objective function `fn` passed to `fmin()`, and when selecting a cluster setup, it is helpful to understand how `SparkTrials` distributes tuning tasks. \nIn Hyperopt, a trial generally corresponds to fitting one model on one setting of hyperparameters. Hyperopt iteratively generates trials, evaluates them, and repeats. \nWith `SparkTrials`, the driver node of your cluster generates new trials, and worker nodes evaluate those trials. Each trial is generated with a Spark job which has one task, and is evaluated in the task on a worker machine. If your cluster is set up to run multiple tasks per worker, then multiple trials may be evaluated at once on that worker.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-concepts.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Hyperparameter tuning\n##### Hyperopt concepts\n###### [`SparkTrials` and MLflow](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-concepts.html#id3)\n\nDatabricks Runtime ML supports logging to MLflow from workers. You can add custom logging code in the objective function you pass to Hyperopt. \n`SparkTrials` logs tuning results as nested MLflow runs as follows: \n* Main or parent run: The call to `fmin()` is logged as the main run. If there is an active run, `SparkTrials` logs to this active run and does not end the run when `fmin()` returns. If there is no active run, `SparkTrials` creates a new run, logs to it, and ends the run before `fmin()` returns.\n* Child runs: Each hyperparameter setting tested (a \u201ctrial\u201d) is logged as a child run under the main run. MLflow log records from workers are also stored under the corresponding child runs. \nWhen calling `fmin()`, Databricks recommends active MLflow run management; that is, wrap the call to `fmin()` inside a `with mlflow.start_run():` statement. This ensures that each `fmin()` call is logged to a separate MLflow main run, and makes it easier to log extra tags, parameters, or metrics to that run. \nNote \nWhen you call `fmin()` multiple times within the same active MLflow run, MLflow logs those calls to the same main run. To resolve name conflicts for logged parameters and tags, MLflow appends a UUID to names with conflicts. \nWhen logging from workers, you do not need to manage runs explicitly in the objective function. Call `mlflow.log_param(\"param_from_worker\", x)` in the objective function to log a parameter to the child run. You can log parameters, metrics, tags, and artifacts in the objective function.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-concepts.html"} +{"content":"# Databricks data engineering\n## What is DBFS?\n#### Best practices for DBFS and Unity Catalog\n\nUnity Catalog introduces a number of new configurations and concepts that approach data governance entirely differently than DBFS. This article outlines several best practices around working with Unity Catalog external locations and DBFS. \nDatabricks recommends against using DBFS and mounted cloud object storage for most use cases in Unity Catalog-enabled Databricks workspaces. This article describes a few scenarios in which you should use mounted cloud object storage. Note that Databricks does not recommend using the DBFS root in conjunction with Unity Catalog, unless you must migrate files or data stored there into Unity Catalog.\n\n#### Best practices for DBFS and Unity Catalog\n##### How is DBFS used in Unity Catalog-enabled workspaces?\n\nActions performed against tables in the `hive_metastore` use legacy data access patterns, which might include data and storage credentials managed by DBFS. Managed tables in the workspace-scoped `hive_metastore` are stored on the DBFS root.\n\n#### Best practices for DBFS and Unity Catalog\n##### How does DBFS work in single user access mode?\n\nClusters configured with single user access mode have full access to DBFS, including all files in the DBFS root and mounted data.\n\n#### Best practices for DBFS and Unity Catalog\n##### How does DBFS work in shared access mode?\n\nShared access mode combines Unity Catalog data governance with Databricks legacy table ACLs. Access to data in the `hive_metastore` is only available to users that have permissions explicitly granted. \nTo interact with files directly using DBFS, you must have `ANY FILE` permissions granted. Because `ANY FILE` allows users to bypass legacy tables ACLs in the `hive_metastore` and access all data managed by DBFS, Databricks recommends caution when granting this privilege.\n\n#### Best practices for DBFS and Unity Catalog\n##### Do not use DBFS with Unity Catalog external locations\n\nUnity Catalog secures access to data in external locations by using full cloud URI paths to identify grants on managed object storage directories. DBFS mounts use an entirely different data access model that bypasses Unity Catalog entirely. Databricks recommends that you do not reuse cloud object storage volumes between DBFS mounts and UC external volumes, including when sharing data across workspaces or accounts.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dbfs\/unity-catalog.html"} +{"content":"# Databricks data engineering\n## What is DBFS?\n#### Best practices for DBFS and Unity Catalog\n##### Secure your Unity Catalog-managed storage\n\nUnity Catalog using managed storage locations for storing data files for managed tables and volumes. \nDatabricks recommends the following for managed storage locations: \n* Use new storage accounts or buckets.\n* Define a custom identity policy for Unity Catalog.\n* Restrict all access to Databricks managed by Unity Catalog.\n* Restrict all access to identity access policies created for Unity Catalog.\n\n#### Best practices for DBFS and Unity Catalog\n##### Add existing data to external locations\n\nIt is possible to load existing storage accounts into Unity Catalog using external locations. For greatest security, Databricks recommends only loading storage accounts to external locations after revoking all other storage credentials and access patterns. \nYou should never load a storage account used as a DBFS root as an external location in Unity Catalog.\n\n#### Best practices for DBFS and Unity Catalog\n##### Cluster configurations are ignored by Unity Catalog filesystem access\n\nUnity Catalog does not respect cluster configurations for filesystem settings. This means that Hadoop filesystem settings for configuring custom behavior with cloud object storage do not work when accessing data using Unity Catalog.\n\n#### Best practices for DBFS and Unity Catalog\n##### Limitation around multiple path access\n\nWhile you can generally use Unity Catalog and DBFS together, paths that are equal or share a parent\/child relationship cannot be referenced in the same command or notebook cell using different access methods. \nFor example, if an external table `foo` is defined in the `hive_metastore` at location `a\/b\/c` and an external location is defined in Unity Catalog on `a\/b\/`, the following code would throw an error: \n```\nspark.read.table(\"foo\").filter(\"id IS NOT NULL\").write.mode(\"overwrite\").save(\"a\/b\/c\")\n\n``` \nThis error would not arise if this logic is broken into two cells: \n```\ndf = spark.read.table(\"foo\").filter(\"id IS NOT NULL\")\n\n``` \n```\ndf.write.mode(\"overwrite\").save(\"a\/b\/c\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dbfs\/unity-catalog.html"} +{"content":"# Databricks data engineering\n### Git integration with Databricks Git folders\n\nDatabricks Git folders is a visual Git client and API in Databricks. It supports common Git operations such as cloning a repository, committing and pushing, pulling, branch management, and visual comparison of diffs when committing. \nWithin Git folders you can develop code in notebooks or other files and follow data science and engineering code development best practices using Git for version control, collaboration, and CI\/CD. \nNote \nGit folders (Repos) are primarily designed for authoring and collaborative workflows. \nFor information on migrating from a legacy Git integration, see [Migrate to Git folders (formerly Repos) from legacy Git](https:\/\/docs.databricks.com\/_extras\/documents\/migrate-to-repos-from-legacy-git.pdf).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/index.html"} +{"content":"# Databricks data engineering\n### Git integration with Databricks Git folders\n#### What can you do with Databricks Git folders?\n\nDatabricks Git folders provides source control for data and AI projects by integrating with Git providers. \nIn Databricks Git folders, you can use Git functionality to: \n* Clone, push to, and pull from a remote Git repository.\n* Create and manage branches for development work, including merging, rebasing, and resolving conflicts.\n* Create notebooks (including IPYNB notebooks) and edit them and other files.\n* Visually compare differences upon commit and resolve merge conflicts. \nFor step-by-step instructions, see [Run Git operations on Databricks Git folders (Repos)](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html). \nNote \nDatabricks Git folders also has an [API](https:\/\/docs.databricks.com\/api\/workspace\/repos) that you can integrate with your CI\/CD pipeline. For example, you can programmatically update a Databricks repo so that it always has the most recent version of the code. For information about best practices for code development using Databricks Git folders, see [CI\/CD techniques with Git and Databricks Git folders (Repos)](https:\/\/docs.databricks.com\/repos\/ci-cd-techniques-with-repos.html). \nFor information on the kinds of notebooks supported in Databricks, see [Export and import Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/index.html"} +{"content":"# Databricks data engineering\n### Git integration with Databricks Git folders\n#### Supported Git providers\n\nDatabricks Git folders are backed by an integrated Git repository. The repository can be hosted by any of the cloud and enterprise Git providers listed in the following section. \nNote \n**What is a \u201cGit provider\u201d?** \nA \u201cGit provider\u201d is the specific (named) service that hosts a source control model based on Git. Git-based source control platforms are hosted in two ways: as a cloud service hosted by the developing company, or as an on-premises service installed and managed by your own company on its own hardware. Many Git providers such as GitHub, Microsoft, GitLab, and Atlassian provide both cloud-based SaaS and on-premises (sometimes called \u201cself-managed\u201d) Git services. \nWhen choosing your Git provider during configuration, you must be aware of the differences between cloud (SaaS) and on-premises Git providers. On-premises solutions are typically hosted behind a company VPN and might not be accessible from the internet. Usually, the on-premises Git providers have a name ending in \u201cServer\u201d or \u201cSelf-Managed\u201d, but if you are uncertain, contact your company admins or review the Git provider\u2019s documentation. \nIf your Git provider is cloud-based and not listed as a supported provider, selecting \u201cGitHub\u201d as your provider may work but is not guaranteed. \nNote \nIf you are using \u201cGitHub\u201d as a provider and are still uncertain if you are using the cloud or on-premises version, see [About GitHub Enterprise Server](https:\/\/docs.github.com\/en\/github-ae@latest\/get-started\/using-github-docs\/about-versions-of-github-docs#github-enterprise-server) in the GitHub docs. \n### Cloud Git providers supported by Databricks \n* GitHub, GitHub AE, and GitHub Enterprise Cloud\n* Atlassian BitBucket Cloud\n* GitLab and GitLab EE\n* Microsoft Azure DevOps (Azure Repos) \n* AWS CodeCommit \n### On-premises Git providers supported by Databricks \n* GitHub Enterprise Server\n* Atlassian BitBucket Server and Data Center\n* GitLab Self-Managed\n* Microsoft Azure DevOps Server: A workspace admin must explicitly allowlist the URL domain prefixes for your Microsoft Azure DevOps Server if the URL does not match `dev.azure.com\/*` or `visualstudio.com\/*`. For more details, see [Restrict usage to URLs in an allow list](https:\/\/docs.databricks.com\/repos\/repos-setup.html#allow-lists) \nIf you are integrating an on-premises Git repo that is not accessible from the internet, a proxy for Git authentication requests must also be installed within your company\u2019s VPN. For more details, see [Set up private Git connectivity for Databricks Git folders (Repos)](https:\/\/docs.databricks.com\/repos\/git-proxy.html). \nTo learn how to use access tokens with your Git provider, see [Configure Git credentials & connect a remote repo to Databricks](https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/index.html"} +{"content":"# Databricks data engineering\n### Git integration with Databricks Git folders\n#### Resources for Git integration\n\nUse the Databricks CLI 2.0 for Git integration with Databricks: \n* [Download the latest CLI version](https:\/\/github.com\/databricks\/cli\/releases)\n* [Set up the CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/install.html) \nRead the following reference docs: \n* [Databricks CLI global flags](https:\/\/docs.databricks.com\/dev-tools\/cli\/commands.html#global-flags) and [commands](https:\/\/docs.databricks.com\/dev-tools\/cli\/commands.html)\n\n### Git integration with Databricks Git folders\n#### Next steps\n\n* [Set up Databricks Git folders (Repos)](https:\/\/docs.databricks.com\/repos\/repos-setup.html)\n* [Configure Git credentials & connect a remote repo to Databricks](https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/index.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to StreamSets\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nStreamSets helps you to manage and monitor your data flow throughout its lifecycle. StreamSets native integration with Databricks and Delta Lake allows you to pull data from various sources and manage your pipelines easily. \nFor a general demonstration of StreamSets, watch the following YouTube video (10 minutes). \nHere are the steps for using StreamSets with Databricks.\n\n#### Connect to StreamSets\n##### Step 1: Generate a Databricks personal access token\n\nStreamSets authenticates with Databricks using a Databricks personal access token. \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/streamsets.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to StreamSets\n##### Step 2: Set up a cluster to support integration needs\n\nStreamSets will write data to an S3 bucket and the Databricks integration cluster will read data from that location. Therefore the integration cluster requires secure access to the S3 bucket. \n### Secure access to an S3 bucket \nTo access AWS resources, you can launch the Databricks integration cluster with an instance profile. The instance profile should have access to the staging S3 bucket and the target S3 bucket where you want to write the Delta tables. To create an instance profile and configure the integration cluster to use the role, follow the instructions in [Tutorial: Configure S3 access with an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html). \nAs an alternative, you can use [IAM credential passthrough](https:\/\/docs.databricks.com\/archive\/credential-passthrough\/iam-passthrough.html), which enables user-specific access to S3 data from a shared cluster. \n### Specify the cluster configuration \n1. Set **Cluster Mode** to **Standard**.\n2. Set **Databricks Runtime Version** to Runtime: 6.3 or above.\n3. Enable [optimized writes and auto compaction](https:\/\/docs.databricks.com\/delta\/tune-file-size.html) by adding the following properties to your [Spark configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration): \n```\nspark.databricks.delta.optimizeWrite.enabled true\nspark.databricks.delta.autoCompact.enabled true\n\n```\n4. Configure your cluster depending on your integration and scaling needs. \nFor cluster configuration details, see [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html). \nSee [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html) for the steps to obtain the JDBC URL and HTTP path.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/streamsets.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to StreamSets\n##### Step 3: Obtain JDBC and ODBC connection details to connect to a cluster\n\nTo connect a Databricks cluster to StreamSets you need the following JDBC\/ODBC connection properties: \n* JDBC URL\n* HTTP Path\n\n#### Connect to StreamSets\n##### Step 4: Get StreamSets for Databricks\n\n[Sign up](https:\/\/cloud.login.streamsets.com\/signup) for [StreamSets for Databricks](https:\/\/streamsets.com\/solutions\/streamsets-for-databricks\/), if you do not already have a StreamSets account. You can get started for free and upgrade when you\u2019re ready; see [StreamSets DataOps Platform Pricing](https:\/\/streamsets.com\/pricing\/).\n\n#### Connect to StreamSets\n##### Step 5: Learn how to use StreamSets to load data into Delta Lake\n\nStart with a sample pipeline or check out [StreamSets solutions](https:\/\/streamsets.com\/documentation\/datacollector\/latest\/help\/index.html?contextID=concept_a5b_wvk_ckb) to learn how to build a pipeline that ingests data into Delta Lake.\n\n#### Connect to StreamSets\n##### Additional resources\n\n[Support](https:\/\/streamsets.com\/support\/)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/streamsets.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Set up and manage Unity Catalog\n\nThis article explains how to configure and use Unity Catalog to manage data in your Databricks workspace. It is intended primarily for workspace admins who are using Unity Catalog for the first time. \nBy the end of this article you will have: \n* A workspace that is enabled for Unity Catalog.\n* Compute that has access to Unity Catalog.\n* Users with permission to access and create objects in Unity Catalog. \nYou may also want to review other introductory articles: \n* For a quick walkthrough of how to create a table and grant permissions in Unity Catalog, see [Tutorial: Create your first table and grant privileges](https:\/\/docs.databricks.com\/getting-started\/create-table.html).\n* For key Unity Catalog concepts and an introduction to how Unity Catalog works, see [What is Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html).\n* To learn how best to use Unity Catalog to meet your data governance needs, see [Unity Catalog best practices](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html). \nNote \nIf you want to upgrade an existing non-Unity-Catalog workspace to Unity Catalog, you might benefit from using [UCX](https:\/\/github.com\/databrickslabs\/ucx), a Databricks Labs project that provides a set of workflows and utilities for upgrading identities, permissions, and tables to Unity Catalog. See [Use the UCX utilities to upgrade your workspace to Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/ucx.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Set up and manage Unity Catalog\n##### Overview of Unity Catalog enablement\n\nTo use Unity Catalog, your Databricks workspaces must be enabled for Unity Catalog, which means that the workspaces are attached to a Unity Catalog metastore, the top-level container for Unity Catalog metadata. \nThe way admins set up Unity Catalog depends on whether the workspace was enabled automatically for Unity Catalog or requires manual enablement. \n### Automatic enablement of Unity Catalog \nDatabricks began to enable new workspaces for Unity Catalog automatically on November 8, 2023, with a rollout proceeding gradually across accounts. Workspaces that were enabled automatically have the following properties: \n* An automatically-provisioned Unity Catalog metastore (unless a Unity Catalog metastore already existed for the workspace region).\n* Default privileges for workspace admins, such as the ability to create a catalog or an external database connection.\n* No metastore admin (unless an existing Unity Catalog metastore was used and a metastore admin was already assigned).\n* No metastore-level storage for managed tables and managed volumes (unless an existing Unity Catalog metastore with metastore-level storage was used).\n* A *workspace catalog*, which, when originally provisioned, is named after your workspace. \nAll users in your workspace can create assets in the `default` schema in this catalog. By default, this catalog is *bound* to your workspace, which means that it can only be accessed through your workspace. Automatic provisioning of the workspace catalog at workspace creation is rolling out gradually across accounts. \nYour workspace gets the workspace catalog only if the workspace creator provided an appropriate IAM role and storage location during workspace provisioning. If you don\u2019t have such a catalog, you can create a catalog like it by following the instructions in [Create and manage catalogs](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html). \nThese default configurations will work well for most workspaces, but they can all be modified by a workspace admin or account admin. For example, an account admin can assign a metastore admin and create metastore-level storage, and a workspace admin can modify the workspace catalog name and access. \n### What if my workspace wasn\u2019t enabled for Unity Catalog automatically? \nIf your workspace was not enabled for Unity Catalog automatically, an account admin or metastore admin must manually attach the workspace to a Unity Catalog metastore in the same region. If no Unity Catalog metastore exists in the region, an account admin must create one. For instructions, see [Create a Unity Catalog metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html). \n### How do I know if my workspace was enabled for Unity Catalog? \nTo confirm if your workspace is enabled for Unity Catalog, ask a Databricks workspace admin or account admin to check for you. See also [Step 1: Confirm that your workspace is enabled for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#workspace). \n### How do I know if my workspace includes a *workspace catalog* ? \nSome new workspaces have a *workspace catalog*, which, when originally provisioned, is named after your workspace. To determine if your workspace has one, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog** in the sidebar to open Catalog Explorer, and search for a catalog that uses your workspace name as the catalog name. \nNote \nThe workspace catalog is like any other catalog in Unity Catalog: a workspace admin can change its name, change its ownership, or even delete it. However, immediately after the workspace is created, it bears the workspace name\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Set up and manage Unity Catalog\n##### Before you begin\n\nBefore you begin the tasks described in this article, you should familiarize yourself with the basic Unity Catalog concepts, including metastores, admin roles, and managed storage. See [What is Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). \nYou should also confirm that you meet the following requirements: \n* A Databricks workspace on the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons).\n* The following roles and privileges, which depend on the status of your workspace: \n+ Workspace admin: If your workspace was enabled for Unity Catalog automatically when it was created, you must be a workspace admin to complete the required tasks.\n+ Account admin: If your workspace is not already enabled for Unity Catalog, an account admin must attach the workspace to the metastore. \nIf there is no Unity Catalog metastore in the same region as the workspace, an account admin must also create the Unity Catalog metastore. \nInstructions for determining whether a metastore exists for your workspace region, along with instructions for creating a metastore, follow in this article.See [Admin privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html) and [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Set up and manage Unity Catalog\n##### Step 1: Confirm that your workspace is enabled for Unity Catalog\n\nIn this step, you determine whether your workspace is already enabled for Unity Catalog, where enablement is defined as having a Unity Catalog metastore attached to the workspace. If your workspace is not enabled for Unity Catalog, you must enable your workspace for Unity Catalog manually. See [Next steps if your workspace is not enabled for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enable-manually). \nTo confirm, do one of the following. \n### Use the account console to confirm Unity Catalog enablement \n1. As a Databricks account admin, log into the account console.\n2. Click ![Workspaces Icon](https:\/\/docs.databricks.com\/_images\/workspaces-icon-account.png) **Workspaces**.\n3. Find your workspace and check the **Metastore** column. If a metastore name is present, your workspace is attached to a Unity Catalog metastore and therefore enabled for Unity Catalog. \n### Run a SQL query to confirm Unity Catalog enablement \nRun the following SQL query in the SQL query editor or a notebook that is attached to a cluster that uses *shared* or *single-user* access mode. See [Access modes](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode). No admin role is required. \n```\nSELECT CURRENT_METASTORE();\n\n``` \nIf the query returns a metastore ID like the following, then your workspace is attached to a Unity Catalog metastore and therefore enabled for Unity Catalog. \n![Current metastore output](https:\/\/docs.databricks.com\/_images\/current-metastore-output-aws.png) \n### Next steps if your workspace is not enabled for Unity Catalog \nIf your workspace is not enabled for Unity Catalog (attached to a metastore), the next step depends on whether or not you already have a Unity Catalog metastore defined for your workspace region: \n* If your account already has a Unity Catalog metastore defined for your workspace region, you can simply attach your workspace to the existing metastore. Go to [Enable your workspace for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/enable-workspaces.html#enable-workspace).\n* If there is no Unity Catalog metastore defined for your workspace\u2019s region, you must create a metastore and then attach the workspace. Go to [Create a Unity Catalog metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html). \nWhen your workspace is enabled for Unity Catalog, go to the next step.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Set up and manage Unity Catalog\n##### Step 2: Add users and assign the workspace admin role\n\nThe user who creates the workspace is automatically added as a workspace user with the workspace admin role (that is, a user in the `admins` workspace-local group). As a workspace admin, you can add and invite users to the workspace, can assign the workspace admin role to other users, and can create service principals and groups. \nAccount admins also have the ability to add users, service principals, and groups to your workspace. They can grant the account admin and metastore admin roles. \nFor details, see [Manage users](https:\/\/docs.databricks.com\/admin\/users-groups\/users.html). \n### (Recommended) Sync account-level identities from your IdP \nIt can be convenient to manage user access to Databricks by setting up provisioning from a third-party identity provider (IdP), like Okta. For complete instructions, see [Sync users and groups from your identity provider](https:\/\/docs.databricks.com\/admin\/users-groups\/scim\/index.html).\n\n#### Set up and manage Unity Catalog\n##### Step 3: Create clusters or SQL warehouses that users can use to run queries and create objects\n\nTo run Unity Catalog workloads, compute resources must comply with certain security requirements. Non-compliant compute resources cannot access data or other objects in Unity Catalog. SQL warehouses always comply with Unity Catalog requirements, but some cluster access modes do not. See [Access modes](https:\/\/docs.databricks.com\/compute\/configure.html#access-modes). \nAs a workspace admin, you can opt to make compute creation restricted to admins or let users create their own SQL warehouses and clusters. You can also create cluster policies that enable users to create their own clusters, using Unity Catalog-compliant specifications that you enforce. See [Compute permissions](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions) and [Create and manage compute policies](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Set up and manage Unity Catalog\n##### Step 4: Grant privileges to users\n\nTo create objects and access them in Unity Catalog catalogs and schemas, a user must have permission to do so. This section describes the user and admin privileges granted on some workspaces by default and describes how to grant additional privileges. \n### Default user privileges \nSome workspaces have default user (non-admin) privileges upon launch: \n* If your workspace launched with an automatically-provisioned [workspace catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#workspace-catalog), all workspace users can create objects in the workspace catalog\u2019s `default` schema. \nTo learn how to determine whether your workspace has a workspace catalog, see [How do I know if my workspace includes a workspace catalog ?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#workspace-catalog).\n* If your workspace was enabled for Unity Catalog manually, it has a `main` catalog provisioned automatically. \nWorkspace users have the `USE CATALOG` privilege on the `main` catalog, which doesn\u2019t grant the ability to create or select from any objects in the catalog, but is a prerequisite for working with any objects in the catalog. The user who created the metastore owns the `main` catalog by default and can both transfer ownership and grant access to other users. \nIf metastore storage is added after the metastore is created, no `main` catalog is provisioned. \nOther workspaces have no catalogs created by default and no non-admin user privileges enabled by default. A workspace admin must create the first catalog and grant users access to it and the objects in it. Skip ahead to [Step 5: Create new catalogs and schemas](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#create-catalogs) before you complete the steps in this section. \n### Default admin privileges \nSome workspaces have default workspace admin privileges upon launch: \n* If your workspace was enabled for Unity Catalog automatically: \n+ Workspace admins can create new catalogs and objects in new catalogs, and grant access to them.\n+ There is no metastore admin by default.\n+ Workspace admins own the workspace catalog (if there is one) and can grant access to that catalog and any objects in that catalog.\n* If your workspace was enabled for Unity Catalog manually: \n+ Workspace admins have no special Unity Catalog privileges by default.\n+ Metastore admins must exist and can create any Unity Catalog object and can take ownership of any Unity Catalog object. \nFor a list of additional object privileges granted to workspace admins in automatically-enabled Unity Catalog workspaces, see [Workspace admin privileges when workspaces are enabled for Unity Catalog automatically](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html#workspace-admins-auto). \n### Grant privileges \nFor access to objects other than those listed in the previous sections, a privileged user must grant that access. \nFor example, to grant a group the ability to create new schemas in `my-catalog`, the catalog owner can run the following in the SQL Editor or a notebook: \n```\nGRANT CREATE SCHEMA ON my-catalog TO `data-consumers`;\n\n``` \nIf your workspace was enabled for Unity Catalog automatically, the workspace admin owns the workspace catalog and can grant the ability to create new schemas: \n```\nGRANT CREATE SCHEMA ON <workspace-catalog> TO `data-consumers`;\n\n``` \nYou can also grant and revoke privileges using Catalog Explorer. \nImportant \nYou cannot grant privileges to the workspace-local `users` or `admins` groups. To grant privileges on groups, they must be account-level groups. \nFor details about managing privileges in Unity Catalog, see [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Set up and manage Unity Catalog\n##### Step 5: Create new catalogs and schemas\n\nTo start using Unity Catalog, you must have at least one catalog defined. Catalogs are the primary unit of data isolation and organization in Unity Catalog. All schemas and tables live in catalogs, as do volumes, views, and models. \nSome workspaces have no automatically-provisioned catalog. To use Unity Catalog, a workspace admin must create the first catalog for such workspaces. \nOther workspaces have access to a pre-provisioned catalog that your users can access to get started (either the workspace catalog or the `main` catalog, depending on how your workspace was enabled for Unity Catalog). As you add more data and AI assets into Databricks, you can create additional catalogs to group those assets in a way that makes it easy to govern data logically. \nFor recommendations about how best to use catalogs and schemas to organize your data and AI assets, see [Unity Catalog best practices](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html). \nAs a metastore admin, workspace admin (auto-enabled workspaces only), or other user with the `CREATE CATALOG` privilege, you can create new catalogs in the metastore. When you do, you should: \n1. Create *managed storage* for the new catalog. \nManaged storage is a dedicated storage location in your AWS account for managed tables and managed volumes. You can assign managed storage to the metastore, to catalogs, and to schemas. When a user creates a table, the data is stored in the storage location that is lowest in the hierarchy. For example, if a storage location is defined for the metastore and catalog but not the schema, the data is stored in the location defined for the catalog. \nDatabricks recommends that you assign managed storage at the catalog level, because catalogs typically represent logical units of data isolation. If you are comfortable with data in multiple catalogs sharing the same storage location, you can default to the metastore-level storage location. If your workspace was enabled for Unity Catalog automatically, there is no metastore-level storage by default. An account admin has the option to configure metastore-level storage. See [Managed storage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#managed-storage) and [Add managed storage to an existing metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html#add-storage). \nAssigning managed storage to a catalog requires that you create: \n* A *storage credential*.\n* An *external location* that references that storage credential.For an introduction to these objects and instructions for creating them, see [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html).\n2. Bind the new catalog to your workspace if you want to limit access from other workspaces that share the same metastore. \nSee [Bind a catalog to one or more workspaces](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#bind).\n3. Grant privileges on the catalog. \nFor detailed instructions, see [Create and manage catalogs](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html). \n### Catalog creation example \nThe following example shows the creation of a catalog with managed storage, followed by granting the `SELECT` privilege on the catalog: \n```\nCREATE CATALOG IF NOT EXISTS mycatalog\nMANAGED LOCATION 's3:\/\/depts\/finance';\n\nGRANT SELECT ON mycatalog TO `finance-team`;\n\n``` \nFor more examples, including instructions for creating catalogs using Catalog Explorer, see [Create and manage catalogs](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html). \n### Create a schema \nSchemas represent more granular groupings (like departments or projects, for example) than catalogs. All tables and other Unity Catalog objects in the catalog are contained in schemas. As the owner of a new catalog, you may want to create the schemas in the catalog. But you might want instead to delegate the ability to create schemas to other users, by giving them the `CREATE SCHEMA` privilege on the catalog. \nFor detailed instructions, see [Create and manage schemas (databases)](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-schemas.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Set up and manage Unity Catalog\n##### (Optional) Assign the metastore admin role\n\nIf your workspace was enabled for Unity Catalog automatically, no metastore admin role is assigned by default. Metastore admins have some privileges that workspace admins don\u2019t. \nYou might want to assign a metastore admin if you need to: \n* [Change ownership](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/ownership.html) of catalogs after someone leaves the company.\n* Manage and delegate permissions on the [init script and jar allowlist](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/allowlist.html).\n* [Delegate the ability to create catalogs and other top-level permissions](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html) to non-workspace admins.\n* Receive shared data through [Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html#permissions).\n* Remove [default workspace admin permissions](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html#workspace-admins-auto).\n* Add managed storage to the metastore, if it has none. See [Add managed storage to an existing metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html#add-storage). \nFor detailed information about the metastore admin role and instructions for assigning it, see [Assign a metastore admin](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html#assign-metastore-admin).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Set up and manage Unity Catalog\n##### Upgrade tables in your Hive metastore to Unity Catalog tables\n\nIf your workspace was in service before it was enabled for Unity Catalog, it likely has a Hive metastore that contains data that you want to continue to use. Databricks recommends that you migrate the tables managed by the Hive metastore to the Unity Catalog metastore. \nSee [Upgrade Hive tables and views to Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html) and [Use the UCX utilities to upgrade your workspace to Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/ucx.html).\n\n#### Set up and manage Unity Catalog\n##### (Optional) Keep working with your Hive metastore\n\nIf your workspace has a Hive metastore that contains data that you want to continue to use, and you choose not to follow the recommendation to [upgrade the tables managed by the Hive metastore to the Unity Catalog metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html), you can continue to work with data in the Hive metastore alongside data in the Unity Catalog metastore. \nThe Hive metastore is represented in Unity Catalog interfaces as a catalog named `hive_metastore`. In order to continue working with data in your Hive metastore without having to update queries to specify the `hive_metastore` catalog, you can set the workspace\u2019s default catalog to `hive_metastore`. See [Manage the default catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#default). \nDepending on when your workspace was enabled for Unity Catalog, the default catalog may already be `hive_metastore`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Set up and manage Unity Catalog\n##### (Optional) Create metastore-level storage\n\nAlthough Databricks recommends that you create a separate managed storage location for each catalog in your metastore (and you can do the same for schemas), you can opt instead to create a managed location at the metastore level and use it as the storage for multiple catalogs and schemas. \nIf you want metastore-level storage, you must also assign a metastore admin. See [(Optional) Assign the metastore admin role](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#metastore-admin). \nMetastore-level storage is **required** only if the following are true: \n* You want to share notebooks using [Databricks-to-Databricks Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html).\n* You use a Databricks partner product integration that relies on personal staging locations (deprecated). \nFor more information about the hierarchy of managed storage locations, see [Data is physically separated in storage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html#physically-separate). \nTo learn how to add metastore-level storage to metastores that have none, see [Add managed storage to an existing metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html#add-storage). \nNote \nMost workspaces that were enabled for Unity Catalog before November 8, 2023 have a metastore-level storage root.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Set up and manage Unity Catalog\n##### Next steps\n\n* Run a quick tutorial to create your first table in Unity Catalog: [Tutorial: Create your first table and grant privileges](https:\/\/docs.databricks.com\/getting-started\/create-table.html)\n* Learn more about Unity Catalog: [What is Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html)\n* Learn best practices for using Unity Catalog: [Unity Catalog best practices](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html)\n* Learn how to grant and revoke privileges: [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html)\n* [Learn how to create tables](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html)\n* [Learn how to upgrade Hive tables to Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html)\n* Install the Unity Catalog CLI: [Databricks CLI (legacy)](https:\/\/docs.databricks.com\/archive\/dev-tools\/cli\/index.html) and [Unity Catalog CLI (legacy)](https:\/\/docs.databricks.com\/archive\/dev-tools\/cli\/unity-catalog-cli.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html"} +{"content":"# Technology partners\n## Connect to data governance partners using Partner Connect\n#### Connect to erwin Data Modeler by Quest\n\nerwin Data Modeler by Quest allows you to find, visualize, design, deploy, and standardize high-quality enterprise data assets. \nYou can connect erwin Data Modeler to a Databricks cluster or SQL warehouse (formerly SQL endpoint).\n\n#### Connect to erwin Data Modeler by Quest\n##### Requirements\n\nBefore you connect to erwin Data Modeler, you need the following: \n* erwin Data Modeler version 12.1 SP1 or above. See the [erwin Data Modeler installation guide](https:\/\/bookshelf.erwin.com\/bookshelf\/public_html\/12.1\/Content\/Installation\/Installation\/erwin_Install\/Installation%20Guide.html) on the erwin website.\n* A Databricks personal access token. \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/data-governance\/erwin.html"} +{"content":"# Technology partners\n## Connect to data governance partners using Partner Connect\n#### Connect to erwin Data Modeler by Quest\n##### Connect to erwin Data Modeler using Partner Connect\n\nTo connect to erwin Data Modeler using Partner Connect, do the following: \n1. In the sidebar, click ![Partner Connect button](https:\/\/docs.databricks.com\/_images\/partner-connect.png) **Partner Connect**.\n2. Click the **erwin** tile.\n3. In the **Connect to partner** dialog, for **Compute**, select the Databricks compute resource that you want to use.\n4. Click **Download connection file**.\n5. Open the downloaded connection file, which starts erwin Data Modeler.\n6. On the **Connection** page of the **Reverse Engineering Wizard**, enter your authentication credentials: \nFor **User Name**, enter `token`. For **Password**, enter the personal access token from the requirements.\n7. Click **Connect**.\n8. Starting with step 7, follow [Select the Reverse Engineering Options](https:\/\/bookshelf.erwin.com\/bookshelf\/public_html\/12.5\/Content\/User%20Guides\/erwin%20Help\/set_options_for_reverse_engineer.html) in the erwin Data Modeler documentation to create a model from your Databricks data.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/data-governance\/erwin.html"} +{"content":"# Technology partners\n## Connect to data governance partners using Partner Connect\n#### Connect to erwin Data Modeler by Quest\n##### Connect to erwin Data Modeler manually\n\nTo connect to erwin Data Modeler manually and create a model from your Databricks data, follow [Select the Reverse Engineering Options](https:\/\/bookshelf.erwin.com\/bookshelf\/public_html\/12.5\/Content\/User%20Guides\/erwin%20Help\/set_options_for_reverse_engineer.html) in the erwin Data Modeler documentation. \nIn the **Connection** page of the **Reverse Engineering Wizard**, specify the following: \n* The values described in the *Connection* section of [Reverse Engineering Options for Databricks](https:\/\/bookshelf.erwin.com\/bookshelf\/public_html\/12.5\/Content\/User%20Guides\/erwin%20Help\/Reverse_Engineering_Options_Databricks.html).\n* The personal access token from the requirements.\n\n#### Connect to erwin Data Modeler by Quest\n##### Additional resources\n\n* [erwin Data Modeler](https:\/\/www.erwin.com\/products\/erwin-data-modeler\/)\n* [erwin Data Modeler documentation](https:\/\/bookshelf.erwin.com\/bookshelf\/public_html\/12.5\/Content\/User%20Guides\/erwin%20Help\/Online%20Help.html)\n* [erwin Data Modeler Support](https:\/\/support.quest.com\/erwin-data-modeler\/12.1)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/data-governance\/erwin.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Many small Spark jobs\n\nIf you see many small jobs, it\u2019s likely you\u2019re doing many operations on relatively small data (<10GB). Small operations only take a few seconds each, but they add up, and the time spent in overhead per operation also adds up. \nThe best approach to speeding up small jobs is to run multiple operations in parallel. [Delta Live Tables](https:\/\/www.databricks.com\/product\/delta-live-tables) do this for you automatically. \nOther options include: \n* Separate your operations into multiple notebooks and run them in parallel on the same cluster by using [multi-task jobs](https:\/\/www.databricks.com\/blog\/2021\/07\/13\/announcement-orchestrating-multiple-tasks-with-databricks-jobs-public-preview.html).\n* Use Python\u2019s [ThreadPoolExecutor](https:\/\/docs.python.org\/3\/library\/concurrent.futures.html) or another multi-threading approach to run queries in parallel.\n* Use [SQL warehouses](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html) if all your queries are written in SQL. SQL warehouses scale very well for many queries run in parallel as they were designed for this type of workload.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/small-spark-jobs.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Auto Loader options\n\nConfiguration options specific to the `cloudFiles` source are prefixed with `cloudFiles` so that they are in a separate namespace from other Structured Streaming source options. \n* [Common Auto Loader options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#common-auto-loader-options)\n* [Directory listing options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#directory-listing-options)\n* [File notification options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#file-notification-options)\n* [File format options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#file-format-options) \n+ [Generic options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#generic-options)\n+ [`JSON` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#json-options)\n+ [`CSV` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#csv-options)\n+ [`XML` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#xml-options)\n+ [`PARQUET` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#parquet-options)\n+ [`AVRO` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#avro-options)\n+ [`BINARYFILE` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#binaryfile-options)\n+ [`TEXT` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#text-options)\n+ [`ORC` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#orc-options)\n* [Cloud specific options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#cloud-specific-options) \n+ [AWS specific options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#aws-specific-options)\n+ [Azure specific options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#azure-specific-options)\n+ [Google specific options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#google-specific-options)\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Auto Loader options\n##### [Common Auto Loader options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#id3)\n\nYou can configure the following options for directory listing or file notification mode. \n| Option |\n| --- |\n| `cloudFiles.allowOverwrites` Type: `Boolean` Whether to allow input directory file changes to overwrite existing data. There are a few caveats regarding enabling this config. Please refer to [Auto Loader FAQ](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/faq.html#process-file-behavior) for details. Default value: `false` |\n| `cloudFiles.backfillInterval` Type: `Interval String` Auto Loader can trigger asynchronous backfills at a given interval, e.g. `1 day` to backfill once a day, or `1 week` to backfill once a week. File event notification systems do not guarantee 100% delivery of all files that have been uploaded therefore you can use backfills to guarantee that all files eventually get processed, available in [Databricks Runtime 8.4 (unsupported)](https:\/\/docs.databricks.com\/archive\/runtime-release-notes\/8.4.html) and above. Default value: None |\n| `cloudFiles.format` Type: `String` The [data file format](https:\/\/docs.databricks.com\/structured-streaming\/index.html) in the source path. Allowed values include:* `avro`: [Avro file](https:\/\/docs.databricks.com\/query\/formats\/avro.html) * `binaryFile`: [Binary file](https:\/\/docs.databricks.com\/query\/formats\/binary.html) * `csv`: [Read and write to CSV files](https:\/\/docs.databricks.com\/query\/formats\/csv.html) * `json`: [JSON file](https:\/\/docs.databricks.com\/query\/formats\/json.html) * `orc`: ORC file * `parquet`: [Read Parquet files using Databricks](https:\/\/docs.databricks.com\/query\/formats\/parquet.html) * `text`: Text file Default value: None (required option) |\n| `cloudFiles.includeExistingFiles` Type: `Boolean` Whether to include existing files in the stream processing input path or to only process new files arriving after initial setup. This option is evaluated only when you start a stream for the first time. Changing this option after restarting the stream has no effect. Default value: `true` |\n| `cloudFiles.inferColumnTypes` Type: `Boolean` Whether to infer exact column types when leveraging schema inference. By default, columns are inferred as strings when inferring JSON and CSV datasets. See [schema inference](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html) for more details. Default value: `false` |\n| `cloudFiles.maxBytesPerTrigger` Type: `Byte String` The maximum number of new bytes to be processed in every trigger. You can specify a byte string such as `10g` to limit each microbatch to 10 GB of data. This is a soft maximum. If you have files that are 3 GB each, Databricks processes 12 GB in a microbatch. When used together with `cloudFiles.maxFilesPerTrigger`, Databricks consumes up to the lower limit of `cloudFiles.maxFilesPerTrigger` or `cloudFiles.maxBytesPerTrigger`, whichever is reached first. This option has no effect when used with `Trigger.Once()` (deprecated). Default value: None |\n| `cloudFiles.maxFileAge` Type: `Interval String` How long a file event is tracked for deduplication purposes. Databricks does not recommend tuning this parameter unless you are ingesting data at the order of millions of files an hour. See the section on [Event retention](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/production.html#max-file-age) for more details. Tuning `cloudFiles.maxFileAge` too aggressively can cause data quality issues such as duplicate ingestion or missing files. Therefore, Databricks recommends a conservative setting for `cloudFiles.maxFileAge`, such as 90 days, which is similar to what comparable data ingestion solutions recommend. Default value: None |\n| `cloudFiles.maxFilesPerTrigger` Type: `Integer` The maximum number of new files to be processed in every trigger. When used together with `cloudFiles.maxBytesPerTrigger`, Databricks consumes up to the lower limit of `cloudFiles.maxFilesPerTrigger` or `cloudFiles.maxBytesPerTrigger`, whichever is reached first. This option has no effect when used with `Trigger.Once()` (deprecated). Default value: 1000 |\n| `cloudFiles.partitionColumns` Type: `String` A comma separated list of Hive style partition columns that you would like inferred from the directory structure of the files. Hive style partition columns are key value pairs combined by an equality sign such as `<base-path>\/a=x\/b=1\/c=y\/file.format`. In this example, the partition columns are `a`, `b`, and `c`. By default these columns will be automatically added to your schema if you are using schema inference and provide the `<base-path>` to load data from. If you provide a schema, Auto Loader expects these columns to be included in the schema. If you do not want these columns as part of your schema, you can specify `\"\"` to ignore these columns. In addition, you can use this option when you want columns to be inferred the file path in complex directory structures, like the example below: `<base-path>\/year=2022\/week=1\/file1.csv` `<base-path>\/year=2022\/month=2\/day=3\/file2.csv` `<base-path>\/year=2022\/month=2\/day=4\/file3.csv` Specifying `cloudFiles.partitionColumns` as `year,month,day` will return `year=2022` for `file1.csv`, but the `month` and `day` columns will be `null`. `month` and `day` will be parsed correctly for `file2.csv` and `file3.csv`. Default value: None |\n| `cloudFiles.schemaEvolutionMode` Type: `String` The mode for evolving the schema as new columns are discovered in the data. By default, columns are inferred as strings when inferring JSON datasets. See [schema evolution](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html) for more details. Default value: `\"addNewColumns\"` when a schema is not provided. `\"none\"` otherwise. |\n| `cloudFiles.schemaHints` Type: `String` Schema information that you provide to Auto Loader during schema inference. See [schema hints](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html) for more details. Default value: None |\n| `cloudFiles.schemaLocation` Type: `String` The location to store the inferred schema and subsequent changes. See [schema inference](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html) for more details. Default value: None (required when inferring the schema) |\n| `cloudFiles.useStrictGlobber` Type: `Boolean` Whether to use a strict globber that matches the default globbing behavior of other file sources in Apache Spark. See [Common data loading patterns](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/patterns.html) for more details. Available in Databricks Runtime 12.2 LTS and above. Default value: `false` |\n| `cloudFiles.validateOptions` Type: `Boolean` Whether to validate Auto Loader options and return an error for unknown or inconsistent options. Default value: `true` |\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Auto Loader options\n##### [Directory listing options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#id4)\n\nThe following options are relevant to directory listing mode. \n| Option |\n| --- |\n| `cloudFiles.useIncrementalListing` (deprecated) Type: `String` This feature has been deprecated. Databricks recommends using [file notification mode](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html) instead of `cloudFiles.useIncrementalListing`. Whether to use the incremental listing rather than the full listing in directory listing mode. By default, Auto Loader makes the best effort to automatically detect if a given directory is applicable for the incremental listing. You can explicitly use the incremental listing or use the full directory listing by setting it as `true` or `false` respectively. Incorrectly enabling incremental listing on a non-lexically ordered directory prevents Auto Loader from discovering new files. Works with Azure Data Lake Storage Gen2 (`abfss:\/\/`), S3 (`s3:\/\/`), and GCS (`gs:\/\/`). Available in [Databricks Runtime 9.1 LTS](https:\/\/docs.databricks.com\/release-notes\/runtime\/9.1lts.html) and above. Default value: `auto` Available values: `auto`, `true`, `false` |\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Auto Loader options\n##### [File notification options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#id5)\n\nThe following options are relevant to file notification mode. \n| Option |\n| --- |\n| `cloudFiles.fetchParallelism` Type: `Integer` Number of threads to use when fetching messages from the queueing service. Default value: 1 |\n| `cloudFiles.pathRewrites` Type: A JSON string Required only if you specify a `queueUrl` that receives file notifications from multiple S3 buckets and you want to leverage mount points configured for accessing data in these containers. Use this option to rewrite the prefix of the `bucket\/key` path with the mount point. Only prefixes can be rewritten. For example, for the configuration `{\"<databricks-mounted-bucket>\/path\": \"dbfs:\/mnt\/data-warehouse\"}`, the path `s3:\/\/<databricks-mounted-bucket>\/path\/2017\/08\/fileA.json` is rewritten to `dbfs:\/mnt\/data-warehouse\/2017\/08\/fileA.json`. Default value: None |\n| `cloudFiles.resourceTag` Type: `Map(String, String)` A series of key-value tag pairs to help associate and identify related resources, for example: `cloudFiles.option(\"cloudFiles.resourceTag.myFirstKey\", \"myFirstValue\")` `.option(\"cloudFiles.resourceTag.mySecondKey\", \"mySecondValue\")` For more information on AWS, see [Amazon SQS cost allocation tags](https:\/\/docs.aws.amazon.com\/AWSSimpleQueueService\/latest\/SQSDeveloperGuide\/sqs-queue-tags.html) and [Configuring tags for an Amazon SNS topic](https:\/\/docs.aws.amazon.com\/sns\/latest\/dg\/sns-tags.html). [(1)](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#f1) For more information on Azure, see [Naming Queues and Metadata](https:\/\/learn.microsoft.com\/rest\/api\/storageservices\/naming-queues-and-metadata) and the coverage of `properties.labels` in [Event Subscriptions](https:\/\/learn.microsoft.com\/rest\/api\/eventgrid\/controlplane\/event-subscriptions\/create-or-update). Auto Loader stores these key-value tag pairs in JSON as labels. [(1)](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#f1) For more information on GCP, see [Reporting usage with labels](https:\/\/cloud.google.com\/pubsub\/docs\/labels). [(1)](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#f1) Default value: None |\n| `cloudFiles.useNotifications` Type: `Boolean` Whether to use file notification mode to determine when there are new files. If `false`, use directory listing mode. See [How Auto Loader works](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-detection-modes.html). Default value: `false` | \n**(1)** Auto Loader adds the following key-value tag pairs by default on a best-effort basis: \n* `vendor`: `Databricks`\n* `path`: The location from where the data is loaded. Unavailable in GCP due to labeling limitations.\n* `checkpointLocation`: The location of the stream\u2019s checkpoint. Unavailable in GCP due to labeling limitations.\n* `streamId`: A globally unique identifier for the stream. \nThese key names are reserved and you cannot overwrite their values.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Auto Loader options\n##### [File format options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#id6)\n\nWith Auto Loader you can ingest `JSON`, `CSV`, `PARQUET`, `AVRO`, `TEXT`, `BINARYFILE`, and `ORC` files. \n* [Generic options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#generic-options)\n* [`JSON` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#json-options)\n* [`CSV` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#csv-options)\n* [`XML` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#xml-options)\n* [`PARQUET` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#parquet-options)\n* [`AVRO` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#avro-options)\n* [`BINARYFILE` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#binaryfile-options)\n* [`TEXT` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#text-options)\n* [`ORC` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#orc-options) \n### [Generic options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#id20) \nThe following options apply to all file formats. \n| Option |\n| --- |\n| **`ignoreCorruptFiles`** Type: `Boolean` Whether to ignore corrupt files. If true, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned. Observable as `numSkippedCorruptFiles` in the `operationMetrics` column of the Delta Lake history. Available in Databricks Runtime 11.3 LTS and above. Default value: `false` |\n| **`ignoreMissingFiles`** Type: `Boolean` Whether to ignore missing files. If true, the Spark jobs will continue to run when encountering missing files and the contents that have been read will still be returned. Available in Databricks Runtime 11.3 LTS and above. Default value: `false` (`true` for `COPY INTO`) |\n| **`modifiedAfter`** Type: `Timestamp String`, for example, `2021-01-01 00:00:00.000000 UTC+0` An optional timestamp to ingest files that have a modification timestamp after the provided timestamp. Default value: None |\n| **`modifiedBefore`** Type: `Timestamp String`, for example, `2021-01-01 00:00:00.000000 UTC+0` An optional timestamp to ingest files that have a modification timestamp before the provided timestamp. Default value: None |\n| **`pathGlobFilter`** or **`fileNamePattern`** Type: `String` A potential glob pattern to provide for choosing files. Equivalent to `PATTERN` in `COPY INTO`. `fileNamePattern` can be used in `read_files`. Default value: None |\n| **`recursiveFileLookup`** Type: `Boolean` Whether to skip partition inference during schema inference. This does not affect which files are loaded. Default value: `false` | \n### [`JSON` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#id21) \n| Option |\n| --- |\n| **`allowBackslashEscapingAnyCharacter`** Type: `Boolean` Whether to allow backslashes to escape any character that succeeds it. If not enabled, only characters that are explicitly listed by the JSON specification can be escaped. Default value: `false` |\n| **`allowComments`** Type: `Boolean` Whether to allow the use of Java, C, and C++ style comments (`'\/'`, `'*'`, and `'\/\/'` varieties) within parsed content or not. Default value: `false` |\n| **`allowNonNumericNumbers`** Type: `Boolean` Whether to allow the set of not-a-number (`NaN`) tokens as legal floating number values. Default value: `true` |\n| **`allowNumericLeadingZeros`** Type: `Boolean` Whether to allow integral numbers to start with additional (ignorable) zeroes (for example, `000001`). Default value: `false` |\n| **`allowSingleQuotes`** Type: `Boolean` Whether to allow use of single quotes (apostrophe, character `'\\'`) for quoting strings (names and String values). Default value: `true` |\n| **`allowUnquotedControlChars`** Type: `Boolean` Whether to allow JSON strings to contain unescaped control characters (ASCII characters with value less than 32, including tab and line feed characters) or not. Default value: `false` |\n| **`allowUnquotedFieldNames`** Type: `Boolean` Whether to allow use of unquoted field names (which are allowed by JavaScript, but not by the JSON specification). Default value: `false` |\n| **`badRecordsPath`** Type: `String` The path to store files for recording the information about bad JSON records. Default value: None |\n| **`columnNameOfCorruptRecord`** Type: `String` The column for storing records that are malformed and cannot be parsed. If the `mode` for parsing is set as `DROPMALFORMED`, this column will be empty. Default value: `_corrupt_record` |\n| **`dateFormat`** Type: `String` The format for parsing date strings. Default value: `yyyy-MM-dd` |\n| **`dropFieldIfAllNull`** Type: `Boolean` Whether to ignore columns of all null values or empty arrays and structs during schema inference. Default value: `false` |\n| **`encoding`** or **`charset`** Type: `String` The name of the encoding of the JSON files. See `java.nio.charset.Charset` for list of options. You cannot use `UTF-16` and `UTF-32` when `multiline` is `true`. Default value: `UTF-8` |\n| **`inferTimestamp`** Type: `Boolean` Whether to try and infer timestamp strings as a `TimestampType`. When set to `true`, schema inference might take noticeably longer. You must enable `cloudFiles.inferColumnTypes` to use with Auto Loader. Default value: `false` |\n| **`lineSep`** Type: `String` A string between two consecutive JSON records. Default value: None, which covers `\\r`, `\\r\\n`, and `\\n` |\n| **`locale`** Type: `String` A `java.util.Locale` identifier. Influences default date, timestamp, and decimal parsing within the JSON. Default value: `US` |\n| **`mode`** Type: `String` Parser mode around handling malformed records. One of `'PERMISSIVE'`, `'DROPMALFORMED'`, or `'FAILFAST'`. Default value: `PERMISSIVE` |\n| **`multiLine`** Type: `Boolean` Whether the JSON records span multiple lines. Default value: `false` |\n| **`prefersDecimal`** Type: `Boolean` Attempts to infer strings as `DecimalType` instead of float or double type when possible. You must also use schema inference, either by enabling `inferSchema` or using `cloudFiles.inferColumnTypes` with Auto Loader. Default value: `false` |\n| **`primitivesAsString`** Type: `Boolean` Whether to infer primitive types like numbers and booleans as `StringType`. Default value: `false` |\n| **`readerCaseSensitive`** Type: `Boolean` Specifies the case sensitivity behavior when `rescuedDataColumn` is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner. Available in Databricks Runtime 13.3 and above. Default value: `true` |\n| **`rescuedDataColumn`** Type: `String` Whether to collect all data that can\u2019t be parsed due to a data type mismatch or schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details, refer to [What is the rescued data column?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html#rescue). Default value: None |\n| **`timestampFormat`** Type: `String` The format for parsing timestamp strings. Default value: `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]` |\n| **`timeZone`** Type: `String` The `java.time.ZoneId` to use when parsing timestamps and dates. Default value: None | \n### [`CSV` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#id22) \n| Option |\n| --- |\n| **`badRecordsPath`** Type: `String` The path to store files for recording the information about bad CSV records. Default value: None |\n| **`charToEscapeQuoteEscaping`** Type: `Char` The character used to escape the character used for escaping quotes. For example, for the following record: `[ \" a\\\\\", b ]`:* If the character to escape the `'\\'` is undefined, the record won\u2019t be parsed. The parser will read characters: `[a],[\\],[\"],[,],[ ],[b]` and throw an error because it cannot find a closing quote. * If the character to escape the `'\\'` is defined as `'\\'`, the record will be read with 2 values: `[a\\]` and `[b]`. Default value: `'\\0'` |\n| **`columnNameOfCorruptRecord`** Note Supported for Auto Loader. Not supported for `COPY INTO`. Type: `String` The column for storing records that are malformed and cannot be parsed. If the `mode` for parsing is set as `DROPMALFORMED`, this column will be empty. Default value: `_corrupt_record` |\n| **`comment`** Type: `Char` Defines the character that represents a line comment when found in the beginning of a line of text. Use `'\\0'` to disable comment skipping. Default value: `'\\u0000'` |\n| **`dateFormat`** Type: `String` The format for parsing date strings. Default value: `yyyy-MM-dd` |\n| **`emptyValue`** Type: `String` String representation of an empty value. Default value: `\"\"` |\n| **`encoding`** or **`charset`** Type: `String` The name of the encoding of the CSV files. See `java.nio.charset.Charset` for the list of options. `UTF-16` and `UTF-32` cannot be used when `multiline` is `true`. Default value: `UTF-8` |\n| **`enforceSchema`** Type: `Boolean` Whether to forcibly apply the specified or inferred schema to the CSV files. If the option is enabled, headers of CSV files are ignored. This option is ignored by default when using Auto Loader to rescue data and allow schema evolution. Default value: `true` |\n| **`escape`** Type: `Char` The escape character to use when parsing the data. Default value: `'\\'` |\n| **`header`** Type: `Boolean` Whether the CSV files contain a header. Auto Loader assumes that files have headers when inferring the schema. Default value: `false` |\n| **`ignoreLeadingWhiteSpace`** Type: `Boolean` Whether to ignore leading whitespaces for each parsed value. Default value: `false` |\n| **`ignoreTrailingWhiteSpace`** Type: `Boolean` Whether to ignore trailing whitespaces for each parsed value. Default value: `false` |\n| **`inferSchema`** Type: `Boolean` Whether to infer the data types of the parsed CSV records or to assume all columns are of `StringType`. Requires an additional pass over the data if set to `true`. For Auto Loader, use `cloudFiles.inferColumnTypes` instead. Default value: `false` |\n| **`lineSep`** Type: `String` A string between two consecutive CSV records. Default value: None, which covers `\\r`, `\\r\\n`, and `\\n` |\n| **`locale`** Type: `String` A `java.util.Locale` identifier. Influences default date, timestamp, and decimal parsing within the CSV. Default value: `US` |\n| **`maxCharsPerColumn`** Type: `Int` Maximum number of characters expected from a value to parse. Can be used to avoid memory errors. Defaults to `-1`, which means unlimited. Default value: `-1` |\n| **`maxColumns`** Type: `Int` The hard limit of how many columns a record can have. Default value: `20480` |\n| **`mergeSchema`** Type: `Boolean` Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader when inferring the schema. Default value: `false` |\n| **`mode`** Type: `String` Parser mode around handling malformed records. One of `'PERMISSIVE'`, `'DROPMALFORMED'`, and `'FAILFAST'`. Default value: `PERMISSIVE` |\n| **`multiLine`** Type: `Boolean` Whether the CSV records span multiple lines. Default value: `false` |\n| **`nanValue`** Type: `String` The string representation of a non-a-number value when parsing `FloatType` and `DoubleType` columns. Default value: `\"NaN\"` |\n| **`negativeInf`** Type: `String` The string representation of negative infinity when parsing `FloatType` or `DoubleType` columns. Default value: `\"-Inf\"` |\n| **`nullValue`** Type: `String` String representation of a null value. Default value: `\"\"` |\n| **`parserCaseSensitive`** (deprecated) Type: `Boolean` While reading files, whether to align columns declared in the header with the schema case sensitively. This is `true` by default for Auto Loader. Columns that differ by case will be rescued in the `rescuedDataColumn` if enabled. This option has been deprecated in favor of `readerCaseSensitive`. Default value: `false` |\n| **`positiveInf`** Type: `String` The string representation of positive infinity when parsing `FloatType` or `DoubleType` columns. Default value: `\"Inf\"` |\n| **`preferDate`** Type: `Boolean` Attempts to infer strings as dates instead of timestamp when possible. You must also use schema inference, either by enabling `inferSchema` or using `cloudFiles.inferColumnTypes` with Auto Loader. Default value: `true` |\n| **`quote`** Type: `Char` The character used for escaping values where the field delimiter is part of the value. Default value: `\"` |\n| **`readerCaseSensitive`** Type: `Boolean` Specifies the case sensitivity behavior when `rescuedDataColumn` is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner. Default value: `true` |\n| **`rescuedDataColumn`** Type: `String` Whether to collect all data that can\u2019t be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to [What is the rescued data column?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html#rescue). Default value: None |\n| **`sep`** or **`delimiter`** Type: `String` The separator string between columns. Default value: `\",\"` |\n| **`skipRows`** Type: `Int` The number of rows from the beginning of the CSV file that should be ignored (including commented and empty rows). If `header` is true, the header will be the first unskipped and uncommented row. Default value: `0` |\n| **`timestampFormat`** Type: `String` The format for parsing timestamp strings. Default value: `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]` |\n| **`timeZone`** Type: `String` The `java.time.ZoneId` to use when parsing timestamps and dates. Default value: None |\n| **`unescapedQuoteHandling`** Type: `String` The strategy for handling unescaped quotes. Allowed options:* `STOP_AT_CLOSING_QUOTE`: If unescaped quotes are found in the input, accumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found. * `BACK_TO_DELIMITER`: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters of the current parsed value until the delimiter defined by `sep` is found. If no delimiter is found in the value, the parser will continue accumulating characters from the input until a delimiter or line ending is found. * `STOP_AT_DELIMITER`: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters until the delimiter defined by `sep`, or a line ending is found in the input. * `SKIP_VALUE`: If unescaped quotes are found in the input, the content parsed for the given value will be skipped (until the next delimiter is found) and the value set in `nullValue` will be produced instead. * `RAISE_ERROR`: If unescaped quotes are found in the input, a `TextParsingException` will be thrown. Default value: `STOP_AT_DELIMITER` | \n### [`XML` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#id23) \n| Option | Description | Scope |\n| --- | --- | --- |\n| `rowTag` | The row tag of the XML files to treat as a row. In the example XML `<books> <book><book>...<books>`, the appropriate value is `book`. This is a required option. | read |\n| `samplingRatio` | Defines a fraction of rows used for schema inference. XML built-in functions ignore this option. Default: `1.0`. | read |\n| `excludeAttribute` | Whether to exclude attributes in elements. Default: `false`. | read |\n| `mode` | Mode for dealing with corrupt records during parsing. `PERMISSIVE`: For corrupted records, puts the malformed string into a field configured by `columnNameOfCorruptRecord`, and sets malformed fields to `null`. To keep corrupt records, you can set a `string` type field named `columnNameOfCorruptRecord` in a user-defined schema. If a schema does not have the field, corrupt records are dropped during parsing. When inferring a schema, the parser implicitly adds a `columnNameOfCorruptRecord` field in an output schema. `DROPMALFORMED`: Ignores corrupted records. This mode is unsupported for XML built-in functions. `FAILFAST`: Throws an exception when the parser meets corrupted records. | read |\n| `inferSchema` | If `true`, attempts to infer an appropriate type for each resulting DataFrame column. If `false`, all resulting columns are of `string` type. Default: `true`. XML built-in functions ignore this option. | read |\n| `columnNameOfCorruptRecord` | Allows renaming the new field that contains a malformed string created by `PERMISSIVE` mode. Default: `spark.sql.columnNameOfCorruptRecord`. | read |\n| `attributePrefix` | The prefix for attributes to differentiate attributes from elements. This will be the prefix for field names. Default is `_`. Can be empty for reading XML, but not for writing. | read, write |\n| `valueTag` | The tag used for the character data within elements that also have attribute(s) or child element(s) elements. User can specify the `valueTag` field in the schema or it will be added automatically during schema inference when character data is present in elements with other elements or attributes. Default: `_VALUE` | read,write |\n| `encoding` | For reading, decodes the XML files by the given encoding type. For writing, specifies encoding (charset) of saved XML files. XML built-in functions ignore this option. Default: `UTF-8`. | read, write |\n| `ignoreSurroundingSpaces` | Defines whether surrounding white spaces from values being read should be skipped. Default: `true`. Whitespace-only character data are ignored. | read |\n| `rowValidationXSDPath` | Path to an optional XSD file that is used to validate the XML for each row individually. Rows that fail to validate are treated like parse errors as above. The XSD does not otherwise affect the schema provided, or inferred. | read |\n| `ignoreNamespace` | If `true`, namespaces\u2019 prefixes on XML elements and attributes are ignored. Tags `<abc:author>` and `<def:author>`, for example, are treated as if both are just `<author>`. Namespaces cannot be ignored on the `rowTag` element, only its read children. XML parsing is not namespace-aware even if `false`. Default: `false`. | read |\n| `timestampFormat` | Custom timestamp format string that follows the [datetime pattern](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-datetime-pattern.html) format. This applies to `timestamp` type. Default: `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`. | read, write |\n| `timestampNTZFormat` | Custom format string for timestamp without timezone that follows the datetime pattern format. This applies to TimestampNTZType type. Default: `yyyy-MM-dd'T'HH:mm:ss[.SSS]` | read, write |\n| `dateFormat` | Custom date format string that follows the [datetime pattern](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-datetime-pattern.html) format. This applies to date type. Default: `yyyy-MM-dd`. | read, write |\n| `locale` | Sets a locale as a language tag in IETF BCP 47 format. For instance, `locale` is used while parsing dates and timestamps. Default: `en-US`. | read |\n| `rootTag` | Root tag of the XML files. For example, in `<books> <book><book>...<\/books>`, the appropriate value is `books`. You can include basic attributes by specifying a value like `books foo=\"bar\"`. Default: `ROWS`. | write |\n| `declaration` | Content of XML declaration to write at the start of every output XML file, before the `rootTag`. For example, a value of `foo` causes `<?xml foo?>` to be written. Set to an empty string to suppress. Default: `version=\"1.0\"` `encoding=\"UTF-8\" standalone=\"yes\"`. | write |\n| `arrayElementName` | Name of XML element that encloses each element of an array-valued column when writing. Default: `item`. | write |\n| `nullValue` | Sets the string representation of a null value. Default: string `null`. When this is `null`, the parser does not write attributes and elements for fields. | read, write |\n| `compression` | Compression code to use when saving to file. This can be one of the known case-insensitive shortened names (`none`, `bzip2`, `gzip`,`lz4`, `snappy', and`deflate`). XML built-in functions ignore this option. Default: `none`. | write |\n| `validateName` | If true, throws an error on XML element name validation failure. For example, SQL field names can have spaces, but XML element names cannot. Default: `true`. | write |\n| `readerCaseSensitive` | Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner. Default: `true`. | read |\n| `rescuedDataColumn` | Whether to collect all data that can\u2019t be parsed due to a data type mismatch and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details, see [What is the rescued data column?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html#rescue). Default: None. | read | \n### [`PARQUET` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#id24) \n| Option |\n| --- |\n| **`datetimeRebaseMode`** Type: `String` Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values: `EXCEPTION`, `LEGACY`, and `CORRECTED`. Default value: `LEGACY` |\n| **`int96RebaseMode`** Type: `String` Controls the rebasing of the INT96 timestamp values between Julian and Proleptic Gregorian calendars. Allowed values: `EXCEPTION`, `LEGACY`, and `CORRECTED`. Default value: `LEGACY` |\n| **`mergeSchema`** Type: `Boolean` Whether to infer the schema across multiple files and to merge the schema of each file. Default value: `false` |\n| **`readerCaseSensitive`** Type: `Boolean` Specifies the case sensitivity behavior when `rescuedDataColumn` is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner. Default value: `true` |\n| **`rescuedDataColumn`** Type: `String` Whether to collect all data that can\u2019t be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to [What is the rescued data column?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html#rescue). Default value: None | \n### [`AVRO` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#id25) \n| Option |\n| --- |\n| **`avroSchema`** Type: `String` Optional schema provided by a user in Avro format. When reading Avro, this option can be set to an evolved schema, which is compatible but different with the actual Avro schema. The deserialization schema will be consistent with the evolved schema. For example, if you set an evolved schema containing one additional column with a default value, the read result will contain the new column too. Default value: None |\n| **`datetimeRebaseMode`** Type: `String` Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values: `EXCEPTION`, `LEGACY`, and `CORRECTED`. Default value: `LEGACY` |\n| **`mergeSchema`** Type: `Boolean` Whether to infer the schema across multiple files and to merge the schema of each file. `mergeSchema` for Avro does not relax data types. Default value: `false` |\n| **`readerCaseSensitive`** Type: `Boolean` Specifies the case sensitivity behavior when `rescuedDataColumn` is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner. Default value: `true` |\n| **`rescuedDataColumn`** Type: `String` Whether to collect all data that can\u2019t be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to [What is the rescued data column?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html#rescue). Default value: None | \n### [`BINARYFILE` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#id26) \nBinary files do not have any additional configuration options. \n### [`TEXT` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#id27) \n| Option |\n| --- |\n| **`encoding`** Type: `String` The name of the encoding of the TEXT files. See `java.nio.charset.Charset` for list of options. Default value: `UTF-8` |\n| **`lineSep`** Type: `String` A string between two consecutive TEXT records. Default value: None, which covers `\\r`, `\\r\\n` and `\\n` |\n| **`wholeText`** Type: `Boolean` Whether to read a file as a single record. Default value: `false` | \n### [`ORC` options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#id28) \n| Option |\n| --- |\n| **`mergeSchema`** Type: `Boolean` Whether to infer the schema across multiple files and to merge the schema of each file. Default value: `false` |\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Auto Loader options\n##### [Cloud specific options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#id16)\n\nAuto Loader provides a number of options for configuring cloud infrastructure. \n* [AWS specific options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#aws-specific-options)\n* [Azure specific options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#azure-specific-options)\n* [Google specific options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#google-specific-options) \n### [AWS specific options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#id29) \nProvide the following option only if you choose `cloudFiles.useNotifications` = `true` and you want Auto Loader to set up the notification services for you: \n| Option |\n| --- |\n| `cloudFiles.region` Type: `String` The region where the source S3 bucket resides and where the AWS SNS and SQS services will be created. Default value: The region of the EC2 instance. | \nProvide the following option only if you choose `cloudFiles.useNotifications` = `true` and you want Auto Loader to use a queue that you have already set up: \n| Option |\n| --- |\n| `cloudFiles.queueUrl` Type: `String` The URL of the SQS queue. If provided, Auto Loader directly consumes events from this queue instead of setting up its own AWS SNS and SQS services. Default value: None | \nYou can use the following options to provide credentials to access AWS SNS and SQS when IAM roles are not available or when you\u2019re ingesting data from different clouds. \n| Option |\n| --- |\n| `cloudFiles.awsAccessKey` Type: `String` The AWS access key ID for the user. Must be provided with `cloudFiles.awsSecretKey`. Default value: None |\n| `cloudFiles.awsSecretKey` Type: `String` The AWS secret access key for the user. Must be provided with `cloudFiles.awsAccessKey`. Default value: None |\n| `cloudFiles.roleArn` Type: `String` The ARN of an IAM role to assume. The role can be assumed from your cluster\u2019s instance profile or by providing credentials with `cloudFiles.awsAccessKey` and `cloudFiles.awsSecretKey`. Default value: None |\n| `cloudFiles.roleExternalId` Type: `String` An identifier to provide while assuming a role using `cloudFiles.roleArn`. Default value: None |\n| `cloudFiles.roleSessionName` Type: `String` An optional session name to use while assuming a role using `cloudFiles.roleArn`. Default value: None |\n| `cloudFiles.stsEndpoint` Type: `String` An optional endpoint to provide for accessing AWS STS when assuming a role using `cloudFiles.roleArn`. Default value: None | \n### [Azure specific options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#id30) \nYou must provide values for all of the following options if you specify `cloudFiles.useNotifications` = `true` and you want Auto Loader to set up the notification services for you: \n| Option |\n| --- |\n| `cloudFiles.clientId` Type: `String` The client ID or application ID of the service principal. Default value: None |\n| `cloudFiles.clientSecret` Type: `String` The client secret of the service principal. Default value: None |\n| `cloudFiles.connectionString` Type: `String` The connection string for the storage account, based on either account access key or shared access signature (SAS). Default value: None |\n| `cloudFiles.resourceGroup` Type: `String` The Azure Resource Group under which the storage account is created. Default value: None |\n| `cloudFiles.subscriptionId` Type: `String` The Azure Subscription ID under which the resource group is created. Default value: None |\n| `cloudFiles.tenantId` Type: `String` The Azure Tenant ID under which the service principal is created. Default value: None | \nImportant \nAutomated notification setup is available in Azure China and Government regions with Databricks Runtime 9.1 and later. You must provide a `queueName` to use Auto Loader with file notifications in these regions for older DBR versions. \nProvide the following option only if you choose `cloudFiles.useNotifications` = `true` and you want Auto Loader to use a queue that you have already set up: \n| Option |\n| --- |\n| `cloudFiles.queueName` Type: `String` The name of the Azure queue. If provided, the cloud files source directly consumes events from this queue instead of setting up its own Azure Event Grid and Queue Storage services. In that case, your `cloudFiles.connectionString` requires only read permissions on the queue. Default value: None | \n### [Google specific options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#id31) \nAuto Loader can automatically set up notification services for you by leveraging Google Service Accounts. You can configure your cluster to assume a service account by following [Google service setup](https:\/\/docs.databricks.com\/compute\/configure.html). The permissions that your service account needs are specified in [What is Auto Loader file notification mode?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html). Otherwise, you can provide the following options for authentication if you want Auto Loader to set up the notification services for you. \n| Option |\n| --- |\n| `cloudFiles.client` Type: `String` The client ID of the Google Service Account. Default value: None |\n| `cloudFiles.clientEmail` Type: `String` The email of the Google Service Account. Default value: None |\n| `cloudFiles.privateKey` Type: `String` The private key that\u2019s generated for the Google Service Account. Default value: None |\n| `cloudFiles.privateKeyId` Type: `String` The id of the private key that\u2019s generated for the Google Service Account. Default value: None |\n| `cloudFiles.projectId` Type: `String` The id of the project that the GCS bucket is in. The Google Cloud Pub\/Sub subscription will also be created within this project. Default value: None | \nProvide the following option only if you choose `cloudFiles.useNotifications` = `true` and you want Auto Loader to use a queue that you have already set up: \n| Option |\n| --- |\n| `cloudFiles.subscription` Type: `String` The name of the Google Cloud Pub\/Sub subscription. If provided, the cloud files source consumes events from this queue instead of setting up its own GCS Notification and Google Cloud Pub\/Sub services. Default value: None |\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Use Databricks compute with your jobs\n\nWhen you run a Databricks job, the tasks configured as part of the job run on Databricks compute, either serverless compute, a cluster, or a SQL warehouse, depending on the task type. Selecting the compute type and configuration options is important when operationalizing a job. This article provides recommendations for using Databricks compute resources to run your jobs. \nTo learn more about using serverless compute with your Databricks jobs, see [Run your Databricks job with serverless compute for workflows](https:\/\/docs.databricks.com\/workflows\/jobs\/run-serverless-jobs.html). \nNote \n[Secrets](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html) are not redacted from a cluster\u2019s Spark driver log `stdout` and `stderr` streams. To protect sensitive data, by default, Spark driver logs are viewable only by users with CAN MANAGE permission on job, single user access mode, and shared access mode clusters. To allow users with CAN ATTACH TO or CAN RESTART permission to view the logs on these clusters, set the following Spark configuration property in the cluster configuration: `spark.databricks.acl.needAdminPermissionToViewLogs false`. \nOn No Isolation Shared access mode clusters, the Spark driver logs can be viewed by users with CAN ATTACH TO or CAN MANAGE permission. To limit who can read the logs to only users with the CAN MANAGE permission, set `spark.databricks.acl.needAdminPermissionToViewLogs` to `true`. \nSee [Spark configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration) to learn how to add Spark properties to a cluster configuration.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/use-compute.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Use Databricks compute with your jobs\n##### Use shared job clusters\n\nTo optimize resource usage with jobs that orchestrate multiple tasks, use shared job clusters. A shared job cluster allows multiple tasks in the same job run to reuse the cluster. You can use a single job cluster to run all tasks that are part of the job, or multiple job clusters optimized for specific workloads. To use a shared job cluster: \n1. Select **New Job Clusters** when you create a task and complete the [cluster configuration](https:\/\/docs.databricks.com\/compute\/configure.html).\n2. Select the new cluster when adding a task to the job, or create a new job cluster. Any cluster you configure when you select **New Job Clusters** is available to any task in the job. \nA shared job cluster is scoped to a single job run and cannot be used by other jobs or runs of the same job. \nLibraries cannot be declared in a shared job cluster configuration. You must add dependent libraries in task settings.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/use-compute.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Use Databricks compute with your jobs\n##### Choose the correct cluster type for your job\n\n* **New Job Clusters** are dedicated clusters for a job or task run. A shared job cluster is created and started when the first task using the cluster starts and terminates after the last task using the cluster completes. The cluster is not terminated when idle but only after all tasks are completed. If a shared job cluster fails or is terminated before all tasks have finished, a new cluster is created. A cluster scoped to a single task is created and started when the task starts and terminates when the task completes. In production, Databricks recommends using new shared or task-scoped clusters so that each job or task runs in a fully isolated environment.\n* When you run a task on a new cluster, the task is treated as a data engineering (task) workload, subject to the task workload pricing. When you run a task on an existing all-purpose cluster, the task is treated as a data analytics (all-purpose) workload, subject to all-purpose workload pricing.\n* If you select a terminated existing cluster and the job owner has CAN RESTART [permission](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions), Databricks starts the cluster when the job is scheduled to run.\n* Existing all-purpose clusters work best for tasks such as updating [dashboards](https:\/\/docs.databricks.com\/notebooks\/dashboards.html) at regular intervals.\n\n#### Use Databricks compute with your jobs\n##### Use a pool to reduce cluster start times\n\nTo decrease new job cluster start time, create a [pool](https:\/\/docs.databricks.com\/compute\/pools.html) and configure the job\u2019s cluster to use the pool. \n### Automatic availability zones \nTo take advantage of automatic availability zones (Auto-AZ), you must enable it with the [Clusters API](https:\/\/docs.databricks.com\/api\/workspace\/clusters), setting `aws_attributes.zone_id = \"auto\"`. See [Availability zones](https:\/\/docs.databricks.com\/compute\/configure.html#availability-zones).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/use-compute.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Dataset optimization and caching\n\nDashboards are valuable data analysis and decision-making tools, and efficient load times can significantly improve the user experience. This article explains how caching and dataset optimizations make dashboards more performant and efficient.\n\n#### Dataset optimization and caching\n##### Query performance\n\nYou can inspect queries and their performance in the workspace query history. The query history shows SQL queries performed using SQL warehouses. Click ![History Icon](https:\/\/docs.databricks.com\/_images\/history-icon.png) **Query History** in the sidebar to view the query history. See [Query history](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html). \nFor dashboard datasets, Databricks applies performance optimizations depending on the result size of the dataset.\n\n#### Dataset optimization and caching\n##### Dataset optimizations\n\nDashboard datasets include the following performance optimizations: \n* If the dataset result size is small (less than 64K rows), the dataset result is pulled to the client, and visualization-specific filtering and aggregation are performed on the client. Filtering and aggregating data for small datasets is very fast, and ensuring that your dataset is small can help you optimize dashboard performance. With small datasets, only the dataset query appears in the query history.\n* If the dataset result size is large (>= 64K rows), the dataset query text is wrapped in a SQL `WITH` clause, and the visualization-specific filtering and aggregation is performed in a query on the backend rather than in the client. With large datasets, the visualization query appears in the query history.\n* For visualization queries sent to the backend, separate visualization queries against the same dataset that share the same `GROUP BY` clauses and filter predicates are combined into a single query for processing. In this case, users might see one combined query in the query history that is fetching results for multiple visualizations.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/caching.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Dataset optimization and caching\n##### Caching and data freshness\n\nDashboards maintain a 24-hour result cache to optimize initial loading times, operating on a best-effort basis. This means that while the system always attempts to use historical query results linked to dashboard credentials to enhance performance, there are some cases where cached results cannot be created or maintained. \nThe following table explains how caching varies by dashboard status and credentials: \n| Dashboard type | Caching type |\n| --- | --- |\n| Published dashboard with embedded credentials | Shared cache. All viewers see the same results. |\n| Draft dashboard or published dashboard without embedded credentials | Per user cache. Viewers see results based on their data permissions. | \nDashboards automatically use cached query results if the underlying data remains unchanged after the last query or if the results were retrieved less than 24 hours ago. If stale results exist and parameters are applied to the dashboard, queries will rerun unless the same parameters were used in the past 24 hours. Similarly, applying filters to datasets exceeding 64,000 rows prompts queries to rerun unless the same filters were previously applied in the last 24 hours.\n\n#### Dataset optimization and caching\n##### Scheduled queries\n\nAdding a schedule to a published dashboard with embedded credentials can significantly speed up the initial loading process for all dashboard viewers. \nFor each scheduled dashboard update, the following occurs: \n* All SQL logic that defines datasets runs on the designated time interval.\n* Results populate the query result cache and help to improve initial dashboard load time.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/caching.html"} +{"content":"# AI and Machine Learning on Databricks\n## Prepare data and environment for ML and DL\n### Load data for machine learning and deep learning\n#### Prepare data for distributed training\n###### Load data using Petastorm\n\nThis article describes how to use [Petastorm](https:\/\/github.com\/uber\/petastorm) convert data from Apache Spark to TensorFlow or PyTorch. It also provides an example showing how to use Petastorm to prepare data for ML. \nPetastorm is an open source data access library. It enables single-node or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format and datasets that are already loaded as Apache Spark DataFrames.\nPetastorm supports popular Python-based machine learning (ML) frameworks such as TensorFlow, PyTorch, and PySpark.\nFor more information about Petastorm, see the [Petastorm API documentation](https:\/\/petastorm.readthedocs.io\/en\/latest).\n\n###### Load data using Petastorm\n####### Load data from Spark DataFrames using Petastorm\n\nThe Petastorm Spark converter API simplifies data conversion from Spark to TensorFlow or PyTorch. The input Spark DataFrame is first materialized in Parquet format and then loaded as a `tf.data.Dataset` or `torch.utils.data.DataLoader`.\nSee the [Spark Dataset Converter API section](https:\/\/petastorm.readthedocs.io\/en\/latest\/api.html#module-petastorm.spark.spark_dataset_converter) in the Petastorm API documentation. \nThe recommended workflow is: \n1. Use Apache Spark to load and optionally preprocess data.\n2. Use the Petastorm `spark_dataset_converter` method to convert data from a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader.\n3. Feed data into a DL framework for training or inference.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/load-data\/petastorm.html"} +{"content":"# AI and Machine Learning on Databricks\n## Prepare data and environment for ML and DL\n### Load data for machine learning and deep learning\n#### Prepare data for distributed training\n###### Load data using Petastorm\n####### Configure cache directory\n\nThe Petastorm Spark converter caches the input Spark DataFrame in Parquet format in a user-specified cache directory location. The cache directory must be a DBFS path starting with `file:\/\/\/dbfs\/`, for example, `file:\/\/\/dbfs\/tmp\/foo\/` which refers to the same location as `dbfs:\/tmp\/foo\/`. You can configure the cache directory in two ways: \n* In the cluster [Spark config](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration) add the line: `petastorm.spark.converter.parentCacheDirUrl file:\/\/\/dbfs\/...`\n* In your notebook, call `spark.conf.set()`: \n```\nfrom petastorm.spark import SparkDatasetConverter, make_spark_converter\n\nspark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, 'file:\/\/\/dbfs\/...')\n\n``` \nYou can either explicitly delete the cache after using it by calling `converter.delete()` or manage the cache implicitly by configuring the lifecycle rules in your object storage. \nDatabricks supports DL training in three scenarios: \n* Single-node training\n* Distributed hyperparameter tuning\n* Distributed training \nFor end-to-end examples, see the following notebooks: \n* [Simplify data conversion from Spark to TensorFlow](https:\/\/docs.databricks.com\/machine-learning\/load-data\/petastorm.html#petastorm-tensorflow)\n* [Simplify data conversion from Spark to PyTorch](https:\/\/docs.databricks.com\/machine-learning\/load-data\/petastorm.html#petastorm-pytorch)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/load-data\/petastorm.html"} +{"content":"# AI and Machine Learning on Databricks\n## Prepare data and environment for ML and DL\n### Load data for machine learning and deep learning\n#### Prepare data for distributed training\n###### Load data using Petastorm\n####### Load Parquet files directly using Petastorm\n\nThis method is less preferred than the Petastorm Spark converter API. \nThe recommended workflow is: \n1. Use Apache Spark to load and optionally preprocess data.\n2. Save data in Parquet format into a DBFS path that has a companion DBFS mount.\n3. Load data in Petastorm format via the DBFS mount point.\n4. Use data in a DL framework for training or inference. \nSee [example notebook](https:\/\/docs.databricks.com\/machine-learning\/load-data\/petastorm.html#petastorm-example) for an end-to-end example.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/load-data\/petastorm.html"} +{"content":"# AI and Machine Learning on Databricks\n## Prepare data and environment for ML and DL\n### Load data for machine learning and deep learning\n#### Prepare data for distributed training\n###### Load data using Petastorm\n####### Examples: Preprocess data and train models with TensorFlow or PyTorch\n\nThis example notebook demonstrates the following workflow on Databricks: \n1. Load data using Spark.\n2. Convert the Spark DataFrame to a TensorFlow Dataset using Petastorm.\n3. Feed the data into a single-node TensorFlow model for training.\n4. Feed the data into a distributed hyperparameter tuning function.\n5. Feed the data into a distributed TensorFlow model for training. \n### Simplify data conversion from Spark to TensorFlow notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/petastorm-spark-converter-tensorflow.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nThis example notebook demonstrates the following workflow on Databricks: \n1. Load data using Spark.\n2. Convert the Spark DataFrame to a PyTorch DataLoader using Petastorm.\n3. Feed the data into a single-node PyTorch model for training.\n4. Feed the data into a distributed hyperparameter tuning function.\n5. Feed the data into a distributed PyTorch model for training. \n### Simplify data conversion from Spark to PyTorch notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/petastorm-spark-converter-pytorch.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/load-data\/petastorm.html"} +{"content":"# AI and Machine Learning on Databricks\n## Prepare data and environment for ML and DL\n### Load data for machine learning and deep learning\n#### Prepare data for distributed training\n###### Load data using Petastorm\n####### Example: Preprocess data and load Parquet files with Petastorm\n\nThis example notebook shows you the following workflow on Databricks: \n1. Use Spark to load and preprocess data.\n2. Save data using Parquet under `dbfs:\/ml`.\n3. Load data using Petastorm via the optimized FUSE mount `file:\/dbfs\/ml`.\n4. Feed data into a deep learning framework for training or inference. \n### Use Spark and Petastorm to prepare data for deep learning notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/petastorm.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/load-data\/petastorm.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to TIBCO Spotfire Analyst\n\nThis article describes how to use TIBCO Spotfire Analyst with a Databricks cluster or a Databricks SQL warehouse.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/spotfire.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to TIBCO Spotfire Analyst\n##### Requirements\n\n* A cluster or SQL warehouse in your Databricks workspace. \n+ [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n+ [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* The connection details for your cluster or SQL warehouse, specifically the **Server Hostname**, **Port**, and **HTTP Path** values. \n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). To create a personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**.\nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n+ [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n+ [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/spotfire.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to TIBCO Spotfire Analyst\n##### Steps to connect\n\n1. In TIBCO Spotfire Analyst, on the navigation bar, click the plus (**Files and data**) icon and click **Connect to**.\n2. Select **Databricks** and click **New connection**.\n3. In the **Apache Spark SQL** dialog, on the **General** tab, for **Server**, enter the **Server Hostname** and **Port** field values from Step 1, separated by a colon.\n4. For **Authentication method**, select **Username and password**.\n5. For **Username**, enter the word `token`.\n6. For **Password**, enter your personal access token from Step 1.\n7. On the **Advanced** tab, for **Thrift transport mode**, select **HTTP**.\n8. For **HTTP Path**, enter the **HTTP Path** field value from Step 1.\n9. On the **General** tab, click **Connect**.\n10. After a successful connection, in the **Database** list, select the database you want to use, and then click **OK**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/spotfire.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to TIBCO Spotfire Analyst\n##### Select the Databricks data to analyze\n\nYou select data in the **Views in Connection** dialog. \n![Available Tables](https:\/\/docs.databricks.com\/_images\/tibco-image3.png) \n1. Browse the available tables in Databricks.\n2. Add the tables you want as views, which will be the data tables you analyze in TIBCO Spotfire.\n3. For each view, you can decide which columns you want to include. If you want create a very specific and flexible data selection, you have access to a range of powerful tools in this dialog, such as: \n* Custom queries. With custom queries, you can select the data you want to analyze by typing a custom SQL query.\n* Prompting. Leave the data selection to the user of your analysis file. You configure prompts based on columns of your choice. Then, the end user who opens the analysis can select to limit and view data for relevant values only. For example, the user can select data within a certain span of time or for a specific geographic region.\n4. Click **OK**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/spotfire.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to TIBCO Spotfire Analyst\n##### Push-down queries to Databricks or import data\n\nWhen you have selected the data that you want to analyze, the final step is to choose how you want to retrieve the data from Databricks. A summary of the data tables you are adding to your analysis is displayed, and you can click each table to change the data loading method. \n![orders table example](https:\/\/docs.databricks.com\/_images\/tibco-image4.png) \nThe default option for Databricks is **External**. This means the data table is kept in-database in Databricks, and TIBCO Spotfire pushes different queries to the database for the relevant slices of data, based on your actions in the analysis. \nYou can also select **Imported** and TIBCO Spotfire will extract the entire data table up-front, which enables local in-memory analysis. When you import data tables, you also use analytical functions in the embedded in-memory data engine of TIBCO Spotfire. \nThe third option is **On-demand** (corresponding to a dynamic `WHERE` clause), which means that slices of data will be extracted based on user actions in the analysis. You can define the criteria, which could be actions like marking or filtering data, or changing document properties. On-demand data loading can also be combined with **External** data tables.\n\n#### Connect to TIBCO Spotfire Analyst\n##### Additional resources\n\n[Support](https:\/\/www.tibco.com\/services\/support)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/spotfire.html"} +{"content":"# Technology partners\n## Connect to ML partners using Partner Connect\n#### Connect to John Snow Labs\n\nJohn Snow Labs provides production-grade, scalable, and trainable versions of the latest research in natural language processing (NLP) through the following products: \n* Spark NLP: state-of-the-art NLP for Python, Java, or Scala.\n* Spark NLP for Healthcare: state-of-the-art clinical and biomedical NLP.\n* Spark OCR: a scalable, private, and highly accurate OCR and de-identification library. \nYou can integrate your Databricks clusters with John Snow Labs. \nNote \nJohn Snow Labs does not integrate with Databricks SQL warehouses (formerly Databricks SQL endpoints).\n\n#### Connect to John Snow Labs\n##### Connect to John Snow Labs using Partner Connect\n\nThe Partner Connect steps cover the most popular NLP and OCR tasks: \n* Create a new cluster in your Databricks workspace.\n* Automatically install John Snow Labs NLP and OCR libraries on the new cluster.\n* Create and deploy a 30-day trial license for John Snow Labs NLP and OCR libraries.\n* Copy 20+ ready-to-use Python notebooks to the new cluster. \n### Differences between standard connections and John Snow Labs \nTo connect to John Snow Labs using Partner Connect, you follow the steps in [Connect to ML partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/ml.html). The John Snow Labs connection is different from standard machine learning connections in the following ways: \n* To complete the Partner Connect steps, you need a valid credit card. Your credit card is subject to pay-as-you-go charges that begin after the trial ends.\n* After you follow the on-screen instructions to start your John Snow Labs NLP trial, check your email inbox for a message from John Snow Labs that contains instructions about how to get started, then follow the instructions in the message. It could take up to a half hour for this message to arrive. \n### Steps to connect \nTo connect your Databricks workspace to John Snow Labs using Partner Connect, see [Connect to ML partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/ml.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ml\/john-snow-labs.html"} +{"content":"# Technology partners\n## Connect to ML partners using Partner Connect\n#### Connect to John Snow Labs\n##### Connect to John Snow Labs manually\n\nFollow these instructions to automatically install the John Snow Labs NLP and OCR libraries and notebooks on your cluster, and to activate your trial of John Snow Labs if you do not already have a John Snow Labs account. \n### Requirements \nBefore you integrate with John Snow Labs, you must have the following: \n* A Databricks [cluster](https:\/\/docs.databricks.com\/compute\/configure.html) in your Databricks workspace.\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement). \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \n### Procedure \nTo integrate with John Snow Labs, complete these steps: \n1. Make sure you meet the [requirements](https:\/\/docs.databricks.com\/partners\/ml\/john-snow-labs.html#requirements) for John Snow Labs.\n2. Go to the [John Snow Labs NLP on Databricks](https:\/\/www.johnsnowlabs.com\/databricks\/) webpage.\n3. Click **Install in my Databricks account**.\n4. In the **Please tell us about yourself** dialog, enter your first name, last name, and company email address.\n5. For **Databricks instance url**, enter your Databricks [workspace URL](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `https:\/\/dbc-a1b2345c-cloud.databricks.com\/?o=1234567890123456`.\n6. For **Databricks access token**, enter your Databricks personal access token value from the requirements in this article.\n7. Click **Test connection**.\n8. After the connection succeeds, for **Choose a cluster to install on**, select the cluster from the requirements in this article.\n9. Click **Get Trial License**.\n10. Check your email inbox for a message from John Snow Labs that contains a request to validate your email address.\n11. In the message, click **Validate my email**.\n12. After several minutes, check your email inbox again for another message from John Snow Labs that contains instructions about how to get started. Note that in some cases it could take up to a half hour for this message to arrive.\n13. Follow the instructions in the message. \nNote \nTo manually install the John Snow Labs libraries and notebooks on your cluster, see the following on the John Snow Labs website: \n* [Install Spark NLP on Databricks](https:\/\/nlp.johnsnowlabs.com\/docs\/en\/install#databricks-support)\n* [Install Spark NLP for Healthcare on Databricks](https:\/\/nlp.johnsnowlabs.com\/docs\/en\/licensed_install#install-on-databricks)\n* [Install Spark OCR on Databricks](https:\/\/nlp.johnsnowlabs.com\/docs\/en\/ocr_install#databricks)\n14. To upgrade your trial of John Snow Labs, sign in to your John Snow Labs account, at <https:\/\/my.johnsnowlabs.com\/login>.\n15. Continue to next steps.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ml\/john-snow-labs.html"} +{"content":"# Technology partners\n## Connect to ML partners using Partner Connect\n#### Connect to John Snow Labs\n##### Next steps\n\nExplore one or more of the following resources on the John Snow Labs website: \n* [John Snow Labs website](https:\/\/www.johnsnowlabs.com)\n* [Spark NLP](https:\/\/www.johnsnowlabs.com\/spark-nlp\/)\n* [Spark NLP for Healthcare](https:\/\/www.johnsnowlabs.com\/spark-nlp-health\/)\n* [Spark OCR](https:\/\/www.johnsnowlabs.com\/spark-ocr\/)\n* [John Snow Labs NLP Documentation](https:\/\/nlp.johnsnowlabs.com\/docs)\n* [John Snow Labs NLP on Databricks](https:\/\/www.johnsnowlabs.com\/databricks\/)\n* [Additional learning resources](https:\/\/nlp.johnsnowlabs.com\/learn)\n* [Spark NLP documentation](https:\/\/nlp.johnsnowlabs.com\/docs\/en\/quickstart)\n* [Spark NLP for Healthcare documentation](https:\/\/nlp.johnsnowlabs.com\/docs\/en\/license_getting_started)\n* [Spark OCR documentation](https:\/\/nlp.johnsnowlabs.com\/docs\/en\/ocr)\n* [Support](mailto:support%40johsnowlabs.com)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ml\/john-snow-labs.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deploy models for batch inference and prediction\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-inference\/dl-model-inference.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deploy models for batch inference and prediction\n#### Deep learning model inference workflow\n\nFor model inference for deep learning applications, Databricks recommends the following workflow. For example notebooks that use TensorFlow and PyTorch, see [Deep learning model inference examples](https:\/\/docs.databricks.com\/machine-learning\/model-inference\/dl-model-inference.html#dl-inference-examples). \n1. **Load the data into Spark DataFrames.** Depending on the data type, Databricks recommends the following ways to load data: \n* Image files (JPG,PNG): Load the image paths into a Spark DataFrame. Image loading and preprocessing input data occurs in a pandas UDF.\n```\nfiles_df = spark.createDataFrame(map(lambda path: (path,), file_paths), [\"path\"])\n\n``` \n* TFRecords: Load the data using the [spark-tensorflow-connector](https:\/\/github.com\/tensorflow\/ecosystem\/tree\/master\/spark\/spark-tensorflow-connector).\n```\ndf = spark.read.format(\"tfrecords\").load(image_path)\n\n``` \n* Data sources such as Parquet, CSV, JSON, JDBC, and other metadata: Load the data using [Spark data sources](https:\/\/docs.databricks.com\/query\/formats\/index.html).\n2. **Perform model inference using pandas UDFs.** [pandas UDFs](https:\/\/spark.apache.org\/docs\/latest\/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs) use Apache Arrow to transfer data and pandas to work with the data. To do model inference, the following are the broad steps in the workflow with pandas UDFs. \n1. Load the trained model: For efficiency, Databricks recommends broadcasting the weights of the model from the driver and loading the model graph and get the weights from the broadcasted variables in a pandas UDF.\n2. Load and preprocess input data: To load data in batches, Databricks recommends using the [tf.data API](https:\/\/www.tensorflow.org\/guide\/data) for TensorFlow and the [DataLoader class](https:\/\/pytorch.org\/tutorials\/beginner\/data_loading_tutorial.html) for PyTorch. Both also support prefetching and multi-threaded loading to hide IO bound latency.\n3. Run model prediction: run model inference on the data batch.\n4. Send predictions back to Spark DataFrames: collect the prediction results and return as `pd.Series`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-inference\/dl-model-inference.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deploy models for batch inference and prediction\n#### Deep learning model inference workflow\n##### Deep learning model inference examples\n\nThe examples in this section follow the recommended deep learning inference workflow. These examples illustrate how to perform model inference using a pre-trained deep residual networks (ResNets) neural network model. \n* [Model inference using TensorFlow Keras API](https:\/\/docs.databricks.com\/machine-learning\/model-inference\/resnet-model-inference-keras.html)\n* [Model inference using TensorFlow and TensorRT](https:\/\/docs.databricks.com\/machine-learning\/model-inference\/resnet-model-inference-tensorrt.html)\n* [Model inference using PyTorch](https:\/\/docs.databricks.com\/machine-learning\/model-inference\/resnet-model-inference-pytorch.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-inference\/dl-model-inference.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Manage data quality with Delta Live Tables\n\nYou use expectations to define data quality constraints on the contents of a dataset. Expectations allow you to guarantee data arriving in tables meets data quality requirements and provide insights into data quality for each pipeline update. You apply expectations to queries using Python decorators or SQL constraint clauses.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Manage data quality with Delta Live Tables\n##### What are Delta Live Tables expectations?\n\nExpectations are optional clauses you add to Delta Live Tables dataset declarations that apply data quality checks on each record passing through a query. \nAn expectation consists of three things: \n* A description, which acts as a unique identifier and allows you to track metrics for the constraint.\n* A boolean statement that always returns true or false based on some stated condition.\n* An action to take when a record fails the expectation, meaning the boolean returns false. \nThe following matrix shows the three actions you can apply to invalid records: \n| Action | Result |\n| --- | --- |\n| [warn](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html#retain) (default) | Invalid records are written to the target; failure is reported as a metric for the dataset. |\n| [drop](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html#drop) | Invalid records are dropped before data is written to the target; failure is reported as a metrics for the dataset. |\n| [fail](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html#fail) | Invalid records prevent the update from succeeding. Manual intervention is required before re-processing. | \nYou can view data quality metrics such as the number of records that violate an expectation by querying the Delta Live Tables event log. See [Monitor Delta Live Tables pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/observability.html). \nFor a complete reference of Delta Live Tables dataset declaration syntax, see [Delta Live Tables Python language reference](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html) or [Delta Live Tables SQL language reference](https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html). \nNote \nWhile you can include multiple clauses in any expectation, only Python supports defining actions based on multiple expectations. See [Multiple expectations](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html#expect-all).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Manage data quality with Delta Live Tables\n##### Retain invalid records\n\nUse the `expect` operator when you want to keep records that violate the expectation. Records that violate the expectation are added to the target dataset along with valid records: \n```\n@dlt.expect(\"valid timestamp\", \"col(\u201ctimestamp\u201d) > '2012-01-01'\")\n\n``` \n```\nCONSTRAINT valid_timestamp EXPECT (timestamp > '2012-01-01')\n\n```\n\n#### Manage data quality with Delta Live Tables\n##### Drop invalid records\n\nUse the `expect or drop` operator to prevent further processing of invalid records. Records that violate the expectation are dropped from the target dataset: \n```\n@dlt.expect_or_drop(\"valid_current_page\", \"current_page_id IS NOT NULL AND current_page_title IS NOT NULL\")\n\n``` \n```\nCONSTRAINT valid_current_page EXPECT (current_page_id IS NOT NULL and current_page_title IS NOT NULL) ON VIOLATION DROP ROW\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Manage data quality with Delta Live Tables\n##### Fail on invalid records\n\nWhen invalid records are unacceptable, use the `expect or fail` operator to stop execution immediately when a record fails validation. If the operation is a table update, the system atomically rolls back the transaction: \n```\n@dlt.expect_or_fail(\"valid_count\", \"count > 0\")\n\n``` \n```\nCONSTRAINT valid_count EXPECT (count > 0) ON VIOLATION FAIL UPDATE\n\n``` \nWhen a pipeline fails because of an expectation violation, you must fix the pipeline code to handle the invalid data correctly before re-running the pipeline. \nFail expectations modify the Spark query plan of your transformations to track information required to detect and report on violations. For many queries, you can use this information to identify which input record resulted in the violation. The following is an example exception: \n```\nExpectation Violated:\n{\n\"flowName\": \"a-b\",\n\"verboseInfo\": {\n\"expectationsViolated\": [\n\"x1 is negative\"\n],\n\"inputData\": {\n\"a\": {\"x1\": 1,\"y1\": \"a },\n\"b\": {\n\"x2\": 1,\n\"y2\": \"aa\"\n}\n},\n\"outputRecord\": {\n\"x1\": 1,\n\"y1\": \"a\",\n\"x2\": 1,\n\"y2\": \"aa\"\n},\n\"missingInputData\": false\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Manage data quality with Delta Live Tables\n##### Multiple expectations\n\nYou can define expectations with one or more data quality constraints in Python pipelines. These decorators accept a Python dictionary as an argument, where the key is the expectation name and the value is the expectation constraint. \nUse `expect_all` to specify multiple data quality constraints when records that fail validation should be included in the target dataset: \n```\n@dlt.expect_all({\"valid_count\": \"count > 0\", \"valid_current_page\": \"current_page_id IS NOT NULL AND current_page_title IS NOT NULL\"})\n\n``` \nUse `expect_all_or_drop` to specify multiple data quality constraints when records that fail validation should be dropped from the target dataset: \n```\n@dlt.expect_all_or_drop({\"valid_count\": \"count > 0\", \"valid_current_page\": \"current_page_id IS NOT NULL AND current_page_title IS NOT NULL\"})\n\n``` \nUse `expect_all_or_fail` to specify multiple data quality constraints when records that fail validation should halt pipeline execution: \n```\n@dlt.expect_all_or_fail({\"valid_count\": \"count > 0\", \"valid_current_page\": \"current_page_id IS NOT NULL AND current_page_title IS NOT NULL\"})\n\n``` \nYou can also define a collection of expectations as a variable and pass it to one or more queries in your pipeline: \n```\nvalid_pages = {\"valid_count\": \"count > 0\", \"valid_current_page\": \"current_page_id IS NOT NULL AND current_page_title IS NOT NULL\"}\n\n@dlt.table\n@dlt.expect_all(valid_pages)\ndef raw_data():\n# Create raw dataset\n\n@dlt.table\n@dlt.expect_all_or_drop(valid_pages)\ndef prepared_data():\n# Create cleaned and prepared dataset\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Manage data quality with Delta Live Tables\n##### Quarantine invalid data\n\nThe following example uses expectations in combination with temporary tables and views. This pattern provides you with metrics for records that pass expectation checks during pipeline updates, and provides a way to process valid and invalid records through different downstream paths. \nNote \nThis example reads sample data included in the [Databricks datasets](https:\/\/docs.databricks.com\/discover\/databricks-datasets.html#dbfs-datasets). Because the Databricks datasets are not supported with a pipeline that publishes to Unity Catalog, this example works only with a pipeline configured to publish to the Hive metastore. However, this pattern also works with Unity Catalog enabled pipelines, but you must read data from [external locations](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html). To learn more about using Unity Catalog with Delta Live Tables, see [Use Unity Catalog with your Delta Live Tables pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html). \n```\nimport dlt\nfrom pyspark.sql.functions import expr\n\nrules = {}\nrules[\"valid_website\"] = \"(Website IS NOT NULL)\"\nrules[\"valid_location\"] = \"(Location IS NOT NULL)\"\nquarantine_rules = \"NOT({0})\".format(\" AND \".join(rules.values()))\n\n@dlt.table(\nname=\"raw_farmers_market\"\n)\ndef get_farmers_market_data():\nreturn (\nspark.read.format('csv').option(\"header\", \"true\")\n.load('\/databricks-datasets\/data.gov\/farmers_markets_geographic_data\/data-001\/')\n)\n\n@dlt.table(\nname=\"farmers_market_quarantine\",\ntemporary=True,\npartition_cols=[\"is_quarantined\"]\n)\n@dlt.expect_all(rules)\ndef farmers_market_quarantine():\nreturn (\ndlt.read(\"raw_farmers_market\")\n.select(\"MarketName\", \"Website\", \"Location\", \"State\",\n\"Facebook\", \"Twitter\", \"Youtube\", \"Organic\", \"updateTime\")\n.withColumn(\"is_quarantined\", expr(quarantine_rules))\n)\n\n@dlt.view(\nname=\"valid_farmers_market\"\n)\ndef get_valid_farmers_market():\nreturn (\ndlt.read(\"farmers_market_quarantine\")\n.filter(\"is_quarantined=false\")\n)\n\n@dlt.view(\nname=\"invalid_farmers_market\"\n)\ndef get_invalid_farmers_market():\nreturn (\ndlt.read(\"farmers_market_quarantine\")\n.filter(\"is_quarantined=true\")\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Manage data quality with Delta Live Tables\n##### Validate row counts across tables\n\nYou can add an additional table to your pipeline that defines an expectation to compare row counts between two live tables. The results of this expectation appear in the event log and the Delta Live Tables UI. This following example validates equal row counts between the `tbla` and `tblb` tables: \n```\nCREATE OR REFRESH LIVE TABLE count_verification(\nCONSTRAINT no_rows_dropped EXPECT (a_count == b_count)\n) AS SELECT * FROM\n(SELECT COUNT(*) AS a_count FROM LIVE.tbla),\n(SELECT COUNT(*) AS b_count FROM LIVE.tblb)\n\n```\n\n#### Manage data quality with Delta Live Tables\n##### Perform advanced validation with Delta Live Tables expectations\n\nYou can define live tables using aggregate and join queries and use the results of those queries as part of your expectation checking. This is useful if you wish to perform complex data quality checks, for example, ensuring a derived table contains all records from the source table or guaranteeing the equality of a numeric column across tables. You can use the `TEMPORARY` keyword to prevent these tables from being published to the target schema. \nThe following example validates that all expected records are present in the `report` table: \n```\nCREATE TEMPORARY LIVE TABLE report_compare_tests(\nCONSTRAINT no_missing_records EXPECT (r.key IS NOT NULL)\n)\nAS SELECT * FROM LIVE.validation_copy v\nLEFT OUTER JOIN LIVE.report r ON v.key = r.key\n\n``` \nThe following example uses an aggregate to ensure the uniqueness of a primary key: \n```\nCREATE TEMPORARY LIVE TABLE report_pk_tests(\nCONSTRAINT unique_pk EXPECT (num_entries = 1)\n)\nAS SELECT pk, count(*) as num_entries\nFROM LIVE.report\nGROUP BY pk\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Manage data quality with Delta Live Tables\n##### Make expectations portable and reusable\n\nYou can maintain data quality rules separately from your pipeline implementations. \nDatabricks recommends storing the rules in a Delta table with each rule categorized by a tag. You use this tag in dataset definitions to determine which rules to apply. \nThe following example creates a table named `rules` to maintain rules: \n```\nCREATE OR REPLACE TABLE\nrules\nAS SELECT\ncol1 AS name,\ncol2 AS constraint,\ncol3 AS tag\nFROM (\nVALUES\n(\"website_not_null\",\"Website IS NOT NULL\",\"validity\"),\n(\"location_not_null\",\"Location IS NOT NULL\",\"validity\"),\n(\"state_not_null\",\"State IS NOT NULL\",\"validity\"),\n(\"fresh_data\",\"to_date(updateTime,'M\/d\/yyyy h:m:s a') > '2010-01-01'\",\"maintained\"),\n(\"social_media_access\",\"NOT(Facebook IS NULL AND Twitter IS NULL AND Youtube IS NULL)\",\"maintained\")\n)\n\n``` \nThe following Python example defines data quality expectations based on the rules stored in the `rules` table. The `get_rules()` function reads the rules from the `rules` table and returns a Python dictionary containing rules matching the `tag` argument passed to the function. The dictionary is applied in the `@dlt.expect_all_*()` decorators to enforce data quality constraints. For example, any records failing the rules tagged with `validity` will be dropped from the `raw_farmers_market` table: \nNote \nThis example reads sample data included in the [Databricks datasets](https:\/\/docs.databricks.com\/discover\/databricks-datasets.html#dbfs-datasets). Because the Databricks datasets are not supported with a pipeline that publishes to Unity Catalog, this example works only with a pipeline configured to publish to the Hive metastore. However, this pattern also works with Unity Catalog enabled pipelines, but you must read data from [external locations](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html). To learn more about using Unity Catalog with Delta Live Tables, see [Use Unity Catalog with your Delta Live Tables pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html). \n```\nimport dlt\nfrom pyspark.sql.functions import expr, col\n\ndef get_rules(tag):\n\"\"\"\nloads data quality rules from a table\n:param tag: tag to match\n:return: dictionary of rules that matched the tag\n\"\"\"\nrules = {}\ndf = spark.read.table(\"rules\")\nfor row in df.filter(col(\"tag\") == tag).collect():\nrules[row['name']] = row['constraint']\nreturn rules\n\n@dlt.table(\nname=\"raw_farmers_market\"\n)\n@dlt.expect_all_or_drop(get_rules('validity'))\ndef get_farmers_market_data():\nreturn (\nspark.read.format('csv').option(\"header\", \"true\")\n.load('\/databricks-datasets\/data.gov\/farmers_markets_geographic_data\/data-001\/')\n)\n\n@dlt.table(\nname=\"organic_farmers_market\"\n)\n@dlt.expect_all_or_drop(get_rules('maintained'))\ndef get_organic_farmers_market():\nreturn (\ndlt.read(\"raw_farmers_market\")\n.filter(expr(\"Organic = 'Y'\"))\n.select(\"MarketName\", \"Website\", \"State\",\n\"Facebook\", \"Twitter\", \"Youtube\", \"Organic\",\n\"updateTime\"\n)\n)\n\n``` \nInstead of creating a table named `rules` to maintain rules, you could create a Python module to main rules, for example, in a file named `rules_module.py` in the same folder as the notebook: \n```\ndef get_rules_as_list_of_dict():\nreturn [\n{\n\"name\": \"website_not_null\",\n\"constraint\": \"Website IS NOT NULL\",\n\"tag\": \"validity\"\n},\n{\n\"name\": \"location_not_null\",\n\"constraint\": \"Location IS NOT NULL\",\n\"tag\": \"validity\"\n},\n{\n\"name\": \"state_not_null\",\n\"constraint\": \"State IS NOT NULL\",\n\"tag\": \"validity\"\n},\n{\n\"name\": \"fresh_data\",\n\"constraint\": \"to_date(updateTime,'M\/d\/yyyy h:m:s a') > '2010-01-01'\",\n\"tag\": \"maintained\"\n},\n{\n\"name\": \"social_media_access\",\n\"constraint\": \"NOT(Facebook IS NULL AND Twitter IS NULL AND Youtube IS NULL)\",\n\"tag\": \"maintained\"\n}\n]\n\n``` \nThen modify the preceding notebook by importing the module and changing the `get_rules()` function to read from the module instead of from the `rules` table: \n```\nimport dlt\nfrom rules_module import *\nfrom pyspark.sql.functions import expr, col\n\ndf = spark.createDataFrame(get_rules_as_list_of_dict())\n\ndef get_rules(tag):\n\"\"\"\nloads data quality rules from a table\n:param tag: tag to match\n:return: dictionary of rules that matched the tag\n\"\"\"\nrules = {}\nfor row in df.filter(col(\"tag\") == tag).collect():\nrules[row['name']] = row['constraint']\nreturn rules\n\n@dlt.table(\nname=\"raw_farmers_market\"\n)\n@dlt.expect_all_or_drop(get_rules('validity'))\ndef get_farmers_market_data():\nreturn (\nspark.read.format('csv').option(\"header\", \"true\")\n.load('\/databricks-datasets\/data.gov\/farmers_markets_geographic_data\/data-001\/')\n)\n\n@dlt.table(\nname=\"organic_farmers_market\"\n)\n@dlt.expect_all_or_drop(get_rules('maintained'))\ndef get_organic_farmers_market():\nreturn (\ndlt.read(\"raw_farmers_market\")\n.filter(expr(\"Organic = 'Y'\"))\n.select(\"MarketName\", \"Website\", \"State\",\n\"Facebook\", \"Twitter\", \"Youtube\", \"Organic\",\n\"updateTime\"\n)\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### What are dashboard parameters?\n\nDashboard parameters are one way to make dashboards interactive, enabling viewers to use single-value selectors and date pickers to input specific values into dataset queries at runtime. For example, parameters can filter data based on criteria like dates and product categories before it\u2019s aggregated in a SQL query, allowing for more efficient querying and precise analysis. \nParameters are added to datasets and connected to one or more widgets on the canvas of a dashboard by a dashboard author or editor. Dashboard viewers interact with the dashboard data by selecting values in filter widgets at runtime. This reruns the associated queries and presents visualizations built on the filtered data. \nParameters directly modify the query, which can be powerful. Dataset filters can also offer dashboard interactivity, more features, and better performance with large datasets than parameters. See [Filters](https:\/\/docs.databricks.com\/dashboards\/index.html#filters).\n\n#### What are dashboard parameters?\n##### Add a parameter to a query\n\nYou must have at least **Can Edit** permissions on the draft dashboard to add a parameter to a dashboard dataset. You can add parameters directly to the dataset queries in the **Data** tab. \n![Gif shows an example of the following steps.](https:\/\/docs.databricks.com\/_images\/add-param.gif) \nTo add a parameter to a query: \n1. Place your cursor where you want to place the parameter in your query.\n2. Click **Add parameter** to insert a new parameter. \nThis creates a new parameter with the default name `parameter`. To change the default name, replace it in the query editor. You can also add parameters by typing this syntax in the query editor.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/parameters.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### What are dashboard parameters?\n##### Edit a query parameter\n\nTo edit a parameter: \n1. Click ![Gear icon](https:\/\/docs.databricks.com\/_images\/gear-icon.png) next to the parameter name. A **Parameter details** dialog appears and includes the following configuration options: \n* **Keyword**: The keyword that represents the parameter in the query. This can only be changed by directly updating the text in the query.\n* **Display name**: The name in the filter editor. By default, the title is the same as the keyword.\n* **Type**: Supported types include **String**, **Date**, **Date and Time**, **Numeric**. \n+ The default type is **String**.\n+ The **Numeric** datatype allows you to specify between **Decimal** and **Integer**. The default numeric type is **Decimal**.\n2. Click another part of the UI to close the dialog.\n\n#### What are dashboard parameters?\n##### Set a default parameter value\n\nYou can set a default value for your parameter by typing it into the text field under the parameter name. Run the query to preview the query results with the parameter value applied. Running the query also saves the default value. When you set this parameter using a filter widget on the canvas, the default value is used.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/parameters.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### What are dashboard parameters?\n##### Query-based parameters\n\nQuery-based parameters allow authors to define a dynamic or static list of values that viewers can choose from when setting parameters as they explore data in a dashboard. They are defined by combining a field filter and a parameter filter in a single filter widget. \nTo create a query-based parameter, the dashboard author performs the following steps: \n1. Create a dataset whose result set is limited to a list of possible parameter values.\n2. Create a dataset query that uses a parameter.\n3. Configure a filter widget on the canvas that filters on a field and uses a parameter. \n* The **Fields** configurations should be set to use the field with the desired list of eligible parameter values.\n* The **Parameters** configuration should be set to select a parameter value. \nNote \nIf a dataset used in query-based parameters is also used in other visualizations on a dashboard, a viewer\u2019s filter selection modifies all connected queries. To avoid this, authors should create a dedicated dataset for query-based parameters that is not used in any other visualizations on the dashboard. \nSee [Use query-based parameters](https:\/\/docs.databricks.com\/dashboards\/tutorials\/query-based-params.html) for a step-by-step tutorial that demonstrates how to add a query-based parameter and visualization. \n### Create a dynamic parameter list \nTo create a dynamic dataset that populates the drop-down that viewers use to select parameter values, write a SQL query that returns a single field and includes all the values in that field. Any new value in that field is automatically added as a parameter selection when the dataset is updated. An example SQL query is as follows: \n```\nSELECT\nDISTINCT c_mktsegment\nFROM\nsamples.tpch.customer\n\n``` \n### Create a static parameter list \nYou can create a static dataset that includes only values that you hardcode into your dataset. An example query is as follows: \n```\nSELECT\n*\nFROM\n(\nVALUES\n('MACHINERY'),\n('BUILDING'),\n('FURNITURE'),\n('HOUSEHOLD'),\n('AUTOMOBILE')\n) AS data(available_choices)\n\n``` \n### Filter types \n**Single Value** and **Date Picker** filters support setting parameters. When setting query-based parameters with a **Date Picker** filter, dates that appear in the underlying query\u2019s results are shown in black. Dates that do not appear in the query results are gray. Users can choose gray dates even though they are not included in the underlying query.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/parameters.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### What are dashboard parameters?\n##### Remove a query parameter\n\nTo remove a parameter, delete it from your query.\n\n#### What are dashboard parameters?\n##### Show parameters on the dashboard\n\nAdding a filter to your dashboard canvas allows viewers to select and modify parameter values, so they can interactively explore and analyze the data. If you do not expose the parameter on the dashboard, viewers see only query results that use the default parameter value that you set in the query. \nTo add a parameter to your dashboard: \n1. Click ![Filter Icon](https:\/\/docs.databricks.com\/_images\/lakeview-filter.png) **Add a filter (field\/parameter)**.\n2. Click ![add field icon](https:\/\/docs.databricks.com\/_images\/lakeview-add-vis-field.png) next to **Parameters** in the configuration panel.\n3. Click the parameter name you want the viewer to use with this widget.\n\n#### What are dashboard parameters?\n##### Include parameters in the URL\n\nParameter settings are stored in the URL, allowing users to bookmark it to maintain their dashboard\u2019s state, including pre-set filters and parameters, or to share it with others for consistent application of the same filters and parameters.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/parameters.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### What are dashboard parameters?\n##### Parameter syntax examples\n\nThe following examples demonstrate some common use cases for parameters. \n### Insert a date \nThe following example includes a **Date** parameter that limits query results to records after a specific date. \n```\nSELECT\no_orderdate AS Date,\no_orderpriority AS Priority,\nsum(o_totalprice) AS `Total Price`\nFROM\nsamples.tpch.orders\nWHERE\no_orderdate > :date_param\nGROUP BY\n1,\n2\n\n``` \n### Insert a number \nThe following example includes a **Numeric** parameter that limits results to records where the `o_total_price` field is greater than the provided parameter value. \n```\nSELECT\no_orderdate AS Date,\no_orderpriority AS Priority,\no_totalprice AS Price\nFROM\nsamples.tpch.orders\nWHERE\no_totalprice > :num_param\n\n``` \n### Insert a field name \nIn the following example, the `field_param` is used with the `IDENTIFIER` function to provide a threshold value for the query at runtime. The parameter value should be a column name from the table used in the query. \n```\nSELECT\n*\nFROM\nsamples.tpch.orders\nWHERE\nIDENTIFIER(:field_param) < 10000\n\n``` \n### Insert database objects \nThe following example creates three parameters: `catalog`, `schema`, and `table`. Dashboard viewers can use filter widgets on the canvas to select parameter values. \n```\nSELECT\n*\nFROM\nIDENTIFIER(:catalog || '.' || :schema || '.' || :table)\n\n``` \nSee [IDENTIFIER clause](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-names-identifier-clause.html). \nImportant \nEnabling dashboard viewers to access data through parameter selections, like table or catalog names, could lead to accidental exposure of sensitive information. If you\u2019re publishing a dashboard with these options, Databricks recommends not embedding credentials in the published dashboard. \n### Concatenate multiple parameters \nYou can include parameters in other SQL functions. This example allows the viewer to select an employee title and a number ID. The query uses the `format_string` function to concatenate the two strings and filter on the rows that match. See [format\\_string function](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/format_string.html). \n```\nSELECT\no_orderkey,\no_clerk\nFROM\nsamples.tpch.orders\nWHERE\no_clerk LIKE format_string('%s%s', :title, :emp_number)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/parameters.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### What are dashboard parameters?\n##### Dashboard parameters vs. Databricks SQL query parameters\n\nDashboard parameters use the same syntax as named parameter markers. See [Named parameter markers](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-parameter-marker.html#named-parameter-markers). Dashboards do not support Databricks SQL style parameters.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/parameters.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Manage access to Delta Sharing data shares (for providers)\n\nThis article explains how to grant a data recipient access to a Delta Sharing share. It also explains how to view, update, and revoke access.\n\n### Manage access to Delta Sharing data shares (for providers)\n#### Requirements\n\nTo share data with recipients: \n* You must use a Databricks workspace that has a Unity Catalog metastore attached.\n* You must use a SQL warehouse or cluster that uses a Unity-Catalog-capable cluster access mode.\n* Shares and recipients must already be defined.\n* You must be one of the following: \n+ Metastore admin.\n+ User with delegated permissions or ownership on both the share and the recipient objects ((`USE SHARE` + `SET SHARE PERMISSION`) or share owner) AND (`USE RECIPIENT` or recipient owner).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/grant-access.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Manage access to Delta Sharing data shares (for providers)\n#### Grant recipient access to share\n\nTo grant share access to recipients, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or SQL commands in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: One of the following: \n* Metastore admin.\n* Delegated permissions or ownership on both the share and the recipient objects ((`USE SHARE` + `SET SHARE PERMISSION`) or share owner) AND (`USE RECIPIENT` or recipient owner). \nTo add recipients to a share (starting at the share): \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Shares** tab, find and select the share.\n4. Click **Add recipient**.\n5. On the **Add recipient** dialog, start typing the recipient name or click the drop-down menu to select the recipients you want to add to the share.\n6. Click **Add**. \nTo grant share access to a recipient (starting at the recipient): \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Recipients** tab, find and select the recipient.\n4. Click **Grant share**.\n5. On the **Grant share** dialog, start typing the share name or click the drop-down menu to select the shares you want to grant.\n6. Click **Grant**. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nGRANT SELECT ON SHARE <share-name> TO RECIPIENT <recipient-name>;\n\n``` \n`SELECT` is the only privilege that you can grant a recipient on a share. \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). Replace `<share-name>` with the name of the share you want to grant to the recipient, and replace `<recipient-name>` with the recipient\u2019s name. `SELECT` is the only privilege that you can grant on a share. \n```\ndatabricks shares update <share-name> \\\n--json='{\n\"changes\": [\n{\n\"principal\": \"<recipient-name>\",\n\"add\": [\n\"SELECT\"\n]\n}\n]\n}'\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/grant-access.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Manage access to Delta Sharing data shares (for providers)\n#### Revoke recipient access to a share\n\nTo revoke a recipient\u2019s access to a share, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `REVOKE ON SHARE` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: Metastore admin, user with the `USE SHARE` privilege, or share object owner. \nTo revoke a recipient\u2019s access to a share, starting at the share: \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Shares** tab, find and select the share.\n4. On the **Recipients** tab, find the recipient.\n5. Click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu (also known as the three-dot menu) and select **Revoke**.\n6. On the confirmation dialog, click **Revoke**. \nTo revoke a recipient\u2019s access to a share, starting at the recipient: \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Recipients** tab, find and select the recipient.\n4. On the **Shares** tab, find the share.\n5. Click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu (also known as the three-dot menu) on the share row and select **Revoke**.\n6. On the confirmation dialog, click **Revoke**. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nREVOKE SELECT ON SHARE <share-name> FROM RECIPIENT <recipient-name>;\n\n``` \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). Replace `<share-name>` with the name of the share you want to remove for the recipient, and replace `<recipient-name>` with the recipient\u2019s name. `SELECT` is the only privilege that you can remove for a recipient. \n```\ndatabricks shares update <share-name> \\\n--json='{\n\"changes\": [\n{\n\"principal\": \"<recipient-name>\",\n\"remove\": [\n\"SELECT\"\n]\n}\n]\n}'\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/grant-access.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Manage access to Delta Sharing data shares (for providers)\n#### View grants on a share or grants possessed by a recipient\n\nTo view the current grants on a share, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `SHOW GRANTS ON SHARE` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: If you are viewing recipients granted access to a share, you must be a metastore admin, a user with the `USE SHARE` privilege, or the share object owner. If you are viewing shares granted to a recipient, you must be a metastore admin, a user with the `USE RECIPIENT` privilege, or the recipient object owner. \nTo view recipients with access to a share: \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Shares** tab, find and select the share.\n4. Go to the **Recipients** tab to view all recipients who have access to the share. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nSHOW GRANT ON SHARE <share-name>;\n\n``` \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \n```\ndatabricks shares share-permissions <share-name>\n\n``` \nTo view the current share grants possessed by a recipient, you can use Catalog Explorer, the Databricks CLI, or the `SHOW GRANTS TO RECIPIENT` SQL command in a Databricks notebook or the Databricks SQL query editor. \nTo view shares granted to a recipient: \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Recipients** tab, find and select the recipient.\n4. Go to the **Shares** tab to view all shares that the recipient has access to. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nSHOW GRANTS TO RECIPIENT <recipient-name>;\n\n``` \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/archive\/dev-tools\/cli\/index.html). \n```\ndatabricks recipients share-permissions <recipient-name>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/grant-access.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Audit Unity Catalog events\n\nThis article contains audit log information for Unity Catalog events. Unity Catalog captures an audit log of actions performed against the metastore. This enables admins to access fine-grained details about who accessed a given dataset and the actions they performed.\n\n#### Audit Unity Catalog events\n##### Configure audit logs\n\nTo access audit logs for Unity Catalog events, you must [enable and configure audit logs](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-log-delivery.html) for your account. \nImportant \nUnity Catalog activity is logged at the level of the account. Do not enter a value into `workspace_ids_filter`. \nAudit logs for each workspace and account-level activities are delivered to your account. Logs are delivered to the S3 bucket that you configure.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/audit.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Audit Unity Catalog events\n##### Audit log format\n\nIn Databricks, audit logs output events in a JSON format. The following example is for a `createMetastoreAssignment` event. \n```\n{\n\"version\":\"2.0\",\n\"auditLevel\":\"ACCOUNT_LEVEL\",\n\"timestamp\":1629775584891,\n\"orgId\":\"3049056262456431186970\",\n\"shardName\":\"test-shard\",\n\"accountId\":\"77636e6d-ac57-484f-9302-f7922285b9a5\",\n\"sourceIPAddress\":\"10.2.91.100\",\n\"userAgent\":\"curl\/7.64.1\",\n\"sessionId\":\"f836a03a-d360-4792-b081-baba525324312\",\n\"userIdentity\":{\n\"email\":\"crampton.rods@email.com\",\n\"subjectName\":null\n},\n\"serviceName\":\"unityCatalog\",\n\"actionName\":\"createMetastoreAssignment\",\n\"requestId\":\"ServiceMain-da7fa5878f40002\",\n\"requestParams\":{\n\"workspace_id\":\"30490590956351435170\",\n\"metastore_id\":\"abc123456-8398-4c25-91bb-b000b08739c7\",\n\"default_catalog_name\":\"main\"\n},\n\"response\":{\n\"statusCode\":200,\n\"errorMessage\":null,\n\"result\":null\n},\n\"MAX_LOG_MESSAGE_LENGTH\":16384\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/audit.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Audit Unity Catalog events\n##### Audit log analysis example\n\nThe following steps and notebook create a dashboard you can use to analyze your account\u2019s audit log data. \n1. Create a cluster with the **Single User** [access mode](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode). See [Access modes](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode).\n2. Import the following example notebook into your workspace and attach it to the cluster you just created. See [Import a notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html#import-notebook). \n### Audit log analysis notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/aws-audit-logs-etl-unity-catalog.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n3. A series of widgets appear at the top of the page. Enter a value for **checkpoint** and optionally enter values for the remaining fields. \n* **checkpoint**: The path where streaming checkpoints are stored, either in DBFS or S3.\n* **catalog**: Name of the catalog where you want to store the audit tables (catalog must already exist). Make sure that you have `USE CATALOG` and `CREATE` privileges on it.\n* **database**: Name of the database (schema) where you want to store the audit tables (will be created if doesn\u2019t already exist). If it does already exist, make sure that you have `USE SCHEMA` and `CREATE` privileges on it.\n* **log\\_bucket**: The path to the storage location where your audit logs reside. This should be in the following format: \n```\n<bucket-name>\/<delivery-path-prefix>\/workspaceId=0\/\n\n``` \nFor information about configuring audit logs, see [Configure audit log delivery](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-log-delivery.html). Append `workspaceId=0` to the path to get the account-level audit logs, including Unity Catalog events.\n* **start\\_date**: Filter events by start date. \nValues for `<bucket-name>` and `<delivery-path>` are automatically filled from the notebook widgets.\n4. Run the notebook to create the audit report.\n5. To modify the report or to return activities for a given user, see commands 23 and 24 in the notebook.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/audit.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Audit Unity Catalog events\n##### Unity Catalog audit log events\n\nFor a list of auditable events in Unity Catalog, see [Unity Catalog events](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html#uc).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/audit.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Create and manage scheduled notebook jobs\n\nYou can create and manage notebook jobs directly in the notebook UI. If a notebook is already assigned to one or more jobs, you can create and manage schedules for those jobs. If a notebook is not assigned to a job, you can create a job and a schedule to run the notebook.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Create and manage scheduled notebook jobs\n##### Schedule a notebook job\n\nTo schedule a notebook job to run periodically: \n1. In the notebook, click ![Notebook schedule button](https:\/\/docs.databricks.com\/_images\/schedule-button.png) at the top right. If no jobs exist for this notebook, the Schedule dialog appears. \n![Schedule notebook dialog](https:\/\/docs.databricks.com\/_images\/schedule-dialog.png) \nIf jobs already exist for the notebook, the Jobs List dialog appears. To display the Schedule dialog, click **Add a schedule**. \n![Job list dialog](https:\/\/docs.databricks.com\/_images\/job-list-dialog.png)\n2. In the Schedule dialog, optionally enter a name for the job. The default name is the name of the notebook.\n3. Select **Manual** to run your job only when manually triggered, or **Scheduled** to define a schedule for running the job. If you select **Scheduled**, use the drop-downs to specify the frequency, time, and time zone.\n4. In the **Compute** drop-down, select the compute resource to run the task. \nIf the notebook is attached to a SQL warehouse, the default compute is the same SQL warehouse. \nPreview \nServerless compute for workflows is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). For information on eligibility and enablement, see [Enable serverless compute public preview](https:\/\/docs.databricks.com\/admin\/workspace-settings\/serverless.html). \nIf your workspace is Unity Catalog-enabled and [Serverless Workflows](https:\/\/docs.databricks.com\/workflows\/jobs\/run-serverless-jobs.html) is enabled, the job runs on serverless compute by default. \nOtherwise, if you have **Allow Cluster Creation** permissions, the job runs on a [new job cluster](https:\/\/docs.databricks.com\/workflows\/jobs\/use-compute.html) by default. To edit the configuration of the default job cluster, click **Edit** at the right of the field to display the [cluster configuration dialog](https:\/\/docs.databricks.com\/compute\/configure.html). If you do not have **Allow Cluster Creation** permissions, the job runs on the cluster that the notebook is attached to by default. If the notebook is not attached to a cluster, you must select a cluster from the **Cluster** dropdown.\n5. Optionally, enter any **Parameters** to pass to the job. Click **Add** and specify the key and value of each parameter. Parameters set the value of the [notebook widget](https:\/\/docs.databricks.com\/notebooks\/widgets.html) specified by the key of the parameter. Use [dynamic value references](https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html) to pass a limited set of dynamic values as part of a parameter value.\n6. Optionally, specify email addresses to receive **Alerts** on job events. See [Add email and system notifications for job events](https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html).\n7. Click **Submit**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Create and manage scheduled notebook jobs\n##### Run a notebook job\n\nTo manually run a notebook job: \n1. In the notebook, click ![Notebook schedule button](https:\/\/docs.databricks.com\/_images\/schedule-button.png) at the top right.\n2. Click **Run now**.\n3. To view the [job run details](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#job-run-details), click ![New Tab Icon](https:\/\/docs.databricks.com\/_images\/open-in-new-tab.png).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Create and manage scheduled notebook jobs\n##### Manage scheduled notebook jobs\n\nTo display jobs associated with this notebook, click the **Schedule** button. The jobs list dialog appears, showing all jobs currently defined for this notebook. To manage jobs, click ![Jobs Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/jobs-vertical-ellipsis.png) at the right of a job in the list. \n![Job list menu](https:\/\/docs.databricks.com\/_images\/job-list-menu.png) \nFrom this menu, you can edit the schedule, [clone](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#clone-job) the job, view [job run details](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#view-job-run-list), pause the job, resume the job, or delete a scheduled job. \nWhen you clone a scheduled job, a new job is created with the same parameters as the original. The new job appears in the list with the name `Clone of <initial job name>`. \nHow you edit a job depends on the complexity of the job\u2019s schedule. Either the Schedule dialog or the [Job details panel](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#job-edit) displays, allowing you to edit the schedule, cluster, parameters, and so on.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html"} +{"content":"# Compute\n## Use compute\n#### Troubleshoot compute issues\n\nThis article provides you with resources you can use in the event you need to troubleshoot compute behavior in your workspace. The topics in this article relate to compute start-up issues. \nFor other troubleshooting articles, see: \n* [Debugging with the Apache Spark UI](https:\/\/docs.databricks.com\/compute\/troubleshooting\/debugging-spark-ui.html)\n* [Diagnose cost and performance issues using the Spark UI](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/index.html)\n* [Handling large queries in interactive workflows](https:\/\/docs.databricks.com\/compute\/troubleshooting\/query-watchdog.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/troubleshooting\/index.html"} +{"content":"# Compute\n## Use compute\n#### Troubleshoot compute issues\n##### A new compute does not respond\n\nAfter what seems like a successful workspace deployment, you might notice that your first test compute does not respond. After approximately 20-30 minutes, check your [compute event log](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#event-log). You might see an error message similar to: \n```\nThe compute plane network is misconfigured. Please verify that the network for your compute plane is configured correctly. Error message: Node daemon ping timeout in 600000 ms ...\n\n``` \nThis message indicates that the routing or the firewall is incorrect. Databricks requested EC2 instances for a new compute, but encountered a long time delay waiting for the EC2 instance to bootstrap and connect to the control plane. The compute manager terminates the instances and reports this error. \nYour network configuration must allow compute node instances to successfully connect to the Databricks [control plane](https:\/\/docs.databricks.com\/getting-started\/overview.html). For a faster troubleshooting technique than using a compute, you can deploy an EC2 instance into one of the workspace subnets and do typical network troubleshooting steps like `nc`, `ping`, `telnet`, `traceroute`, etc. The Relay CNAME for each region is mentioned [in the customer-managed VPC article](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#firewall). For the Artifact Storage, ensure that there\u2019s a successful networking path to S3. \nFor access domains and IPs by region, see [IP addresses and domains](https:\/\/docs.databricks.com\/resources\/supported-regions.html#ip-domain-aws). For regional endpoints, see [(Recommended) Configure regional endpoints](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#regional-endpoints). The following example uses the AWS region `eu-west-1`: \n```\n# Verify access to the web application\nnc -zv ireland.cloud.databricks.com 443\n\n# Verify access to the secure compute connectivity relay\nnc -zv tunnel.eu-west-1.cloud.databricks.com 443\n\n# Verify S3 global and regional access\nnc -zv s3.amazonaws.com 443\nnc -zv s3.eu-west-1.amazonaws.com 443\n\n# Verify STS global and regional access\nnc -zv sts.amazonaws.com 443\nnc -zv sts.eu-west-1.amazonaws.com 443\n\n# Verify regional Kinesis access\nnc -zv kinesis.eu-west-1.amazonaws.com 443\n\n``` \nIf these all return correctly, the networking could be configured correctly but there could be another issue if you are using a firewall. The firewall may have deep packet inspection, SSL inspection, or something else that causes Databricks commands to fail. Using an EC2 instance in the Databricks subnet, try the following: \n```\ncurl -X GET -H 'Authorization: Bearer <token>' \\\nhttps:\/\/<workspace-name>.cloud.databricks.com\/api\/2.0\/clusters\/spark-versions\n\n``` \nReplace `<token>` with your own personal access token and use the correct URL for your workspace. See the [Token management API](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement). \nIf this request fails, try the `-k` option with your request to remove SSL verification. If this works with the `-k` option, then the firewall is causing an issue with SSL certificates. \nLook at the SSL certificates using the following and replace the domain name with the [control plane web application domain](https:\/\/docs.databricks.com\/resources\/supported-regions.html#ip-domain-aws) for your region: \n```\nopenssl s_client -showcerts -connect oregon.cloud.databricks.com:443\n\n``` \nThis command shows the return code and the Databricks certificates. If it returns an error, it\u2019s a sign that your firewall is misconfigured and must be fixed. \nNote that [SSL issues are not a networking layer issue](https:\/\/security.stackexchange.com\/questions\/19681\/where-does-ssl-encryption-take-place). Viewing traffic at the firewall will not show these SSL issues. Looking at source and destination requests will work as expected.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/troubleshooting\/index.html"} +{"content":"# Compute\n## Use compute\n#### Troubleshoot compute issues\n##### Problems using your metastore or compute event log includes `METASTORE_DOWN` events\n\nIf your workspace seems to be up and you can set up compute, but you have `METASTORE_DOWN` events in your [compute event logs](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#view-log), or if your [metastore](https:\/\/docs.databricks.com\/archive\/external-metastores\/index.html) does not seem to work, confirm if you use a Web Application Firewall (WAF) like Squid proxy. Compute members must connect to several services that don\u2019t work over a WAF.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/troubleshooting\/index.html"} +{"content":"# Compute\n## Use compute\n#### Troubleshoot compute issues\n##### Compute start error: failed to launch Spark container on instance\n\nYou might see a compute log error such as: \n```\nCluster start error: failed to launch spark container on instance ...\nException: Could not add container for ... with address ....\nTimed out with exception after 1 attempts\n\n``` \nThis [compute log error](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#view-log) is likely due to the instance not being able to use STS to get into the root S3 bucket. This usually happens when you are implementing exfiltration protection, using VPC endpoints to lock down communication, or adding a firewall. \nTo fix, do one of the following: \n* Change the firewall to allow the global STS endpoint to pass (`sts.amazonaws.com`) as documented in the [VPC requirements docs](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html).\n* Use a VPC endpoint to set up the [regional endpoint](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#regional-endpoints). \nTo get more information about the error, call the `decode-authorization-message` AWS CLI command. For details, see the [AWS article for decode-authorization-message](https:\/\/docs.aws.amazon.com\/cli\/latest\/reference\/sts\/decode-authorization-message.html). The command looks like: \n```\naws sts decode-authorization-message --encoded-message\n\n``` \nYou may see this error if you set up a VPC endpoint (VPCE) with a different security group for the STS VPCE than the workspaces. You can either update the security groups to allow resources in each security group to talk to each other or put the STS VPCE in the same security group as the workspace subnets. \nCompute nodes need to use STS to access the root S3 bucket using the customer S3 policy. A network path has to be available to the AWS STS service from Databricks compute nodes.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/troubleshooting\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Troubleshooting and limitations\n##### Troubleshooting\n\n**Error message: `Database recommender_system does not exist in the Hive metastore.`** \nA feature table is stored as a Delta table. The database is specified by the table name prefix, so a feature table *recommender\\_system.customer\\_features* will be stored in the *recommender\\_system* database. \nTo create the database, run: \n```\n%sql CREATE DATABASE IF NOT EXISTS recommender_system;\n\n``` \n**Error message: `ModuleNotFoundError: No module named 'databricks.feature_engineering'` or `ModuleNotFoundError: No module named 'databricks.feature_store'`** \nThis error occurs when databricks-feature-engineering is not installed on the Databricks Runtime you are using. \ndatabricks-feature-engineering [is available on PyPI](https:\/\/pypi.org\/project\/databricks-feature-engineering\/), and can be installed with: \n```\n%pip install databricks-feature-engineering\n\n``` \n**Error message: `ModuleNotFoundError: No module named 'databricks.feature_store'`** \nThis error occurs when databricks-feature-store is not installed on the Databricks Runtime you are using. \nNote \nFor Databricks Runtime 14.3 and above, install databricks-feature-engineering instead via `%pip install databricks-feature-engineering` \ndatabricks-feature-store [is available on PyPI](https:\/\/pypi.org\/project\/databricks-feature-store\/), and can be installed with: \n```\n%pip install databricks-feature-store\n\n``` \n**Error message: `Invalid input. Data is not compatible with model signature. Cannot convert non-finite values...'`** \nThis error can occur when using a Feature Store-packaged model in Databricks Model Serving. When providing custom feature values in an input to the endpoint, you must provide a value for the feature for each row in the input, or for no rows. You cannot provide custom values for a feature for only some rows.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/troubleshooting-and-limitations.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Troubleshooting and limitations\n##### Limitations\n\n* Databricks Runtime ML clusters are not supported when using Delta Live Tables as feature tables. Instead, use a shared cluster and manually install the client using `pip install databricks-feature-engineering`. You must also install any other required ML libraries. \n```\n%pip install databricks-feature-engineering\n\n```\n* Materialized views and streaming tables are managed by Delta Live Tables pipelines. `fe.write_table()` does not update them. Instead, use the Delta Live Table pipeline to update the tables. \n* Workspace Feature Store does not support deleting individual features from a feature table.\n* A maximum of 100 [on-demand features](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/on-demand-features.html) can be used in a model.\n* Publishing Delta Live Tables to third-party online stores is not supported. Use [Databricks Online Tables](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html) as an online store.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/troubleshooting-and-limitations.html"} +{"content":"# What is Delta Lake?\n### Work with Delta Lake table history\n\nEach operation that modifies a Delta Lake table creates a new table version. You can use history information to audit operations, rollback a table, or query a table at a specific point in time using time travel. \nNote \nDatabricks does not recommend using Delta Lake table history as a long-term backup solution for data archival. Databricks recommends using only the past 7 days for time travel operations unless you have set both data and log retention configurations to a larger value.\n\n### Work with Delta Lake table history\n#### Retrieve Delta table history\n\nYou can retrieve information including the operations, user, and timestamp for each write to a Delta table by running the `history` command. The operations are returned in reverse chronological order. \nTable history retention is determined by the table setting `delta.logRetentionDuration`, which is 30 days by default. \nNote \nTime travel and table history are controlled by different retention thresholds. See [What is Delta Lake time travel?](https:\/\/docs.databricks.com\/delta\/history.html#time-travel). \n```\nDESCRIBE HISTORY '\/data\/events\/' -- get the full history of the table\n\nDESCRIBE HISTORY delta.`\/data\/events\/`\n\nDESCRIBE HISTORY '\/data\/events\/' LIMIT 1 -- get the last operation only\n\nDESCRIBE HISTORY eventsTable\n\n``` \nFor Spark SQL syntax details, see [DESCRIBE HISTORY](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-describe-history.html). \nSee the [Delta Lake API documentation](https:\/\/docs.databricks.com\/delta\/index.html#delta-api) for Scala\/Java\/Python syntax details. \n[Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html) provides a visual view of this detailed table information and history for Delta tables. In addition to the table schema and sample data, you can click the **History** tab to see the table history that displays with `DESCRIBE HISTORY`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/history.html"} +{"content":"# What is Delta Lake?\n### Work with Delta Lake table history\n#### History schema\n\nThe output of the `history` operation has the following columns. \n| Column | Type | Description |\n| --- | --- | --- |\n| version | long | Table version generated by the operation. |\n| timestamp | timestamp | When this version was committed. |\n| userId | string | ID of the user that ran the operation. |\n| userName | string | Name of the user that ran the operation. |\n| operation | string | Name of the operation. |\n| operationParameters | map | Parameters of the operation (for example, predicates.) |\n| job | struct | Details of the job that ran the operation. |\n| notebook | struct | Details of notebook from which the operation was run. |\n| clusterId | string | ID of the cluster on which the operation ran. |\n| readVersion | long | Version of the table that was read to perform the write operation. |\n| isolationLevel | string | Isolation level used for this operation. |\n| isBlindAppend | boolean | Whether this operation appended data. |\n| operationMetrics | map | Metrics of the operation (for example, number of rows and files modified.) |\n| userMetadata | string | User-defined commit metadata if it was specified | \n```\n+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+-----------------+-------------+--------------------+\n|version| timestamp|userId|userName|operation| operationParameters| job|notebook|clusterId|readVersion| isolationLevel|isBlindAppend| operationMetrics|\n+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+-----------------+-------------+--------------------+\n| 5|2019-07-29 14:07:47| ###| ###| DELETE|[predicate -> [\"(...|null| ###| ###| 4|WriteSerializable| false|[numTotalRows -> ...|\n| 4|2019-07-29 14:07:41| ###| ###| UPDATE|[predicate -> (id...|null| ###| ###| 3|WriteSerializable| false|[numTotalRows -> ...|\n| 3|2019-07-29 14:07:29| ###| ###| DELETE|[predicate -> [\"(...|null| ###| ###| 2|WriteSerializable| false|[numTotalRows -> ...|\n| 2|2019-07-29 14:06:56| ###| ###| UPDATE|[predicate -> (id...|null| ###| ###| 1|WriteSerializable| false|[numTotalRows -> ...|\n| 1|2019-07-29 14:04:31| ###| ###| DELETE|[predicate -> [\"(...|null| ###| ###| 0|WriteSerializable| false|[numTotalRows -> ...|\n| 0|2019-07-29 14:01:40| ###| ###| WRITE|[mode -> ErrorIfE...|null| ###| ###| null|WriteSerializable| true|[numFiles -> 2, n...|\n+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+-----------------+-------------+--------------------+\n\n``` \nNote \n* A few of the other columns are not available if you write into a Delta table using the following methods: \n+ [JDBC or ODBC](https:\/\/docs.databricks.com\/integrations\/jdbc-odbc-bi.html)\n+ [Run a command using the REST API](https:\/\/docs.databricks.com\/api\/workspace\/commandexecution\/execute)\n+ [Some task types for jobs](https:\/\/docs.databricks.com\/api\/workspace\/jobs\/create)\n* Columns added in the future will always be added after the last column.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/history.html"} +{"content":"# What is Delta Lake?\n### Work with Delta Lake table history\n#### Operation metrics keys\n\nThe `history` operation returns a collection of operations metrics in the `operationMetrics` column map. \nThe following tables list the map key definitions by operation. \n| Operation | Metric name | Description |\n| --- | --- | --- |\n| WRITE, CREATE TABLE AS SELECT, REPLACE TABLE AS SELECT, COPY INTO | | |\n| | numFiles | Number of files written. |\n| | numOutputBytes | Size in bytes of the written contents. |\n| | numOutputRows | Number of rows written. |\n| STREAMING UPDATE | | |\n| | numAddedFiles | Number of files added. |\n| | numRemovedFiles | Number of files removed. |\n| | numOutputRows | Number of rows written. |\n| | numOutputBytes | Size of write in bytes. |\n| DELETE | | |\n| | numAddedFiles | Number of files added. Not provided when partitions of the table are deleted. |\n| | numRemovedFiles | Number of files removed. |\n| | numDeletedRows | Number of rows removed. Not provided when partitions of the table are deleted. |\n| | numCopiedRows | Number of rows copied in the process of deleting files. |\n| | executionTimeMs | Time taken to execute the entire operation. |\n| | scanTimeMs | Time taken to scan the files for matches. |\n| | rewriteTimeMs | Time taken to rewrite the matched files. |\n| TRUNCATE | | |\n| | numRemovedFiles | Number of files removed. |\n| | executionTimeMs | Time taken to execute the entire operation. |\n| MERGE | | |\n| | numSourceRows | Number of rows in the source DataFrame. |\n| | numTargetRowsInserted | Number of rows inserted into the target table. |\n| | numTargetRowsUpdated | Number of rows updated in the target table. |\n| | numTargetRowsDeleted | Number of rows deleted in the target table. |\n| | numTargetRowsCopied | Number of target rows copied. |\n| | numOutputRows | Total number of rows written out. |\n| | numTargetFilesAdded | Number of files added to the sink(target). |\n| | numTargetFilesRemoved | Number of files removed from the sink(target). |\n| | executionTimeMs | Time taken to execute the entire operation. |\n| | scanTimeMs | Time taken to scan the files for matches. |\n| | rewriteTimeMs | Time taken to rewrite the matched files. |\n| UPDATE | | |\n| | numAddedFiles | Number of files added. |\n| | numRemovedFiles | Number of files removed. |\n| | numUpdatedRows | Number of rows updated. |\n| | numCopiedRows | Number of rows just copied over in the process of updating files. |\n| | executionTimeMs | Time taken to execute the entire operation. |\n| | scanTimeMs | Time taken to scan the files for matches. |\n| | rewriteTimeMs | Time taken to rewrite the matched files. |\n| FSCK | numRemovedFiles | Number of files removed. |\n| CONVERT | numConvertedFiles | Number of Parquet files that have been converted. |\n| OPTIMIZE | | |\n| | numAddedFiles | Number of files added. |\n| | numRemovedFiles | Number of files optimized. |\n| | numAddedBytes | Number of bytes added after the table was optimized. |\n| | numRemovedBytes | Number of bytes removed. |\n| | minFileSize | Size of the smallest file after the table was optimized. |\n| | p25FileSize | Size of the 25th percentile file after the table was optimized. |\n| | p50FileSize | Median file size after the table was optimized. |\n| | p75FileSize | Size of the 75th percentile file after the table was optimized. |\n| | maxFileSize | Size of the largest file after the table was optimized. |\n| CLONE | | |\n| | sourceTableSize | Size in bytes of the source table at the version that\u2019s cloned. |\n| | sourceNumOfFiles | Number of files in the source table at the version that\u2019s cloned. |\n| | numRemovedFiles | Number of files removed from the target table if a previous Delta table was replaced. |\n| | removedFilesSize | Total size in bytes of the files removed from the target table if a previous Delta table was replaced. |\n| | numCopiedFiles | Number of files that were copied over to the new location. 0 for shallow clones. |\n| | copiedFilesSize | Total size in bytes of the files that were copied over to the new location. 0 for shallow clones. |\n| RESTORE | | |\n| | tableSizeAfterRestore | Table size in bytes after restore. |\n| | numOfFilesAfterRestore | Number of files in the table after restore. |\n| | numRemovedFiles | Number of files removed by the restore operation. |\n| | numRestoredFiles | Number of files that were added as a result of the restore. |\n| | removedFilesSize | Size in bytes of files removed by the restore. |\n| | restoredFilesSize | Size in bytes of files added by the restore. |\n| VACUUM | | |\n| | numDeletedFiles | Number of deleted files. |\n| | numVacuumedDirectories | Number of vacuumed directories. |\n| | numFilesToDelete | Number of files to delete. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/history.html"} +{"content":"# What is Delta Lake?\n### Work with Delta Lake table history\n#### What is Delta Lake time travel?\n\nDelta Lake time travel supports querying previous table versions based on timestamp or table version (as recorded in the transaction log). You can use time travel for applications such as the following: \n* Re-creating analyses, reports, or outputs (for example, the output of a machine learning model). This could be useful for debugging or auditing, especially in regulated industries.\n* Writing complex temporal queries.\n* Fixing mistakes in your data.\n* Providing snapshot isolation for a set of queries for fast changing tables. \nImportant \nTable versions accessible with time travel are determined by a combination of the retention threshold for transaction log files and the frequency and specified retention for `VACUUM` operations. If you run `VACUUM` daily with the default values, 7 days of data is available for time travel.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/history.html"} +{"content":"# What is Delta Lake?\n### Work with Delta Lake table history\n#### Delta time travel syntax\n\nYou query a Delta table with time travel by adding a clause after the table name specification. \n* `timestamp_expression` can be any one of: \n+ `'2018-10-18T22:15:12.013Z'`, that is, a string that can be cast to a timestamp\n+ `cast('2018-10-18 13:36:32 CEST' as timestamp)`\n+ `'2018-10-18'`, that is, a date string\n+ `current_timestamp() - interval 12 hours`\n+ `date_sub(current_date(), 1)`\n+ Any other expression that is or can be cast to a timestamp\n* `version` is a long value that can be obtained from the output of `DESCRIBE HISTORY table_spec`. \nNeither `timestamp_expression` nor `version` can be subqueries. \nOnly date or timestamp strings are accepted. For example, `\"2019-01-01\"` and `\"2019-01-01T00:00:00.000Z\"`. See the following code for example syntax: \n```\nSELECT * FROM people10m TIMESTAMP AS OF '2018-10-18T22:15:12.013Z'\nSELECT * FROM delta.`\/tmp\/delta\/people10m` VERSION AS OF 123\n\n``` \n```\ndf1 = spark.read.option(\"timestampAsOf\", \"2019-01-01\").table(\"people10m\")\ndf2 = spark.read.option(\"versionAsOf\", 123).load(\"\/tmp\/delta\/people10m\")\n\n``` \nYou can also use the `@` syntax to specify the timestamp or version as part of the table name. The timestamp must be in `yyyyMMddHHmmssSSS` format. You can specify a version after `@` by prepending a `v` to the version. See the following code for example syntax: \n```\nSELECT * FROM people10m@20190101000000000\nSELECT * FROM people10m@v123\n\n``` \n```\nspark.read.table(\"people10m@20190101000000000\")\nspark.read.table(\"people10m@v123\")\n\nspark.read.load(\"\/tmp\/delta\/people10m@20190101000000000\")\nspark.read.load(\"\/tmp\/delta\/people10m@v123\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/history.html"} +{"content":"# What is Delta Lake?\n### Work with Delta Lake table history\n#### What are transaction log checkpoints?\n\nDelta Lake records table versions as JSON files within the `_delta_log` directory, which is stored alongside table data. To optimize checkpoint querying, Delta Lake aggregates table versions to Parquet checkpoint files, preventing the need to read all JSON versions of table history. Databricks optimizes checkpointing frequency for data size and workload. Users should not need to interact with checkpoints directly. The checkpoint frequency is subject to change without notice.\n\n### Work with Delta Lake table history\n#### Configure data retention for time travel queries\n\nTo query a previous table version, you must retain *both* the log and the data files for that version. \nData files are deleted when `VACUUM` runs against a table. Delta Lake manages log file removal automatically after checkpointing table versions. \nBecause most Delta tables have `VACUUM` run against them regularly, point-in-time queries should respect the retention threshold for `VACUUM`, which is 7 days by default. \nIn order to increase the data retention threshold for Delta tables, you must configure the following table properties: \n* `delta.logRetentionDuration = \"interval <interval>\"`: controls how long the history for a table is kept. The default is `interval 30 days`.\n* `delta.deletedFileRetentionDuration = \"interval <interval>\"`: determines the threshold `VACUUM` uses to remove data files no longer referenced in the current table version. The default is `interval 7 days`. \nYou can specify Delta properties during table creation or set them with an `ALTER TABLE` statement. See [Delta table properties reference](https:\/\/docs.databricks.com\/delta\/table-properties.html). \nNote \nYou must set both of these properties to ensure table history is retained for longer duration for tables with frequent `VACUUM` operations. For example, to access 30 days of historical data, set `delta.deletedFileRetentionDuration = \"interval 30 days\"` (which matches the default setting for `delta.logRetentionDuration`). \nIncreasing data retention threshold can cause your storage costs to go up, as more data files are maintained.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/history.html"} +{"content":"# What is Delta Lake?\n### Work with Delta Lake table history\n#### Restore a Delta table to an earlier state\n\nYou can restore a Delta table to its earlier state by using the `RESTORE` command. A Delta table internally maintains historic versions of the table that enable it to be restored to an earlier state.\nA version corresponding to the earlier state or a timestamp of when the earlier state was created are supported as options by the `RESTORE` command. \nImportant \n* You can restore an already restored table.\n* You can restore a [cloned](https:\/\/docs.databricks.com\/delta\/clone.html) table.\n* You must have `MODIFY` permission on the table being restored.\n* You cannot restore a table to an older version where the data files were deleted manually or by `vacuum`. Restoring to this version partially is still possible if `spark.sql.files.ignoreMissingFiles` is set to `true`.\n* The timestamp format for restoring to an earlier state is `yyyy-MM-dd HH:mm:ss`. Providing only a date(`yyyy-MM-dd`) string is also supported. \n```\nRESTORE TABLE db.target_table TO VERSION AS OF <version>\nRESTORE TABLE delta.`\/data\/target\/` TO TIMESTAMP AS OF <timestamp>\n\n``` \nFor syntax details, see [RESTORE](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-restore.html). \nImportant \nRestore is considered a data-changing operation. Delta Lake log entries added by the `RESTORE` command contain [dataChange](https:\/\/github.com\/delta-io\/delta\/blob\/master\/PROTOCOL.md#add-file-and-remove-file) set to true. If there is a downstream application, such as a [Structured streaming](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html) job that processes the updates to a Delta Lake table, the data change log entries added by the restore operation are considered as new data updates, and processing them may result in duplicate data. \nFor example: \n| Table version | Operation | Delta log updates | Records in data change log updates |\n| --- | --- | --- | --- |\n| 0 | INSERT | AddFile(\/path\/to\/file-1, dataChange = true) | (name = Viktor, age = 29, (name = George, age = 55) |\n| 1 | INSERT | AddFile(\/path\/to\/file-2, dataChange = true) | (name = George, age = 39) |\n| 2 | OPTIMIZE | AddFile(\/path\/to\/file-3, dataChange = false), RemoveFile(\/path\/to\/file-1), RemoveFile(\/path\/to\/file-2) | (No records as Optimize compaction does not change the data in the table) |\n| 3 | RESTORE(version=1) | RemoveFile(\/path\/to\/file-3), AddFile(\/path\/to\/file-1, dataChange = true), AddFile(\/path\/to\/file-2, dataChange = true) | (name = Viktor, age = 29), (name = George, age = 55), (name = George, age = 39) | \nIn the preceding example, the `RESTORE` command results in updates that were already seen when reading the Delta table version 0 and 1. If a streaming query was reading this table, then these files will be considered as newly added data and will be processed again.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/history.html"} +{"content":"# What is Delta Lake?\n### Work with Delta Lake table history\n#### Restore metrics\n\n`RESTORE` reports the following metrics as a single row DataFrame once the operation is complete: \n* `table_size_after_restore`: The size of the table after restoring.\n* `num_of_files_after_restore`: The number of files in the table after restoring.\n* `num_removed_files`: Number of files removed (logically deleted) from the table.\n* `num_restored_files`: Number of files restored due to rolling back.\n* `removed_files_size`: Total size in bytes of the files that are removed from the table.\n* `restored_files_size`: Total size in bytes of the files that are restored. \n![Restore metrics example](https:\/\/docs.databricks.com\/_images\/restore-metrics.png)\n\n### Work with Delta Lake table history\n#### Examples of using Delta Lake time travel\n\n* Fix accidental deletes to a table for the user `111`: \n```\nINSERT INTO my_table\nSELECT * FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)\nWHERE userId = 111\n\n```\n* Fix accidental incorrect updates to a table: \n```\nMERGE INTO my_table target\nUSING my_table TIMESTAMP AS OF date_sub(current_date(), 1) source\nON source.userId = target.userId\nWHEN MATCHED THEN UPDATE SET *\n\n```\n* Query the number of new customers added over the last week. \n```\nSELECT count(distinct userId)\nFROM my_table - (\nSELECT count(distinct userId)\nFROM my_table TIMESTAMP AS OF date_sub(current_date(), 7))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/history.html"} +{"content":"# What is Delta Lake?\n### Work with Delta Lake table history\n#### How do I find the last commit\u2019s version in the Spark session?\n\nTo get the version number of the last commit written by the current `SparkSession` across all threads\nand all tables, query the SQL configuration `spark.databricks.delta.lastCommitVersionInSession`. \n```\nSET spark.databricks.delta.lastCommitVersionInSession\n\n``` \n```\nspark.conf.get(\"spark.databricks.delta.lastCommitVersionInSession\")\n\n``` \n```\nspark.conf.get(\"spark.databricks.delta.lastCommitVersionInSession\")\n\n``` \nIf no commits have been made by the `SparkSession`, querying the key returns an empty value. \nNote \nIf you share the same `SparkSession` across multiple threads, it\u2019s similar to sharing a variable\nacross multiple threads; you may hit race conditions as the configuration value is updated\nconcurrently.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/history.html"} +{"content":"# \n### Share data and AI assets securely using Delta Sharing\n\nThis article introduces Delta Sharing in Databricks, the secure data sharing platform that lets you share data and AI assets in Databricks with users outside your organization, whether those users use Databricks or not. \nImportant \nThe Delta Sharing articles on this site focus on sharing Databricks data, notebooks, and AI models. Delta Sharing is also available as an [open-source project](https:\/\/delta.io\/sharing) that you can use to share Delta tables from other platforms. Delta Sharing also provides the backbone for [Databricks Marketplace](https:\/\/docs.databricks.com\/marketplace\/index.html), an open forum for exchanging data products. \nNote \nIf you are a *data recipient* who has been granted access to shared data through Delta Sharing, and you just want to learn how to access that data, see [Access data shared with you using Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/recipient.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/index.html"} +{"content":"# \n### Share data and AI assets securely using Delta Sharing\n#### What is Delta Sharing?\n\n[Delta Sharing](https:\/\/delta.io\/sharing) is an [open protocol](https:\/\/github.com\/delta-io\/delta-sharing\/blob\/main\/PROTOCOL.md) developed by Databricks for secure data sharing with other organizations regardless of the computing platforms they use. \nThere are three ways to share data using Delta Sharing: \n1. **The Databricks-to-Databricks sharing protocol**, which lets you share data and AI assets from your Unity Catalog-enabled workspace with users who also have access to a Unity Catalog-enabled Databricks workspace. \nThis approach uses the Delta Sharing server that is built into Databricks. It supports some Delta Sharing features that are not suppported in the other protocols, including notebook sharing, Unity Catalog volume sharing, Unity Catalog AI model sharing, Unity Catalog data governance, auditing, and usage tracking for both providers and recipients. The integration with Unity Catalog simplifies setup and governance for both providers and recipients and improves performance. \nSee [Share data using the Delta Sharing Databricks-to-Databricks protocol (for providers)](https:\/\/docs.databricks.com\/data-sharing\/share-data-databricks.html).\n2. **The Databricks open sharing protocol**, which lets you share tabular data that you manage in a Unity Catalog-enabled Databricks workspace with users on any computing platform. \nThis approach uses the Delta Sharing server that is built into Databricks and is useful when you manage data using Unity Catalog and want to share it with users who don\u2019t use Databricks or don\u2019t have access to a Unity Catalog-enabled Databricks workspace. The integration with Unity Catalog on the provider side simplifies setup and governance for providers. \nSee [Share data using the Delta Sharing open sharing protocol (for providers)](https:\/\/docs.databricks.com\/data-sharing\/share-data-open.html).\n3. **A customer-managed implementation of the open-source Delta Sharing server**, which lets you share from any platform to any platform, whether Databricks or not. \nThe Databricks documentation does not cover instructions for setting up your own Delta Sharing server. See [github.com\/delta-io\/delta-sharing](https:\/\/github.com\/delta-io\/delta-sharing).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/index.html"} +{"content":"# \n### Share data and AI assets securely using Delta Sharing\n#### Shares, providers, and recipients\n\nThe primary concepts underlying Delta Sharing in Databricks are *shares*, *providers*, and *recipients*. \n### What is a share? \nIn Delta Sharing, a *share* is a read-only collection of tables and table partitions that a provider wants to share with one or more recipients. If your recipient uses a Unity Catalog-enabled Databricks workspace, you can also include notebook files, views (including dynamic views that restrict access at the row and column level), Unity Catalog volumes, and Unity Catalog models in a share. \nYou can add or remove tables, views, volumes, models, and notebook files from a share at any time, and you can assign or revoke data recipient access to a share at any time. \nIn a Unity Catalog-enabled Databricks workspace, a share is a securable object registered in Unity Catalog. If you remove a share from your Unity Catalog metastore, all recipients of that share lose the ability to access it. \nSee [Create and manage shares for Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/create-share.html). \n### What is a provider? \nA *provider* is an entity that shares data with a recipient. If you are a provider and you want to take advantage of the built-in Databricks Delta Sharing server and manage shares and recipients using Unity Catalog, you need at least one Databricks workspace that is enabled for Unity Catalog. You do not need to migrate all of your existing workspaces to Unity Catalog. You can simply create a new Unity Catalog-enabled workspace for your Delta Sharing needs. \nIf a recipient is on a Unity Catalog-enabled Databricks workspace, the provider is also a Unity Catalog securable object that represents the provider organization and associates that organization with a set of shares. \n### What is a recipient? \nA *recipient* is an entity that receives shares from a provider. In Unity Catalog, a share is a securable object that represents an organization and associates it with a credential or secure sharing identifier that allows that organization to access one or more shares. \nAs a data provider (sharer), you can define multiple recipients for any given Unity Catalog metastore, but if you want to share data from multiple metastores with a particular user or group of users, you must define the recipient separately for each metastore. A recipient can have access to multiple shares. \nIf a provider deletes a recipient from their Unity Catalog metastore, that recipient loses access to all shares it could previously access. \nSee [Create and manage data recipients for Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/index.html"} +{"content":"# \n### Share data and AI assets securely using Delta Sharing\n#### Open sharing versus Databricks-to-Databricks sharing\n\nThis section describes the two protocols for sharing from a Databricks workspace that is enabled for Unity Catalog. \nNote \nThis section assumes that the provider is on a Unity Catalog-enabled Databricks workspace. To learn about setting up an open-source Delta Sharing server to share from a non-Databricks platform or non-Unity Catalog workspace, see [github.com\/delta-io\/delta-sharing](https:\/\/github.com\/delta-io\/delta-sharing). \nThe way a provider uses Delta Sharing in Databricks depends on who they are sharing data with: \n* *Open sharing* lets you share data with any user, whether or not they have access to Databricks.\n* *Databricks-to-Databricks sharing* lets you share data with Databricks users whose workspace is attached to a Unity Catalog metastore that is different from yours. Databricks-to-Databricks also supports notebook, volume, and model sharing, which is not available in open sharing. \n### What is open Delta Sharing? \nIf you want to share data with users outside of your Databricks workspace, regardless of whether they use Databricks, you can use open Delta Sharing to share your data securely. As a data provider, you generate a token and share it securely with the recipient. They use the token to authenticate and get read access to the tables you\u2019ve included in the shares you\u2019ve given them access to. \nRecipients can access the shared data using many computing tools and platforms, including: \n* Databricks\n* Apache Spark\n* Pandas\n* Power BI \nFor a full list of Delta Sharing connectors and information about how to use them, see the [Delta Sharing](https:\/\/delta.io\/sharing) documentation. \nSee also [Share data using the Delta Sharing open sharing protocol (for providers)](https:\/\/docs.databricks.com\/data-sharing\/share-data-open.html). \n### What is Databricks-to-Databricks Delta Sharing? \nIf you want to share data with users who have a Databricks workspace that is [enabled for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/enable-workspaces.html), you can use Databricks-to-Databricks Delta Sharing. Databricks-to-Databricks sharing lets you share data with users in other Databricks accounts, whether they\u2019re on AWS, Azure, or GCP. It\u2019s also a great way to securely share data across different Unity Catalog metastores in your own Databricks account. Note that there is no need to use Delta Sharing to share data between workspaces attached to the same Unity Catalog metastore, because in that scenario you can use Unity Catalog itself to manage access to data across workspaces. \nOne advantage of Databricks-to-Databricks sharing is that the share recipient doesn\u2019t need a token to access the share, and the provider doesn\u2019t need to manage recipient tokens. The security of the sharing connection\u2014including all identity verification, authentication, and auditing\u2014is managed entirely through Delta Sharing and the Databricks platform. Another advantage is the ability to share Databricks notebook files, views, Unity Catalog volumes, and Unity Catalog models. \nSee also [Share data using the Delta Sharing Databricks-to-Databricks protocol (for providers)](https:\/\/docs.databricks.com\/data-sharing\/share-data-databricks.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/index.html"} +{"content":"# \n### Share data and AI assets securely using Delta Sharing\n#### How do provider admins set up Delta Sharing?\n\nThis section gives an overview of how providers can enable Delta Sharing and initiate sharing from a Unity Catalog-enabled Databricks workspace. For open-source Delta Sharing, see [github.com\/delta-io\/delta-sharing](https:\/\/github.com\/delta-io\/delta-sharing). \nDatabricks-to-Databricks sharing between Unity Catalog metastores in the same account is always enabled. If you are a provider who wants to enable Delta Sharing to share data with Databricks workspaces in other accounts or non-Databricks clients, a Databricks account admin or metastore admin performs the following setup steps (at a high level): \n1. Enable Delta Sharing for the Unity Catalog metastore that manages the data you want to share. \nNote \nYou do not need to enable Delta Sharing on your metastore if you intend to use Delta Sharing to share data only with users on other Unity Catalog metastores in your account. Metastore-to-metastore sharing within a single Databricks account is enabled by default. \nSee [Enable Delta Sharing on a metastore](https:\/\/docs.databricks.com\/data-sharing\/set-up.html#enable).\n2. Create a share that includes data assets registered in the Unity Catalog metastore. \nIf you are sharing with a non-Databricks recipient (known as open sharing) you can include tables in the Delta or Parquet format. If you plan to use [Databricks-to-Databricks sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html#d-to-d), you can also add views, Unity Catalog volumes, Unity Catalog models, and notebook files to a share. \nSee [Create and manage shares for Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/create-share.html).\n3. Create a recipient. \nSee [Create and manage data recipients for Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html). \nIf your recipient is not a Databricks user, or does not have access to a Databricks workspace that is enabled for Unity Catalog, you must use [open sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html#open). A set of token-based credentials is generated for that recipient. \nIf your recipient has access to a Databricks workspace that is enabled for Unity Catalog, you can use [Databricks-to-Databricks sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html#d-to-d), and no token-based credentials are required. You request a *sharing identifier* from the recipient and use it to establish the secure connection. \nTip \nUse yourself as a test recipient to try out the setup process.\n4. Grant the recipient access to one or more shares. \nSee [Manage access to Delta Sharing data shares (for providers)](https:\/\/docs.databricks.com\/data-sharing\/grant-access.html). \nNote \nThis step can also be performed by a non-admin user with the `USE SHARE`, `USE RECIPIENT` and `SET SHARE PERMISSION` privileges. See [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html).\n5. Send the recipient the information they need to connect to the share (open sharing only). \nSee [Send the recipient their connection information](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#send). \nFor open sharing, use a secure channel to send the recipient an activation link that allows them to download their token-based credentials. \nFor Databricks-to-Databricks sharing, the data included in the share becomes available in the recipient\u2019s Databricks workspace as soon as you grant them access to the share. \nThe recipient can now access the shared data.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/index.html"} +{"content":"# \n### Share data and AI assets securely using Delta Sharing\n#### How do recipients access the shared data?\n\nRecipients access shared data assets in read-only format. Shared notebook files are read-only, but they can be cloned and then modified and run in the recipient workspace just like any other notebook. \nSecure access depends on the sharing model: \n* Open sharing (recipient does not have a Databricks workspace enabled for Unity Catalog): The recipient provides the credential whenever they access the data in their tool of choice, including Apache Spark, pandas, Power BI, Databricks, and many more. See [Read data shared using Delta Sharing open sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html).\n* Databricks-to-Databricks (recipient workspace is enabled for Unity Catalog): The recipient accesses the data using Databricks. They can use Unity Catalog to grant and deny access to other users in their Databricks account. See [Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html). \nWhenever the data provider updates data tables or volumes in their own Databricks account, the updates appear in near real time in the recipient\u2019s system.\n\n### Share data and AI assets securely using Delta Sharing\n#### How do you keep track of who is sharing and accessing shared data?\n\nData providers on Unity Catalog-enabled Databricks workspaces can use Databricks audit logging and system tables to monitor the creation and modification of shares and recipients, and can monitor recipient activity on shares. See [Audit and monitor data sharing](https:\/\/docs.databricks.com\/data-sharing\/audit-logs.html). \nData recipients who use shared data in a Databricks workspace can use Databricks audit logging and system tables to understand who is accessing which data. See [Audit and monitor data sharing](https:\/\/docs.databricks.com\/data-sharing\/audit-logs.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/index.html"} +{"content":"# \n### Share data and AI assets securely using Delta Sharing\n#### Sharing volumes\n\nYou can share volumes using the Databricks-to-Databricks sharing flow. See [Add volumes to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#volumes) (for providers) and [Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html) (for recipients).\n\n### Share data and AI assets securely using Delta Sharing\n#### Sharing models\n\nYou can share models using the Databricks-to-Databricks sharing flow. See [Add models to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#models) (for providers) and [Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html) (for recipients).\n\n### Share data and AI assets securely using Delta Sharing\n#### Sharing notebooks\n\nYou can use Delta Sharing to share notebook files using the Databricks-to-Databricks sharing flow. See [Add notebook files to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#add-remove-notebook-files) (for providers) and [Read shared notebooks](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html#preview-notebook-files) (for recipients).\n\n### Share data and AI assets securely using Delta Sharing\n#### Restricting access at the row and column level\n\nYou can share dynamic views that restrict access to certain table data based on recipient properties. Dynamic view sharing requires the Databricks-to-Databricks sharing flow. See [Add dynamic views to a share to filter rows and columns](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#dynamic-views).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/index.html"} +{"content":"# \n### Share data and AI assets securely using Delta Sharing\n#### Delta Sharing and streaming\n\nDelta Sharing supports Spark Structured Streaming. A provider can share a table with history so that a recipient can use it as a Structured Streaming source, processing shared data incrementally with low latency. Recipients can also perform [Delta Lake time travel queries](https:\/\/docs.databricks.com\/delta\/history.html) on tables shared with history. \nTo learn how to share tables with history, see [Add tables to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#add-tables). To learn how to use shared tables as streaming sources, see [Query a table using Apache Spark Structured Streaming](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html#streaming-source) (for recipients of Databricks-to-Databricks sharing) or [Access a shared table using Spark Structured Streaming](https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html#streaming-source) (for recipients of open sharing data). \nSee also [Streaming on Databricks](https:\/\/docs.databricks.com\/structured-streaming\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/index.html"} +{"content":"# \n### Share data and AI assets securely using Delta Sharing\n#### Delta Lake feature support matrix\n\nDelta Sharing supports most Delta Lake features when you share a table. This support matrix lists Delta features that require specific versions of Databricks Runtime and the open-source Delta Sharing Spark connector, along with unsupported features. \n| Feature | Provider | Databricks recipient | Open source recipient |\n| --- | --- | --- | --- |\n| Deletion vectors | Sharing tables with this feature is in Public Preview. | * Databricks Runtime 14.1+ for batch queries * Databricks Runtime 14.2+ for CDF and streaming queries | delta-sharing-spark 3.1+ |\n| Column mapping | Sharing tables with this feature is in Public Preview. | * Databricks Runtime 14.1+ for batch queries * Databricks Runtime 14.2+ for CDF and streaming queries | delta-sharing-spark 3.1+ |\n| Uniform format | Sharing tables with this feature is in Public Preview. | * Databricks Runtime 14.1+ for batch queries * Databricks Runtime 14.2+ for CDF and streaming queries | delta-sharing-spark 3.1+ |\n| v2Checkpoint | Not supported | Not supported | Not supported |\n| TimestampNTZ | Not supported | Not supported | Not supported |\n| Liquid clustering | Not supported | Not supported | Not supported |\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/index.html"} +{"content":"# \n### Share data and AI assets securely using Delta Sharing\n#### Delta Sharing FAQs\n\nThe following are frequently asked questions about Delta Sharing. \n### Do I need Unity Catalog to use Delta Sharing? \nNo, you do not need Unity Catalog to share (as a provider) or consume shared data (as a recipient). However, Unity Catalog provides benefits such as support for non-tabular and AI asset sharing, out-of-the-box governance, simplicity, and query performance. \nProviders can share data in two ways: \n* Put the assets to share under Unity Catalog management and share them using the built-in Databricks Delta Sharing server. \nYou do do not need to migrate all assets to Unity Catalog. You need only one Databricks workspace that is enabled for Unity Catalog to manage assets that you want to share. In some accounts, new workspaces are enabled for Unity Catalog automatically. See [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement).\n* Implement the [open Delta Sharing server](https:\/\/github.com\/delta-io\/delta-sharing) to share data, without necessarily using your Databricks account. \nRecipients can consume data in two ways: \n* Without a Databricks workspace. Use open source Delta Sharing connectors that are available for many data platforms, including Power BI, pandas, and open source Apache Spark. See [Read data shared using Delta Sharing open sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html) and the [Delta Sharing open source project](https:\/\/delta.io\/sharing\/).\n* In a Databricks workspace. Recipient workspaces don\u2019t need to be enabled for Unity Catalog, but there are advantages of governance, simplicity, and performance if they are. \nRecipient organizations who want these advantages don\u2019t need to migrate all assets to Unity Catalog. You need only one Databricks workspace that is enabled for Unity Catalog to manage assets that are shared with you. In some accounts, new workspaces are enabled for Unity Catalog automatically. See [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement). \nSee [Read data shared using Delta Sharing open sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html) and [Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html). \n### Do I need to be a Databricks customer to use Delta Sharing? \nNo, Delta Sharing is an open protocol. You can share non-Databricks data with recipients on any data platform. Providers can configure an open Delta Sharing server to share from any computing platform. Recipients can consume shared data using open source Delta Sharing connectors for many data products, including Power BI, pandas, and open source Spark. \nHowever, using Delta Sharing on Databricks, especially sharing from a Unity Catalog-enabled workspace, has many advantages. \nFor details, see the first question in this FAQ. \n### Does Delta Sharing incur egress costs? \nDelta Sharing within a region incurs no egress cost. Unlike other data sharing platforms, Delta Sharing does not require data replication. This model has many advantages, but it means that your cloud vendor may charge data egress fees when you share data across clouds or regions. Databricks supports sharing from Cloudflare R2 (Public Preview), which incurs no egress fees, and provides other tools and recommendations to monitor and avoid egress fees. See [Monitor and manage Delta Sharing egress costs (for providers)](https:\/\/docs.databricks.com\/data-sharing\/manage-egress.html). \n### Can providers revoke recipient access? \nYes, recipient access can be revoked on-demand and at specified levels of granularity. You can deny recipient access to specific shares and specific IP addresses, filter tabular data for a recipient, revoke recipient tokens, and delete recipients entirely. See [Revoke recipient access to a share](https:\/\/docs.databricks.com\/data-sharing\/grant-access.html#revoke) and [Create and manage data recipients for Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html). \n### Isn\u2019t it insecure to use pre-signed URLs? \nDelta Sharing uses pre-signed URLs to provide temporary access to a file in object storage. They are only given to recipients that already have access to the shared data. They are secure because they are short-lived and don\u2019t expand the level of access beyond what recipients have already been granted. \n### Are the tokens used in the Delta Sharing open sharing protocol secure? \nBecause Delta Sharing enables cross-platform sharing\u2014unlike other available data sharing platforms\u2014the sharing protocol requires an open token. Providers can ensure token security by configuring the token lifetime, setting networking controls, and revoking access on demand. In addition, the token does not expand the level of access beyond what recipients have already been granted. See [Security considerations for tokens](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#security-considerations). \nIf you prefer not to use tokens to manage access to recipient shares, you should use [Databricks-to-Databricks sharing](https:\/\/docs.databricks.com\/data-sharing\/share-data-databricks.html) or contact your Databricks account team for alternatives. \n### Does Delta Sharing support view sharing? \nYes, Delta Sharing supports view sharing. See [Add views to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#views). \nTo learn about planned enhancements to viewing sharing, contact your Databricks account team.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/index.html"} +{"content":"# \n### Share data and AI assets securely using Delta Sharing\n#### Limitations\n\n* Tabular data must be in the [Delta table format](https:\/\/docs.databricks.com\/delta\/index.html). You can easily convert Parquet tables to Delta\u2014and back again. See [CONVERT TO DELTA](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-convert-to-delta.html).\n* View sharing is supported only in Databricks-to-Databricks sharing. Shareable views must be defined on Delta tables or other shareable views. See [Add views to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#views) (for providers) and [Read shared views](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html#views) (for recipients).\n* Notebook sharing is supported only in Databricks-to-Databricks sharing. See [Add notebook files to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#add-remove-notebook-files) and [Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html).\n* Volume sharing is supported only in Databricks-to-Databricks sharing. See [Add volumes to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#volumes) (for providers) and [Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html).\n* Model sharing is supported only in Databricks-to-Databricks sharing. See [Add models to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#models) (for providers) and [Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html).\n* There are limits on the number of files in metadata allowed for a shared table. To learn more, see [Resource limit exceeded errors](https:\/\/docs.databricks.com\/data-sharing\/troubleshooting.html#resource-limits).\n* Schemas named `information_schema` cannot be imported into a Unity Catalog metastore, because that schema name is reserved in Unity Catalog.\n* [Table constraints](https:\/\/docs.databricks.com\/tables\/constraints.html) (primary and foreign key constraints) are not available in shared tables.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/index.html"} +{"content":"# \n### Share data and AI assets securely using Delta Sharing\n#### Resource quotas\n\nThe values below indicate the quotas for Delta Sharing resources. Quota values below are expressed relative to the parent object in Unity Catalog. \n| Object | Parent | Value |\n| --- | --- | --- |\n| provider | metastore | 1000 |\n| recipients | metastore | 5000 |\n| shares | metastore | 1000 |\n| tables | share | 1000 |\n| volumes | share | 1000 |\n| models | share | 1000 |\n| schemas | share | 500 |\n| notebooks | share | 100 | \nIf you expect to exceed these resource limits, contact your Databricks account team.\n\n### Share data and AI assets securely using Delta Sharing\n#### Next steps\n\n* [Enable your Databricks account for Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/set-up.html)\n* [Create shares](https:\/\/docs.databricks.com\/data-sharing\/create-share.html)\n* [Create recipients](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html)\n* Learn more about the [open sharing](https:\/\/docs.databricks.com\/data-sharing\/share-data-open.html) and [Databricks-to-Databricks](https:\/\/docs.databricks.com\/data-sharing\/share-data-databricks.html) sharing models\n* [Learn how recipients access shared data](https:\/\/docs.databricks.com\/data-sharing\/recipient.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/index.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n### Query databases using JDBC\n##### Query MariaDB with Databricks\n\nThis example queries MariaDB using its JDBC driver. For more details on reading, writing, configuring parallelism, and query pushdown, see [Query databases using JDBC](https:\/\/docs.databricks.com\/connect\/external-systems\/jdbc.html).\n\n##### Query MariaDB with Databricks\n###### Create the JDBC URL\n\n```\ndriver = \"org.mariadb.jdbc.Driver\"\n\ndatabase_host = \"<database-host-url>\"\ndatabase_port = \"3306\" # update if you use a non-default port\ndatabase_name = \"<database-name>\"\ntable = \"<table-name>\"\nuser = \"<username>\"\npassword = \"<password>\"\n\nurl = f\"jdbc:mariadb:\/\/{database_host}:{database_port}\/{database_name}\"\n\n``` \n```\nval driver = \"org.mariadb.jdbc.Driver\"\n\nval database_host = \"<database-host-url>\"\nval database_port = \"3306\" # update if you use a non-default port\nval database_name = \"<database-name>\"\nval table = \"<table-name>\"\nval user = \"<username>\"\nval password = \"<password>\"\n\nval url = s\"jdbc:mariadb:\/\/${database_host}:${database_port}\/${database_name}\"\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/mariadb.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n### Query databases using JDBC\n##### Query MariaDB with Databricks\n###### Query the remote table\n\n```\nremote_table = (spark.read\n.format(\"jdbc\")\n.option(\"driver\", driver)\n.option(\"url\", url)\n.option(\"dbtable\", table)\n.option(\"user\", user)\n.option(\"password\", password)\n.load()\n)\n\n``` \n```\nval remote_table = spark.read\n.format(\"jdbc\")\n.option(\"driver\", driver)\n.option(\"url\", url)\n.option(\"dbtable\", table)\n.option(\"user\", user)\n.option(\"password\", password)\n.load()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/mariadb.html"} +{"content":"# Technology partners\n## Connect to ML partners using Partner Connect\n#### Connect to Labelbox\n\nLabelbox is a training data platform used to create training data from images, video, audio, text, and tiled imagery. Using Labelbox, AI teams can customize a workflow to operate, manage and improve data labeling, data cataloging, and model debugging in a single, unified platform. Labelbox is designed to help AI teams build and operate production-grade machine learning systems. \nYou can connect your Databricks clusters that have the Machine Learning version of the Databricks Runtime to Labelbox.\n\n#### Connect to Labelbox\n##### Connect to Labelbox using Partner Connect\n\nThis section describes how to connect a cluster in your Databricks workspace to Labelbox using Partner Connect. \n### Differences between standard connections and Labelbox \nTo connect to Labelbox using Partner Connect, you follow the steps in [Connect to ML partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/ml.html). The Labelbox connection is different from standard machine learning connections in the following ways: \n* In addition to a cluster, a service principal, and a personal access token, Partner Connect creates a notebook named `labelbox_databricks_example.ipynb` in the **Workspace\/Shared\/labelbox\\_demo** folder in your Labelbox account, if it doesn\u2019t already exist. \n### Steps to connect \nTo connect to Labelbox using Partner Connect, do the following: \n1. [Connect to ML partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/ml.html).\n2. [Create a Labelbox API key](https:\/\/docs.labelbox.com\/docs\/create-an-api-key) for your Labelbox account, if you do not have one. Copy the API key and save it in a secure location, as the key will eventually be hidden from view, and you will need this key later.\n3. [Set up the ML cluster and Labelbox starter notebook](https:\/\/docs.databricks.com\/partners\/ml\/labelbox.html#set-up-notebook).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ml\/labelbox.html"} +{"content":"# Technology partners\n## Connect to ML partners using Partner Connect\n#### Connect to Labelbox\n##### Connect to Labelbox manually\n\nThe steps in this section describe how to connect Labelbox to a Databricks cluster. \nNote \nTo connect faster, use Partner Connect. \n### Requirements \nYou must have an available cluster running Databricks Runtime for Machine Learning. To check this for an existing cluster, look for **ML** in the **Runtime** column when you [display the cluster](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-list) in your workspace. If you do not have an available Databricks Runtime ML cluster, [create a cluster](https:\/\/docs.databricks.com\/compute\/configure.html) and for **Databricks Runtime Version**, choose a version from the **ML** list. \n### Steps to connect \nTo connect to Labelbox manually, do the following: \n1. Go to the [Labelbox](https:\/\/app.labelbox.com\/signin) page to **Sign Up** for a new Labelbox account or to **Log In** to your existing Labelbox account.\n2. [Create a Labelbox API key](https:\/\/docs.labelbox.com\/docs\/create-an-api-key) for your Labelbox account, if you do not have one. Copy the API key and save it in a secure location, as the key will eventually be hidden from view, and you will need this key later.\n3. Check for a Labelbox starter notebook in your workspace: \n1. In the sidebar, click **Workspace > Shared**.\n2. If a folder named **labelbox\\_demo** does not already exist, create it: \n1. Click the down arrow next to **Shared**.\n2. Click **Create > Folder**.\n3. Enter `labelbox_demo`,\n4. Click **Create Folder**.\n3. Click the **labelbox\\_demo** folder. If a starter notebook named **labelbox\\_databricks\\_example.ipynb** does not exist in the folder, import it: \n1. Click the down arrow next to **labelbox\\_demo**.\n2. Click **Import**.\n3. Click **URL**.\n4. Enter `https:\/\/github.com\/Labelbox\/labelbox-python\/blob\/develop\/examples\/integrations\/databricks\/labelbox_databricks_example.ipynb` and click **Import**.\n4. Continue to set up the ML cluster and Labelbox starter notebook.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ml\/labelbox.html"} +{"content":"# Technology partners\n## Connect to ML partners using Partner Connect\n#### Connect to Labelbox\n##### Set up the ML cluster and Labelbox starter notebook\n\n1. Check that the required Labelbox libraries are installed in your ML cluster: \n1. In the sidebar, click **Compute**.\n2. Click your ML cluster. Use the **Filter** box to find it, if necessary. \nNote \nIf you used Partner Connect to connect to Labelbox, the ML cluster\u2019s name should be **LABELBOX\\_CLUSTER**.\n3. Click the **Libraries** tab.\n4. If the **labelbox** package is not listed, install it: \n1. Click **Install New**.\n2. Click **PyPI**.\n3. For **Package**, enter **labelbox**.\n4. Click **Install**.\n5. If the **labelspark** package is not listed, install it: \n1. Click **Install New**.\n2. Click **PyPI**.\n3. For **Package**, enter **labelspark**.\n4. Click **Install**.\n2. Attach your ML cluster to the starter notebook: \n1. In the sidebar, click **Workspace > Shared > labelbox\\_demo > labelbox\\_databricks\\_example.ipynb**.\n2. [Attach your ML cluster](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) to the notebook.\n3. Browse through the notebook to learn how to automate Labelbox.\n\n#### Connect to Labelbox\n##### Additional resources\n\n* [README](https:\/\/github.com\/Labelbox\/labelbox-python\/blob\/develop\/examples\/integrations\/databricks\/readme.md) in GitHub for the starter notebook\n* [Labelbox Docs](https:\/\/docs.labelbox.com\/)\n* [Support](https:\/\/docs.labelbox.com\/reference\/contacting-customer-support)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ml\/labelbox.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to ThoughtSpot\n\nThis article describes how to use ThoughtSpot with a Databricks cluster or a Databricks SQL warehouse (formerly Databricks SQL endpoint).\n\n#### Connect to ThoughtSpot\n##### Connect to ThoughtSpot using Partner Connect\n\nTo connect your Databricks workspace to Thoughtspot using Partner Connect, see [Connect to BI partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/bi.html). \nNote \nPartner Connect only supports SQL warehouses for ThoughtSpot. To connect a cluster to Thoughtspot, connect manually.\n\n#### Connect to ThoughtSpot\n##### Connect to ThoughtSpot manually\n\nTo connect to ThoughtSpot manually, see [Add a Databricks connection](https:\/\/docs.thoughtspot.com\/software\/latest\/connections-databricks-add) in the ThoughtSpot documentation.\n\n#### Connect to ThoughtSpot\n##### Next steps\n\nTo continue using ThoughtSpot, see the following resources on the ThoughtSpot website: \n* [How to view a data schema](https:\/\/docs.thoughtspot.com\/software\/latest\/schema-viewer)\n* [Create a join relationship](https:\/\/docs.thoughtspot.com\/software\/latest\/join-add)\n* [Searching in ThoughtSpot](https:\/\/docs.thoughtspot.com\/software\/latest\/search)\n* [Understand formulas in searches](https:\/\/docs.thoughtspot.com\/software\/latest\/formulas)\n* [Understand charts](https:\/\/docs.thoughtspot.com\/software\/latest\/charts)\n* [Basic Pinboard usage](https:\/\/docs.thoughtspot.com\/software\/latest\/pinboards)\n* [What is SpotIQ?](https:\/\/docs.thoughtspot.com\/software\/latest\/spotiq)\n\n#### Connect to ThoughtSpot\n##### Additional resources\n\n[Support](https:\/\/www.thoughtspot.com\/support)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/thoughtspot.html"} +{"content":"# Compute\n## Use compute\n#### Serverless compute for notebooks\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). For information on eligibility and enablement, see [Enable serverless compute public preview](https:\/\/docs.databricks.com\/admin\/workspace-settings\/serverless.html). \nThis article explains how to use serverless compute for notebooks. For information on using serverless compute for workflows, see [Run your Databricks job with serverless compute for workflows](https:\/\/docs.databricks.com\/workflows\/jobs\/run-serverless-jobs.html). \nFor pricing information, see [Databricks pricing](https:\/\/www.databricks.com\/product\/pricing).\n\n#### Serverless compute for notebooks\n##### Requirements\n\n* Your workspace must be in a supported region. See [Databricks clouds and regions](https:\/\/docs.databricks.com\/resources\/supported-regions.html).\n* Your workspace must be enabled for Unity Catalog.\n* Your workspace must be enabled for the Public Preview.\n\n#### Serverless compute for notebooks\n##### Using serverless compute for notebooks\n\nIf your workspace is enabled for serverless interactive compute, all users in the workspace have access to serverless compute for notebooks. No additional permissions are required. \nTo attach to the serverless compute, click the **Connect** drop-down menu in the notebook and select **Serverless**. For new notebooks, the attached compute automatically defaults to serverless upon code execution if no other resource has been selected.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/serverless.html"} +{"content":"# Compute\n## Use compute\n#### Serverless compute for notebooks\n##### View query insights\n\nServerless compute for notebooks and workflows uses query insights to assess Spark execution performance. After running a cell in a notebook, you can view insights related to SQL and Python queries by clicking the **See performance** link. \n![Show query performance](https:\/\/docs.databricks.com\/_images\/query-performance.png) \nYou can click on any of the Spark statements to view the query metrics. From there you can click **See query profile** to see a visualization of the query execution. For more information on query profiles, see [Query profile](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-profile.html). \n### Query history \nAll queries that are run on serverless compute will also be recorded on your workspace\u2019s query history page. For information on query history, see [Query history](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html). \n### Query insight limitations \n* The query profile is only available after the query execution terminates.\n* Metrics are updated live although the query profile is not shown during execution.\n* Only the following query statuses are covered: RUNNING, CANCELED, FAILED, FINISHED.\n* Running queries cannot be canceled from the query history page. They can be canceled in notebooks or jobs.\n* Verbose metrics are not available.\n* Query Profile download is not available.\n* Access to the Spark UI is not available.\n* The statement text only contains the last line that was run. However, there might be several lines preceding this line that were run as part of the same statement.\n\n#### Serverless compute for notebooks\n##### Limitations\n\nFor a list of limitations, see [Serverless compute limitations](https:\/\/docs.databricks.com\/release-notes\/serverless.html#limitations).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/serverless.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Reliability for the data lakehouse\n##### Best practices for reliability\n\nThis article covers best practices for **reliability** organized by architectural principles listed in the following sections.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/reliability\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Reliability for the data lakehouse\n##### Best practices for reliability\n###### 1. Design for failure\n\n### Use Delta Lake \nDelta Lakeis an open source storage format that brings reliability to data lakes. Delta Lake provides ACID transactions, schema enforcement, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. Delta Lake on Databricks allows you to configure Delta Lake based on your workload patterns. See [What is Delta Lake?](https:\/\/docs.databricks.com\/delta\/index.html). \n### Use Apache Spark or Photon for distributed compute \nApache Spark, as the compute engine of the Databricks lakehouse, is based on resilient distributed data processing. In case of an internal Spark task not returning a result as expected, Apache Spark automatically reschedules the missing tasks and continues with the execution of the entire job. This is helpful for failures outside the code, like a short network issue or a revoked spot VM. Working with both the SQL API and the Spark DataFrame API comes with this resilience built into the engine. \nIn the Databricks lakehouse, [Photon](https:\/\/docs.databricks.com\/compute\/photon.html), a native vectorized engine entirely written in C++, is high performance compute compatible with Apache Spark APIs. \n### Automatically rescue invalid or nonconforming data \nInvalid or nonconforming data can lead to crashes of workloads that rely on an established data format. To increase the end-to-end resilience of the whole process, it is best practice to filter out invalid and nonconforming data at ingestion. Supporting rescued data ensures you never lose or miss out on data during ingest or ETL. The rescued data column contains any data that wasn\u2019t parsed, either because it was missing from the given schema, because there was a type mismatch, or the column casing in the record or file didn\u2019t match that in the schema. \n* **Databricks Auto Loader:** [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html) is the ideal tool for streaming the ingestion of files. It supports [rescued data](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html#what-is-the-rescued-data-column) for JSON and CSV. For example, for JSON, the rescued data column contains any data that wasn\u2019t parsed, possibly because it was missing from the given schema, because there was a type mismatch, or because the casing of the column didn\u2019t match. The rescued data column is part of the schema returned by Auto Loader as `_rescued_data` by default when the schema is being inferred.\n* **Delta Live Tables:** Another option to build workflows for resilience is using [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html) with quality constraints. See [Manage data quality with Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html). Out of the box, Delta Live Tables supports three modes: Retain, drop, and fail on invalid records. To quarantine identified invalid records, expectation rules can be defined in a specific way so that invalid records are stored (\u201cquarantined\u201d) in another table. See [Quarantine invalid data](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html#quarantine-invalid-data). \n### Configure jobs for automatic retries and termination \nDistributed systems are complex, and a failure at one point can potentially cascade throughout the system. \n* Databricks jobs support an [automatic retry policy](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#retry-policies) that determines when and how many times failed runs are retried.\n* Delta Live Tables also automates failure recovery by using escalating retries to balance speed with reliability. See [Development and production modes](https:\/\/docs.databricks.com\/delta-live-tables\/updates.html#development-and-production-modes). \nOn the other hand, a task that hangs can prevent the whole job from finishing, thus incurring high costs. Databricks jobs support a timeout configuration to terminate jobs that take longer than expected. \n### Use a scalable and production-grade model serving infrastructure \nFor batch and streaming inference, use Databricks jobs and MLflow to deploy models as Apache Spark UDFs to leverage job scheduling, retries, autoscaling, and so on. See [Use MLflow for model inference](https:\/\/docs.databricks.com\/machine-learning\/model-inference\/index.html#offline-batch-predictions). \n[Model serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) provides a scalable and production-grade model real-time serving infrastructure. It processes your machine learning models using MLflow and exposes them as REST API endpoints. This functionality uses serverless compute, which means that the endpoints and associated compute resources are managed and run in the Databricks cloud account. \n### Use managed services where possible \nLeverage managed services of the Databricks Data Intelligence Platform like [serverless compute](https:\/\/docs.databricks.com\/getting-started\/overview.html#serverless), [model serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html), or [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html) where possible. These services are - without extra effort by the customer - operated by Databricks in a reliable and scalable way, making workloads more reliable.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/reliability\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Reliability for the data lakehouse\n##### Best practices for reliability\n###### 2. Manage data quality\n\n### Use a layered storage architecture \nCurate data by creating a layered architecture and ensuring data quality increases as data moves through the layers. A common layering approach is: \n* **Raw layer (bronze):** Source data gets ingested into the lakehouse into the first layer and should be persisted there. When all downstream data is created from the raw layer, rebuilding the subsequent layers from this layer is possible if needed.\n* **Curated layer (silver):** The purpose of the second layer is to hold cleansed, refined, filtered and aggregated data. The goal of this layer is to provide a sound, reliable foundation for analyses and reports across all roles and functions.\n* **Final layer (gold):** The third layer is created around business or project needs. It provides a different view as data products to other business units or projects, preparing data around security needs (such as anonymized data) or optimizing for performance (such as with preaggregated views). The data products in this layer are seen as the truth for the business. \nThe final layer should only contain high-quality data and can be fully trusted from a business point of view. \n### Improve data integrity by reducing data redundancy \nCopying or duplicating data creates data redundancy and will lead to lost integrity, lost data lineage, and often different access permissions. This will decrease the quality of the data in the lakehouse. A temporary or throwaway copy of data is not harmful on its own - it is sometimes necessary for boosting agility, experimentation and innovation. However, if these copies become operational and regularly used for business decisions, they become data silos. These data silos getting out of sync has a significant negative impact on data integrity and quality, raising questions such as \u201cWhich data set is the master?\u201d or \u201cIs the data set up to date?\u201d. \n### Actively manage schemas \nUncontrolled schema changes can lead to invalid data and failing jobs that use these data sets. Databricks has several methods to validate and enforce the schema: \n* Delta Lake supports schema validation and schema enforcementby automatically handling schema variations to prevent the insertion of bad records during ingestion. See [Schema enforcement](https:\/\/docs.databricks.com\/tables\/schema-enforcement.html).\n* [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html) detects the addition of new columns as it processes your data. By default, the addition of a new column causes your streams to stop with an `UnknownFieldException`. Auto Loader supports several modes for [schema evolution](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html). \n### Use constraints and data expectations \nDelta tables support standard SQL constraint management clauses that ensure that the quality and integrity of data added to a table are automatically verified. When a constraint is violated, Delta Lake throws an `InvariantViolationException` error to signal that the new data can\u2019t be added. See [Constraints on Databricks](https:\/\/docs.databricks.com\/tables\/constraints.html). \nTo further improve this handling, Delta Live Tables supports Expectations: Expectations define data quality constraints on the contents of a data set. An expectation consists of a description, an invariant, and an action to take when a record fails the invariant. Expectations to queries use Python decorators or SQL constraint clauses. See [Manage data quality with Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html). \n### Take a data-centric approach to machine learning \nFeature engineering, training, inference, and monitoring pipelines are data pipelines. They must be as robust as other production data engineering processes. Data quality is crucial in any ML application, so ML data pipelines should employ systematic approaches to monitoring and mitigating data quality issues. Avoid tools that make it challenging to join data from ML predictions, model monitoring, and so on, with the rest of your data. The simplest way to achieve this is to develop ML applications on the same platform used to manage production data. For example, instead of downloading training data to a laptop, where it is hard to govern and reproduce results, secure the data in cloud storage and make that storage available to your training process.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/reliability\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Reliability for the data lakehouse\n##### Best practices for reliability\n###### 3. Design for autoscaling\n\n### Enable autoscaling for batch workloads \n[Autoscaling](https:\/\/docs.databricks.com\/compute\/configure.html#autoscaling) allows clusters to resize automatically based on workloads. Autoscaling can benefit many use cases and scenarios from both a cost and performance perspective. The documentation provides considerations for determining whether to use Autoscaling and how to get the most benefit. \nFor streaming workloads, Databricks recommends using Delta Live Tables with autoscaling. See [Use autoscaling to increase efficiency and reduce resource usage](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#use-autoscaling-to-increase-efficiency-and-reduce-resource-usage). \n### Enable autoscaling for SQL warehouse \nThe scaling parameter of a SQL warehouse sets the minimum and the maximum number of clusters over which queries sent to the warehouse are distributed. The default is a minimum of one and a maximum of one cluster. \nTo handle more concurrent users for a given warehouse, increase the cluster count. To learn how Databricks adds clusters to and removes clusters from a warehouse, see [SQL warehouse sizing, scaling, and queuing behavior](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/warehouse-behavior.html). \n### Use Delta Live Tables enhanced autoscaling \n[Databricks enhanced autoscaling](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#use-autoscaling-to-increase-efficiency-and-reduce-resource-usage) optimizes cluster utilization by automatically allocating cluster resources based on workload volume, with minimal impact on the data processing latency of your pipelines.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/reliability\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Reliability for the data lakehouse\n##### Best practices for reliability\n###### 4. Test recovery procedures\n\n### Create regular backups \nTo recover from a failure, regular backups must be available. The Databricks Labs project *migrate* allows workspace admins to create backups by exporting most of the assets of their workspaces (the tool uses the Databricks CLI\/API in the background). See [Databricks Migration Tool](https:\/\/github.com\/databrickslabs\/migrate). Backups can be used either for restoring workspaces or for importing into a new workspace in case of a migration. \n### Recover from Structured Streaming query failures \nStructured Streaming provides fault-tolerance and data consistency for streaming queries. Using Databricks workflows, you can easily configure your Structured Streaming queries to restart on failure automatically. The restarted query continues where the failed one left off. See [Recover from Structured Streaming query failures with workflows](https:\/\/docs.databricks.com\/structured-streaming\/query-recovery.html). \n### Recover ETL jobs based on Delta time travel \nDespite thorough testing, a job in production can fail or produce some unexpected, even invalid, data. Sometimes this can be fixed with an additional job after understanding the source of the issue and fixing the pipeline that led to the issue in the first place. However, often this is not straightforward, and the respective job should be rolled back. Using Delta Time travel allows users to easily roll back changes to an older version or timestamp, repair the pipeline, and restart the fixed pipeline. See [What is Delta Lake time travel?](https:\/\/docs.databricks.com\/delta\/history.html#time-travel). \nA convenient way to do so is the [RESTORE](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-restore.html) command. \n### Use Databricks Workflows and built-in recovery \nDatabricks Workflows are built for recovery. When a task in a multi-task job fails (and, as such, all dependent tasks), Databricks Workflows provide a matrix view of the runs, which lets you examine the issue that led to the failure. See [View runs for a job](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#view-job-run-list). Whether it was a short network issue or a real issue in the data, you can fix it and start a repair run in Databricks Workflows. It runs only the failed and dependent tasks and keep the successful results from the earlier run, saving time and money. \n### Configure a disaster recovery pattern \nA clear disaster recovery pattern is critical for a cloud-native data analytics platform like Databricks. For some companies, it\u2019s critical that your data teams can use the Databricks platform even in the rare case of a regional service-wide cloud-service provider outage, whether caused by a regional disaster like a hurricane or earthquake or another source. \nDatabricks is often a core part of an overall data ecosystem that includes many services, including upstream data ingestion services (batch\/streaming), cloud-native storage, downstream tools and services such as business intelligence apps, and orchestration tooling. Some of your use cases might be particularly sensitive to a regional service-wide outage. \nDisaster recovery involves a set of policies, tools, and procedures that enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. A large cloud service like Azure, AWS, or GCP serves many customers and has built-in guards against a single failure. For example, a region is a group of buildings connected to different power sources to guarantee that a single power loss will not shut down a region. However, cloud region failures can happen, and the degree of disruption and its impact on your organization can vary. See [Disaster recovery](https:\/\/docs.databricks.com\/admin\/disaster-recovery.html). \nEssential parts of a disaster recovery strategy are selecting a strategy (active\/active or active\/passive), selecting the right toolset, and testing both [failover](https:\/\/docs.databricks.com\/admin\/disaster-recovery.html#test-failover) and [restore](https:\/\/docs.databricks.com\/admin\/disaster-recovery.html#failback).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/reliability\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Reliability for the data lakehouse\n##### Best practices for reliability\n###### 5. Automate deployments and workloads\n\nIn the Operational excellence article, see [Operational Excellence - Automate deployments and workloads](https:\/\/docs.databricks.com\/lakehouse-architecture\/operational-excellence\/best-practices.html#2-automate-deployments-and-workloads).\n\n##### Best practices for reliability\n###### 6. Set up monitoring, alerting, and logging\n\nIn the Operational excellence best practices article, see [Operational Excellence - Set up monitoring, alerting, and logging](https:\/\/docs.databricks.com\/lakehouse-architecture\/operational-excellence\/best-practices.html#system-monitoring).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/reliability\/best-practices.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Connect to Google Cloud Storage\n\nThis article describes how to configure a connection from Databricks to read and write tables and data stored on Google Cloud Storage (GCS). \nTo read or write from a GCS bucket, you must create an attached service account and you must associate the bucket with the service account. You connect to the bucket directly with a key that you generate for the service account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/gcs.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Connect to Google Cloud Storage\n##### Access a GCS bucket directly with a Google Cloud service account key\n\nTo read and write directly to a bucket, you configure a key defined in your [Spark configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration). \n### Step 1: Set up Google Cloud service account using Google Cloud Console \nYou must create a service account for the Databricks cluster. Databricks recommends giving this service account the least privileges needed to perform its tasks. \n1. Click **IAM and Admin** in the left navigation pane.\n2. Click **Service Accounts**.\n3. Click **+ CREATE SERVICE ACCOUNT**.\n4. Enter the service account name and description. \n![Google Create service account for GCS](https:\/\/docs.databricks.com\/_images\/google-create-service-account-gcs.png)\n5. Click **CREATE**.\n6. Click **CONTINUE**.\n7. Click **DONE**. \n### Step 2: Create a key to access GCS bucket directly \nWarning \nThe JSON key you generate for the service account is a private key that should only be shared with authorized users as it controls access to datasets and resources in your Google Cloud account. \n1. In the Google Cloud console, in the service accounts list, click the newly created account.\n2. In the **Keys** section, click **ADD KEY > Create new key**. \n![Google Create Key](https:\/\/docs.databricks.com\/_images\/google-create-key-gcs.png)\n3. Accept the **JSON** key type.\n4. Click **CREATE**. The key file is downloaded to your computer. \n### Step 3: Configure the GCS bucket \n#### Create a bucket \nIf you do not already have a bucket, create one: \n1. Click **Storage** in the left navigation pane.\n2. Click **CREATE BUCKET**. \n![Google Create Bucket](https:\/\/docs.databricks.com\/_images\/google-create-bucket.png)\n3. Click **CREATE**. \n#### Configure the bucket \n1. Configure the bucket details.\n2. Click the **Permissions** tab.\n3. Next to the **Permissions** label, click **ADD**. \n![Google Bucket Details](https:\/\/docs.databricks.com\/_images\/google-bucket-details-gcs.png)\n4. Provide the **Storage Admin** permission to the service account on the bucket from the Cloud Storage roles. \n![Google Bucket Permissions](https:\/\/docs.databricks.com\/_images\/google-bucket-permissions-gcs.png)\n5. Click **SAVE**. \n### Step 4: Put the service account key in Databricks secrets \nDatabricks recommends using secret scopes for storing all credentials. You can put the private key and private key id from your key JSON file into Databricks secret scopes. You can grant users, service principals, and groups in your workspace access to read the secret scopes. This protects the service account key while allowing users to access GCS. To create a secret scope, see [Secrets](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html). \n### Step 5: Configure a Databricks cluster \n1. In the **Spark Config** tab, use the following snippet to set the keys stored in secret scopes: \n```\nspark.hadoop.google.cloud.auth.service.account.enable true\nspark.hadoop.fs.gs.auth.service.account.email <client-email>\nspark.hadoop.fs.gs.project.id <project-id>\nspark.hadoop.fs.gs.auth.service.account.private.key {{secrets\/scope\/gsa_private_key}}\nspark.hadoop.fs.gs.auth.service.account.private.key.id {{secrets\/scope\/gsa_private_key_id}}\n\n``` \nReplace `<client-email>`, `<project-id>` with the values of those exact field names from your key JSON file. \nUse both cluster access control and notebook access control together to protect access to the service account and data in the GCS bucket. See [Compute permissions](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions) and [Collaborate using Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/notebooks-collaborate.html). \n### Step 6: Read from GCS \nTo read from the GCS bucket, use a Spark read command in any supported format, for example: \n```\ndf = spark.read.format(\"parquet\").load(\"gs:\/\/<bucket-name>\/<path>\")\n\n``` \nTo write to the GCS bucket, use a Spark write command in any supported format, for example: \n```\ndf.write.format(\"parquet\").mode(\"<mode>\").save(\"gs:\/\/<bucket-name>\/<path>\")\n\n``` \nReplace `<bucket-name>` with the name of the bucket you created in [Step 3: Configure the GCS bucket](https:\/\/docs.databricks.com\/connect\/storage\/gcs.html#gcs-bucket). \n#### Example notebooks \n##### Read from Google Cloud Storage notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/gcs-read-scala.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n##### Write to Google Cloud Storage notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/gcs-write-scala.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/gcs.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### What are Hugging Face Transformers?\n##### Model inference using Hugging Face Transformers for NLP\n\nThis article shows you how to use Hugging Face Transformers for natural language processing (NLP) model inference. \nHugging Face transformers provides the [pipelines](https:\/\/huggingface.co\/docs\/transformers\/main\/en\/main_classes\/pipelines) class to use the pre-trained model for inference. \ud83e\udd17 Transformers pipelines support a [wide range of NLP tasks](https:\/\/huggingface.co\/docs\/transformers\/main_classes\/pipelines#natural-language-processing) that you can easily use on Databricks.\n\n##### Model inference using Hugging Face Transformers for NLP\n###### Requirements\n\n* MLflow 2.3\n* Any cluster with the Hugging Face `transformers` library installed can be used for batch inference. The `transformers` library comes preinstalled on Databricks Runtime 10.4 LTS ML and above. Many of the popular NLP models work best on GPU hardware, so you may get the best performance using recent GPU hardware unless you use a model specifically optimized for use on CPUs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/model-inference-nlp.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### What are Hugging Face Transformers?\n##### Model inference using Hugging Face Transformers for NLP\n###### Use Pandas UDFs to distribute model computation on a Spark cluster\n\nWhen experimenting with pre-trained models you can use [Pandas UDFs](https:\/\/docs.databricks.com\/udf\/pandas.html) to wrap the model and perform computation on worker CPUs or GPUs. Pandas UDFs distribute the model to each worker. \nYou can also create a Hugging Face Transformers pipeline for machine translation and use a Pandas UDF to run the pipeline on the workers of a Spark cluster: \n```\nimport pandas as pd\nfrom transformers import pipeline\nimport torch\nfrom pyspark.sql.functions import pandas_udf\n\ndevice = 0 if torch.cuda.is_available() else -1\ntranslation_pipeline = pipeline(task=\"translation_en_to_fr\", model=\"t5-base\", device=device)\n\n@pandas_udf('string')\ndef translation_udf(texts: pd.Series) -> pd.Series:\ntranslations = [result['translation_text'] for result in translation_pipeline(texts.to_list(), batch_size=1)]\nreturn pd.Series(translations)\n\n``` \nSetting the `device` in this manner ensures that GPUs are used if they are available on\nthe cluster. \nThe Hugging Face pipelines for translation return a list of Python `dict` objects, each with a single key `translation_text` and a value containing the translated text. This UDF extracts the translation from the results to return a Pandas series with just the translated text. If your pipeline was constructed to use GPUs by setting `device=0`, then Spark automatically reassigns GPUs on the worker nodes if your cluster has instances with multiple GPUs. \nTo use the UDF to translate a text column, you can call the UDF in a `select` statement: \n```\ntexts = [\"Hugging Face is a French company based in New York City.\", \"Databricks is based in San Francisco.\"]\ndf = spark.createDataFrame(pd.DataFrame(texts, columns=[\"texts\"]))\ndisplay(df.select(df.texts, translation_udf(df.texts).alias('translation')))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/model-inference-nlp.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### What are Hugging Face Transformers?\n##### Model inference using Hugging Face Transformers for NLP\n###### Return complex result types\n\nUsing Pandas UDFs you can also return more structured output. For example, in named-entity recognition, pipelines return a list of `dict` objects containing the entity, its span, type, and an associated score. While similar to the example for translation, the return type for the `@pandas_udf` annotation is more complex in the case of named-entity recognition. \nYou can get a sense of the return types to use through inspection of pipeline results, for example by running the pipeline on the driver. \nIn this example, use the following code: \n```\nfrom transformers import pipeline\nimport torch\ndevice = 0 if torch.cuda.is_available() else -1\nner_pipeline = pipeline(task=\"ner\", model=\"Davlan\/bert-base-multilingual-cased-ner-hrl\", aggregation_strategy=\"simple\", device=device)\nner_pipeline(texts)\n\n``` \nTo yield the annotations: \n```\n[[{'entity_group': 'ORG',\n'score': 0.99933606,\n'word': 'Hugging Face',\n'start': 0,\n'end': 12},\n{'entity_group': 'LOC',\n'score': 0.99967843,\n'word': 'New York City',\n'start': 42,\n'end': 55}],\n[{'entity_group': 'ORG',\n'score': 0.9996372,\n'word': 'Databricks',\n'start': 0,\n'end': 10},\n{'entity_group': 'LOC',\n'score': 0.999588,\n'word': 'San Francisco',\n'start': 23,\n'end': 36}]]\n\n``` \nTo represent this as a return type, you can use an `array` of `struct` fields, listing the `dict` entries as the fields of the `struct`: \n```\nimport pandas as pd\nfrom pyspark.sql.functions import pandas_udf\n\n@pandas_udf('array<struct<word string, entity_group string, score float, start integer, end integer>>')\ndef ner_udf(texts: pd.Series) -> pd.Series:\nreturn pd.Series(ner_pipeline(texts.to_list(), batch_size=1))\n\ndisplay(df.select(df.texts, ner_udf(df.texts).alias('entities')))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/model-inference-nlp.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### What are Hugging Face Transformers?\n##### Model inference using Hugging Face Transformers for NLP\n###### Tune performance\n\nThere are several key aspects to tuning performance of the UDF. The first is to use each GPU effectively, which you can adjust by changing the size of batches sent to the GPU by the Transformers pipeline. The second is to make sure the DataFrame is well-partitioned to utilize the entire cluster. \nFinally, you may wish to cache the Hugging Face model to save model load time or ingress costs. \n### Choose a batch size \nWhile the UDFs described above should work out-of-the box with a `batch_size` of 1, this may not use the resources available to the workers efficiently. To improve performance, tune the batch size to the model and hardware in the cluster. Databricks recommends trying various batch sizes for the pipeline on your cluster to find the best performance. Read more about [pipeline batching](https:\/\/huggingface.co\/docs\/transformers\/main_classes\/pipelines#pipeline-batching) and other [performance options](https:\/\/huggingface.co\/docs\/transformers\/performance) in Hugging Face documentation. \nTry finding a batch size that is large enough so that it drives the full GPU utilization but does not result in `CUDA out of memory` errors. When you receive `CUDA out of memory` errors during tuning, you need to detach and reattach the notebook to release the memory used by the model and data in the GPU. \nMonitor GPU performance by viewing the live [cluster metrics](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#metrics) for a cluster, and choosing a metric, such as `gpu0-util` for GPU processor utilization or `gpu0_mem_util` for GPU memory utilization. \n### Tune parallelism with stage-level scheduling \nBy default, Spark schedules one task per GPU on each machine. To increase parallelism, you can use stage-level scheduling to tell Spark how many tasks to run per GPU. For example, if you would like Spark to run two tasks per GPU, you can specify this in the following way: \n```\nfrom pyspark.resource import TaskResourceRequests, ResourceProfileBuilder\n\ntask_requests = TaskResourceRequests().resource(\"gpu\", 0.5)\n\nbuilder = ResourceProfileBuilder()\nresource_profile = builder.require(task_requests).build\n\nrdd = df.withColumn('predictions', loaded_model(struct(*map(col, df.columns)))).rdd.withResources(resource_profile)\n\n``` \n### Repartition data to use all available hardware \nThe second consideration for performance is making full use of the hardware in your cluster. Generally, a small multiple of the number of GPUs on your workers (for GPU clusters) or number of cores across the workers in your cluster (for CPU clusters) works well. Your input DataFrame may already have enough partitions to take advantage of the cluster\u2019s parallelism. To see how many partitions the DataFrame contains, use `df.rdd.getNumPartitions()`. You can repartition a DataFrame using `repartitioned_df = df.repartition(desired_partition_count)`. \n### Cache the model in DBFS or on mount points \nIf you are frequently loading a model from different or restarted clusters, you may also wish to cache the Hugging Face model in the [DBFS root volume](https:\/\/docs.databricks.com\/dbfs\/index.html) or on [a mount point](https:\/\/docs.databricks.com\/dbfs\/mounts.html). This can decrease ingress costs and reduce the time to load the model on a new or restarted cluster. To do this, set the `TRANSFORMERS_CACHE` environment variable in your code before loading the pipeline. \nFor example: \n```\nimport os\nos.environ['TRANSFORMERS_CACHE'] = '\/dbfs\/hugging_face_transformers_cache\/'\n\n``` \nAlternatively, you can achieve similar results by logging the model to MLflow with the [MLflow `transformers` flavor](https:\/\/mlflow.org\/docs\/latest\/models.html#transformers-transformers-experimental).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/model-inference-nlp.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### What are Hugging Face Transformers?\n##### Model inference using Hugging Face Transformers for NLP\n###### Notebook: Hugging Face Transformers inference and MLflow logging\n\nTo get started quickly with example code, this notebook is an end-to-end example for text summarization by using Hugging Face Transformers pipelines inference and MLflow logging. \n### Hugging Face Transformers pipelines inference notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/hugging-face-transformers-batch-nlp.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n##### Model inference using Hugging Face Transformers for NLP\n###### Additional resources\n\nYou can fine-tune your Hugging Face model with the following guides: \n* [Prepare data for fine tuning Hugging Face models](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/load-data.html)\n* [Fine-tune Hugging Face models for a single GPU](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/fine-tune-model.html) \nLearn more about [What are Hugging Face Transformers?](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/model-inference-nlp.html"} +{"content":"# What is Databricks Marketplace?\n### Manage requests for your data product in Databricks Marketplace\n\nThis article describes how to manage consumer requests for your data product in Databricks Marketplace. This article is intended for data providers. \nThere are two types of Databricks Marketplace listings: \n* Listings that are instantly available and free of charge.\n* Listings that require your approval because they involve a commercial transaction or require customization (such as a specific geography or demographic sample set, for example).\n\n### Manage requests for your data product in Databricks Marketplace\n#### Manage instantly available listings\n\nYou don\u2019t need to do anything to let a Databricks Marketplace consumer get access to the data you\u2019ve shared in your instantly available listings. \nTo monitor the consumers (recipients, in Delta Sharing terminology) who are using your instantly available listings, you can use the Delta Sharing interfaces. As long as you know the name of the share, you can view all of the recipients who have accessed that share. See [View the recipients who have permissions on a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#list-permissions). \nYou can also use Delta Sharing interfaces to revoke a consumer\u2019s access to shared data. See [Revoke recipient access to a share](https:\/\/docs.databricks.com\/data-sharing\/grant-access.html#revoke).\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/manage-requests-provider.html"} +{"content":"# What is Databricks Marketplace?\n### Manage requests for your data product in Databricks Marketplace\n#### Manage requests that require your approval\n\nTo manage requests that require your approval, use the **Consumer requests** tab in the **Provider console**. \n**Permissions required:** [Marketplace admin role](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#marketplace-admin). To fulfill a request, you must have the `CREATE RECIPIENT` and `USE RECIPIENT` privileges. See [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html). \nTo manage a new request: \n1. Log into your Databricks workspace.\n2. In the sidebar, click ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace**.\n3. On the upper-right corner of the Marketplace page, click **Provider console**.\n4. The most recent new requests are listed on the **Overview** tab. To see all requests, go to the **Consumer requests** tab. \n**New requests** are listed before **All other requests**, which includes those that are fulfilled, pending, or denied. \n**Client type** can be **Databricks** (Databricks-to-Databricks sharing) or **Open client** ([Databricks-to-external-platform using a credential file](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer-open.html)).\n5. Click the **Review** button on the request row.\n6. Review the requester details.\n7. Select an action: \n* **Mark as pending** if your communications with the consumer are still ongoing and any required commercial transactions are incomplete.\n* **Fulfill** if your communication with the consumer and all transactions are complete and you are ready to share the data product. You must **Select a share**, and that share must already be created. See [Create and manage shares for Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/create-share.html). When you add a share and mark the request fulfilled: \n+ If the consumer is on a Unity Catalog-enabled Databricks workspace, they gain access to that share as a catalog in their Databricks workspace in near-real time.\n+ If the consumer is not on a Unity Catalog-enabled Databricks workspace, a credential file is generated and made available for them to download. See [Access data products in Databricks Marketplace using external platforms](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer-open.html).\n* **Deny** if you will not share a data product with the consumer. Optionally, you can select a **Reason for denial**. Select **Other** to enter a free-form reason.\n8. Click **Update request**. \nTo manage pending, fulfilled, or denied requests, scroll down to the **All other requests** list. You can view requester details and advance pending requests to another state. You cannot change the status of fulfilled or denied requests.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/manage-requests-provider.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n#### Data governance for the data lakehouse\n\nThe architectural principles of the **data governance** pillar cover how to centrally manage data and access to it. \n![Data governance lakehouse architecture diagram for Databricks.](https:\/\/docs.databricks.com\/_images\/data-governance.png)\n\n#### Data governance for the data lakehouse\n##### Principles of data governance\n\n1. **Unify data management** \nData management is the foundation for executing the data governance strategy. It involves the collection, integration, organization, and persistence of trusted data assets to help organizations maximize their value. A unified catalog centrally and consistently stores all your data and analytical artifacts, as well as the metadata associated with each data object. It enables end users to discover the data sets available to them and provides provenance visibility by tracking the lineage of all data assets.\n2. **Unify data security** \nThere are two tenets of effective data security governance: understanding who has access to what data, and who has recently accessed what data. This information is critical for almost all compliance requirements for regulated industries and is fundamental to any security governance program. With a unified data security system, the permissions model can be centrally and consistently managed across all data assets. Data access is centrally audited with alerting and monitoring capabilities to promote accountability.\n3. **Manage data quality** \nData quality is fundamental to deriving accurate and meaningful insights from data. Data quality has many dimensions, including completeness, accuracy, validity, and consistency. It must be actively managed to improve the quality of the final data sets so that the data serves as reliable and trustworthy information for business users.\n4. **Share data securely and in real-time** \nData sharing plays a key role in business processes across the enterprise, from product development and internal operations to customer experience and regulatory compliance. Increasingly, organizations need to share data sets, large and small, with their business units, customers, suppliers, and partners. Security is critical, as is efficiency and instant access to the latest data. Using an open and secure exchange technology helps to maximize the pool of potential exchange partners by removing the barriers of vendor technology lock-in.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/data-governance\/index.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n#### Data governance for the data lakehouse\n##### Next: Best practices for data governance\n\nSee [Best practices for data governance](https:\/\/docs.databricks.com\/lakehouse-architecture\/data-governance\/best-practices.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/data-governance\/index.html"} +{"content":"# What is Delta Lake?\n### Best practices: Delta Lake\n\nThis article describes best practices when using Delta Lake. \nDatabricks recommends using predictive optimization. See [Predictive optimization for Delta Lake](https:\/\/docs.databricks.com\/optimizations\/predictive-optimization.html). \nWhen deleting and recreating a table in the same location, you should always use a `CREATE OR REPLACE TABLE` statement. See [Drop or replace a Delta table](https:\/\/docs.databricks.com\/delta\/drop-table.html).\n\n### Best practices: Delta Lake\n#### Use liquid clustering for optimized data skipping\n\nDatabricks recommends using liquid clustering rather than partitioning, Z-order, or other data organization strategies to optimize data layout for data skipping. See [Use liquid clustering for Delta tables](https:\/\/docs.databricks.com\/delta\/clustering.html).\n\n### Best practices: Delta Lake\n#### Compact files\n\nPredictive optimization automatically runs `OPTIMIZE` and `VACUUM` commands on Unity Catalog managed tables. See [Predictive optimization for Delta Lake](https:\/\/docs.databricks.com\/optimizations\/predictive-optimization.html). \nDatabricks recommends frequently running the [OPTIMIZE](https:\/\/docs.databricks.com\/delta\/optimize.html) command to compact small files. \nNote \nThis operation does not remove the old files. To remove them, run the [VACUUM](https:\/\/docs.databricks.com\/delta\/vacuum.html) command.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/best-practices.html"} +{"content":"# What is Delta Lake?\n### Best practices: Delta Lake\n#### Replace the content or schema of a table\n\nSometimes you may want to replace a Delta table. For example: \n* You discover the data in the table is incorrect and want to replace the content.\n* You want to rewrite the whole table to do incompatible schema changes (such as changing column types). \nWhile you can delete the entire directory of a Delta table and create a new table on the same path, it\u2019s *not recommended* because: \n* Deleting a directory is not efficient. A directory containing very large files can take hours or even days to delete.\n* You lose all of the content in the deleted files; it\u2019s hard to recover if you delete the wrong table.\n* The directory deletion is not atomic. While you are deleting the table a concurrent query reading the table can fail or see a partial table. \n* You may experience potential consistency issues on S3, which is only eventually consistent. \nIf you don\u2019t need to change the table schema, you can [delete](https:\/\/docs.databricks.com\/delta\/tutorial.html#delete) data from a Delta table and insert your new data, or [update](https:\/\/docs.databricks.com\/delta\/tutorial.html#update) the table to fix the incorrect values. \nIf you want to change the table schema, you can replace the whole table atomically. For example: \n```\ndataframe.write \\\n.format(\"delta\") \\\n.mode(\"overwrite\") \\\n.option(\"overwriteSchema\", \"true\") \\\n.saveAsTable(\"<your-table>\") # Managed table\n\ndataframe.write \\\n.format(\"delta\") \\\n.mode(\"overwrite\") \\\n.option(\"overwriteSchema\", \"true\") \\\n.option(\"path\", \"<your-table-path>\") \\\n.saveAsTable(\"<your-table>\") # External table\n\n``` \n```\nREPLACE TABLE <your-table> USING DELTA AS SELECT ... -- Managed table\nREPLACE TABLE <your-table> USING DELTA LOCATION \"<your-table-path>\" AS SELECT ... -- External table\n\n``` \n```\ndataframe.write\n.format(\"delta\")\n.mode(\"overwrite\")\n.option(\"overwriteSchema\", \"true\")\n.saveAsTable(\"<your-table>\") \/\/ Managed table\n\ndataframe.write\n.format(\"delta\")\n.mode(\"overwrite\")\n.option(\"overwriteSchema\", \"true\")\n.option(\"path\", \"<your-table-path>\")\n.saveAsTable(\"<your-table>\") \/\/ External table\n\n``` \nThere are multiple benefits with this approach: \n* Overwriting a table is much faster because it doesn\u2019t need to list the directory recursively or delete any files.\n* The old version of the table still exists. If you delete the wrong table you can easily retrieve the old data using time travel. See [Work with Delta Lake table history](https:\/\/docs.databricks.com\/delta\/history.html).\n* It\u2019s an atomic operation. Concurrent queries can still read the table while you are deleting the table.\n* Because of Delta Lake ACID transaction guarantees, if overwriting the table fails, the table will be in its previous state. \n* You won\u2019t face any consistency issues on S3 because you don\u2019t delete files. \nIn addition, if you want to delete old files to save storage costs after overwriting the table, you can use [VACUUM](https:\/\/docs.databricks.com\/delta\/vacuum.html) to delete them. It\u2019s optimized for file deletion and is usually faster than deleting the entire directory.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/best-practices.html"} +{"content":"# What is Delta Lake?\n### Best practices: Delta Lake\n#### Spark caching\n\nDatabricks does not recommend that you use Spark caching for the following reasons: \n* You lose any data skipping that can come from additional filters added on top of the cached `DataFrame`.\n* The data that gets cached might not be updated if the table is accessed using a different identifier.\n\n### Best practices: Delta Lake\n#### Differences between Delta Lake and Parquet on Apache Spark\n\nDelta Lake handles the following operations automatically. You should never perform these operations manually: \n* **`REFRESH TABLE`**: Delta tables always return the most up-to-date information, so there is no need to call `REFRESH TABLE` manually after changes.\n* **Add and remove partitions**: Delta Lake automatically tracks the set of partitions present in a table and updates the list as data is added or removed. As a result, there is no need to run `ALTER TABLE [ADD|DROP] PARTITION` or `MSCK`.\n* **Load a single partition**: Reading partitions directly is not necessary. For example, you don\u2019t need to run `spark.read.format(\"parquet\").load(\"\/data\/date=2017-01-01\")`. Instead, use a `WHERE` clause for data skipping, such as `spark.read.table(\"<table-name>\").where(\"date = '2017-01-01'\")`.\n* **Don\u2019t manually modify data files**: Delta Lake uses the transaction log to commit changes to the table atomically. Do not directly modify, add, or delete Parquet data files in a Delta table, because this can lead to lost data or table corruption.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/best-practices.html"} +{"content":"# What is Delta Lake?\n### Best practices: Delta Lake\n#### Improve performance for Delta Lake merge\n\nYou can reduce the time it takes to merge by using the following approaches: \n* **Reduce the search space for matches**: By default, the `merge` operation searches the entire Delta table to find matches in the source table. One way to speed up `merge` is to reduce the search space by adding known constraints in the match condition. For example, suppose you have a table that is partitioned by `country` and `date` and you want to use `merge` to update information for the last day and a specific country. Adding the following condition makes the query faster, as it looks for matches only in the relevant partitions: \n```\nevents.date = current_date() AND events.country = 'USA'\n\n``` \nFurthermore, this query also reduces the chances of conflicts with other concurrent operations. See [Isolation levels and write conflicts on Databricks](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html) for more details.\n* **Compact files**: If the data is stored in many small files, reading the data to search for matches can become slow. You can compact small files into larger files to improve read throughput. See [Compact data files with optimize on Delta Lake](https:\/\/docs.databricks.com\/delta\/optimize.html) for details.\n* **Control the shuffle partitions for writes**: The `merge` operation shuffles data multiple times to compute and write the updated data. The number of tasks used to shuffle is controlled by the Spark session configuration `spark.sql.shuffle.partitions`. Setting this parameter not only controls the parallelism but also determines the number of output files. Increasing the value increases parallelism but also generates a larger number of smaller data files.\n* **Enable optimized writes**: For partitioned tables, `merge` can produce a much larger number of small files than the number of shuffle partitions. This is because every shuffle task can write multiple files in multiple partitions, and can become a performance bottleneck. You can reduce the number of files by enabling optimized writes. See [Optimized writes for Delta Lake on Databricks](https:\/\/docs.databricks.com\/delta\/tune-file-size.html#optimized-writes).\n* **Tune file sizes in table**: Databricks can automatically detect if a Delta table has frequent `merge` operations that rewrite files and may choose to reduce the size of rewritten files in anticipation of further file rewrites in the future. See the section on [tuning file sizes](https:\/\/docs.databricks.com\/delta\/tune-file-size.html) for details.\n* **Low Shuffle Merge**: [Low Shuffle Merge](https:\/\/docs.databricks.com\/optimizations\/low-shuffle-merge.html) provides an optimized implementation of `MERGE` that provides better performance for most common workloads. In addition, it preserves existing data layout optimizations such as [Z-ordering](https:\/\/docs.databricks.com\/delta\/data-skipping.html) on unmodified data.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/best-practices.html"} +{"content":"# What is Delta Lake?\n### Best practices: Delta Lake\n#### Manage data recency\n\nAt the beginning of each query, Delta tables auto-update to the latest version of the table. This process can be observed in notebooks when the command status reports: `Updating the Delta table's state`. However, when running historical analysis on a table, you may not necessarily need up-to-the-last-minute data, especially for tables where streaming data is being ingested frequently. In these cases, queries can be run on stale snapshots of your Delta table. This approach can lower latency in getting results from queries. \nYou can configure tolerance for stale data by setting the Spark session configuration `spark.databricks.delta.stalenessLimit` with a time string value such as `1h` or `15m` (for 1 hour or 15 minutes, respectively). This configuration is session specific, and doesn\u2019t affect other clients accessing the table. If the table state has been updated within the staleness limit, a query against the table returns results without waiting for the latest table update. This setting never prevents your table from updating, and when stale data is returned, the update processes in the background. If the last table update is older than the staleness limit, the query does not return results until the table state update completes.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/best-practices.html"} +{"content":"# What is Delta Lake?\n### Best practices: Delta Lake\n#### Enhanced checkpoints for low-latency queries\n\nDelta Lake writes [checkpoints](https:\/\/github.com\/delta-io\/delta\/blob\/master\/PROTOCOL.md#checkpoints) as an aggregate state of a Delta table at an optimized frequency. These checkpoints serve as the starting point to compute the latest state of the table. Without checkpoints, Delta Lake would have to read a large collection of JSON files (\u201cdelta\u201d files) representing commits to the transaction log to compute the state of a table. In addition, the column-level statistics Delta Lake uses to perform [data skipping](https:\/\/docs.databricks.com\/delta\/data-skipping.html) are stored in the checkpoint. \nImportant \nDelta Lake checkpoints are different than [Structured Streaming checkpoints](https:\/\/docs.databricks.com\/structured-streaming\/query-recovery.html). \nColumn-level statistics are stored as a struct and a JSON (for backwards compatibility). The struct format makes Delta Lake reads much faster, because: \n* Delta Lake doesn\u2019t perform expensive JSON parsing to obtain column-level statistics.\n* Parquet column pruning capabilities significantly reduce the I\/O required to read the statistics for a column. \nThe struct format enables a collection of optimizations that reduce the overhead of Delta Lake read operations from seconds to tens of milliseconds, which significantly reduces the latency for short queries.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/best-practices.html"} +{"content":"# What is Delta Lake?\n### Best practices: Delta Lake\n#### Manage column-level statistics in checkpoints\n\nYou manage how statistics are written in checkpoints using the table properties `delta.checkpoint.writeStatsAsJson` and `delta.checkpoint.writeStatsAsStruct`. If both table properties are `false`, Delta Lake *cannot* perform data skipping. \n* Batch writes write statistics in both JSON and struct format. `delta.checkpoint.writeStatsAsJson` is `true`.\n* `delta.checkpoint.writeStatsAsStruct` is undefined by default.\n* Readers use the struct column when available and otherwise fall back to using the JSON column. \nImportant \nEnhanced checkpoints do not break compatibility with open source Delta Lake readers. However, setting `delta.checkpoint.writeStatsAsJson` to `false` may have implications on proprietary Delta Lake readers. Contact your vendors to learn more about performance implications.\n\n### Best practices: Delta Lake\n#### Enable enhanced checkpoints for Structured Streaming queries\n\nIf your Structured Streaming workloads don\u2019t have low latency requirements (subminute latencies), you can enable enhanced checkpoints by running the following SQL command: \n```\nALTER TABLE [<table-name>|delta.`<path-to-table>`] SET TBLPROPERTIES\n('delta.checkpoint.writeStatsAsStruct' = 'true')\n\n``` \nYou can also improve the checkpoint write latency by setting the following table properties: \n```\nALTER TABLE [<table-name>|delta.`<path-to-table>`] SET TBLPROPERTIES\n(\n'delta.checkpoint.writeStatsAsStruct' = 'true',\n'delta.checkpoint.writeStatsAsJson' = 'false'\n)\n\n``` \nIf data skipping is not useful in your application, you can set both properties to false. Then no statistics are collected or written. Databricks does not recommend this configuration.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/best-practices.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n#### Manage model lifecycle using the Workspace Model Registry (legacy)\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/multiple-workspaces.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n#### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Share models across workspaces\n\nImportant \nDatabricks recommends using [Models in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html) to share models across workspaces. The approach in this article is deprecated. \nDatabricks supports sharing models across multiple workspaces. For example, you can develop and log a model in a development workspace, and then access and compare it against models in a separate production workspace. This is useful when multiple teams share access to models or when your organization has multiple workspaces to handle the different stages of development. For cross-workspace model development and deployment, Databricks recommends the [deploy code](https:\/\/docs.databricks.com\/machine-learning\/mlops\/deployment-patterns.html#deploy-code) approach, where the model training code is deployed to multiple environments. \nIn multi-workspace situations, you can access models across Databricks workspaces by using a remote model registry. For example, data scientists could access the production model registry with read-only access to compare their in-development models against the current production models. An example multi-workspace set-up is shown below. \n![Multiple workspaces](https:\/\/docs.databricks.com\/_images\/multiworkspace1.png) \nAccess to a remote registry is controlled by tokens. Each user or script that needs access [creates a personal access token](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement) in the remote registry and [copies that token into the secret manager](https:\/\/docs.databricks.com\/api\/workspace\/secrets) of their local workspace. Each API request sent to the remote registry workspace must include the access token; MLflow provides a simple mechanism to specify the secrets to be used when performing model registry operations. \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \nAll [client](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.tracking.html) and [fluent](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.html) API methods for model registry are supported for remote workspaces.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/multiple-workspaces.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n#### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Share models across workspaces\n####### Requirements\n\nUsing a model registry across workspaces requires the MLflow Python client, release 1.11.0 or above. \nNote \nThis workflow is implemented from logic in the MLflow client. Ensure that the environment running the client has access to make network requests against the Databricks workspace containing the remote model registry. A common restriction put on the registry workspace is an IP allow list, which can disallow connections from MLflow clients running in a cluster in another workspace.\n\n###### Share models across workspaces\n####### Set up the API token for a remote registry\n\n1. In the model registry workspace, [create an access token](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement).\n2. In the local workspace, create secrets to store the access token and the remote workspace information: \n1. Create a secret scope: `databricks secrets create-scope <scope>`.\n2. Pick a unique name for the target workspace, shown here as `<prefix>`. Then create three secrets: \n* `databricks secrets put-secret <scope> <prefix>-host` :\nEnter the hostname of the model registry workspace. For example, `https:\/\/cust-success.cloud.databricks.com\/`.\n* `databricks secrets put-secret <scope> <prefix>-token` :\nEnter the access token from the model registry workspace.\n* `databricks secrets put-secret <scope> <prefix>-workspace-id` :\nEnter the workspace ID for the model registry workspace which can be [found in the URL](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-instance-names-urls-and-ids) of any page. \nNote \nYou may want to [share the secret scope](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html#permissions) with other users, since there is a [limit on the number of secret scopes per workspace](https:\/\/docs.databricks.com\/security\/secrets\/secret-scopes.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/multiple-workspaces.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n#### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Share models across workspaces\n####### Specify a remote registry\n\nBased on the secret scope and name prefix you created for the remote registry workspace, you can construct a registry URI of the form: \n```\nregistry_uri = f'databricks:\/\/<scope>:<prefix>'\n\n``` \nYou can use the URI to specify a remote registry for [fluent API methods](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.html) by first calling: \n```\nmlflow.set_registry_uri(registry_uri)\n\n``` \nOr, you can specify it explicitly when you instantiate an `MlflowClient`: \n```\nclient = MlflowClient(registry_uri=registry_uri)\n\n``` \nThe following workflows show examples of both approaches.\n\n###### Share models across workspaces\n####### Register a model in the remote registry\n\nOne way to register a model is to use the `mlflow.register_model` API: \n```\nmlflow.set_registry_uri(registry_uri)\nmlflow.register_model(model_uri=f'runs:\/<run-id>\/<artifact-path>', name=model_name)\n\n``` \nExamples for other model registration methods can be found in the notebook at the end of this page. \nNote \nRegistering a model in a remote workspace creates a temporary copy of the model artifacts in DBFS in the remote workspace. You may want to delete this copy once the model version is in `READY` status. The temporary files can be found under the `\/dbfs\/databricks\/mlflow\/tmp-external-source\/<run-id>` folder. \nYou can also specify a `tracking_uri` to point to a MLflow Tracking service in another workspace in a similar manner to `registry_uri`. This means you can take a run on a remote workspace and register its model in the current or another remote workspace.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/multiple-workspaces.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n#### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Share models across workspaces\n####### Use a model from the remote registry\n\nYou can load and use a model version in a remote registry with `mlflow.<flavor>.load_model` methods by first setting the registry URI: \n```\nmlflow.set_registry_uri(registry_uri)\nmodel = mlflow.pyfunc.load_model(f'models:\/<model-name>\/Staging')\nmodel.predict(...)\n\n``` \nOr, you can explicitly specify the remote registry in the `models:\/` URI: \n```\nmodel = mlflow.pyfunc.load_model(f'models:\/\/<scope>:<prefix>@databricks\/<model-name>\/Staging')\nmodel.predict(...)\n\n``` \nOther helper methods for accessing the model files are also supported, such as: \n```\nclient.get_latest_versions(model_name)\nclient.get_model_version_download_uri(model_name, version)\n\n```\n\n###### Share models across workspaces\n####### Manage a model in the remote registry\n\nYou can perform any action on models in the remote registry as long as you have the required permissions. For example, if you have CAN MANAGE permissions on a model, you can transition a model version stage or delete the model using `MlflowClient` methods: \n```\nclient = MlflowClient(tracking_uri=None, registry_uri=registry_uri)\nclient.transition_model_version_stage(model_name, version, 'Archived')\nclient.delete_registered_model(model_name)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/multiple-workspaces.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n#### Manage model lifecycle using the Workspace Model Registry (legacy)\n###### Share models across workspaces\n####### Notebook example: Remote model registry\n\nThe following notebook is applicable for workspaces that are not enabled for Unity Catalog. It shows how to log models to the MLflow tracking server from the current workspace, and register the models into Model Registry in a different workspace. Databricks recommends using [Models in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html) to share models across workspaces. \n### Remote Model Registry example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-model-registry-multi-workspace.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/multiple-workspaces.html"} +{"content":"# \n### Creating a `\ud83d\udd17 Chain` version\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-chain.html"} +{"content":"# \n### Creating a `\ud83d\udd17 Chain` version\n#### Conceptual overview\n\nThe `\ud83d\udd17 Chain` is the \u201cheart\u201d of your application and contains the orchestration code that glues together a `\ud83d\udd0d Retriever`, Generative AI Models, and often other APIs\/services to turn a user query (question) into a bot response (answer). Each `\ud83d\udd17 Chain` is associated with 1+ `\ud83d\udd0d Retriever`s. \nTo use RAG Studio, the bare minimum requirement is to configure a `\ud83d\udd17 Chain`. \nAn example `\ud83d\udd17 Chain` might accept a user query, perform query processing, query a `\ud83d\udd0d Retriever`, and then prompt a Generative AI Models with the query and retriever results to generate a response to the user. However, `\ud83d\udd17 Chain` logic can be arbitrarily complex and often includes additional steps. \nRAG Studio is compatible with any MLflow logged model that has the following request\/response schema. The request schema follows the OpenAI [ChatMessages](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html#chatmessage) format and the response schema follows the [ChatResponse](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html#chat-response). \n```\nrequest_signature = {\n# `messages` is an Array of [ChatMessages](\/machine-learning\/foundation-models\/api-reference.md#chatmessage)\n# To support support multi-turn conversation, your front end application can pass an array of length >1, where the array alternates between role = \"user\" and role = \"assistant\".\n# The last message in the array must be of role = \"user\"\n\"messages\": [{\"role\": \"user\", \"content\": \"This is a question to ask?\"}]\n}\n\nresponse_signature = {\n# `choices` is an array of ChatCompletionChoice\n# There can be 1+ choices, but each choice must have a single [ChatMessages](\/machine-learning\/foundation-models\/api-reference.md#chatmessage) with role = \"assistant\"\n\"choices\": [{\n\"index\": 0,\n\"message\": {\"role\": \"assistant\", \"content\": \"This is the correct answer.\"},\n\"finish_reason\": \"stop\"\n}],\n\"object\": \"chat.completions\"\n# TODO: add the rest of https:\/\/docs.databricks.com\/en\/machine-learning\/foundation-models\/api-reference.html#chat-task schema here\n}\n\n``` \nNote \nIn v2024-01-19, while you can use any MLflow model, in order to enable `\ud83d\udcdd Trace` logging, you must use a LangChain defined chain inside your `\ud83d\udd17 Chain`. Future versions will enable the `RAG Trace Logging API` to be called directly by your code. \nTip \n**\ud83d\udea7 Roadmap \ud83d\udea7** Support for Llama-Index chains \nA `\ud83d\udd17 Chain` consists of: \n1. Configuration stored in the `chains` section of `rag-config.yml`\n2. Code stored in `app-directory\/src\/build_chain.py` that configures the chain\u2019s logic and logs it as a [Unity Catalog Model](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html). \nYou can configure a Generative AI Models in `rag-config.yml`. This embedding model can be any [Foundational Model APIs pay-per-token](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html#what-are-databricks-foundation-model-apis), [Foundational Model APIs provisioned throughput](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html#what-are-databricks-foundation-model-apis) or [External Model](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html) Endpoint that supports the a [`llm\/v1\/chat`](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html#chat) task. \nTip \n**\ud83d\udea7 Roadmap \ud83d\udea7** Support for multiple `\ud83d\udd17 Chain` per RAG Application. In v2024-01-19, only one `\ud83d\udd17 Chain` can be created per RAG Application. \nNote \n`\ud83d\udd17 Chain`s must be deployed to Databricks Model Serving in order to enable `\ud83d\udcdd Trace` logging and `\ud83d\udc4d Assessments` collection.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-chain.html"} +{"content":"# \n### Creating a `\ud83d\udd17 Chain` version\n#### Data flows\n\n![legend](https:\/\/docs.databricks.com\/_images\/data-flow-chain.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-chain.html"} +{"content":"# \n### Creating a `\ud83d\udd17 Chain` version\n#### Step-by-step instructions\n\n1. Open the `rag-config.yml` in your IDE\/code editor.\n2. Edit the `chains` configuration. \n```\nchains:\n- name: spark-docs-chain # User specified, must be unique, no spaces\ndescription: Spark docs chain # User specified, any text string\n# explicit link to the retriever that this chain uses.\n# currently, only one retriever per chain is supported, but this schema allows support for adding multiple in the future\nretrievers:\n- name: ann-retriever\nfoundational_models:\n- name: llama-2-70b-chat # user specified name to reference this model in the chain & to override per environment. Must be unique.\ntype: v1\/llm\/chat\nendpoint_name: databricks-llama-2-70b-chat\nprompt_template:\nchat_messages:\n- role: \"system\"\ncontent: \"You are a trustful assistant for Databricks users. You are answering python, coding, SQL, data engineering, spark, data science, AI, ML, Datawarehouse, platform, API or infrastructure, Cloud administration question related to Databricks. If you do not know the answer to a question, you truthfully say you do not know. Read the discussion to get the context of the previous conversation. In the chat discussion, you are referred to as 'system'. The user is referred to as 'user'.\"\n- role: \"user\"\ncontent: \"Discussion: {chat_history}. Here's some context which might or might not help you answer: {context} Answer straight, do not repeat the question, do not start with something like: the answer to the question, do not add 'AI' in front of your answer, do not say: here is the answer, do not mention the context or the question. Based on this history and context, answer this question: {question}\"\nconfigurations:\ntemperature: 0.9\nmax_tokens: 200\n\n```\n3. Edit the `src\/my_rag_builder\/chain.py` to modify the default code or add custom code. \nIf you just want to modify the default chain logic, edit the `full_chain` that defines a chain in LangChain LECL. \nNote \nYou can modify this file in any way you see fit, as long as after the code finishes running, `destination_model_name` contains a logged MLflow model with the signature defined above, logged using the provided convenience function `chain_model_utils.log_register_chain_model()`.\n4. To test the chain locally: \n1. Set the `DATABRICKS_TOKEN` environment variable to a [Personal Access Token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). \n```\nexport DATABRICKS_TOKEN=pat_token_key\n\n```\n2. Update `vector_search_index_name` on line 204 to the name of a Vector Search index previously created with `.\/rag create-rag-version`\n3. Uncomment all or part of lines 244-264 to print the chain output to the console.\n4. Run the `chain.py` file. \n```\npython chain.py\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-chain.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n#### Distributed training\n\nWhen possible, Databricks recommends that you train neural networks on a single machine; distributed code for training and inference is more complex than single-machine code and slower due to communication overhead. However, you should consider distributed training and inference if your model or your data are too large to fit in memory on a single machine. For these workloads, Databricks Runtime ML includes the TorchDistributor, Horovod and spark-tensorflow-distributor packages. \nDatabricks also offers distributed training for Spark ML models with the `pyspark.ml.connect` module, see [Train Spark ML models on Databricks Connect with pyspark.ml.connect](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/distributed-ml-for-spark-connect.html).\n\n#### Distributed training\n##### DeepSpeed distributor\n\nThe DeepSpeed distributor is built on top of [TorchDistributor](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/spark-pytorch-distributor.html) and is a recommended solution for customers with models that require higher compute power, but are limited by memory constraints. DeepSpeed is an open-source library developed by Microsoft and offers optimized memory usage, reduced communication overhead, and advanced pipeline parallelism. Learn more about [Distributed training with DeepSpeed distributor](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/deepspeed.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n#### Distributed training\n##### TorchDistributor\n\n[TorchDistributor](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/reference\/api\/pyspark.ml.torch.distributor.TorchDistributor.html) is an open-source module in PySpark that helps users do distributed training with PyTorch on their Spark clusters, so it lets you launch PyTorch training jobs as Spark jobs. Under-the-hood, it initializes the environment and the communication channels between the workers and utilizes the CLI command `torch.distributed.run` to run distributed training across the worker nodes. Learn more about [Distributed training with TorchDistributor](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/spark-pytorch-distributor.html).\n\n#### Distributed training\n##### spark-tensorflow-distributor\n\n[spark-tensorflow-distributor](https:\/\/github.com\/tensorflow\/ecosystem\/tree\/master\/spark\/spark-tensorflow-distributor) is an open-source native package in TensorFlow for distributed training with TensorFlow on Spark clusters. Learn more about [Distributed training with TensorFlow 2](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/spark-tf-distributor.html).\n\n#### Distributed training\n##### Ray\n\n[Ray](https:\/\/docs.ray.io\/en\/latest\/ray-overview\/index.html) is an open-source framework that specializes in parallel compute processing for scaling ML workflows and AI applications. See [Use Ray on Databricks](https:\/\/docs.databricks.com\/machine-learning\/ray-integration.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n#### Distributed training\n##### Horovod\n\n[Horovod](https:\/\/github.com\/horovod\/horovod) is a distributed training framework for TensorFlow, Keras, and PyTorch. Databricks supports distributed deep learning training using HorovodRunner and the `horovod.spark` package. For Spark ML pipeline applications using Keras or PyTorch, you can use the `horovod.spark` [estimator API](https:\/\/spark.apache.org\/docs\/latest\/ml-pipeline.html#estimators). \n### Requirements \nDatabricks Runtime ML. \n### Use Horovod \nThe following articles provide general information about distributed deep learning with Horovod and example notebooks illustrating how to use HorovodRunner and the `horovod.spark` package. \n* [HorovodRunner: distributed deep learning with Horovod](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-runner.html)\n* [HorovodRunner examples](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-runner-examples.html)\n* [`horovod.spark`: distributed deep learning with Horovod](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-spark.html) \n### Install a different version of Horovod \nTo upgrade or downgrade Horovod from the pre-installed version in your ML cluster, you must recompile Horovod by following these steps: \n1. Uninstall the current version of Horovod. \n```\n%pip uninstall -y horovod\n\n``` \n2. If using a GPU-accelerated cluster, install CUDA development libraries required to compile Horovod. To ensure compatibility, leave the package versions unchanged. \n```\n%sh\nwget https:\/\/developer.download.nvidia.com\/compute\/cuda\/repos\/ubuntu1804\/x86_64\/cuda-ubuntu1804.pin\nmv cuda-ubuntu1804.pin \/etc\/apt\/preferences.d\/cuda-repository-pin-600\napt-key adv --fetch-keys https:\/\/developer.download.nvidia.com\/compute\/cuda\/repos\/ubuntu1804\/x86_64\/7fa2af80.pub\nadd-apt-repository \"deb https:\/\/developer.download.nvidia.com\/compute\/cuda\/repos\/ubuntu1804\/x86_64\/ \/\"\n\nwget https:\/\/developer.download.nvidia.com\/compute\/machine-learning\/repos\/ubuntu1804\/x86_64\/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb\ndpkg -i .\/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb\n\napt-get update\napt-get install --allow-downgrades --no-install-recommends -y \\\ncuda-nvml-dev-11-0=11.0.167-1 \\\ncuda-nvcc-11-0=11.0.221-1 \\\ncuda-cudart-dev-11-0=11.0.221-1 \\\ncuda-libraries-dev-11-0=11.0.3-1 \\\nlibnccl-dev=2.11.4-1+cuda11.5\\\nlibcusparse-dev-11-0=11.1.1.245-1\n\n``` \n3. Download the desired version of Horovod\u2019s source code and compile with the appropriate flags. If you don\u2019t need any of the extensions (such as `HOROVOD_WITH_PYTORCH`), you can remove those flags. \n```\n%sh\nHOROVOD_VERSION=v0.21.3 # Change as necessary\ngit clone --recursive https:\/\/github.com\/horovod\/horovod.git --branch ${HOROVOD_VERSION}\ncd horovod\nrm -rf build\/ dist\/\nHOROVOD_WITH_MPI=1 HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 \\\n# For Databricks Runtime 8.4 ML and below, replace with \/databricks\/conda\/envs\/databricks-ml\/bin\/python\nsudo \/databricks\/python3\/bin\/python setup.py bdist_wheel\nreadlink -f dist\/horovod-*.whl\n\n``` \n```\n%sh\nHOROVOD_VERSION=v0.21.3 # Change as necessary\ngit clone --recursive https:\/\/github.com\/horovod\/horovod.git --branch ${HOROVOD_VERSION}\ncd horovod\nrm -rf build\/ dist\/\nHOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_CUDA_HOME=\/usr\/local\/cuda HOROVOD_WITH_MPI=1 HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 \\\n# For Databricks Runtime 8.4 ML and below, replace with \/databricks\/conda\/envs\/databricks-ml-gpu\/bin\/python\nsudo \/databricks\/python3\/bin\/python setup.py bdist_wheel\nreadlink -f dist\/horovod-*.whl\n\n``` \n4. Use `%pip` to reinstall Horovod by specifying the Python wheel path from the previous command\u2019s output. `0.21.3` is shown in this example. \n```\n%pip install --no-cache-dir \/databricks\/driver\/horovod\/dist\/horovod-0.21.3-cp38-cp38-linux_x86_64.whl\n\n``` \n### Troubleshoot Horovod installation \n**Problem**: Importing `horovod.{torch|tensorflow}` raises `ImportError: Extension horovod.{torch|tensorflow} has not been built` \n**Solution**: Horovod comes pre-installed on Databricks Runtime ML, so this error typically occurs if updating an environment goes wrong. The error indicates that Horovod was installed before a required library (PyTorch or TensorFlow). Since Horovod is compiled during installation, `horovod.{torch|tensorflow}` will not get compiled if those packages aren\u2019t present during the installation of Horovod.\nTo fix the issue, follow these steps: \n1. Verify that you are on a Databricks Runtime ML cluster.\n2. Ensure that the PyTorch or TensorFlow package is already installed.\n3. Uninstall Horovod (`%pip uninstall -y horovod`).\n4. Install `cmake` (`%pip install cmake`).\n5. Reinstall `horovod`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/index.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### ipywidgets\n\n[ipywidgets](https:\/\/ipywidgets.readthedocs.io\/en\/7.7.0\/) are visual elements that allow users to specify parameter values in notebook cells. You can use ipywidgets to make your Databricks Python notebooks interactive. \nThe ipywidgets package includes over [30 different controls](https:\/\/ipywidgets.readthedocs.io\/en\/7.7.0\/examples\/Widget%20List.html), including form controls such as sliders, text boxes, and checkboxes, as well as layout controls such as tabs, accordions, and grids. Using these elements, you can build graphical user interfaces to interface with your notebook code. \nNote \n* To determine the version of ipywidgets that your cluster supports, refer to the [release notes](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html) for the Databricks Runtime version of your cluster.\n* Some ipywidgets do not work in Databricks Runtime 15.0.\n* For information about Databricks widgets, see [Databricks widgets](https:\/\/docs.databricks.com\/notebooks\/widgets.html). For guidelines on when to use Databricks widgets or ipywidgets, see [Best practices for using ipywidgets and Databricks widgets](https:\/\/docs.databricks.com\/notebooks\/ipywidgets.html#best-practices).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/ipywidgets.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### ipywidgets\n##### Requirements\n\n* ipywidgets are available in preview in Databricks Runtime 11.0 through Databricks Runtime 12.2 LTS, and are generally available in Databricks Runtime 13.0 and above. Support for Unity Catalog tables is available in Databricks Runtime 12.1 and above on Unity Catalog-enabled clusters.\n* To use ipywidgets on Databricks, your browser must be able to access the `databricks-dev-cloudfront.dev.databricks.com` domain. \nBy default, ipywidgets occupies port 6062. With Databricks Runtime 11.3 LTS and above, if you run into conflicts with third-party integrations such as Datadog, you can change the port using the following [Spark config](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration): \n`spark.databricks.driver.ipykernel.commChannelPort <port-number>` \nFor example: \n`spark.databricks.driver.ipykernel.commChannelPort 1234` \nThe Spark config must be set when the cluster is created.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/ipywidgets.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### ipywidgets\n##### Usage\n\nThe following code creates a histogram with a slider that can take on values between 3 and 10. The value of the widget determines the number of bins in the histogram. As you move the slider, the histogram updates immediately. See [the ipywidgets example notebook](https:\/\/docs.databricks.com\/notebooks\/ipywidgets.html#example-notebooks) to try this out. \n```\nimport ipywidgets as widgets\nfrom ipywidgets import interact\n\n# Load a dataset\nsparkDF = spark.read.csv(\"\/databricks-datasets\/bikeSharing\/data-001\/day.csv\", header=\"true\", inferSchema=\"true\")\n\n# In this code, `(bins=(3, 10)` defines an integer slider widget that allows values between 3 and 10.\n@interact(bins=(3, 10))\ndef plot_histogram(bins):\npdf = sparkDF.toPandas()\npdf.hist(column='temp', bins=bins)\n\n``` \nThe following code creates an integer slider that can take on values between 0 and 10. The default value is 5. To access the value of the slider in your code, use `int_slider.value`. \n```\nimport ipywidgets as widgets\n\nint_slider = widgets.IntSlider(max=10, value=5)\nint_slider\n\n``` \nThe following code loads and displays a sample dataframe from a table in Unity Catalog. Support for Unity Catalog tables is available with Databricks Runtime 12.1 and above on Unity Catalog-enabled clusters. \n```\nimport ipywidgets as widgets\n\n# Create button widget. Clicking this button loads a sampled dataframe from UC table.\nbutton = widgets.Button(description=\"Load dataframe sample\")\n\n# Output widget to display the loaded dataframe\noutput = widgets.Output()\n\ndef load_sample_df(table_name):\nreturn spark.sql(f\"SELECT * FROM {table_name} LIMIT 1000\")\n\ndef on_button_clicked(_):\nwith output:\noutput.clear_output()\ndf = load_sample_df('<catalog>.<schema>.<table>')\nprint(df.toPandas())\n\n# Register the button's callback function to query UC and display results to the output widget\nbutton.on_click(on_button_clicked)\n\ndisplay(button, output)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/ipywidgets.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### ipywidgets\n##### Notebook example: ipywidgets\n\nThe following notebook shows some examples of using ipywidgets in notebooks. \n### ipywidgets example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/ipywidgets.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### ipywidgets\n##### Notebook example: ipywidgets advanced example\n\nThe following notebook shows a more complex example using ipywidgets to create an interactive map. \n### Advanced example: maps with ipywidgets \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/ipywidgets-adv.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### ipywidgets\n##### Best practices for using ipywidgets and Databricks widgets\n\nTo add interactive controls to Python notebooks, Databricks recommends using ipywidgets. For notebooks in other languages, use [Databricks widgets](https:\/\/docs.databricks.com\/notebooks\/widgets.html). \nYou can use Databricks widgets to [pass parameters between notebooks](https:\/\/docs.databricks.com\/notebooks\/widgets.html#widgets-and-percent-run) and to pass parameters to jobs; ipywidgets do not support these scenarios.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/ipywidgets.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### ipywidgets\n##### Which third-party Jupyter widgets are supported in Databricks?\n\nDatabricks provides best-effort support for third-party widgets, such as [ipyleaflet](https:\/\/github.com\/jupyter-widgets\/ipyleaflet), [bqplot](https:\/\/github.com\/bqplot\/bqplot), and [VegaFusion](https:\/\/github.com\/vegafusion\/vegafusion). However, some third-party widgets are not supported. For a list of the widgets that have been tested in Databricks notebooks, contact your Databricks account team.\n\n#### ipywidgets\n##### Limitations\n\n* A notebook using ipywidgets must be attached to a running cluster.\n* Widget states are not preserved across notebook sessions. You must re-run widget cells to render them each time you attach the notebook to a cluster.\n* The Password and Controller ipywidgets are not supported.\n* HTMLMath and Label widgets with LaTeX expressions do not render correctly. (For example, `widgets.Label(value=r'$$\\frac{x+1}{x-1}$$')` does not render correctly.)\n* Widgets might not render properly if the notebook is in dark mode, especially colored widgets.\n* Widget outputs cannot be used in notebook dashboard views.\n* The maximum message payload size for an ipywidget is 5 MB. Widgets that use images or large text data may not be properly rendered.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/ipywidgets.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Open or run a Delta Live Tables pipeline from a notebook\n\nFor notebooks that are assigned to a Delta Live Tables pipeline, you can open the pipeline details, start a pipeline update, or delete a pipeline using the **Delta Live Tables** dropdown menu in the notebook toolbar. \nTo open the pipeline details, click **Delta Live Tables** and click the pipeline name, or click ![Jobs Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/jobs-vertical-ellipsis.png) **> View in Pipelines**. \nTo start an update of the pipeline, click **Delta Live Tables** and click **Start** next to the pipeline name. \nYou can track the progress of the update by viewing the event log. To view the event log, click **Delta Live Tables >** ![Jobs Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/jobs-vertical-ellipsis.png) **> View logs**. The event log opens in the notebook side panel. To view details for a log entry, click the entry. The **Pipeline event log details** pop-up appears. To view a JSON document containing the log details, click the **JSON** tab. \nTo delete a pipeline the notebook is assigned to, click **Delta Live Tables >** ![Jobs Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/jobs-vertical-ellipsis.png) **> Delete**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebooks-dlt-pipeline.html"} +{"content":"# \n### Legacy visualizations\n\nThis article describes legacy Databricks visualizations. See [Visualizations in Databricks notebooks](https:\/\/docs.databricks.com\/visualizations\/index.html) for current visualization support. \nDatabricks also natively supports visualization libraries in Python and R and lets you install and use third-party libraries.\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html"} +{"content":"# \n### Legacy visualizations\n#### Create a legacy visualization\n\nTo create a legacy visualization from a results cell, click **+** and select **Legacy Visualization**. \nLegacy visualizations support a rich set of plot types: \n![Chart types](https:\/\/docs.databricks.com\/_images\/display-charts.png) \n### Choose and configure a legacy chart type \nTo choose a bar chart, click the bar chart icon ![Chart Button](https:\/\/docs.databricks.com\/_images\/chart-button.png): \n![Bar chart icon](https:\/\/docs.databricks.com\/_images\/diamonds-bar-chart.png) \nTo choose another plot type, click ![Button Down](https:\/\/docs.databricks.com\/_images\/button-down.png) to the right of the bar chart ![Chart Button](https:\/\/docs.databricks.com\/_images\/chart-button.png) and choose the plot type. \n### Legacy chart toolbar \nBoth line and bar charts have a built-in toolbar that support a rich set of client-side interactions. \n![Chart toolbar](https:\/\/docs.databricks.com\/_images\/chart-toolbar.png) \nTo configure a chart, click **Plot Options\u2026**. \n![Plot options](https:\/\/docs.databricks.com\/_images\/plot-options.png) \nThe line chart has a few custom chart options: setting a Y-axis range, showing and hiding points, and displaying the Y-axis with a log scale. \nFor information about legacy chart types, see: \n* [Migrate legacy line charts](https:\/\/docs.databricks.com\/visualizations\/legacy-charts.html) \n### Color consistency across charts \nDatabricks supports two kinds of color consistency across legacy charts: series set and global. \n*Series set* color consistency assigns the same color to the same value if you have series with the\nsame values but in different orders (for example, A = `[\"Apple\", \"Orange\", \"Banana\"]` and B =\n`[\"Orange\", \"Banana\", \"Apple\"]`). The values are sorted before plotting, so both legends are sorted\nthe same way (`[\"Apple\", \"Banana\", \"Orange\"]`), and the same values are given the same colors. However,\nif you have a series C = `[\"Orange\", \"Banana\"]`, it would not be color consistent with set\nA because the set isn\u2019t the same. The sorting algorithm would assign the first color to \u201cBanana\u201d in\nset C but the second color to \u201cBanana\u201d in set A. If you want these series to be color consistent,\nyou can specify that charts should have global color consistency. \nIn *global* color consistency, each value is always mapped to the same color no matter what values\nthe series have. To enable this for each chart, select the **Global color consistency** checkbox. \n![Global color consistency](https:\/\/docs.databricks.com\/_images\/series-colors.gif) \nNote \nTo achieve this consistency, Databricks hashes directly from values to colors. To avoid\ncollisions (where two values go to the exact same color), the hash is to a large set of colors,\nwhich has the side effect that nice-looking or easily distinguishable colors cannot be guaranteed;\nwith many colors there are bound to be some that are very similar looking.\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html"} +{"content":"# \n### Legacy visualizations\n#### Machine learning visualizations\n\nIn addition to the standard chart types, legacy visualizations support the following machine learning training parameters and results: \n* [Residuals](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#residuals)\n* [ROC curves](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#roc-curves)\n* [Decision trees](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#decision-trees) \n### [Residuals](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#id2) \nFor linear and logistic regressions, you can render a [fitted versus residuals](https:\/\/en.wikipedia.org\/wiki\/Errors_and_residuals) plot. To obtain this plot, supply the model and DataFrame. \nThe following example runs a linear regression on city population to house sale price data and then displays the residuals versus the fitted data. \n```\n# Load data\npop_df = spark.read.csv(\"\/databricks-datasets\/samples\/population-vs-price\/data_geo.csv\", header=\"true\", inferSchema=\"true\")\n\n# Drop rows with missing values and rename the feature and label columns, replacing spaces with _\nfrom pyspark.sql.functions import col\npop_df = pop_df.dropna() # drop rows with missing values\nexprs = [col(column).alias(column.replace(' ', '_')) for column in pop_df.columns]\n\n# Register a UDF to convert the feature (2014_Population_estimate) column vector to a VectorUDT type and apply it to the column.\nfrom pyspark.ml.linalg import Vectors, VectorUDT\n\nspark.udf.register(\"oneElementVec\", lambda d: Vectors.dense([d]), returnType=VectorUDT())\ntdata = pop_df.select(*exprs).selectExpr(\"oneElementVec(2014_Population_estimate) as features\", \"2015_median_sales_price as label\")\n\n# Run a linear regression\nfrom pyspark.ml.regression import LinearRegression\n\nlr = LinearRegression()\nmodelA = lr.fit(tdata, {lr.regParam:0.0})\n\n# Plot residuals versus fitted data\ndisplay(modelA, tdata)\n\n``` \n![Display residuals](https:\/\/docs.databricks.com\/_images\/residuals.png) \n### [ROC curves](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#id3) \nFor logistic regressions, you can render an [ROC](https:\/\/en.wikipedia.org\/wiki\/Receiver_operating_characteristic) curve. To obtain this plot, supply the model, the prepped data that is input to the `fit` method, and the parameter `\"ROC\"`. \nThe following example develops a classifier that predicts if an individual earns <=50K or >50k a year from various attributes of the individual. The Adult dataset derives from census data, and consists of information about 48842 individuals and their annual income. \nThe example code in this section uses one-hot encoding. \n```\n\n# This code uses one-hot encoding to convert all categorical variables into binary vectors.\n\nschema = \"\"\"`age` DOUBLE,\n`workclass` STRING,\n`fnlwgt` DOUBLE,\n`education` STRING,\n`education_num` DOUBLE,\n`marital_status` STRING,\n`occupation` STRING,\n`relationship` STRING,\n`race` STRING,\n`sex` STRING,\n`capital_gain` DOUBLE,\n`capital_loss` DOUBLE,\n`hours_per_week` DOUBLE,\n`native_country` STRING,\n`income` STRING\"\"\"\n\ndataset = spark.read.csv(\"\/databricks-datasets\/adult\/adult.data\", schema=schema)\n\nfrom pyspark.ml import Pipeline\nfrom pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler\n\ncategoricalColumns = [\"workclass\", \"education\", \"marital_status\", \"occupation\", \"relationship\", \"race\", \"sex\", \"native_country\"]\n\nstages = [] # stages in the Pipeline\nfor categoricalCol in categoricalColumns:\n# Category indexing with StringIndexer\nstringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + \"Index\")\n# Use OneHotEncoder to convert categorical variables into binary SparseVectors\nencoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + \"classVec\"])\n# Add stages. These are not run here, but will run all at once later on.\nstages += [stringIndexer, encoder]\n\n# Convert label into label indices using the StringIndexer\nlabel_stringIdx = StringIndexer(inputCol=\"income\", outputCol=\"label\")\nstages += [label_stringIdx]\n\n# Transform all features into a vector using VectorAssembler\nnumericCols = [\"age\", \"fnlwgt\", \"education_num\", \"capital_gain\", \"capital_loss\", \"hours_per_week\"]\nassemblerInputs = [c + \"classVec\" for c in categoricalColumns] + numericCols\nassembler = VectorAssembler(inputCols=assemblerInputs, outputCol=\"features\")\nstages += [assembler]\n\n# Run the stages as a Pipeline. This puts the data through all of the feature transformations in a single call.\n\npartialPipeline = Pipeline().setStages(stages)\npipelineModel = partialPipeline.fit(dataset)\npreppedDataDF = pipelineModel.transform(dataset)\n\n# Fit logistic regression model\n\nfrom pyspark.ml.classification import LogisticRegression\nlrModel = LogisticRegression().fit(preppedDataDF)\n\n# ROC for data\ndisplay(lrModel, preppedDataDF, \"ROC\")\n\n``` \n![Display ROC](https:\/\/docs.databricks.com\/_images\/roc.png) \nTo display the residuals, omit the `\"ROC\"` parameter: \n```\ndisplay(lrModel, preppedDataDF)\n\n``` \n![Display logistic regression residuals](https:\/\/docs.databricks.com\/_images\/log-reg-residuals.png) \n### [Decision trees](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#id4) \nLegacy visualizations support rendering a decision tree. \nTo obtain this visualization, supply the decision tree model. \nThe following examples train a tree to recognize digits (0 - 9) from the MNIST dataset of images of handwritten digits and then displays the tree. \n```\ntrainingDF = spark.read.format(\"libsvm\").load(\"\/databricks-datasets\/mnist-digits\/data-001\/mnist-digits-train.txt\").cache()\ntestDF = spark.read.format(\"libsvm\").load(\"\/databricks-datasets\/mnist-digits\/data-001\/mnist-digits-test.txt\").cache()\n\nfrom pyspark.ml.classification import DecisionTreeClassifier\nfrom pyspark.ml.feature import StringIndexer\nfrom pyspark.ml import Pipeline\n\nindexer = StringIndexer().setInputCol(\"label\").setOutputCol(\"indexedLabel\")\n\ndtc = DecisionTreeClassifier().setLabelCol(\"indexedLabel\")\n\n# Chain indexer + dtc together into a single ML Pipeline.\npipeline = Pipeline().setStages([indexer, dtc])\n\nmodel = pipeline.fit(trainingDF)\ndisplay(model.stages[-1])\n\n``` \n```\nval trainingDF = spark.read.format(\"libsvm\").load(\"\/databricks-datasets\/mnist-digits\/data-001\/mnist-digits-train.txt\").cache\nval testDF = spark.read.format(\"libsvm\").load(\"\/databricks-datasets\/mnist-digits\/data-001\/mnist-digits-test.txt\").cache\n\nimport org.apache.spark.ml.classification.{DecisionTreeClassifier, DecisionTreeClassificationModel}\nimport org.apache.spark.ml.feature.StringIndexer\nimport org.apache.spark.ml.Pipeline\n\nval indexer = new StringIndexer().setInputCol(\"label\").setOutputCol(\"indexedLabel\")\nval dtc = new DecisionTreeClassifier().setLabelCol(\"indexedLabel\")\nval pipeline = new Pipeline().setStages(Array(indexer, dtc))\n\nval model = pipeline.fit(trainingDF)\nval tree = model.stages.last.asInstanceOf[DecisionTreeClassificationModel]\n\ndisplay(tree)\n\n``` \n![Display decision tree](https:\/\/docs.databricks.com\/_images\/decision-tree.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html"} +{"content":"# \n### Legacy visualizations\n#### Structured Streaming DataFrames\n\nTo visualize the result of a streaming query in real time you can `display` a Structured Streaming DataFrame in Scala and Python. \n```\nstreaming_df = spark.readStream.format(\"rate\").load()\ndisplay(streaming_df.groupBy().count())\n\n``` \n```\nval streaming_df = spark.readStream.format(\"rate\").load()\ndisplay(streaming_df.groupBy().count())\n\n``` \n`display` supports the following optional parameters: \n* `streamName`: the streaming query name.\n* `trigger` (Scala) and `processingTime` (Python): defines how often the streaming query is run. If not specified, the system checks for availability of new data as soon as the previous processing has completed. To reduce the cost in production, Databricks recommends that you *always* set a trigger interval. The default trigger interval is 500 ms.\n* `checkpointLocation`: the location where the system writes all the checkpoint information. If it is not specified, the system automatically generates a temporary checkpoint location on DBFS. In order for your stream to continue processing data from where it left off, you must provide a checkpoint location. Databricks recommends that in production you *always* specify the `checkpointLocation` option. \n```\nstreaming_df = spark.readStream.format(\"rate\").load()\ndisplay(streaming_df.groupBy().count(), processingTime = \"5 seconds\", checkpointLocation = \"dbfs:\/<checkpoint-path>\")\n\n``` \n```\nimport org.apache.spark.sql.streaming.Trigger\n\nval streaming_df = spark.readStream.format(\"rate\").load()\ndisplay(streaming_df.groupBy().count(), trigger = Trigger.ProcessingTime(\"5 seconds\"), checkpointLocation = \"dbfs:\/<checkpoint-path>\")\n\n``` \nFor more information about these parameters, see [Starting Streaming Queries](http:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html#starting-streaming-queries).\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html"} +{"content":"# \n### Legacy visualizations\n#### `displayHTML` function\n\nDatabricks programming language notebooks (Python, R, and Scala) support HTML graphics using the `displayHTML` function;\nyou can pass the function any HTML, CSS, or JavaScript code. This function supports interactive graphics using JavaScript libraries such as D3. \nFor examples of using `displayHTML`, see: \n* [HTML, D3, and SVG in notebooks](https:\/\/docs.databricks.com\/visualizations\/html-d3-and-svg.html) \n* [Embed static images in notebooks](https:\/\/docs.databricks.com\/archive\/legacy\/filestore.html#static-images) \nNote \nThe `displayHTML` iframe is served from the domain `databricksusercontent.com`, and the iframe sandbox includes the `allow-same-origin` attribute. `databricksusercontent.com` must be accessible from your browser. If it is currently blocked by your corporate network, it must added to an allow list.\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html"} +{"content":"# \n### Legacy visualizations\n#### Images\n\nColumns containing image data types are rendered as rich HTML. Databricks attempts to render image thumbnails for `DataFrame` columns matching the Spark [ImageSchema](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/ml\/image\/ImageSchema$.html).\nThumbnail rendering works for any images successfully read in through the `spark.read.format('image')` function. For image values generated through other means, Databricks supports the rendering of 1, 3, or 4 channel images (where each channel consists of a single byte), with the following constraints: \n* **One-channel images**: `mode` field must be equal to 0. `height`, `width`, and `nChannels`\nfields must accurately describe the binary image data in the `data` field.\n* **Three-channel images**: `mode` field must be equal to 16. `height`, `width`, and `nChannels`\nfields must accurately describe the binary image data in the `data` field. The `data` field\nmust contain pixel data in three-byte chunks, with the channel ordering `(blue, green, red)` for\neach pixel.\n* **Four-channel images**: `mode` field must be equal to 24. `height`, `width`, and `nChannels`\nfields must accurately describe the binary image data in the `data` field. The `data` field\nmust contain pixel data in four-byte chunks, with the channel ordering `(blue, green, red, alpha)`\nfor each pixel. \n### Example \nSuppose you have a folder containing some images: \n![Folder of image data](https:\/\/docs.databricks.com\/_images\/sample-image-data.png) \nIf you read the images into a DataFrame and then display the DataFrame, Databricks renders thumbnails of the images: \n```\nimage_df = spark.read.format(\"image\").load(sample_img_dir)\ndisplay(image_df)\n\n``` \n![Display image DataFrame](https:\/\/docs.databricks.com\/_images\/image-data.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html"} +{"content":"# \n### Legacy visualizations\n#### Visualizations in Python\n\nIn this section: \n* [Seaborn](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#seaborn)\n* [Other Python libraries](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#other-python-libraries) \n### [Seaborn](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#id5) \nYou can also use other Python libraries to generate plots. The Databricks Runtime includes the [seaborn](https:\/\/seaborn.pydata.org\/) visualization library. To create a seaborn plot, import the library, create a plot, and pass the plot to the `display` function. \n```\nimport seaborn as sns\nsns.set(style=\"white\")\n\ndf = sns.load_dataset(\"iris\")\ng = sns.PairGrid(df, diag_sharey=False)\ng.map_lower(sns.kdeplot)\ng.map_diag(sns.kdeplot, lw=3)\n\ng.map_upper(sns.regplot)\n\ndisplay(g.fig)\n\n``` \n![Seaborn plot](https:\/\/docs.databricks.com\/_images\/seaborn-iris.png) \n### [Other Python libraries](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#id6) \n* [Bokeh](https:\/\/docs.databricks.com\/visualizations\/bokeh.html)\n* [Matplotlib](https:\/\/docs.databricks.com\/visualizations\/matplotlib.html)\n* [Plotly](https:\/\/docs.databricks.com\/visualizations\/plotly.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html"} +{"content":"# \n### Legacy visualizations\n#### Visualizations in R\n\nTo plot data in R, use the `display` function as follows: \n```\nlibrary(SparkR)\ndiamonds_df <- read.df(\"\/databricks-datasets\/Rdatasets\/data-001\/csv\/ggplot2\/diamonds.csv\", source = \"csv\", header=\"true\", inferSchema = \"true\")\n\ndisplay(arrange(agg(groupBy(diamonds_df, \"color\"), \"price\" = \"avg\"), \"color\"))\n\n``` \nYou can use the default R [plot](https:\/\/www.rdocumentation.org\/packages\/graphics\/versions\/3.6.2\/topics\/plot) function. \n```\nfit <- lm(Petal.Length ~., data = iris)\nlayout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs\/page\nplot(fit)\n\n``` \n![R default plot](https:\/\/docs.databricks.com\/_images\/r-iris.png) \nYou can also use any R visualization package. The R notebook captures the resulting plot as a `.png` and displays it inline. \nIn this section: \n* [Lattice](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#lattice)\n* [DandEFA](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#dandefa)\n* [Plotly](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#plotly)\n* [Other R libraries](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#other-r-libraries) \n### [Lattice](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#id7) \nThe [Lattice](https:\/\/www.statmethods.net\/advgraphs\/trellis.html) package supports trellis graphs\u2014graphs that display a variable or the relationship between variables, conditioned on one or more other variables. \n```\nlibrary(lattice)\nxyplot(price ~ carat | cut, diamonds, scales = list(log = TRUE), type = c(\"p\", \"g\", \"smooth\"), ylab = \"Log price\")\n\n``` \n![R Lattice plot](https:\/\/docs.databricks.com\/_images\/r-lattice.png) \n### [DandEFA](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#id8) \nThe [DandEFA](https:\/\/www.rdocumentation.org\/packages\/DandEFA\/versions\/1.6) package supports dandelion plots. \n```\ninstall.packages(\"DandEFA\", repos = \"https:\/\/cran.us.r-project.org\")\nlibrary(DandEFA)\ndata(timss2011)\ntimss2011 <- na.omit(timss2011)\ndandpal <- rev(rainbow(100, start = 0, end = 0.2))\nfacl <- factload(timss2011,nfac=5,method=\"prax\",cormeth=\"spearman\")\ndandelion(facl,bound=0,mcex=c(1,1.2),palet=dandpal)\nfacl <- factload(timss2011,nfac=8,method=\"mle\",cormeth=\"pearson\")\ndandelion(facl,bound=0,mcex=c(1,1.2),palet=dandpal)\n\n``` \n![R DandEFA plot](https:\/\/docs.databricks.com\/_images\/r-daefa.png) \n### [Plotly](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#id9) \nThe [Plotly](https:\/\/plotly.com\/r\/) R package relies on [htmlwidgets for R](https:\/\/www.htmlwidgets.org\/).\nFor installation instructions and a notebook, see [htmlwidgets](https:\/\/docs.databricks.com\/visualizations\/htmlwidgets.html). \n### [Other R libraries](https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html#id10) \n* [ggplot2](https:\/\/docs.databricks.com\/visualizations\/ggplot2.html)\n* [htmlwidgets](https:\/\/docs.databricks.com\/visualizations\/htmlwidgets.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html"} +{"content":"# \n### Legacy visualizations\n#### Visualizations in Scala\n\nTo plot data in Scala, use the `display` function as follows: \n```\nval diamonds_df = spark.read.format(\"csv\").option(\"header\",\"true\").option(\"inferSchema\",\"true\").load(\"\/databricks-datasets\/Rdatasets\/data-001\/csv\/ggplot2\/diamonds.csv\")\n\ndisplay(diamonds_df.groupBy(\"color\").avg(\"price\").orderBy(\"color\"))\n\n```\n\n### Legacy visualizations\n#### Deep dive notebooks for Python and Scala\n\nFor a deep dive into Python visualizations, see the notebook: \n* [Visualization deep dive in Python](https:\/\/docs.databricks.com\/visualizations\/charts-and-graphs-python.html) \nFor a deep dive into Scala visualizations, see the notebook: \n* [Visualization deep dive in Scala](https:\/\/docs.databricks.com\/visualizations\/charts-and-graphs-scala.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/legacy-visualizations.html"} +{"content":"# \n### Discover data\n\nDatabricks provides a suite of tools and products that simplify the discovery of data assets that are accessible through the Databricks Data Intelligence Platform. This article provides an opinionated overview of how you can discover and preview data that has already been configured for access in your workspace. \n* To connect to data sources, see [Connect to data sources](https:\/\/docs.databricks.com\/connect\/index.html).\n* For information about gaining access to data in the Databricks Marketplace, see [What is Databricks Marketplace?](https:\/\/docs.databricks.com\/marketplace\/index.html). \nTopics in this section focus on exploring data objects and data files. If you\u2019re looking for information about working with assets such as notebooks, SQL queries, libraries, and models, see [Navigate the workspace](https:\/\/docs.databricks.com\/workspace\/index.html). \nIf you\u2019re seeking guidance around generating summary statistics for datasets or other tasks associated with exploratory data analysis (EDA), see [Exploratory data analysis on Databricks: Tools and techniques](https:\/\/docs.databricks.com\/exploratory-data-analysis\/index.html).\n\n### Discover data\n#### How can you discover data assets?\n\nData discovery tools on Databricks fall into the following general categories: \n* AI-assisted insights, summary, and search.\n* Keyword search.\n* Catalog exploration using the UI.\n* Programmatic listing and metadata exploration. \nData discovery tools are optimized for data governed by Unity Catalog. Data assets that have not been registered as Unity Catalog objects might not be discoverable using some of these approaches.\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/index.html"} +{"content":"# \n### Discover data\n#### Find data using the UI\n\nCatalog Explorer provides tools for exploring and governing data assets. You access Catalog Explorer using the ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog** in the workspace sidebar. See [What is Catalog Explorer?](https:\/\/docs.databricks.com\/catalog-explorer\/index.html). \nNotebooks and the SQL query editor also provide a catalog navigator for exploring database objects. Click the **Catalog** icon in these interfaces to expand or collapse the catalog navigator without leaving from your code editor. \nOnce you\u2019ve discovered a dataset of interest, you can use the **Insights** tab to learn how the data is being used in your workspace. See [View frequent queries and users of a table](https:\/\/docs.databricks.com\/discover\/table-insights.html).\n\n### Discover data\n#### Explore data programmatically\n\nYou can use the `SHOW` command on all database objects to discover assets registered to Unity Catalog. Use the `LIST` command, the `%fs` magic command, or Databricks Utilities to list files. \nSee [Explore storage and find data files](https:\/\/docs.databricks.com\/discover\/files.html) and [Explore database objects](https:\/\/docs.databricks.com\/discover\/database-objects.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/index.html"} +{"content":"# \n### Discover data\n#### Review data comments\n\nYou can review comments to learn about the contents of datasets available in your lakehouse. Comments can be set on data objects including catalogs, schemas, tables, and columns. You can view comments in Catalog Explorer or using the `DESCRIBE` command for an object. \nCatalog Explorer can provide AI-generated comments for tables, which makes it easy for data asset owners to provide a rich overview of datasets. See [Add AI-generated comments to a table](https:\/\/docs.databricks.com\/catalog-explorer\/ai-comments.html). \nUsers can also optionally provide comments on tables and other database objects using markdown, which is rendered in Catalog Explorer. See [Document data in Catalog Explorer using markdown comments](https:\/\/docs.databricks.com\/catalog-explorer\/markdown-data-comments.html).\n\n### Discover data\n#### Search for tables in your lakehouse\n\nYou can use the search bar in Databricks to find tables registered to Unity Catalog. You can either perform a keyword search or use semantic search to find datasets or columns that relate to your search query. Search only returns results for tables that you have permission to see. Search reviews table names, column names, table comments, and column comments. See [Search for workspace objects](https:\/\/docs.databricks.com\/search\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/index.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Subscribe to Google Pub\/Sub\n\nDatabricks provides a built-in connector to subscribe to Google Pub\/Sub in Databricks Runtime 13.3 LTS and above. This connector provides exactly-once processing semantics for records from the subscriber. \nNote \nPub\/Sub might publish duplicate records, and records might arrive to the subscriber out of order. You should write Databricks code to handle duplicate and out-of-order records.\n\n#### Subscribe to Google Pub\/Sub\n##### Syntax example\n\nThe following code example demonstrates the basic syntax for configuring a Structured Streaming read from Pub\/Sub: \n```\nval authOptions: Map[String, String] =\nMap(\"clientId\" -> clientId,\n\"clientEmail\" -> clientEmail,\n\"privateKey\" -> privateKey,\n\"privateKeyId\" -> privateKeyId)\n\nval query = spark.readStream\n.format(\"pubsub\")\n\/\/ we will create a Pubsub subscription if none exists with this id\n.option(\"subscriptionId\", \"mysub\") \/\/ required\n.option(\"topicId\", \"mytopic\") \/\/ required\n.option(\"projectId\", \"myproject\") \/\/ required\n.options(authOptions)\n.load()\n\n``` \nFor more configuration options, see [Configure options for Pub\/Sub streaming read](https:\/\/docs.databricks.com\/connect\/streaming\/pub-sub.html#options).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/pub-sub.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Subscribe to Google Pub\/Sub\n##### Configure access to Pub\/Sub\n\nDatabricks recommends using secrets when providing authorization options. The following options are required to authorize a connection: \n* `clientEmail`\n* `clientId`\n* `privateKey`\n* `privateKeyId` \nThe following table describes the roles required for the configured credentials: \n| Roles | Required or optional | How it is used |\n| --- | --- | --- |\n| `roles\/pubsub.viewer` or `roles\/viewer` | Required | Check if subscription exists and get subscription |\n| `roles\/pubsub.subscriber` | Required | Fetch data from a subscription |\n| `roles\/pubsub.editor` or `roles\/editor` | Optional | Enables creation of a subscription if one doesn\u2019t exist and also enables use of the `deleteSubscriptionOnStreamStop` to delete subscriptions on stream termination |\n\n#### Subscribe to Google Pub\/Sub\n##### Pub\/Sub schema\n\nThe schema for the stream matches the records that are fetched from Pub\/Sub, as described in the following table: \n| Field | Type |\n| --- | --- |\n| `messageId` | `StringType` |\n| `payload` | `ArrayType[ByteType]` |\n| `attributes` | `StringType` |\n| `publishTimestampInMillis` | `LongType` |\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/pub-sub.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Subscribe to Google Pub\/Sub\n##### Configure options for Pub\/Sub streaming read\n\nThe following table describes the options supported for Pub\/Sub. All options are configured as part of a Structured Streaming read using `.option(\"<optionName>\", \"<optionValue>\")` syntax. \nNote \nSome Pub\/Sub configuration options use the concept of **fetches** instead of **micro-batches**. This reflects internal implementation details, and options work similarly to corollaries in other Structured Streaming connectors, except that records are fetched and then processed. \n| Option | Default value | Description |\n| --- | --- | --- |\n| `numFetchPartitions` | Set to one half of the number of executors present at stream initialization. | The number of parallel Spark tasks that fetch records from a subscription. |\n| `deleteSubscriptionOnStreamStop` | `false` | If `true`, the subscription passed to the stream is deleted when the streaming job ends. |\n| `maxBytesPerTrigger` | none | A soft limit for the batch size to be processed during each triggered micro-batch. |\n| `maxRecordsPerFetch` | 1000 | The number of records to fetch per task before processing records. |\n| `maxFetchPeriod` | 10 seconds | The time duration for each task to fetch before processing records. Databricks recommends using the default value. |\n\n#### Subscribe to Google Pub\/Sub\n##### Incremental batch processing semantics for Pub\/Sub\n\nYou can use `Trigger.AvailableNow` to consume available records from the Pub\/Sub sources an an incremental batch. \nDatabricks records the timestamp when you begin a read with the `Trigger.AvailableNow` setting. Records processed by the batch include all previously fetched data and any newly published records with a timestamp less than the recorded stream start timestamp. \nSee [Configuring incremental batch processing](https:\/\/docs.databricks.com\/structured-streaming\/triggers.html#available-now).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/pub-sub.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Subscribe to Google Pub\/Sub\n##### Monitoring streaming metrics\n\nStructured Streaming progress metrics report the number of records fetched and ready to process, the size of the records fetched and ready to process, and the number of duplicates seen since stream start. The following is an example of these metrics: \n```\n\"metrics\" : {\n\"numDuplicatesSinceStreamStart\" : \"1\",\n\"numRecordsReadyToProcess\" : \"1\",\n\"sizeOfRecordsReadyToProcess\" : \"8\"\n}\n\n```\n\n#### Subscribe to Google Pub\/Sub\n##### Limitations\n\nSpeculative execution (`spark.speculation`) is not supported with Pub\/Sub.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/pub-sub.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Configure data access for ingestion\n\nThis article describes how admin users can configure access to data in a bucket in Amazon S3 (S3) so that Databricks users can load data from S3 into a table in Databricks. \nThis article describes the following ways to configure secure access to source data: \n* (Recommended) Create a Unity Catalog volume.\n* Create a Unity Catalog external location with a storage credential. \n* Launch a compute resource that uses an AWS instance profile.\n* Generate temporary credentials (an AWS access key ID, a secret key, and a session token).\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/configure-data-access.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Configure data access for ingestion\n##### Before you begin\n\nBefore you configure access to data in S3, make sure you have the following: \n* Data in an S3 bucket in your AWS account. To create a bucket, see [Creating a bucket](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/create-bucket-overview.html) in the AWS documentation. \n* To access data using a Unity Catalog volume (recommended), the `READ VOLUME` privilege on the volume. For more information, see [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html) and [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html).\n* To access data using a Unity Catalog external location, the `READ FILES` privilege on the external location. For more information, see [Create an external location to connect cloud storage to Databricks](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html). \n* To access data using a compute resource with an AWS instance profile, Databricks workspace admin permissions. \n* A Databricks SQL [warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html). To create a SQL warehouse, see [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* Familiarity with the Databricks SQL user interface.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/configure-data-access.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Configure data access for ingestion\n##### Configure access to cloud storage\n\nUse one of the following methods to configure access to S3: \n* (Recommended) Create a Unity Catalog volume. For more information, see [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html).\n* Configure a Unity Catalog external location with a storage credential. For more information about external locations, see [Create an external location to connect cloud storage to Databricks](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html). \n* Configure a compute resource to use an AWS instance profile. For more information, see [Configure a SQL warehouse to use an instance profile](https:\/\/docs.databricks.com\/admin\/sql\/data-access-configuration.html#storage-access).\n* Generate temporary credentials (an AWS access key ID, a secret key, and a session token) to share with other Databricks users. For more information, see [Generate temporary credentials for ingestion](https:\/\/docs.databricks.com\/ingestion\/copy-into\/generate-temporary-credentials.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/configure-data-access.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Configure data access for ingestion\n##### Clean up\n\nYou can clean up the associated resources in your cloud account and Databricks if you no longer want to keep them. \n### Delete the AWS CLI named profile \nIn your `~\/.aws\/credentials` file for Unix, Linux, and macOS, or in your `%USERPROFILE%\\.aws\\credentials` file for Windows, remove the following portion of the file, and then save the file: \n```\n[<named-profile>]\naws_access_key_id = <access-key-id>\naws_secret_access_key = <secret-access-key>\n\n``` \n### Delete the IAM user \n1. Open the IAM console in your AWS account, typically at <https:\/\/console.aws.amazon.com\/iam>.\n2. In the sidebar, click **Users**.\n3. Select the box next to the user, and then click **Delete**.\n4. Enter the name of the user, and then click **Delete**. \n### Delete the IAM policy \n1. Open the IAM console in your AWS account, if it is not already open, typically at <https:\/\/console.aws.amazon.com\/iam>.\n2. In the sidebar, click **Policies**.\n3. Select the option next to the policy, and then click **Actions > Delete**.\n4. Enter the name of the policy, and then click **Delete**. \n### Delete the S3 bucket \n1. Open the Amazon S3 console in your AWS account, typically at <https:\/\/console.aws.amazon.com\/s3>.\n2. Select the option next to the bucket, and then click **Empty**.\n3. Enter `permanently delete`, and then click **Empty**.\n4. In the sidebar, click **Buckets**.\n5. Select the option next to the bucket, and then click **Delete**.\n6. Enter the name of the bucket, and then click **Delete bucket**. \n### Stop the SQL warehouse \nIf you are not using the SQL warehouse for any other tasks, you should stop the SQL warehouse to avoid additional costs. \n1. In the **SQL** persona, on the sidebar, click **SQL Warehouses**.\n2. Next to the name of the SQL warehouse, click **Stop**.\n3. When prompted, click **Stop** again.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/configure-data-access.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Configure data access for ingestion\n##### Next steps\n\nAfter you complete the steps in this article, users can run the `COPY INTO` command to load the data from the S3 bucket into your Databricks workspace. \n* To load data using a Unity Catalog volume or external location, see [Load data using COPY INTO with Unity Catalog volumes or external locations](https:\/\/docs.databricks.com\/ingestion\/copy-into\/unity-catalog.html). \n* To load data using a SQL warehouse with an AWS instance profile, see [Load data using COPY INTO with an instance profile](https:\/\/docs.databricks.com\/ingestion\/copy-into\/tutorial-dbsql.html).\n* To load data using temporary credentials (an AWS access key ID, a secret key, and a session token), see [Load data using COPY INTO with temporary credentials](https:\/\/docs.databricks.com\/ingestion\/copy-into\/temporary-credentials.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/configure-data-access.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Notebook outputs and results\n\nAfter you [attach a notebook to a cluster](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) and [run one or more cells](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html), your notebook has state and displays outputs. This section describes how to manage notebook state and outputs.\n\n#### Notebook outputs and results\n##### Clear notebooks state and outputs\n\nTo clear the notebook state and outputs, select one of the **Clear** options at the bottom of the **Run** menu. \n| Menu option | Description |\n| --- | --- |\n| Clear all cell outputs | Clears the cell outputs. This is useful if you are sharing the notebook and do not want to include any results. |\n| Clear state | Clears the notebook state, including function and variable definitions, data, and imported libraries. |\n| Clear state and outputs | Clears both cell outputs and the notebook state. |\n| Clear state and run all | Clears the notebook state and starts a new run. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-outputs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Notebook outputs and results\n##### Show results\n\nWhen a cell is run, table results return a maximum of 10,000 rows or 2 MB, whichever is less. \nBy default, text results return a maximum of 50,000 characters. With Databricks Runtime 12.2 LTS and above, you can increase this limit by setting the Spark configuration property `spark.databricks.driver.maxReplOutputLength`. \n### Explore SQL cell results in Python notebooks natively using Python \nYou can load data using SQL and explore it using Python. In a Databricks Python notebook, table results from a SQL language cell are automatically made available as a Python DataFrame. For details, see [Explore SQL cell results in Python notebooks](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#implicit-sql-df). \n### New cell result table \nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nYou can now select a new cell result table rendering. With the new result table, you can do the following: \n* Copy a column or other subset of tabular results to the clipboard.\n* Do a text search over the results table.\n* [Sort and filter data](https:\/\/docs.databricks.com\/notebooks\/notebook-outputs.html#filter).\n* Navigate between table cells using the keyboard arrow keys.\n* Select part of a column name or cell value by double-clicking and dragging to select the desired text. \nTo enable the new result table, click **New result table** in the upper-right corner of the cell results, and change the toggle selector from **OFF** to **ON**. \n![new result display selector](https:\/\/docs.databricks.com\/_images\/new-results-table.gif) \nWhen the feature is on, you can click column or row headers to select entire columns or rows, and you can click in the upper-left cell of the table to select the entire table. You can drag your cursor across any rectangular set of cells to select them. \nTo copy the selected data to the clipboard, press `Cmd + c` on MacOS or `Ctrl + c` on Windows, or right-click and select **Copy** from the drop-down menu. \nTo search for text in the results table, enter the text in the **Search** box. Matching cells are highlighted. \nTo open a side panel that displays information about the selection, click the panel icon ![panel icon](https:\/\/docs.databricks.com\/_images\/panel-icon.png) icon in the upper-right corner, next to the **Search** box. \n![location of panel icon](https:\/\/docs.databricks.com\/_images\/panel-icon-loc.png) \nColumn headers indicate the data type of the column. For example, ![indicator for integer type column](https:\/\/docs.databricks.com\/_images\/integer-type-column.png) indicates integer data type. Hover over the indicator to see the data type. \n### Sort and filter results \nWhen you use the new cell result table rendering, you can sort and filter results. \nTo sort the table by the values in a column, hover your cursor over the column name. At the right of the cell containing the column name, an icon appears. Click the arrow to sort the column. Successive clicks toggle through sorting in ascending order, descending order, or unsorted. \n![how to sort a column](https:\/\/docs.databricks.com\/_images\/sort.gif) \nTo sort by multiple columns, hold down the **Shift** key as you click the sort arrow for the columns. \nTo create a filter, click ![filter icon](https:\/\/docs.databricks.com\/_images\/filter-icon.png) at the upper-right of the cell results. In the dialog that appears, select the column to filter on and the filter rule and value to apply. For example: \n![filter example](https:\/\/docs.databricks.com\/_images\/filter-example.png) \nTo add another filter, click ![add filter button](https:\/\/docs.databricks.com\/_images\/add-filter.png). \nTo temporarily enable or disable a filter, toggle the **Enabled\/Disabled** button in the dialog. To delete a filter, click the X next to the filter name ![delete filter X](https:\/\/docs.databricks.com\/_images\/remove-filter.png). \nTo filter by a specific value, right-click on a cell with that value and select **Filter by this value** from the drop-down menu. \n![specific value](https:\/\/docs.databricks.com\/_images\/filter-value.png) \nYou can also create a filter from the kebab menu in the column name: \n![filter kebab menu](https:\/\/docs.databricks.com\/_images\/filter-kebab-menu.png) \nFilters are applied only to the results shown in the results table. If the data returned is truncated (for example, when a query returns more than 64,000 rows), the filter is applied only to the returned rows.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-outputs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Notebook outputs and results\n##### Download results\n\nBy default downloading results is enabled. To toggle this setting, see [Manage the ability to download results from notebooks](https:\/\/docs.databricks.com\/admin\/workspace-settings\/notebooks.html#manage-download-results). \nYou can download a cell result that contains tabular output to your local machine. Click the downward pointing arrow next to the tab title. The menu options depend on the number of rows in the result and on the Databricks Runtime version. Downloaded results are saved on your local machine as a CSV file named `export.csv`. \n![Download cell results](https:\/\/docs.databricks.com\/_images\/download-result.png)\n\n#### Notebook outputs and results\n##### View multiple outputs per cell\n\nPython notebooks and `%python` cells in non-Python notebooks support multiple outputs per cell. For example, the output of the following code includes both the plot and the table: \n```\nimport pandas as pd\nfrom sklearn.datasets import load_iris\n\ndata = load_iris()\niris = pd.DataFrame(data=data.data, columns=data.feature_names)\nax = iris.plot()\nprint(\"plot\")\ndisplay(ax)\nprint(\"data\")\ndisplay(iris)\n\n```\n\n#### Notebook outputs and results\n##### Commit notebook outputs in Databricks Git folders\n\nTo learn about committing .ipynb notebook outputs, see [Allow committing .ipynb notebook output](https:\/\/docs.databricks.com\/repos\/manage-assets.html#ipynb-repos). \n* The notebook must be an .ipynb file\n* Workspace admin settings must allow notebook outputs to be committed.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-outputs.html"} +{"content":"# AI and Machine Learning on Databricks\n### Model training examples\n\nThis section includes examples showing how to train machine learning models on Databricks using many popular open-source libraries. \nYou can also use [AutoML](https:\/\/docs.databricks.com\/machine-learning\/automl\/index.html), which automatically prepares a dataset for model training, performs a set of trials using open-source libraries such as scikit-learn and XGBoost, and creates a Python notebook with the source code for each trial run so you can review, reproduce, and modify the code.\n\n### Model training examples\n#### Machine learning examples\n\n| Package | Notebook(s) | Features |\n| --- | --- | --- |\n| scikit-learn | [Machine learning tutorial](https:\/\/docs.databricks.com\/machine-learning\/train-model\/scikit-learn.html#basic-example) | Unity Catalog, classification model, MLflow, automated hyperparameter tuning with Hyperopt and MLflow |\n| scikit-learn | [End-to-end example](https:\/\/docs.databricks.com\/machine-learning\/train-model\/scikit-learn.html#e2e-example) | Unity Catalog, classification model, MLflow, automated hyperparameter tuning with Hyperopt and MLflow, XGBoost |\n| MLlib | [MLlib examples](https:\/\/docs.databricks.com\/machine-learning\/train-model\/mllib.html) | Binary classification, decision trees, GBT regression, Structured Streaming, custom transformer |\n| xgboost | [XGBoost examples](https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost.html) | Python, PySpark, and Scala, single node workloads and distributed training |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n### Model training examples\n#### Hyperparameter tuning examples\n\nFor general information about hyperparameter tuning in Databricks, see [Hyperparameter tuning](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/index.html). \n| Package | Notebook | Features |\n| --- | --- | --- |\n| Hyperopt | [Distributed hyperopt](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-spark-mlflow-integration.html) | Distributed hyperopt, scikit-learn, MLflow |\n| Hyperopt | [Compare models](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-model-selection.html) | Use distributed hyperopt to search hyperparameter space for different model types simultaneously |\n| Hyperopt | [Distributed training algorithms and hyperopt](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-distributed-ml.html) | Hyperopt, MLlib |\n| Hyperopt | [Hyperopt best practices](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-best-practices.html) | Best practices for datasets of different sizes |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/index.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### What are Python user-defined table functions?\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nA user-defined table function (UDTF) allows you to register functions that return tables instead of scalar values. UDTFs function similarly to common table expressions (CTEs) when referenced in SQL queries. You reference UDTFs in the `FROM` clause of a SQL statement, and you can chain additional Spark SQL operators to the results. \nUDTFs are registered to the local SparkSession and are isolated at the notebook or job level. \nUDTFs are supported on compute configured with assigned or no-isolation shared access modes. You cannot use UDTFs on shared access mode. \nYou cannot register UDTFs as objects in Unity Catalog, and UDTFs cannot be used with SQL warehouses.\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/python-udtf.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### What are Python user-defined table functions?\n##### What is the basic syntax for a UDTF?\n\nApache Spark implements Python UDTFs as Python classes with a mandatory `eval` method. \nYou emit results as rows using `yield`. \nFor Apache Spark to use your class as a UDTF, you must import the PySpark `udtf` function. \nDatabricks recommends using this function as a decorator and always explicitly specifying field names and types using the `returnType` option. \nThe following example creates a simple table from scalar inputs using a UDTF: \n```\nfrom pyspark.sql.functions import lit, udtf\n\n@udtf(returnType=\"sum: int, diff: int\")\nclass SimpleUDTF:\ndef eval(self, x: int, y: int):\nyield x + y, x - y\n\nSimpleUDTF(lit(1), lit(2)).show()\n# +----+-----+\n# | sum| diff|\n# +----+-----+\n# | 3| -1|\n# +----+-----+\n\n``` \nYou can use Python `*args` syntax and implement logic to handle an unspecified number of input values. The following example returns the same result while explicitly checking the input length and types for the arguments: \n```\n@udtf(returnType=\"sum: int, diff: int\")\nclass SimpleUDTF:\ndef eval(self, *args):\nassert(len(args) == 2)\nassert(isinstance(arg, int) for arg in args)\nx = args[0]\ny = args[1]\nyield x + y, x - y\n\nSimpleUDTF(lit(1), lit(2)).show()\n# +----+-----+\n# | sum| diff|\n# +----+-----+\n# | 3| -1|\n# +----+-----+\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/python-udtf.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### What are Python user-defined table functions?\n##### Register a UDTF\n\nYou can register a UDTF to the current SparkSession for use in SQL queries using the following syntax: \n```\nspark.udtf.register(\"<udtf-sql-name>\", <udtf-python-name>)\n\n``` \nThe following example registers a Python UDTF to SQL: \n```\nspark.udtf.register(\"simple_udtf\", SimpleUDTF)\n\n``` \nOnce registered, you can use the UDTF in SQL using either the `%sql` magic command or `spark.sql()` function, as in the following examples: \n```\n%sql\nSELECT * FROM simple_udtf(1,2);\n\n``` \n```\nspark.sql(\"SELECT * FROM simple_udtf(1,2);\")\n\n```\n\n#### What are Python user-defined table functions?\n##### Yielding results\n\nPython UDTFs are implemented with `yield` to return results. Results are always returned as a table containing 0 or more rows with the specified schema. \nWhen passing scalar arguments, logic in the `eval` method runs exactly once with the set of scalar arguments passed. For table arguments, the `eval` method runs once for each row in the input table. \nLogic can be written to return 0, 1, or many rows per input. \nThe following UDTF demonstrates returning 0 or more rows for each input by separating items from a comma separated list into separate entries: \n```\nfrom pyspark.sql.functions import udtf\n\n@udtf(returnType=\"id: int, item: string\")\nclass Itemize:\ndef eval(self, id: int, item_list: str):\nitems = item_list.split(\",\")\nfor item in items:\nif item != \"\":\nyield id, item\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/python-udtf.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### What are Python user-defined table functions?\n##### Pass a table argument to a UDTF\n\nYou can use the SQL keyword `TABLE()` to pass a table argument to a UDTF. You can use a table name or a query, as in the following examples: \n```\nTABLE(table_name);\nTABLE(SELECT * FROM table_name);\n\n``` \nTable arguments are processed one row at a time. You can use standard PySpark column field annotations to interact with columns in each row. The following example demonstrates explicitly importing the PySpark `Row` type and then filtering the passed table on the `id` field: \n```\nfrom pyspark.sql.functions import udtf\nfrom pyspark.sql.types import Row\n\n@udtf(returnType=\"id: int\")\nclass FilterUDTF:\ndef eval(self, row: Row):\nif row[\"id\"] > 5:\nyield row[\"id\"],\n\nspark.udtf.register(\"filter_udtf\", FilterUDTF)\n\nspark.sql(\"SELECT * FROM filter_udtf(TABLE(SELECT * FROM range(10)))\").show()\n# +---+\n# | id|\n# +---+\n# | 6|\n# | 7|\n# | 8|\n# | 9|\n# +---+\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/python-udtf.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### What are Python user-defined table functions?\n##### Pass scalar arguments to a UDTF\n\nYou can pass scalar arguments to a UDTF using any combination of the following values: \n* Scalar constants\n* Scalar functions\n* Fields in a relation \nTo pass fields in a relation, you must register the UDTF and use the SQL `LATERAL` keyword. \nNote \nYou can use in-line table aliases to disambiguate columns. \nThe following example demonstrates using `LATERAL` to pass fields from a table to a UDTF: \n```\nfrom pyspark.sql.functions import udtf\n\n@udtf(returnType=\"id: int, item: string\")\nclass Itemize:\ndef eval(self, id: int, item_list: str):\nitems = item_list.split(\",\")\nfor item in items:\nif item != \"\":\nyield id, item\n\nspark.udtf.register(\"itemize\", Itemize)\n\nspark.sql(\"\"\"\nSELECT b.id, b.item FROM VALUES (1, 'pots,pans,forks'),\n(2, 'spoons,'),\n(3, ''),\n(4, 'knives,cups') t(id, item_list),\nLATERAL itemize(id, item_list) b\n\"\"\").show()\n\n```\n\n#### What are Python user-defined table functions?\n##### Set default values for UDTFs\n\nYou can optionally implement an `__init__` method to set default values for class variables you can reference in your Python logic. \nThe `__init__` method does not accept any arguments and has no access to variables or state information in the SparkSession.\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/python-udtf.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### What are Python user-defined table functions?\n##### Use Apache Arrow with UDTFs\n\nDatabricks recommends using Apache Arrow for UDTFs that receive a small amount of data as input but output a large table. \nYou can enable Arrow by specifying the `useArrow` parameter when declaring the UDTF, as in the following example: \n```\nfrom pyspark.sql.functions import udtf\n\n@udtf(returnType=\"c1: int, c2: int\", useArrow=True)\nclass PlusOne:\ndef eval(self, x: int):\nyield x, x + 1\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/python-udtf.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Configure settings for Databricks jobs\n\nThis article provides details on configuring Databricks Jobs and individual job tasks in the Jobs UI. To learn about using the Databricks CLI to edit job settings, run the CLI command `databricks jobs update -h`. To learn about using the Jobs API, see the [Jobs API](https:\/\/docs.databricks.com\/api\/workspace\/jobs). \nSome configuration options are available on the job, and other options are available on individual tasks. For example, the maximum concurrent runs can be set only on the job, while retry policies are defined for each task.\n\n#### Configure settings for Databricks jobs\n##### Edit a job\n\nTo change the configuration for a job: \n1. Click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar.\n2. In the **Name** column, click the job name. \nThe side panel displays the **Job details**. You can change the trigger for the job, compute configuration, [notifications](https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html), the maximum number of concurrent runs, configure duration thresholds, and add or change tags. If [job access control](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#jobs_acl_user_guide) is enabled, you can also edit job permissions.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Configure settings for Databricks jobs\n##### Add parameters for all job tasks\n\nYou can configure parameters on a job that are passed to any of the job\u2019s tasks that accept key-value parameters, including Python wheel files configured to accept keyword arguments. Parameters set at the job level are added to configured task-level parameters. Job parameters passed to tasks are visible in the task configuration, along with any parameters configured on the task. \nYou can also pass job parameters to tasks that are not configured with key-value parameters such as `JAR` or `Spark Submit` tasks. To pass job parameters to these tasks, format arguments as `{{job.parameters.[name]}}`, replacing `[name]` with the `key` that identifies the parameter. \nJob parameters take precedence over task parameters. If a job parameter and a task parameter have the same key, the job parameter overrides the task parameter. \nYou can override configured job parameters or add new job parameters when you [run a job with different parameters](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#job-run-with-different-params) or [repair a job run](https:\/\/docs.databricks.com\/workflows\/jobs\/repair-job-failures.html#repair-run). \nYou can also share context about jobs and tasks using a set of [dynamic value references](https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html). \nTo add job parameters, click **Edit parameters** in the **Job details** side panel and specify the key and default value of each parameter. To view a list of available dynamic value references, click **Browse dynamic values**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Configure settings for Databricks jobs\n##### Add tags to a job\n\nTo add labels or key:value attributes to your job, you can add *tags* when you edit the job. You can use tags to filter jobs in the [Jobs list](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#view-jobs-list); for example, you can use a `department` tag to filter all jobs that belong to a specific department. \nNote \nBecause job tags are not designed to store sensitive information such as personally identifiable information or passwords, Databricks recommends using tags for non-sensitive values only. \nTags also propagate to job clusters created when a job is run, allowing you to use tags with your existing [cluster monitoring](https:\/\/docs.databricks.com\/admin\/account-settings\/usage-detail-tags.html). \nTo add or edit tags, click **+ Tag** in the **Job details** side panel. You can add the tag as a key and value or a label. To add a label, enter the label in the **Key** field and leave the **Value** field empty.\n\n#### Configure settings for Databricks jobs\n##### Configure shared clusters\n\nTo see tasks associated with a cluster, click the **Tasks** tab and hover over the cluster in the side panel. To change the cluster configuration for all associated tasks, click **Configure** under the cluster. To configure a new cluster for all associated tasks, click **Swap** under the cluster.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Configure settings for Databricks jobs\n##### Control access to a job\n\nJob access control enables job owners and administrators to grant fine-grained permissions on their jobs. Job owners can choose which other users or groups can view the job results. Owners can also choose who can manage their job runs (Run now and Cancel run permissions). \nFor information on job permission levels, see [Job ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#jobs). \nYou must have CAN MANAGE or IS OWNER permission on the job in order to manage permissions on it. \n1. In the sidebar, click **Job Runs**.\n2. Click the name of a job.\n3. In the **Job details** panel, click **Edit permissions**.\n4. In **Permission Settings**, click the **Select User, Group or Service Principal\u2026** drop-down menu and select a user, group, or service principal. \n![Permissions Settings dialog](https:\/\/docs.databricks.com\/_images\/select-permission-job.png)\n5. Click **Add**.\n6. Click **Save**. \n### Manage the job owner \nBy default, the creator of a job has the IS OWNER permission and is the user in the job\u2019s **Run as** setting. Job\u2019s run as the identity of the user in the **Run as** setting. For more information on the **Run as** setting, see [Run a job as a service principal](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#run-as-sp). \nWorkspace admins can change the job owner to themselves. When ownership is transferred, the previous owner is granted the CAN MANAGE permission \nNote \nWhen the `RestrictWorkspaceAdmins` setting on a workspace is set to `ALLOW ALL`, workspace admins can change a job owner to any user or service principal in their workspace. To restrict workspace admins to only change a job owner to themselves, see [Restrict workspace admins](https:\/\/docs.databricks.com\/admin\/workspace-settings\/restrict-workspace-admins.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Configure settings for Databricks jobs\n##### Configure maximum concurrent runs\n\nClick **Edit concurrent runs** under **Advanced settings** to set the maximum number of parallel runs for this job. Databricks skips the run if the job has already reached its maximum number of active runs when attempting to start a new run. Set this value higher than the default of 1 to perform multiple runs of the same job concurrently. This is useful, for example, if you trigger your job on a frequent schedule and want to allow consecutive runs to overlap with each other or you want to trigger multiple runs that differ by their input parameters.\n\n#### Configure settings for Databricks jobs\n##### Enable queueing of job runs\n\nTo enable runs of a job to be placed in a queue to run later when they cannot run immediately because of concurrency limits, click the **Queue** toggle under **Advanced settings**. See [What if my job cannot run because of concurrency limits?](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#job-queueing). \nNote \nQueueing is enabled by default for jobs that were created through the UI after April 15, 2024.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Configure settings for Databricks jobs\n##### Configure an expected completion time or a timeout for a job\n\nYou can configure optional duration thresholds for a job, including an expected completion time for the job and a maximum completion time for the job. To configure duration thresholds, click **Set duration thresholds**. \nTo configure an expected completion time for the job, enter the expected duration in the **Warning** field. If the job exceeds this threshold, you can configure notifications for the slow running job. See [Configure notifications for slow running or late jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html#configure-duration-warning). \nTo configure a maximum completion time for a job, enter the maximum duration in the **Timeout** field. If the job does not complete in this time, Databricks sets its status to \u201cTimed Out\u201d and the job is stopped.\n\n#### Configure settings for Databricks jobs\n##### Edit a task\n\nTo set task configuration options: \n1. Click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar.\n2. In the **Name** column, click the job name.\n3. Click the **Tasks** tab and select the task to edit.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Configure settings for Databricks jobs\n##### Define task dependencies\n\nYou can define the order of execution of tasks in a job using the **Depends on** drop-down menu. You can set this field to one or more tasks in the job. \n![Edit task dependencies](https:\/\/docs.databricks.com\/_images\/task-dependencies.png) \nNote \n**Depends on** is not visible if the job consists of only one task. \nConfiguring task dependencies creates a Directed Acyclic Graph (DAG) of task execution, a common way of representing execution order in job schedulers. For example, consider the following job consisting of four tasks: \n![Task dependencies example diagram](https:\/\/docs.databricks.com\/_images\/task-dependencies-diagram.png) \n* Task 1 is the root task and does not depend on any other task.\n* Task 2 and Task 3 depend on Task 1 completing first.\n* Finally, Task 4 depends on Task 2 and Task 3 completing successfully. \nDatabricks runs upstream tasks before running downstream tasks, running as many of them in parallel as possible. The following diagram illustrates the order of processing for these tasks: \n![Task dependencies example flow](https:\/\/docs.databricks.com\/_images\/task-dependencies-flow.png)\n\n#### Configure settings for Databricks jobs\n##### Configure a cluster for a task\n\nTo configure the cluster where a task runs, click the **Cluster** drop-down menu. You can edit a shared job cluster, but you cannot delete a shared cluster if other tasks still use it. \nTo learn more about selecting and configuring clusters to run tasks, see [Use Databricks compute with your jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/use-compute.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Configure settings for Databricks jobs\n##### Configure dependent libraries\n\nDependent libraries will be installed on the cluster before the task runs. You must set all task dependencies to ensure they are installed before the run starts. Follow the recommendations in [Manage library dependencies](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-jars-in-workflows.html#library-dependencies-management) for specifying dependencies.\n\n#### Configure settings for Databricks jobs\n##### Configure an expected completion time or a timeout for a task\n\nYou can configure optional duration thresholds for a task, including an expected completion time for the task and a maximum completion time for the task. To configure duration thresholds, click **Duration threshold**. \nTo configure the task\u2019s expected completion time, enter the duration in the **Warning** field. If the task exceeds this threshold, an event is triggered. You can use this event to notify when a task is running slowly. See [Configure notifications for slow running or late jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html#configure-duration-warning). \nTo configure a maximum completion time for a task, enter the maximum duration in the **Timeout** field. If the task does not complete in this time, Databricks sets its status to \u201cTimed Out\u201d.\n\n#### Configure settings for Databricks jobs\n##### Configure a retry policy for a task\n\nTo configure a policy that determines when and how many times failed task runs are retried, click **+ Add** next to **Retries**. The retry interval is calculated in milliseconds between the start of the failed run and the subsequent retry run. \nNote \nIf you configure both **Timeout** and **Retries**, the timeout applies to each retry.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html"} +{"content":"# \n","doc_uri":"https:\/\/docs.databricks.com\/languages\/index.html"} +{"content":"# \n### Develop on Databricks\n\nDatabricks actively supports developers who want to use their favorite language or tool to harness Databricks functionality. The following table provides an overview of developer-focused Databricks features and integrations, which includes Python, R, Scala, and SQL language support and many other tools that enable automating and streamlining your organization\u2019s ETL pipelines and software development lifecycle. \n| If you are a\u2026 | Check out these Databricks features and tools\u2026 |\n| --- | --- |\n| Python developer | [Databricks for Python developers](https:\/\/docs.databricks.com\/languages\/python.html) [Databricks SDK for Python](https:\/\/docs.databricks.com\/dev-tools\/sdk-python.html) [PyCharm with Databricks](https:\/\/docs.databricks.com\/dev-tools\/pycharm.html) [Visual Studio Code with Databricks Connect for Python](https:\/\/docs.databricks.com\/dev-tools\/databricks-connect\/python\/vscode.html) [Eclipse with PyDev and Databricks Connect for Python](https:\/\/docs.databricks.com\/dev-tools\/databricks-connect\/python\/eclipse.html) |\n| R developer | [Databricks for R developers](https:\/\/docs.databricks.com\/sparkr\/index.html) [Databricks SDK for R](https:\/\/docs.databricks.com\/dev-tools\/sdk-r.html) [RStudio Desktop with Databricks](https:\/\/docs.databricks.com\/dev-tools\/rstudio.html) |\n| Scala developer | [Databricks for Scala developers](https:\/\/docs.databricks.com\/languages\/scala.html) [Visual Studio Code with Databricks Connect for Scala](https:\/\/docs.databricks.com\/dev-tools\/databricks-connect\/scala\/vscode.html) [IntelliJ IDEA with Databricks Connect for Scala](https:\/\/docs.databricks.com\/dev-tools\/databricks-connect\/scala\/intellij-idea.html) |\n| Java developer | [Databricks SDK for Java](https:\/\/docs.databricks.com\/dev-tools\/sdk-java.html) [IntelliJ IDEA with Databricks Connect for Java](https:\/\/docs.databricks.com\/dev-tools\/intellij-idea.html) [Eclipse with Databricks Connect for Java](https:\/\/docs.databricks.com\/dev-tools\/eclipse.html) |\n| Go developer | [Databricks SQL Driver for Go](https:\/\/docs.databricks.com\/dev-tools\/go-sql-driver.html) [Databricks SDK for Go](https:\/\/docs.databricks.com\/dev-tools\/sdk-go.html) |\n| SQL expert | [SQL language reference](https:\/\/docs.databricks.com\/sql\/language-manual\/index.html) [Databricks SQL database tools](https:\/\/docs.databricks.com\/dev-tools\/index-sql.html) [SQL connectors, drivers, and APIs](https:\/\/docs.databricks.com\/dev-tools\/index-driver.html) |\n| DevOps engineer | [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) [dbutils](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html) [CI\/CD on Databricks](https:\/\/docs.databricks.com\/dev-tools\/index-ci-cd.html) [Databricks Asset Bundles](https:\/\/docs.databricks.com\/dev-tools\/bundles\/index.html) [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) |\n\n","doc_uri":"https:\/\/docs.databricks.com\/languages\/index.html"} +{"content":"# What is Delta Lake?\n### What are deletion vectors?\n\nDeletion vectors are a storage optimization feature that can be enabled on Delta Lake tables. By default, when a single row in a data file is deleted, the entire Parquet file containing the record must be rewritten. With deletion vectors enabled for the table, `DELETE`, `UPDATE`, and `MERGE` operations use deletion vectors to mark existing rows as removed or changed without rewriting the Parquet file. Subsequent reads on the table resolve current table state by applying the deletions noted by deletion vectors to the most recent table version. \nDatabricks recommends using Databricks Runtime 14.3 LTS and above to write tables with deletion vectors to leverage all optimizations. You can read tables with deletion vectors enabled in Databricks Runtime 12.2 LTS and above. \nIn Databricks Runtime 14.2 and above, tables with deletion vectors support row-level concurrency. See [Write conflicts with row-level concurrency](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html#rlc-conflicts). \nNote \nPhoton leverages deletion vectors for predictive I\/O updates, accelerating `DELETE`, `MERGE`, and `UPDATE` operations. All clients that support reading deletion vectors can read updates that produced deletion vectors, regardless of whether these updates were produced by predictive I\/O. See [Use predictive I\/O to accelerate updates](https:\/\/docs.databricks.com\/optimizations\/predictive-io.html#updates).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/deletion-vectors.html"} +{"content":"# What is Delta Lake?\n### What are deletion vectors?\n#### Enable deletion vectors\n\nImportant \nA workspace admin setting controls whether deletion vectors are auto-enabled for new Delta tables. See [Auto-enable deletion vectors](https:\/\/docs.databricks.com\/admin\/workspace-settings\/deletion-vectors.html). \nYou enable support for deletion vectors on a Delta Lake table by setting a Delta Lake table property. You enable deletion vectors during table creation or alter an existing table, as in the following examples: \n```\nCREATE TABLE <table-name> [options] TBLPROPERTIES ('delta.enableDeletionVectors' = true);\n\nALTER TABLE <table-name> SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);\n\n``` \nWarning \nWhen you enable deletion vectors, the table protocol is upgraded. After upgrading, the table will not be readable by Delta Lake clients that do not support deletion vectors. See [How does Databricks manage Delta Lake feature compatibility?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html). \nIn Databricks Runtime 14.1 and above, you can drop the deletion vectors table feature to enable compatibility with other Delta clients. See [Drop Delta table features](https:\/\/docs.databricks.com\/delta\/drop-feature.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/deletion-vectors.html"} +{"content":"# What is Delta Lake?\n### What are deletion vectors?\n#### Apply changes to Parquet data files\n\nDeletion vectors indicate changes to rows as soft-deletes that logically modify existing Parquet data files in the Delta Lake table. These changes are applied physically when data files are rewritten, as triggered by one of the following events: \n* An `OPTIMIZE` command is run on the table.\n* Auto-compaction triggers a rewrite of a data file with a deletion vector.\n* `REORG TABLE ... APPLY (PURGE)` is run against the table. \nEvents related to file compaction do not have strict guarantees for resolving changes recorded in deletion vectors, and some changes recorded in deletion vectors might not be applied if target data files would not otherwise be candidates for file compaction. `REORG TABLE ... APPLY (PURGE)` rewrites all data files containing records with modifications recorded using deletion vectors. See [REORG TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-reorg-table.html). \nNote \nModified data might still exist in the old files. You can run [VACUUM](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-vacuum.html) to physically delete the old files. `REORG TABLE ... APPLY (PURGE)` creates a new version of the table at the time it completes, which is the timestamp you must consider for the retention threshold for your `VACUUM` operation to fully remove deleted files. See [Remove unused data files with vacuum](https:\/\/docs.databricks.com\/delta\/vacuum.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/deletion-vectors.html"} +{"content":"# What is Delta Lake?\n### What are deletion vectors?\n#### Compatibility with Delta clients\n\nDatabricks leverages deletion vectors to power predictive I\/O for updates on Photon-enabled compute. See [Use predictive I\/O to accelerate updates](https:\/\/docs.databricks.com\/optimizations\/predictive-io.html#updates). \nSupport for leveraging deletion vectors for reads and writes varies by client. \nThe following table denotes required client versions for reading and writing Delta tables with deletion vectors enabled and specifies which write operations leverage deletion vectors: \n| Client | Write deletion vectors | Read deletion vectors |\n| --- | --- | --- |\n| Databricks Runtime with Photon | Supports `MERGE`, `UPDATE`, and `DELETE` using Databricks Runtime 12.2 LTS and above. | Requires Databricks Runtime 12.2 LTS or above. |\n| Databricks Runtime without Photon | Supports `DELETE` using Databricks Runtime 12.2 LTS and above. Supports `UPDATE` using Databricks Runtime 14.1 and above. Supports `MERGE` using Databricks Runtime 14.3 LTS and above. | Requires Databricks Runtime 12.2 LTS or above. |\n| OSS Apache Spark with OSS Delta Lake | Supports `DELETE` using OSS Delta 2.4.0 and above. Supports `UPDATE` using OSS Delta 3.0.0 and above. | Requires OSS Delta 2.3.0 or above. |\n| Delta Sharing recipients | Writes are not supported on Delta Sharing tables | Databricks: Requires DBR 14.1 or above. Open-source Apache Spark: Requires `delta-sharing-spark` 3.1 or above. | \nNote \nFor support in other Delta clients, see the [OSS Delta Lake integrations documentation](https:\/\/delta.io\/integrations).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/deletion-vectors.html"} +{"content":"# What is Delta Lake?\n### What are deletion vectors?\n#### Limitations\n\n* UniForm does not support deletion vectors.\n* You can enable deletion vectors for Materialized views, but to disable deletion vectors for a Materialized view, you must drop the Materialized view and recreate it.\n* You cannot generate a manifest file for a table with deletion vectors present. To generate a manifest, run `REORG TABLE ... APPLY (PURGE)` and ensure that no concurrent write operations are running.\n* You cannot incrementally generate manifest files for a table with deletion vectors enabled.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/deletion-vectors.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Run MLflow Projects on Databricks\n\nAn [MLflow Project](https:\/\/mlflow.org\/docs\/latest\/projects.html) is a format for packaging data science code in a reusable and reproducible way. The MLflow Projects component includes an API and command-line tools for running projects, which also integrate with the Tracking component to automatically record the parameters and git commit of your source code for reproducibility. \nThis article describes the format of an MLflow Project and how to run an MLflow project remotely on Databricks clusters using the MLflow CLI, which makes it easy to vertically scale your data science code. \nMLflow Project execution is not supported on Databricks Community Edition.\n\n#### Run MLflow Projects on Databricks\n##### MLflow project format\n\nAny local directory or Git repository can be treated as an MLflow project. The following conventions define a project: \n* The project\u2019s name is the name of the directory.\n* The software environment is specified in `python_env.yaml`, if present. If no `python_env.yaml` file is present, MLflow uses a virtualenv environment containing only Python (specifically, the latest Python available to virtualenv) when running the project.\n* Any `.py` or `.sh` file in the project can be an entry point, with no parameters explicitly declared. When you run such a command with a set of parameters, MLflow passes each parameter on the command line using `--key <value>` syntax. \nYou specify more options by adding an MLproject file, which is a text file in YAML syntax. An example MLproject file looks like this: \n```\nname: My Project\n\npython_env: python_env.yaml\n\nentry_points:\nmain:\nparameters:\ndata_file: path\nregularization: {type: float, default: 0.1}\ncommand: \"python train.py -r {regularization} {data_file}\"\nvalidate:\nparameters:\ndata_file: path\ncommand: \"python validate.py {data_file}\"\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/projects.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Run MLflow Projects on Databricks\n##### Run an MLflow project\n\nTo run an MLflow project on a Databricks cluster in the default workspace, use the command: \n```\nmlflow run <uri> -b databricks --backend-config <json-new-cluster-spec>\n\n``` \nwhere `<uri>` is a Git repository URI or folder containing an MLflow project and `<json-new-cluster-spec>` is a JSON document containing a [new\\_cluster structure](https:\/\/docs.databricks.com\/api\/workspace\/jobs). The Git URI should be of the form: `https:\/\/github.com\/<repo>#<project-folder>`. \nAn example cluster specification is: \n```\n{\n\"spark_version\": \"7.3.x-scala2.12\",\n\"num_workers\": 1,\n\"node_type_id\": \"i3.xlarge\"\n}\n\n``` \nIf you need to install libraries on the worker, use the \u201ccluster specification\u201d format. Note that Python wheel files must be uploaded to DBFS and specified as `pypi` dependencies. For example: \n```\n{\n\"new_cluster\": {\n\"spark_version\": \"7.3.x-scala2.12\",\n\"num_workers\": 1,\n\"node_type_id\": \"i3.xlarge\"\n},\n\"libraries\": [\n{\n\"pypi\": {\n\"package\": \"tensorflow\"\n}\n},\n{\n\"pypi\": {\n\"package\": \"\/dbfs\/path_to_my_lib.whl\"\n}\n}\n]\n}\n\n``` \nImportant \n* `.egg` and `.jar` dependencies are not supported for MLflow projects.\n* Execution for MLflow projects with Docker environments is not supported.\n* You must use a new cluster specification when running an MLflow Project on Databricks. Running Projects against existing clusters is not supported. \n### Using SparkR \nIn order to use SparkR in an MLflow Project run, your project code must first install and import SparkR as follows: \n```\nif (file.exists(\"\/databricks\/spark\/R\/pkg\")) {\ninstall.packages(\"\/databricks\/spark\/R\/pkg\", repos = NULL)\n} else {\ninstall.packages(\"SparkR\")\n}\n\nlibrary(SparkR)\n\n``` \nYour project can then initialize a SparkR session and use SparkR as normal: \n```\nsparkR.session()\n...\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/projects.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Run MLflow Projects on Databricks\n##### Example\n\nThis example shows how to create an experiment, run the MLflow tutorial project on a Databricks cluster, view the job run output, and view the run in the experiment. \n### Requirements \n1. Install MLflow using `pip install mlflow`.\n2. Install and configure the [Databricks CLI](https:\/\/docs.databricks.com\/archive\/dev-tools\/cli\/index.html). The Databricks CLI authentication mechanism is required to run jobs on a Databricks cluster. \n### Step 1: Create an experiment \n1. In the workspace, select **Create > MLflow Experiment**.\n2. In the Name field, enter `Tutorial`.\n3. Click **Create**. Note the Experiment ID. In this example, it is `14622565`. \n![Experiment ID](https:\/\/docs.databricks.com\/_images\/mlflow-experiment-id.png) \n### Step 2: Run the MLflow tutorial project \nThe following steps set up the `MLFLOW_TRACKING_URI` environment variable and run the project, recording the training parameters, metrics, and the trained model to the experiment noted in the preceding step: \n1. Set the `MLFLOW_TRACKING_URI` environment variable to the Databricks workspace. \n```\nexport MLFLOW_TRACKING_URI=databricks\n\n```\n2. Run the MLflow tutorial project, training a [wine model](https:\/\/github.com\/mlflow\/mlflow\/tree\/master\/examples\/sklearn_elasticnet_wine). Replace `<experiment-id>` with the Experiment ID you noted in the preceding step. \n```\nmlflow run https:\/\/github.com\/mlflow\/mlflow#examples\/sklearn_elasticnet_wine -b databricks --backend-config cluster-spec.json --experiment-id <experiment-id>\n\n``` \n```\n=== Fetching project from https:\/\/github.com\/mlflow\/mlflow#examples\/sklearn_elasticnet_wine into \/var\/folders\/kc\/l20y4txd5w3_xrdhw6cnz1080000gp\/T\/tmpbct_5g8u ===\n=== Uploading project to DBFS path \/dbfs\/mlflow-experiments\/<experiment-id>\/projects-code\/16e66ccbff0a4e22278e4d73ec733e2c9a33efbd1e6f70e3c7b47b8b5f1e4fa3.tar.gz ===\n=== Finished uploading project to \/dbfs\/mlflow-experiments\/<experiment-id>\/projects-code\/16e66ccbff0a4e22278e4d73ec733e2c9a33efbd1e6f70e3c7b47b8b5f1e4fa3.tar.gz ===\n=== Running entry point main of project https:\/\/github.com\/mlflow\/mlflow#examples\/sklearn_elasticnet_wine on Databricks ===\n=== Launched MLflow run as Databricks job run with ID 8651121. Getting run status page URL... ===\n=== Check the run's status at https:\/\/<databricks-instance>#job\/<job-id>\/run\/1 ===\n\n```\n3. Copy the URL `https:\/\/<databricks-instance>#job\/<job-id>\/run\/1` in the last line of the MLflow run output. \n### Step 3: View the Databricks job run \n1. Open the URL you copied in the preceding step in a browser to view the Databricks job run output: \n![Job run output](https:\/\/docs.databricks.com\/_images\/mlflow-job-run.png) \n### Step 4: View the experiment and MLflow run details \n1. Navigate to the experiment in your Databricks workspace. \n![Go to experiment](https:\/\/docs.databricks.com\/_images\/mlflow-workspace-experiment.png)\n2. Click the experiment. \n![View experiment](https:\/\/docs.databricks.com\/_images\/mlflow-experiment.png)\n3. To display run details, click a link in the Date column. \n![Run details](https:\/\/docs.databricks.com\/_images\/mlflow-run-remote.png) \nYou can view logs from your run by clicking the **Logs** link in the Job Output field.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/projects.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Run MLflow Projects on Databricks\n##### Resources\n\nFor some example MLflow projects, see the [MLflow App Library](https:\/\/github.com\/mlflow\/mlflow-apps), which contains a repository of ready-to-run projects aimed at making it easy to include ML functionality into your code.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/projects.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Preset\n\nPreset provides modern business intelligence for your entire organization. Preset provides a powerful, easy to use data exploration and visualization platform, powered by open source Apache Superset. \nYou can integrate your Databricks SQL warehouses (formerly Databricks SQL endpoints) and Databricks clusters with Preset.\n\n#### Connect to Preset\n##### Connect to Preset using Partner Connect\n\nTo connect your Databricks workspace to Preset using Partner Connect, see [Connect to BI partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/bi.html). \nNote \nPartner Connect only supports Databricks SQL warehouses for Preset. To connect a cluster in your Databricks workspace to Preset, connect to Preset manually.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/preset.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Preset\n##### Connect to Preset manually\n\nIn this section, you connect an existing SQL warehouse or cluster in your Databricks workspace to Preset. \nNote \nFor SQL warehouses, you can use Partner Connect to simplify the connection process. \n### Requirements \nBefore you integrate with Preset manually, you must have the following: \n* A cluster or SQL warehouse in your Databricks workspace. \n+ [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n+ [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* The connection details for your cluster or SQL warehouse, specifically the **Server Hostname**, **Port**, and **HTTP Path** values. \n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). To create a personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**.\nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n+ [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n+ [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \n### Steps to connect \nTo connect to Preset manually, do the following: \n1. [Create a new Preset account](https:\/\/manage.app.preset.io\/starter-registration\/), or [sign in to your existing Preset account](https:\/\/manage.app.preset.io\/login\/).\n2. Click **+ Workspace**.\n3. In the **Add New Workspace** dialog, enter a name for the workspace, select the workspace region that is nearest to you, and then click **Save**.\n4. Open the workspace by clicking the workspace tile.\n5. On the toolbar, click **Catalog** > **Databases**.\n6. Click **+ Database**.\n7. In the **Connect a database** dialog, in the **Supported Databases** list, select one of the following: \n* For a SQL warehouse, select **Databricks SQL Warehouse**.\n* For a cluster, select **Databricks Interactive Cluster**.\n8. For **SQLAlchemy URI**, enter the following value: \nFor a SQL warehouse: \n```\ndatabricks+pyodbc:\/\/token:{access token}@{server hostname}:{port}\/{database name}\n\n``` \nFor a cluster: \n```\ndatabricks+pyhive:\/\/token:{access token}@{server hostname}:{port}\/{database name}\n\n``` \nReplace: \n* `{access token}` with the Databricks personal access token value from the [requirements](https:\/\/docs.databricks.com\/partners\/bi\/preset.html#requirements).\n* `{server hostname}` with the **Server Hostname** value from the requirements.\n* `{port}` with the **Port** value from the requirements.\n* `{database name}` with the name of the target database in your Databricks workspace.For example, for a SQL warehouse: \n```\ndatabricks+pyodbc:\/\/token:dapi...@dbc-a1b2345c-d6e7.cloud.databricks.com:443\/default\n\n``` \nFor example, for a cluster: \n```\ndatabricks+pyhive:\/\/token:dapi...@dbc-a1b2345c-d6e7.cloud.databricks.com:443\/default\n\n```\n9. Click the **Advanced** tab, and expand **Other**.\n10. For **Engine Parameters**, enter the following value: \nFor a SQL warehouse: \n```\n{\"connect_args\": {\"http_path\": \"sql\/1.0\/warehouses\/****\", \"driver_path\": \"\/opt\/simba\/spark\/lib\/64\/libsparkodbc_sb64.so\"}}\n\n``` \nFor a cluster: \n```\n{\"connect_args\": {\"http_path\": \"sql\/protocolv1\/o\/****\"}}\n\n``` \nReplace `sql\/1.0\/warehouses\/****` or `sql\/protocolv1\/o\/****` with the **HTTP Path** value from the [requirements](https:\/\/docs.databricks.com\/partners\/bi\/preset.html#requirements). \nFor example, for a SQL warehouse: \n```\n{\"connect_args\": {\"http_path\": \"sql\/1.0\/warehouses\/ab12345cd678e901\", \"driver_path\": \"\/opt\/simba\/spark\/lib\/64\/libsparkodbc_sb64.so\"}}\n\n``` \nFor example, for a cluster: \n```\n{\"connect_args\": {\"http_path\": \"sql\/protocolv1\/o\/1234567890123456\/1234-567890-buyer123\"}}\n\n```\n11. Click the **Basic** tab, and then click **Test Connection**. \nNote \nFor connection troubleshooting, see [Database Connection Walkthrough for Databricks](https:\/\/docs.preset.io\/v1\/docs\/databricks) on the Preset website.\n12. After the connection succeeds, click **Connect**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/preset.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Preset\n##### Next steps\n\nExplore one or more of the following resources on the [Preset website](https:\/\/preset.io\/): \n* [Introducing Preset Cloud](https:\/\/preset.io\/product\/)\n* [Preset documentation](https:\/\/docs.preset.io\/)\n* [Connect Data to Preset](https:\/\/docs.preset.io\/docs\/connect-data-to-preset)\n* [Getting Started Guide](https:\/\/docs.preset.io\/docs\/welcome-to-preset)\n* [Sharing and Collaborating](https:\/\/docs.preset.io\/docs\/how-to-share)\n* [Support](https:\/\/preset.io\/support\/)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/preset.html"} +{"content":"# Technology partners\n### What is Databricks Partner Connect?\n\nPartner Connect lets you create trial accounts with select Databricks technology partners and connect your Databricks workspace to partner solutions from the Databricks UI. This allows you to try partner solutions using your data in the Databricks lakehouse, then adopt the solutions that best meet your business needs. \nPartner Connect provides a simpler alternative to manual partner connections by provisioning the required Databricks resources on your behalf, then passing resource details to the partner. Required resources might include a Databricks SQL warehouse (formerly Databricks SQL endpoint), a service principal, and a personal access token. \nNot all Databricks partner solutions are featured in Partner Connect. For a list of partners that are featured in Partner Connect, with links to their connection guides, see [Databricks Partner Connect partners](https:\/\/docs.databricks.com\/integrations\/index.html#partner-connect). \nTip \nIf you have an existing partner account, Databricks recommends that you follow the steps to connect to the partner solution manually in the appropriate connection guide. This is because the connection experience in Partner Connect is optimized for new partner accounts. \nNote \nSome partner solutions allow you to connect using Databricks SQL warehouses or Databricks clusters, but not both. For details, see the partner\u2019s connection guide. \nImportant \nPartner Connect isn\u2019t available in AWS GovCloud regions.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/index.html"} +{"content":"# Technology partners\n### What is Databricks Partner Connect?\n#### Requirements\n\nTo connect your Databricks workspace to a partner solution using Partner Connect, you must meet the following requirements: \n* Your Databricks account must be on the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons). This is because many of the partner solutions in Partner Connect use Databricks SQL, which is available only on these plans. To view your Databricks account details, use the [account console](https:\/\/docs.databricks.com\/admin\/account-settings\/index.html#account-console).\n* To create new connections to partner solutions, you must first sign in to your workspace as a Databricks workspace admin. For information about Databricks workspace admins, see [Manage users](https:\/\/docs.databricks.com\/admin\/users-groups\/users.html).\n* For all other Partner Connect tasks, you must first sign in to your workspace as a Databricks workspace admin or a Databricks user who has at least the **Workspace access** entitlement. If you are working with SQL warehouses, you also need the **Databricks SQL access** entitlement. For more information, see [Manage users](https:\/\/docs.databricks.com\/admin\/users-groups\/users.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/index.html"} +{"content":"# Technology partners\n### What is Databricks Partner Connect?\n#### Quickstart: Connect to a partner solution using Partner Connect\n\n1. Make sure your Databricks account, workspace, and the signed-in user all meet the [requirements](https:\/\/docs.databricks.com\/partner-connect\/index.html#requirements) for Partner Connect.\n2. In the sidebar, click ![Partner Connect button](https:\/\/docs.databricks.com\/_images\/partner-connect.png) **Partner Connect**.\n3. Click the tile for the partner that you want to connect your workspace to. If the tile has a check mark icon, stop here, as your workspace is already connected. Otherwise, follow the on-screen directions to finish creating the connection. \nNote \nPartner solutions that use a locally-installed application instead of a web-based one (such as Power BI Desktop and Tableau Desktop) do not display a check mark icon in their tile in Partner Connect, even after you connect your workspace to them.\n4. To work with your new connection, see the concluding or **Next steps** section of the corresponding partner connection guide.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/index.html"} +{"content":"# Technology partners\n### What is Databricks Partner Connect?\n#### Common tasks required to create and manage partner connections\n\nThis section describes common tasks you might need to complete to create and manage partner connections using Partner Connect. \n### Allow users to access partner-generated databases and tables \nPartner solutions in the **Data Ingestion** category in Partner Connect can create databases and tables in your workspace. These databases and tables are owned by the partner solution\u2019s associated Databricks service principal in your workspace. \nBy default, these databases and tables can be accessed only by the service principal and by workspace admins. To allow other users in your workspace to access these databases and tables, use the SQL [GRANT](https:\/\/docs.databricks.com\/sql\/language-manual\/security-grant.html) statement. To get access details for an existing database or table, use the SQL [SHOW GRANTS](https:\/\/docs.databricks.com\/sql\/language-manual\/security-show-grant.html) statement. \n### Create an access token \nDatabricks partner solutions require you to provide the partner with a Databricks personal access token. The partner uses this token to authenticate with your Databricks workspace. \nFor cloud-based partner solutions in Partner Connect (such as Fivetran, Labelbox, Prophecy, and Rivery), Partner Connect automatically creates the token (along with a Databricks service principal that is associated with that token) and then shares the token\u2019s value with the partner. You cannot access the token\u2019s value. If for any reason the token expires or the token\u2019s value is no longer shared with the partner, you must create a replacement token for the service principal; to do this, see [Manage service principals and personal access tokens](https:\/\/docs.databricks.com\/partner-connect\/admin.html#service-principal-pat). To share the replacement token with the partner, see the partner\u2019s documentation. \nOnly Databricks workspace administrators can generate replacement tokens for Databricks service principals. If you cannot generate a replacement token, contact your administrator. See also [Manage service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html). \nFor desktop-based partner solutions in Partner Connect (such as Power BI and Tableau), you must create the token and then share the token\u2019s value with the partner. To create the token, see the [Token management API](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement). To set up the partner solution so that it shares the new token with the partner, follow the on-screen instructions in Partner Connect or see the partner\u2019s documentation. \nImportant \nWhen you create the token and share the token\u2019s value with the partner, the partner can take whatever actions that the related entity (such as your Databricks user or a Databricks service principal) can normally take within your Databricks workspace. Do not share token values with partners whom you do not trust. \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \nDatabricks workspace administrators can disable token generation. If you cannot generate a token, contact your administrator. See also [Monitor and manage personal access tokens](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html). \n### Allow a SQL warehouse to access external data \nTo allow a Databricks SQL warehouse to access data outside of Databricks, see [Enable data access configuration](https:\/\/docs.databricks.com\/admin\/sql\/data-access-configuration.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/index.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Qlik Replicate\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nQlik Replicate helps you pull data from multiple data sources (Oracle, Microsoft SQL Server, SAP, mainframe and more) into Delta Lake. Replicate\u2019s automated change data capture (CDC) helps you avoid the heavy lifting of manually extracting data, transferring using an API script, chopping, staging, and importing. Qlik Compose automates the CDC into Delta Lake. \nNote \nFor information about Qlik Sense, a solution that helps you analyze data in Delta Lake, see [Connect to Qlik Sense](https:\/\/docs.databricks.com\/partners\/bi\/qlik-sense.html). \nFor a general demonstration of Qlik Replicate Replicate, watch the following YouTube video (14 minutes). \nFor a demonstration of data pipelines with Qlik Replicate Replicate, see the following YouTube video (6 minutes). \nHere are the steps for using Qlik Replicate with Databricks.\n\n#### Connect to Qlik Replicate\n##### Step 1: Generate a Databricks personal access token\n\nQlik Replicate authenticates with Databricks using a Databricks personal access token. \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/qlik.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Qlik Replicate\n##### Step 2: Set up a cluster to support integration needs\n\nQlik Replicate will write data to an S3 bucket and the Databricks integration cluster will read data from that location. Therefore the integration cluster requires secure access to the S3 bucket. \n### Secure access to an S3 bucket \nTo access AWS resources, you can launch the Databricks integration cluster with an instance profile. The instance profile should have access to the staging S3 bucket and the target S3 bucket where you want to write the Delta tables. To create an instance profile and configure the integration cluster to use the role, follow the instructions in [Tutorial: Configure S3 access with an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html). \nAs an alternative, you can use [IAM credential passthrough](https:\/\/docs.databricks.com\/archive\/credential-passthrough\/iam-passthrough.html), which enables user-specific access to S3 data from a shared cluster. \n### Specify the cluster configuration \n1. Set **Cluster Mode** to **Standard**.\n2. Set **Databricks Runtime Version** to a Databricks runtime version.\n3. Enable [optimized writes and auto compaction](https:\/\/docs.databricks.com\/delta\/tune-file-size.html) by adding the following properties to your [Spark configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration): \n```\nspark.databricks.delta.optimizeWrite.enabled true\nspark.databricks.delta.autoCompact.enabled true\n\n```\n4. Configure your cluster depending on your integration and scaling needs. \nFor cluster configuration details, see [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html). \nSee [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html) for the steps to obtain the JDBC URL and HTTP path.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/qlik.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Qlik Replicate\n##### Step 3: Obtain JDBC and ODBC connection details to connect to a cluster\n\nTo connect a Databricks cluster to Qlik Replicate you need the following JDBC\/ODBC connection properties: \n* JDBC URL\n* HTTP Path\n\n#### Connect to Qlik Replicate\n##### Step 4: Configure Qlik Replicate with Databricks\n\nGo to the [Qlik](https:\/\/www.qlik.com\/us\/products\/technology\/databricks) login page and follow the instructions.\n\n#### Connect to Qlik Replicate\n##### Additional resources\n\n[Support](https:\/\/support.qlik.com\/)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/qlik.html"} +{"content":"# \n### Collect feedback on `\ud83d\uddc2\ufe0f Request Log`s from expert users\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html) \nThis tutorial walks you through the process of collecting feedback on `\ud83d\uddc2\ufe0f Request Log`s from your `\ud83e\udde0 Expert Users`. This step is done when you have negative feedback from your `\ud83d\udc64 End Users` and need to get input in order to understand what went wrong \/ what should have happened.\n\n### Collect feedback on `\ud83d\uddc2\ufe0f Request Log`s from expert users\n#### Data flow\n\n![legend](https:\/\/docs.databricks.com\/_images\/rag-pass-to-reviewer.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/8-eval-review.html"} +{"content":"# \n### Collect feedback on `\ud83d\uddc2\ufe0f Request Log`s from expert users\n#### Step 1: Create the `\ud83d\udccb Review Set` & instructions\n\n1. Run the following SQL to create a Unity Catalog table called `<catalog>.<schema>.<review_table_name>`. This table can be stored in any Unity Catalog schema, but we suggest storing it in the Unity Catalog schema you configured for the RAG Application. \nNote \nYou can modify the SQL code to only select a subset of logs. If you do this, make sure you keep the original schema of the `request` column. \n```\nCREATE TABLE <catalog>.<schema>.<review_table_name> AS (SELECT * FROM <request_log_table> where app_version_id=<model_uri> LIMIT 10)\n\n``` \nNote \nThe schema is intentionally the same between the request logs and the review set. \nWarning \nTo review the assessments, you will need to use the `request_id` from `<catalog>.<schema>.<review_table_name>`. The generated `request_id`s are unique UUIDs, but there is a very low probability 2 UUIDs can be identical.\n2. Open the file `src\/review\/instructions.md` and modify the instructions as needed. \n```\n# Instructions for reviewers\n\nPlease review these chats. For each conversation, read the question asked, assess the bot's response for accuracy, and respond to the feedback prompts accordingly.\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/8-eval-review.html"} +{"content":"# \n### Collect feedback on `\ud83d\uddc2\ufe0f Request Log`s from expert users\n#### Step 2: Deploy the `\ud83d\udccb Review Set` to the <review-ui>\n\n1. Run the following command. \n```\n.\/rag start-review -e dev -v 1 --review-request-table <catalog>.<schema>.<review_table_name>\n\n```\n2. The URL for the `\ud83d\udcac Review UI` is printed to the console. \n```\n...truncated for clarity...\n\nYour Review UI is now available. Open the Review UI here: <review_url>\n\n```\n3. Add permissions to the deployed version so your `\ud83e\udde0 Expert Users` can access the above URL. \n* Give the Databricks user you wish to grant access `read` permissions to \n+ the MLflow Experiment\n+ the Model Serving endpoint\n+ the Unity Catalog Model\nTip \n**\ud83d\udea7 Roadmap \ud83d\udea7** Support for adding any corporate SSO to access the `\ud83d\udcac Review UI` e.g., no requirements for a Databricks account.\n4. Share the URL with your `\ud83e\udde0 Expert Users` \n![RAG review app](https:\/\/docs.databricks.com\/_images\/review-ui-with-logs.png)\n5. The `\ud83d\udc4d Assessments` from your users will appear in the `\ud83d\udc4d Assessment & Evaluation Results Log` for the `Environment` that you deployed to. You can query for just these assessments with: \n```\nSELECT a.*\nFROM <assessment_log> a LEFT SEMI JOIN <catalog>.<schema>.<review_table_name> r ON (a.request.request_id = r.request.request_id)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/8-eval-review.html"} +{"content":"# Query data\n## Data format options\n#### ORC file\n\n[Apache ORC](https:\/\/orc.apache.org\/) is a columnar file format that provides optimizations to speed up queries. It is a far more efficient file format than [CSV](https:\/\/docs.databricks.com\/query\/formats\/csv.html) or [JSON](https:\/\/docs.databricks.com\/query\/formats\/json.html). \nFor more information, see [ORC Files](https:\/\/spark.apache.org\/docs\/latest\/sql-data-sources-orc.html).\n\n#### ORC file\n##### Options\n\nSee the following Apache Spark reference articles for supported read and write options. \n* Read \n+ [Python](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrameReader.orc.html?highlight=orc#pyspark.sql.DataFrameReader.orc)\n+ [Scala](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/sql\/DataFrameReader.html#orc(paths:String*):org.apache.spark.sql.DataFrame)\n* Write \n+ [Python](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrameWriter.orc.html?highlight=orc#pyspark.sql.DataFrameWriter.orc)\n+ [Scala](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/sql\/DataFrameWriter.html#orc(path:String):Unit)\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/orc.html"} +{"content":"# Discover data\n### Explore database objects\n\nThis article details how you can discover and explore catalogs, schemas, tables, and other database objects in Databricks. The instructions in this article focus on returning details for database objects that you have at least the `BROWSE` or `SELECT` privilege on. \nFor general information on Unity Catalog privileges, see [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html). For information about how to set schema ownership and permissions, see [Manage Unity Catalog object ownership](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/ownership.html) and [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html). \nMost access to database objects is governed by Unity Catalog, but your company might use another data governance approach or combine Unity Catalog with other legacy table ACLs. This article focuses on describing behavior for objects governed by Unity Catalog, but most methods described in this article also work for database objects that aren\u2019t governed by Unity Catalog. \nThis article includes instructions for Catalog Explorer and SQL. Select the ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog** icon in the workspace side bar to access Catalog Explorer. You can execute SQL commands from a notebook or the query editor attached to compute. To view database objects with Catalog Explorer, you must have at least the `BROWSE` privilege on the objects. To view database objects with SQL, you must have at least the `SELECT` privilege on the object, as well as `USE CATALOG` on the parent catalog and `USE SCHEMA` on the parent schema. \nNote \nYou can navigate Unity Catalog-governed database objects in Catalog Explorer without active compute. To explore data in the `hive_metastore` and other catalogs not governed by Unity Catalog, you must attach to compute with appropriate privileges.\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/database-objects.html"} +{"content":"# Discover data\n### Explore database objects\n#### Filtering database objects\n\nDuring interactive exploration of database objects with Catalog Explorer, you can use the provided text box to filter results. Matched strings in object names are highlighted, but only among currently visible database objects. For complete search of all database objects, see [Search for workspace objects](https:\/\/docs.databricks.com\/search\/index.html). \nSQL provides similar functionality by optionally specifying a `regex_pattern` clause in conjunction with a `SHOW` statement, such as the following: \n```\nSHOW TABLES IN schema_name LIKE 'sales_*_fy23'\n\n```\n\n### Explore database objects\n#### Explore catalogs\n\nCatalogs represent the top level of data governance in each Unity Catalog metastore. \nRun the following command to see a list of catalogs available to you. \n```\nSHOW CATALOGS\n\n``` \nSee [SHOW CATALOGS](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-aux-show-catalogs.html). \nWhen you access Catalog Explorer, you see a list of catalogs available to you. \n### Select a catalog \nRun the following command to set your currently active catalog. \n```\nUSE CATALOG catalog_name\n\n``` \nSee [USE CATALOG](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-use-catalog.html). \nClick on a catalog name to select it. \n### See catalog details \nRun the following command to describe a catalog. \n```\nDESCRIBE CATALOG catalog_name\n\n``` \nSee [DESCRIBE CATALOG](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-aux-describe-catalog.html). \nSelect the **Details** tab to review catalog details.\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/database-objects.html"} +{"content":"# Discover data\n### Explore database objects\n#### Explore schemas\n\nSchemas are collections of tables, views, volumes, functions, and models in Unity Catalog. Schemas are contained in catalogs. \nRun the following command to see a list of schemas available to you. \n```\nSHOW SCHEMAS IN catalog_name\n\n``` \nSee [SHOW SCHEMAS](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-aux-show-schemas.html). \nWhen you select a catalog in Catalog Explorer, you see a list of available schemas. \n### Select a schema \nRun the following command to set your currently active schema. \n```\nUSE schema catalog_name.schema_name\n\n``` \nSee [USE SCHEMA](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-use-schema.html). \nClick on a schema name to select it. \n### See schema details \nRun the following command to describe a schema. \n```\nDESCRIBE SCHEMA schema_name\n\n``` \nSee [DESCRIBE SCHEMA](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-aux-describe-schema.html). \nSelect the **Details** tab to review schema details.\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/database-objects.html"} +{"content":"# Discover data\n### Explore database objects\n#### Explore tables and views\n\nTables and views are contained in schemas. Most tables in Unity Catalog are backed by Delta Lake, but you might also have access to tables registered against external data. See [What data can you query with Databricks?](https:\/\/docs.databricks.com\/query\/index.html#lakehouse-external). \nViews in Unity Catalog always reference data in another table. \nRun the following command to see a list of tables available to you. \n```\nSHOW TABLES IN catalog_name.schema_name\n\n``` \nRun the following command to see a list of tables available to you. \n```\nSHOW VIEWS IN catalog_name.schema_name\n\n``` \nSee [SHOW TABLES](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-aux-show-tables.html) and [SHOW VIEWS](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-aux-show-views.html). \nWhen you select a schema in Catalog Explorer, you see a list of available tables and views. \nNote \nIf the schema has other database objects like volumes present, you might need to click **Tables** to expand the list of tables and views.\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/database-objects.html"} +{"content":"# Discover data\n### Explore database objects\n#### View table contents and details\n\nYou can view most table details with either Catalog Explorer or SQL. Some details are only available in the Catalog Explorer UI. \nSelect a table in Catalog Explorer to explore table details. \n### Explore table columns \nRun the following command to view table columns. \n```\nSHOW COLUMNS IN table_name\n\n``` \nSee [SHOW COLUMNS](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-aux-show-columns.html). \nSelect the **Columns** tab to view table columns. \n### View sample data \nRun the following command to view 1000 records from a table. \n```\nSELECT * FROM table_name LIMIT 1000;\n\n``` \nSee [Query data](https:\/\/docs.databricks.com\/query\/index.html). \nSelect the **Sample Data** tab to view sample data. You must have access to active compute to sample data. \n### See table details \nRun the following command to describe a table. \n```\nDESCRIBE TABLE table_name\n\n``` \nRun the following command to display table properties for a table. \n```\nSHOW TBLPROPERTIES table_name\n\n``` \nSee [DESCRIBE TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-aux-describe-table.html) and [SHOW TBLPROPERTIES](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-aux-show-tblproperties.html). \nSelect the **Details** tab to review table details, including table properties. \n### View table history \nTable history is available for Delta tables. All Unity Catalog managed tables are Delta tables. \nRun the following command to review table history. \n```\nDESCRIBE HISTORY table_name\n\n``` \nSee [DESCRIBE HISTORY](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-describe-history.html). \nSelect the **History** tab to review table history. \n### View frequent queries and users \nIf the table is registered in Unity Catalog, you can view the most frequent queries made on the table and users who accessed the table in the past 30 days using Catalog Explorer. See [View frequent queries and users of a table](https:\/\/docs.databricks.com\/discover\/table-insights.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/database-objects.html"} +{"content":"# Discover data\n### Explore database objects\n#### View primary key and foreign key relationships\n\nFor tables with foreign keys defined, click **View relationships** ![View relationships button](https:\/\/docs.databricks.com\/_images\/pk-fk-view-relationships.png) at the top-right of the **Columns** tab. The Entity Relationship Diagram (ERD) opens. The ERD displays the primary key and foreign key relationships between tables in a graph, providing a clear and intuitive representation of how data entities connect. \n![Entity relationship diagram](https:\/\/docs.databricks.com\/_images\/ce-erd.png) \nFor more information about primary key and foreign key constraints, see [Constraints on Databricks](https:\/\/docs.databricks.com\/tables\/constraints.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/database-objects.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Workspace Model Registry example\n\nNote \nThis documentation covers the Workspace Model Registry. Databricks recommends using [Models in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html). Models in Unity Catalog provides centralized model governance, cross-workspace access, lineage, and deployment. Workspace Model Registry will be deprecated in the future. \nThis example illustrates how to use the Workspace Model Registry to build a machine learning application that forecasts the daily power output of a wind farm. The example shows how to: \n* Track and log models with MLflow\n* Register models with the Model Registry\n* Describe models and make model version stage transitions\n* Integrate registered models with production applications\n* Search and discover models in the Model Registry\n* Archive and delete models \nThe article describes how to perform these steps using the MLflow Tracking and MLflow Model Registry UIs and APIs. \nFor a notebook that performs all these steps using the MLflow Tracking and Registry APIs, see the [Model Registry example notebook](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#notebook).\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Workspace Model Registry example\n###### Load dataset, train model, and track with MLflow Tracking\n\nBefore you can register a model in the Model Registry, you must first train and log the [model](https:\/\/docs.databricks.com\/mlflow\/models.html) during an [experiment run](https:\/\/docs.databricks.com\/mlflow\/experiments.html). This section shows how to load the wind farm dataset, train a model, and log the training run to MLflow. \n### Load dataset \nThe following code loads a dataset containing weather data and power output information for a wind farm in the United States. The dataset contains `wind direction`, `wind speed`, and `air temperature` features sampled every six hours (once at `00:00`, once at `08:00`, and once at `16:00`), as well as daily aggregate power output (`power`), over several years. \n```\nimport pandas as pd\nwind_farm_data = pd.read_csv(\"https:\/\/github.com\/dbczumar\/model-registry-demo-notebook\/raw\/master\/dataset\/windfarm_data.csv\", index_col=0)\n\ndef get_training_data():\ntraining_data = pd.DataFrame(wind_farm_data[\"2014-01-01\":\"2018-01-01\"])\nX = training_data.drop(columns=\"power\")\ny = training_data[\"power\"]\nreturn X, y\n\ndef get_validation_data():\nvalidation_data = pd.DataFrame(wind_farm_data[\"2018-01-01\":\"2019-01-01\"])\nX = validation_data.drop(columns=\"power\")\ny = validation_data[\"power\"]\nreturn X, y\n\ndef get_weather_and_forecast():\nformat_date = lambda pd_date : pd_date.date().strftime(\"%Y-%m-%d\")\ntoday = pd.Timestamp('today').normalize()\nweek_ago = today - pd.Timedelta(days=5)\nweek_later = today + pd.Timedelta(days=5)\n\npast_power_output = pd.DataFrame(wind_farm_data)[format_date(week_ago):format_date(today)]\nweather_and_forecast = pd.DataFrame(wind_farm_data)[format_date(week_ago):format_date(week_later)]\nif len(weather_and_forecast) < 10:\npast_power_output = pd.DataFrame(wind_farm_data).iloc[-10:-5]\nweather_and_forecast = pd.DataFrame(wind_farm_data).iloc[-10:]\n\nreturn weather_and_forecast.drop(columns=\"power\"), past_power_output[\"power\"]\n\n``` \n### Train model \nThe following code trains a neural network using TensorFlow Keras to predict power output based on the weather features in the dataset. MLflow is used to track the model\u2019s hyperparameters, performance metrics, source code, and artifacts. \n```\ndef train_keras_model(X, y):\nimport tensorflow.keras\nfrom tensorflow.keras.models import Sequential\nfrom tensorflow.keras.layers import Dense\n\nmodel = Sequential()\nmodel.add(Dense(100, input_shape=(X_train.shape[-1],), activation=\"relu\", name=\"hidden_layer\"))\nmodel.add(Dense(1))\nmodel.compile(loss=\"mse\", optimizer=\"adam\")\n\nmodel.fit(X_train, y_train, epochs=100, batch_size=64, validation_split=.2)\nreturn model\n\nimport mlflow\n\nX_train, y_train = get_training_data()\n\nwith mlflow.start_run():\n# Automatically capture the model's parameters, metrics, artifacts,\n# and source code with the `autolog()` function\nmlflow.tensorflow.autolog()\n\ntrain_keras_model(X_train, y_train)\nrun_id = mlflow.active_run().info.run_id\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Workspace Model Registry example\n###### Register and manage the model using the MLflow UI\n\nIn this section: \n* [Create a new registered model](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#create-a-new-registered-model)\n* [Explore the Model Registry UI](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#explore-the-model-registry-ui)\n* [Add model descriptions](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#add-model-descriptions)\n* [Transition a model version](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#transition-a-model-version) \n### [Create a new registered model](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#id2) \n1. Navigate to the MLflow Experiment Runs sidebar by clicking the **Experiment** icon ![Experiment icon](https:\/\/docs.databricks.com\/_images\/experiment1.png) in the Databricks notebook\u2019s right sidebar. \n![Runs sidebar](https:\/\/docs.databricks.com\/_images\/notebook-toolbar.png)\n2. Locate the MLflow Run corresponding to the TensorFlow Keras model training session, and open it in the MLflow Run UI by clicking the **View Run Detail** icon.\n3. In the MLflow UI, scroll down to the **Artifacts** section and click the directory named **model**. Click the **Register Model** button that appears. \n![Register model](https:\/\/docs.databricks.com\/_images\/mlflow_ui_register_model.png)\n4. Select **Create New Model** from the drop-down menu, and input the following model name: `power-forecasting-model`.\n5. Click **Register**. This registers a new model called `power-forecasting-model` and creates a new model version: `Version 1`. \n![New model version](https:\/\/docs.databricks.com\/_images\/register_model_confirm.png) \nAfter a few moments, the MLflow UI displays a link to the new registered model. Follow this link to open the new model version in the MLflow Model Registry UI. \n### [Explore the Model Registry UI](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#id3) \nThe model version page in the MLflow Model Registry UI provides information about `Version 1` of the registered forecasting model, including its author, creation time, and its current stage. \n![Model version page](https:\/\/docs.databricks.com\/_images\/registry_version_page.png) \nThe model version page also provides a **Source Run** link, which opens the MLflow Run that was used to create the model in the MLflow Run UI. From the MLflow Run UI, you can access the **Source** notebook link to view a snapshot of the Databricks notebook that was used to train the model. \n![Source run](https:\/\/docs.databricks.com\/_images\/source_run_link.png) \n![Source notebook](https:\/\/docs.databricks.com\/_images\/source_notebook_link.png) \nTo navigate back to the MLflow Model Registry, click ![Models Icon](https:\/\/docs.databricks.com\/_images\/models-icon.png) **Models** in sidebar. \nThe resulting MLflow Model Registry home page displays a list of all the registered models in your Databricks workspace, including their versions and stages. \nClick the **power-forecasting-model** link to open the registered model page, which displays all of the versions of the forecasting model. \n### [Add model descriptions](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#id4) \nYou can add descriptions to registered models and model versions. Registered model descriptions are useful for recording information that applies to multiple model versions (e.g., a general overview of the modeling problem and dataset). Model version descriptions are useful for detailing the unique attributes of a particular model version (e.g., the methodology and algorithm used to develop the model). \n1. Add a high-level description to the registered power forecasting model. Click the ![Edit Icon](https:\/\/docs.databricks.com\/_images\/edit-icon.png) icon and enter the following description: \n```\nThis model forecasts the power output of a wind farm based on weather data. The weather data consists of three features: wind speed, wind direction, and air temperature.\n\n``` \n![Add model description](https:\/\/docs.databricks.com\/_images\/model_description.png)\n2. Click **Save**.\n3. Click the **Version 1** link from the registered model page to navigate back to the model version page.\n4. Click the ![Edit Icon](https:\/\/docs.databricks.com\/_images\/edit-icon.png) icon and enter the following description: \n```\nThis model version was built using TensorFlow Keras. It is a feed-forward neural network with one hidden layer.\n\n``` \n![Add model version description](https:\/\/docs.databricks.com\/_images\/model_version_description.png)\n5. Click **Save**. \n### [Transition a model version](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#id5) \nThe MLflow Model Registry defines several model stages: **None**, **Staging**, **Production**, and `Archived`. Each stage has a unique meaning. For example, **Staging** is meant for model testing, while **Production** is for models that have completed the testing or review processes and have been deployed to applications. \n1. Click the **Stage** button to display the list of available model stages and your available stage transition options.\n2. Select **Transition to -> Production** and press **OK** in the stage transition confirmation window to transition the model to **Production**. \n![Transition to production](https:\/\/docs.databricks.com\/_images\/stage_transition_prod.png) \nAfter the model version is transitioned to **Production**, the current stage is displayed in the UI, and an entry is added to the activity log to reflect the transition. \n![Production stage](https:\/\/docs.databricks.com\/_images\/stage_production.png) \n![Model version activity](https:\/\/docs.databricks.com\/_images\/activity_production.png) \nThe MLflow Model Registry allows multiple model versions to share the same stage. When referencing a model by stage, the Model Registry uses the latest model version (the model version with the largest version ID). The registered model page displays all of the versions of a particular model. \n![Registered model page](https:\/\/docs.databricks.com\/_images\/model_registry_versions.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Workspace Model Registry example\n###### Register and manage the model using the MLflow API\n\nIn this section: \n* [Define the model\u2019s name programmatically](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#define-the-models-name-programmatically)\n* [Register the model](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#register-the-model)\n* [Add model and model version descriptions using the API](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#add-model-and-model-version-descriptions-using-the-api)\n* [Transition a model version and retrieve details using the API](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#transition-a-model-version-and-retrieve-details-using-the-api) \n### [Define the model\u2019s name programmatically](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#id6) \nNow that the model has been registered and transitioned to **Production**, you can reference it using MLflow programmatic APIs. Define the registered model\u2019s name as follows: \n```\nmodel_name = \"power-forecasting-model\"\n\n``` \n### [Register the model](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#id7) \n```\nmodel_name = get_model_name()\n\nimport mlflow\n\n# The default path where the MLflow autologging function stores the TensorFlow Keras model\nartifact_path = \"model\"\nmodel_uri = \"runs:\/{run_id}\/{artifact_path}\".format(run_id=run_id, artifact_path=artifact_path)\n\nmodel_details = mlflow.register_model(model_uri=model_uri, name=model_name)\n\nimport time\nfrom mlflow.tracking.client import MlflowClient\nfrom mlflow.entities.model_registry.model_version_status import ModelVersionStatus\n\n# Wait until the model is ready\ndef wait_until_ready(model_name, model_version):\nclient = MlflowClient()\nfor _ in range(10):\nmodel_version_details = client.get_model_version(\nname=model_name,\nversion=model_version,\n)\nstatus = ModelVersionStatus.from_string(model_version_details.status)\nprint(\"Model status: %s\" % ModelVersionStatus.to_string(status))\nif status == ModelVersionStatus.READY:\nbreak\ntime.sleep(1)\n\nwait_until_ready(model_details.name, model_details.version)\n\n``` \n### [Add model and model version descriptions using the API](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#id8) \n```\nfrom mlflow.tracking.client import MlflowClient\n\nclient = MlflowClient()\nclient.update_registered_model(\nname=model_details.name,\ndescription=\"This model forecasts the power output of a wind farm based on weather data. The weather data consists of three features: wind speed, wind direction, and air temperature.\"\n)\n\nclient.update_model_version(\nname=model_details.name,\nversion=model_details.version,\ndescription=\"This model version was built using TensorFlow Keras. It is a feed-forward neural network with one hidden layer.\"\n)\n\n``` \n### [Transition a model version and retrieve details using the API](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#id9) \n```\nclient.transition_model_version_stage(\nname=model_details.name,\nversion=model_details.version,\nstage='production',\n)\nmodel_version_details = client.get_model_version(\nname=model_details.name,\nversion=model_details.version,\n)\nprint(\"The current model stage is: '{stage}'\".format(stage=model_version_details.current_stage))\n\nlatest_version_info = client.get_latest_versions(model_name, stages=[\"production\"])\nlatest_production_version = latest_version_info[0].version\nprint(\"The latest production version of the model '%s' is '%s'.\" % (model_name, latest_production_version))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Workspace Model Registry example\n###### Load versions of the registered model using the API\n\nThe MLflow Models component defines functions for loading models from several machine learning frameworks. For example, `mlflow.tensorflow.load_model()` is used to load TensorFlow models that were saved in MLflow format, and `mlflow.sklearn.load_model()` is used to load scikit-learn models that were saved in MLflow format. \nThese functions can load models from the MLflow Model Registry. \n```\nimport mlflow.pyfunc\n\nmodel_version_uri = \"models:\/{model_name}\/1\".format(model_name=model_name)\n\nprint(\"Loading registered model version from URI: '{model_uri}'\".format(model_uri=model_version_uri))\nmodel_version_1 = mlflow.pyfunc.load_model(model_version_uri)\n\nmodel_production_uri = \"models:\/{model_name}\/production\".format(model_name=model_name)\n\nprint(\"Loading registered model version from URI: '{model_uri}'\".format(model_uri=model_production_uri))\nmodel_production = mlflow.pyfunc.load_model(model_production_uri)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Workspace Model Registry example\n###### Forecast power output with the production model\n\nIn this section, the production model is used to evaluate weather forecast data for the wind farm. The `forecast_power()` application loads the latest version of the forecasting model from the specified stage and uses it to forecast power production over the next five days. \n```\ndef plot(model_name, model_stage, model_version, power_predictions, past_power_output):\nimport pandas as pd\nimport matplotlib.dates as mdates\nfrom matplotlib import pyplot as plt\nindex = power_predictions.index\nfig = plt.figure(figsize=(11, 7))\nax = fig.add_subplot(111)\nax.set_xlabel(\"Date\", size=20, labelpad=20)\nax.set_ylabel(\"Power\\noutput\\n(MW)\", size=20, labelpad=60, rotation=0)\nax.tick_params(axis='both', which='major', labelsize=17)\nax.xaxis.set_major_formatter(mdates.DateFormatter('%m\/%d'))\nax.plot(index[:len(past_power_output)], past_power_output, label=\"True\", color=\"red\", alpha=0.5, linewidth=4)\nax.plot(index, power_predictions.squeeze(), \"--\", label=\"Predicted by '%s'\\nin stage '%s' (Version %d)\" % (model_name, model_stage, model_version), color=\"blue\", linewidth=3)\nax.set_ylim(ymin=0, ymax=max(3500, int(max(power_predictions.values) * 1.3)))\nax.legend(fontsize=14)\nplt.title(\"Wind farm power output and projections\", size=24, pad=20)\nplt.tight_layout()\ndisplay(plt.show())\n\ndef forecast_power(model_name, model_stage):\nfrom mlflow.tracking.client import MlflowClient\nclient = MlflowClient()\nmodel_version = client.get_latest_versions(model_name, stages=[model_stage])[0].version\nmodel_uri = \"models:\/{model_name}\/{model_stage}\".format(model_name=model_name, model_stage=model_stage)\nmodel = mlflow.pyfunc.load_model(model_uri)\nweather_data, past_power_output = get_weather_and_forecast()\npower_predictions = pd.DataFrame(model.predict(weather_data))\npower_predictions.index = pd.to_datetime(weather_data.index)\nprint(power_predictions)\nplot(model_name, model_stage, int(model_version), power_predictions, past_power_output)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Workspace Model Registry example\n###### Create a new model version\n\nClassical machine learning techniques are also effective for power forecasting. The following code trains a random forest model using scikit-learn and registers it with the MLflow Model Registry via the `mlflow.sklearn.log_model()` function. \n```\nimport mlflow.sklearn\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.metrics import mean_squared_error\n\nwith mlflow.start_run():\nn_estimators = 300\nmlflow.log_param(\"n_estimators\", n_estimators)\n\nrand_forest = RandomForestRegressor(n_estimators=n_estimators)\nrand_forest.fit(X_train, y_train)\n\nval_x, val_y = get_validation_data()\nmse = mean_squared_error(rand_forest.predict(val_x), val_y)\nprint(\"Validation MSE: %d\" % mse)\nmlflow.log_metric(\"mse\", mse)\n\n# Specify the `registered_model_name` parameter of the `mlflow.sklearn.log_model()`\n# function to register the model with the MLflow Model Registry. This automatically\n# creates a new model version\nmlflow.sklearn.log_model(\nsk_model=rand_forest,\nartifact_path=\"sklearn-model\",\nregistered_model_name=model_name,\n)\n\n``` \n### Fetch the new model version ID using MLflow Model Registry search \n```\nfrom mlflow.tracking.client import MlflowClient\nclient = MlflowClient()\n\nmodel_version_infos = client.search_model_versions(\"name = '%s'\" % model_name)\nnew_model_version = max([model_version_info.version for model_version_info in model_version_infos])\n\nwait_until_ready(model_name, new_model_version)\n\n``` \n### Add a description to the new model version \n```\nclient.update_model_version(\nname=model_name,\nversion=new_model_version,\ndescription=\"This model version is a random forest containing 100 decision trees that was trained in scikit-learn.\"\n)\n\n``` \n### Transition the new model version to Staging and test the model \nBefore deploying a model to a production application, it is often best practice to test it in a staging environment. The following code transitions the new model version to **Staging** and evaluates its performance. \n```\nclient.transition_model_version_stage(\nname=model_name,\nversion=new_model_version,\nstage=\"Staging\",\n)\n\nforecast_power(model_name, \"Staging\")\n\n``` \n### Deploy the new model version to Production \nAfter verifying that the new model version performs well in staging, the following code transitions the model to **Production** and uses the exact same application code from the [Forecast power output with the production model](https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html#forecast-power-output-with-the-production-model) section to produce a power forecast. \n```\nclient.transition_model_version_stage(\nname=model_name,\nversion=new_model_version,\nstage=\"production\",\n)\n\nforecast_power(model_name, \"production\")\n\n``` \nThere are now two model versions of the forecasting model in the **Production** stage: the model version trained in Keras model and the version trained in scikit-learn. \n![Product model versions](https:\/\/docs.databricks.com\/_images\/multiple_prod_stage.png) \nNote \nWhen referencing a model by stage, the MLflow Model Model Registry automatically uses the latest production version. This enables you to update your production models without changing any application code.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Workspace Model Registry example\n###### Archive and delete models\n\nWhen a model version is no longer being used, you can archive it or delete it. You can also delete an entire registered model; this removes all of its associated model versions. \n### Archive `Version 1` of the power forecasting model \nArchive `Version 1` of the power forecasting model because it is no longer being used. You can archive models in the MLflow Model Registry UI or via the MLflow API. \n### Archive `Version 1` in the MLflow UI \nTo archive `Version 1` of the power forecasting model: \n1. Open its corresponding model version page in the MLflow Model Registry UI: \n![Transition to archived](https:\/\/docs.databricks.com\/_images\/stage_transition_archived.png)\n2. Click the **Stage** button, select **Transition To -> Archived**: \n![Archived stage](https:\/\/docs.databricks.com\/_images\/confirm_archived_transition.png)\n3. Press **OK** in the stage transition confirmation window. \n![Archived model version](https:\/\/docs.databricks.com\/_images\/stage_archived.png) \n#### Archive `Version 1` using the MLflow API \nThe following code uses the `MlflowClient.update_model_version()` function to archive `Version 1` of the power forecasting model. \n```\nfrom mlflow.tracking.client import MlflowClient\n\nclient = MlflowClient()\nclient.transition_model_version_stage(\nname=model_name,\nversion=1,\nstage=\"Archived\",\n)\n\n``` \n#### Delete `Version 1` of the power forecasting model \nYou can also use the MLflow UI or MLflow API to delete model versions. \nWarning \nModel version deletion is permanent and cannot be undone. \n##### Delete `Version 1` in the MLflow UI \nTo delete `Version 1` of the power forecasting model: \n1. Open its corresponding model version page in the MLflow Model Registry UI. \n![Delete model version](https:\/\/docs.databricks.com\/_images\/delete_version.png)\n2. Select the drop-down arrow next to the version identifier and click **Delete**. \n##### Delete `Version 1` using the MLflow API \n```\nclient.delete_model_version(\nname=model_name,\nversion=1,\n)\n\n``` \n##### Delete the model using the MLflow API \nYou must first transition all remaining model version stages to **None** or **Archived**. \n```\nfrom mlflow.tracking.client import MlflowClient\n\nclient = MlflowClient()\nclient.transition_model_version_stage(\nname=model_name,\nversion=2,\nstage=\"Archived\",\n)\n\n``` \n```\nclient.delete_registered_model(name=model_name)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Workspace Model Registry example\n###### Notebook\n\n### MLflow Model Registry example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-model-registry-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/workspace-model-registry-example.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Run your first Structured Streaming workload\n\nThis article provides code examples and explanation of basic concepts necessary to run your first Structured Streaming queries on Databricks. You can use Structured Streaming for near real-time and incremental processing workloads. \nStructured Streaming is one of several technologies that power streaming tables in Delta Live Tables. Databricks recommends using Delta Live Tables for all new ETL, ingestion, and Structured Streaming workloads. See [What is Delta Live Tables?](https:\/\/docs.databricks.com\/delta-live-tables\/index.html). \nNote \nWhile Delta Live Tables provides a slightly modified syntax for declaring streaming tables, the general syntax for configuring streaming reads and transformations applies to all streaming use cases on Databricks. Delta Live Tables also simplifies streaming by managing state information, metadata, and numerous configurations.\n\n#### Run your first Structured Streaming workload\n##### Read from a data stream\n\nYou can use Structured Streaming to incrementally ingest data from supported data sources. Some of the most common data sources used in Databricks Structured Streaming workloads include the following: \n* Data files in cloud object storage\n* Message buses and queues\n* Delta Lake \nDatabricks recommends using Auto Loader for streaming ingestion from cloud object storage. Auto Loader supports most file formats supported by Structured Streaming. See [What is Auto Loader?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html). \nEach data source provides a number of options to specify how to load batches of data. During reader configuration, the main options you might need to set fall into the following categories: \n* Options that specify the data source or format (for example, file type, delimiters, and schema).\n* Options that configure access to source systems (for example, port settings and credentials).\n* Options that specify where to start in a stream (for example, Kafka offsets or reading all existing files).\n* Options that control how much data is processed in each batch (for example, max offsets, files, or bytes per batch).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/tutorial.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Run your first Structured Streaming workload\n##### Use Auto Loader to read streaming data from object storage\n\nThe following example demonstrates loading JSON data with Auto Loader, which uses `cloudFiles` to denote format and options. The `schemaLocation` option enables schema inference and evolution. Paste the following code in a Databricks notebook cell and run the cell to create a streaming DataFrame named `raw_df`: \n```\nfile_path = \"\/databricks-datasets\/structured-streaming\/events\"\ncheckpoint_path = \"\/tmp\/ss-tutorial\/_checkpoint\"\n\nraw_df = (spark.readStream\n.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"json\")\n.option(\"cloudFiles.schemaLocation\", checkpoint_path)\n.load(file_path)\n)\n\n``` \nLike other read operations on Databricks, configuring a streaming read does not actually load data. You must trigger an action on the data before the stream begins. \nNote \nCalling `display()` on a streaming DataFrame starts a streaming job. For most Structured Streaming use cases, the action that triggers a stream should be writing data to a sink. See [Preparing your Structured Streaming code for production](https:\/\/docs.databricks.com\/structured-streaming\/tutorial.html#production).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/tutorial.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Run your first Structured Streaming workload\n##### Perform a streaming transformation\n\nStructured Streaming supports most transformations that are available in Databricks and Spark SQL. You can even load MLflow models as UDFs and make streaming predictions as a transformation. \nThe following code example completes a simple transformation to enrich the ingested JSON data with additional information using Spark SQL functions: \n```\nfrom pyspark.sql.functions import col, current_timestamp\n\ntransformed_df = (raw_df.select(\n\"*\",\ncol(\"_metadata.file_path\").alias(\"source_file\"),\ncurrent_timestamp().alias(\"processing_time\")\n)\n)\n\n``` \nThe resulting `transformed_df` contains query instructions to load and transform each record as it arrives in the data source. \nNote \nStructured Streaming treats data sources as unbounded or infinite datasets. As such, some transformations are not supported in Structured Streaming workloads because they would require sorting an infinite number of items. \nMost aggregations and many joins require managing state information with watermarks, windows, and output mode. See [Apply watermarks to control data processing thresholds](https:\/\/docs.databricks.com\/structured-streaming\/watermarks.html).\n\n#### Run your first Structured Streaming workload\n##### Write to a data sink\n\nA data sink is the target of a streaming write operation. Common sinks used in Databricks streaming workloads include the following: \n* Delta Lake\n* Message buses and queues\n* Key-value databases \nAs with data sources, most data sinks provide a number of options to control how data is written to the target system. During writer configuration, the main options you might need to set fall into the following categories: \n* Output mode (append by default).\n* A checkpoint location (required for each **writer**).\n* Trigger intervals; see [Configure Structured Streaming trigger intervals](https:\/\/docs.databricks.com\/structured-streaming\/triggers.html).\n* Options that specify the data sink or format (for example, file type, delimiters, and schema).\n* Options that configure access to target systems (for example, port settings and credentials).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/tutorial.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Run your first Structured Streaming workload\n##### Perform an incremental batch write to Delta Lake\n\nThe following example writes to Delta Lake using a specified file path and checkpoint. \nImportant \nAlways make sure you specify a unique checkpoint location for each streaming writer you configure. The checkpoint provides the unique identity for your stream, tracking all records processed and state information associated with your streaming query. \nThe `availableNow` setting for the trigger instructs Structured Streaming to process all previously unprocessed records from the source dataset and then shut down, so you can safely execute the following code without worrying about leaving a stream running: \n```\ntarget_path = \"\/tmp\/ss-tutorial\/\"\ncheckpoint_path = \"\/tmp\/ss-tutorial\/_checkpoint\"\n\ntransformed_df.writeStream\n.trigger(availableNow=True)\n.option(\"checkpointLocation\", checkpoint_path)\n.option(\"path\", target_path)\n.start()\n\n``` \nIn this example, no new records arrive in our data source, so repeat execution of this code does not ingest new records. \nWarning \nStructured Streaming execution can prevent auto termination from shutting down compute resources. To avoid unexpected costs, be sure to terminate streaming queries.\n\n#### Run your first Structured Streaming workload\n##### Preparing your Structured Streaming code for production\n\nDatabricks recommends using Delta Live Tables for most Structured Streaming workloads. The following recommendations provide a starting point for preparing Structured Streaming workloads for production: \n* Remove unnecessary code from notebooks that would return results, such as `display` and `count`.\n* Do not run Structured Streaming workloads on interactive clusters; always schedule streams as jobs.\n* To help streaming jobs recover automatically, configure jobs with infinite retries.\n* Do not use auto-scaling for workloads with Structured Streaming. \nFor more recommendations, see [Production considerations for Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/production.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/tutorial.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Run your first Structured Streaming workload\n##### Read data from Delta Lake, transform, and write to Delta Lake\n\nDelta Lake has extensive support for working with Structured Streaming as both a source and a sink. See [Delta table streaming reads and writes](https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html). \nThe following example shows example syntax to incrementally load all new records from a Delta table, join them with a snapshot of another Delta table, and write them to a Delta table: \n```\n(spark.readStream\n.table(\"<table-name1>\")\n.join(spark.read.table(\"<table-name2>\"), on=\"<id>\", how=\"left\")\n.writeStream\n.trigger(availableNow=True)\n.option(\"checkpointLocation\", \"<checkpoint-path>\")\n.toTable(\"<table-name3>\")\n)\n\n``` \nYou must have proper permissions configured to read source tables and write to target tables and the specified checkpoint location. Fill in all parameters denoted with angle brackets (`<>`) using the relevant values for your data sources and sinks. \nNote \nDelta Live Tables provides a fully declarative syntax for creating Delta Lake pipelines and manages properties like triggers and checkpoints automatically. See [What is Delta Live Tables?](https:\/\/docs.databricks.com\/delta-live-tables\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/tutorial.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Run your first Structured Streaming workload\n##### Read data from Kafka, transform, and write to Kafka\n\nApache Kafka and other messaging buses provide some of the lowest latency available for large datasets. You can use Databricks to apply transformations to data ingested from Kafka and then write data back to Kafka. \nNote \nWriting data to cloud object storage adds additional latency overhead. If you wish to store data from a messaging bus in Delta Lake but require the lowest latency possible for streaming workloads, Databricks recommends configuring separate streaming jobs to ingest data to the lakehouse and apply near real-time transformations for downstream messaging bus sinks. \nThe following code example demonstrates a simple pattern to enrich data from Kafka by joining it with data in a Delta table and then writing back to Kafka: \n```\n(spark.readStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", \"<server:ip>\")\n.option(\"subscribe\", \"<topic>\")\n.option(\"startingOffsets\", \"latest\")\n.load()\n.join(spark.read.table(\"<table-name>\"), on=\"<id>\", how=\"left\")\n.writeStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", \"<server:ip>\")\n.option(\"topic\", \"<topic>\")\n.option(\"checkpointLocation\", \"<checkpoint-path>\")\n.start()\n)\n\n``` \nYou must have proper permissions configured for access to your Kafka service. Fill in all parameters denoted with angle brackets (`<>`) using the relevant values for your data sources and sinks. See [Stream processing with Apache Kafka and Databricks](https:\/\/docs.databricks.com\/connect\/streaming\/kafka.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/tutorial.html"} +{"content":"# Transform data\n### Clean and validate data with batch or stream processing\n\nCleaning and validating data is essential for ensuring the quality of data assets in a lakehouse. This article outlines Databricks product offerings designed to facilitate data quality, as well as providing recommendations for defining business logic to implement custom rules.\n\n### Clean and validate data with batch or stream processing\n#### Schema enforcement on Databricks\n\nDelta Lake provides semantics to enforce schema and constraint checks on write, which provides guarantees around data quality for tables in a lakehouse. \nSchema enforcement ensures that data written to a table adheres to a predefined schema. Schema validation rules vary by operation. See [Schema enforcement](https:\/\/docs.databricks.com\/tables\/schema-enforcement.html). \nTo handle schema evolution, Delta provides mechanisms for making schema changes and evolving tables. It is important to carefully consider when to use schema evolution to avoid dropped fields or failed pipelines. For details on manually or automatically updating schemas, see [Update Delta Lake table schema](https:\/\/docs.databricks.com\/delta\/update-schema.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/validate.html"} +{"content":"# Transform data\n### Clean and validate data with batch or stream processing\n#### Table constraints\n\nConstraints can take the form of informational primary key and foreign key constraints, or enforced constraints. See [ADD CONSTRAINT clause](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table-add-constraint.html). \nTable constraints on Databricks are either enforced or informational. \nEnforced constraints include `NOT NULL` and `CHECK` constraints. \nInformational constraints include primary key and foreign key constraints. \nSee [Constraints on Databricks](https:\/\/docs.databricks.com\/tables\/constraints.html). \n### Deal with null or missing values \n**NOT NULL** can be enforced on Delta tables. It can only be enabled on an existing table if no existing records in the column are null, and prevents new records with null values from being inserted into a table. \n### Pattern enforcement \nRegular expressions (regex) can be used to enforce expected patterns in a data field. This is particularly useful when dealing with textual data that needs to adhere to specific formats or patterns. \nTo enforce a pattern using regex, you can use the `REGEXP` or `RLIKE` functions in SQL. These functions allow you to match a data field against a specified regex pattern. \nHere\u2019s an example of how to use the `CHECK` constraint with regex for pattern enforcement in SQL: \n```\nCREATE TABLE table_name (\ncolumn_name STRING CHECK (column_name REGEXP '^[A-Za-z0-9]+$')\n);\n\n``` \n### Value enforcement \nConstraints can be used to enforce value ranges on columns in a table. This ensures that only valid values within the specified range are allowed to be inserted or updated. \nTo enforce a value range constraint, you can use the `CHECK` constraint in SQL. The `CHECK` constraint allows you to define a condition that must be true for every row in the table. \nHere\u2019s an example of how to use the `CHECK` constraint to enforce a value range on a column: \n```\nCREATE TABLE table_name (\ncolumn_name INT CHECK (column_name >= 0 AND column_name <= 100)\n);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/validate.html"} +{"content":"# Transform data\n### Clean and validate data with batch or stream processing\n#### Define and configure expectations using Delta Live Tables.\n\nDelta Live Tables allows you to define expectations when declaring materialized views or streaming tables. You can choose to configure expectations to warn you about violations, drop violating records, or fail workloads based on violations. See [Manage data quality with Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html).\n\n### Clean and validate data with batch or stream processing\n#### Data monitoring\n\nDatabricks provides data quality monitoring services, which let you monitor the statistical properties and quality of the data in all of the tables in your account. See [Introduction to Databricks Lakehouse Monitoring](https:\/\/docs.databricks.com\/lakehouse-monitoring\/index.html).\n\n### Clean and validate data with batch or stream processing\n#### Cast data types\n\nWhen inserting or updating data in a table, Databricks casts data types when it can do so safely without losing information. \nSee the following articles for details about casting behaviors: \n* [cast function](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/cast.html)\n* [SQL data type rules](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-datatype-rules.html)\n* [ANSI compliance in Databricks Runtime](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-ansi-compliance.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/validate.html"} +{"content":"# Transform data\n### Clean and validate data with batch or stream processing\n#### Custom business logic\n\nYou can use filters and `WHERE` clauses to define custom logic that quarantines bad records and prevents them from propagating to downstream tables. `CASE WHEN ... OTHERWISE` clauses allow you to define conditional logic to gracefully apply business logic to records that violate expectations in predictable ways. \n```\nDECLARE current_time = now()\n\nINSERT INTO silver_table\nSELECT * FROM bronze_table\nWHERE event_timestamp <= current_time AND quantity >= 0;\n\nINSERT INTO quarantine_table\nSELECT * FROM bronze_table\nWHERE event_timestamp > current_time OR quantity < 0;\n\n``` \nNote \nDatabricks recommends always processing filtered data as a separate write operation, especially when using Structured Streaming. Using `.foreachBatch` to write to multiple tables can lead to inconsistent results. \nFor example, you might have an upstream system that isn\u2019t capable of encoding `NULL` values, and so the placeholder value `-1` is used to represent missing data. Rather than writing custom logic for all downstream queries in Databricks to ignore records containing `-1`, you could use a case when statement to dynamically replace these records as a transformation. \n```\nINSERT INTO silver_table\nSELECT\n* EXCEPT weight,\nCASE\nWHEN weight = -1 THEN NULL\nELSE weight\nEND AS weight\nFROM bronze_table;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/validate.html"} +{"content":"# Transform data\n### Data modeling\n\nThis article introduces considerations, caveats, and recommendations for data modeling on Databricks. It is targeted toward users who are setting up new tables or authoring ETL workloads, with an emphasis on understanding Databricks behaviors that influence transforming raw data into a new data model. Data modeling decisions depend on how your organization and workloads use tables. The data model you choose impacts query performance, compute costs, and storage costs. This includes an introduction to the foundational concepts in database design with Databricks. \nImportant \nThis article exclusively applies to tables backed by Delta Lake, which includes all Unity Catalog managed tables. \nYou can use Databricks to query other external data sources, including tables registered with Lakehouse Federation. Each external data source has different limitations, semantics, and transactional guarantees. See [Query data](https:\/\/docs.databricks.com\/query\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/data-modeling.html"} +{"content":"# Transform data\n### Data modeling\n#### Database management concepts\n\nA lakehouse built with Databricks shares many components and concepts with other enterprise data warehousing systems. Consider the following concepts and features while designing your data model. \n### Transactions on Databricks \nDatabricks scopes transactions to individual tables. This means that Databricks does not support multi-table statements (also called multi-statement transactions). \nFor data modeling workloads, this translates to having to perform multiple independent transactions when ingesting a source record requires inserting or updating rows into two or more tables. Each of these transactions can succeed or fail independent of other transactions, and downstream queries need to be tolerant of state mismatch due to failed or delayed transactions. \n### Primary and foreign keys on Databricks \nPrimary and foreign keys are informational and not enforced. This model is common in many enterprise cloud-based database systems, but differs from many traditional relational database systems. See [Constraints on Databricks](https:\/\/docs.databricks.com\/tables\/constraints.html). \n### Joins on Databricks \nJoins can introduce processing bottlenecks in any database design. When processing data on Databricks, the query optimizer seeks to optimize the plan for joins, but can struggle when an individual query must join results from many tables. The optimizer can also fail to skip records in a table when filter parameters are on a field in another table, which can result in a full table scan. \nSee [Work with joins on Databricks](https:\/\/docs.databricks.com\/transform\/join.html). \nNote \nYou can use materialized views to incrementally compute the results for some join operations, but other joins are not compatible with materialized views. See [Use materialized views in Databricks SQL](https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html). \n### Working with nested and complex data types \nDatabricks supports working with semi-structured data sources including JSON, Avro, and ProtoBuff, and storing\ncomplex data as structs, JSON strings, and maps and arrays. See [Model semi-structured data](https:\/\/docs.databricks.com\/transform\/semi-structured.html). \n### Normalized data models \nDatabricks can work well with any data model. If you have an existing data model that you need to query from or migrate to Databricks, you should evaluate performance before rearchitecting your data. \nIf you are architecting a new lakehouse or adding datasets to an existing environment, Databricks recommends against using a heavily normalized model such as third normal form (3NF). \nModels like the star schema or snowflake schema perform well on Databricks, as there are fewer joins present in standard queries and fewer keys to keep in sync. In addition, having more data fields in a single table allows the query optimizer to skip large amounts of data using file-level statistics. For more on data skipping, see [Data skipping for Delta Lake](https:\/\/docs.databricks.com\/delta\/data-skipping.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/data-modeling.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n#### TensorBoard\n\n[TensorBoard](https:\/\/www.tensorflow.org\/tensorboard) is a suite of visualization tools for debugging, optimizing, and understanding TensorFlow, PyTorch, Hugging Face Transformers, and other machine learning programs.\n\n#### TensorBoard\n##### Use TensorBoard\n\nStarting TensorBoard in Databricks is no different than starting it on a Jupyter notebook on your local computer. \n1. Load the `%tensorboard` magic command and define your log directory. \n```\n%load_ext tensorboard\nexperiment_log_dir = <log-directory>\n\n```\n2. Invoke the `%tensorboard` magic command. \n```\n%tensorboard --logdir $experiment_log_dir\n\n``` \nThe TensorBoard server starts and displays the user interface inline in the notebook. It also provides a link to open TensorBoard in a new tab. \nThe following screenshot shows the TensorBoard UI started in a populated log directory. \n![TensorBoard UI started in populated log directory](https:\/\/docs.databricks.com\/_images\/tensorboard.png) \nYou can also start TensorBoard by using TensorBoard\u2019s notebook module directly. \n```\nfrom tensorboard import notebook\nnotebook.start(\"--logdir {}\".format(experiment_log_dir))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/tensorboard.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n#### TensorBoard\n##### TensorBoard logs and directories\n\nTensorBoard visualizes your machine learning programs by reading logs generated by TensorBoard callbacks and functions in [TensorBoard](https:\/\/www.tensorflow.org\/tensorboard\/get_started) or [PyTorch](https:\/\/pytorch.org\/docs\/stable\/tensorboard.html). To generate logs for other machine learning libraries, you can directly write logs using TensorFlow file writers (see [Module: tf.summary](https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/summary) for TensorFlow 2.x and see [Module: tf.compat.v1.summary](https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/compat\/v1\/summary) for the older API in TensorFlow 1.x ). \nTo make sure that your experiment logs are reliably stored, Databricks recommends writing logs to cloud storage rather than on the ephemeral cluster file system. For each experiment, start TensorBoard in a unique directory. For each run of your machine learning code in the experiment that generates logs, set the TensorBoard callback or file writer to write to a subdirectory of the experiment directory. That way, the data in the TensorBoard UI is separated into runs. \nRead the official [TensorBoard documentation](https:\/\/www.tensorflow.org\/tensorboard\/get_started) to get started using TensorBoard to log information for your machine learning program.\n\n#### TensorBoard\n##### Manage TensorBoard processes\n\nThe TensorBoard processes started within Databricks notebook are not terminated when the notebook is detached or the REPL is restarted (for example, when you clear the state of the notebook). To manually kill a TensorBoard process, send it a termination signal using `%sh kill -15 pid`. Improperly killed TensorBoard processes might corrupt `notebook.list()`. \nTo list the TensorBoard servers currently running on your cluster, with their corresponding log directories and process IDs, run `notebook.list()` from the TensorBoard notebook module.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/tensorboard.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n#### TensorBoard\n##### Known issues\n\n* The inline TensorBoard UI is inside an iframe. Browser security features prevent external links within the UI from working unless you open the link in a new tab.\n* The `--window_title` option of TensorBoard is overridden on Databricks.\n* By default, TensorBoard scans a port range for selecting a port to listen to. If there are too many TensorBoard processes running on the cluster, all ports in the port range might be unavailable. You can work around this limitation by specifying a port number with the `--port` argument. The specified port should be between 6006 and 6106.\n* For download links to work, you must open TensorBoard in a tab.\n* When using TensorBoard 1.15.0, the Projector tab is blank. As a workaround, to visit the projector page directly, you can replace `#projector` in the URL by `data\/plugin\/projector\/projector_binary.html`.\n* TensorBoard 2.4.0 has a [known issue](https:\/\/github.com\/tensorflow\/tensorboard\/issues\/4421) that might affect TensorBoard rendering if upgraded.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/tensorboard.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Dynamic file pruning\n\nDynamic file pruning, can significantly improve the performance of many queries on Delta Lake tables. Dynamic file pruning triggers for queries that contain filter statements or `WHERE` clauses. You must use Photon-enabled compute to use dynamic file pruning in `MERGE`, `UPDATE`, and `DELETE` statements. Only `SELECT` statements leverage dynamic file pruning when Photon is not used. \nDynamic file pruning is especially efficient for non-partitioned tables, or for joins on non-partitioned columns. The performance impact of dynamic file pruning is often correlated to the clustering of data so consider using Z-Ordering to maximize the benefit. \nFor background and use cases for dynamic file pruning, see [Faster SQL queries on Delta Lake with dynamic file pruning](https:\/\/databricks.com\/blog\/2020\/04\/30\/faster-sql-queries-on-delta-lake-with-dynamic-file-pruning.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/dynamic-file-pruning.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Dynamic file pruning\n##### Configuration\n\nDynamic file pruning is controlled by the following Apache Spark configuration options: \n* `spark.databricks.optimizer.dynamicFilePruning` (default is `true`): The main flag that directs the optimizer to push down filters. When set to `false`, dynamic file pruning will not be in effect.\n* `spark.databricks.optimizer.deltaTableSizeThreshold` (default is `10,000,000,000 bytes (10 GB)`): Represents the minimum size (in bytes) of the Delta table on the probe side of the join required to trigger dynamic file pruning. If the probe side is not very large, it is probably not worthwhile to push down the filters and we can just simply scan the whole table. You can find the size of a Delta table by running the `DESCRIBE DETAIL table_name` command and then looking at the `sizeInBytes` column.\n* `spark.databricks.optimizer.deltaTableFilesThreshold` (default is `10`): Represents the number of files of the Delta table on the probe side of the join required to trigger dynamic file pruning. When the probe side table contains fewer files than the threshold value, dynamic file pruning is not triggered. If a table has only a few files, it is probably not worthwhile to enable dynamic file pruning. You can find the size of a Delta table by running the `DESCRIBE DETAIL table_name` command and then looking at the `numFiles` column.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/dynamic-file-pruning.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect Tableau to Databricks\n\nThis article shows you how to connect Databricks to Tableau Desktop and includes information about other Tableau editions. You can connect through [Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/index.html) or you can connect manually. \nWhen you use Databricks as a data source with Tableau, you can provide powerful interactive analytics, bringing the contributions of your data scientists and data engineers to your business analysts by scaling to massive datasets. \nTo learn more on how to use Tableau Desktop to build reports and visualizations, please read [Tutorial: Get Started with Tableau Desktop](https:\/\/help.tableau.com\/current\/guides\/get-started-tutorial\/en-us\/get-started-tutorial-home.htm).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/tableau.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect Tableau to Databricks\n##### Requirements\n\n### Connect data managed by Databricks Unity Catalog to Tableau \n* [Tableau Desktop](https:\/\/www.tableau.com\/products\/desktop) 2021.4 or above. Download and install Tableau Desktop on your computer.\n* [Databricks ODBC driver](https:\/\/databricks.com\/spark\/odbc-drivers-download) version 2.6.19 or above. Install the driver using the downloaded installation file on your desktop. Follow [instructions provided by Tableau](https:\/\/help.tableau.com\/current\/pro\/desktop\/en-us\/examples_databricks.htm) to set up the connection to Databricks. Please refer to [Tableau and ODBC](https:\/\/help.tableau.com\/current\/pro\/desktop\/en-us\/odbc_tableau.htm) on more details about how Tableau Desktop works with ODBC driver. \n### Connect data managed by the legacy Databricks Hive metastore to Tableau \n* Tableau Desktop 2019.3 or above.\n* Databricks ODBC Driver 2.6.15 or above. \n### Authentication options \nUse one of the following authentication options: \n* (Recommended) Tableau enabled as an OAuth application in your account. Tableau Desktop is enabled by default. To enable Tableau Cloud or Tableau Server, see [Configure Databricks sign-on from Tableau Server](https:\/\/docs.databricks.com\/integrations\/configure-oauth-tableau.html). \nOnly users enrolled in Tableau\u2019s internal identity provider (IdP) can authenticate using single sign-on (SSO). OAuth tokens for Tableau expire after 90 days.\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html#pat-user). \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n* A Databricks [username](https:\/\/docs.databricks.com\/admin\/users-groups\/users.html) (typically your email address) and password. \nUsername and password authentication may be disabled if your Databricks workspace is [enabled for single sign-on (SSO)](https:\/\/docs.databricks.com\/admin\/users-groups\/single-sign-on\/index.html). If so, use a Databricks personal access token instead. \n* The connection details for a cluster or SQL warehouse, specifically the **Server Hostname** and **HTTP Path** values. \n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/tableau.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect Tableau to Databricks\n##### Connect to Tableau Desktop using Partner Connect\n\nYou can use Partner Connect to connect a cluster or SQL warehouse with Tableau Desktop in just a few clicks. \n1. Make sure your Databricks account, workspace, and the signed-in user all meet the [requirements](https:\/\/docs.databricks.com\/partner-connect\/index.html#requirements) for Partner Connect.\n2. In the sidebar, click ![Partner Connect button](https:\/\/docs.databricks.com\/_images\/partner-connect.png) **Partner Connect**.\n3. Click the **Tableau** tile.\n4. In the **Connect to partner** dialog, for **Compute**, choose the name of the Databricks compute resource that you want to connect.\n5. Choose **Download connection file**.\n6. Open the downloaded connection file, which starts Tableau Desktop.\n7. In Tableau Desktop, enter your authentication credentials, and then click **Sign In**: \n* To use a Databricks personal access token, enter **token** for **Username** and your personal access token for **Password**.\n* To use a Databricks username and password, enter your username for **Username** and your password for **Password**. \nAfter you successfully connect with Tableau Desktop, you can stop here. The remaining information in this article covers additional information about Tableau, such as connecting manually with Tableau Desktop, setting up Tableau Server on Linux, how to use Tableau Online, and best practices and troubleshooting with Tableau.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/tableau.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect Tableau to Databricks\n##### Connect to Tableau Desktop manually\n\nFollow these instructions to connect to a cluster or SQL warehouse with Tableau Desktop. \nNote \nTo connect faster with Tableau Desktop, use Partner Connect. \n1. Start Tableau Desktop.\n2. Click **File > New**.\n3. On the **Data** tab, click **Connect to Data**.\n4. In the list of connectors, click **Databricks**.\n5. Enter the **Server Hostname** and **HTTP Path**.\n6. For **Authentication**, choose your authentication method, enter your authentication credentials, and then click **Sign in**. \n* To use a Databricks personal access token, select **Personal Access Token** and enter your personal access token for **Password**.\n* To use a Databricks username and password, select **Username \/ Password** and enter your username for **Username** and your password for **Password**.\n* **OAuth\/Microsoft Entra ID**. For **OAuth endpoint**, enter `https:\/\/{<server-hostname>}\/oidc`, where `<server-hostname>` is the **Server Hostname** for your cluster or SQL warehouse. A browser window opens and prompts you to sign in to your IdP.If Unity Catalog is enabled for your workspace, additionally set the default catalog. In the **Advanced** tab, for **Connection properties**, add `Catalog=<catalog-name>`. To change the default catalog, in the **Initial SQL** tab, enter `USE CATALOG <catalog-name>`. \nAfter you successfully connect with Tableau Desktop, you can stop here. The remaining information in this article covers additional information about Tableau, such as setting up Tableau Server on Linux, how to use Tableau Online, and best practices and troubleshooting with Tableau.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/tableau.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect Tableau to Databricks\n##### Tableau Server on Linux\n\nEdit `\/etc\/odbcinst.ini` to include the following: \n```\n[Simba Spark ODBC Driver 64-bit]\nDescription=Simba Spark ODBC Driver (64-bit)\nDriver=\/opt\/simba\/spark\/lib\/64\/libsparkodbc_sb64.so\n\n``` \nNote \nTableau Server on Linux recommends 64-bit processing architecture.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/tableau.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect Tableau to Databricks\n##### Publish and refresh a workbook on Tableau Online\n\nThis article shows how to publish a workbook from Tableau Desktop to [Tableau Online](https:\/\/www.tableau.com\/products\/cloud-bi) and keep it updated when the data source changes. You need a [workbook](https:\/\/help.tableau.com\/current\/pro\/desktop\/en-us\/environ_workbooksandsheets_workbooks.htm) in Tableau Desktop and a [Tableau Online](https:\/\/www.tableau.com\/tableau-login-hub) account. \n1. Extract the workbook\u2019s data from Tableau Desktop: in Tableau Desktop, with the workbook that you want to publish displayed, click **Data >** `<data-source-name>` **> Extract Data**.\n2. In the **Extract Data** dialog box, click **Extract**.\n3. Browse to a location on your local machine where you want to save the extracted data, and then click **Save**.\n4. Publish the workbook\u2019s data source to Tableau Online: in Tableau Desktop, click **Server > Publish Data Source >** `<data-source-name>`.\n5. If the **Tableau Server Sign In** dialog box displays, click the **Tableau Online** link, and follow the on-screen directions to sign in to Tableau Online.\n6. In the **Publish Data Source to Tableau Online** dialog box, next to **Refresh Not Enabled**, click the **Edit** link.\n7. In the flyout box that displays, for **Authentication**, change **Refresh not enabled** to **Allow refresh access**.\n8. Click anywhere outside of this flyout to hide it.\n9. Select **Update workbook to use the published data source**.\n10. Click **Publish**. The data source displays in Tableau Online.\n11. In Tableau Online, in the **Publishing Complete** dialog box, click **Schedule**, and follow the on-screen directions.\n12. Publish the workbook to Tableau Online: in Tableau Desktop, with the workbook you want to publish displayed, click **Server > Publish Workbook**.\n13. In the **Publish Workbook to Tableau Online** dialog box, click **Publish**. The workbook displays in Tableau Online. \nTableau Online checks for changes to the data source according to the schedule you set, and updates the published workbook if changes are detected. \nFor more information, see the following on the Tableau website: \n* [Publish a Data Source](https:\/\/help.tableau.com\/current\/pro\/desktop\/en-us\/publish_datasources.htm)\n* [Comprehensive Steps to Publish a Workbook](https:\/\/help.tableau.com\/current\/pro\/desktop\/en-us\/publish_workbooks_howto.htm)\n* [Schedule Extract Refreshes as You Publish a Workbook](https:\/\/help.tableau.com\/current\/pro\/desktop\/en-us\/publish_workbooks_schedules.htm)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/tableau.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect Tableau to Databricks\n##### Best practices and troubleshooting\n\nThe two fundamental actions to optimize Tableau queries are: \n* Reduce the number of records being queried and visualized in a single chart or dashboard.\n* Reduce the number of queries being sent by Tableau in a single chart or dashboard. \nDeciding which to try first depends on your dashboard. If you have a number of different charts for individual users all in the same dashboard, it\u2019s likely that Tableau is sending too many queries to Databricks. If you only have a couple of charts but they take a long time to load, there are probably too many records being returned by Databricks to load effectively. \nTableau performance recording, available on both Tableau Desktop and Tableau Server, can help you understand where performance bottlenecks are by identifying processes that are causing latency when you run a particular workflow or dashboard. \n### Enable performance recording to debug any Tableau issue \nFor instance, if query execution is the problem, you know it has to do with the data engine process or the data source that you are querying. If the visual layout is performing slowly, you know that it is the VizQL. \nIf the performance recording says that the latency is in executing query, it is likely that too much time is taken by Databricks returning the results or by the ODBC\/Connector overlay processing the data into SQL for VizQL. When this occurs, you should analyze what you are returning and attempt to change the analytical pattern to have a dashboard per group, segment, or article instead of trying to cram everything into one dashboard and relying on Quick Filters. \nIf the poor performance is caused by sorting or visual layout, the problem may be the number of marks the dashboard is trying to return. Databricks can return one million records quickly, but Tableau may not be able to compute the layout and sort the results. If this is a problem, aggregate the query and drill into the lower levels. You can also try a bigger machine, since Tableau is only constrained by physical resources on the machine on which it is running. \nFor an in-depth tutorial on the performance recorder, see [Create a Performance Recording](https:\/\/help.tableau.com\/current\/server-linux\/en-us\/perf_record_create_server.htm). \n### Performance on Tableau Server versus Tableau Desktop \nIn general, a workflow that runs on Tableau Desktop is not any faster on Tableau Server. A dashboard that doesn\u2019t execute on Tableau Desktop won\u2019t execute on Tableau Server. This is important to keep in mind. \nIn fact, getting things working on Desktop is a much better troubleshooting technique, because Tableau Server has more processes to consider when you troubleshoot. And if things work in Tableau Desktop but not in Tableau Server, then you can safely narrow the issue down to the processes in Tableau Server that aren\u2019t in Tableau Desktop. \n### Configuration \nBy default, the parameters from the connection URL override those in the\nSimba ODBC DSN. There are two ways you can customize the ODBC configurations\nfrom Tableau: \n* `.tds` file for a single data source: \n1. Follow the instructions in [Save Data Sources](https:\/\/help.tableau.com\/current\/pro\/desktop\/en-us\/export_connection.htm) to export the `.tds` file for the data source.\n2. Find the property line `odbc-connect-string-extras=''` in the `.tds` file and set the parameters. For example, to enable `AutoReconnect` and `UseNativeQuery`, you can change the line to `odbc-connect-string-extras='AutoReconnect=1,UseNativeQuery=1'`.\n3. Reload the `.tds` file by reconnecting the connection.The compute resource is optimized to use less heap memory for collecting large results, so it can serve more rows per fetch block than Simba ODBC\u2019s default. Append `RowsFetchedPerBlock=100000'` to the value of the `odbc-connect-string-extras` property.\n* `.tdc` file for all data sources: \n1. If you have never created a `.tdc` file, you can add [TableauTdcExample.tdc](https:\/\/docs.databricks.com\/_static\/examples\/TableauTdcExample.tdc) to the folder `Document\/My Tableau Repository\/Datasources`.\n2. Add the file to all developers\u2019 Tableau Desktop installations, so that it works when the dashboards are shared. \n### Optimize charts (worksheets) \nThere are a number of tactical chart optimizations that can help you improve the performance of your Tableau worksheets. \nFor filters that don\u2019t change often and are not meant to be interacted with, use context filters, which speed up execution time.\nAnother good rule of thumb is to use `if\/else` statements instead of `case\/when` statements in your queries. \nTableau can push down filters into data sources, which can greatly speed up query speeds. See [Filtering Across Multiple Data Sources Using a Parameter](https:\/\/kb.tableau.com\/articles\/howto\/filter-multiple-data-sources-using-parameter) and [Filter Data Across Multiple Data Sources](https:\/\/help.tableau.com\/current\/pro\/desktop\/en-us\/filter_across_datasources.htm) for more information about data source push down filters. \nIt is best to avoid table calculations if you can because they need to scan the full dataset. For more information about table calculations, see [Transform Values with Table Calculations](https:\/\/help.tableau.com\/current\/pro\/desktop\/en-us\/calculations_tablecalculations.htm). \n### Optimize dashboards \nHere are a number of tips and troubleshooting exercises you can apply to improve the performance of your Tableau dashboard. \nA common source of issues with Tableau dashboards connected to Databricks is the use of quick filters on individual dashboards that serve a number of different users, functions, or segments. You can attach global quick filters to all of the charts on the dashboard. It\u2019s a great feature, but one that can quickly cause problems. One global quick filter on a dashboard with five charts causes a minimum of 10 queries to be sent to Databricks. This can spiral to greater numbers as more filters are added and can cause massive performance problems, because Spark is not built to handle many concurrent queries starting at the same exact moment. This becomes more problematic when the Databricks cluster or SQL warehouse that you are using is not large enough to handle the high volume of queries. \nAs a first step, we recommend that you use Tableau performance recording to troubleshoot what might be causing the issue. \nIf the poor performance is caused by *sorting* or *visual layout*, the problem may be the number of marks the dashboard is trying to return. Databricks can return one million records quickly, but Tableau may not be able to compute the layout and sort the results. If this is a problem, aggregate the query and drill into the lower levels. You can also try a bigger machine, since Tableau is only constrained by physical resources on the machine on which it is running. \nFor information about drilling down in Tableau, see [Drill down into the details](https:\/\/help.tableau.com\/current\/guides\/get-started-tutorial\/en-us\/get-started-tutorial-drilldown.htm). \nIn general, seeing many granular marks is often a poor analytical pattern, because it doesn\u2019t provide insight. Drilling down from higher levels of aggregation makes more sense and reduces the number of records that need to be processed and visualized. \n#### Use actions to optimize dashboards \nTo drill from group to segment to article in order to obtain the same analysis and information as the \u201cocean boiled\u201d dashboard, you can use Tableau *actions*. Actions allow you to click a mark (for example a state on a map) and be sent to another dashboard that filters based on the state you click. This reduces the need to have many filters on one dashboard and reduces the number of records that need to be generated, because you can set an action to not generate records until it gets a predicate to filter on. \nFor more information, see [Actions](https:\/\/help.tableau.com\/current\/pro\/desktop\/en-us\/actions.htm) and [6 Tips to Make Your Dashboards More Performant](https:\/\/www.tableau.com\/about\/blog\/2016\/1\/5-tips-make-your-dashboards-more-performant-48574). \n### Caching \nCaching data is a good way to improve the performance of worksheets or dashboards. \n#### Caching in Tableau \nTableau has four layers of caching before it goes back to the data, whether that data is in a live connection or an extract: \n* **Tiles**: If someone is loading the exact same dashboard and nothing changes, then Tableau tries to reuse the same tiles for the charts. This is similar to Google Maps tiles.\n* **Model**: There are mathematical calculations used to generate visualizations in the event that tiles can\u2019t be used. Tableau Server attempts to use the same models.\n* **Abstract**: Aggregate results of queries are stored as well. This is the third \u201cdefense\u201d level. If a query returns Sum(Sales), Count(orders), Sum(Cost), in a previous query and a future query wants just Sum(Sales), then Tableau grabs that result and uses it.\n* **Native Cache**: If the query is the exact same as another one, Tableau uses the same results. This is the last level of caching. If this fails, then Tableau goes to the data. \n#### Caching frequency in Tableau \nTableau has administrative settings for caching more or less often. If the server is set to **Refresh Less Often**, Tableau keeps data in the cache for up to 12 hours. If it is set to **Refresh More Often**, Tableau goes back to the data on every page refresh. \nCustomers who have the same dashboard being used over again\u2014for example, \u201cMonday morning pipeline reports\u201d\u2014should be on a server set to Refresh Less Often so that the dashboards all use the same cache. \n#### Cache warming in Tableau \nIn Tableau you can warm the cache by setting a subscription for the dashboard to be sent before you want the dashboard viewed.\nThis is because the dashboard needs to be rendered in order to generate the image for the subscription email. See\n[Warming the Tableau Server Cache Using Subscriptions](https:\/\/kb.tableau.com\/articles\/HowTo\/warming-the-tableau-server-cache-using-subscriptions). \n### Tableau Desktop: The error `The drivers... are not properly installed` displays \n**Issue**: When you try to connect Tableau Desktop to Databricks, Tableau displays an error message in the connection dialog with a link to the driver download page, where you can find driver links and installation instructions. \n**Cause**: Your installation of Tableau Desktop is not running a supported driver. \n**Resolution**: Download the [Databricks ODBC driver](https:\/\/databricks.com\/spark\/odbc-drivers-download) version 2.6.15 or above. \n**See also**: [Error \u201cThe drivers\u2026 are not properly installed\u201d](https:\/\/kb.tableau.com\/articles\/issue\/error-the-drivers-are-not-properly-installed) on the Tableau website.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/tableau.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect Tableau to Databricks\n##### Additional resources\n\n* [Tableau Products](https:\/\/www.tableau.com\/products)\n* [Tableau resources](https:\/\/www.tableau.com\/resources)\n* [Databricks](https:\/\/help.tableau.com\/current\/pro\/desktop\/en-us\/examples_databricks.htm)\n* [Support](https:\/\/www.tableau.com\/support)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/tableau.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Automate Unity Catalog setup using Terraform\n\nYou can automate Unity Catalog setup by using the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html). This article provides links to the Terraform provider Unity Catalog deployment guide and resource reference documentation, along with requirements (\u201cBefore you begin\u201d) and validation and deployment tips.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/automate.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Automate Unity Catalog setup using Terraform\n##### Before you begin\n\nTo automate Unity Catalog setup using Terraform, you must have the following: \n* Your Databricks account must be on the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons).\n* In AWS, you must have the ability to create Amazon S3 buckets, AWS IAM roles, AWS IAM policies, and cross-account trust relationships.\n* You must have at least one Databricks workspace that you want to use with Unity Catalog. See [Manually create a workspace (existing Databricks accounts)](https:\/\/docs.databricks.com\/admin\/workspace\/create-workspace.html). \nTo use the Databricks Terraform provider to configure a metastore for Unity Catalog, storage for the metastore, any external storage, and all of their related access credentials, you must have the following: \n* An AWS account.\n* A Databricks on AWS account.\n* A service principal that has the account admin role in your Databricks account.\n* The Terraform CLI. See [Download Terraform](https:\/\/www.terraform.io\/downloads.html) on the Terraform website.\n* The following seven Databricks environment variables: \n+ `DATABRICKS_CLIENT_ID`, set to the value of the client ID, also known as the application ID, of the service principal. See [OAuth machine-to-machine (M2M) authentication](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html).\n+ `DATABRICKS_CLIENT_SECRET`, set to the value of the client secret of the service principal. See [OAuth machine-to-machine (M2M) authentication](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html).\n+ `DATABRICKS_ACCOUNT_ID`, set to the value of the ID of your Databricks account. You can find this value in the corner of your [Databricks account console](https:\/\/accounts.cloud.databricks.com).\n+ `TF_VAR_databricks_account_id`, also set to the value of the ID of your Databricks account.\n+ `AWS_ACCESS_KEY_ID`, set to the value of your AWS user\u2019s access key ID. See [Programmatic access](https:\/\/docs.aws.amazon.com\/general\/latest\/gr\/aws-sec-cred-types.html#access-keys-and-secret-access-keys) in the AWS General Reference.\n+ `AWS_SECRET_ACCESS_KEY`, set to the value of your AWS user\u2019s secret access key. See [Programmatic access](https:\/\/docs.aws.amazon.com\/general\/latest\/gr\/aws-sec-cred-types.html#access-keys-and-secret-access-keys) in the AWS General Reference.\n+ `AWS_REGION`, set to the value of the AWS Region code for your Databricks account. See [Regional endpoints](https:\/\/docs.aws.amazon.com\/general\/latest\/gr\/rande.html#regional-endpoints) in the AWS General Reference. \nNote \nAn account admin\u2019s username and password can also be used to authenticate to the Terraform provider. Databricks strongly recommends that you use OAuth for service principals. To use a username and password, you must have the following environment variables: \n+ `DATABRICKS_USERNAME`, set to the value of your Databricks account-level admin username.\n+ `DATABRICKS_PASSWORD`, set to the value of the password for your Databricks account-level admin user. \nTo set these environment variables, see your operating system\u2019s documentation. \nTo use the Databricks Terraform provider to configure all other Unity Catalog infrastructure components, you must have the following: \n* A Databricks workspace.\n* On your local development machine, you must have: \n+ The Terraform CLI. See [Download Terraform](https:\/\/www.terraform.io\/downloads.html) on the Terraform website.\n+ One of the following: \n- Databricks CLI version 0.205 or above, configured with your Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html) by running `databricks configure --host <workspace-url> --profile <some-unique-profile-name>`. See [Install or update the Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/install.html) and [Databricks personal access token authentication](https:\/\/docs.databricks.com\/dev-tools\/cli\/authentication.html#token-auth). \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n- The following Databricks environment variables: \n* `DATABRICKS_HOST`, set to the value of your Databricks [workspace instance URL](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url), for example `https:\/\/dbc-1234567890123456.cloud.databricks.com`\n* `DATABRICKS_CLIENT_ID`, set to the value of the client ID, also known as the application ID, of the service principal. See [OAuth machine-to-machine (M2M) authentication](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html).\n* `DATABRICKS_CLIENT_SECRET`, set to the value of the client secret of the service principal. See [OAuth machine-to-machine (M2M) authentication](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html).Alternatively, you can use a personal access token instead of a service principal\u2019s client ID and client secret: \n* `DATABRICKS_TOKEN`, set to the value of your Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). See also [Monitor and manage personal access tokens](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html).To set these environment variables, see your operating system\u2019s documentation. \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/automate.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Automate Unity Catalog setup using Terraform\n##### Terraform provider Unity Catalog deployment guide and resource reference documentation\n\nTo learn how to deploy all prerequisites and enable Unity Catalog for a workspace, see [Deploying pre-requisite resources and enabling Unity Catalog](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/guides\/unity-catalog) in the Databricks Terraform provider documentation. \nIf you already have some Unity Catalog infrastructure components in place, you can use Terraform to deploy additional Unity Catalog infrastructure components as needed. See each section of the guide referenced in the previous paragraph and the [Unity Catalog section of the Databricks Terraform provider documentation](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs).\n\n#### Automate Unity Catalog setup using Terraform\n##### Validate, plan, deploy, or destroy the resources\n\n* To validate the syntax of the Terraform configurations without deploying them, run the `terraform validate` command.\n* To show the actions that Terraform would take to deploy the configurations, run the `terraform plan` command. This command does not actually deploy the configurations.\n* To deploy the configurations, run the `terraform deploy` command.\n* To delete the deployed resources, run the `terraform destroy` command.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/automate.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### How do paths work for data managed by Unity Catalog?\n\nThis article explains restrictions around path overlaps in Unity Catalog, details path-based access patterns for data files in Unity Catalog objects, and describes how Unity Catalog manages paths for tables and volumes. \nNote \nVolumes are only supported on Databricks Runtime 13.3 LTS and above. In Databricks Runtime 12.2 LTS and below, operations against `\/Volumes` paths might succeed, but can write data to ephemeral storage disks attached to compute clusters rather than persisting data to Unity Catalog volumes as expected.\n\n#### How do paths work for data managed by Unity Catalog?\n##### Paths for Unity Catalog objects cannot overlap\n\nUnity Catalog enforces data governance by preventing managed directories of data from overlapping. Unity Catalog enforces the following rules: \n* External locations cannot overlap other external locations.\n* Tables and volumes store data files in external locations or the metastore root location.\n* Volumes cannot overlap other volumes.\n* Tables cannot overlap other tables.\n* Tables and volumes cannot overlap each other.\n* Managed storage locations cannot overlap each other. See [Managed storage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#managed-storage).\n* External volumes cannot overlap managed storage locations.\n* External tables cannot overlap managed storage locations. \nThese rules mean that the following restrictions exist in Unity Catalog: \n* You cannot define an external location within another external location.\n* You cannot define a volume within another volume.\n* You cannot define a table within another table.\n* You cannot define a table on any data files or directories within a volume.\n* You cannot define a volume on a directory within a table. \nNote \nYou can always use path-based access to write or read data files from volumes, including Delta Lake. You cannot register these data files as tables in the Unity Catalog metastore.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/paths.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### How do paths work for data managed by Unity Catalog?\n##### Paths for managed tables and managed volumes are fully-managed by Unity Catalog\n\nWhen you create a managed table or a managed volume, Unity Catalog creates a new directory in the Unity Catalog-configured storage location associated with the containing schema. The name of this directory is randomly generated to avoid any potential collision with other directories already present. \nThis behavior differs from how Hive metastore creates managed tables. Databricks recommends always interacting with Unity Catalog managed tables using table names and Unity Catalog managed volumes using volume paths.\n\n#### How do paths work for data managed by Unity Catalog?\n##### Paths for external tables and external volumes are governed by Unity Catalog\n\nWhen you create an external table or an external volume, you specify a path within an external location governed by Unity Catalog. \nImportant \nDatabricks recommends never creating an external volume or external table at the root of an external location. Instead, create external volumes and external tables in sub-directories within an external location. These recommendations should help avoid accidentally overlapping paths. See [Paths for Unity Catalog objects cannot overlap](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/paths.html#path-overlap). \nFor ease of use, Databricks recommends interacting with Unity Catalog external tables using table names and Unity Catalog external volumes using volume paths. \nImportant \nUnity Catalog manages all privileges for access using cloud URIs to data associated with external tables or external volumes. These privileges override any privileges associated with external locations. See [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/paths.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### How do paths work for data managed by Unity Catalog?\n##### How can you access data in Unity Catalog?\n\nUnity Catalog objects provide access to data through object identifiers, volume paths, or cloud URIs. You can access data associated with some objects through multiple methods. \nUnity Catalog tables are accessed using a three-tier identifier with the following pattern: \n```\n<catalog_name>.<schema_name>.<table_name>\n\n``` \nVolumes provide a file path to access data files with the following pattern: \n```\n\/Volumes\/<catalog_name>\/<schema_name>\/<volume_name>\/<path_to_file>\n\n``` \nCloud URIs require users to provide the driver, storage container identifier, and full path to the target files, as in the following example: \n```\ns3:\/\/<bucket_name>\/<path>\n\n``` \nThe following table shows the access methods allowed for Unity Catalog objects: \n| Object | Object identifier | File path | Cloud URI |\n| --- | --- | --- | --- |\n| External location | no | no | yes |\n| Managed table | yes | no | no |\n| External table | yes | no | yes |\n| Managed volume | no | yes | no |\n| External volume | no | yes | yes | \nNote \nUnity Catalog volumes use three-tier object identifiers with the following pattern for management commands (such as `CREATE VOLUME` and `DROP VOLUME`): \n```\n<catalog_name>.<schema_name>.<volume_name>\n\n``` \nTo actually work with files in volumes, you must use path-based access.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/paths.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Looker\n\nThis article describes how to use Looker with a Databricks cluster or Databricks SQL warehouse (formerly Databricks SQL endpoint). \nImportant \nWhen persistent derived tables (PDTs) are enabled, by default Looker regenerates PDTs every 5 minutes by connecting to the associated database. Databricks recommends that you change the default frequency to avoid incurring excess compute costs. For more information, see [Enable and manage persistent derived tables (PDTs)](https:\/\/docs.databricks.com\/partners\/bi\/looker.html#persistent-derived-tables).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/looker.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Looker\n##### Requirements\n\nBefore you connect to Looker manually, you need the following: \n* A cluster or SQL warehouse in your Databricks workspace. \n+ [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n+ [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* The connection details for your cluster or SQL warehouse, specifically the **Server Hostname**, **Port**, and **HTTP Path** values. \n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). To create a personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**.\nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n+ [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n+ [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/looker.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Looker\n##### Connect to Looker manually\n\nTo connect to Looker manually, do the following: \n1. In Looker, click **Admin > Connections > Add Connection**. \n![Connection parameters](https:\/\/docs.databricks.com\/_images\/looker-databricks-connect.png)\n2. Enter a unique **Name** for the connection. \nTip \nConnection names should contain only lowercase letters, numbers, and underscores. Other characters might be accepted but could cause unexpected results later.\n3. For **Dialect**, select **Databricks**.\n4. For **Remote Host**, enter the **Server Hostname** from the requirements.\n5. For **Port**, enter the **Port** from the requirements.\n6. For **Database**, enter the name of the database in the workspace that you want to access through the connection (for example, `default`).\n7. For **Username**, enter the word `token`.\n8. For **Password**, enter your personal access token from the requirements.\n9. For **Additional Params**, enter `transportMode=http;ssl=1;httpPath=<http-path>`, replacing `<http-path>` with the **HTTP Path** value from the requirements. \nIf Unity Catalog is enabled for your workspace, additionally set a default catalog. Enter `ConnCatalog=<catalog-name>`, replacing `<catalog-name>` with the name of a catalog.\n10. For **PDT And Datagroup Maintenance Schedule**, enter a valid `cron` expression to change the default frequency for regenerating PDTs. The default frequency is every five minutes.\n11. If you want to translate queries into other time zones, adjust **Query Time Zone**.\n12. For the remaining fields, keep the defaults, in particular: \n* Keep the **Max Connections** and **Connection Pool Timeout** defaults.\n* Leave **Database Time Zone** blank (assuming that you are storing everything in UTC).\n13. Click **Test These Settings**.\n14. If the test succeeds, click **Add Connection**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/looker.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Looker\n##### Model your database in Looker\n\nThis section creates a project and runs the generator. The following steps assume that there are permanent tables stored in the database for your connection. \n1. On the **Develop** menu, turn on **Development Mode**.\n2. Click **Develop > Manage LookML Projects**.\n3. Click **New LookML Project**.\n4. Enter a unique **Project Name**. \nTip \nProject names should contain only lowercase letters, numbers, and underscores. Other characters might be accepted but could produce unexpected results later.\n5. For **Connection**, select the name of the connection from Step 2.\n6. For **Schemas**, enter `default`, unless you have other databases to model through the connection.\n7. For the remaining fields, keep the defaults, in particular: \n* Leave **Starting Point** set to **Generate Model from Database Schema**.\n* Leave **Build Views From** set to **All Tables**.\n8. Click **Create Project**. \nAfter you create the project and the generator runs, Looker displays a user interface with one `.model` file and multiple `.view` files. The `.model` file shows the tables in the schema and any discovered join relations between them, and the `.view` files list each dimension (column) available for each table in the schema.\n\n#### Connect to Looker\n##### Next steps\n\nTo begin working with your project, see the following resources on the Looker website: \n* [Exploring data in Looker](https:\/\/docs.looker.com\/exploring-data\/exploring-data)\n* [Creating Visualizations and Graphs](https:\/\/docs.looker.com\/exploring-data\/visualizing-query-results)\n* [Retrieve and chart data](https:\/\/docs.looker.com\/exploring-data\/retrieve-chart-intro)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/looker.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Looker\n##### Enable and manage persistent derived tables (PDTs)\n\nLooker can reduce query times and database loads by creating *persistent derived tables* (PDTs). A PDT is a derived table that Looker writes into a scratch schema in your database. Looker then regenerates the PDT on the schedule that you specify. For more information, see [Persistent derived tables (PDTs)](https:\/\/docs.looker.com\/data-modeling\/learning-lookml\/derived-tables#persistent_derived_table) in the Looker documentation. \nTo enable PDTs for a database connection, select **Persistent Derived Tables** for that connection and complete the on-screen instructions. For more information, see [Persistent Derived Tables](https:\/\/docs.looker.com\/setup-and-management\/connecting-to-db#persistent_derived_tables) and [Configuring Separate Login Credentials for PDT Processes](https:\/\/docs.looker.com\/setup-and-management\/connecting-to-db#pdt-overrides) in the Looker documentation. \nWhen PDTs are enabled, by default Looker regenerates PDTs every 5 minutes by connecting to the associated database. Looker restarts the associated Databricks resource if it is stopped. Databricks recommends that you change the default frequency by setting the **PDT And Datagroup Maintenance Schedule** field for your database connection to a valid `cron` expression. For more information, see [PDT and Datagroup Maintenance Schedule](https:\/\/docs.looker.com\/setup-and-management\/connecting-to-db#pdt_maintenance_schedule) in the Looker documentation. \nTo enable PDTs or to change the PDT regeneration frequency for an existing database connection, click **Admin > Database Connections**, click **Edit** next to your database connection, and follow the preceding instructions.\n\n#### Connect to Looker\n##### Additional resources\n\n[Looker support](https:\/\/help.looker.com\/hc)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/looker.html"} +{"content":"# \n### Infrastructure and Unity Catalog assets created by RAG Studio\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html) \nRAG Studio will create the following items in your configured Databricks Workspace and Unity Catalog schema. \nWarning \nInfrastructure will incur charges per your Databricks contract. \nNote \nBelow, any value that appears in curly brackets `{}` is replaced by the value for that configuration setting in `rag-config.yml`. For example, `{name}` will be replaced by the RAG Application name that you configured.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/created-infra.html"} +{"content":"# \n### Infrastructure and Unity Catalog assets created by RAG Studio\n#### Infrastructure created in the Databricks Workspace\n\n| Component | Default Value | Configurable Value |\n| --- | --- | --- |\n| **Created ONCE for the RAG Application** | |\n| Vector Search Endpoint | `{name}__vs_endpoint` | `global_config.vector_search_endpoint` |\n| MLflow Experiment name | `\/Shared\/{name}__experiment\/` | `global_config.mlflow_experiment_name` |\n| **Created FOR EACH `Version`** | |\n| [Model Serving Endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html) to host the `\ud83d\udd17 Chain` | `rag_studio_{name}_{env_name}` | not configurable |\n| **Created FOR EACH `Environment`** | |\n| Databricks Jobs | `[{name}][{env_name}] {job_name}` | not configurable | \nNote \nThe [MLflow experiment name](https:\/\/docs.databricks.com\/mlflow\/experiments.html) must start with either `\/Shared` or `\/Users`. \nNote \nThe Databricks Jobs will be replaced with [Databricks managed serverless compute services](https:\/\/docs.databricks.com\/getting-started\/overview.html#serverless) in future versions.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/created-infra.html"} +{"content":"# \n### Infrastructure and Unity Catalog assets created by RAG Studio\n#### Assets created in the Unity Catalog schema\n\n| Component | Default Value | Configurable Value |\n| --- | --- | --- |\n| **Created FOR EACH `Version`** |\n| [Unity Catalog Delta Table](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html) with the output from the `\ud83d\uddc3\ufe0f Data Processor` | `{name}__embedded_docs__{random_uuid}` | `data_processors[].destination_table.name` (if specified, index will be called `{provided_value}__{version_id}`) |\n| [Unity Catalog Vector Index](https:\/\/docs.databricks.com\/generative-ai\/vector-search.html) of the Delta Table created by the `\ud83d\uddc3\ufe0f Data Processor` | `{name}__embedded_docs_index__{random_uuid}` | `data_processors[].destination_vector_index.databricks_vector_search.index_name` (if specified, index will be called `{provided_value}__{version_id}`) |\n| [Unity Catalog Model](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html) for the `\ud83d\udd17 Chain` | `{name}__chain__{version_id}` | not configurable |\n| Inference Table for the `\ud83d\udd17 Chain`\u2019s [Model Serving Endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html) | `rag_studio_{name}_{env_name}_payload` | not configurable |\n| **Created FOR EACH `Environment` ([details](https:\/\/docs.databricks.com\/rag-studio\/details\/environments.html)** |\n| `\ud83d\uddc2\ufe0f Request Log` for user traffic | `rag_studio_{name}_{env_name}_assessment_log` | not configurable |\n| `\ud83d\udc4d Assessment & Evaluation Results Log` | `rag_studio_{name}_{env_name}_request_log` | not configurable |\n| **Created FOR EACH `\ud83d\udcd6 Evaluation Set`** |\n| `\ud83d\udc4d Assessment & Evaluation Results Log` | `{name_of_evaluation_set_table}_assessment_log` | not configurable |\n| `\ud83d\uddc2\ufe0f Request Log` | `{name_of_evaluation_set_table}_request_log` | not configurable |\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/created-infra.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Orchestrate Databricks jobs with Apache Airflow\n\nThis article describes the Apache Airflow support for orchestrating data pipelines with Databricks, has instructions for installing and configuring Airflow locally, and provides an example of deploying and running a Databricks workflow with Airflow.\n\n##### Orchestrate Databricks jobs with Apache Airflow\n###### Job orchestration in a data pipeline\n\nDeveloping and deploying a data processing pipeline often requires managing complex dependencies between tasks. For example, a pipeline might read data from a source, clean the data, transform the cleaned data, and write the transformed data to a target. You also need support for testing, scheduling, and troubleshooting errors when you operationalize a pipeline. \nWorkflow systems address these challenges by allowing you to define dependencies between tasks, schedule when pipelines run, and monitor workflows. [Apache Airflow](https:\/\/airflow.apache.org\/) is an open source solution for managing and scheduling data pipelines. Airflow represents data pipelines as directed acyclic graphs (DAGs) of operations. You define a workflow in a Python file, and Airflow manages the scheduling and execution. The Airflow Databricks connection lets you take advantage of the optimized Spark engine offered by Databricks with the scheduling features of Airflow.\n\n##### Orchestrate Databricks jobs with Apache Airflow\n###### Requirements\n\n* The integration between Airflow and Databricks requires Airflow version 2.5.0 and later. The examples in this article are tested with Airflow version 2.6.1.\n* Airflow requires Python 3.8, 3.9, 3.10, or 3.11. The examples in this article are tested with Python 3.8.\n* The instructions in this article to install and run Airflow require [pipenv](https:\/\/pipenv.pypa.io\/en\/latest\/) to create a [Python virtual environment](https:\/\/realpython.com\/python-virtual-environments-a-primer\/).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-airflow-with-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Orchestrate Databricks jobs with Apache Airflow\n###### Airflow operators for Databricks\n\nAn Airflow DAG is composed of tasks, where each task runs an Airflow [Operator](https:\/\/airflow.apache.org\/docs\/apache-airflow\/stable\/core-concepts\/operators.html). Airflow operators supporting the integration to Databricks are implemented in the [Databricks provider](https:\/\/airflow.apache.org\/docs\/apache-airflow-providers-databricks\/stable\/connections\/databricks.html). \nThe Databricks provider includes operators to run a number of tasks against a Databricks workspace, including [importing data into a table](https:\/\/airflow.apache.org\/docs\/apache-airflow-providers-databricks\/stable\/_api\/airflow\/providers\/databricks\/operators\/databricks_sql\/index.html#airflow.providers.databricks.operators.databricks_sql.DatabricksCopyIntoOperator), [running SQL queries](https:\/\/airflow.apache.org\/docs\/apache-airflow-providers-databricks\/stable\/_api\/airflow\/providers\/databricks\/operators\/databricks_sql\/index.html#airflow.providers.databricks.operators.databricks_sql.DatabricksSqlOperator), and working with [Databricks Git folders](https:\/\/airflow.apache.org\/docs\/apache-airflow-providers-databricks\/stable\/_api\/airflow\/providers\/databricks\/operators\/databricks_repos\/index.html). \nThe Databricks provider implements two operators for triggering jobs: \n* The [DatabricksRunNowOperator](https:\/\/airflow.apache.org\/docs\/apache-airflow-providers-databricks\/stable\/_api\/airflow\/providers\/databricks\/operators\/databricks\/index.html#airflow.providers.databricks.operators.databricks.DatabricksRunNowOperator) requires an existing Databricks job and uses the [POST \/api\/2.1\/jobs\/run-now](https:\/\/docs.databricks.com\/api\/workspace\/jobs\/runnow) API request to trigger a run. Databricks recommends using the `DatabricksRunNowOperator` because it reduces duplication of job definitions, and job runs triggered with this operator can be found in the [Jobs UI](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#view-job-run-list).\n* The [DatabricksSubmitRunOperator](https:\/\/airflow.apache.org\/docs\/apache-airflow-providers-databricks\/stable\/_api\/airflow\/providers\/databricks\/operators\/databricks\/index.html#airflow.providers.databricks.operators.databricks.DatabricksSubmitRunOperator) does not require a job to exist in Databricks and uses the [POST \/api\/2.1\/jobs\/runs\/submit](https:\/\/docs.databricks.com\/api\/workspace\/jobs\/submit) API request to submit the job specification and trigger a run. \nTo create a new Databricks job or reset an existing job, the Databricks provider implements the [DatabricksCreateJobsOperator](https:\/\/airflow.apache.org\/docs\/apache-airflow-providers-databricks\/stable\/_api\/airflow\/providers\/databricks\/operators\/databricks\/index.html#airflow.providers.databricks.operators.databricks.DatabricksCreateJobsOperator). The `DatabricksCreateJobsOperator` uses the [POST \/api\/2.1\/jobs\/create](https:\/\/docs.databricks.com\/api\/workspace\/jobs\/create) and [POST \/api\/2.1\/jobs\/reset](https:\/\/docs.databricks.com\/api\/workspace\/jobs\/reset) API requests. You can use the `DatabricksCreateJobsOperator` with the `DatabricksRunNowOperator` to create and run a job. \nNote \nUsing the Databricks operators to trigger a job requires providing credentials in the Databricks connection configuration. See [Create a Databricks personal access token for Airflow](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-airflow-with-jobs.html#create-token). \nThe Databricks Airflow operators write the job run page URL to the Airflow logs every `polling_period_seconds` (the default is 30 seconds). For more information, see the [apache-airflow-providers-databricks](https:\/\/airflow.apache.org\/docs\/apache-airflow-providers-databricks\/stable\/index.html) package page on the Airflow website.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-airflow-with-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Orchestrate Databricks jobs with Apache Airflow\n###### Install the Airflow Databricks integration locally\n\nTo install Airflow and the Databricks provider locally for testing and development, use the following steps. For other Airflow installation options, including creating a production installation, see [installation](https:\/\/airflow.apache.org\/docs\/apache-airflow\/stable\/installation\/index.html) in the Airflow documentation. \nOpen a terminal and run the following commands: \n```\nmkdir airflow\ncd airflow\npipenv --python 3.8\npipenv shell\nexport AIRFLOW_HOME=$(pwd)\npipenv install apache-airflow\npipenv install apache-airflow-providers-databricks\nmkdir dags\nairflow db init\nairflow users create --username admin --firstname <firstname> --lastname <lastname> --role Admin --email <email>\n\n``` \nReplace `<firstname>`, `<lastname>`, and `<email>` with your username and email. You will be prompted to enter a password for the admin user. Make sure to save this password because it is required to log in to the Airflow UI. \nThis script performs the following steps: \n1. Creates a directory named `airflow` and changes into that directory.\n2. Uses `pipenv` to create and spawn a Python virtual environment. Databricks recommends using a Python virtual environment to isolate package versions and code dependencies to that environment. This isolation helps reduce unexpected package version mismatches and code dependency collisions.\n3. Initializes an environment variable named `AIRFLOW_HOME` set to the path of the `airflow` directory.\n4. Installs Airflow and the Airflow Databricks provider packages.\n5. Creates an `airflow\/dags` directory. Airflow uses the `dags` directory to store DAG definitions.\n6. Initializes a SQLite database that Airflow uses to track metadata. In a production Airflow deployment, you would configure Airflow with a standard database. The SQLite database and default configuration for your Airflow deployment are initialized in the `airflow` directory.\n7. Creates an admin user for Airflow. \nTip \nTo confirm the installation of the Databricks provider, run the following command in the Airflow installation directory: \n```\nairflow providers list\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-airflow-with-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Orchestrate Databricks jobs with Apache Airflow\n###### Start the Airflow web server and scheduler\n\nThe Airflow web server is required to view the Airflow UI. To start the web server, open a terminal in the Airflow installation directory and run the following commands: \nNote \nIf the Airflow web server fails to start because of a port conflict, you can change the default port in the [Airflow configuration](https:\/\/airflow.apache.org\/docs\/apache-airflow\/stable\/configurations-ref.html). \n```\npipenv shell\nexport AIRFLOW_HOME=$(pwd)\nairflow webserver\n\n``` \nThe scheduler is the Airflow component that schedules DAGs. To start the scheduler, open a new terminal in the Airflow installation directory and run the following commands: \n```\npipenv shell\nexport AIRFLOW_HOME=$(pwd)\nairflow scheduler\n\n```\n\n##### Orchestrate Databricks jobs with Apache Airflow\n###### Test the Airflow installation\n\nTo verify the Airflow installation, you can run one of the example DAGs included with Airflow: \n1. In a browser window, open `http:\/\/localhost:8080\/home`. Log in to the Airflow UI with the username and password you created when installing Airflow. The Airflow **DAGs** page appears.\n2. Click the **Pause\/Unpause DAG** toggle to unpause one of the example DAGs, for example, the `example_python_operator`.\n3. Trigger the example DAG by clicking the **Trigger DAG** button.\n4. Click the DAG name to view details, including the run status of the DAG.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-airflow-with-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Orchestrate Databricks jobs with Apache Airflow\n###### Create a Databricks personal access token for Airflow\n\nAirflow connects to Databricks using a Databricks personal access token (PAT). To create a PAT: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**. \nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n* [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n* [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \nYou can also authenticate to Databricks using Databricks OAuth for service principals. See [Databricks Connection](https:\/\/airflow.apache.org\/docs\/apache-airflow-providers-databricks\/stable\/connections\/databricks.html) in the Airflow documentation.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-airflow-with-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Orchestrate Databricks jobs with Apache Airflow\n###### Configure a Databricks connection\n\nYour Airflow installation contains a default connection for Databricks. To update the connection to connect to your workspace using the personal access token you created above: \n1. In a browser window, open `http:\/\/localhost:8080\/connection\/list\/`. If prompted to sign in, enter your admin username and password.\n2. Under **Conn ID**, locate **databricks\\_default** and click the **Edit record** button.\n3. Replace the value in the **Host** field with the [workspace instance name](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url) of your Databricks deployment, for example, `https:\/\/adb-123456789.cloud.databricks.com`.\n4. In the **Password** field, enter your Databricks personal access token.\n5. Click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-airflow-with-jobs.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Orchestrate Databricks jobs with Apache Airflow\n###### Example: Create an Airflow DAG to run a Databricks job\n\nThe following example demonstrates how to create a simple Airflow deployment that runs on your local machine and deploys an example DAG to trigger runs in Databricks. In this example, you will: \n1. Create a new notebook and add code to print a greeting based on a configured parameter.\n2. Create a Databricks job with a single task that runs the notebook.\n3. Configure an Airflow connection to your Databricks workspace.\n4. Create an Airflow DAG to trigger the notebook job. You define the DAG in a Python script using `DatabricksRunNowOperator`.\n5. Use the Airflow UI to trigger the DAG and view the run status. \n### Create a notebook \nThis example uses a notebook containing two cells: \n* The first cell contains a [Databricks Utilities text widget](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#dbutils-widgets-text) defining a variable named `greeting` set to the default value `world`.\n* The second cell prints the value of the `greeting` variable prefixed by `hello`. \nTo create the notebook: \n1. Go to your Databricks workspace, click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar, and select **Notebook**.\n2. Give your notebook a name, such as **Hello Airflow**, and make sure the default language is set to **Python**.\n3. Copy the following Python code and paste it into the first cell of the notebook. \n```\ndbutils.widgets.text(\"greeting\", \"world\", \"Greeting\")\ngreeting = dbutils.widgets.get(\"greeting\")\n\n```\n4. Add a new cell below the first cell and copy and paste the following Python code into the new cell: \n```\nprint(\"hello {}\".format(greeting))\n\n``` \n### Create a job \n1. Click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar.\n2. Click ![Create Job Button](https:\/\/docs.databricks.com\/_images\/create-job.png). \nThe **Tasks** tab appears with the create task dialog. \n![Create first task dialog](https:\/\/docs.databricks.com\/_images\/create-job-dialog.png)\n3. Replace **Add a name for your job\u2026** with your job name.\n4. In the **Task name** field, enter a name for the task, for example, **greeting-task**.\n5. In the **Type** drop-down menu, select **Notebook**.\n6. In the **Source** drop-down menu, select **Workspace**.\n7. Click the **Path** text box and use the file browser to find the notebook you created, click the notebook name, and click **Confirm**.\n8. Click **Add** under **Parameters**. In the **Key** field, enter `greeting`. In the **Value** field, enter `Airflow user`.\n9. Click **Create task**. \nIn the **Job details** panel, copy the **Job ID** value. This value is required to trigger the job from Airflow. \n### Run the job \nTo test your new job in the Databricks Workflows UI, click ![Run Now Button](https:\/\/docs.databricks.com\/_images\/run-now-button.png) in the upper right corner. When the run completes, you can verify the output by viewing the [job run details](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#job-run-details). \n### Create a new Airflow DAG \nYou define an Airflow DAG in a Python file. To create a DAG to trigger the example notebook job: \n1. In a text editor or IDE, create a new file named `databricks_dag.py` with the following contents: \n```\nfrom airflow import DAG\nfrom airflow.providers.databricks.operators.databricks import DatabricksRunNowOperator\nfrom airflow.utils.dates import days_ago\n\ndefault_args = {\n'owner': 'airflow'\n}\n\nwith DAG('databricks_dag',\nstart_date = days_ago(2),\nschedule_interval = None,\ndefault_args = default_args\n) as dag:\n\nopr_run_now = DatabricksRunNowOperator(\ntask_id = 'run_now',\ndatabricks_conn_id = 'databricks_default',\njob_id = JOB_ID\n)\n\n``` \nReplace `JOB_ID` with the value of the job ID saved earlier.\n2. Save the file in the `airflow\/dags` directory. Airflow automatically reads and installs DAG files stored in `airflow\/dags\/`. \n### Install and verify the DAG in Airflow \nTo trigger and verify the DAG in the Airflow UI: \n1. In a browser window, open `http:\/\/localhost:8080\/home`. The Airflow **DAGs** screen appears.\n2. Locate `databricks_dag` and click the **Pause\/Unpause DAG** toggle to unpause the DAG.\n3. Trigger the DAG by clicking the **Trigger DAG** button.\n4. Click a run in the **Runs** column to view the status and details of the run.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-airflow-with-jobs.html"} +{"content":"# What is Delta Lake?\n### Clone a table on Databricks\n\nYou can create a copy of an existing Delta Lake table on Databricks at a specific version using the `clone` command. Clones can be either deep or shallow. \nDatabricks also supports cloning Parquet and Iceberg tables. See [Incrementally clone Parquet and Iceberg tables to Delta Lake](https:\/\/docs.databricks.com\/delta\/clone-parquet.html). \nFor details on using clone with Unity Catalog, see [Shallow clone for Unity Catalog tables](https:\/\/docs.databricks.com\/delta\/clone-unity-catalog.html). \nNote \nDatabricks recommends using Delta Sharing to provide read-only access to tables across different organizations. See [Share data and AI assets securely using Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html).\n\n### Clone a table on Databricks\n#### Clone types\n\n* A *deep clone* is a clone that copies the source table data to the clone target in addition to the metadata of the existing table. Additionally, stream metadata is also cloned such that a stream that writes to the Delta table can be stopped on a source table and continued on the target of a clone from where it left off.\n* A *shallow clone* is a clone that does not copy the data files to the clone target. The table metadata is equivalent to the source. These clones are cheaper to create. \nThe metadata that is cloned includes: schema, partitioning information, invariants, nullability. For deep clones only, stream and [COPY INTO](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-copy-into.html) metadata are also cloned. Metadata not cloned are the table description and [user-defined commit metadata](https:\/\/docs.databricks.com\/delta\/custom-metadata.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clone.html"} +{"content":"# What is Delta Lake?\n### Clone a table on Databricks\n#### What are the semantics of Delta clone operations?\n\nIf you have working with a Delta table registered to the Hive metastore or a collection of files not registered as a table, clone has the following semantics: \nImportant \nIn Databricks Runtime 13.3 LTS and above, Unity Catalog managed tables have support for shallow clones. Clone semantics for Unity Catalog tables differ significantly from Delta Lake clone semantics in other environments. See [Shallow clone for Unity Catalog tables](https:\/\/docs.databricks.com\/delta\/clone-unity-catalog.html). \n* Any changes made to either deep or shallow clones affect only the clones themselves and not the source table.\n* Shallow clones reference data files in the source directory. If you run `vacuum` on the source table, clients can no longer read the referenced data files and a `FileNotFoundException` is thrown. In this case, running clone with replace over the shallow clone repairs the clone. If this occurs often, consider using a deep clone instead which does not depend on the source table.\n* Deep clones do not depend on the source from which they were cloned, but are expensive to create because a deep clone copies the data as well as the metadata.\n* Cloning with `replace` to a target that already has a table at that path creates a Delta log if one does not exist at that path. You can clean up any existing data by running `vacuum`.\n* For existing Delta tables, a new commit is created that includes the new metadata and new data from the source table. This new commit is incremental, meaning that only new changes since the last clone are committed to the table.\n* Cloning a table is not the same as `Create Table As Select` or `CTAS`. A clone copies the metadata of the source table in addition to the data. Cloning also has simpler syntax: you don\u2019t need to specify partitioning, format, invariants, nullability and so on as they are taken from the source table.\n* A cloned table has an independent history from its source table. Time travel queries on a cloned table do not work with the same inputs as they work on its source table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clone.html"} +{"content":"# What is Delta Lake?\n### Clone a table on Databricks\n#### Example clone syntax\n\nThe following code examples demonstrate syntax for creating deep and shallow clones: \n```\nCREATE TABLE delta.`\/data\/target\/` CLONE delta.`\/data\/source\/` -- Create a deep clone of \/data\/source at \/data\/target\n\nCREATE OR REPLACE TABLE db.target_table CLONE db.source_table -- Replace the target\n\nCREATE TABLE IF NOT EXISTS delta.`\/data\/target\/` CLONE db.source_table -- No-op if the target table exists\n\nCREATE TABLE db.target_table SHALLOW CLONE delta.`\/data\/source`\n\nCREATE TABLE db.target_table SHALLOW CLONE delta.`\/data\/source` VERSION AS OF version\n\nCREATE TABLE db.target_table SHALLOW CLONE delta.`\/data\/source` TIMESTAMP AS OF timestamp_expression -- timestamp can be like \u201c2019-01-01\u201d or like date_sub(current_date(), 1)\n\n``` \n```\nfrom delta.tables import *\n\ndeltaTable = DeltaTable.forPath(spark, \"\/path\/to\/table\") # path-based tables, or\ndeltaTable = DeltaTable.forName(spark, \"source_table\") # Hive metastore-based tables\n\ndeltaTable.clone(target=\"target_table\", isShallow=True, replace=False) # clone the source at latest version\n\ndeltaTable.cloneAtVersion(version=1, target=\"target_table\", isShallow=True, replace=False) # clone the source at a specific version\n\n# clone the source at a specific timestamp such as timestamp=\"2019-01-01\"\ndeltaTable.cloneAtTimestamp(timestamp=\"2019-01-01\", target=\"target_table\", isShallow=True, replace=False)\n\n``` \n```\nimport io.delta.tables._\n\nval deltaTable = DeltaTable.forPath(spark, \"\/path\/to\/table\")\nval deltaTable = DeltaTable.forName(spark, \"source_table\")\n\ndeltaTable.clone(target=\"target_table\", isShallow=true, replace=false) \/\/ clone the source at latest version\n\ndeltaTable.cloneAtVersion(version=1, target=\"target_table\", isShallow=true, replace=false) \/\/ clone the source at a specific version\n\ndeltaTable.cloneAtTimestamp(timestamp=\"2019-01-01\", target=\"target_table\", isShallow=true, replace=false) \/\/ clone the source at a specific timestamp\n\n``` \nFor syntax details, see [CREATE TABLE CLONE](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-clone.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clone.html"} +{"content":"# What is Delta Lake?\n### Clone a table on Databricks\n#### Clone metrics\n\n`CLONE` reports the following metrics as a single row DataFrame once the operation is complete: \n* `source_table_size`: Size of the source table that\u2019s being cloned in bytes.\n* `source_num_of_files`: The number of files in the source table.\n* `num_removed_files`: If the table is being replaced, how many files are removed from the current table.\n* `num_copied_files`: Number of files that were copied from the source (0 for shallow clones).\n* `removed_files_size`: Size in bytes of the files that are being removed from the current table.\n* `copied_files_size`: Size in bytes of the files copied to the table. \n![Clone metrics example](https:\/\/docs.databricks.com\/_images\/clone-metrics.png)\n\n### Clone a table on Databricks\n#### Permissions\n\nYou must configure permissions for Databricks table access control and your cloud provider. \n### Table access control \nThe following permissions are required for both deep and shallow clones: \n* `SELECT` permission on the source table.\n* If you are using `CLONE` to create a new table, `CREATE` permission on the database in which you are creating the table.\n* If you are using `CLONE` to replace a table, you must have `MODIFY` permission on the table. \n### Cloud provider permissions \nIf you have created a deep clone, any user that reads the deep clone must have read access to the clone\u2019s directory. To make changes to the clone, users must have write access to the clone\u2019s directory. \nIf you have created a shallow clone, any user that reads the shallow clone needs permission to read the files in the original table, since the data files remain in the source table with shallow clones, as well as the clone\u2019s directory. To make changes to the clone, users will need write access to the clone\u2019s directory.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clone.html"} +{"content":"# What is Delta Lake?\n### Clone a table on Databricks\n#### Use clone for data archiving\n\nYou can use deep clone to preserve the state of a table at a certain point in time for archival purposes. You can sync deep clones incrementally to maintain an updated state of a source table for disaster recovery. \n```\n-- Every month run\nCREATE OR REPLACE TABLE delta.`\/some\/archive\/path` CLONE my_prod_table\n\n```\n\n### Clone a table on Databricks\n#### Use clone for ML model reproduction\n\nWhen doing machine learning, you may want to archive a certain version of a table on which you trained an ML model. Future models can be tested using this archived data set. \n```\n-- Trained model on version 15 of Delta table\nCREATE TABLE delta.`\/model\/dataset` CLONE entire_dataset VERSION AS OF 15\n\n```\n\n### Clone a table on Databricks\n#### Use clone for short-term experiments on a production table\n\nTo test a workflow on a production table without corrupting the table, you can easily create a shallow clone. This allows you to run arbitrary workflows on the cloned table that contains all the production data but does not affect any production workloads. \n```\n-- Perform shallow clone\nCREATE OR REPLACE TABLE my_test SHALLOW CLONE my_prod_table;\n\nUPDATE my_test WHERE user_id is null SET invalid=true;\n-- Run a bunch of validations. Once happy:\n\n-- This should leverage the update information in the clone to prune to only\n-- changed files in the clone if possible\nMERGE INTO my_prod_table\nUSING my_test\nON my_test.user_id <=> my_prod_table.user_id\nWHEN MATCHED AND my_test.user_id is null THEN UPDATE *;\n\nDROP TABLE my_test;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clone.html"} +{"content":"# What is Delta Lake?\n### Clone a table on Databricks\n#### Use clone to override table properties\n\nNote \nAvailable in Databricks Runtime 7.5 and above. \nTable property overrides are particularly useful for: \n* Annotating tables with owner or user information when sharing data with different business units.\n* Archiving Delta tables and table history or time travel is required. You can specify the data and log retention periods independently for the archive table. For example: \n```\nCREATE OR REPLACE TABLE archive.my_table CLONE prod.my_table\nTBLPROPERTIES (\ndelta.logRetentionDuration = '3650 days',\ndelta.deletedFileRetentionDuration = '3650 days'\n)\nLOCATION 'xx:\/\/archive\/my_table'\n\n``` \n```\ndt = DeltaTable.forName(spark, \"prod.my_table\")\ntblProps = {\n\"delta.logRetentionDuration\": \"3650 days\",\n\"delta.deletedFileRetentionDuration\": \"3650 days\"\n}\ndt.clone('xx:\/\/archive\/my_table', isShallow=False, replace=True, tblProps)\n\n``` \n```\nval dt = DeltaTable.forName(spark, \"prod.my_table\")\nval tblProps = Map(\n\"delta.logRetentionDuration\" -> \"3650 days\",\n\"delta.deletedFileRetentionDuration\" -> \"3650 days\"\n)\ndt.clone(\"xx:\/\/archive\/my_table\", isShallow = false, replace = true, properties = tblProps)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/clone.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Access data shared with you using Delta Sharing (for recipients)\n\nThis article shows how to to access data that has been shared with you using Delta Sharing.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/recipient.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Access data shared with you using Delta Sharing (for recipients)\n#### Delta Sharing and data recipients\n\nDelta Sharing is an open standard for secure data sharing. A Databricks user, called a *data provider*, can use Delta Sharing to share data with a person or group outside of their organization, called a *data recipient*. \n### Databricks-to-Databricks sharing and open sharing \nHow you access the data depends on whether you yourself are a Databricks user and whether or not your data provider configured the data being shared with you for *Databricks-to-Databricks* sharing or *open sharing*. \n**In the Databricks-to-Databricks model**, you must be a user on a Databricks workspace that is enabled for [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). A member of your team provides the data provider with a unique identifier for your Unity Catalog metastore, and the data provider uses that to create a secure sharing connection. The shared data becomes available for access in your workspace. If necessary, a member of your team configures granular access control on that data. \n**In the open sharing model**, you can use any tool you like (including Databricks) to access the shared data. The data provider sends you an activation URL over a secure channel. You follow it to download a credential file that lets you access the data shared with you. \n### Terms of use \nThe shared data is not provided by Databricks directly but by data providers running on Databricks. \nNote \nBy accessing a data provider\u2019s shared data as a data recipient, data recipient represents that it has been authorized to access the data share(s) provided to it by the data provider and acknowledges that (1) Databricks has no liability for such data or data recipient\u2019s use of such shared data, and (2) Databricks may collect information about data recipient\u2019s use of and access to the shared data (including identifying any individual or company who accesses the data using the credential file in connection with such information) and may share it with the applicable data provider.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/recipient.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Access data shared with you using Delta Sharing (for recipients)\n#### Get access to the data shared with you\n\nHow you access the data depends on whether your data provider shared data with you using the open sharing protocol or the Databricks-to-Databricks sharing protocol. See [Databricks-to-Databricks sharing and open sharing](https:\/\/docs.databricks.com\/data-sharing\/recipient.html#open-sharing-vs-db-to-db). \n### Get access in the Databricks-to-Databricks model \nIn the Databricks-to-Databricks model: \n1. The data provider sends you instructions for finding a unique identifier for the [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) metastore associated with your Databricks workspace, and you send it to them. \nThe sharing identifier is a string consisting of the metastore\u2019s cloud, region, and UUID (the unique identifier for the metastore), in the format `<cloud>:<region>:<uuid>`. For example, `aws:eu-west-1:b0c978c8-3e68-4cdf-94af-d05c120ed1ef`. \nTo get the sharing identifier using Catalog Explorer: \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared with me**.\n3. Above the Providers tab, click the **Sharing identifier** copy icon.To get the sharing identifier using a notebook or Databricks SQL query, use the default SQL function `CURRENT_METASTORE`. If you use a notebook, it must run on a [shared or single-user cluster](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode) in the workspace you will use to access the shared data. \n```\nSELECT CURRENT_METASTORE();\n\n```\n2. The data provider creates: \n* A *recipient* in their Databricks account to represent you and the users in your organization who will access the data.\n* A *share*, which is a representation of the tables, volumes, and views to be shared with you.\n3. You access the data shared with you. You or someone on your team can, if necessary, configure granular data access on that data for your users. See [Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html). \n### Get access in the open sharing model \nIn the open sharing model: \n1. The data provider creates: \n* A *recipient* in their Databricks account to represent you and the users in your organization who will access the data. A token and credential file are generated as part of this configuration.\n* A *share*, which is a representation of the tables and partitions to be shared with you.\n2. The data provider sends you an activation URL over a secure channel. You follow it to download a credential file that lets you access the data shared with you. \nImportant \nDon\u2019t share the activation link with anyone. You can download a credential file only once. If you visit the activation link again after the credential file has already downloaded, the **Download Credential File** button is disabled. \nIf you lose the activation link before you use it, contact the data provider.\n3. Store the credential file in a secure location. \nDon\u2019t share the credential file with anyone outside the group of users who should have access to the shared data. If you need to share it with someone in your organization, Databricks recommends using a password manager.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/recipient.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Access data shared with you using Delta Sharing (for recipients)\n#### Read the shared data\n\nHow you read data that has been shared securely with you using Delta Sharing depends on whether you received a credential file (the open sharing model) or you are using a Databricks workspace and you provided the data provider with your sharing identifier (the Databricks-to-Databricks model). \n### Read shared data using a credential file (open sharing) \nIf data has been shared with you using the Delta Sharing open sharing protocol, you use the credential file that you downloaded to authenticate to the data provider\u2019s Databricks account and read the shared data. Access persists as long as the underlying token is valid and the provider continues to share the data. Providers manage token expiration and rotation. Updates to the data are available to you in near real time. You can read and make copies of the shared data, but you can\u2019t modify the source data. \nTo learn how to access and read shared data using the credential file in Databricks, Apache Spark, pandas, and Power BI, see [Read data shared using Delta Sharing open sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html). \n### Read shared data using Databricks-to-Databricks sharing \nIf data has been shared with you using the Databricks-to-Databricks model, then no credential file is required to access the shared data. Databricks takes care of the secure connection, and the shared data is automatically discoverable in your Databricks workspace. \nTo learn how to find, read, and manage that shared data in your Databricks workspace, see [Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/recipient.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Access data shared with you using Delta Sharing (for recipients)\n#### Audit usage of shared data\n\nIf you have access to a Databricks workspace, you can use Databricks audit logs to understand who in your organization is accessing which data using Delta Sharing. See [Audit and monitor data sharing](https:\/\/docs.databricks.com\/data-sharing\/audit-logs.html).\n\n### Access data shared with you using Delta Sharing (for recipients)\n#### Next steps\n\n* [Learn more about Databricks](https:\/\/docs.databricks.com\/introduction\/index.html)\n* [Learn more about Delta Sharing](https:\/\/delta.io\/sharing\/)\n* [Learn more about Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/recipient.html"} +{"content":"# Get started: Account and workspace setup\n### Start a Databricks free trial on AWS\n\nNote \nTo create your Databricks account, you need an existing AWS account. If you don\u2019t have an AWS account, you can sign up for an AWS Free Tier account at <https:\/\/aws.amazon.com\/free\/>. \nThis article tells you how to sign up for a Databricks free trial and how to cancel the free trial.\n\n### Start a Databricks free trial on AWS\n#### How does the free Databricks trial work?\n\nDuring the 14-day free trial, all Databricks usage is free, but Databricks uses compute and S3 storage resources in your AWS account. \nWhen your Databricks free trial ends, you receive an email informing you that you are automatically enrolled in the Databricks plan that you selected when you signed up for the trial, but you won\u2019t be billed without your authorization. You can upgrade to a [higher tier plan](https:\/\/databricks.com\/product\/pricing\/platform-addons) at any time after your trial ends.\n\n### Start a Databricks free trial on AWS\n#### Cancel the free trial\n\nYou can cancel your subscription at any time. See [Cancel your Databricks subscription](https:\/\/docs.databricks.com\/admin\/account-settings\/account.html#cancel).\n\n### Start a Databricks free trial on AWS\n#### Sign-up choice: Databricks or AWS\n\nYou can sign up through [AWS Marketplace](https:\/\/aws.amazon.com\/marketplace\/pp\/prodview-wtyi5lgtce6n6) or through the [Databricks website](https:\/\/databricks.com\/try-databricks). The main difference is how you\u2019ll be billed after the free trial ends: \n* If you signed up through AWS Marketplace, your Databricks usage will be billed through your AWS account.\n* If you signed up through Databricks, you\u2019ll pay with a credit card and manage billing through the Databricks account console.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/free-trial.html"} +{"content":"# Get started: Account and workspace setup\n### Start a Databricks free trial on AWS\n#### Sign up through Databricks\n\nIf you sign up for the 14-day free trial directly with Databricks, you must provide credit card information, but you won\u2019t be billed during the trial period. Billing won\u2019t start up after your trial without your authorization. \n1. Navigate to the [Try Databricks](https:\/\/databricks.com\/try-databricks) page.\n2. Enter your name, company, email, and title, and click **Continue**.\n3. Select **Amazon Web Services** as your cloud provider and click **Get started**. \n![Try Databricks](https:\/\/docs.databricks.com\/_images\/try.png)\n!\nThe Databricks trial is free, but you must have an AWS account as Databricks uses compute and storage resources in your AWS account.\n4. Look for the welcome email and click the link to verify your email address. You are prompted to create your Databricks password.\n5. After you create your password, you\u2019re redirected to the Databricks account console, where you can set up your Databricks [account and create a workspace](https:\/\/docs.databricks.com\/admin\/workspace\/quick-start.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/free-trial.html"} +{"content":"# Get started: Account and workspace setup\n### Start a Databricks free trial on AWS\n#### Sign up through AWS Marketplace\n\nNote \nAn AWS account can have only one active AWS Marketplace subscription to Databricks. \n1. Log in to your AWS account as a user with the Purchaser role.\n2. Go to [AWS Marketplace](https:\/\/aws.amazon.com\/marketplace\/pp\/prodview-wtyi5lgtce6n6). \nYou can also follow this [direct link](https:\/\/aws.amazon.com\/marketplace\/pp\/prodview-wtyi5lgtce6n6).\n3. On the initial subscription page, click **View purchase options**.\n4. On the next page, read the terms and click **Subscribe**.\n5. On the pop-up dialog, click **Set Up Your Account**. \nA Databricks sign-up page appears.\n6. Enter your email address, first name, last name, and company, and click **Sign up**. \nThis email address becomes your Databricks account owner username.\n7. When you\u2019ve finished, look for two emails: \n* An email from Amazon confirming your Databricks subscription in AWS Marketplace.\n* An email from Databricks welcoming you and asking you to verify your email address.\n8. In the Databricks welcome email, click the link to verify your email address. You are prompted to create your Databricks password.\n9. After you create your password, you\u2019re redirected to the Databricks account console, where you can set up your Databricks [account and create a workspace](https:\/\/docs.databricks.com\/admin\/workspace\/quick-start.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/free-trial.html"} +{"content":"# Get started: Account and workspace setup\n### Start a Databricks free trial on AWS\n#### Manage credit card billing\n\nIf you signed up for a 14-day free trial paying with a credit card, you can continue to use your Databricks account when the trial is over by adding billing information. You will receive an email to remind you. \n1. Log in to the [account console](https:\/\/accounts.cloud.databricks.com\/) as the account owner or an account admin.\n2. Click the **Settings** icon in the sidebar and click the **Subscription & Billing** tab.\n3. Click the **Add billing information** button.\n4. On the **Billing** page, add your billing information and click **Save**. \nYou will be billed monthly until you cancel. To switch from monthly credit-card billing to invoice or commit billing, contact a Databricks representative.\n\n### Start a Databricks free trial on AWS\n#### Manage AWS Marketplace subscription billing\n\nIf you signed up for the free trial using AWS Marketplace, charges appear on the [AWS Billing](https:\/\/docs.aws.amazon.com\/awsaccountbilling\/latest\/aboutv2\/billing-what-is.html) & Cloud Management dashboard alongside your other AWS charges. After the free trial period, you are billed only for the resources you use\n\n### Start a Databricks free trial on AWS\n#### Next steps\n\nFor a 30-minute setup guide for your first Databricks workspace, including setting up connections with your cloud storage, see [Get started: Databricks workspace onboarding](https:\/\/docs.databricks.com\/getting-started\/onboarding-account.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/free-trial.html"} +{"content":"# \n### Initialize a RAG Application\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html) \nThe following guide walks you through initializing a RAG Studio application. \nImportant \nThe steps in this tutorial are done *once per app* to initialize your application\u2019s code base. The steps in the remainder of the tutorials are *repeated throughout your development process* as you iterate on versions of the application.\n\n### Initialize a RAG Application\n#### Step 1: Initialize your development environment\n\n1. Follow the steps in [Development Environment Setup](https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-dev.html) to configure your development environment.\n2. Extract the RAG Studio Asset Bundle template to your home directory \n```\nmkdir ~\/.rag_studio\ncd ~\/.rag_studio\ncurl -O <URL to zip file provided by your Databricks contact>\nunzip rag-studio-0.0.0a2.zip\n\n``` \nNote \nThis step is only required due to the Private Preview status of the product.\n\n### Initialize a RAG Application\n#### Step 2: Configure the required infrastructure\n\nFollow the steps in the [Infrastructure Setup guide](https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-infra.html) to create the required infrastructure. \nBy default, RAG Studio provisions new job clusters for tasks like data ingestion, RAG chain creation, and evaluation. For more information on cluster requirements, including instructions to instead use an interactive cluster, see [Clusters](https:\/\/docs.databricks.com\/rag-studio\/setup\/clusters.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1-create-sample-app.html"} +{"content":"# \n### Initialize a RAG Application\n#### Step 3: Initialize the application\n\n1. Open the terminal on your development machine and change to the directory you want to hold the application\u2019s code base\n2. Initialize the sample application using the Asset Bundle template \nNote \nIf you have multiple Databricks CLI profiles configured in `~\/.databrickscfg`, RAG Studio will use the default profile for creating your app. \n```\ndatabricks bundle init ~\/.rag_studio\/rag-studio-0.0.0a2\/\n\n> RAG Studio app name (e.g. hr-bot): databricks-docs-bot\n> Existing catalog name: <catalog-name>\n> Existing Unity Catalog schema name: <schema-name>\n> The secret scope used to access vector search endpoint: <secret-scope>\n> The secret key used to access vector search endpoint: <secret-key>\n\n\u2728 Your Rag Studio workspace has been created in the 'databricks-docs-bot' directory!\n\nPlease refer to the README.md of your project for further instructions on getting started.\n\n``` \nNote \nRead the [directory stucture](https:\/\/docs.databricks.com\/rag-studio\/details\/directory-structure.html) reference doc to understand how the code base is structured.\n3. Change to the (new) folder `databricks-docs-bot` inside `~\/rag_studio`. The folder is named based on the app name you provided. \n```\ncd databricks-docs-bot\n\n```\n4. Install the required Python libraries \n```\npip install -r requirements.txt\n\n```\n5. At this point, your environment is set up and you are ready to start development. Before we continue, let\u2019s understand the `.\/rag` command line interface - this interface is used to execute RAG Studio\u2019s various workflows and tasks. Throughout the tutorial, we will show you how to use these commands, but you can always run `.\/rag --help` or `.\/rag name-of-command --help` to understand how to use a specific command. \n```\n.\/rag --help\n\nUsage: rag [OPTIONS] COMMAND [ARGS]...\n\nOptions:\n-h, --help Show this message and exit.\n\nCommands:\ncreate-rag-version Create and deploy a new version of the RAG chain...\ndeploy-chain Deploy the chain model for a given RAG version\nexplore-eval Run the exploration notebook on the evaluation results\ningest-data (Dev environment only) Ingest raw data from data...\nrun-offline-eval Run offline evaluation for a given chain version...\nrun-online-eval Run online evaluation on the currently-logged...\nsetup-prod-env Set up the EnvironmentName.REVIEWERS and...\nstart-review Start the review process for a given chain model...\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1-create-sample-app.html"} +{"content":"# \n### Initialize a RAG Application\n#### Understanding how RAG Studio deployment jobs work\n\nDeployment jobs in RAG Studio are managed with the `.\/rag` command-line utility. When you start a deployment job, the job is run using Databricks Jobs on a compute cluster. In future releases, these deployment steps will not require provisioning a compute cluster. \nWarning \nIf you press CTRL+C to terminate a deployment command, if the deployment job has started, it will remain running in the background. To stop the job, go to the URL printed in the console to stop the job in the Databricks Workflow UI. \nNote \nBy default, RAG Studio will provision new compute for each deployment job. If you prefer to use an existing compute cluster, you can pass `-c <cluster-id>` to any `.\/rag` command. Alternatively, you can set a `cluster_id` in `config\/rag-config.yml` in `environment_config.development.cluster_id`. Note that this *only* works in the development `Environment`. \nTo find `<cluster-id>` for your compute, open the cluster in the Databricks UI. In the cluster\u2019s URL, the `<cluster-id>` is `0126-194718-ucabc7oi`: `https:\/\/<workspace-url>\/?o=123456798#setting\/clusters\/0126-194718-ucabc7oi\/configuration`. \nIf you opt to use your own compute, ensure the cluster is using the `MLR 13.3` runtime.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1-create-sample-app.html"} +{"content":"# \n### Initialize a RAG Application\n#### Step 4: Initialize the `Environments`\n\n1. Run the following command to initialize these `Environments`. This command takes about 10 minutes to run. \n```\n.\/rag setup-prod-env\n\n``` \nNote \nSee [Infrastructure and Unity Catalog assets created by RAG Studio](https:\/\/docs.databricks.com\/rag-studio\/details\/created-infra.html) for details of what is created in your Workspace and Unity Catalog schema. \nImportant \nYou can safely ignore warnings about multiple versions of the Databricks CLI. \n`Databricks CLI v0.212.1 found at \/opt\/homebrew\/bin\/databricks` \n`Your current $PATH prefers running CLI v0.18.0 at \/<your env path>\/bin\/databricks` \n`Because both are installed and available in $PATH, I assume you are trying to run the newer version.`\n`If you want to disable this behavior you can set DATABRICKS_CLI_DO_NOT_EXECUTE_NEWER_VERSION=1.`\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1-create-sample-app.html"} +{"content":"# \n### Initialize a RAG Application\n#### Step 5: Ingest sample data for your application\n\nNote \nThe default `\ud83d\udce5 Data Ingestor` downloads the Databricks documentation. You can modify the code in `src\/notebooks\/ingest_data.py` to ingest from another source or adjust `config\/rag-config.yml` to use data from that already exists in a Unity Catalog Volume. The default `\ud83d\uddc3\ufe0f Data Processor` that ships with RAG Studio only supports HTML files. If you have other file types in your Unity Catalog Volume, follow the steps in [Creating a \ud83d\uddc3\ufe0f Data Processor version](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-data-processor.html) to adjust the `\ud83d\uddc3\ufe0f Data Processor` code. \n1. Run the following command to start the data ingestion process. The default application will download the Databricks documentation to a Unity Catalog volume in your configured UC schema. This step takes approximately 10 minutes. \nNote \nThe Unity Catalog catalog and schema are the ones that you configured in [Step 3](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1-create-sample-app.html#step-3-initialize-the-application). \n```\n.\/rag ingest-data -e dev\n\n```\n2. You will see the following message in your console when the ingestion completes. \n```\n-------------------------\nRun URL: <URL to the deployment Databricks Job>\n\n<timestamp> \"[dev e] [databricks-docs-bot][dev] ingest_data\" RUNNING\n<timestamp> \"[dev e] [databricks-docs-bot][dev] ingest_data\" TERMINATED SUCCESS\nSuccessfully downloaded and uploaded Databricks documentation articles to UC Volume '`catalog`.`schema`.`raw_databricks_docs`'%\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1-create-sample-app.html"} +{"content":"# \n### Initialize a RAG Application\n#### Follow the next tutorial!\n\n[Ingest or connect raw data](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1b-ingest-data.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1-create-sample-app.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### HIPAA compliance features\n\nPreview \nThe ability for admins to add Enhanced Security and Compliance features is a feature in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). The compliance security profile and support for compliance standards are generally available (GA). \nHIPAA compliance features requires enabling the *compliance security profile*, which adds monitoring agents, enforces instance types for inter-node encryption, provides a hardened compute image, and other features. For technical details, see [Compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html). It is your responsibility to [confirm that each workspace has the compliance security profile enabled](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html#verify). \nTo use the compliance security profile, your Databricks account must include the Enhanced Security and Compliance add-on. For details, see the [pricing page](https:\/\/databricks.com\/product\/aws-pricing). \nThis feature requires your workspace to be on the Enterprise pricing tier. \nEnsure that sensitive information is never entered in customer-defined input fields, such as workspace names, cluster names, and job names.\n\n####### HIPAA compliance features\n######## Which compute resources get enhanced security\n\nThe compliance security profile enhancements apply to compute resources in the [classic compute plane](https:\/\/docs.databricks.com\/getting-started\/overview.html) in all regions. \nServerless SQL warehouse support for the compliance security profile varies by region. See [Serverless SQL warehouses support the compliance security profile in some regions](https:\/\/docs.databricks.com\/admin\/sql\/serverless.html#security-profile).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/hipaa.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### HIPAA compliance features\n######## HIPAA overview\n\nThe Health Insurance Portability and Accountability Act of 1996 (HIPAA), the Health Information Technology for Economic and Clinical Health (HITECH), and the regulations issued under HIPAA are a set of US healthcare laws. Among other provisions, these laws establish requirements for the use, disclosure, and safeguarding of protected health information (PHI). \nHIPAA applies to [covered entities and business associates](https:\/\/www.hhs.gov\/hipaa\/for-professionals\/covered-entities\/index.html) that create, receive, maintain, transmit, or access PHI. When a covered entity or business associate engages the services of a cloud service provider (CSP), such as Databricks, the CSP becomes a business associate under HIPAA. \nHIPAA regulations require that covered entities and their business associates enter into a contract called a Business Associate Agreement (BAA) to ensure the business associates will protect PHI adequately. Among other things, a BAA establishes the permitted and required uses and disclosures of PHI by the business associate, based on the relationship between the parties and the activities and services being performed by the business associate.\n\n####### HIPAA compliance features\n######## Does Databricks permit the processing of PHI data on Databricks?\n\nYes, if you enable the compliance security profile and add the HIPAA compliance standard as part of the compliance security profile configuration. Contact your Databricks account team for more information. It is your responsibility before you process PHI data to have a BAA agreement with Databricks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/hipaa.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### HIPAA compliance features\n######## Enable HIPAA on a workspace\n\nThis section assumes you are on the E2 version of the Databricks platform. \nIf you are an existing HIPAA customer and your account is **not** yet on the E2 version of the Databricks platform, \n* Note that the E2 platform is a multi-tenant platform and your choice to deploy HIPAA on E2 will be treated as a waiver of any provision in your contract that would be in conflict with our ability to provide you HIPAA on the E2 platform. \nTo configure your workspace to support processing of data regulated by the HIPAA compliance standard, the workspace must have the [compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html) enabled. You can enable it and add the HIPAA compliance standard across all workspaces or only on some workspaces. \n* To enable the compliance security profile and add the HIPAA compliance standard for an existing workspace, see [Enable enhanced security and compliance features on a workspace](https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-compliance.html#aws-workspace-config).\n* To set an account-level setting to enable the compliance security profile and HIPAA for new workspaces, see [Set account-level defaults for new workspaces](https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-compliance.html#aws-account-level-defaults). \nImportant \n* You are wholly responsible for ensuring your own compliance with all applicable laws and regulations. Information provided in Databricks online documentation does not constitute legal advice, and you should consult your legal advisor for any questions regarding regulatory compliance.\n* Databricks does not support the use of preview features for the processing of PHI on the HIPAA on E2 platform, with the exception of the features listed in [Preview features that are supported for processing of PHI data](https:\/\/docs.databricks.com\/security\/privacy\/hipaa.html#supported-preview-features).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/hipaa.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### HIPAA compliance features\n######## Preview features that are supported for processing of PHI data\n\nThe following preview features are supported for processing of PHI: \n* [SCIM provisioning](https:\/\/docs.databricks.com\/admin\/users-groups\/scim\/index.html)\n* [IAM passthrough](https:\/\/docs.databricks.com\/archive\/credential-passthrough\/iam-passthrough.html)\n* [Secret paths in environment variables](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html#spark-conf-env-var)\n* [System tables](https:\/\/docs.databricks.com\/admin\/system-tables\/index.html)\n* [Serverless SQL warehouse usage when compliance security profile is enabled](https:\/\/docs.databricks.com\/admin\/sql\/serverless.html#security-profile), with support in some regions\n* [Filtering sensitive table data with row filters and column masks](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/row-and-column-filters.html)\n* [Unified login](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/single-sign-on\/index.html#unified-login)\n* [Lakehouse Federation to Redshift](https:\/\/docs.databricks.com\/query-federation\/redshift.html)\n* [Liquid clustering for Delta tables](https:\/\/docs.databricks.com\/delta\/clustering.html)\n* [Unity Catalog-enabled DLT pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html)\n* [Databricks Assistant](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html)\n* Scala support for shared clusters\n* Delta Live Tables Hive metastore to Unity Catalog clone API\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/hipaa.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### HIPAA compliance features\n######## Shared responsibility of HIPAA compliance\n\nComplying with HIPAA has three major areas, with different responsibilities. While each party has numerous responsibilities, below we enumerate key responsibilities of ours, along with your responsibilities. \nThis article use the Databricks terminology *control plane* and a *compute plane*, which are two main parts of how Databricks works: \n* The Databricks [control plane](https:\/\/docs.databricks.com\/getting-started\/overview.html) includes the backend services that Databricks manages in its own AWS account.\n* The compute plane is where your data lake is processed. The [classic compute plane](https:\/\/docs.databricks.com\/getting-started\/overview.html) includes an VPC in your AWS account, and clusters of compute resources to process your notebooks, jobs, and pro or classic SQL warehouses. \nImportant \nFor workspaces with HIPAA compliance features enabled, *compute plane* refers to the classic compute plane in your own AWS account. As of this release, [serverless compute](https:\/\/docs.databricks.com\/getting-started\/overview.html#serverless) features are disabled on a workspace with HIPAA compliance features enabled. \nKey responsibilities of AWS include: \n* Perform its obligations as a business associate under your BAA with AWS.\n* Provide you the EC2 machines under your contract with AWS that support HIPAA compliance.\n* Provide hardware-accelerated encryption at rest and in-transit encryption within the AWS Nitro Instances that is adequate under HIPAA.\n* Delete encryption keys and data when Databricks releases the EC2 instances. \nKey responsibilities of Databricks include: \n* Encrypt in-transit PHI data that is transmitted to or from the control plane.\n* Encrypt PHI data at rest in the control plane\n* Limit the set of instance types to the AWS Nitro instance types that enforce in-transit encryption and encryption at rest. For the list of supported instance types, see AWS Nitro System and HIPAA compliance features. Databricks limits the instance types both in the account console and through the API.\n* Deprovision EC2 instances when you indicate in Databricks that they are to be deprovisioned, for example auto-termination or manual termination, so that AWS can wipe them. \nKey responsibilities of yours: \n* Configure your workspace to use either [customer-managed keys for managed services](https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html#managed-services) or the [Store interactive notebook results in customer account](https:\/\/docs.databricks.com\/admin\/workspace-settings\/notebook-results.html) feature.\n* Do not use preview features within Databricks to process PHI other than features listed in [Preview features that are supported for processing of PHI data](https:\/\/docs.databricks.com\/security\/privacy\/hipaa.html#supported-preview-features)\n* Follow [security](https:\/\/docs.databricks.com\/security\/index.html) best practices, such as disable unnecessary egress from the compute plane and use the Databricks [secrets](https:\/\/docs.databricks.com\/api\/workspace\/secrets) feature (or other similar functionality) to store access keys that provide access to PHI.\n* Enter into a business associate agreement with AWS to cover all data processed within the VPC where the EC2 instances are deployed.\n* Do not do something within a virtual machine that would be a violation of HIPAA. For example, direct Databricks to send unencrypted PHI to an endpoint.\n* Ensure that all data that may contain PHI is encrypted at rest when you store it in locations that the Databricks platform may interact with. This includes setting the encryption settings on each workspace\u2019s root S3 bucket that is part of workspace creation. You are responsible for ensuring the encryption (as well as performing backups) for this storage and all other data sources.\n* Ensure that all data that may contain PHI is encrypted in transit between Databricks and any of your data storage locations or external locations you access from a compute plane machine. For example, any APIs that you use in a notebook that might connect to external data source must use appropriate encryption on any outgoing connections.\n* Ensure that all data that may contain PHI is encrypted at rest when you store it in locations that the Databricks platform may interact with. This includes setting the encryption settings on each workspace\u2019s root storage that is part of workspace creation.\n* Ensure the encryption (as well as performing backups) for your root S3 bucket and all other data sources.\n* Ensure that all data that may contain PHI is encrypted in transit between Databricks and any of your data storage locations or external locations you access from a compute plane machine. For example, any APIs that you use in a notebook that might connect to external data source must use appropriate encryption on any outgoing connections. \nNote the following about [customer-managed](https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html) keys: \n* You can add customer-managed keys for your workspace\u2019s root S3 bucket using the customer-managed keys for workspace storage feature, but Databricks does not require you to do so.\n* As an optional part of the customer-managed keys for workspace storage feature, you can add customer-managed keys for EBS volumes, but this is not necessary for HIPAA compliance. \nNote \nIf you are an existing HIPAA customer and your workspace is **not** on the E2 version of the Databricks platform, to create a cluster, see the legacy article [Create and verify a cluster for legacy HIPAA support](https:\/\/docs.databricks.com\/archive\/security\/hipaa-legacy-cluster.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/hipaa.html"} +{"content":"# \n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-data-processor.html"} +{"content":"# \n### Creating a `\ud83d\uddc3\ufe0f Data Processor` version\n#### Conceptual overview\n\nThe `\ud83d\uddc3\ufe0f Data Processor` is a data pipeline that parses, chunks, and embeds unstructured documents from a `\ud83d\udce5 Data Ingestor` destination [UC Volume](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html) into chunks stored in a Delta Table and synced to a [Unity Catalog Vector Index](https:\/\/docs.databricks.com\/generative-ai\/vector-search.html). A `\ud83d\uddc3\ufe0f Data Processor` is associated with 1+ `\ud83d\udce5 Data Ingestor` and can be associated with any number of `\ud83d\udd0d Retriever`s. \nA `\ud83d\uddc3\ufe0f Data Processor` consists of: \n1. Configuration stored in the `data_processors` section of `rag-config.yml`\n2. Code stored in `app-directory\/src\/process_data.py` \nTo parse & chunk documents, you can define any custom Python code, including the use of LangChain TextSplitters. \nTo simplify experimentation with different settings, Databricks suggests parameterizing your `\ud83d\uddc3\ufe0f Data Processor` using the `key:value configuration` settings in `rag-config.yml`. By default, Databricks provides a `chunk_size` and `chunk_overlap` configuration, but you can create any custom parameter. \nTo embed documents, configure an embedding model in `rag-config.yml`. This embedding model can be any [Foundational Model APIs pay-per-token](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html#what-are-databricks-foundation-model-apis), [Foundational Model APIs provisioned throughput](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html#what-are-databricks-foundation-model-apis), or [External Model](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html) Endpoint that supports the a [`llm\/v1\/embeddings`](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html#embedding) task. \nThe downstream `\ud83d\udd0d Retriever`s and `\ud83d\udd17 Chain`s reference the `\ud83d\uddc3\ufe0f Data Processor`\u2019s configuration to access this embedding model. \nTip \n**\ud83d\udea7 Roadmap \ud83d\udea7** Support for multiple `\ud83d\uddc3\ufe0f Data Processor` per RAG Application. In v2024-01-19, only one `\ud83d\uddc3\ufe0f Data Processor` can be created per RAG Application.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-data-processor.html"} +{"content":"# \n### Creating a `\ud83d\uddc3\ufe0f Data Processor` version\n#### Data flows\n\n![legend](https:\/\/docs.databricks.com\/_images\/data-flow-processor.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-data-processor.html"} +{"content":"# \n### Creating a `\ud83d\uddc3\ufe0f Data Processor` version\n#### Step-by-step instructions\n\n1. Open the `rag-config.yml` in your IDE\/code editor.\n2. Edit the `data_processors` configuration. \n```\ndata_processors:\n- name: spark-docs-processor\ndescription: Parse, chunk, embed Spark documentation\n# explicit link to the data ingestors that this processor uses.\ndata_ingestors:\n- name: spark-docs-ingestor\n# Optional. The Unity Catalog table where the embedded, chunked docs are stored.\n# If not specified, will default to `{name}__embedded_docs__{version_number}`\n# If specified, will default to `{provided_value}__{version_number}`\ndestination_table:\nname: databricks_docs_chunked\ndestination_vector_index:\ndatabricks_vector_search:\n# Optional. The Unity Catalog table where the embedded, chunked docs are stored.\n# If not specified, will default to `{name}__embedded_docs_index__{version_number}`\n# If specified, will default to `{provided_value}__{version_number}`\nindex_name: databricks_docs_index\nembedding_model:\nendpoint_name: databricks-bge-large-en\ninstructions:\nembedding: \"\"\nquery: \"Represent this sentence for searching relevant passages:\"\n# You can specify arbitrary key-value pairs as `configurations`\nconfigurations:\nchunk_size: 500\nchunk_overlap: 50\n\n```\n3. Edit the `src\/my_rag_builder\/document_processor.py` to modify the default code or add custom code. \nNote \nYou can modify this file in any way you see fit, as long as after the code finishes running, `destination_table.name` contains the following columns: \n* `chunk_id` - A unique identifier of the chunk, typically a UUID.\n* doc\\_uri - A unique identifier of the source document, for example a URL.\nand this data is synchronized to `databricks_vector_search.index_name`.\n4. You can run the `document_processor.py` file in a Databricks Notebook or using [Databricks Connect](https:\/\/docs.databricks.com\/dev-tools\/databricks-connect\/index.html) to test the processor..\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-data-processor.html"} +{"content":"# Get started: Account and workspace setup\n## Best practice articles\n","doc_uri":"https:\/\/docs.databricks.com\/cheat-sheet\/compute.html"} +{"content":"# Get started: Account and workspace setup\n## Best practice articles\n#### Compute creation cheat sheet\n\nThis article aims to provide clear and opinionated guidance for compute creation. By using the right compute types for your workflow, you can improve performance and save on costs. \n| Best Practice | Impact | Docs |\n| --- | --- | --- |\n| If you are new to Databricks, start by using general all-purpose instance types | Selecting the appropriate instance type for the workload results in higher efficiency. | * [Create a cluster](https:\/\/docs.databricks.com\/compute\/configure.html) |\n| Use shared access mode unless your required functionality isn\u2019t supported | Compute with shared access mode can be used by multiple users with data isolation among users. | * [Access modes](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode) |\n| Use Graviton instance types if they are available | Instance types with Graviton processors have the best price-to-performance ratio of any instance type, according to AWS. | * [AWS Graviton instance types](https:\/\/docs.databricks.com\/compute\/configure.html#graviton) |\n| Use the latest generation instance types if there is enough availability | The latest generation of instance types provide the best performance and latest features. | * [Amazon EC2 Instance Types](https:\/\/aws.amazon.com\/ec2\/instance-types\/) |\n| Set your on-demand and spot-instance balance based on how quickly you need your workload to run | Spot instances save on cost but can affect the overall run time of an operation if the spot instances are reclaimed. | * [Spot instances](https:\/\/docs.databricks.com\/compute\/configure.html#aws-spot) |\n| Choose the size of your nodes and the number of workers based on the types of operations your workload performs | For example, if you expect a lot of shuffles, it can be more efficient to use a large single node instead of multiple smaller nodes. | * [Compute sizing considerations](https:\/\/docs.databricks.com\/compute\/cluster-config-best-practices.html#cluster-sizing) |\n| Run vacuum on a cluster with auto-scaling set for 1-4 workers, where each worker has 8 cores. Select a driver with between 8 and 32 cores. Increase the size of the driver if you get out-of-memory (OOM) errors. | Vacuum statements happen in two phases, the second of which is driver-heavy. If you don\u2019t use the right-sized cluster, the operation could cause a slowdown and might not succeed. | * [What size cluster does vacuum need?](https:\/\/docs.databricks.com\/delta\/vacuum.html#cluster-size) * [VACUUM best practices](https:\/\/kb.databricks.com\/en_US\/delta\/vacuum-best-practices-on-delta-lake) |\n| Assess whether your batch workflow would benefit from Photon | Photon provides faster queries and reduces your total cost per workload. | * [Photon advantages](https:\/\/docs.databricks.com\/compute\/photon.html) |\n\n","doc_uri":"https:\/\/docs.databricks.com\/cheat-sheet\/compute.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Load data with Delta Live Tables\n\nYou can load data from any data source supported by Apache Spark on Databricks using Delta Live Tables. You can define datasets (tables and views) in Delta Live Tables against any query that returns a Spark DataFrame, including streaming DataFrames and Pandas for Spark DataFrames. For data ingestion tasks, Databricks recommends using streaming tables for most use cases. Streaming tables are good for ingesting data from cloud object storage using Auto Loader or from message buses like Kafka. The examples below demonstrate some common patterns. \nImportant \nNot all data sources have SQL support. You can mix SQL and Python notebooks in a Delta Live Tables pipeline to use SQL for all operations beyond ingestion. \nFor details on working with libraries not packaged in Delta Live Tables by default, see [Manage Python dependencies for Delta Live Tables pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/external-dependencies.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/load.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Load data with Delta Live Tables\n###### Load files from cloud object storage\n\nDatabricks recommends using Auto Loader with Delta Live Tables for most data ingestion tasks from cloud object storage. Auto Loader and Delta Live Tables are designed to incrementally and idempotently load ever-growing data as it arrives in cloud storage. The following examples use Auto Loader to create datasets from CSV and JSON files: \nNote \nTo load files with Auto Loader in a Unity Catalog enabled pipeline, you must use [external locations](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html). To learn more about using Unity Catalog with Delta Live Tables, see [Use Unity Catalog with your Delta Live Tables pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html). \n```\n@dlt.table\ndef customers():\nreturn (\nspark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"csv\")\n.load(\"\/databricks-datasets\/retail-org\/customers\/\")\n)\n\n@dlt.table\ndef sales_orders_raw():\nreturn (\nspark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"json\")\n.load(\"\/databricks-datasets\/retail-org\/sales_orders\/\")\n)\n\n``` \n```\nCREATE OR REFRESH STREAMING TABLE customers\nAS SELECT * FROM cloud_files(\"\/databricks-datasets\/retail-org\/customers\/\", \"csv\")\n\nCREATE OR REFRESH STREAMING TABLE sales_orders_raw\nAS SELECT * FROM cloud_files(\"\/databricks-datasets\/retail-org\/sales_orders\/\", \"json\")\n\n``` \nSee [What is Auto Loader?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html) and [Auto Loader SQL syntax](https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html#auto-loader-sql). \nWarning \nIf you use Auto Loader with file notifications and run a full refresh for your pipeline or streaming table, you must manually clean up your resources. You can use the [CloudFilesResourceManager](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html#cloud-resource-management) in a notebook to perform cleanup.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/load.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Load data with Delta Live Tables\n###### Load data from a message bus\n\nYou can configure Delta Live Tables pipelines to ingest data from message buses with streaming tables. Databricks recommends combining streaming tables with continuous execution and enhanced autoscaling to provide the most efficient ingestion for low-latency loading from message buses. See [Optimize the cluster utilization of Delta Live Tables pipelines with Enhanced Autoscaling](https:\/\/docs.databricks.com\/delta-live-tables\/auto-scaling.html). \nFor example, the following code configures a streaming table to ingest data from Kafka: \n```\nimport dlt\n\n@dlt.table\ndef kafka_raw():\nreturn (\nspark.readStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", \"<server:ip>\")\n.option(\"subscribe\", \"topic1\")\n.option(\"startingOffsets\", \"latest\")\n.load()\n)\n\n``` \nYou can write downstream operations in pure SQL to perform streaming transformations on this data, as in the following example: \n```\nCREATE OR REFRESH STREAMING TABLE streaming_silver_table\nAS SELECT\n*\nFROM\nSTREAM(LIVE.kafka_raw)\nWHERE ...\n\n``` \nFor an example of working with Event Hubs, see [Use Azure Event Hubs as a Delta Live Tables data source](https:\/\/docs.databricks.com\/delta-live-tables\/event-hubs.html). \nSee [Configure streaming data sources](https:\/\/docs.databricks.com\/connect\/streaming\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/load.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Load data with Delta Live Tables\n###### Load data from external systems\n\nDelta Live Tables supports loading data from any data source supported by Databricks. See [Connect to data sources](https:\/\/docs.databricks.com\/connect\/index.html). You can also load external data using Lakehouse Federation for [supported data sources](https:\/\/docs.databricks.com\/query-federation\/index.html#connection-types). Because Lakehouse Federation requires Databricks Runtime 13.3 LTS or above, to use Lakehouse Federation your pipeline must be configured to use the [preview channel](https:\/\/docs.databricks.com\/delta-live-tables\/properties.html#config-settings). \nSome data sources do not have equivalent support in SQL. If you cannot use Lakehouse Federation with one of these data sources, you can use a standalone Python notebook to ingest data from the source. This notebook can then be added as a source library with SQL notebooks to build a Delta Live Tables pipeline. The following example declares a materialized view to access the current state of data in a remote PostgreSQL table: \n```\nimport dlt\n\n@dlt.table\ndef postgres_raw():\nreturn (\nspark.read\n.format(\"postgresql\")\n.option(\"dbtable\", table_name)\n.option(\"host\", database_host_url)\n.option(\"port\", 5432)\n.option(\"database\", database_name)\n.option(\"user\", username)\n.option(\"password\", password)\n.load()\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/load.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Load data with Delta Live Tables\n###### Load small or static datasets from cloud object storage\n\nYou can load small or static datasets using Apache Spark load syntax. Delta Live Tables supports all of the file formats supported by Apache Spark on Databricks. For a full list, see [Data format options](https:\/\/docs.databricks.com\/query\/formats\/index.html). \nThe following examples demonstrate loading JSON to create Delta Live Tables tables: \n```\n@dlt.table\ndef clickstream_raw():\nreturn (spark.read.format(\"json\").load(\"\/databricks-datasets\/wikipedia-datasets\/data-001\/clickstream\/raw-uncompressed-json\/2015_2_clickstream.json\"))\n\n``` \n```\nCREATE OR REFRESH LIVE TABLE clickstream_raw\nAS SELECT * FROM json.`\/databricks-datasets\/wikipedia-datasets\/data-001\/clickstream\/raw-uncompressed-json\/2015_2_clickstream.json`;\n\n``` \nNote \nThe `SELECT * FROM format.`path`;` SQL construct is common to all SQL environments on Databricks. It is the recommended pattern for direct file access using SQL with Delta Live Tables.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/load.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Load data with Delta Live Tables\n###### Securely access storage credentials with secrets in a pipeline\n\nYou can use Databricks [secrets](https:\/\/docs.databricks.com\/security\/secrets\/index.html) to store credentials such as access keys or passwords. To configure the secret in your pipeline, use a Spark property in the pipeline settings cluster configuration. See [Configure your compute settings](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#cluster-config). \nThe following example uses a secret to store an access key required to read input data from an Azure Data Lake Storage Gen2 (ADLS Gen2) storage account using [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html). You can use this same method to configure any secret required by your pipeline, for example, AWS keys to access S3, or the password to an Apache Hive metastore. \nTo learn more about working with Azure Data Lake Storage Gen2, see [Connect to Azure Data Lake Storage Gen2 and Blob Storage](https:\/\/docs.databricks.com\/connect\/storage\/azure-storage.html). \nNote \nYou must add the `spark.hadoop.` prefix to the `spark_conf` configuration key that sets the secret value. \n```\n{\n\"id\": \"43246596-a63f-11ec-b909-0242ac120002\",\n\"clusters\": [\n{\n\"spark_conf\": {\n\"spark.hadoop.fs.azure.account.key.<storage-account-name>.dfs.core.windows.net\": \"{{secrets\/<scope-name>\/<secret-name>}}\"\n},\n\"autoscale\": {\n\"min_workers\": 1,\n\"max_workers\": 5,\n\"mode\": \"ENHANCED\"\n}\n}\n],\n\"development\": true,\n\"continuous\": false,\n\"libraries\": [\n{\n\"notebook\": {\n\"path\": \"\/Users\/user@databricks.com\/DLT Notebooks\/Delta Live Tables quickstart\"\n}\n}\n],\n\"name\": \"DLT quickstart using ADLS2\"\n}\n\n``` \nReplace \n* `<storage-account-name>` with the ADLS Gen2 storage account name.\n* `<scope-name>` with the Databricks secret scope name.\n* `<secret-name>` with the name of the key containing the Azure storage account access key. \n```\nimport dlt\n\njson_path = \"abfss:\/\/<container-name>@<storage-account-name>.dfs.core.windows.net\/<path-to-input-dataset>\"\n@dlt.create_table(\ncomment=\"Data ingested from an ADLS2 storage account.\"\n)\ndef read_from_ADLS2():\nreturn (\nspark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"json\")\n.load(json_path)\n)\n\n``` \nReplace \n* `<container-name>` with the name of the Azure storage account container that stores the input data.\n* `<storage-account-name>` with the ADLS Gen2 storage account name.\n* `<path-to-input-dataset>` with the path to the input dataset.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/load.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Load data with Delta Live Tables\n###### Load data from Azure Event Hubs\n\nAzure Event Hubs is a data streaming service that provides an Apache Kafka compatible interface. You can use the Structured Streaming Kafka connector, included in the Delta Live Tables runtime, to load messages from Azure Event Hubs. To learn more about loading and processing messages from Azure Event Hubs, see [Use Azure Event Hubs as a Delta Live Tables data source](https:\/\/docs.databricks.com\/delta-live-tables\/event-hubs.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/load.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/testing.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Testing the Databricks ODBC Driver\n\nThis article describes how to test code that uses the [Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/index.html). \nTo test code that uses the Databricks ODBC Driver along with DSNs or DSN-less connection strings, you can use popular test frameworks for programming languages that support ODBC. For instance, the following example Python code uses [pyodbc](https:\/\/docs.databricks.com\/dev-tools\/pyodbc.html), [pytest](https:\/\/docs.pytest.org\/), and [unittest.mock](https:\/\/docs.python.org\/3\/library\/unittest.mock.html) to automate and test the Databricks ODBC Driver using a DSN. This example code is based on the example code in [Connect Python and pyodbc to Databricks](https:\/\/docs.databricks.com\/dev-tools\/pyodbc.html). \nThe following example code file named `helpers.py` contains several functions that automate the Databricks Driver using a DSN: \n* The `connect_to_dsn` function uses a DSN to open a connection through a Databricks compute resource.\n* The `get_cursor_from_connection` function uses the connection to obtain a *cursor*, which enables fetch operations on the data through the compute resource.\n* The `select_from_nyctaxi_trips` function uses the cursor to select the specified number of data rows from the `trips` table in the `samples` catalog\u2019s `nyctaxi` schema.\n* The `print_rows` function prints the data rows\u2019 content to the screen. \n```\n# helpers.py\n\nfrom pyodbc import connect, Connection, Cursor\n\ndef connect_to_dsn(\nconnstring: str,\nautocommit: bool\n) -> Connection:\n\nconnection = connect(\nconnstring,\nautocommit = autocommit\n)\n\nreturn connection\n\ndef get_cursor_from_connection(\nconnection: Connection\n) -> Cursor:\n\ncursor = connection.cursor()\nreturn cursor\n\ndef select_from_nyctaxi_trips(\ncursor: Cursor,\nnum_rows: int\n) -> Cursor:\n\nselect_cursor = cursor.execute(f\"SELECT * FROM samples.nyctaxi.trips LIMIT {num_rows}\")\nreturn select_cursor\n\ndef print_rows(cursor: Cursor):\nfor row in cursor.fetchall():\nprint(row)\n\n``` \nThe following example code file named `main.py` file calls the functions in the `helpers.py` file: \n```\n# main.py\n\nfrom helpers import *\n\nconnection = connect_to_dsn(\nconnstring = \"DSN=<your-dsn-name>\",\nautocommit = True\n)\n\ncursor = get_cursor_from_connection(\nconnection = connection)\n\nselect_cursor = select_from_nyctaxi_trips(\ncursor = cursor,\nnum_rows = 2\n)\n\nprint_rows(\ncursor = select_cursor\n)\n\n``` \nThe following example code file named `test_helpers.py` uses `pytest` to test the functions in the `helpers.py` file. Instead of using the time and cost of actual compute resources to call the functions in the `helpers.py` file, the following example code uses `unittest.mock` to simulate these calls. These simulated calls are typically completed in just a few seconds, increasing your confidence in the quality of your code while not changing the state of your existing Databricks accounts or workspaces. \n```\n# test_helpers.py\n\nfrom pyodbc import SQL_DBMS_NAME\nfrom helpers import *\nfrom unittest.mock import patch\nimport datetime\n\n@patch(\"helpers.connect_to_dsn\")\ndef test_connect_to_dsn(mock_connection):\nmock_connection.return_value.getinfo.return_value = \"Spark SQL\"\n\nmock_connection = connect_to_dsn(\nconnstring = \"DSN=<your-dsn-name>\",\nautocommit = True\n)\n\nassert mock_connection.getinfo(SQL_DBMS_NAME) == \"Spark SQL\"\n\n@patch('helpers.get_cursor_from_connection')\ndef test_get_cursor_from_connection(mock_connection):\nmock_cursor = mock_connection.return_value.cursor\nmock_cursor.return_value.rowcount = -1\n\nmock_connection = connect_to_dsn(\nconnstring = \"DSN=<your-dsn-name>\",\nautocommit = True\n)\n\nmock_cursor = get_cursor_from_connection(\nconnection = mock_connection\n)\n\nassert mock_cursor.rowcount == -1\n\n@patch('helpers.select_from_nyctaxi_trips')\ndef test_select_from_nyctaxi_trips(mock_connection):\nmock_cursor = mock_connection.return_value.cursor\nmock_get_cursor = mock_cursor.return_value.execute\nmock_select_cursor = mock_get_cursor.return_value.arraysize = 1\n\nmock_connection = connect_to_dsn(\nconnstring = \"DSN=<your-dsn-name>\",\nautocommit = True\n)\n\nmock_get_cursor = get_cursor_from_connection(\nconnection = mock_connection\n)\n\nmock_select_cursor = select_from_nyctaxi_trips(\ncursor = mock_get_cursor,\nnum_rows = 2\n)\n\nassert mock_select_cursor.arraysize == 1\n\n@patch('helpers.print_rows')\ndef test_print_rows(mock_connection, capsys):\nmock_cursor = mock_connection.return_value.cursor\nmock_get_cursor = mock_cursor.return_value.execute\nmock_select_cursor = mock_get_cursor.return_value.fetchall.return_value = [\n(datetime.datetime(2016, 2, 14, 16, 52, 13), datetime.datetime(2016, 2, 14, 17, 16, 4), 4.94, 19.0, 10282, 10171),\n(datetime.datetime(2016, 2, 4, 18, 44, 19), datetime.datetime(2016, 2, 4, 18, 46), 0.28, 3.5, 10110, 10110)\n]\n\nmock_connection = connect_to_dsn(\nconnstring = \"DSN=<your-dsn-name>\",\nautocommit = True\n)\n\nmock_get_cursor = get_cursor_from_connection(\nconnection = mock_connection\n)\n\nmock_select_cursor = select_from_nyctaxi_trips(\ncursor = mock_get_cursor,\nnum_rows = 2\n)\n\nprint_rows(\ncursor = mock_select_cursor\n)\n\ncaptured = capsys.readouterr()\nassert captured.out == \"(datetime.datetime(2016, 2, 14, 16, 52, 13), datetime.datetime(2016, 2, 14, 17, 16, 4), 4.94, 19.0, 10282, 10171)\\n\" \\\n\"(datetime.datetime(2016, 2, 4, 18, 44, 19), datetime.datetime(2016, 2, 4, 18, 46), 0.28, 3.5, 10110, 10110)\\n\"\n\n``` \nBecause the `select_from_nyctaxi_trips` function contains a `SELECT` statement and therefore does not change the state of the `trips` table, mocking is not absolutely required in this example. However, mocking enables you to quickly run your tests without waiting for an actual connection to be made with the compute resource. Also, mocking enables you to run simulated tests multiple times for functions that might change a table\u2019s state, such as `INSERT INTO`, `UPDATE`, and `DELETE FROM`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/testing.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Cost-based optimizer\n\nSpark SQL can use a cost-based optimizer (CBO) to improve query plans. This is especially useful for queries with multiple joins.\nFor this to work it is critical to collect table and column statistics and keep them up to date.\n\n#### Cost-based optimizer\n##### Collect statistics\n\nTo get the full benefit of the CBO it is important to collect both *column statistics* and *table statistics*.\nStatistics can be collected using the ANALYZE TABLE command. \nTip \nTo keep the statistics up-to-date, run `ANALYZE TABLE` after writing to the table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/cbo.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Cost-based optimizer\n##### Verify query plans\n\nThere are several ways to verify the query plan. \n### `EXPLAIN` command \nTo check if the plan uses statistics, use the SQL commands \n* Databricks Runtime 7.x and above: [EXPLAIN](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-qry-explain.html) \nIf statistics are missing then the query plan might not be optimal. \n```\n== Optimized Logical Plan ==\nAggregate [s_store_sk], [s_store_sk, count(1) AS count(1)L], Statistics(sizeInBytes=20.0 B, rowCount=1, hints=none)\n+- Project [s_store_sk], Statistics(sizeInBytes=18.5 MB, rowCount=1.62E+6, hints=none)\n+- Join Inner, (d_date_sk = ss_sold_date_sk), Statistics(sizeInBytes=30.8 MB, rowCount=1.62E+6, hints=none)\n:- Project [ss_sold_date_sk, s_store_sk], Statistics(sizeInBytes=39.1 GB, rowCount=2.63E+9, hints=none)\n: +- Join Inner, (s_store_sk = ss_store_sk), Statistics(sizeInBytes=48.9 GB, rowCount=2.63E+9, hints=none)\n: :- Project [ss_store_sk, ss_sold_date_sk], Statistics(sizeInBytes=39.1 GB, rowCount=2.63E+9, hints=none)\n: : +- Filter (isnotnull(ss_store_sk) && isnotnull(ss_sold_date_sk)), Statistics(sizeInBytes=39.1 GB, rowCount=2.63E+9, hints=none)\n: : +- Relation[ss_store_sk,ss_sold_date_sk] parquet, Statistics(sizeInBytes=134.6 GB, rowCount=2.88E+9, hints=none)\n: +- Project [s_store_sk], Statistics(sizeInBytes=11.7 KB, rowCount=1.00E+3, hints=none)\n: +- Filter isnotnull(s_store_sk), Statistics(sizeInBytes=11.7 KB, rowCount=1.00E+3, hints=none)\n: +- Relation[s_store_sk] parquet, Statistics(sizeInBytes=88.0 KB, rowCount=1.00E+3, hints=none)\n+- Project [d_date_sk], Statistics(sizeInBytes=12.0 B, rowCount=1, hints=none)\n+- Filter ((((isnotnull(d_year) && isnotnull(d_date)) && (d_year = 2000)) && (d_date = 2000-12-31)) && isnotnull(d_date_sk)), Statistics(sizeInBytes=38.0 B, rowCount=1, hints=none)\n+- Relation[d_date_sk,d_date,d_year] parquet, Statistics(sizeInBytes=1786.7 KB, rowCount=7.30E+4, hints=none)\n\n``` \nImportant \nThe `rowCount` statistic is especially important for queries with multiple joins. If `rowCount` is missing, it means there is not enough information to calculate it (that is, some required columns do not have statistics). \n### Spark SQL UI \nUse the Spark SQL UI page to see the executed plan and accuracy of the statistics. \n![Missing estimate](https:\/\/docs.databricks.com\/_images\/docs-cbo-nostats.png)\nMissing estimate \nA line such as `rows output: 2,451,005 est: N\/A` means that this operator produces approximately 2M rows and there were no statistics available. \n![Good estimate](https:\/\/docs.databricks.com\/_images\/docs-cbo-goodstats.png)\nGood estimate \nA line such as `rows output: 2,451,005 est: 1616404 (1X)` means that this operator produces approx. 2M rows, while the estimate was approx. 1.6M and the estimation error factor was 1. \n![Bad estimate](https:\/\/docs.databricks.com\/_images\/docs-cbo-badstats.png)\nBad estimate \nA line such as `rows output: 2,451,005 est: 2626656323` means that this operator produces approximately 2M rows while the estimate was 2B rows, so the estimation error factor was 1000.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/cbo.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Cost-based optimizer\n##### Disable the Cost-Based Optimizer\n\nThe CBO is enabled by default. You disable the CBO by changing the `spark.sql.cbo.enabled` flag. \n```\nspark.conf.set(\"spark.sql.cbo.enabled\", false)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/cbo.html"} +{"content":"# Security and compliance guide\n## Networking\n### Users to Databricks networking\n#### Manage IP access lists\n###### Configure IP access lists for the account console\n\nThis article describes how to configure IP access lists for the Databricks account console UI. You can also use the [Account IP Access Lists API](https:\/\/docs.databricks.com\/api\/account\/AccountsIpAccessLists). IP access lists for the account console does *not* affect [IP access lists for workspaces](https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list-workspace.html).\n\n###### Configure IP access lists for the account console\n####### Requirements\n\n* This feature requires the [Enterprise pricing tier](https:\/\/www.databricks.com\/product\/pricing\/platform-addons). \n* IP access lists support only Internet Protocol version 4 (IPv4) addresses.\n\n###### Configure IP access lists for the account console\n####### Enable IP access lists\n\nAccount admins can enable and disable IP access lists for account console. When IP access lists is enabled, users can only access the account console through IPs on the allow list. When IP access lists is disabled, all existing allow lists or block lists are ignored and all IP addresses can access the account console. By default, new IP access lists for the account console take effect within a few minutes. \n1. As an account admin, go to the account console.\n2. In the sidebar, click **Settings**.\n3. On the **Security** tab, click **IP Access List**. \n![IP access lists for account console main settings](https:\/\/docs.databricks.com\/_images\/ip-access-lists-ui.png)\n4. Set the **Enabled\/disabled** toggle to **Enabled**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list-account.html"} +{"content":"# Security and compliance guide\n## Networking\n### Users to Databricks networking\n#### Manage IP access lists\n###### Configure IP access lists for the account console\n####### Add an IP access list\n\n1. As an account admin, go to the account console.\n2. In the sidebar, click **Settings**.\n3. In the **Security** tab, click **IP Access List**.\n4. Click **Add rule**. \n![IP access lists add rule](https:\/\/docs.databricks.com\/_images\/add-ip-access-rule.png)\n5. Choose whether to make an **ALLOW** or **BLOCK** list.\n6. In the label field, add a human-readable label.\n7. Add one or more IP addresses or CIDR IP ranges, with commas separating them.\n8. Click **Add rule**.\n\n###### Configure IP access lists for the account console\n####### Delete an IP access list\n\n1. As an account admin, go to the account console.\n2. In the sidebar, click **Settings**.\n3. In the **Security** tab, click **IP Access List**.\n4. On the row for the rule, click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) on the right, and select **Delete**.\n5. Confirm deletion in the confirmation popup that appears.\n\n###### Configure IP access lists for the account console\n####### Disable an IP access list\n\n1. As an account admin, go to the account console.\n2. In the sidebar, click **Settings**.\n3. In the **Security** tab, click **IP Access List**.\n4. In the **Status** column, click the **Enabled** button to toggle between enabled and disabled.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list-account.html"} +{"content":"# Security and compliance guide\n## Networking\n### Users to Databricks networking\n#### Manage IP access lists\n###### Configure IP access lists for the account console\n####### Update an IP access list\n\n1. As an account admin, go to the account console.\n2. In the sidebar, click **Settings**.\n3. In the **Security** tab, click **IP Access List**.\n4. On the row for the rule, click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) on the right, and select **Update**.\n5. Update any fields.\n6. Click **Update rule**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list-account.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Load and transform data with Delta Live Tables\n\nThe articles in this section provide common patterns, recommendations, and examples of data ingestion and transformation in Delta Live Tables pipelines. When ingesting source data to create the initial datasets in a pipeline, these initial datasets are commonly called *bronze* tables and often perform simple transformations. By contrast, the final tables in a pipeline, commonly referred to as *gold* tables, often require complicated aggregations or reading from sources that are the targets of an `APPLY CHANGES INTO` operation.\n\n#### Load and transform data with Delta Live Tables\n##### Load data\n\nYou can load data from any data source supported by Apache Spark on Databricks using Delta Live Tables. For examples of patterns for loading data from different sources, including cloud object storage, message buses like Kafka, and external systems like PostgreSQL, see [Load data with Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/load.html). These examples feature recommendations like using streaming tables with Auto Loader in Delta Live Tables for an optimized ingestion experience.\n\n#### Load and transform data with Delta Live Tables\n##### Data flows\n\nIn Delta Live Tables, a *flow* is a streaming query that processes source data incrementally to update a target streaming table. Many streaming queries needed to implement a Delta Live Tables pipeline create an implicit flow as part of the query definition. Delta Live Tables also supports explicitly declaring flows when more specialized processing is required. To learn more about Delta Live Tables flows and see examples of using flows to implement data processing tasks, see [Load and process data incrementally with Delta Live Tables flows](https:\/\/docs.databricks.com\/delta-live-tables\/flows.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/load-and-transform.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Load and transform data with Delta Live Tables\n##### Change data capture (CDC)\n\nDelta Live Tables simplifies change data capture (CDC) with the `APPLY CHANGES` API. By automatically handling out-of-sequence records, the `APPLY CHANGES` API in Delta Live Tables ensures correct processing of CDC records and removes the need to develop complex logic for handling out-of-sequence records. See [APPLY CHANGES API: Simplify change data capture in Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html).\n\n#### Load and transform data with Delta Live Tables\n##### Transform data\n\nWith Delta Live Tables, you can declare transformations on datasets and specify how records are processed through query logic. For examples of common transformation patterns when building out Delta Live Tables pipelines, including usage of streaming tables, materialized views, stream-static joins, and MLflow models in pipelines, see [Transform data with Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/transform.html).\n\n#### Load and transform data with Delta Live Tables\n##### Optimize stateful processing in Delta Live Tables with watermarks\n\nTo effectively manage data kept in state, you can use watermarks when performing stateful stream processing in Delta Live Tables, including aggregations, joins, and deduplication. In stream processing, a watermark is an Apache Spark feature that can define a time-based threshold for processing data when performing stateful operations. Data arriving is processed until the threshold is reached, at which point the time window defined by the threshold is closed. Watermarks can be used to avoid problems during query processing, mainly when processing larger datasets or long-running processing. \nFor examples and recommendations, see [Optimize stateful processing in Delta Live Tables with watermarks](https:\/\/docs.databricks.com\/delta-live-tables\/stateful-processing.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/load-and-transform.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Manage configuration of Delta Live Tables pipelines\n\nBecause Delta Live Tables automates operational complexities such as infrastructure management, task orchestration, error recovery, and performance optimization, many of your pipelines can run with minimal manual configuration. However, Delta Live Tables also allows you to manage configuration for pipelines requiring non-default configurations or to optimize performance and resource usage. These articles provide details on managing configurations for your Delta Live Tables pipelines, including settings that determine how pipelines are run, options for the compute that runs a pipeline, and management of external dependencies such as Python libraries.\n\n#### Manage configuration of Delta Live Tables pipelines\n##### Manage pipeline settings\n\nThe configuration for a Delta Live Tables pipeline includes settings that define the source code implementing the pipeline. It also includes settings that control pipeline infrastructure, dependency management, how updates are processed, and how tables are saved in the workspace. Most configurations are optional, but some require careful attention. \nTo learn about the configuration options for pipelines and how to use them, see [Configure pipeline settings for Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html). \nFor detailed specifications of Delta Live Tables settings, properties that control how tables are managed, and non-settable compute options, see [Delta Live Tables properties reference](https:\/\/docs.databricks.com\/delta-live-tables\/properties.html).\n\n#### Manage configuration of Delta Live Tables pipelines\n##### Manage external dependencies for pipelines that use Python\n\nDelta Live Tables supports using external dependencies in your pipelines such as Python packages and libraries. To learn about options and recommendations for using dependencies, see [Manage Python dependencies for Delta Live Tables pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/external-dependencies.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/manage-pipeline-configurations.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Manage configuration of Delta Live Tables pipelines\n##### Use Python modules stored in your Databricks workspace\n\nIn addition to implementing your Python code in Databricks notebooks, you can use Databricks Git Folders or workspace files to store your code as Python modules. Storing your code as Python modules is especially useful when you have common functionality that you want to use in multiple pipelines or multiple notebooks in the same pipeline. To learn how to use Python modules with your pipelines, see [Import Python modules from Git folders or workspace files](https:\/\/docs.databricks.com\/delta-live-tables\/import-workspace-files.html).\n\n#### Manage configuration of Delta Live Tables pipelines\n##### Optimize pipeline compute utilization\n\nUse Enhanced Autoscaling to optimize the cluster utilization of your pipelines. Enhanced Autoscaling adds additional resources only if the system determines those resources will increase pipeline processing speed. Resources are freed when no longer needed, and clusters are shut down as soon as all pipeline updates are complete. \nTo learn more about Enhanced Autoscaling, including configuration details, see [Optimize the cluster utilization of Delta Live Tables pipelines with Enhanced Autoscaling](https:\/\/docs.databricks.com\/delta-live-tables\/auto-scaling.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/manage-pipeline-configurations.html"} +{"content":"# \n### Rag Studio user personas\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html) \nThere are 3 personas who interact with a RAG Application and RAG Studio. \n* **`\ud83e\udde0 Expert Users`:** Stakeholders with your organization that are subject matter experts in the app\u2019s domain. Expert user\u2019s primary responsibility is to provide the `\ud83d\udc69\ud83d\udcbb RAG App Developer`\u2019s feedback on the bot\u2019s responses. \n+ *For example, for an HR Bot, the expert users could be a group of HR professionals who were appointed to help build the bot.*\n* **`\ud83d\udc64 End Users`:** Any user who interacts with the bot once it is in production. \n+ *In the HR bot example, this would be any employees with an HR question that uses the bot.*\n* **`\ud83d\udc69\ud83d\udcbb RAG App Developer`:** You and your team!\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/concepts\/user-personas.html"} +{"content":"# Databricks data engineering\n### Work with files on Databricks\n\nDatabricks provides multiple utilities and APIs for interacting with files in the following locations: \n* Unity Catalog volumes\n* Workspace files\n* Cloud object storage\n* DBFS mounts and DBFS root\n* Ephemeral storage attached to the driver node of the cluster \nThis article provides examples for interacting with files in these locations for the following tools: \n* Apache Spark\n* Spark SQL and Databricks SQL\n* Databricks file system utitlities (`dbutils.fs` or `%fs`)\n* Databricks CLI\n* Databricks REST API\n* Bash shell commands (`%sh`)\n* Notebook-scoped library installs using `%pip`\n* Pandas\n* OSS Python file management and processing utilities \nImportant \nFile operations that require FUSE access to data cannot directly access cloud object storage using URIs. Databricks recommends using Unity Catalog volumes to configure access to these locations for FUSE. \nScala does not support FUSE for Unity Catalog volumes or workspace files on compute configured with single user access mode or clusters without Unity Catalog. Scala supports FUSE for Unity Catalog volumes and workspace files on compute configured with Unity Catalog and shared access mode.\n\n### Work with files on Databricks\n#### Do I need to provide a URI scheme to access data?\n\nData access paths in Databricks follow one of the following standards: \n* **URI-style paths** include a URI scheme. For Databricks-native data access solutions, URI schemes are optional for most use cases. When you directly access data in cloud object storage, you must provide the correct URI scheme for the storage type. \n![URI paths diagram](https:\/\/docs.databricks.com\/_images\/uri-paths-aws.png)\n* **POSIX-style paths** provide data access relative to the driver root (`\/`). POSIX-style paths never require a scheme. You can use Unity Catalog volumes or DBFS mounts to provide POSIX-style access to data in cloud object storage. Many ML frameworks and other OSS Python modules require FUSE and can only use POSIX-style paths. \n![POSIX paths diagram](https:\/\/docs.databricks.com\/_images\/posix-paths.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/index.html"} +{"content":"# Databricks data engineering\n### Work with files on Databricks\n#### Work with files in Unity Catalog volumes\n\nDatabricks recommends using Unity Catalog volumes to configure access to non-tabular data files stored in cloud object storage. See [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html). \n| Tool | Example |\n| --- | --- |\n| Apache Spark | `spark.read.format(\"json\").load(\"\/Volumes\/my_catalog\/my_schema\/my_volume\/data.json\").show()` |\n| Spark SQL and Databricks SQL | `SELECT * FROM csv.`\/Volumes\/my_catalog\/my_schema\/my_volume\/data.csv`;` `LIST '\/Volumes\/my_catalog\/my_schema\/my_volume\/';` |\n| Databricks file system utilities | `dbutils.fs.ls(\"\/Volumes\/my_catalog\/my_schema\/my_volume\/\")` `%fs ls \/Volumes\/my_catalog\/my_schema\/my_volume\/` |\n| Databricks CLI | `databricks fs cp \/path\/to\/local\/file dbfs:\/Volumes\/my_catalog\/my_schema\/my_volume\/` |\n| Databricks REST API | `POST https:\/\/<databricks-instance>\/api\/2.1\/jobs\/create` `{\"name\": \"A multitask job\", \"tasks\": [{...\"libraries\": [{\"jar\": \"\/Volumes\/dev\/environment\/libraries\/logging\/Logging.jar\"}],},...]}` |\n| Bash shell commands | `%sh curl http:\/\/<address>\/text.zip -o \/Volumes\/my_catalog\/my_schema\/my_volume\/tmp\/text.zip` |\n| Library installs | `%pip install \/Volumes\/my_catalog\/my_schema\/my_volume\/my_library.whl` |\n| Pandas | `df = pd.read_csv('\/Volumes\/my_catalog\/my_schema\/my_volume\/data.csv')` |\n| OSS Python | `os.listdir('\/Volumes\/my_catalog\/my_schema\/my_volume\/path\/to\/directory')` | \nNote \nThe `dbfs:\/` schema is required when working with the Databricks CLI. \n### Volumes limitations \nVolumes have the following limitations: \n* Direct-append or non-sequential (random) writes, such as writing Zip and Excel files are not supported. For direct-append or random-write workloads, perform the operations on a local disk first and then copy the results to Unity Catalog volumes. For example: \n```\n# python\nimport xlsxwriter\nfrom shutil import copyfile\n\nworkbook = xlsxwriter.Workbook('\/local_disk0\/tmp\/excel.xlsx')\nworksheet = workbook.add_worksheet()\nworksheet.write(0, 0, \"Key\")\nworksheet.write(0, 1, \"Value\")\nworkbook.close()\n\ncopyfile('\/local_disk0\/tmp\/excel.xlsx', '\/Volumes\/my_catalog\/my_schema\/my_volume\/excel.xlsx')\n\n```\n* Sparse files are not supported. To copy sparse files, use `cp --sparse=never`: \n```\n$ cp sparse.file \/Volumes\/my_catalog\/my_schema\/my_volume\/sparse.file\nerror writing '\/dbfs\/sparse.file': Operation not supported\n$ cp --sparse=never sparse.file \/Volumes\/my_catalog\/my_schema\/my_volume\/sparse.file\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/index.html"} +{"content":"# Databricks data engineering\n### Work with files on Databricks\n#### Work with workspace files\n\nDatabricks [workspace files](https:\/\/docs.databricks.com\/files\/workspace.html) are the set of files in a workspace that are not notebooks. You can use workspace files to store and access data and other files saved alongside notebooks and other workspace assets. Because workspace files have size restrictions, Databricks recommends only storing small data files here primarily for development and testing. \n| Tool | Example |\n| --- | --- |\n| Apache Spark | `spark.read.format(\"json\").load(\"file:\/Workspace\/Users\/<user-folder>\/data.json\").show()` |\n| Spark SQL and Databricks SQL | `SELECT * FROM json.`file:\/Workspace\/Users\/<user-folder>\/file.json`;` |\n| Databricks file system utilities | `dbutils.fs.ls(\"file:\/Workspace\/Users\/<user-folder>\/\")` `%fs ls file:\/Workspace\/Users\/<user-folder>\/` |\n| Databricks CLI | `databricks workspace list` |\n| Databricks REST API | `POST https:\/\/<databricks-instance>\/api\/2.0\/workspace\/delete` `{\"path\": \"\/Workspace\/Shared\/code.py\", \"recursive\": \"false\"}` |\n| Bash shell commands | `%sh curl http:\/\/<address>\/text.zip -o \/Workspace\/Users\/<user-folder>\/text.zip` |\n| Library installs | `%pip install \/Workspace\/Users\/<user-folder>\/my_library.whl` |\n| Pandas | `df = pd.read_csv('\/Workspace\/Users\/<user-folder>\/data.csv')` |\n| OSS Python | `os.listdir('\/Workspace\/Users\/<user-folder>\/path\/to\/directory')` | \nNote \nThe `file:\/` schema is required when working with Databricks Utilities, Apache Spark, or SQL. \n### Workspace files limitations \nWorkspace files have the following limitations: \n* Workspace file size is limited to 500MB. Operations that attempt to download or create files larger than this limit will fail. \n* If your workflow uses source code located in a [remote Git repository](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-repos.html), you cannot write to the current directory or write using a relative path. Write data to other location options.\n* You cannot use `git` commands when you save to workspace files. The creation of `.git` directories is not allowed in workspace files.\n* There is limited support for workspace file operations from **serverless compute**.\n* Executors cannot write to workspace files.\n* symlinks are not supported.\n* Workspace files can\u2019t be accessed from [user-defined functions (UDFs)](https:\/\/docs.databricks.com\/udf\/index.html) on clusters with [shared access mode](https:\/\/docs.databricks.com\/compute\/configure.html#access-modes). \n### Where do deleted workspace files go? \nDeleting a workspace file sends it to the trash. You can either recover or permanently delete files from the trash using the UI. \nSee [Delete an object](https:\/\/docs.databricks.com\/workspace\/workspace-objects.html#delete-object).\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/index.html"} +{"content":"# Databricks data engineering\n### Work with files on Databricks\n#### Work with files in cloud object storage\n\nDatabricks recommends using Unity Catalog volumes to configure secure access to files in cloud object storage. If you choose to directly access data in cloud object storage using URIs, you must configure permissions. See [Manage external locations, external tables, and external volumes](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html#manage-external). \nThe following examples use URIs to access data in cloud object storage: \n| Tool | Example |\n| --- | --- |\n| Apache Spark | `spark.read.format(\"json\").load(\"s3:\/\/<bucket>\/path\/file.json\").show()` |\n| Spark SQL and Databricks SQL | `SELECT * FROM csv.`s3:\/\/<bucket>\/path\/file.json`;` `LIST 's3:\/\/<bucket>\/path';` |\n| Databricks file system utilities | `dbutils.fs.ls(\"s3:\/\/<bucket>\/path\/\")` `%fs ls s3:\/\/<bucket>\/path\/` |\n| Databricks CLI | Not supported |\n| Databricks REST API | Not supported |\n| Bash shell commands | Not supported |\n| Library installs | `%pip install s3:\/\/bucket-name\/path\/to\/library.whl` |\n| Pandas | Not supported |\n| OSS Python | Not supported | \nNote \nCloud object storage not support Amazon S3 mounts with client-side encryption enabled.\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/index.html"} +{"content":"# Databricks data engineering\n### Work with files on Databricks\n#### Work with files in DBFS mounts and DBFS root\n\nDBFS mounts are not securable using Unity Catalog and are no longer recommended by Databricks. Data stored in the DBFS root is accessible by all users in the workspace. Databricks recommends against storing any sensitive or production code or data in the DBFS root. See [What is DBFS?](https:\/\/docs.databricks.com\/dbfs\/index.html). \n| Tool | Example |\n| --- | --- |\n| Apache Spark | `spark.read.format(\"json\").load(\"\/mnt\/path\/to\/data.json\").show()` |\n| Spark SQL and Databricks SQL | `SELECT * FROM json.`\/mnt\/path\/to\/data.json`;` |\n| Databricks file system utilities | `dbutils.fs.ls(\"\/mnt\/path\")` `%fs ls \/mnt\/path` |\n| Databricks CLI | `databricks fs cp dbfs:\/mnt\/path\/to\/remote\/file \/path\/to\/local\/file` |\n| Databricks REST API | `POST https:\/\/<host>\/api\/2.0\/dbfs\/delete --data '{ \"path\": \"\/tmp\/HelloWorld.txt\" }'` |\n| Bash shell commands | `%sh curl http:\/\/<address>\/text.zip > \/dbfs\/mnt\/tmp\/text.zip` |\n| Library installs | `%pip install \/dbfs\/mnt\/path\/to\/my_library.whl` |\n| Pandas | `df = pd.read_csv('\/dbfs\/mnt\/path\/to\/data.csv')` |\n| OSS Python | `os.listdir('\/dbfs\/mnt\/path\/to\/directory')` | \nNote \nThe `dbfs:\/` schema is required when working with the Databricks CLI.\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/index.html"} +{"content":"# Databricks data engineering\n### Work with files on Databricks\n#### Work with files in ephemeral storage attached to the driver node\n\nThe ephermal storage attached to the driver node is block storage with native POSIX-based path access. Any data stored in this location disappears when a cluster terminates or restarts. \n| Tool | Example |\n| --- | --- |\n| Apache Spark | Not supported |\n| Spark SQL and Databricks SQL | Not supported |\n| Databricks file system utilities | `dbutils.fs.ls(\"file:\/path\")` `%fs ls file:\/path` |\n| Databricks CLI | Not supported |\n| Databricks REST API | Not supported |\n| Bash shell commands | `%sh curl http:\/\/<address>\/text.zip > \/tmp\/text.zip` |\n| Library installs | Not supported |\n| Pandas | `df = pd.read_csv('\/path\/to\/data.csv')` |\n| OSS Python | `os.listdir('\/path\/to\/directory')` | \nNote \nThe `file:\/` schema is required when working with Databricks Utilities. \n### Move data from ephemeral storage to volumes \nYou might want to access data downloaded or saved to ephemeral storage using Apache Spark. Because ephemeral storage is attached to the driver and Spark is a distributed processing engine, not all operations can directly access data here. If you need to move data from the driver filesystem to Unity Catalog volumes, you can copy files using [magic commands](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#language-magic) or the [Databricks utilities](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html), as in the following examples: \n```\ndbutils.fs.cp (\"file:\/<path>\", \"\/Volumes\/<catalog>\/<schema>\/<volume>\/<path>\")\n\n``` \n```\n%sh cp \/<path> \/Volumes\/<catalog>\/<schema>\/<volume>\/<path>\n\n``` \n```\n%fs cp file:\/<path> \/Volumes\/<catalog>\/<schema>\/<volume>\/<path>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/index.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Spark driver overloaded\n\nSo you\u2019ve determined that your driver is overloaded. The most common reason for this is that there are too many concurrent things running on the cluster. This could be too many streams, queries, or Spark jobs (some customers use threads to run many spark jobs concurrently). \nIt could also be that you\u2019re running non-Spark code on your cluster that is keeping the driver busy. If you see gaps in your timeline caused by running non-Spark code, this means your workers are all idle and likely wasting money during the gaps. Maybe this is intentional and unavoidable, but if you can write this code to use Spark you will fully utilize the cluster. Start with [this tutorial](https:\/\/docs.databricks.com\/getting-started\/quick-start.html) to learn how to work with Spark. \nIf you have too many things running on the cluster simultaneously, then you have three options: \n* Increase the size of your driver\n* Reduce the concurrency\n* Spread the load over multiple clusters \nDatabricks recommends you first try doubling the size of the driver and see how that impacts your job.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/spark-driver-overloaded.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/index.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n######## Databricks JDBC Driver\n\nDatabricks provides a [JDBC driver](https:\/\/www.databricks.com\/spark\/jdbc-drivers-download) that enables you to connect participating apps, tools, clients, SDKs, and APIs to Databricks through Java Database Connectivity (JDBC), an industry-standard specification for accessing database management systems. \nThis article and its related articles supplement the information in the [Databricks JDBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Databricks-JDBC-Driver-Install-and-Configuration-Guide.pdf), available online in PDF format and in your JDBC Driver download\u2019s `docs` directory. \nNote \nDatabricks also provides an ODBC Driver. See [Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/index.html). \nThe process for using the JDBC driver is as follows: \n1. Download and reference the JDBC driver, depending on your target operating system. See [Download and reference the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/download.html).\n2. Gather and store configuration settings for your target Databricks compute resource (a Databricks cluster or a Databricks SQL warehouse), your target Databricks authentication type, and any special or advanced driver capabilities, as a JDBC connection URL or as a programmatic collection of JDBC connection properties. Whether you use a connection URL or a collection of connection properties will depend on the requirements of your target app, tool, client, SDK, or API. See: \n* [Compute settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/compute.html)\n* [Authentication settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html)\n* [Driver capability settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html)\n3. To use your connection URL or collection of connection properties with your target app, tool, client, SDK, or API, see [Technology partners](https:\/\/docs.databricks.com\/integrations\/index.html) or your provider\u2019s documentation. \nFor more information, view the [Databricks JDBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Databricks-JDBC-Driver-Install-and-Configuration-Guide.pdf) in PDF format. This guide is also included as a PDF file named `Databricks JDBC Driver Install and Configuration Guide.pdf` in your JDBC driver download\u2019s `docs` directory.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/index.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n######## Databricks JDBC Driver\n######### Additional resources\n\n* [Databricks JDBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Databricks-JDBC-Driver-Install-and-Configuration-Guide.pdf) \n* [DataGrip integration with Databricks](https:\/\/docs.databricks.com\/dev-tools\/datagrip.html)\n* [DBeaver integration with Databricks](https:\/\/docs.databricks.com\/dev-tools\/dbeaver.html)\n* [Connect to SQL Workbench\/J](https:\/\/docs.databricks.com\/partners\/bi\/workbenchj.html) \n* [Connect to Infoworks](https:\/\/docs.databricks.com\/partners\/ingestion\/infoworks.html)\n* [Connect to Qlik Replicate](https:\/\/docs.databricks.com\/partners\/ingestion\/qlik.html) \n* [Connect to Stitch](https:\/\/docs.databricks.com\/partners\/ingestion\/stitch.html) \n* [Connect to StreamSets](https:\/\/docs.databricks.com\/partners\/ingestion\/streamsets.html)\n* [Connect to Syncsort](https:\/\/docs.databricks.com\/partners\/ingestion\/syncsort.html)\n* [Connect to MicroStrategy](https:\/\/docs.databricks.com\/partners\/bi\/microstrategy.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Third-party online stores\n\nThis article describes how to work with third-party online stores for real-time serving of feature values. You can also use Databricks online tables for real-time feature serving with much less setup required. See [Databricks Online Tables](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html). \nWith real-time serving, you publish feature tables to a low-latency database and deploy the model or feature spec to a REST endpoint. \nDatabricks Feature Store also supports automatic feature lookup. In this case, the input values provided by the client include values that are only available at the time of inference. The model incorporates logic to automatically fetch the feature values it needs from the provided input values. \nThe diagram illustrates the relationship between MLflow and Feature Store components for real-time serving. \n![Feature Store workflow with online lookup](https:\/\/docs.databricks.com\/_images\/fs-flow-online-lookup.png) \nDatabricks Feature Store supports these online stores: \n| Online store provider | Publish with Feature Engineering in Unity Catalog | Publish with Workspace Feature Store | Feature lookup in Legacy MLflow Model Serving | Feature lookup in Model Serving |\n| --- | --- | --- | --- | --- |\n| Amazon DynamoDB | X | X (Feature Store client v0.3.8 and above) | X | X |\n| Amazon Aurora (MySQL-compatible) | | X | X | |\n| Amazon RDS MySQL | | X | X | |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-feature-stores.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Third-party online stores\n##### Start using online stores\n\nSee the following articles to get started with online stores: \n* [Authentication for working with online stores](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/fs-authentication.html)\n* [Publish features to an online store](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/publish-features.html)\n* [Automatic feature lookup with MLflow models on Databricks](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/automatic-feature-lookup.html) (includes example notebook)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-feature-stores.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Restrict Delta Sharing recipient access using IP access lists (open sharing)\n\nThis article describes how data providers can assign IP access lists to control recipient access to shared data. \nIf you, as a data provider, are using the [open Delta Sharing protocol](https:\/\/docs.databricks.com\/data-sharing\/index.html#open), you can limit a recipient to a restricted set of IP addresses when they access data that you share. This list is independent of [Workspace IP access lists](https:\/\/docs.databricks.com\/api\/workspace\/ipaccesslists). Only allow lists are supported. \nThe IP access list affects the following: \n* Delta Sharing OSS Protocol REST API access\n* Delta Sharing activation URL access\n* Delta Sharing credential file download \nEach recipient supports a maximum of 100 IP\/CIDR values, where one CIDR counts as a single value. Only IPv4 addresses are supported.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/access-list.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Restrict Delta Sharing recipient access using IP access lists (open sharing)\n#### Assign an IP access list to a recipient\n\nYou can assign an IP access list to a recipient using Catalog Explorer or the Databricks Unity Catalog CLI. \n**Permissions required**: If you are assigning an IP access list when you create a recipient, you must be a metastore admin or user with the `CREATE_RECIPIENT` privilege. If you are assigning an IP access list to an existing recipient, you must be the recipient object owner. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Recipients** tab, select the recipient.\n4. On the **IP access list** tab, click **Add IP address\/CIDRs** for each IP address (in single IP address format, like 8.8.8.8) or range of IP addresses (in CIDR format, like 8.8.8.4\/10). \nTo add an IP access list when you create a new recipient, run the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html), replacing `<recipient-name>` and the IP address values. \n```\ndatabricks recipients create \\\n--json=-'{\n\"name\": \"<recipient-name>\",\n\"authentication_type\": \"<authentication-type>\",\n\"ip_access_list\": {\n\"allowed_ip_addresses\": [\n\"8.8.8.8\",\n\"8.8.8.4\/10\"\n]\n}\n}'\n\n``` \nTo add an IP access list to an existing recipient, run the following command, replacing `<recipient-name>` and the IP address values. \n```\ndatabricks recipients update \\\n--json='{\n\"name\": \"<recipient-name>\",\n\"ip_access_list\": {\n\"allowed_ip_addresses\": [\n\"8.8.8.8\",\n\"8.8.8.4\/10\"\n]\n}\n}'\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/access-list.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Restrict Delta Sharing recipient access using IP access lists (open sharing)\n#### Remove an IP access list\n\nYou can remove a recipient\u2019s IP access list using Catalog Explorer or the Databricks Unity Catalog CLI. If you remove all IP addresses from the list, the recipient can access the shared data from anywhere. \n**Permissions required**: Recipient object owner. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Recipients** tab, select the recipient.\n4. On the **IP access list** tab, click the trash can icon next to the IP address you want to delete. \nUse the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) to pass in an empty IP access list: \n```\ndatabricks recipients update \\\n--json='{\n\"name\": \"<recipient-name>\",\n\"ip_access_list\": {}\n}'\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/access-list.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Restrict Delta Sharing recipient access using IP access lists (open sharing)\n#### View a recipient\u2019s IP access list\n\nYou can view a recipient\u2019s IP access list using Catalog Explorer, the Databricks Unity Catalog CLI, or the `DESCRIBE RECIPIENT` SQL command in a notebook or Databricks SQL query. \n**Permissions required**: Metastore admin, user with the `USE RECIPIENT` privilege, or the recipient object owner. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Recipients** tab, find and select the recipient.\n4. View allowed IP addresses on the **IP access list** tab. \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \n```\ndatabricks recipients get <recipient-name>\n\n``` \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nDESCRIBE RECIPIENT <recipient-name>;\n\n```\n\n### Restrict Delta Sharing recipient access using IP access lists (open sharing)\n#### Audit logging for Delta Sharing IP access lists\n\nThe following operations trigger audit logs related to IP access lists: \n* Recipient management operations: create, update\n* Denial of access to any of the Delta Sharing OSS Protocol REST API calls\n* Denial of access to Delta Sharing activation URL (open sharing only)\n* Denial of access to Delta Sharing credential file download (open sharing only) \nTo learn more about how to enable and read audit logs for Delta Sharing, see [Audit and monitor data sharing](https:\/\/docs.databricks.com\/data-sharing\/audit-logs.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/access-list.html"} +{"content":"# \n### Deploy a version of a RAG Application\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html) \nThe following guide walks you through deploying an version of the application in the development `Environment` so you can chat with it through the `\ud83d\udcac Review UI`. \nNote \nThe default RAG Studio template ships with a fully functioning application. You can deploy the code as-is. See [Create versions of your RAG application to iterate on the app\u2019s quality](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-app-versions.html) to understand how to create a new `Version`.\n\n### Deploy a version of a RAG Application\n#### Step 1: Build and deploy the sample application\n\nNote \nThis step will run the `\ud83d\uddc3\ufe0f Data Processor`, package the `\ud83d\udd17 Chain` into a Unity Catalog model, and then deploy the `\ud83d\udd17 Chain` to Model Serving. \n1. Deploy the application to your workspace by running the following command in your console. This step will take approximately 15-30 minutes. \n```\n.\/rag create-rag-version -e dev\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1c-deploy-version.html"} +{"content":"# \n### Deploy a version of a RAG Application\n#### Step 2: View your RAG Application in the `\ud83d\udcac Review UI`\n\n1. Congrats! You have deployed a fully functioning RAG application, complete with logging, the ability to collect feedback from users and LLM-Judges, and automated quality\/cost\/latency metric computation. \nNote \nWhile we are only exploring this application for the purposes of getting started with RAG Studio, this application is ready to be deployed to your production environment.\n2. In the console, you will see output similar to below. Open the URL in your web browser to open the `\ud83d\udcac Review UI`. \n```\n...truncated for clarity of docs...\n=======\nTask deploy_chain_task:\nYour Review UI is now available. Open the Review UI here: https:\/\/<workspace-url>\/ml\/review\/model\/catalog.schema.rag_studio_databricks-docs-bot\/version\/1\/environment\/dev\n\n```\n3. You can now interact with the RAG application! \n![RAG application](https:\/\/docs.databricks.com\/_images\/review-ui-1.png)\n\n### Deploy a version of a RAG Application\n#### Data flow\n\n![RAG review app](https:\/\/docs.databricks.com\/_images\/data-flow-review-ui.png)\n\n### Deploy a version of a RAG Application\n#### Follow the next tutorial!\n\n[View logs & assessments](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/2-view-logs.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1c-deploy-version.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### Create a training run using the Foundation Model Training UI\n\nImportant \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). Reach out to your Databricks account team to enroll in the Public Preview. \nThis article describes how to create and configure a training run using the Foundation Model Training UI. You can also create a run using the API. For instructions, see [Create a training run using the Foundation Model Training API](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/create-fine-tune-run.html).\n\n#### Create a training run using the Foundation Model Training UI\n##### Requirements\n\nSee [Requirements](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html#required).\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/ui.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Foundation Model Training\n#### Create a training run using the Foundation Model Training UI\n##### Create a training run using the UI\n\nFollow these steps to create a training run using the UI. \n1. In the left sidebar, click **Experiments**.\n2. On the **Foundation Model Training** card, click **Create Foundation Model Experiment**. \n![Foundation model experiment form](https:\/\/docs.databricks.com\/_images\/finetuning-expt.png)\n3. The **Foundation Model Training** form opens. Items marked with an asterisk are required. Make your selections, and then click **Start Training**. \n**Type**: Select the task to perform. \n| Task | Description |\n| --- | --- |\n| Instruction Finetuning | Continue training a foundation model with prompt-and-response input to optimize the model for a specific task. |\n| Continued Pre-training | Continue training a foundation model to give it domain-specific knowledge. |\n| Chat Completion | Continue training a foundation model with chat logs to optimize it for Q&A or conversation applications. | \n**Select Foundation Model**: Select the model to tune or train. For a list of supported models, see [Supported models](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html#supported-models). \n**Training data**: Click **Browse** to select a table in Unity Catalog, or enter the full URL for a Hugging Face dataset. For data size recommendations, see [Recommended data size for model training](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html#data-size). \nIf you select a table in Unity Catalog, you must also select the compute to use to read the table. \n**Register to location**: Select the Unity Catalog catalog and schema from the drop-down menus. The trained model is saved to this location. \n**Model name**: The model is saved with this name in the catalog and schema you specified. A default name appears in this field, which you can change if desired. \n**Advanced options**: For more customization, you can configure optional settings for evaluation, hyperparameter tuning, or train from an existing proprietary model. \n| Setting | Description |\n| --- | --- |\n| Training duration | Duration of the training run, specified in epochs (for example, `10ep`) or tokens (for example, `1000000tok`). Default is `1ep`. |\n| Learning rate | The learning rate for model training. Default is `5e-7`. The optimizer is DecoupledLionW with betas of 0.99 and 0.95 and no weight decay. The learning rate scheduler is LinearWithWarmupSchedule with a warmup of 2% of the total training duration and a final learning rate multiplier of 0. |\n| Context length | The maximum sequence length of a data sample. Data longer than this setting is truncated. The default depends on the model selected. |\n| Evaluation data | Click **Browse** to select a table in Unity Catalog, or enter the full URL for a Hugging Face dataset. If you leave this field blank, no evaluation is performed. |\n| Model evaluation prompts | Type optional prompts to use to evaluate the model. |\n| Experiment name | By default, a new, automatically generated name is assigned for each run. You can optionally enter a custom name or select an existing experiment from the drop-down list. |\n| Custom weights | By default, training begins by using the original weights of the selected model. To start with custom weights from a [Composer checkpoint](https:\/\/github.com\/mosaicml\/composer\/), enter the path to the Unity Catalog table that contains the checkpoint values. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/ui.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n#### Cost optimization for the data lakehouse\n\nThis article covers architectural principles of the **cost optimization** pillar, aimed at enabling cost management in a way that maximizes the value delivered. Given a budget, cost efficiency is driven by business objectives and return on investment. Cost optimization principles can help achieve both business objectives and cost justification. \n![Cost optimization lakehouse architecture diagram for Databricks.](https:\/\/docs.databricks.com\/_images\/cost-optimization.png)\n\n#### Cost optimization for the data lakehouse\n##### Principles of cost optimization\n\n1. **Choose the correct resources** \nChoose the right resources that align with business goals and can handle workload performance. When onboarding new workloads, explore the different deployment options and choose the one with the best price\/performance ratio.\n2. **Dynamically allocate and deallocate resources** \nDynamically allocate and release resources to match performance requirements. Identify unused or underutilized resources and reconfigure, consolidate, or turn them off.\n3. **Monitor and control cost** \nThe cost of your workloads depends on the amount of resources consumed and the rates charged for those resources. To understand the cost of these workloads, monitor them for each resource involved. This provides a baseline for controlling consumption and costs.\n4. **Analyze and attribute expenditure** \nThe lakehouse makes it easy to identify workload usage and costs accurately. This enables the transparent allocation of costs to individual workload owners. They can then measure return on investment and optimize their resources to reduce costs if necessary.\n5. **Optimize workloads, aim for scalable costs** \nA key advantage of the lakehouse is its ability to scale dynamically. As a starting point, usage and performance metrics are analyzed to determine the initial number of instances. With auto-scaling, additional costs can be saved by choosing smaller instances for a highly variable workload, or by scaling out rather than up to achieve the required level of performance.\n\n#### Cost optimization for the data lakehouse\n##### Next: Best practices for cost optimization\n\nSee [Best practices for cost optimization](https:\/\/docs.databricks.com\/lakehouse-architecture\/cost-optimization\/best-practices.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/cost-optimization\/index.html"} +{"content":"# \n### Collect feedback from `\ud83e\udde0 Expert Users`\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html) \nThis tutorial walks you through deploying your RAG Application to the Reviewers `Environment` in order to allow `\ud83e\udde0 Expert Users` to test the application and provide feedback.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/4-collect-feedback.html"} +{"content":"# \n### Collect feedback from `\ud83e\udde0 Expert Users`\n#### Step 1: Create the Reviewers & End Users `Environment`\n\n1. If you did not already run this command in [Initialize a RAG Application](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1-create-sample-app.html), run the following command to initialize these `Environments`. This command takes about 10 minutes to run. \n```\n.\/rag setup-prod-env\n\n``` \nNote \nSee [Infrastructure and Unity Catalog assets created by RAG Studio](https:\/\/docs.databricks.com\/rag-studio\/details\/created-infra.html) for details of what is created in your Workspace and Unity Catalog schema.\n2. Run the following command to deploy the version to the Reviewers `Environment`. This command takes about 10 minutes to run. \n```\n.\/rag deploy-chain -v 1 -e reviewers\n\n```\n3. In the console, you will see output similar to below. Open the URL in your web browser to open the `\ud83d\udcac Review UI`. You can share this URL with your `\ud83e\udde0 Expert Users`. \n```\n...truncated for clarity of docs...\n=======\nTask deploy_chain_task:\nYour Review UI is now available. Open the Review UI here: https:\/\/<workspace-url>\/ml\/review\/model\/catalog.schema.rag_studio_databricks-docs-bot\/version\/1\/environment\/reviewers\n\n```\n4. Add permissions to the deployed version so your `\ud83e\udde0 Expert Users` can access the above URL. \n* Give the Databricks user you wish to grant access `read` permissions to \n+ the MLflow Experiment\n+ the Model Serving endpoint\n+ the Unity Catalog Model\nTip \n**\ud83d\udea7 Roadmap \ud83d\udea7** Support for adding any corporate SSO to access the `\ud83d\udcac Review UI` e.g., no requirements for a Databricks account.\n5. Now, every time a `\ud83e\udde0 Expert Users` chats with your RAG Application, the `\ud83d\uddc2\ufe0f Request Log` and `\ud83d\udc4d Assessment & Evaluation Results Log` will be populated.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/4-collect-feedback.html"} +{"content":"# \n### Collect feedback from `\ud83e\udde0 Expert Users`\n#### Follow the next tutorial!\n\n[Create an \ud83d\udcd6 Evaluation Set](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/5-create-eval-set.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/4-collect-feedback.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n### Set up query federation for non-Unity-Catalog workspaces\n##### Query federation for Microsoft SQL Server in Databricks SQL (Experimental)\n\nExperimental \nThe configurations described in this article are [Experimental](https:\/\/docs.databricks.com\/release-notes\/release-types.html). Experimental features are provided as-is and are not supported by Databricks through customer technical support. **To get full query federation support, you should instead use [Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html), which enables your Databricks users to take advantage of Unity Catalog syntax and data governance tools.** \nThis article describes how to configure read-only query federation to SQL Server on Serverless and Pro SQL warehouses. \nYou configure connections to SQL Server at the table level. You can use [secrets](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/secret.html) to store and access text credentials without displaying them in plaintext. See the following example: \n```\nDROP TABLE IF EXISTS sqlserver_table;\nCREATE TABLE sqlserver_table\nUSING sqlserver\nOPTIONS (\ndbtable '<table-name>',\nhost '<database-host-url>',\nport '1433',\ndatabase '<database-name>',\nuser secret('sqlserver_creds', 'my_username'),\npassword secret('sqlserver_creds', 'my_password')\n);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/sql-server-no-uc.html"} +{"content":"# \n### Delegated authentication to third-party services\n\nDatabricks can log you into third-party services, such as the [Ideas Portal](https:\/\/ideas.databricks.com) (powered by Aha!) and the [Help Center](https:\/\/help.databricks.com\/s\/) (powered by Salesforce), using your Databricks username. These third-party services delegate authentication to Databricks, essentially putting Databricks in the role of single sign-on (SSO) provider. \nFor example, with delegated authentication enabled, when you go to the help ![Help icon](https:\/\/docs.databricks.com\/_images\/help-icon.png) menu in your Databricks workspace and select **Feedback**, you\u2019ll be logged into the Ideas Portal immediately, without having to provide credentials again. \nNote \nDelegated authentication is enabled by default for all Databricks accounts, but your administrator may choose to [disable it](https:\/\/docs.databricks.com\/admin\/access-control\/auth-external.html). \nThis article describes how each of the third-party services that delegate authentication to Databricks works.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delegated-auth.html"} +{"content":"# \n### Delegated authentication to third-party services\n#### Ideas Portal\n\nThe [Ideas Portal](https:\/\/ideas.databricks.com) is available only through delegated authentication. \n### Sign in to the Ideas Portal the first time \nGo to the help ![Help icon](https:\/\/docs.databricks.com\/_images\/help-icon.png) menu in your Databricks workspace and select **Feedback**. You will be asked to validate your email address. Wait for the validation email and click the link in the email to complete the validation process and gain access to the Ideas Portal. \n### Sign in to the Ideas Portal as a returning user \nGo to the help ![Help icon](https:\/\/docs.databricks.com\/_images\/help-icon.png) menu in your Databricks workspace and select **Feedback**. Databricks launches the Ideas Portal and signs you in. \nYou can also log in by going directly to [ideas.databricks.com](https:\/\/ideas.databricks.com). If you have an active session, you will be logged in automatically. If you do not, you will be prompted to enter your workspace domain and sign into your workspace. Once you are signed in, you will be redirected to the Ideas Portal. \nFor more information about the Ideas Portal and the feedback process, see [Submit product feedback](https:\/\/docs.databricks.com\/resources\/ideas.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delegated-auth.html"} +{"content":"# \n### Delegated authentication to third-party services\n#### Help Center\n\nAnyone can access the [Help Center](https:\/\/help.databricks.com\/s\/) to learn how to use Databricks and get answers to questions. If your organization has a support contract with Databricks and you are an authorized support contact for your organization, you can also sign in to the Help Center to create, view, and modify support cases. \nSupport contacts have two ways of signing in to the Help Center: using Databricks delegated authentication, and using their Help Center credentials (managed by Salesforce). In either case, a Databricks support representative or a designated support contact from your organization *must already have added you* as a support contact to the Databricks Salesforce account. \n### Sign in using delegated authentication from your Databricks workspace \nGo to the help ![Help icon](https:\/\/docs.databricks.com\/_images\/help-icon.png) menu in your Databricks workspace and select **Support**. \nIf you are logging in to the Help Center for the first time, you will be asked to validate your email address. Wait for the validation email and click the link in the email to complete the validation process and gain access to the support case features of the Help Center. \nIf you are a returning user, you will be logged into the Help Center automatically. \n### Sign in using delegated authentication from the Help Center \n1. Click the **Login** button on the upper right corner of the Help Center home page. \n![help center login button](https:\/\/docs.databricks.com\/_images\/help-center-login-button.png)\n2. On the login page, click the **Sign In** button. \n![login page](https:\/\/docs.databricks.com\/_images\/login-main.png) \nIf you are logging in to the Help Center for the first time, you will be prompted to sign into your Databricks workspace and asked to validate your email address. Wait for the validation email and click the link in the email to complete the validation process and gain access to the support case features of the Help Center. \nIf you are a returning user and you already have an active Databricks workspace session on your browser, you will be logged into the Help Center automatically. If you don\u2019t have an active session, you will be prompted to log in to your workspace and then logged in automatically to the Help Center. \n### Sign in using your Databricks support credentials from the Help Center \nIf you do not have a Databricks user account, or it does not share the same email address as your registered support contact user: \n1. Click the **Login** button on the upper right corner of the Help Center home page.\n2. On the login page, click the **Click Here** link next to **Don\u2019t have a Databricks Workspace account?**.\n3. On the secondary login page, enter your Databricks support (Salesforce) credentials. \n![secondary login page](https:\/\/docs.databricks.com\/_images\/login-username-pwd.png) \nFor more information about the Help Center and the support process, see [Support](https:\/\/docs.databricks.com\/resources\/support.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delegated-auth.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### Expand and read Zip compressed files\n\nYou can use the `unzip` Bash command to expand files or directories of files that have been Zip compressed. If you download or encounter a file or directory ending with `.zip`, expand the data before trying to continue. \nNote \nApache Spark provides native codecs for interacting with compressed Parquet files. Most Parquet files written by Databricks end with `.snappy.parquet`, indicating they use snappy compression.\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/unzip-files.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### Expand and read Zip compressed files\n##### How to unzip data\n\nThe Databricks `%sh` [magic command](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#language-magic) enables execution of arbitrary Bash code, including the `unzip` command. \nThe following example uses a zipped CSV file downloaded from the internet. See [Download data from the internet](https:\/\/docs.databricks.com\/files\/download-internet-files.html). \nNote \nYou can use the Databricks Utilities to move files to the ephemeral storage attached to the driver before expanding them. You cannot expand zip files while they reside in Unity Catalog volumes. See [Databricks Utilities (dbutils) reference](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html). \nThe following code uses `curl` to download and then `unzip` to expand the data: \n```\n%sh curl https:\/\/resources.lendingclub.com\/LoanStats3a.csv.zip --output \/tmp\/LoanStats3a.csv.zip\nunzip \/tmp\/LoanStats3a.csv.zip\n\n``` \nUse dbutils to move the expanded file to a Unity Catalog volume, as follows: \n```\ndbutils.fs.mv(\"file:\/LoanStats3a.csv\", \"\/Volumes\/my_catalog\/my_schema\/my_volume\/LoanStats3a.csv\")\n\n``` \nIn this example, the downloaded data has a comment in the first row and a header in the second. Now that the data has been expanded and moved, use standard options for reading CSV files, as in the following example: \n```\ndf = spark.read.format(\"csv\").option(\"skipRows\", 1).option(\"header\", True).load(\"\/Volumes\/my_catalog\/my_schema\/my_volume\/LoanStats3a.csv\")\ndisplay(df)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/unzip-files.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n####### Databricks ODBC and JDBC Drivers\n\nDatabricks provides an [ODBC driver](https:\/\/docs.databricks.com\/integrations\/odbc\/index.html) and a [JDBC driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/index.html) to connect your tools or clients to Databricks. For tool or client specific connection instructions, see [Technology partners](https:\/\/docs.databricks.com\/integrations\/index.html) or your tool\u2019s or client\u2019s documentation. \n* To get started with the ODBC driver, see [Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/index.html).\n* To get started with the JDBC driver, see [Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/index.html).\n\n####### Databricks ODBC and JDBC Drivers\n######## Additional resources\n\n* [Databricks ODBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Simba-Apache-Spark-ODBC-Connector-Install-and-Configuration-Guide.pdf)\n* [Databricks JDBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Databricks-JDBC-Driver-Install-and-Configuration-Guide.pdf)\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc-odbc-bi.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to Prophecy\n\nProphecy helps teams be successful and productive on Apache Spark and Apache Airflow with low-code development, scheduling, and metadata. \nYou can integrate your Databricks clusters with Prophecy. \nFor a general overview and demonstration of Prophecy, watch the following YouTube video (26 minutes). \nNote \nProphecy does not integrate with Databricks SQL warehouses.\n\n#### Connect to Prophecy\n##### Connect to Prophecy using Partner Connect\n\nFor an overview of the Partner Connect procedure, watch this YouTube video (3 minutes). \n### Steps to connect \n1. [Connect to data prep partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/prep.html).\n2. In your Prophecy account, on the navigation bar, click **Metadata**.\n3. Click **Fabrics**. The fabric for your workspace should be displayed.\n4. Do one of the following: \n* If the fabric is displayed, skip ahead to next steps.\n* If the fabric is not displayed, you can troubleshoot by connecting manually.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/prophecy.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to Prophecy\n##### Connect to Prophecy manually\n\nUse the steps in this section to connect Prophecy to your workspace. \nNote \nTo connect faster to Prophecy, use Partner Connect. \n### Requirements \nTo complete this series of steps, you need a Databricks [personal access token](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement). \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \n### Steps to connect \nTo connect to Prophecy manually, do the following: \n1. Sign in to your Prophecy account, or create a new Prophecy account, at <https:\/\/app.prophecy.io>.\n2. On the navigation bar, click **Metadata**.\n3. Click the **Fabrics** tab. \nImportant \nIf you sign in to your organization\u2019s Prophecy account, there may already be a list of existing fabric entries. *These entries might contain connection details for workspaces that are separate from yours.* If you still want to reuse one of these fabrics, and you trust the workspace and have access to it, then skip ahead to next steps.\n4. Click the plus (**+**) icon.\n5. Click the **Fabric** tab.\n6. Enter a **Name**, select a **Team**, and enter an optional **Description**. (Do not change the **Execution Url** value.)\n7. Click **Create Fabric**.\n8. In the list of fabrics, click the name of the fabric that you just added. \n[Learn about Fabric usage patterns](https:\/\/docs.prophecy.io\/concepts\/fabrics).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/prophecy.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to Prophecy\n##### Next steps\n\n* See [Creating a Pipeline](https:\/\/docs.prophecy.io\/concepts\/project\/pipeline\/#creating-a-pipeline) in the Prophecy documentation.\n\n#### Connect to Prophecy\n##### Additional resources\n\n[Prophecy website](https:\/\/prophecy.io)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/prophecy.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Pass context about job runs into job tasks\n\nYou can use *dynamic value references* to pass context about a job or task run such as the job or task name, the identifier of a run, or the start time of a job run. Dynamic value references are templated variables that are replaced with the appropriate values when the job task runs.\nWhen a job runs, the task parameter variable surrounded by double curly braces is replaced and appended to an optional string value included as part of the value. For example, to pass a parameter named `MyJobId` with a value of `my-job-6` for any run of job ID 6, add the following task parameter: \n```\n{\n\"MyJobID\": \"my-job-{{job.id}}\"\n}\n\n``` \nThe contents of the double curly braces are not evaluated as expressions, so you cannot do operations or functions in double-curly braces. \nUser-provided identifiers, for example, task names, task value keys, or job parameter names containing special characters must be escaped by surrounding the identifiers with backticks ( `` `` ). Only alphanumeric and underscore characters can be used without escaping. \n```\n{\n\"VariableWithSpecialChars\": \"{{job.parameters.`param$@`}}\"\n}\n\n``` \nSyntax errors in references (for example, a missing brace) are ignored and the value is treated as a literal string. For example, `{{my.value}` is passed as the string `\"{{my.value}\"`. However, entering an invalid reference that belongs to a known namespace (for example, `{{job.naem}}`) is not allowed. An error message is displayed if an invalid reference belonging to a known namespace is entered in the UI. \nAfter a task completes, you can see resolved values for parameters under **Parameters** on the [run details page](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#job-run-details).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Pass context about job runs into job tasks\n##### Supported value references\n\nThe following dynamic value references are supported: \n| Reference | Description |\n| --- | --- |\n| `{{job.id}}` | The unique identifier assigned to the job. |\n| `{{job.name}}` | The name of the job at the time of the job run. |\n| `{{job.run_id}}` | The unique identifier assigned to the job run. |\n| `{{job.repair_count}}` | The number of repair attempts on the current job run. |\n| `{{job.start_time.[argument]}}` | A value based on the time (in UTC timezone) that the job run started. The return value is based on the `argument` option. See [Options for date and time values](https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html#time-return-options). |\n| `{{job.parameters.[name]}}` | The value of the job-level parameter with the key `[name]`. |\n| `{{job.trigger.type}}` | The trigger type of the job run. The possible values are `periodic`, `onetime`, `run_job_task`, `file_arrival`, `continuous`, and `table`. |\n| `{{job.trigger.file_arrival.location}}` | If a file arrival trigger is configured for this job, the value of the storage location. |\n| `{{job.trigger.time.[argument]}}` | A value based on the time (in UTC timezone) that the job run was triggered, rounded down to the closest minute for jobs with a cron schedule. The return value is based on the `argument` option. See [Options for date and time values](https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html#time-return-options). |\n| `{{task.name}}` | The name of the current task. |\n| `{{task.run_id}}` | The unique identifier of the current task run. |\n| `{{task.execution_count}}` | The number of times the current task was run (including retries and repairs). |\n| `{{task.notebook_path}}` | The notebook path of the current notebook task. |\n| `{{tasks.[task_name].run_id}}` | The unique identifier assigned to the task run for `[task_name]`. |\n| `{{tasks.[task_name].result_state}}` | The result state of task `[task_name]`. The possible values are `success`, `failed`, `excluded`, `canceled`, `evicted`, `timedout`, `upstream_canceled`, `upstream_evicted`, and `upstream_failed`. |\n| `{{tasks.[task_name].error_code}}` | The error code for task `[task_name]` if an error occurred running the task. Examples of possible values are `RunExecutionError`, `ResourceNotFound`, and `UnauthorizedError`. For successful tasks, this evaluates to an empty string. |\n| `{{tasks.[task_name].execution_count}}` | The number of times the task `[task_name]` was run (including retries and repairs). |\n| `{{tasks.[task_name].notebook_path}}` | The path to the notebook for the notebook task `[task_name]`. |\n| `{{tasks.[task_name].values.[value_name]}}` | The task value with the key `[value_name]` that was set by task `[task_name]`. |\n| `{{workspace.id}}` | The unique identifier assigned to the workspace. |\n| `{{workspace.url}}` | The URL of the workspace. | \nYou can set these references with any task when you [Create a job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#job-create), [Edit a job](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#job-edit), or [Run a job with different parameters](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#job-run-with-different-params). \nYou can also pass parameters between tasks in a job with *task values*. See [Share information between tasks in a Databricks job](https:\/\/docs.databricks.com\/workflows\/jobs\/share-task-context.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Pass context about job runs into job tasks\n##### Options for date and time values\n\nUse the following arguments to specify the return value from time based parameter variables. All return values are based on a timestamp in UTC timezone. \n| Argument | Description |\n| --- | --- |\n| `iso_weekday` | Returns a digit from 1 to 7, representing the day of the week of the timestamp. |\n| `is_weekday` | Returns `true` if the timestamp is on a weekday. |\n| `iso_date` | Returns the date in ISO format. |\n| `iso_datetime` | Returns the date and time in ISO format. |\n| `year` | Returns the year part of the timestamp. |\n| `month` | Returns the month part of the timestamp. |\n| `day` | Returns the day part of the timestamp. |\n| `hour` | Returns the hour part of the timestamp. |\n| `minute` | Returns the minute part of the timestamp. |\n| `second` | Returns the second part of the timestamp. |\n| `timestamp_ms` | Returns the timestamp in milliseconds. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Pass context about job runs into job tasks\n##### Deprecated parameter variables\n\nThe following parameter variables are deprecated. Although they are still supported, any new jobs or updates to existing jobs should use the [supported value references](https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html#supported-references). The recommended replacement reference is included in the description of each variable. \n| Variable | Description |\n| --- | --- |\n| `{{job_id}}` | The unique identifier assigned to a job. Use `job.id` instead. |\n| `{{run_id}}` | The unique identifier assigned to a task run. Use `task.run_id` instead. |\n| `{{start_date}}` | The date a task run started. The format is yyyy-MM-dd in UTC timezone. Use `job.start_time.[argument]` instead. |\n| `{{start_time}}` | The timestamp of the run\u2019s start of execution after the cluster is created and ready. The format is milliseconds since UNIX epoch in UTC timezone, as returned by `System.currentTimeMillis()`. Use `job.start_time.[format]` instead. |\n| `{{task_retry_count}}` | The number of retries that have been attempted to run a task if the first attempt fails. The value is 0 for the first attempt and increments with each retry. Use `task.execution_count` instead. |\n| `{{parent_run_id}}` | The unique identifier assigned to the run of a job with multiple tasks. Use `job.run_id` instead. |\n| `{{task_key}}` | The unique name assigned to a task that\u2019s part of a job with multiple tasks. Use `task.name` instead. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html"} +{"content":"# Compute\n## What are Databricks pools?\n#### Pool best practices\n\nThis article explains what pools are, and how you can best configure them. For information on creating a pool, see [Pool configuration reference](https:\/\/docs.databricks.com\/compute\/pools.html).\n\n#### Pool best practices\n##### Pool considerations\n\nConsider the following when creating a pool: \n* Create pools using instance types and Databricks runtimes based on target workloads.\n* When possible, populate pools with spot instances to reduce costs.\n* Populate pools with on-demand instances for jobs with short execution times and strict execution time requirements.\n* Use pool tags and cluster tags to manage billing.\n* Pre-populate pools to make sure instances are available when clusters need them.\n\n#### Pool best practices\n##### Create pools based on workloads\n\nIf your driver node and worker nodes have different requirements, create a different pool for each. \nYou can minimize instance acquisition time by creating a pool for each instance type and Databricks runtime your organization commonly uses. For example, if most data engineering clusters use instance type A, data science clusters use instance type B, and analytics clusters use instance type C, create a pool with each instance type. \nConfigure pools to use on-demand instances for jobs with short execution times and strict execution time requirements. Use on-demand instances to prevent acquired instances from being lost to a higher bidder on the spot market. \nConfigure pools to use spot instances for clusters that support interactive development or jobs that prioritize cost savings over reliability.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/pool-best-practices.html"} +{"content":"# Compute\n## What are Databricks pools?\n#### Pool best practices\n##### Tag pools to manage cost and billing\n\nTagging pools to the correct cost center allows you to manage cost and usage chargeback. You can use multiple custom tags to associate multiple cost centers to a pool. However, it\u2019s important to understand how tags are propagated when a cluster is created from pools. Tags from pools propagate to the underlying cloud provider instances, but the cluster\u2019s tags do not. Apply all custom tags required for managing chargeback of the cloud provider compute cost to the pool. \nPool tags and cluster tags both propagate to Databricks billing. You can use the combination of cluster and pool tags to manage chargeback of Databricks Units. \nTo learn more, see [Monitor usage using tags](https:\/\/docs.databricks.com\/admin\/account-settings\/usage-detail-tags.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/pool-best-practices.html"} +{"content":"# Compute\n## What are Databricks pools?\n#### Pool best practices\n##### Configure pools to control cost\n\nYou can use the following configuration options to help control the cost of pools: \n* Set the [Min Idle](https:\/\/docs.databricks.com\/compute\/pools.html#minimum-idle-instances) instances to 0 to avoid paying for running instances that aren\u2019t doing work. The tradeoff is a possible increase in time when a cluster needs to acquire a new instance.\n* Set the [Idle Instance Auto Termination](https:\/\/docs.databricks.com\/compute\/pools.html#idle-instance-auto-termination) time to provide a buffer between when the instance is released from the cluster and when it\u2019s dropped from the pool. Set this to a period that allows you to minimize cost while ensuring the availability of instances for scheduled jobs. For example, job A is scheduled to run at 8:00 AM and takes 40 minutes to complete. Job B is scheduled to run at 9:00 AM and takes 30 minutes to complete. Set the Idle Instance Auto Termination value to 20 minutes to ensure that instances returned to the pool when job A completes are available when job B starts. Unless they are claimed by another cluster, those instances are terminated 20 minutes after job B ends.\n* Set the [Max Capacity](https:\/\/docs.databricks.com\/compute\/pools.html#maximum-capacity) based on anticipated usage. This sets the ceiling for the maximum number of used and idle instances in the pool. If a job or cluster requests an instance from a pool at its maximum capacity, the request fails, and the cluster doesn\u2019t acquire more instances. Therefore, Databricks recommends that you set the maximum capacity only if there is a strict instance quota or budget constraint.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/pool-best-practices.html"} +{"content":"# Compute\n## What are Databricks pools?\n#### Pool best practices\n##### Pre-populate pools\n\nTo benefit fully from pools, you can pre-populate newly created pools. Set the **Min Idle** instances greater than zero in the pool configuration. Alternatively, if you\u2019re following the recommendation to set this value to zero, use a starter job to ensure that newly created pools have available instances for clusters to access. \nWith the starter job approach, schedule a job with flexible execution time requirements to run before jobs with more strict performance requirements or before users start using interactive clusters. After the job finishes, the instances used for the job are released back to the pool. Set **Min Idle** instance setting to 0 and set the **Idle Instance Auto Termination** time high enough to ensure that idle instances remain available for subsequent jobs. \nUsing a starter job allows the pool instances to spin up, populate the pool, and remain available for downstream job or interactive clusters.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/pool-best-practices.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Configure schema inference and evolution in Auto Loader\n\nYou can configure Auto Loader to automatically detect the schema of loaded data, allowing you to initialize tables without explicitly declaring the data schema and evolve the table schema as new columns are introduced. This eliminates the need to manually track and apply schema changes over time. \nAuto Loader can also \u201crescue\u201d data that was unexpected (for example, of differing data types) in a JSON blob column, that you can choose to access later using the [semi-structured data access APIs](https:\/\/docs.databricks.com\/optimizations\/semi-structured.html). \nThe following formats are supported for schema inference and evolution: \n| File format | Supported versions |\n| --- | --- |\n| `JSON` | All versions |\n| `CSV` | All versions |\n| `XML` | Databricks Runtime 14.3 LTS and above |\n| `Avro` | Databricks Runtime 10.4 LTS and above |\n| `Parquet` | Databricks Runtime 11.3 LTS and above |\n| `ORC` | Unsupported |\n| `Text` | Not applicable (fixed-schema) |\n| `Binaryfile` | Not applicable (fixed-schema) |\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Configure schema inference and evolution in Auto Loader\n##### Syntax for schema inference and evolution\n\nSpecifying a target directory for the option `cloudFiles.schemaLocation` enables schema inference and evolution. You can choose to use the same directory you specify for the `checkpointLocation`. If you use [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html), Databricks manages schema location and other checkpoint information automatically. \nNote \nIf you have more than one source data location being loaded into the target table, each Auto Loader ingestion workload requires a separate streaming checkpoint. \nThe following example uses `parquet` for the `cloudFiles.format`. Use `csv`, `avro`, or `json` for other file sources. All other settings for read and write stay the same for the default behaviors for each format. \n```\n(spark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"parquet\")\n# The schema location directory keeps track of your data schema over time\n.option(\"cloudFiles.schemaLocation\", \"<path-to-checkpoint>\")\n.load(\"<path-to-source-data>\")\n.writeStream\n.option(\"checkpointLocation\", \"<path-to-checkpoint>\")\n.start(\"<path_to_target\")\n)\n\n``` \n```\nspark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"parquet\")\n\/\/ The schema location directory keeps track of your data schema over time\n.option(\"cloudFiles.schemaLocation\", \"<path-to-checkpoint>\")\n.load(\"<path-to-source-data>\")\n.writeStream\n.option(\"checkpointLocation\", \"<path-to-checkpoint>\")\n.start(\"<path_to_target\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Configure schema inference and evolution in Auto Loader\n##### How does Auto Loader schema inference work?\n\nTo infer the schema when first reading data, Auto Loader samples the first 50 GB or 1000 files that it discovers, whichever limit is crossed first. Auto Loader stores the schema information in a directory `_schemas` at the configured `cloudFiles.schemaLocation` to track schema changes to the input data over time. \nNote \nTo change the size of the sample that\u2019s used you can set the SQL configurations: \n```\nspark.databricks.cloudFiles.schemaInference.sampleSize.numBytes\n\n``` \n(byte string, for example `10gb`) \nand \n```\nspark.databricks.cloudFiles.schemaInference.sampleSize.numFiles\n\n``` \n(integer) \nBy default, Auto Loader schema inference seeks to avoid schema evolution issues due to type mismatches. For formats that don\u2019t encode data types (JSON, CSV, and XML), Auto Loader infers all columns as strings (including nested fields in JSON files). For formats with typed schema (Parquet and Avro), Auto Loader samples a subset of files and merges the schemas of individual files. This behavior is summarized in the following table: \n| File format | Default inferred data type |\n| --- | --- |\n| `JSON` | String |\n| `CSV` | String |\n| `XML` | String |\n| `Avro` | Types encoded in Avro schema |\n| `Parquet` | Types encoded in Parquet schema | \nThe Apache Spark DataFrameReader uses different behavior for schema inference, selecting data types for columns in JSON, CSV, and XML sources based on sample data. To enable this behavior with Auto Loader, set the option `cloudFiles.inferColumnTypes` to `true`. \nNote \nWhen inferring the schema for CSV data, Auto Loader assumes that the files contain headers. If your CSV files do not contain headers, provide the option `.option(\"header\", \"false\")`. In addition, Auto Loader merges the schemas of all the files in the sample to come up with a global schema. Auto Loader can then read each file according to its header and parse the CSV correctly. \nNote \nWhen a column has different data types in two Parquet files, Auto Loader chooses the widest type. You can use [schemaHints](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html#override-schema-inference-with-schema-hints) to override this choice. When you specify schema hints, Auto Loader doesn\u2019t cast the column to the specified type, but rather tells the Parquet reader to read the column as the specified type. In the case of a mismatch, the column is rescued in the [rescued data column](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html#rescue).\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Configure schema inference and evolution in Auto Loader\n##### How does Auto Loader schema evolution work?\n\nAuto Loader detects the addition of new columns as it processes your data. When Auto Loader detects a new column, the stream stops with an `UnknownFieldException`. Before your stream throws this error, Auto Loader performs schema inference on the latest micro-batch of data and updates the schema location with the latest schema by merging new columns to the end of the schema. The data types of existing columns remain unchanged. \nDatabricks recommends configuring Auto Loader streams with [workflows](https:\/\/docs.databricks.com\/workflows\/index.html) to restart automatically after such schema changes. \nAuto Loader supports the following modes for schema evolution, which you set in the option `cloudFiles.schemaEvolutionMode`: \n| Mode | Behavior on reading new column |\n| --- | --- |\n| `addNewColumns` (default) | Stream fails. New columns are added to the schema. Existing columns do not evolve data types. |\n| `rescue` | Schema is never evolved and stream does not fail due to schema changes. All new columns are recorded in the [rescued data column](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html#rescue). |\n| `failOnNewColumns` | Stream fails. Stream does not restart unless the provided schema is updated, or the offending data file is removed. |\n| `none` | Does not evolve the schema, new columns are ignored, and data is not rescued unless the `rescuedDataColumn` option is set. Stream does not fail due to schema changes. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Configure schema inference and evolution in Auto Loader\n##### How do partitions work with Auto Loader?\n\nAuto Loader attempts to infer partition columns from the underlying directory structure of the data if the data is laid out in Hive style partitioning. For example, the file path `base_path\/event=click\/date=2021-04-01\/f0.json` results in the inference of `date` and `event` as partition columns. If the underlying directory structure contains conflicting Hive partitions or doesn\u2019t contain Hive style partitioning, partition columns are ignored. \nBinary file (`binaryFile`) and `text` file formats have fixed data schemas, but support partition column inference. Databricks recommends setting `cloudFiles.schemaLocation` for these file formats. This avoids any potential errors or information loss and prevents inference of partitions columns each time an Auto Loader begins. \nPartition columns are not considered for schema evolution. If you had an initial directory structure like `base_path\/event=click\/date=2021-04-01\/f0.json`, and then start receiving new files as `base_path\/event=click\/date=2021-04-01\/hour=01\/f1.json`, Auto Loader ignores the hour column. To capture information for new partition columns, set `cloudFiles.partitionColumns` to `event,date,hour`. \nNote \nThe option `cloudFiles.partitionColumns` takes a comma-separated list of column names. Only columns that exist as `key=value` pairs in your directory structure are parsed.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Configure schema inference and evolution in Auto Loader\n##### What is the rescued data column?\n\nWhen Auto Loader infers the schema, a rescued data column is automatically added to your schema as `_rescued_data`. You can rename the column or include it in cases where you provide a schema by setting the option `rescuedDataColumn`. \nThe rescued data column ensures that columns that don\u2019t match with the schema are rescued instead of being dropped. The rescued data column contains any data that isn\u2019t parsed for the following reasons: \n* The column is missing from the schema.\n* Type mismatches.\n* Case mismatches. \nThe rescued data column contains a JSON containing the rescued columns and the source file path of the record. \nNote \nThe JSON and CSV parsers support three modes when parsing records: `PERMISSIVE`, `DROPMALFORMED`, and `FAILFAST`. When used together with `rescuedDataColumn`, data type mismatches do not cause records to be dropped in `DROPMALFORMED` mode or throw an error in `FAILFAST` mode. Only corrupt records are dropped or throw errors, such as incomplete or malformed JSON or CSV. If you use `badRecordsPath` when parsing JSON or CSV, data type mismatches are not considered as bad records when using the `rescuedDataColumn`. Only incomplete and malformed JSON or CSV records are stored in `badRecordsPath`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Configure schema inference and evolution in Auto Loader\n##### Change case-sensitive behavior\n\nUnless case sensitivity is enabled, the columns `abc`, `Abc`, and `ABC` are considered the same column for the purposes of schema inference. The case that is chosen is arbitrary and depends on the sampled data. You can use [schema hints](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html#schema-hints) to enforce which case should be used. Once a selection has been made and the schema is inferred, Auto Loader does not consider the casing variants that were not selected consistent with the schema. \nWhen [rescued data column](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html#rescue) is enabled, fields named in a case other than that of the schema are loaded to the `_rescued_data` column. Change this behavior by setting the option `readerCaseSensitive` to false, in which case Auto Loader reads data in a case-insensitive way.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Configure schema inference and evolution in Auto Loader\n##### Override schema inference with schema hints\n\nYou can use schema hints to enforce the schema information that you know and expect on an inferred schema. When you know that a column is of a specific data type, or if you want to choose a more general data type (for example, a `double` instead of an `integer`), you can provide an arbitrary number of hints for column data types as a string using SQL schema specification syntax, such as the following: \n```\n.option(\"cloudFiles.schemaHints\", \"tags map<string,string>, version int\")\n\n``` \nSee the documentation on [data types](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-datatypes.html#language-mappings) for the list of supported data types. \nIf a column is not present at the start of the stream, you can also use schema hints to add that column to the inferred schema. \nHere is an example of an inferred schema to see the behavior with schema hints. \nInferred schema: \n```\n|-- date: string\n|-- quantity: int\n|-- user_info: struct\n||-- id: string\n||-- name: string\n||-- dob: string\n|-- purchase_options: struct\n||-- delivery_address: string\n\n``` \nBy specifying the following schema hints: \n```\n.option(\"cloudFiles.schemaHints\", \"date DATE, user_info.dob DATE, purchase_options MAP<STRING,STRING>, time TIMESTAMP\")\n\n``` \nyou get: \n```\n|-- date: string -> date\n|-- quantity: int\n|-- user_info: struct\n||-- id: string\n||-- name: string\n||-- dob: string -> date\n|-- purchase_options: struct -> map<string,string>\n|-- time: timestamp\n\n``` \nNote \nArray and Map schema hints support is available in [Databricks Runtime 9.1 LTS](https:\/\/docs.databricks.com\/release-notes\/runtime\/9.1lts.html) and above. \nHere is an example of an inferred schema with complex datatypes to see the behavior with schema hints. \nInferred schema: \n```\n|-- products: array<string>\n|-- locations: array<string>\n|-- users: array<struct>\n||-- users.element: struct\n|| |-- id: string\n|| |-- name: string\n|| |-- dob: string\n|-- ids: map<string,string>\n|-- names: map<string,string>\n|-- prices: map<string,string>\n|-- discounts: map<struct,string>\n||-- discounts.key: struct\n|| |-- id: string\n||-- discounts.value: string\n|-- descriptions: map<string,struct>\n||-- descriptions.key: string\n||-- descriptions.value: struct\n|| |-- content: int\n\n``` \nBy specifying the following schema hints: \n```\n.option(\"cloudFiles.schemaHints\", \"products ARRAY<INT>, locations.element STRING, users.element.id INT, ids MAP<STRING,INT>, names.key INT, prices.value INT, discounts.key.id INT, descriptions.value.content STRING\")\n\n``` \nyou get: \n```\n|-- products: array<string> -> array<int>\n|-- locations: array<int> -> array<string>\n|-- users: array<struct>\n||-- users.element: struct\n|| |-- id: string -> int\n|| |-- name: string\n|| |-- dob: string\n|-- ids: map<string,string> -> map<string,int>\n|-- names: map<string,string> -> map<int,string>\n|-- prices: map<string,string> -> map<string,int>\n|-- discounts: map<struct,string>\n||-- discounts.key: struct\n|| |-- id: string -> int\n||-- discounts.value: string\n|-- descriptions: map<string,struct>\n||-- descriptions.key: string\n||-- descriptions.value: struct\n|| |-- content: int -> string\n\n``` \nNote \nSchema hints are used only if you *do not* provide a schema to Auto Loader. You can use schema hints whether `cloudFiles.inferColumnTypes` is enabled or disabled.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Transform complex data types\n\nWhile working with nested data types, Databricks optimizes certain transformations out-of-the-box. The following code examples demonstrate patterns for working with complex and nested data types in Databricks.\n\n#### Transform complex data types\n##### Dot notation for accessing nested data\n\nYou can use dot notation (`.`) to access a nested field. \n```\ndf.select(\"column_name.nested_field\")\n\n``` \n```\nSELECT column_name.nested_field FROM table_name\n\n```\n\n#### Transform complex data types\n##### Select all nested fields\n\nUse the star operator (`*`) to select all fields within a given field. \nNote \nThis only unpacks nested fields at the specified depth. \n```\ndf.select(\"column_name.*\")\n\n``` \n```\nSELECT column_name.* FROM table_name\n\n```\n\n#### Transform complex data types\n##### Create a new nested field\n\nUse the `struct()` function to create a new nested field. \n```\nfrom pyspark.sql.functions import struct, col\n\ndf.select(struct(col(\"field_to_nest\").alias(\"nested_field\")).alias(\"column_name\"))\n\n``` \n```\nSELECT struct(field_to_nest AS nested_field) AS column_name FROM table_name\n\n```\n\n#### Transform complex data types\n##### Nest all fields into a column\n\nUse the star operator (`*`) to nest all fields from a data source as a single column. \n```\nfrom pyspark.sql.functions import struct\n\ndf.select(struct(\"*\").alias(\"column_name\"))\n\n``` \n```\nSELECT struct(*) AS column_name FROM table_name\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/complex-types.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Transform complex data types\n##### Select a named field from a nested column\n\nUse square brackets `[]` to select nested fields from a column. \n```\nfrom pyspark.sql.functions import col\n\ndf.select(col(\"column_name\")[\"field_name\"])\n\n``` \n```\nSELECT column_name[\"field_name\"] FROM table_name\n\n```\n\n#### Transform complex data types\n##### Explode nested elements from a map or array\n\nUse the `explode()` function to unpack values from `ARRAY` and `MAP` type columns. \n`ARRAY` columns store values as a list. When unpacked with `explode()`, each value becomes a row in the output. \n```\nfrom pyspark.sql.functions import explode\n\ndf.select(explode(\"array_name\").alias(\"column_name\"))\n\n``` \n```\nSELECT explode(array_name) AS column_name FROM table_name\n\n``` \n`MAP` columns store values as ordered key-value pairs. When unpacked with `explode()`, each key becomes a column and values become rows. \n```\nfrom pyspark.sql.functions import explode\n\ndf.select(explode(\"map_name\").alias(\"column1_name\", \"column2_name\"))\n\n``` \n```\nSELECT explode(map_name) AS (column1_name, column2_name) FROM table_name\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/complex-types.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Transform complex data types\n##### Create an array from a list or set\n\nUse the functions `collect_list()` or `collect_set()` to transform the values of a column into an array. `collect_list()` collects all values in the column, while `collect_set()` collects only unique values. \nNote \nSpark does not guarantee the order of items in the array resulting from either operation. \n```\nfrom pyspark.sql.functions import collect_list, collect_set\n\ndf.select(collect_list(\"column_name\").alias(\"array_name\"))\ndf.select(collect_set(\"column_name\").alias(\"set_name\"))\n\n``` \n```\nSELECT collect_list(column_name) AS array_name FROM table_name;\nSELECT collect_set(column_name) AS set_name FROM table_name;\n\n```\n\n#### Transform complex data types\n##### Select a column from a map in an array\n\nYou can also use dot notation (`.`) to access fields in maps that are contained within an array. This returns an array of all values for the specified field. \nConsider the following data structure: \n```\n{\n\"column_name\": [\n{\"field1\": 1, \"field2\":\"a\"},\n{\"field1\": 2, \"field2\":\"b\"}\n]\n}\n\n``` \nYou can return the values from `field1` as an array with the following query: \n```\ndf.select(\"column_name.field1\")\n\n``` \n```\nSELECT column_name.field1 FROM table_name\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/complex-types.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Transform complex data types\n##### Transform nested data to JSON\n\nUse the `to_json` function to convert a complex data type to JSON. \n```\nfrom pyspark.sql.functions import to_json\n\ndf.select(to_json(\"column_name\").alias(\"json_name\"))\n\n``` \n```\nSELECT to_json(column_name) AS json_name FROM table_name\n\n``` \nTo encode all contents of a query or DataFrame, combine this with `struct(*)`. \n```\nfrom pyspark.sql.functions import to_json, struct\n\ndf.select(to_json(struct(\"*\")).alias(\"json_name\"))\n\n``` \n```\nSELECT to_json(struct(*)) AS json_name FROM table_name\n\n``` \nNote \nDatabricks also supports `to_avro` and `to_protobuf` for transforming complex data types for interoperability with integrated systems.\n\n#### Transform complex data types\n##### Transform JSON data to complex data\n\nUse the `from_json` function to convert JSON data to native complex data types. \nNote \nYou must specify the schema for the JSON data. \n```\nfrom pyspark.sql.functions import from_json\n\nschema = \"column1 STRING, column2 DOUBLE\"\n\ndf.select(from_json(\"json_name\", schema).alias(\"column_name\"))\n\n``` \n```\nSELECT from_json(json_name, \"column1 STRING, column2 DOUBLE\") AS column_name FROM table_name\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/complex-types.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Transform complex data types\n##### Notebook: transform complex data types\n\nThe following notebooks provide examples for working with complex data types for Python, Scala, and SQL. \n### Transforming complex data types Python notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/transform-complex-data-types-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Transforming complex data types Scala notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/transform-complex-data-types-scala.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Transforming complex data types SQL notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/transform-complex-data-types-sql.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/complex-types.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Manage and work with foreign catalogs\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes how to manage foreign catalogs and work with data in foreign catalogs. \nA foreign catalog is a securable object in Unity Catalog that mirrors a database in an external data system, enabling you to perform read-only queries on that data system in your Databricks workspace, managing access using Unity Catalog. \nYou work with foreign catalogs in the same way that you work with any catalog managed by Unity Catalog: \n* View information about a foreign catalog: see [View catalog details](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#view-catalog). \nIn addition to the information that is displayed for standard Unity Catalog catalogs, Catalog Explorer also displays the **Connection** used by the foreign catalog and the external database or catalog that is mirrored by the foreign catalog.\n* Run read-only queries on tables in foreign catalogs: see [Query data](https:\/\/docs.databricks.com\/query\/index.html). \nWhen you write queries in Databricks, you must use Apache Spark data types. For a mapping of Spark data types to your external database data types, see the article for your connection type, listed in the table of contents in the left navigation pane of this documentation site.\n* Grant read-only access to data in foreign catalogs: see [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html). \n* Get data lineage for tables in foreign catalogs: see [Capture and view data lineage using Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html). \nTo learn how to create foreign catalogs, see [Create a foreign catalog](https:\/\/docs.databricks.com\/query-federation\/index.html#foreign-catalog).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/foreign-catalogs.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization types\n\nThis article outlines the types of visualizations available to use in Databricks notebooks and in Databricks SQL, and shows you how to create an example of each visualization type.\n\n#### Visualization types\n##### Bar chart\n\nBar charts represent the change in metrics over time or to show proportionality, similar to a [pie](https:\/\/docs.databricks.com\/visualizations\/visualization-types.html#pie) chart. \nNote \nBar charts support backend aggregations, providing support for queries returning more than 64K rows of data without truncation of the result set. \n![Bar chart example](https:\/\/docs.databricks.com\/_images\/stacked-bar-chart.png) \n**Configuration values**: For this bar chart visualization, the following values were set: \n* X column: \n+ Dataset column: `o_orderdate`\n+ Date level: `Months`\n* Y columns: \n+ Dataset column: `o_totalprice`\n+ Aggregation type: `Sum`\n* Group by (dataset column): `o_orderpriority`\n* Stacking: `Stack`\n* X axis name (override default value): `Order month`\n* Y axis name (override default value): `Total price` \n**Configuration options**: For bar chart configuration options, see [chart configuration options](https:\/\/docs.databricks.com\/visualizations\/charts.html#options). \n**SQL query**: For this bar chart visualization, the following SQL query was used to generate the data set. \n```\nselect * from samples.tpch.orders\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/visualization-types.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization types\n##### Line chart\n\nLine charts present the change in one or more metrics over time. \nNote \nLine charts support backend aggregations, providing support for queries returning more than 64K rows of data without truncation of the result set. \n![Line chart example](https:\/\/docs.databricks.com\/_images\/line-chart1.png) \n**Configuration values**: For this line chart visualization, the following values were set: \n* X column: \n+ Dataset column: `o_orderdate`\n+ Date level: `Years`\n* Y columns: \n+ Dataset column: `o_totalprice`\n+ Aggregation type: `Average`\n* Group by (dataset column): `o_orderpriority`\n* X axis name (override default value): `Order year`\n* Y axis name (override default value): `Average price` \n**Configuration options**: For line chart configuration options, see [chart configuration options](https:\/\/docs.databricks.com\/visualizations\/charts.html#options). \n**SQL query**: For this line chart visualization, the following SQL query was used to generate the data set. \n```\nselect * from samples.tpch.orders\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/visualization-types.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization types\n##### Area chart\n\nArea charts combine the line and bar chart to show how one or more groups\u2019 numeric values change over the progression of a second variable, typically that of time. They are often used to show sales funnel changes through time. \nNote \nArea charts support backend aggregations, providing support for queries returning more than 64K rows of data without truncation of the result set. \n![Area chart example](https:\/\/docs.databricks.com\/_images\/stacked-area-chart.png) \n**Configuration values**: For this area chart visualization, the following values were set: \n* X column: \n+ Dataset column: `o_orderdate`\n+ Date level: `Years`\n* Y columns: \n+ Dataset column: `o_totalprice`\n+ Aggregation type: `Sum`\n* Group by (dataset column): `o_orderpriority`\n* Stacking: `Stack`\n* X axis name (override default value): `Order year`\n* Y axis name (override default value): `Total price` \n**Configuration options**: For area chart configuration options, see [chart configuration options](https:\/\/docs.databricks.com\/visualizations\/charts.html#options). \n**SQL query**: For this area chart visualization, the following SQL query was used to generate the data set. \n```\nselect * from samples.tpch.orders\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/visualization-types.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization types\n##### Pie charts\n\nPie charts show proportionality between metrics. They are *not* meant for conveying time series data. \nNote \nPie charts support backend aggregations, providing support for queries returning more than 64K rows of data without truncation of the result set. \n![Pie chart example](https:\/\/docs.databricks.com\/_images\/pie-chart.png) \n**Configuration values**: For this pie chart visualization, the following values were set: \n* X column (dataset column): `o_orderpriority`\n* Y columns: \n+ Dataset column: `o_totalprice`\n+ Aggregation type: `Sum`\n* Label (override default value): `Total price` \n**Configuration options**: For pie chart configuration options, see [chart configuration options](https:\/\/docs.databricks.com\/visualizations\/charts.html#options). \n**SQL query**: For this pie chart visualization, the following SQL query was used to generate the data set. \n```\nselect * from samples.tpch.orders\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/visualization-types.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization types\n##### Histogram charts\n\nA histogram plots the frequency that a given value occurs in a dataset. A histogram helps you to understand whether a dataset has values that are clustered around a small number of ranges or are more spread out. A histogram is displayed as a bar chart in which you control the number of distinct bars (also called bins). \nNote \nHistogram charts support backend aggregations, providing support for queries returning more than 64K rows of data without truncation of the result set. \n![Histogram chart example](https:\/\/docs.databricks.com\/_images\/histogram.png) \n**Configuration values**: For this histogram chart visualization, the following values were set: \n* X column (dataset column): `o_totalprice`\n* Number of bins: 20\n* X axis name (override default value): `Total price` \n**Configuration options**: For histogram chart configuration options, see [histogram chart configuration options](https:\/\/docs.databricks.com\/visualizations\/histogram.html#options). \n**SQL query**: For this histogram chart visualization, the following SQL query was used to generate the data set. \n```\nselect * from samples.tpch.orders\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/visualization-types.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization types\n##### Heatmap chart\n\nHeatmap charts blend features of bar charts, stacking, and bubble charts allowing you to visualize numerical data using colors. A common color palette for a heatmap shows the highest values using warmer colors, like orange or red, and the lowest values using cooler colors, like blue or purple. \nFor example, consider the following heatmap that visualizes the most frequently occurring distances of taxi rides on each day and groups the results by the day of the week, distance, and the total fare. \nNote \nHeatmap charts support backend aggregations, providing support for queries returning more than 64K rows of data without truncation of the result set. \n![Heatmap example](https:\/\/docs.databricks.com\/_images\/heatmap.png) \n**Configuration values**: For this heatmap chart visualization, the following values were set: \n* X column (dataset column): `o_orderpriority`\n* Y columns (dataset column): `o_orderstatus`\n* Color column: \n+ Dataset column: `o_totalprice`\n+ Aggregation type: `Average`\n* X axis name (override default value): `Order priority`\n* Y axis name(override default value): `Order status`\n* Color scheme (override default value): `YIGnBu` \n**Configuration options**: For heatmap configuration options, see [heatmap chart configuation options](https:\/\/docs.databricks.com\/visualizations\/heatmap.html#options). \n**SQL query**: For this heatmap chart visualization, the following SQL query was used to generate the data set. \n```\nselect * from samples.tpch.orders\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/visualization-types.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization types\n##### Scatter chart\n\nScatter visualizations are commonly used to show the relationship between two numerical variables. Additionally, a third dimension can be encoded with color to show how the numerical variables are different across groups. \nNote \nScatter charts support backend aggregations, providing support for queries returning more than 64K rows of data without truncation of the result set. \n![Scatter example](https:\/\/docs.databricks.com\/_images\/scatter1.png) \n**Configuration values**: For this scatter chart visualization, the following values were set: \n* X column (dataset column): `l_quantity`\n* Y column (dataset column): `l_extendedprice`\n* Group by (dataset column): `l_returnflag`\n* X axis name (override default value): `Quantity`\n* Y axis name (override default value): `Extended price` \n**Configuration options**: For scatter chart configuration options, see [chart configuration options](https:\/\/docs.databricks.com\/visualizations\/charts.html#options). \n**SQL query**: For this scatter chart visualization, the following SQL query was used to generate the data set. \n```\nselect * from samples.tpch.lineitem\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/visualization-types.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization types\n##### Bubble chart\n\nBubble charts are scatter charts where the size of each point marker reflects a relevant metric. \nNote \nBubble charts support backend aggregations, providing support for queries returning more than 64K rows of data without truncation of the result set. \n![Bubble example](https:\/\/docs.databricks.com\/_images\/bubble.png) \n**Configuration values**: For this bubble chart visualization, the following values were set: \n* X (dataset column): `l_quantity`\n* Y columns (dataset column): `l_extendedprice`\n* Group by (dataset column): `l-returnflag`\n* Bubble size column (dataset column): `l_tax`\n* Bubble size coefficient: 20\n* X axis name (override default value): `Quantity`\n* Y axis name (override default value): `Extended price` \n**Configuration options**: For bubble chart configuration options, see [chart configuration options](https:\/\/docs.databricks.com\/visualizations\/charts.html#options). \n**SQL query**: For this bubble chart visualization, the following SQL query was used to generate the data set. \n```\nselect * from samples.tpch.lineitem\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/visualization-types.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization types\n##### Box chart\n\nThe box chart visualization shows the distribution summary of numerical data, optionally grouped by category. Using a box chart visualization, you can quickly compare the value ranges across categories and visualize the locality, spread and skewness groups of the values through their quartiles. In each box, the darker line shows the interquartile range. For more information about interpreting box plot visualizations, see the [Box chart article](https:\/\/en.wikipedia.org\/wiki\/Box_plot) on Wikipedia. \nNote \nBox charts only support aggregation for up to 64,000 rows. If a dataset is larger than 64,000 rows, data will be truncated. \n![Box chart example](https:\/\/docs.databricks.com\/_images\/box.png) \n**Configuration values**: For this box chart visualization, the following values were set: \n* X column (dataset column): `l-returnflag`\n* Y columns (dataset column): `l_extendedprice`\n* Group by (dataset column): `l_shipmode`\n* X axis name (override default value): `Return flag1`\n* Y axis name (override default value): `Extended price` \n**Configuration options**: For box chart configuration options, see [box chart configuation options](https:\/\/docs.databricks.com\/visualizations\/boxplot.html#options). \n**SQL query**: For this box chart visualization, the following SQL query was used to generate the data set. \n```\nselect * from samples.tpch.lineitem\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/visualization-types.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization types\n##### Combo chart\n\nCombo charts combine [line](https:\/\/docs.databricks.com\/visualizations\/visualization-types.html#line) and [bar](https:\/\/docs.databricks.com\/visualizations\/visualization-types.html#bar) charts to present the changes over time with proportionality. \nNote \nCombo charts support backend aggregations, providing support for queries returning more than 64K rows of data without truncation of the result set. \n![Combo example](https:\/\/docs.databricks.com\/_images\/combo.png) \n**Configuration values**: For this combo chart visualization, the following values were set: \n* X column (dataset column): `l_shipdate`\n* Y columns: \n+ First dataset column: `l_extendedprice`\n+ Aggregation type: average\n+ Second dataset column: `l_quantity`\n+ Aggregation type: average\n* X axis name (override default value): `Ship date`\n* Left Y axis name (override default value): `Quantity`\n* Right Y axis name (override default value): `Average price`\n* Series: \n+ Order1 (dataset column): `AVG(l_extendedprice)`\n+ Y axis: right\n+ Type: Line\n+ Order2 (dataset column): `AVG(l_quantity)`\n+ Y axis: left\n+ Type: Bar \n**Configuration options**: For combo chart configuration options, see [chart configuration options](https:\/\/docs.databricks.com\/visualizations\/charts.html#options). \n**SQL query**: For this combo chart visualization, the following SQL query was used to generate the data set. \n```\nselect * from samples.tpch.lineitem\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/visualization-types.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization types\n##### Cohort analysis\n\nA cohort analysis examines the outcomes of predetermined groups, called cohorts, as they progress through a set of stages. The cohort visualization only aggregates over dates (it allows for monthly aggregations). It does not do any other aggregations of data within the result set. All other aggregations are done within the query itself. \n![Cohort example](https:\/\/docs.databricks.com\/_images\/cohort.png) \n**Configuration values**: For this cohort visualization, the following values were set: \n* Date (bucket) (database column): `cohort_month`\n* Stage (database column): `months`\n* Bucket population size (database column): `size`\n* Stage value (database column): `active`\n* Time interval: `monthly` \n**Configuration options**: For cohort configuration options, see [cohort chart configuation options](https:\/\/docs.databricks.com\/visualizations\/cohorts.html#options). \n**SQL query**: For this cohort visualization, the following SQL query was used to generate the data set. \n```\n-- match each customer with its cohort by month\nwith cohort_dates as (\nSELECT o_custkey, min(date_trunc('month', o_orderdate)) as cohort_month\nFROM samples.tpch.orders\nGROUP BY 1\n),\n-- find the size of each cohort\ncohort_size as (\nSELECT cohort_month, count(distinct o_custkey) as size\nFROM cohort_dates\nGROUP BY 1\n)\n-- for each cohort and month thereafter, find the number of active customers\nSELECT\ncohort_dates.cohort_month,\nceil(months_between(date_trunc('month', samples.tpch.orders.o_orderdate), cohort_dates.cohort_month)) as months,\ncount(distinct samples.tpch.orders.o_custkey) as active,\nfirst(size) as size\nFROM samples.tpch.orders\nleft join cohort_dates on samples.tpch.orders.o_custkey = cohort_dates.o_custkey\nleft join cohort_size on cohort_dates.cohort_month = cohort_size.cohort_month\nWHERE datediff(date_trunc('month', samples.tpch.orders.o_orderdate), cohort_dates.cohort_month) != 0\nGROUP BY 1, 2\nORDER BY 1, 2\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/visualization-types.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization types\n##### Counter display\n\nCounters display a single value prominently, with an option to compare them against a target value. To use counters, specify which row of data to display on the counter visualization for the **Value Column** and **Target Column**. \nNote \nCounter only supports aggregation for up to 64,000 rows. If a dataset is larger than 64,000 rows, data will be truncated. \n![Counter example](https:\/\/docs.databricks.com\/_images\/counter1.png) \n**Configuration values**: For this counter visualization, the following values were set: \n* Value column \n+ Dataset column: `avg(o_totalprice)`\n+ Row: 1\n* Target column: \n+ Dataset column: `avg(o_totalprice)`\n+ Row: 2\n* Format target value: Enable \n**SQL query**: For this counter visualization, the following SQL query was used to generate the data set. \n```\nselect o_orderdate, avg(o_totalprice)\nfrom samples.tpch.orders\nGROUP BY 1\nORDER BY 1 DESC\n\n```\n\n#### Visualization types\n##### Funnel visualization\n\nThe funnel visualization helps analyze the change in a metric at different stages. To use the funnel, specify a `step` and a `value` column. \nNote \nFunnel only supports aggregation for up to 64,000 rows. If a dataset is larger than 64,000 rows, data will be truncated. \n![Funnel example](https:\/\/docs.databricks.com\/_images\/funnel.png) \n**Configuration values**: For this funnel visualization, the following values were set: \n* Step column (dataset column): `o_orderstatus`\n* Value column (dataset column): `Revenue` \n**SQL query**: For this funnel visualization, the following SQL query was used to generate the data set. \n```\nSELECT o_orderstatus, sum(o_totalprice) as Revenue\nFROM samples.tpch.orders\nGROUP BY 1\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/visualization-types.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization types\n##### Choropleth map visualization\n\nIn choropleth visualizations, geographic localities, such as countries or states, are colored according to the aggregate values of each key column. The query must return geographic locations by name. \nNote \nChoropleth visualizations do not do any aggregations of data within the result set. All aggregations must be computed within the query itself. \n![Map choropleth example](https:\/\/docs.databricks.com\/_images\/choropleth.png) \n**Configuration values**: For this choropleth visualization, the following values were set: \n* Map (dataset column): `Countries`\n* Geographic column (dataset column): `Nation`\n* Geographic type: Short name\n* Value column (dataset column): `revenue`\n* Clustering mode: equidistant \n**Configuration options**: For choropleth configuration options, see [choropleth configuation options](https:\/\/docs.databricks.com\/visualizations\/maps.html#map-choropleth-options). \n**SQL query**: For this choropleth visualization, the following SQL query was used to generate the data set. \n```\nSELECT\ninitcap(n_name) as Country,\nsum(c_acctbal)\nFROM samples.tpch.customer\njoin samples.tpch.nation where n_nationkey = c_nationkey\nGROUP BY 1\n\n```\n\n#### Visualization types\n##### Marker map visualization\n\nIn marker visualizations, a marker is placed at a set of coordinates on the map. The query result must return latitude and longitude pairs. \nNote \nMarker does not do any aggregations of data within the result set. All aggregations must be computed within the query itself. \n![Map marker example](https:\/\/docs.databricks.com\/_images\/marker.png) \nThis marker example is generated from a dataset that includes both latitude and longitude values - which are not available in the Databricks sample datasets. For choropleth configuration options, see [marker configuration options](https:\/\/docs.databricks.com\/visualizations\/maps.html#map-marker-options).\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/visualization-types.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization types\n##### Pivot table visualization\n\nA pivot table visualization aggregates records from a query result into a new tabular display. It\u2019s similar to `PIVOT` or `GROUP BY` statements in SQL. You configure the pivot table visualization with drag-and-drop fields. \nNote \nPivot tables support backend aggregations, providing support for queries returning more than 64K rows of data without truncation of the result set. However, Pivot table (legacy) only support aggregation for up to 64,000 rows. If a dataset is larger than 64,000 rows, data will be truncated. \n[Pivot table example](https:\/\/docs.databricks.com\/_static\/images\/visualizations\/pivot-table.png) \n**Configuration values**: For this pivot table visualization, the following values were set: \n* Select rows (dataset column): `l_retkurnflag`\n* Select columns (dataset column): `l_shipmode`\n* Cell \n+ Dataset column: `l_quantity`\n+ Aggregation type: Sum \n**SQL query**: For this pivot table visualization, the following SQL query was used to generate the data set. \n```\nselect * from samples.tpch.lineitem\n\n```\n\n#### Visualization types\n##### Sankey\n\nA sankey diagram visualizes the flow from one set of values to another. \nNote \nSankey visualizations do not do any aggregations of data within the result set. All aggregations must be computed within the query itself. \n![Sankey example](https:\/\/docs.databricks.com\/_images\/sankey.png) \n**SQL query**: For this Sankey visualization, the following SQL query was used to generate the data set. \n```\nSELECT pickup_zip as stage1, dropoff_zip as stage2, sum(fare_amount) as value\nFROM samples.nyctaxi.trips\nGROUP BY 1, 2\nORDER BY 3 DESC\nLIMIT 10\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/visualization-types.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### Visualization types\n##### Sunburst sequence\n\nA sunburst diagram helps visualize hierarchical data using concentric circles. \nNote \nSunburst sequence does not do any aggregations of data within the result set. All aggregations must be computed within the query itself. \n![Sunburst example](https:\/\/docs.databricks.com\/_images\/sunburst-sequence.png) \n**SQL query**: For this sunburst visualization, the following SQL query was used to generate the data set. \n```\nSELECT pickup_zip as stage1, dropoff_zip as stage2, sum(fare_amount) as value\nFROM samples.nyctaxi.trips\nGROUP BY 1, 2\nORDER BY 3 DESC\nLIMIT 10\n\n```\n\n#### Visualization types\n##### Table\n\nThe table visualization displays data in a standard table, but with the ability to manually reorder, hide, and format the data. See [Table options](https:\/\/docs.databricks.com\/visualizations\/tables.html). \nNote \nTable visualizations do not do any aggregations of data within the result set. All aggregations must be computed within the query itself. \nFor table configuration options, see [table configuration options](https:\/\/docs.databricks.com\/visualizations\/tables.html#options).\n\n#### Visualization types\n##### Word cloud\n\nA word cloud visually represents the frequency a word occurs in the data. \nNote \nWord cloud only supports aggregation for up to 64,000 rows. If a dataset is larger than 64,000 rows, data will be truncated. \n![Word cloud example](https:\/\/docs.databricks.com\/_images\/word-cloud.png) \n**Configuration values**: For this word cloud visualization, the following values were set: test \n* Words column (dataset column): `o_comment`\n* Words Length Limit: 5\n* Frequencies limit: 2 \n**SQL query**: For this word cloud visualization, the following SQL query was used to generate the data set. \n```\nselect * from samples.tpch.orders\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/visualization-types.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use a Python wheel file in a Databricks job\n\nA Python [wheel file](https:\/\/peps.python.org\/pep-0427\/) is a standard way to package and distribute the files required to run a Python application. Using the Python wheel task, you can ensure fast and reliable installation of Python code in your Databricks jobs. This article provides an example of creating a Python wheel file and a job that runs the application packaged in the Python wheel file. In this example, you will: \n* Create the Python files defining an example application.\n* Bundle the example files into a Python wheel file.\n* Create a job to run the Python wheel file.\n* Run the job and view the results.\n\n##### Use a Python wheel file in a Databricks job\n###### Before you begin\n\nYou need the following to complete this example: \n* Python3\n* The Python `wheel` and `setuptool` packages. You can use `pip` to install these packages. For example, you can run the following command to install these packages: \n```\npip install wheel setuptools\n\n```\n\n##### Use a Python wheel file in a Databricks job\n###### Step 1: Create a local directory for the example\n\nCreate a local directory to hold the example code and generated artifacts, for example, `databricks_wheel_test`.\n\n##### Use a Python wheel file in a Databricks job\n###### Step 2: Create the example Python script\n\nThe following Python example is a simple script that reads input arguments and prints out those arguments. Copy this script and save it to a path called `my_test_code\/__main__.py` in the directory you created in the previous step. \n```\n\"\"\"\nThe entry point of the Python Wheel\n\"\"\"\n\nimport sys\n\ndef main():\n# This method will print the provided arguments\nprint('Hello from my func')\nprint('Got arguments:')\nprint(sys.argv)\n\nif __name__ == '__main__':\nmain()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-python-wheels-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use a Python wheel file in a Databricks job\n###### Step 3: Create a metadata file for the package\n\nThe following file contains metadata describing the package. Save this to a path called `my_test_code\/__init__.py` in the directory you created in step 1. \n```\n__version__ = \"0.0.1\"\n__author__ = \"Databricks\"\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-python-wheels-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use a Python wheel file in a Databricks job\n###### Step 4: Create the Python wheel file\n\nConverting the Python artifacts into a Python wheel file requires specifying package metadata such as the package name and entry points. The following script defines this metadata. \nNote \nThe `entry_points` defined in this script are used to run the package in the Databricks workflow. In each value in `entry_points`, the value before `=` (in this example, `run`) is the name of the entry point and is used to configure the Python wheel task. \n1. Save this script in a file named `setup.py` in the root of the directory you created in step 1: \n```\nfrom setuptools import setup, find_packages\n\nimport my_test_code\n\nsetup(\nname='my_test_package',\nversion=my_test_code.__version__,\nauthor=my_test_code.__author__,\nurl='https:\/\/databricks.com',\nauthor_email='john.doe@databricks.com',\ndescription='my test wheel',\npackages=find_packages(include=['my_test_code']),\nentry_points={\n'group_1': 'run=my_test_code.__main__:main'\n},\ninstall_requires=[\n'setuptools'\n]\n)\n\n```\n2. Change into the directory you created in step 1, and run the following command to package your code into the Python wheel distribution: \n```\npython3 setup.py bdist_wheel\n\n``` \nThis command creates the Python wheel file and saves it to the `dist\/my_test_package-0.0.1-py3.none-any.whl` file in your directory.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-python-wheels-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use a Python wheel file in a Databricks job\n###### Step 5. Create a Databricks job to run the Python wheel file\n\n1. Go to your Databricks landing page and do one of the following: \n* In the sidebar, click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** and click ![Create Job Button](https:\/\/docs.databricks.com\/_images\/create-job.png).\n* In the sidebar, click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** and select **Job** from the menu.\n2. In the task dialog box that appears on the **Tasks** tab, replace **Add a name for your job\u2026** with your job name, for example, `Python wheel example`.\n3. In **Task name**, enter a name for the task, for example, `python_wheel_task`.\n4. In **Type**, select **Python Wheel**.\n5. In **Package name**, enter `my_test_package`. The package name is the value assigned to the `name` variable in the `setup.py` script.\n6. In **Entry point**, enter `run`. The entry point is one of the values specified in the `entry_points` collection in the `setup.py` script. In this example, `run` is the only entry point defined.\n7. In **Cluster**, select a compatible cluster. See [Cluster-scoped libraries](https:\/\/docs.databricks.com\/libraries\/index.html#compatibility).\n8. Click **Add** under **Dependent Libraries**. In the **Add dependent library** dialog, with **Workspace** selected, drag the `my_test_package-0.0.1-py3-none-any.whl` file created in step 4 into the dialog\u2019s **Drop file here** area.\n9. Click **Add**.\n10. In **Parameters**, select **Positional arguments** or **Keyword arguments** to enter the key and the value of each parameter. Both positional and keyword arguments are passed to the Python wheel task as command-line arguments. \n* To enter positional arguments, enter parameters as a JSON-formatted array of strings, for example: `[\"first argument\",\"first value\",\"second argument\",\"second value\"]`.\n* To enter keyword arguments, click **+ Add** and enter a key and value. Click **+ Add** again to enter more arguments.\n11. Click **Save task**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-python-wheels-in-workflows.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n### Implement data processing and analysis workflows with Jobs\n##### Use a Python wheel file in a Databricks job\n###### Step 6: Run the job and view the job run details\n\nClick ![Run Now Button](https:\/\/docs.databricks.com\/_images\/run-now-button.png) to run the workflow. To view [details for the run](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#job-run-details), click **View run** in the **Triggered run** pop-up or click the link in the **Start time** column for the run in the [job runs](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#view-job-run-list) view. \nWhen the run completes, the output displays in the **Output** panel, including the arguments passed to the task.\n\n##### Use a Python wheel file in a Databricks job\n###### Next steps\n\nTo learn more about creating and running Databricks jobs, see [Create and run Databricks Jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-python-wheels-in-workflows.html"} +{"content":"# \n### ggplot2\n\nThe following notebook shows how to display [ggplot2](https:\/\/ggplot2.tidyverse.org\/) objects in R notebooks.\n\n### ggplot2\n#### ggplot2 R notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/ggplot2.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/ggplot2.html"} +{"content":"# Discover data\n### View frequent queries and users of a table\n\nYou can use the Insights tab in Catalog Explorer to view the most frequent recent queries and users of any table registered in Unity Catalog. The Insights tab reports on frequent queries and user access for the past 30 days. \nThis information can help you answer questions like: \n* Can I trust this data?\n* What are some good ways to use this data?\n* Which users can answer my questions about this data? \nNote \nThe queries and users listed on the Insights tab are limited to queries performed using Databricks SQL.\n\n### View frequent queries and users of a table\n#### Before you begin\n\nYou must have the following permissions to view frequent queries and user data on the Insights tab. \nIn Unity Catalog: \n* `SELECT` privilege on the table.\n* `USE SCHEMA` privilege on the table\u2019s parent schema.\n* `USE CATALOG` privilege on the table\u2019s parent catalog. \nMetastore admins have these privileges by default. See [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html). \nIn Databricks SQL: \n* CAN VIEW permissions on the queries. You will not see queries that you do not have permission to view. See [Query ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#query).\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/table-insights.html"} +{"content":"# Discover data\n### View frequent queries and users of a table\n#### View the Insights tab\n\n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog** to open Catalog Explorer.\n2. Search for or navigate to the table you want insights on. \nSee [Search for workspace objects](https:\/\/docs.databricks.com\/search\/index.html) and [Explore database objects](https:\/\/docs.databricks.com\/discover\/database-objects.html).\n3. On the table page, click the **Insights** tab. \nQueries made on the table and users who accessed the table in the past 30 days are listed in order of frequency, with the most frequent on top. \nIn the **Insights** tab, you can view frequently used queries, dashboards, notebooks, and joined tables. \n![The Insights tab shows a table with the most frequent users, dashboards, and notebooks.](https:\/\/docs.databricks.com\/_images\/insights-tab.png) \nYou can find the most popular tables across catalogs by sorting tables in a schema by popularity. Popularity is determined by the number of interactive runs done against a table. \n![The Tables tab shows a list of popular tables, when they were created, and the owner.](https:\/\/docs.databricks.com\/_images\/insights-table.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/discover\/table-insights.html"} +{"content":"# AI and Machine Learning on Databricks\n### Deploy models for batch inference and prediction\n\nThis article describes how to deploy MLflow models for offline (batch and streaming) inference. Databricks recommends that you use MLflow to deploy machine learning models for batch or streaming inference. For general information about working with MLflow models, see [Log, load, register, and deploy MLflow models](https:\/\/docs.databricks.com\/mlflow\/models.html). \nFor information about real-time model serving on Databricks, see [Model serving with Databricks](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-inference\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n### Deploy models for batch inference and prediction\n#### Use MLflow for model inference\n\nMLflow helps you generate code for batch or streaming inference. \n* In the MLflow Model Registry, you can [automatically generate a notebook](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#generate-inference-nb) for batch or streaming inference via [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html).\n* In the [MLflow Run page](https:\/\/docs.databricks.com\/mlflow\/runs.html#run-details-screen) for your model, you can copy the generated code snippet for inference on pandas or Apache Spark DataFrames. \nYou can also customize the code generated by either of the above options. See the following notebooks for examples: \n* The [model inference example](https:\/\/docs.databricks.com\/mlflow\/model-example.html) uses a model trained with scikit-learn and previously logged to MLflow to show how to load a model and use it to make predictions on data in different formats. The notebook illustrates how to apply the model as a scikit-learn model to a pandas DataFrame, and how to apply the model as a PySpark UDF to a Spark DataFrame.\n* The [MLflow Model Registry example](https:\/\/docs.databricks.com\/mlflow\/models-in-uc-example.html) shows how to build, manage, and deploy a model with Model Registry. On that page, you can search for `.predict` to identify examples of offline (batch) predictions.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-inference\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n### Deploy models for batch inference and prediction\n#### Create a Databricks job\n\nTo run batch or streaming predictions as a job, create a notebook or JAR that includes the code used to perform the predictions. Then, execute the notebook or JAR as a Databricks [job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html). Jobs can be run either immediately or on a [schedule](https:\/\/docs.databricks.com\/workflows\/jobs\/schedule-jobs.html#job-schedule).\n\n### Deploy models for batch inference and prediction\n#### Streaming inference\n\nFrom the MLflow Model Registry, you can [automatically generate](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html#streaming-inference) a notebook that integrates the MLflow PySpark inference UDF with [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html). \nYou can also modify the generated inference notebook to use the Apache Spark [Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/index.html) API. \n### Inference with deep learning models \nFor information about and examples of deep learning model inference on Databricks, see the following articles: \n* [Deep learning model inference workflow](https:\/\/docs.databricks.com\/machine-learning\/model-inference\/dl-model-inference.html)\n* [Deep learning model inference performance tuning guide](https:\/\/docs.databricks.com\/machine-learning\/model-inference\/model-inference-performance.html)\n* [Reference solutions for machine learning](https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-inference\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n### Deploy models for batch inference and prediction\n#### Inference with MLlib and XGBoost4J models\n\nFor scalable model inference with MLlib and XGBoost4J models, use the native `transform` methods to perform inference directly on Spark DataFrames. The [MLlib example notebooks](https:\/\/docs.databricks.com\/machine-learning\/train-model\/mllib.html) include inference steps.\n\n### Deploy models for batch inference and prediction\n#### Customize and optimize model inference\n\nWhen you use the MLflow APIs to run inference on Spark DataFrames, you can load the model as a Spark UDF and apply it at scale using distributed computing. \nYou can customize your model to add pre-processing or post-processing and to optimize computational performance for large models. A good option for customizing models is the [MLflow pyfunc API](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.pyfunc.html#creating-custom-pyfunc-models), which allows you to wrap a model with custom logic. \nIf you need to do further customization, you can manually wrap your machine learning model in a Pandas UDF or a [pandas Iterator UDF](https:\/\/docs.databricks.com\/udf\/pandas.html). See the [deep learning examples](https:\/\/docs.databricks.com\/machine-learning\/model-inference\/index.html#dl-examples). \nFor smaller datasets, you can also use the native model inference routines provided by the library. \n* [Model inference example](https:\/\/docs.databricks.com\/mlflow\/model-example.html)\n* [Deep learning model inference workflow](https:\/\/docs.databricks.com\/machine-learning\/model-inference\/dl-model-inference.html)\n* [Deep learning model inference performance tuning guide](https:\/\/docs.databricks.com\/machine-learning\/model-inference\/model-inference-performance.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-inference\/index.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to Matillion\n\nMatillion ETL is an ETL\/ELT tool built specifically for cloud database platforms including Databricks. Matillion ETL has a modern, browser-based UI, with powerful, push-down ETL\/ELT functionality. \nYou can integrate your Databricks SQL warehouses (formerly Databricks SQL endpoints) and Databricks clusters with Matillion.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/matillion.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to Matillion\n##### Connect to Matillion using Partner Connect\n\nThis section describes how to use Partner Connect to simplify the process of connecting an existing SQL warehouse or cluster in your Databricks workspace to Matillion. \n### Requirements \nSee the [requirements](https:\/\/docs.databricks.com\/partner-connect\/index.html#requirements) for using Partner Connect. \n### Steps to connect \nTo connect to Matillion using Partner Connect, follow the steps in this section. \nTip \nIf you have an existing Matillion account, Databricks recommends that you connect to Matillion manually. This is because the connection experience in Partner Connect is optimized for new partner accounts. \n1. In the sidebar, click ![Partner Connect button](https:\/\/docs.databricks.com\/_images\/partner-connect.png) **Partner Connect**.\n2. Click the **Matillion** tile. \nThe **Email** box displays the email address for your Databricks account. Matillion uses this email address to prompt you to either create a new Matillion account or sign in to your existing Matillion account.\n3. Click **Connect to Matillion ETL** or **Sign in**. \nA new tab opens in your browser that displays the Matillion Hub.\n4. Complete the on-screen instructions in Matillion to create your 14-day trial Matillion account or to sign in to your existing Matillion account. \nImportant \nIf an error displays stating that someone from your organization has already created an account with Matillion, contact one of your organization\u2019s administrators and have them add you to your organization\u2019s Matillion account. After they add you, sign in to your existing Matillion account.\n5. Complete the on-screen instructions to provide your job details, then click **Continue**.\n6. Complete the on-screen instructions to create an organization, then click **Continue**.\n7. Click the organization you created, then click **Add Matillion ETL instance**.\n8. Click **Continue in AWS**. \nThe Amazon EC2 console opens.\n9. Follow [Launching Matillion ETL using Amazon Machine Image](https:\/\/documentation.matillion.com\/docs\/2568307) in the Matillion ETL documentation, starting with step 5. Then follow [Accessing Matillion ETL on Amazon Web Services (EC2)](https:\/\/documentation.matillion.com\/docs\/2957722#accessing-matillion-etl-on-amazon-web-services-ec2) in the Matillion ETL documentation.\n10. Follow the instructions in the [Matillion ETL documentation](https:\/\/docs.matillion.com\/metl\/docs\/associating-matillion-etl-instances\/). \nMatillion ETL opens in your browser, and the **Create Project** dialog box displays.\n11. Follow [Create a Delta Lake on Databricks project](https:\/\/documentation.matillion.com\/docs\/7422791#creating-a-delta-lake-on-databricks-project-on-aws) in the Matillion documentation. \nFor the settings in the **Delta Lake Connection** section within these instructions, enter the following information: \n* For **Workspace ID**, enter the ID of your Databricks workspace. See [Workspace instance names, URLs, and IDs](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url).\n* For **Username**, enter the word `token`.\n* For **Password**, enter the value of a Databricks personal access token.To get the **Workspace ID** and generate personal access token, do the following: \n1. Return to the Partner Connect tab in your browser.\n2. Take note of the **Workspace ID**.\n3. Click **Generate a new token**. \nA new tab opens in your browser that displays the **Settings** page of the Databricks UI.\n4. Click **Generate new token**.\n5. Optionally enter a description (comment) and expiration period.\n6. Click **Generate**.\n7. Copy the generated personal access token and store it in a secure location.\n8. Return to the Matillion tab in your browser.For the settings in the **Delta Lake Defaults** section within these instructions, for **Cluster**, choose the name of the SQL warehouse or cluster.\n12. Continue with [Next steps](https:\/\/docs.databricks.com\/partners\/prep\/matillion.html#next-steps).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/matillion.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to Matillion\n##### Connect to Matillion manually\n\nThis section describes how to connect an existing SQL warehouse or cluster in your Databricks workspace to Matillion manually. \nNote \nYou can connect to Matillion using Partner Connect to simplify the experience. \n### Requirements \nBefore you integrate with Matillion manually, you must have the following: \n* A [registered Matillion Hub account](https:\/\/docs.matillion.com\/data-productivity-cloud\/hub\/docs\/registration\/).\n* A Matillion ETL instance, which you can launch by using [AWS CloudFormation](https:\/\/documentation.matillion.com\/docs\/2568306), an [Amazon Machine Image (AMI)](https:\/\/documentation.matillion.com\/docs\/2568307), or the [AWS Marketplace](https:\/\/documentation.matillion.com\/docs\/2127001).\n* A Databricks personal access token. \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \n### Steps to connect \nTo connect to Matillion manually, do the following: \n1. Get the name of the existing compute resource that you want to use (a SQL warehouse or cluster) within your workspace. Later, you will choose that name to complete the connection between your compute resource and your Matillion ETL instance. \n* To view SQL warehouses in your workspace, click ![Endpoints Icon](https:\/\/docs.databricks.com\/_images\/warehouses-icon.png) **SQL Warehouses** in the sidebar. To create a new SQL warehouse, see [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* To view the clusters in your workspace, click ![compute icon](https:\/\/docs.databricks.com\/_images\/clusters-icon.png) **Compute** in the sidebar. To create a cluster, see [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n2. Follow [Connect to your Matillion ETL instance and log in to it](https:\/\/documentation.matillion.com\/docs\/2957722) in the Matillion documentation.\n3. Follow [Create a Delta Lake on Databricks project](https:\/\/documentation.matillion.com\/docs\/7422791#creating-a-delta-lake-on-databricks-project-on-aws) in the Matillion documentation. \nFor the settings in the **Delta Lake Connection** section within these instructions, enter the following information: \n* For **Workspace ID**, enter the ID of your Databricks workspace. See [Workspace instance names, URLs, and IDs](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url).\n* For **Username**, enter the word `token`.\n* For **Password**, enter the Databricks personal access token.For the settings in the **Delta Lake Defaults** section within these instructions, for **Cluster**, choose the name of the SQL warehouse or cluster.\n4. Continue with [Next steps](https:\/\/docs.databricks.com\/partners\/prep\/matillion.html#next-steps).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/matillion.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n#### Connect to Matillion\n##### Next steps\n\nExplore one or more of the following resources on the Matillion website: \n* [Matillion ETL Product Overview](https:\/\/documentation.matillion.com\/docs\/1975061)\n* [UI and Basic Functions](https:\/\/documentation.matillion.com\/docs\/2694747)\n* [Documentation](https:\/\/documentation.matillion.com\/docs)\n* [Support](https:\/\/support.matillion.com\/s\/)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/prep\/matillion.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n### Pandas API on Spark\n##### Can you use pandas on Databricks?\n\nDatabricks Runtime includes pandas as one of the standard Python packages, allowing you to create and leverage pandas DataFrames in Databricks notebooks and jobs. \nIn Databricks Runtime 10.4 LTS and above, [Pandas API on Spark](https:\/\/docs.databricks.com\/pandas\/pandas-on-spark.html) provides familiar pandas commands on top of PySpark DataFrames. You can also [convert DataFrames between pandas and PySpark](https:\/\/docs.databricks.com\/pandas\/pyspark-pandas-conversion.html). \nApache Spark includes Arrow-optimized execution of Python logic in the form of [pandas function APIs](https:\/\/docs.databricks.com\/pandas\/pandas-function-apis.html), which allow users to apply pandas transformations directly to PySpark DataFrames. Apache Spark also supports [pandas UDFs](https:\/\/docs.databricks.com\/udf\/pandas.html), which use similar Arrow-optimizations for arbitrary user functions defined in Python.\n\n","doc_uri":"https:\/\/docs.databricks.com\/pandas\/index.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n### Pandas API on Spark\n##### Can you use pandas on Databricks?\n###### Where does pandas store data on Databricks?\n\nYou can use pandas to store data in many different locations on Databricks. Your ability to store and load data from some locations depends on configurations set by workspace administrators. \nNote \nDatabricks recommends storing production data on cloud object storage. See [Connect to Google Cloud Storage](https:\/\/docs.databricks.com\/connect\/storage\/gcs.html). \nFor quick exploration and data without sensitive information, you can safely save data using either relative paths or the [DBFS](https:\/\/docs.databricks.com\/dbfs\/index.html), as in the following examples: \n```\nimport pandas as pd\n\ndf = pd.DataFrame([[\"a\", 1], [\"b\", 2], [\"c\", 3]])\n\ndf.to_csv(\".\/relative_path_test.csv\")\ndf.to_csv(\"\/dbfs\/dbfs_test.csv\")\n\n``` \nYou can explore files written to the DBFS with the `%fs` magic command, as in the following example. Note that the `\/dbfs` directory is the root path for these commands. \n```\n%fs ls\n\n``` \nWhen you save to a relative path, the location of your file depends on where you execute your code. If you\u2019re using a Databricks notebook, your data file saves to the volume storage attached to the driver of your cluster. Data stored in this location is permanently deleted when the cluster terminates. If you\u2019re using [Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html) with arbitrary file support enabled, your data saves to the root of your current project. In either case, you can explore the files written using the `%sh` magic command, which allows simple bash operations relative to your current root directory, as in the following example: \n```\n%sh ls\n\n``` \nFor more information on how Databricks stores various files, see [Work with files on Databricks](https:\/\/docs.databricks.com\/files\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/pandas\/index.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n### Pandas API on Spark\n##### Can you use pandas on Databricks?\n###### How do you load data with pandas on Databricks?\n\nDatabricks provides a number of options to facilitate uploading data to the workspace for exploration. The preferred method to load data with pandas varies depending on how you load your data to the workspace. \nIf you have small data files stored alongside notebooks on your local machine, you can upload your data and code together with [Git folders](https:\/\/docs.databricks.com\/files\/workspace.html). You can then use relative paths to load data files. \nDatabricks provides extensive [UI-based options for data loading](https:\/\/docs.databricks.com\/ingestion\/add-data\/index.html). Most of these options store your data as Delta tables. You can [read a Delta table](https:\/\/docs.databricks.com\/delta\/tutorial.html#read) to a Spark DataFrame, and then [convert that to a pandas DataFrame](https:\/\/docs.databricks.com\/pandas\/pyspark-pandas-conversion.html). \nIf you have saved data files using DBFS or relative paths, you can use DBFS or relative paths to reload those data files. The following code provides an example: \n```\nimport pandas as pd\n\ndf = pd.read_csv(\".\/relative_path_test.csv\")\ndf = pd.read_csv(\"\/dbfs\/dbfs_test.csv\")\n\n``` \nDatabricks recommends storing production data on cloud object storage. See [Connect to Amazon S3](https:\/\/docs.databricks.com\/connect\/storage\/amazon-s3.html). \nIf you\u2019re in a Unity Catalog-enabled workspace, you can access cloud storage with external locations. See [Create an external location to connect cloud storage to Databricks](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html). \nYou can load data directly from S3 using pandas and a fully qualified URL. You need to provide cloud credentials to access cloud data. \n```\ndf = pd.read_csv(\nf\"s3:\/\/{bucket_name}\/{file_path}\",\nstorage_options={\n\"key\": aws_access_key_id,\n\"secret\": aws_secret_access_key,\n\"token\": aws_session_token\n}\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/pandas\/index.html"} +{"content":"# Technology partners\n### Connect to data governance partners using Partner Connect\n\nTo connect your Databricks workspace to a data governance partner solution using Partner Connect, you typically follow the steps in this article. \nImportant \nBefore you follow the steps in this article, see the appropriate partner article for important partner-specific information. There might be differences in the connection steps between partner solutions. For example, some partner solutions allow you to connect Databricks SQL warehouses (formerly Databricks SQL endpoints) or Databricks clusters, but not both.\n\n### Connect to data governance partners using Partner Connect\n#### Requirements\n\nSee the [requirements](https:\/\/docs.databricks.com\/partner-connect\/index.html#requirements) for using Partner Connect. \nImportant \nFor partner-specific requirements, see the appropriate partner article.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/data-governance.html"} +{"content":"# Technology partners\n### Connect to data governance partners using Partner Connect\n#### Steps to connect to a data governance partner\n\nTo connect your Databricks workspace to a data governance partner solution, do the following: \n1. In the sidebar, click **Partner Connect**.\n2. Click the partner tile. \nIf the partner tile has a check mark icon, a workspace admin has already used Partner Connect to connect your workspace to the partner. Click **Sign in** to sign in to your existing partner account and skip the rest of the steps in this section.\n3. If there are no SQL warehouses in your workspace, do the following: \n1. Click **Create warehouse**. A new tab opens in your browser that displays the **New SQL Warehouse** page in the Databricks SQL UI.\n2. Follow the steps in [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n3. Return to the Partner Connect tab in your browser, then close the partner tile.\n4. Re-open the partner tile.\n4. Select a SQL warehouse from the drop-down list. If your SQL warehouse is stopped, click **Start**.\n5. Select a catalog and a schema from the drop-down lists, then click **Add**. You can repeat this step to add multiple schemas. \nNote \nIf a partner doesn\u2019t support Unity Catalog with Partner Connect, the default catalog for your Unity Catalog enabled workspace is used. If your workspace isn\u2019t Unity Catalog enabled, the legacy Hive metastore (`hive_metastore`) is used.\n6. Click **Next**. \nPartner Connect creates the following resources in your workspace: \n* A Databricks service principal named **`<PARTNER>_USER`**.\n* A Databricks personal access token that is associated with the **`<PARTNER>_USER`** service principal.Partner Connect also grants the following privileges to the **`<PARTNER>_USER`** service principal: \n* (Unity Catalog)`USE CATALOG`: Required to interact with objects within the selected catalog.\n* (Unity Catalog) `USE SCHEMA`: Required to interact with objects within the selected schema.\n* (Legacy Hive metastore) `USAGE`: Required to grant the `SELECT` and `READ_METADATA` privileges for the schemas you selected.\n* `SELECT`: Grants the ability to read the schemas you selected.\n* (Legacy Hive metastore) `READ_METADATA`: Grants the ability to read metadata for the schemas you selected.\n* CAN USE: Grants permissions to use the SQL warehouse you selected.\n7. Click **Next**.\n8. Click **Connect to `<Partner>`**. \nA new tab that displays the partner website opens in your web browser.\n9. Complete the on-screen instructions on the partner website to create your trial partner account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/data-governance.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Hevo Data\n\nHevo Data is an end-to-end data pipeline platform that allows you to ingest data from 150+ sources, load it into the Databricks lakehouse, then transform it to derive business insights. \nYou can connect to Hevo Data using a Databricks SQL warehouse (formerly Databricks SQL endpoints) or a Databricks cluster.\n\n#### Connect to Hevo Data\n##### Connect to Hevo Data using Partner Connect\n\nTo connect to Hevo Data using Partner Connect, see [Connect to ingestion partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/ingestion.html). \nNote \nPartner Connect only supports SQL warehouses for Hevo Data. To connect using a cluster, do so manually.\n\n#### Connect to Hevo Data\n##### Connect to Hevo Data manually\n\nThis section describes how to connect to Hevo Data manually. \nNote \nYou can use Partner Connect to simplify the connection experience with a SQL warehouse. \nTo connect to Hevo Data manually, complete the following steps in the Hevo Data documentation: \n1. [Create a Hevo Data account](https:\/\/docs.hevodata.com\/getting-started\/creating-your-hevo-account\/creating-an-account\/) or sign in to your exising Hevo account.\n2. [Configure Databricks as a Destination](https:\/\/docs.hevodata.com\/destinations\/data-warehouses\/databricks\/).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/hevo.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Hevo Data\n##### Next steps\n\nFollow steps in the Hevo Data documentation to do the following: \n1. [Create a Pipeline](https:\/\/docs.hevodata.com\/pipelines\/working-with-pipelines\/creating-a-pipeline\/) to move your data from a source system to the Databricks lakehouse.\n2. Create [Models](https:\/\/docs.hevodata.com\/transform\/models\/working-with-models\/) and [Workflows](https:\/\/docs.hevodata.com\/transform\/workflows\/working-with-workflows\/#creating-a-workflow) to transform your data in the Databricks lakehouse for analysis and reporting.\n\n#### Connect to Hevo Data\n##### Additional resources\n\nExplore the following Hevo Data resources: \n* [Website](https:\/\/hevodata.com\/)\n* [Documentation](https:\/\/docs.hevodata.com\/)\n* [Support](https:\/\/docs.hevodata.com\/introduction\/support\/)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/hevo.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Large language models (LLMs) on Databricks\n\nDatabricks makes it simple to access and build off of publicly available large language models. \nDatabricks Runtime for Machine Learning includes libraries like Hugging Face Transformers and LangChain that allow you to integrate existing pre-trained models or other open-source libraries into your workflow. From here, you can leverage Databricks platform capabilities to fine-tune LLMs using your own data for better domain performance. \nIn addition, Databricks offers built-in functionality for SQL users to access and experiment with LLMs like Azure OpenAI and OpenAI using AI functions.\n\n### Large language models (LLMs) on Databricks\n#### Foundation Model Training\n\nImportant \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). Reach out to your Databricks account team to enroll in the Public Preview. \nFoundation Model Training is a simple interface to the Databricks training stack to perform full model fine-tuning. \nYou can do the following using Foundation Model Training: \n* Fine-tune a model with your custom data, with the checkpoints saved to MLflow. You retain complete control of the fine-tuned model.\n* Automatically register the model to Unity Catalog, allowing easy deployment with model serving.\n* Fine-tune a completed, proprietary model by loading the weights of a previously fine-tuned model. \nSee [Foundation Model Training](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/index.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Large language models (LLMs) on Databricks\n#### Hugging Face Transformers\n\nWith Hugging Face Transformers on Databricks you can scale out your natural language processing (NLP) batch applications and fine-tune models for large-language model applications. \nThe Hugging Face `transformers` library comes preinstalled on Databricks Runtime 10.4 LTS ML and above. Many of the popular NLP models work best on GPU hardware, so you might get the best performance using recent GPU hardware unless you use a model specifically optimized for use on CPUs. \n* [What are Hugging Face Transformers?](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/index.html)\n* [Fine-tune Hugging Face models for a single GPU](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/fine-tune-model.html)\n* [Model inference using Hugging Face Transformers for NLP](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/model-inference-nlp.html)\n\n### Large language models (LLMs) on Databricks\n#### LangChain\n\nLangChain is available as an experimental MLflow flavor which allows LangChain customers to leverage the robust tools and experiment tracking capabilities of MLflow directly from the Databricks environment. \nLangChain is a software framework designed to help create applications that utilize large language models (LLMs) and combine them with external data to bring more training context for your LLMs. \nDatabricks Runtime ML includes `langchain` in Databricks Runtime 13.1 ML and above. \nLearn about [Databricks specific LangChain integrations](https:\/\/docs.databricks.com\/large-language-models\/langchain.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/index.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Large language models (LLMs) on Databricks\n#### AI functions\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \n[AI functions](https:\/\/docs.databricks.com\/large-language-models\/ai-functions.html) are built-in SQL functions that allow SQL users to: \n* Use Databricks Foundation Model APIs to complete various tasks on your company\u2019s data.\n* Access external models like GPT-4 from OpenAI and experiment with them.\n* Query models hosted by Databricks model serving endpoints from SQL queries.\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/index.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Connect BI tools to Unity Catalog\n\nThis article provides information about connecting business intelligence (BI) tools to Unity Catalog.\n\n#### Connect BI tools to Unity Catalog\n##### JDBC\n\nTo access data registered in Unity Catalog over JDBC, use [Simba JDBC driver version 2.6.21 or above](https:\/\/databricks.com\/spark\/jdbc-drivers-download). \nSee [Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/index.html).\n\n#### Connect BI tools to Unity Catalog\n##### ODBC\n\nTo access data registered in Unity Catalog over ODBC, use [Simba ODBC driver version 2.6.19 or above](https:\/\/databricks.com\/spark\/odbc-drivers-download). \nSee [Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/index.html).\n\n#### Connect BI tools to Unity Catalog\n##### Looker\n\nTo use data managed by Unity Catalog in Looker, use the [Simba JDBC driver version 2.6.21 or above](https:\/\/databricks.com\/spark\/odbc-drivers-download). See [Connect to Looker](https:\/\/docs.databricks.com\/partners\/bi\/looker.html).\n\n#### Connect BI tools to Unity Catalog\n##### Power BI\n\nTo access data registered in Unity Catalog using Power BI, use Power BI Desktop version 2.98.683.0 or above (October 2021 release). \nSee [Connect Power BI to Databricks](https:\/\/docs.databricks.com\/partners\/bi\/power-bi.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/business-intelligence.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Connect BI tools to Unity Catalog\n##### Tableau\n\nTo access data registered in Unity Catalog using Tableau, use Tableau Desktop version 2021.4 with [Simba ODBC driver version 2.6.19 or above](https:\/\/databricks.com\/spark\/odbc-drivers-download). \nSee [Connect Tableau to Databricks](https:\/\/docs.databricks.com\/partners\/bi\/tableau.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/business-intelligence.html"} +{"content":"# Get started: Account and workspace setup\n## Best practice articles\n","doc_uri":"https:\/\/docs.databricks.com\/cheat-sheet\/jobs.html"} +{"content":"# Get started: Account and workspace setup\n## Best practice articles\n#### Production job scheduling cheat sheet\n\nThis article aims to provide clear and opinionated guidance for production job scheduling. Using best practices can help reduce costs, improve performance, and tighten security. \n| Best Practice | Impact | Docs |\n| --- | --- | --- |\n| Use jobs clusters for automated workflows | **Cost**: Jobs clusters are billed at lower rates than interactive clusters. | * [Create a cluster](https:\/\/docs.databricks.com\/compute\/configure.html) * [All-purpose and job clusters](https:\/\/docs.databricks.com\/compute\/index.html). |\n| Restart long-running clusters | **Security**: Restart clusters to take advantage of patches and bug fixes to the Databricks Runtime. | * [Restart a cluster to update it with the latest images](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#restart) |\n| Use service principals instead of user accounts to run production jobs | **Security**: If jobs are owned by individual users, when those users leave the org, these jobs may stop running. | * [Manage service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) |\n| Use Databricks Workflows for orchestration whenever possible | **Cost**: There\u2019s no need to use external tools to orchestrate if you are only orchestrating workloads on Databricks. | * [Introduction to Databricks Workflows](https:\/\/docs.databricks.com\/workflows\/index.html) |\n| Use latest LTS version of Databricks Runtime | **Performance and cost**: Databricks is always improving Databricks Runtime for usability, performance, and security. | * [Compute](https:\/\/docs.databricks.com\/compute\/index.html) * [Databricks runtime support lifecycles](https:\/\/docs.databricks.com\/release-notes\/runtime\/databricks-runtime-ver.html) |\n| Don\u2019t store production data in DBFS root | **Security**: When data is stored in the DBFS root, all users can access it. | * [What is DBFS?](https:\/\/docs.databricks.com\/dbfs\/index.html) * [Recommendations for working with DBFS root](https:\/\/docs.databricks.com\/dbfs\/dbfs-root.html) |\n\n","doc_uri":"https:\/\/docs.databricks.com\/cheat-sheet\/jobs.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Notebook-scoped Python libraries\n\nNotebook-scoped libraries let you create, modify, save, reuse, and share custom Python environments that are specific to a notebook. When you install a notebook-scoped library, only the current notebook and any jobs associated with that notebook have access to that library. Other notebooks attached to the same cluster are not affected. \nNotebook-scoped libraries do not persist across sessions. You must reinstall notebook-scoped libraries at the beginning of each session, or whenever the notebook is detached from a cluster. \nDatabricks recommends using the `%pip` magic command to install notebook-scoped Python libraries. \nYou can use `%pip` in notebooks scheduled as jobs. If you need to manage the Python environment in a Scala, SQL, or R notebook, use the `%python` magic command in conjunction with `%pip`. \nYou might experience more traffic to the driver node when working with notebook-scoped library installs. See [How large should the driver node be when working with notebook-scoped libraries?](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html#driver). \nTo install libraries for all notebooks attached to a cluster, use cluster libraries. See [Cluster libraries](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html). \nNote \nOn Databricks Runtime 10.4 LTS and below, you can use the (legacy) Databricks library utility. The library utility is supported only on Databricks Runtime, not Databricks Runtime ML. See [Library utility (dbutils.library) (legacy)](https:\/\/docs.databricks.com\/archive\/dev-tools\/dbutils-library.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Notebook-scoped Python libraries\n##### Manage libraries with `%pip` commands\n\nThe `%pip` command is equivalent to the [pip](https:\/\/pip.pypa.io\/en\/stable\/user_guide\/) command and supports the same API. The following sections show examples of how you can use `%pip` commands to manage your environment. For more information on installing Python packages with `pip`, see the [pip install documentation](https:\/\/pip.pypa.io\/en\/stable\/reference\/pip_install\/) and related pages. \nImportant \n* Starting with Databricks Runtime 13.0 `%pip` commands do not automatically restart the Python process. If you install a new package or update an existing package, you may need to use `dbutils.library.restartPython()` to see the new packages. See [Restart the Python process on Databricks](https:\/\/docs.databricks.com\/libraries\/restart-python-process.html).\n* On Databricks Runtime 12.2 LTS and below, Databricks recommends placing all `%pip` commands at the beginning of the notebook. The notebook state is reset after any `%pip` command that modifies the environment. If you create Python methods or variables in a notebook, and then use `%pip` commands in a later cell, the methods or variables are lost.\n* Upgrading, modifying, or uninstalling core Python packages (such as IPython) with `%pip` may cause some features to stop working as expected. If you experience such problems, reset the environment by detaching and re-attaching the notebook or by restarting the cluster.\n\n#### Notebook-scoped Python libraries\n##### Install a library with `%pip`\n\n```\n%pip install matplotlib\n\n```\n\n#### Notebook-scoped Python libraries\n##### Install a Python wheel package with `%pip`\n\n```\n%pip install \/path\/to\/my_package.whl\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Notebook-scoped Python libraries\n##### Uninstall a library with `%pip`\n\nNote \nYou cannot uninstall a library that is included in [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html) or a library that has been installed as a [cluster library](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html). If you have installed a different library version than the one included in Databricks Runtime or the one installed on the cluster, you can use `%pip uninstall` to revert the library to the default version in Databricks Runtime or the version installed on the cluster, but you cannot use a `%pip` command to uninstall the version of a library included in Databricks Runtime or installed on the cluster. \n```\n%pip uninstall -y matplotlib\n\n``` \nThe `-y` option is required.\n\n#### Notebook-scoped Python libraries\n##### Install a library from a version control system with `%pip`\n\n```\n%pip install git+https:\/\/github.com\/databricks\/databricks-cli\n\n``` \nYou can add parameters to the URL to specify things like the version or git subdirectory. See the [VCS support](https:\/\/pip.pypa.io\/en\/stable\/topics\/vcs-support\/) for more information and for examples using other version control systems.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Notebook-scoped Python libraries\n##### Install a private package with credentials managed by Databricks secrets with `%pip`\n\nPip supports installing packages from private sources with [basic authentication](https:\/\/pip.pypa.io\/en\/stable\/user_guide\/#basic-authentication-credentials), including private version control systems and private package repositories, such as [Nexus](https:\/\/www.sonatype.com\/nexus\/repository-pro) and [Artifactory](https:\/\/jfrog.com\/artifactory\/). Secret management is available via the Databricks Secrets API, which allows you to store authentication tokens and passwords. Use the [DBUtils API](https:\/\/docs.databricks.com\/dev-tools\/databricks-connect\/python\/databricks-utilities.html) to access secrets from your notebook. Note that you can use `$variables` in magic commands. \nTo install a package from a private repository, specify the repository URL with the `--index-url` option to `%pip install` or add it to the `pip` config file at `~\/.pip\/pip.conf`. \n```\ntoken = dbutils.secrets.get(scope=\"scope\", key=\"key\")\n\n``` \n```\n%pip install --index-url https:\/\/<user>:$token@<your-package-repository>.com\/<path\/to\/repo> <package>==<version> --extra-index-url https:\/\/pypi.org\/simple\/\n\n``` \nSimilarly, you can use secret management with magic commands to install private packages from version control systems. \n```\ntoken = dbutils.secrets.get(scope=\"scope\", key=\"key\")\n\n``` \n```\n%pip install git+https:\/\/<user>:$token@<gitprovider>.com\/<path\/to\/repo>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Notebook-scoped Python libraries\n##### Install a package from DBFS with `%pip`\n\nImportant \nAny workspace user can modify files stored in DBFS. Databricks recommends storing files in workspaces or on Unity Catalog volumes. \nYou can use `%pip` to install a private package that has been saved on DBFS. \nWhen you upload a file to DBFS, it automatically renames the file, replacing spaces, periods, and hyphens with underscores. For Python wheel files, `pip` requires that the name of the file use periods in the version (for example, 0.1.0) and hyphens instead of spaces or underscores, so these filenames are not changed. \n```\n%pip install \/dbfs\/mypackage-0.0.1-py3-none-any.whl\n\n```\n\n#### Notebook-scoped Python libraries\n##### Install a package from a volume with `%pip`\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nWith Databricks Runtime 13.3 LTS and above, you can use `%pip` to install a private package that has been saved to a volume. \nWhen you upload a file to a volume, it automatically renames the file, replacing spaces, periods, and hyphens with underscores. For Python wheel files, `pip` requires that the name of the file use periods in the version (for example, 0.1.0) and hyphens instead of spaces or underscores, so these filenames are not changed. \n```\n%pip install \/Volumes\/<catalog>\/<schema>\/<path-to-library>\/mypackage-0.0.1-py3-none-any.whl\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Notebook-scoped Python libraries\n##### Install a package stored as a workspace file with `%pip`\n\nWith Databricks Runtime 11.3 LTS and above, you can use `%pip` to install a private package that has been saved as a workspace file. \n```\n%pip install \/Workspace\/<path-to-whl-file>\/mypackage-0.0.1-py3-none-any.whl\n\n```\n\n#### Notebook-scoped Python libraries\n##### Save libraries in a requirements file\n\n```\n%pip freeze > \/Workspace\/shared\/prod_requirements.txt\n\n``` \nAny subdirectories in the file path must already exist. If you run `%pip freeze > \/Workspace\/<new-directory>\/requirements.txt`, the command fails if the directory `\/Workspace\/<new-directory>` does not already exist.\n\n#### Notebook-scoped Python libraries\n##### Use a requirements file to install libraries\n\nA [requirements file](https:\/\/pip.pypa.io\/en\/stable\/user_guide\/#requirements-files) contains a list of packages to be installed using `pip`. An example of using a requirements file is: \n```\n%pip install -r \/Workspace\/shared\/prod_requirements.txt\n\n``` \nSee [Requirements File Format](https:\/\/pip.pypa.io\/en\/stable\/reference\/pip_install\/#requirements-file-format) for more information on `requirements.txt` files.\n\n#### Notebook-scoped Python libraries\n##### How large should the driver node be when working with notebook-scoped libraries?\n\nUsing notebook-scoped libraries might result in more traffic to the driver node as it works to keep the environment consistent across executor nodes. \nWhen you use a cluster with 10 or more nodes, Databricks recommends these specs as a minimum requirement for the driver node: \n* For a 100 node CPU cluster, use i3.8xlarge.\n* For a 10 node GPU cluster, use p2.xlarge. \nFor larger clusters, use a larger driver node.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Notebook-scoped Python libraries\n##### Can I use `%sh pip`, `!pip`, or `pip`? What is the difference?\n\n`%sh` and `!` execute a shell command in a notebook; the former is a Databricks [auxiliary magic command](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#language-magic) while the latter is a feature of IPython. `pip` is a shorthand for `%pip` when [automagic](https:\/\/ipython.readthedocs.io\/en\/stable\/interactive\/magics.html#magic-automagic) is enabled, which is the default in Databricks Python notebooks. \nOn Databricks Runtime 11.3 LTS and above, `%pip`, `%sh pip`, and `!pip` all install a library as a notebook-scoped Python library. On Databricks Runtime 10.4 LTS and below, Databricks recommends using only `%pip` or `pip` to install notebook-scoped libraries. The behavior of `%sh pip` and `!pip` is not consistent in Databricks Runtime 10.4 LTS and below.\n\n#### Notebook-scoped Python libraries\n##### Known issues\n\n* On Databricks Runtime 9.1 LTS, notebook-scoped libraries are incompatible with batch streaming jobs. Databricks recommends using [cluster libraries](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html) or the [IPython kernel](https:\/\/docs.databricks.com\/notebooks\/ipython-kernel.html) instead.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Cost optimization for the data lakehouse\n##### Best practices for cost optimization\n\nThis article covers best practices supporting principles of **cost optimization**, organized by principle.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/cost-optimization\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Cost optimization for the data lakehouse\n##### Best practices for cost optimization\n###### 1. Choose the correct resources\n\n### Use Delta Lake \nDelta Lake comes with many performance improvements that can significantly speed up a workload (compared to using Parquet, ORC, and JSON). See [Optimization recommendations on Databricks](https:\/\/docs.databricks.com\/optimizations\/index.html). If the workload also runs on a job cluster, this directly leads to a shorter runtime of the cluster and lower costs. \n### Use job clusters \nA job is a way to run non-interactive code in a Databricks cluster. For example, you can run an extract, transform, and load (ETL) workload interactively or on a schedule. Of course, you can also run jobs interactively in the notebook UI. However, on job clusters, the non-interactive workloads will cost significantly less than on all-purpose clusters. See the [pricing overview](https:\/\/www.databricks.com\/product\/aws-pricing) to compare \u201cJobs Compute\u201d and \u201cAll-Purpose Compute\u201d. \nAn additional advantage is that every job or workflow runs on a new cluster, isolating workloads from one another. \nNote \nMultitask workflows can reuse compute resources for all tasks, so that the cluster startup time only appears once per workflow. See [Use Databricks compute with your jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/use-compute.html). \n### Use SQL warehouse for SQL workloads \nFor interactive SQL workloads, a [Databricks SQL warehouse](https:\/\/docs.databricks.com\/admin\/sql\/warehouse-types.html) is the most cost-efficient engine. See the [pricing overview](https:\/\/www.databricks.com\/product\/aws-pricing). \n### Use up-to-date runtimes for your workloads \nThe Databricks platform provides different runtimes that are optimized for data engineering tasks ([Databricks Runtime](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html)) or for Machine Learning ([Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html)). The runtimes are built to provide the best selection of libraries for the tasks and ensure that all provided libraries are up-to-date and work together optimally. Databricks Runtime is released on a regular cadence and offers performance improvements between major releases. These improvements in performance often lead to cost savings due to more efficient usage of cluster resources. \n### Only use GPUs for the right workloads \nVirtual machines with GPUs can dramatically speed up computational processes for deep learning, but have a significantly higher price than CPU-only machines. Use GPU instances only for workloads that have GPU-accelerated libraries. \nMost workloads do not use GPU-accelerated libraries do not benefit from GPU-enabled instances. Workspace admins can restrict GPU machines and clusters to prevent unnecessary use. See the blog post [\u201cAre GPUs Really Expensive? Benchmarking GPUs for Inference on Databricks Clusters\u201d](https:\/\/www.databricks.com\/blog\/2021\/12\/15\/are-gpus-really-expensive-benchmarking-gpus-for-inference-on-the-databricks-clusters.html). \n### Balance between on-demand and capacity excess instances \n[Spot instances](https:\/\/docs.databricks.com\/compute\/cluster-config-best-practices.html) use cloud virtual machine excess resources that are available at a cheaper price. To save cost, Databricks supports creating clusters using spot instances. It is recommended to always have the first instance (Spark driver) as an on-demand virtual machine. Spot instances are a great selection for workloads when it is acceptable to take longer because one or more spot instances have been evicted by the cloud provider.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/cost-optimization\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Cost optimization for the data lakehouse\n##### Best practices for cost optimization\n###### 2. Dynamically allocate and de-allocate resources\n\n### Leverage auto-scaling compute \nAutoscaling allows your workloads to use the right amount of compute required to complete your jobs. \nNote \nCompute auto-scaling has limitations scaling down cluster size for Structured Streaming workloads. Databricks recommends using Delta Live Tables with Enhanced Autoscaling for streaming workloads. See [Optimize the cluster utilization of Delta Live Tables pipelines with Enhanced Autoscaling](https:\/\/docs.databricks.com\/delta-live-tables\/auto-scaling.html). \nSee [Reliabilty - Design for auto scaling](https:\/\/docs.databricks.com\/lakehouse-architecture\/reliability\/best-practices.html#3-design-for-autoscaling): \n* Enable autoscaling for batch workloads.\n* Enable autoscaling for SQL warehouse.\n* Use Delta Live Tables Enhanced Autoscaling. \n### Use auto termination \nDatabricks provides a number of features to help control costs by reducing idle resources and controlling when compute resources can be deployed. \n* Configure auto termination for all interactive clusters. After a specified idle time, the cluster shuts down. See [Instance profiles](https:\/\/docs.databricks.com\/compute\/configure.html#automatic-termination).\n* For use cases where clusters are only needed during business hours, the clusters can be configured with auto termination, and a scheduled process can restart the cluster (and potentially prewarm data if required) in the morning before users are back at their desktops. See [CACHE SELECT](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-cache.html).\n* If a starting time that is significantly shorter than a full cluster start would be acceptable, consider using cluster pools. See [Pool best practices](https:\/\/docs.databricks.com\/compute\/pool-best-practices.html). Databricks pools reduce cluster start and auto-scaling times by maintaining a set of idle, ready-to-use instances. When a cluster is attached to a pool, cluster nodes are created using the pool\u2019s idle instances. If the pool has no idle instances, the pool expands by allocating a new instance from the instance provider in order to accommodate the cluster\u2019s request. When a cluster releases an instance, it returns to the pool and is free for another cluster to use. Only clusters attached to a pool can use that pool\u2019s idle instances. \nDatabricks does not charge DBUs while instances are idle in the pool, resulting in cost savings. Instance provider billing does apply. \n### Use cluster policies to control costs \nCluster policies can enforce many cost specific restrictions for clusters. See [Operational Excellence - Use cluster policies](https:\/\/docs.databricks.com\/lakehouse-architecture\/operational-excellence\/best-practices.html#use-cluster-policies). For example: \n* Enable [cluster autoscaling](https:\/\/docs.databricks.com\/lakehouse-architecture\/reliability\/best-practices.html#3-design-for-autoscaling) with a set minimum number of worker nodes.\n* Enable [cluster auto termination](https:\/\/docs.databricks.com\/lakehouse-architecture\/cost-optimization\/best-practices.html#use-auto-termination) with a reasonable value (for example, 1 hour) to avoid paying for idle times.\n* Ensure that only cost-efficient VM instances can be selected. Follow the best practices for cluster configuration. See [Compute configuration best practices](https:\/\/docs.databricks.com\/compute\/cluster-config-best-practices.html).\n* Apply a [spot instance strategy](https:\/\/docs.databricks.com\/lakehouse-architecture\/cost-optimization\/best-practices.html#balance-between-on-demand-and-capacity-excess-instances).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/cost-optimization\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Cost optimization for the data lakehouse\n##### Best practices for cost optimization\n###### 3. Monitor and control cost\n\n### Monitor costs \nThe [account console](https:\/\/accounts.cloud.databricks.com\/login) allows [viewing the billable usage](https:\/\/docs.databricks.com\/admin\/account-settings\/usage.html). As a Databricks account owner or account admin, you can also use the account console to [download billable usage logs](https:\/\/docs.databricks.com\/admin\/account-settings\/usage.html#usage-downloads). To access this data programmatically, you can also use the [Account API](https:\/\/docs.databricks.com\/api\/account\/billableusage\/download) to download the logs. Alternatively, you can configure [daily delivery of billable usage logs](https:\/\/docs.databricks.com\/admin\/account-settings\/billable-usage-delivery.html) in CSV file format to an AWS S3 storage bucket. \nAs a best practice, the full costs (including VMs, storage, and network infrastructure) should be monitored. This can be achieved by cloud provider cost management tools or by adding third party tools. \n### Evaluate Photon for your workloads \n[Photon](https:\/\/docs.databricks.com\/compute\/photon.html) provides extremely fast query performance at low cost \u2013 from data ingestion, ETL, streaming, data science and interactive queries \u2013 directly on your data lake. Photon is compatible with Apache Spark APIs, so getting started is as easy as turning it on \u2013 no code changes and no lock-in.\nCompared to Apache Spark, Photon provides an additional 2x speedup as measured by the TPC-DS 1TB benchmark. Customers have observed 3x\u20138x speedups on average, based on their workloads, compared to the latest DBR versions. \nFrom a cost perspective, Photon workloads use about 2x\u20133x more DBUs per hour than Spark workloads. Given the observed speedup, this could lead to significant cost savings, and jobs that run regularly should be evaluated whether they are not only faster but also cheaper with Photon. \n### Use serverless for your workloads \nBI workloads typically use data in bursts and generate multiple concurrent queries. For example, someone using a BI tool might update a dashboard, write a query, or simply analyze query results without interacting further with the platform. This example demonstrates two requirements: \n* Terminate clusters during idle periods to save costs.\n* Have compute resources available quickly (for both start-up and scale-up) to satisfy user queries when they request new or updated data with the BI tool. \nNon-serverless Databricks SQL warehouses have a startup time of minutes, so many users tend to accept the higher cost and do not terminate them during idle periods. On the other hand, serverless SQL warehouses start and scale up in seconds, so both immediate availability and termination during idle times can be achieved. This results in a great user experience and overall cost savings. \nAdditionally, serverless SQL warehouses scale down earlier than non-serverless warehouses, resulting lower costs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/cost-optimization\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Cost optimization for the data lakehouse\n##### Best practices for cost optimization\n###### 4. Analyze and attribute expenditure\n\n### Tag clusters for cost attribution \nTo monitor cost and accurately attribute Databricks usage to your organization\u2019s business units and teams (for example, for chargebacks), you can tag clusters and pools. These tags propagate to detailed DBU usage reports and to cloud provider VMs and blob storage instances for cost analysis. \nEnsure that cost control and attribution are already in mind when setting up workspaces and clusters for teams and use cases. This streamlines tagging and improves the accuracy of cost attributions. \nFor the overall costs, DBU virtual machine, disk, and any associated network costs must be considered. For serverless SQL warehouses this is simpler since the DBU costs already include virtual machine and disk costs. \nSee [Monitor usage using tags](https:\/\/docs.databricks.com\/admin\/account-settings\/usage-detail-tags.html). \n### Share cost reports regularly \nCreate cost reports every month to track growth and anomalies in consumption. Share these reports broken down to use cases or teams with the teams that own the respective workloads by using [cluster tagging](https:\/\/docs.databricks.com\/lakehouse-architecture\/cost-optimization\/best-practices.html#tag-clusters-for-cost-attribution). This avoids surprises and allows teams to proactively adapt their workloads if costs get too high.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/cost-optimization\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Cost optimization for the data lakehouse\n##### Best practices for cost optimization\n###### 5. Optimize workloads, aim for scalable costs\n\n### Balance always-on and triggered streaming \nTraditionally, when people think about streaming, terms such as \u201creal-time,\u201d \u201c24\/7,\u201d or \u201calways on\u201d come to mind. If data ingestion happens in \u201creal-time\u201d, the underlying cluster needs to run 24\/7, producing consumption costs every single hour of the day. \nHowever, not every use case that is based on a continuous stream of events needs these events to be added to the analytics data set immediately. If the business requirement for the use case only needs fresh data every few hours or every day, then this requirement can be achieved with only several runs a day, leading to a significant cost reduction for the workload. Databricks recommends using Structured Streaming with trigger `AvailableNow` for incremental workloads that do not have low latency requirements. See [Configuring incremental batch processing](https:\/\/docs.databricks.com\/structured-streaming\/triggers.html#configuring-incremental-batch-processing). \n### Choose the most efficient cluster size \nDatabricks runs one executor per worker node. Therefore, the terms executor and worker are used interchangeably in the context of the Databricks architecture. People often think of cluster size in terms of the number of workers, but there are other important factors to consider: \n* Total executor cores (compute): The total number of cores across all executors. This determines the maximum parallelism of a cluster.\n* Total executor memory: The total amount of RAM across all executors. This determines how much data can be stored in memory before spilling it to disk.\n* Executor local storage: The type and amount of local disk storage. Local disk is primarily used in the case of spills during shuffles and caching. \nAdditional considerations include worker instance type and size, which also influence the preceding factors. When sizing your cluster, consider the following: \n* How much data will your workload consume?\n* What\u2019s the computational complexity of your workload?\n* Where are you reading data from?\n* How is the data partitioned in external storage?\n* How much parallelism do you need? \nDetails and examples can be found under [Cluster sizing considerations](https:\/\/docs.databricks.com\/compute\/cluster-config-best-practices.html#cluster-sizing).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/cost-optimization\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Data governance for the data lakehouse\n##### Best practices for data governance\n\nThis article covers best practices of **data governance**, organized by architectural principles listed in the following sections.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/data-governance\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Data governance for the data lakehouse\n##### Best practices for data governance\n###### 1. Unify data management\n\n### Manage metadata for all data assets in one place \nAs a best practice, run the lakehouse in a single account with one [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). The top-level container of objects in Unity Catalog is a [metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html). It stores data assets (such as tables and views) and the permissions that govern access to them. Use a single metastore per cloud region and do not access metastores across regions to avoid latency issues. \nThe metastore provides a three-level namespace: \n* [Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html)\n* [Schema](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-schemas.html)\n* [Table](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html)\/[view](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-views.html). \nDatabricks recommends using [catalogs to provide segregation across your organization\u2019s information architecture](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html#organize-your-data). Often this means that catalogs can correspond to software development environment scope, team, or business unit. \n### Track data lineage to drive visibility of the data \nData lineage is a powerful tool that helps data leaders drive greater visibility and understanding of the data in their organizations. It describes the transformation and refinement of data from source to insight. Lineage includes the capture of all relevant metadata and events associated with the data in its lifecycle, including the source of the data set, what other data sets were used to create it, who created it and when, what transformations were performed, what other data sets use it, and many other events and attributes. Data lineage can be used for many data-related use cases: \n* **Compliance and audit readiness**: Data lineage helps organizations trace the source of tables and fields. This is important for meeting the requirements of many compliance regulations, such as General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), Health Insurance Portability and Accountability Act (HIPAA), Basel Committee on Banking Supervision (BCBS) 239, and Sarbanes-Oxley Act (SOX).\n* **Impact analysis\/change management**: Data goes through multiple transformations from the source to the final business-ready table. Understanding the potential impact of data changes on downstream users becomes important from a risk-management perspective. This impact can be easily determined using the data lineage collected by Unity Catalog.\n* **Data quality assurance**: Understanding where a data set came from and what transformations have been applied provides much better context for data scientists and analysts, enabling them to gain better and more accurate insights.\n* **Debugging and diagnostics**: In the event of an unexpected result, data lineage helps data teams perform root cause analysis by tracing the error back to its source. This dramatically reduces debugging time. \nUnity Catalog captures runtime [data lineage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html) across queries run on Databricks. Lineage is supported for all languages and is captured down to the column level. Lineage data includes notebooks, workflows, and dashboards related to the query. Lineage can be visualized in [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html) in near real time and accessed using [system tables](https:\/\/docs.databricks.com\/admin\/system-tables\/lineage.html) (preferred) or the Databricks [Data Lineage REST API](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html#data-lineage-api). \n### Discover data and related information using Catalog Explorer \nEasy data discovery enables data scientists, data analysts, and data engineers to quickly discover and reference relevant data and accelerate time to value. Databricks [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html) provides a UI to explore and manage data, schemas (databases), tables, and permissions, data owners, external locations, and credentials. Additionally, you can use the Insights tab in Catalog Explorer to [view the most frequent recent queries](https:\/\/docs.databricks.com\/discover\/table-insights.html) and users of any table registered in Unity Catalog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/data-governance\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Data governance for the data lakehouse\n##### Best practices for data governance\n###### 2. Unify data security\n\n### Centralize access control \nThe Databricks Data Intelligence Platform provides methods for data access control, mechanisms that describe which groups or individuals can access what data. These are statements of policy that can be extremely granular and specific, right down to definitions of every record that each individual has access to. Or they can be very expressive and broad, such as all finance users can see all financial data. \nUnity Catalog centralizes access controls for files, tables, and views. Each securable object in Unity Catalog has an owner. An object\u2019s owner has all privileges on the object, as well as the permission to grant privileges on the securable object to other principals. Unity Catalog allows to [manage privileges](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html), and to [configure access control](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html#configure-access-control) by using SQL DDL statements. \nUnity Catalog uses dynamic views for fine-grained access controls so that you can restrict access to rows and columns to the users and groups who are authorized to query them. See [Create a dynamic view](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-views.html#dynamic-view). \nFor further information see [Security, compliance & privacy - Manage identity and access using least privilege](https:\/\/docs.databricks.com\/lakehouse-architecture\/security-compliance-and-privacy\/best-practices.html#1-manage-identity-and-access-using-least-privilege). \n### Configure audit logging \nDatabricks provides access to [audit logs](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html) of activities performed by Databricks users, allowing your enterprise to monitor detailed Databricks usage patterns. There are two types of logs: Workspace-level audit logs with workspace-level events and account-level audit logs with account-level events. \n### Audit Unity Catalog events \nUnity Catalog [captures an audit log](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/audit.html) of actions performed against the metastore. This enables admins to access fine-grained details about who accessed a given dataset and what actions they performed. \n### Audit data sharing events \nFor secure sharing with Delta Sharing, Databricks provides [audit logs to monitor Delta Sharing events](https:\/\/docs.databricks.com\/data-sharing\/audit-logs.html), including: \n* When someone creates, modifies, updates, or deletes a share or a recipient.\n* When a recipient accesses an activation link and downloads the credential.\n* When a recipient accesses shares or data in shared tables.\n* When a recipient\u2019s credential is rotated or expires.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/data-governance\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Data governance for the data lakehouse\n##### Best practices for data governance\n###### 3. Manage data quality\n\nThe Databricks Data Intelligence Platform provides robust data quality management with built-in quality controls, testing, monitoring, and enforcement to ensure accurate and useful data is available for downstream BI, analytics, and machine learning workloads. \nSee [Reliability - Manage data quality](https:\/\/docs.databricks.com\/lakehouse-architecture\/reliability\/best-practices.html#2-manage-data-quality).\n\n##### Best practices for data governance\n###### 4. Share data securely and in real-time\n\n### Use the open Delta Sharing protocol for sharing data with partners \n[Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html) provides an [open solution for securely sharing live data](https:\/\/docs.databricks.com\/data-sharing\/share-data-open.html) from your lakehouse to any computing platform. Recipients do not need to be on the Databricks platform, on the same cloud, or on any cloud at all. Delta Sharing is natively integrated with Unity Catalog, enabling organizations to centrally manage and audit shared data across the enterprise and confidently share data assets while meeting security and compliance requirements. \nData providers can share live data from where it resides in their cloud storage without replicating or moving it to another system. This approach reduces the operational costs of data sharing because data providers don\u2019t have to replicate data multiple times across clouds, geographies, or data platforms to each of their data consumers. \n### Use Databricks-to-Databricks Delta Sharing between Databricks users \nIf you want to share data with users who don\u2019t have access to your Unity Catalog metastore, you can use [Databricks-to-Databricks Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/share-data-databricks.html), as long as the recipients have access to a Databricks workspace that is enabled for Unity Catalog. Databricks-to-Databricks sharing lets you share data with users in other Databricks accounts, across cloud regions, across cloud providers. It\u2019s a great way to securely share data across different Unity Catalog metastores in your own Databricks account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/data-governance\/best-practices.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Higher-order functions\n\nDatabricks provides dedicated primitives for manipulating arrays in Apache Spark SQL; these make working with arrays much easier and more concise and do away with the large amounts of boilerplate code typically required. The primitives revolve around two functional programming constructs: higher-order functions and anonymous (lambda) functions. These work together to allow you to define functions that manipulate arrays in SQL. A *higher-order function* takes an array, implements how the array is processed, and what the result of the computation will be. It delegates to a *lambda function* how to process each item in the array.\n\n#### Higher-order functions\n##### Introduction to higher-order functions notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/higher-order-functions.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### Higher-order functions\n##### Higher-order functions tutorial Python notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/higher-order-functions-tutorial-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### Higher-order functions\n##### Apache Spark built-in functions\n\nApache Spark has built-in functions for manipulating complex types (for example, array types), including higher-order functions. \nThe following notebook illustrates Apache Spark built-in functions. \n### Apache Spark built-in functions notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/apache-spark-functions.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/higher-order-lambda-functions.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Structured Streaming writes to Azure Synapse\n\nThe Azure Synapse connector offers efficient and scalable Structured Streaming write support for Azure Synapse that\nprovides consistent user experience with batch writes and uses `COPY` for large data transfers\nbetween a Databricks cluster and Azure Synapse instance. \nStructured Streaming support between Databricks and Synapse provides simple semantics for configuring incremental ETL jobs. The model used to load data from Databricks to Synapse introduces latency that might not meet SLA requirements for near-real time workloads. See [Query data in Azure Synapse Analytics](https:\/\/docs.databricks.com\/connect\/external-systems\/synapse-analytics.html).\n\n#### Structured Streaming writes to Azure Synapse\n##### Supported output modes for streaming writes to Synapse\n\nThe Azure Synapse connector supports `Append` and `Complete` output modes for record appends and aggregations. For more details on output modes and compatibility matrix, see the [Structured Streaming guide](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html#output-modes).\n\n#### Structured Streaming writes to Azure Synapse\n##### Synapse fault tolerance semantics\n\nBy default, Azure Synapse Streaming offers end-to-end *exactly-once* guarantee for writing data into an Azure Synapse table by reliably tracking progress of the query using a combination of checkpoint location in DBFS, checkpoint table in Azure Synapse, and locking mechanism to ensure that streaming can handle any types of failures, retries, and query restarts. \nOptionally, you can select less restrictive at-least-once semantics for Azure Synapse Streaming by setting `spark.databricks.sqldw.streaming.exactlyOnce.enabled` option to `false`, in which case data duplication could occur in the event of intermittent connection failures to Azure Synapse or unexpected query termination.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/synapse.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Structured Streaming writes to Azure Synapse\n##### Structured Streaming syntax for writing to Azure Synapse\n\nThe following code examples demonstrate streaming writes to Synapse using Structured Streaming in Scala and Python: \n```\n\/\/ Set up the Blob storage account access key in the notebook session conf.\nspark.conf.set(\n\"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net\",\n\"<your-storage-account-access-key>\")\n\n\/\/ Prepare streaming source; this could be Kafka or a simple rate stream.\nval df: DataFrame = spark.readStream\n.format(\"rate\")\n.option(\"rowsPerSecond\", \"100000\")\n.option(\"numPartitions\", \"16\")\n.load()\n\n\/\/ Apply some transformations to the data then use\n\/\/ Structured Streaming API to continuously write the data to a table in Azure Synapse.\n\ndf.writeStream\n.format(\"com.databricks.spark.sqldw\")\n.option(\"url\", \"jdbc:sqlserver:\/\/<the-rest-of-the-connection-string>\")\n.option(\"tempDir\", \"abfss:\/\/<your-container-name>@<your-storage-account-name>.dfs.core.windows.net\/<your-directory-name>\")\n.option(\"forwardSparkAzureStorageCredentials\", \"true\")\n.option(\"dbTable\", \"<your-table-name>\")\n.option(\"checkpointLocation\", \"\/tmp_checkpoint_location\")\n.start()\n\n``` \n```\n# Set up the Blob storage account access key in the notebook session conf.\nspark.conf.set(\n\"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net\",\n\"<your-storage-account-access-key>\")\n\n# Prepare streaming source; this could be Kafka or a simple rate stream.\ndf = spark.readStream \\\n.format(\"rate\") \\\n.option(\"rowsPerSecond\", \"100000\") \\\n.option(\"numPartitions\", \"16\") \\\n.load()\n\n# Apply some transformations to the data then use\n# Structured Streaming API to continuously write the data to a table in Azure Synapse.\n\ndf.writeStream \\\n.format(\"com.databricks.spark.sqldw\") \\\n.option(\"url\", \"jdbc:sqlserver:\/\/<the-rest-of-the-connection-string>\") \\\n.option(\"tempDir\", \"abfss:\/\/<your-container-name>@<your-storage-account-name>.dfs.core.windows.net\/<your-directory-name>\") \\\n.option(\"forwardSparkAzureStorageCredentials\", \"true\") \\\n.option(\"dbTable\", \"<your-table-name>\") \\\n.option(\"checkpointLocation\", \"\/tmp_checkpoint_location\") \\\n.start()\n\n``` \nFor a full list of configurations, see [Query data in Azure Synapse Analytics](https:\/\/docs.databricks.com\/connect\/external-systems\/synapse-analytics.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/synapse.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Structured Streaming writes to Azure Synapse\n##### Synapse streaming checkpoint table management\n\nThe Azure Synapse connector *does not* delete the streaming checkpoint table that is created when new streaming query is started. This behavior is consistent with the `checkpointLocation` normally specified to object storage. Databricks recommends you periodically delete checkpoint tables for queries that are not going to be run in the future. \nBy default, all checkpoint tables have the name `<prefix>_<query-id>`, where `<prefix>` is a configurable prefix with default value `databricks_streaming_checkpoint` and `query_id` is a streaming query ID with `_` characters removed. \nTo find all checkpoint tables for stale or deleted streaming queries, run the query: \n```\nSELECT * FROM sys.tables WHERE name LIKE 'databricks_streaming_checkpoint%'\n\n``` \nYou can configure the prefix with the Spark SQL configuration option `spark.databricks.sqldw.streaming.exactlyOnce.checkpointTableNamePrefix`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/synapse.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Structured Streaming writes to Azure Synapse\n##### Databricks Synapse connector streaming options reference\n\nThe `OPTIONS` provided in Spark SQL support the following options for streaming in addition to the [batch options](https:\/\/docs.databricks.com\/connect\/external-systems\/synapse-analytics.html#parameters): \n| Parameter | Required | Default | Notes |\n| --- | --- | --- | --- |\n| `checkpointLocation` | Yes | No default | Location on DBFS that will be used by Structured Streaming to write metadata and checkpoint information. See [Recovering from Failures with Checkpointing](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing) in Structured Streaming programming guide. |\n| `numStreamingTempDirsToKeep` | No | 0 | Indicates how many (latest) temporary directories to keep for periodic cleanup of micro batches in streaming. When set to `0`, directory deletion is triggered immediately after micro batch is committed, otherwise provided number of latest micro batches is kept and the rest of directories is removed. Use `-1` to disable periodic cleanup. | \nNote \n`checkpointLocation` and `numStreamingTempDirsToKeep` are relevant only for streaming writes from Databricks to a new table in Azure Synapse.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/synapse.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/download.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Download and reference the Databricks JDBC Driver\n\nThis article describes how to download and reference the [Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/index.html). \nReview the [JDBC ODBC driver license](https:\/\/databricks.com\/jdbc-odbc-driver-license) before you download and reference the JDBC driver. \nSome apps, clients, SDKs, APIs, and tools such as [DataGrip](https:\/\/docs.databricks.com\/dev-tools\/datagrip.html), [DBeaver](https:\/\/docs.databricks.com\/dev-tools\/dbeaver.html), and [SQL Workbench\/J](https:\/\/docs.databricks.com\/partners\/bi\/workbenchj.html) require you to manually download the JDBC driver before you can set up a connection to Databricks. If you use Java build tools such as Maven or Gradle, these build tools can automatically download the JDBC driver. If you do not need to manually download the JDBC driver, skip ahead to [Next steps](https:\/\/docs.databricks.com\/integrations\/jdbc\/download.html#next-steps). \nTo manually download the JDBC driver, do the following: \n1. Go to the [All JDBC Driver Versions](https:\/\/www.databricks.com\/spark\/jdbc-drivers-archive) download page.\n2. Click the **Download** button for the latest version of the JDBC driver. The driver is packaged as a `.jar` file. This file does not require installation. \nIf you use Java code, you can reference the JDBC driver from your code in one of the following ways: \n* If you manually downloaded the `.jar` file, you can add the downloaded `.jar` file to the Java classpath.\n* For Maven projects, you can add the following dependency to the project\u2019s `pom.xml` file to instruct Maven to automatically download the JDBC driver with the specified version: \n```\n<dependency>\n<groupId>com.databricks<\/groupId>\n<artifactId>databricks-jdbc<\/artifactId>\n<version>2.6.36<\/version>\n<\/dependency>\n\n```\n* For Gradle projects, you can add the following dependency to the project\u2019s build file to instruct Gradle to automatically download the JDBC driver with the specified version: \n```\nimplementation 'com.databricks:databricks-jdbc:2.6.36'\n\n``` \nTo view the dependency syntax for other project types, and to get the latest version number of the JDBC driver, see the [Maven Central Repository](https:\/\/central.sonatype.com\/artifact\/com.databricks\/databricks-jdbc).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/download.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Download and reference the Databricks JDBC Driver\n########## Next steps\n\nTo configure a Databricks connection for the Databricks JDBC Driver, see the following articles: \n* [Compute settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/compute.html)\n* [Authentication settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html)\n* [Driver capability settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/download.html"} +{"content":"# What is Databricks?\n## What is a data lakehouse?\n#### What does it mean to build a single source of truth?\n\nThe Databricks lakehouse eliminates the need for creating and syncing copies of data across multiple systems by unifying data access and storage in a single system, establishing the lakehouse as the single source of truth (SSOT). Duplicating data often results in data silos, meaning that different teams within an organization may be working with versions of the same data that differ in quality and freshness.\n\n#### What does it mean to build a single source of truth?\n##### How does the lakehouse control transactions and data access?\n\nDelta Lake transactions use log files stored alongside data files to provide ACID guarantees at a table level. Because the data and log files backing Delta Lake tables live together in cloud object storage, reading and writing data can occur simultaneously without risk of many queries resulting in performance degradation or deadlock for business-critical workloads. This means that users and applications throughout the enterprise environment can connect to the same single copy of the data to drive diverse workloads, with all viewers guaranteed to receive the most current version of the data at the time their query executes.\n\n#### What does it mean to build a single source of truth?\n##### Manage access to production data\n\nUnity Catalog provides a centralized data governance solution that allows data stewards to provide fine-grained access control to users, groups, and service principals. Unity Catalog governs permissions using access control lists (ACLs) that provide both flexibility and specificity in configuring resources. Some configurable permissions include: \n* Read-only access to a handful of tables.\n* Table creation and modification permissions for a database.\n* Ability to read or modify data in a specific cloud storage location.\n* Access to many cloud resources through Unity Catalog managed storage credentials. \nFor more information, see [What is Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/ssot.html"} +{"content":"# What is Databricks?\n## What is a data lakehouse?\n#### What does it mean to build a single source of truth?\n##### Leverage views in the lakehouse\n\nViews on Databricks represent saved queries against data stored in tables somewhere in the lakehouse. Whereas the queries that result in tables are executed at write time, views execute defining logic each time a query against a view runs. This means that views can provide up-to-date access to data from a variety of sources, and that compute is only spent to update results as they are needed. \nYou can use Unity Catalog to secure and share views alongside other data objects, allowing individuals and teams to share the logic that drives key business decisions across the organization. \nFor more information, see [Data objects in the Databricks lakehouse](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html).\n\n#### What does it mean to build a single source of truth?\n##### Share data with collaborators\n\nWhile the ACLs in Unity Catalog cover a wide range of use cases for sharing data within an enterprise organization, Delta Sharing further expands this by managing read-only access to datasets that can be shared with collaborators anywhere. Use cases supported by Unity Catalog include: \n* Providing real-time access to regional analytics for isolated regions of multinational corporations.\n* Sharing datasets across isolated businesses that exist under the same corporate umbrella.\n* Providing secure access to customer-curated datasets for third-party consumers. \nOn Databricks, Delta Sharing comes built-in with Unity Catalog, but it is also part of [open source Delta Lake](https:\/\/delta.io\/sharing). For more information, see [Share data and AI assets securely using Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/ssot.html"} +{"content":"# Compute\n### What are Databricks pools?\n\nDatabricks pools are a set of idle, ready-to-use instances. When cluster nodes are created using the idle instances, cluster start and auto-scaling times are reduced. If the pool has no idle instances, the pool expands by allocating a new instance from the instance provider in order to accommodate the cluster\u2019s request. \nWhen a cluster releases an instance, it returns to the pool and is free for another cluster to use. Only clusters attached to a pool can use that pool\u2019s idle instances. \nDatabricks does not charge DBUs while instances are idle in the pool. Instance provider billing does apply. See [pricing](https:\/\/aws.amazon.com\/ec2\/pricing\/). \nYou can manage pools using the UI or by calling the [Instance Pools API](https:\/\/docs.databricks.com\/api\/workspace\/instancepools).\n\n### What are Databricks pools?\n#### Create a pool\n\nTo create a pool, you must have permission to create pools. By default, only workspace admins have pool creation permissions. Groups can be assigned the `allow-instance-pool-create` entitlement using the [Group API](https:\/\/docs.databricks.com\/api\/workspace\/groups\/update). \nTo create a pool using the UI: \n1. Click ![compute icon](https:\/\/docs.databricks.com\/_images\/clusters-icon.png) **Compute** in the sidebar.\n2. Click the **Pools** tab.\n3. Click the **Create Pool** button.\n4. Specify the pool configuration.\n5. Click the **Create** button.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/pool-index.html"} +{"content":"# Compute\n### What are Databricks pools?\n#### Attach a cluster to a pool\n\nTo attach a cluster to a pool using the [cluster creation UI](https:\/\/docs.databricks.com\/compute\/configure.html), select the pool from the **Driver Type** or **Worker Type** dropdown when you configure the cluster. Available pools are listed at the top of each dropdown list. You can use the same pool or different pools for the driver node and worker nodes. \nIf you use the [Clusters API](https:\/\/docs.databricks.com\/api\/workspace\/clusters), you must specify `driver_instance_pool_id` for the driver node and `instance_pool_id` for the worker nodes. \nFor more best practices related to pools, see [Pool best practices](https:\/\/docs.databricks.com\/compute\/pool-best-practices.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/pool-index.html"} +{"content":"# Compute\n### What are Databricks pools?\n#### Pool permissions\n\nThere are three permission levels for a pool: NO PERMISSIONS, CAN ATTACH TO, and CAN MANAGE. The table lists the abilities for each permission. \n| Ability | NO PERMISSIONS | CAN ATTACH TO | CAN MANAGE |\n| --- | --- | --- | --- |\n| Attach cluster to pool | | x | x |\n| Delete pool | | | x |\n| Edit pool | | | x |\n| Modify permissions | | | x | \nWorkspace admins have the CAN MANAGE permission on all pools in their workspace. Users automatically have the CAN MANAGE permission on pools they create. \n### Configure pool permissions \nThis section describes how to manage permissions using the workspace UI. You can also use the [Permissions API](https:\/\/docs.databricks.com\/api\/workspace\/permissions) or [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html). \nYou must have the CAN MANAGE permission on a pool to configure permissions. \n1. In the sidebar, click **Compute**.\n2. Click the **Pools** tab.\n3. Select the pool you want to update.\n4. Click the **Permissions** button.\n5. In **Permission Settings**, click the **Select user, group or service principal\u2026** drop-down menu and select a user, group, or service principal. \n![Set pool permissions](https:\/\/docs.databricks.com\/_images\/pool-acl.png)\n6. Select a permission from the permission drop-down menu.\n7. Click **Add**, then click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/pool-index.html"} +{"content":"# Compute\n### What are Databricks pools?\n#### Delete a pool\n\nDeleting a pool terminates the pool\u2019s idle instances and removes its configuration. To delete a pool, click the ![Delete Icon](https:\/\/docs.databricks.com\/_images\/delete-icon.png) icon in the actions on the Pools page. If you delete a pool: \n* Running clusters attached to the pool continue to run, but cannot allocate instances during resize or up-scaling.\n* Terminated clusters attached to the pool will fail to start. \nImportant \nYou cannot undo this action.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/pool-index.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Filter sensitive table data using row filters and column masks\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article provides guidance and examples for using row filters, column masks, and mapping tables to filter sensitive data in your tables.\n\n#### Filter sensitive table data using row filters and column masks\n##### What are row filters?\n\nRow filters allow you to apply a filter to a table so that subsequent queries only return rows for which the filter predicate evaluates to true. A row filter is implemented as a SQL user-defined function (UDF). \nTo create a row filter, first write a [SQL UDF](https:\/\/docs.databricks.com\/udf\/index.html) to define the filter policy and then apply it to a table with an `ALTER TABLE` statement. Alternatively, you can specify a row filter for a table in the initial `CREATE TABLE` statement. Each table can have only one row filter. A row filter accepts zero or more input parameters where each input parameter binds to one column of the corresponding table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/row-and-column-filters.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Filter sensitive table data using row filters and column masks\n##### What is the difference between these filters and dynamic views?\n\nThe [dynamic view](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-views.html#dynamic-view) is an abstracted, read-only view of one or more source tables. The user can access the dynamic view without having access to the source tables directly. Creating a dynamic view defines a new table name that must not match the name of any source tables or other tables and views present in the same schema. \nOn the other hand, associating a row filter or column mask to a target table applies the corresponding logic directly to the table itself without introducing any new table names. Subsequent queries may continue referring directly to the target table using its original name. \nBoth dynamic views and row filters and column masks let you apply complex logic to tables and process their filtering decisions at query runtime. \nUse dynamic views if you need to apply transformation logic such as filters and masks to read-only tables, and if it is acceptable for users to refer to the dynamic views using different names. Use row filters and column masks if you want to filter or compute expressions over specific data but still provide users access to the tables using their original names.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/row-and-column-filters.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Filter sensitive table data using row filters and column masks\n##### Row Filter Syntax\n\nTo create a row filter and add it to an existing table, use the following syntax: \nCreate the row filter: \n```\nCREATE FUNCTION <function_name> (<parameter_name> <parameter_type>, ...)\nRETURN {filter clause whose output must be a boolean};\n\n``` \nApply the row filter to a table: \n```\nALTER TABLE <table_name> SET ROW FILTER <function_name> ON (<column_name>, ...);\n\n``` \nRemove a row filter from a table: \n```\nALTER TABLE <table_name> DROP ROW FILTER;\n\n``` \nModify a row filter: \n```\nRun a DROP FUNCTION statement to drop the existing function, or use CREATE OR REPLACE FUNCTION to replace it.\n\n``` \nDelete a row filter: \n```\nALTER TABLE <table_name> DROP ROW FILTER;\nDROP FUNCTION <function_name>;\n\n``` \nNote \nYou must perform the `ALTER TABLE ... DROP ROW FILTER` command before dropping the function or the table will be in an inaccessible state. \nIf the table becomes inaccessible in this way, alter the table and drop the orphaned row filter reference using `ALTER TABLE <table_name> DROP ROW FILTER;`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/row-and-column-filters.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Filter sensitive table data using row filters and column masks\n##### Row filter examples\n\nCreate a SQL user defined function applied to members of the group `admin` in the region `US`. \nWith this function, members of the `admin` group can access all records in the table. If the function is called by a non-admin, the `RETURN_IF` condition fails and the `region='US'` expression is evaluated, filtering the table to only show records in the `US` region. \n```\nCREATE FUNCTION us_filter(region STRING)\nRETURN IF(IS_ACCOUNT_GROUP_MEMBER('admin'), true, region='US');\n\n``` \nApply the function to a table as a row filter. Subsequent queries from the `sales` table then return a subset of rows. \n```\nCREATE TABLE sales (region STRING, id INT);\nALTER TABLE sales SET ROW FILTER us_filter ON (region);\n\n``` \nDisable the row filter. Future user queries from the `sales` table then return all of the rows in the table. \n```\nALTER TABLE sales DROP ROW FILTER;\n\n``` \nCreate a table with the function applied as a row filter as part of the CREATE TABLE statement. Future queries from the `sales` table then each return a subset of rows. \n```\nCREATE TABLE sales (region STRING, id INT)\nWITH ROW FILTER us_filter ON (region);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/row-and-column-filters.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Filter sensitive table data using row filters and column masks\n##### What are column masks?\n\nColumn masks let you apply a masking function to a table column. The masking function gets evaluated at query runtime, substituting each reference of the target column with the results of the masking function. For most use cases, column masks determine whether to return the original column value or redact it based on the identity of the invoking user. Column masks are expressions written as SQL UDFs. \nEach table column can optionally have one masking function applied to it. The masking function takes the unmasked value of the column as input and returns the masked value as its result. The return value of the masking function should be the same type as the column being masked. The masking function can also take additional columns as input parameters and use them in its masking logic. \nTo apply column masks, create a function and apply it to a table column using an `ALTER TABLE` statement. Alternatively, you can apply the masking function when you create the table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/row-and-column-filters.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Filter sensitive table data using row filters and column masks\n##### Column mask syntax\n\nWithin the `MASK` clause, you can use any of the Databricks built-in runtime functions or call other user-defined functions. Common use cases include inspecting the identity of the invoking user running the function using `current_user( )` or which groups they are a member of using `is_account_group_member( )`. \nCreate a column mask: \n```\nCREATE FUNCTION <function_name> (<parameter_name> <parameter_type>, ...)\nRETURN {expression with the same type as the first parameter};\n\n``` \nApply column mask to a column in an existing table: \n```\nALTER TABLE <table_name> ALTER COLUMN <col_name> SET MASK <mask_func_name> [USING COLUMNS <additional_columns>];\n\n``` \nRemove a column mask from a column in a table: \n```\nALTER TABLE <table_name> ALTER COLUMN <column where mask is applied> DROP MASK;\n\n``` \nModify a column mask: \nEither `DROP` the existing function, or use `CREATE OR REPLACE TABLE`. \nDelete a column mask: \n```\nALTER TABLE <table_name> ALTER COLUMN <column where mask is applied> DROP MASK;\nDROP FUNCTION <function_name>;\n\n``` \nNote \nYou must perform the `ALTER TABLE` command before dropping the function or the table will be in an inaccessible state. \nIf the table becomes inaccessible in this way, alter the table and drop the orphaned mask reference reference using `ALTER TABLE <table_name> ALTER COLUMN <column where mask is applied> DROP MASK;`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/row-and-column-filters.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Filter sensitive table data using row filters and column masks\n##### Column mask examples\n\nIn this example, you create a user-defined function that masks the `ssn` column so that only users who are members of the `HumanResourceDept` group can view values in that column. \n```\nCREATE FUNCTION ssn_mask(ssn STRING)\nRETURN CASE WHEN is_member('HumanResourceDept') THEN ssn ELSE '***-**-****' END;\n\n``` \nApply the new function to a table as a column mask. You can add the column mask when you create the table or after. \n```\n--Create the `users` table and apply the column mask in a single step:\n\nCREATE TABLE users (\nname STRING,\nssn STRING MASK ssn_mask);\n\n``` \n```\n--Create the `users` table and apply the column mask after:\n\nCREATE TABLE users\n(name STRING, ssn STRING);\n\nALTER TABLE users ALTER COLUMN ssn SET MASK ssn_mask;\n\n``` \nQueries on that table now return masked `ssn` column values when the querying user is not a member of the `HumanResourceDept` group: \n```\nSELECT * FROM users;\nJames ***-**-****\n\n``` \nTo disable the column mask so that queries return the original values in the `ssn` column: \n```\nALTER TABLE users ALTER COLUMN ssn DROP MASK;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/row-and-column-filters.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Filter sensitive table data using row filters and column masks\n##### Use mapping tables to create an access-control list\n\nTo achieve row-level security, consider defining a mapping table (or access-control list). Each mapping table is a comprehensive mapping table that encodes which data rows in the original table are accessible to certain users or groups. Mapping tables are useful because they offer simple integration with your fact tables through direct joins. \nThis methodology proves beneficial in addressing many use cases with custom requirements. Examples include: \n* Imposing restrictions based on the logged-in user while accommodating different rules for specific user groups.\n* Creating intricate hierarchies, such as organizational structures, requiring diverse sets of rules.\n* Replicating complex security models from external source systems. \nBy adopting mapping tables in this way, you can effectively tackle these challenging scenarios and ensure robust row-level and column-level security implementations.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/row-and-column-filters.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Filter sensitive table data using row filters and column masks\n##### Mapping table examples\n\nUse a mapping table to check if the current user is in a list: \n```\nUSE CATALOG main;\n\n``` \nCreate a new mapping table: \n```\nDROP TABLE IF EXISTS valid_users;\n\nCREATE TABLE valid_users(username string);\nINSERT INTO valid_users\nVALUES\n('fred@databricks.com'),\n('barney@databricks.com');\n\n``` \nCreate a new filter: \nNote \nAll filters run with definer\u2019s rights except for functions that check user context (for example, the `CURRENT_USER` and `IS_MEMBER` functions) which run as the invoker. \nIn this example the function checks to see if the current user is in the `valid_users` table. If the user is found, the function returns true. \n```\nDROP FUNCTION IF EXISTS row_filter;\n\nCREATE FUNCTION row_filter()\nRETURN EXISTS(\nSELECT 1 FROM valid_users v\nWHERE v.username = CURRENT_USER()\n);\n\n``` \nThe example below applies the row filter during table creation. You can also add the filter later using an `ALTER TABLE` statement. When applying to a whole table use the `ON ()` syntax. For a specific row use `ON (row);`. \n```\nDROP TABLE IF EXISTS data_table;\n\nCREATE TABLE data_table\n(x INT, y INT, z INT)\nWITH ROW FILTER row_filter ON ();\n\nINSERT INTO data_table VALUES\n(1, 2, 3),\n(4, 5, 6),\n(7, 8, 9);\n\n``` \nSelect data from the table. This should only return data if the user is in the `valid_users` table. \n```\nSELECT * FROM data_table;\n\n``` \nCreate a mapping table comprising accounts that should always have access to view all the rows in the table, regardless of the column values: \n```\nCREATE TABLE valid_accounts(account string);\nINSERT INTO valid_accounts\nVALUES\n('admin'),\n('cstaff');\n\n``` \nNow create a SQL UDF that returns `true` if the values of all columns in the row are less than five, or if the invoking user is a member of the above mapping table. \n```\nCREATE FUNCTION row_filter_small_values (x INT, y INT, z INT)\nRETURN (x < 5 AND y < 5 AND z < 5)\nOR EXISTS(\nSELECT 1 FROM valid_accounts v\nWHERE IS_ACCOUNT_GROUP_MEMBER(v.account));\n\n``` \nFinally, apply the SQL UDF to the table as a row filter: \n```\nALTER TABLE data_table SET ROW FILTER row_filter_small_values ON (x, y, z);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/row-and-column-filters.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Filter sensitive table data using row filters and column masks\n##### Supportability\n\n* Databricks SQL and Databricks notebooks for SQL workloads are supported.\n* DML commands by users with MODIFY privileges are supported. Filters and masks are applied to the data read by UPDATEs and DELETEs and are not applied to data that is written (including INSERTed data).\n* Supported formats: Delta and Parquet. Parquet is supported for only managed or external tables.\n* Views on tables with column masks or row filters are supported.\n* Delta Lake change data feeds are supported as long as the schema is compatible with the row filters and column masks that may apply to the target table.\n* Foreign tables are supported.\n\n#### Filter sensitive table data using row filters and column masks\n##### Limitations\n\n* Databricks Runtime versions below 12.2 LTS do not support row filters or column masks. These runtimes fail securely, meaning if you try to access tables from unsupported versions of these runtimes, no data is returned.\n* Delta Live Tables materialized views and streaming tables don\u2019t support row filters or column masks.\n* Python and Scala UDFs are not supported as row filter or column mask functions directly. However, it is possible to refer to these in SQL UDFs as long as their definitions are permanently stored in the catalog (in other words, not temporary to the session).\n* Delta Sharing does not work with row-level security or column masks.\n* Time travel does not work with row-level security or column masks.\n* Table sampling does not work with row-level security or column masks.\n* Path-based access to files in tables with policies are not currently supported.\n* Row-filter or column-mask policies with circular dependencies back to the original policies are not supported.\n* `MERGE` and shallow clones are not supported.\n\n#### Filter sensitive table data using row filters and column masks\n##### Single user clusters limitation\n\nDo not add row filters or column masks to any table that you are accessing from single user clusters. This is commonly done in the context of Databricks Jobs. During the public preview, you will be unable to access the table from a single user cluster once a filter or mask has been applied.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/row-and-column-filters.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create a Unity Catalog metastore\n\nThis article shows how to create a Unity Catalog metastore and link it to workspaces. \nImportant \nFor workspaces that were enabled for Unity Catalog automatically, the instructions in this article are unnecessary. Databricks began to enable new workspaces for Unity Catalog automatically on November 8, 2023, with a rollout proceeding gradually across accounts. You must follow the instructions in this article only if you have a workspace and don\u2019t already have a metastore in your workspace region. To determine whether a metastore already exists in your region, see [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement). \nA metastore is the top-level container for data in Unity Catalog. Unity Catalog metastores register metadata about securable objects (such as tables, volumes, external locations, and shares) and the permissions that govern access to them. Each metastore exposes a three-level namespace (`catalog`.`schema`.`table`) by which data can be organized. You must have one metastore for each region in which your organization operates. To work with Unity Catalog, users must be on a workspace that is attached to a metastore in their region. \nTo create a metastore, you do the following: \n1. In your AWS account, optionally create a storage location for metastore-level storage of managed tables and volumes. \nFor information to help you decide whether you need metastore-level storage, see [(Optional) Create metastore-level storage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#metastore-storage) and [Data is physically separated in storage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html#physically-separate).\n2. In your AWS account, create an IAM role that gives access to that storage location.\n3. In Databricks, create the metastore, attaching the storage location, and assign workspaces to the metastore. \nNote \nIn addition to the approaches described in this article, you can also create a metastore by using the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html), specifically the [databricks\\_metastore](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/metastore) resource. To enable Unity Catalog to access the metastore, use [databricks\\_metastore\\_data\\_access](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/metastore_data_access). To link workspaces to a metastore, use [databricks\\_metastore\\_assignment](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/metastore_assignment).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create a Unity Catalog metastore\n##### Before you begin\n\nBefore you begin, you should familiarize yourself with the basic Unity Catalog concepts, including metastores and managed storage. See [What is Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). \nYou should also confirm that you meet the following requirements for all setup steps: \n* You must be a Databricks account admin.\n* Your Databricks account must be on the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons).\n* If you want to set up metastore-level root storage, you must have the ability to create S3 buckets, IAM roles, IAM policies, and cross-account trust relationships in your AWS account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create a Unity Catalog metastore\n##### Step 1 (Optional): Create an S3 bucket for metastore-level managed storage in AWS\n\nIn this step, which is optional, you create the S3 bucket required by Unity Catalog to store managed table and volume data at the metastore level. You create the S3 bucket in your own AWS account. To determine whether you need metastore-level storage, see [(Optional) Create metastore-level storage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#metastore-storage). \n1. In AWS, create an S3 bucket. \nThis S3 bucket will be the metastore-level storage location for managed tables and managed volumes in Unity Catalog. This storage location can be overridden at the catalog and schema levels. See [Managed storage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#managed-storage) \nRequirements: \n* If you have more than one metastore, you should use a dedicated S3 bucket for each one.\n* Locate the bucket in the same region as the workspaces you want to access the data from.\n* The bucket name cannot include dot notation (for example, `incorrect.bucket.name.notation`). For more bucket naming guidance, see the [AWS bucket naming rules](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/bucketnamingrules.html).\n2. Make a note of the S3 bucket path, which starts with `s3:\/\/`.\n3. If you enable KMS encryption on the S3 bucket, make a note of the name of the KMS encryption key.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create a Unity Catalog metastore\n##### Step 2 (Optional): Create an IAM role to access the storage location\n\nIn this step, which is required only if you completed step 1, you create the IAM role required by Unity Catalog to access the S3 bucket that you created in the previous step. \nRole creation is a two-step process. First you simply create the role, adding a *temporary* trust relationship policy that you then modify in a later step. You must modify the trust policy *after* you create the role because your role must be self-assuming\u2014that is, it must be configured to trust itself. The role must therefore exist before you add the self-assumption statement. For information about self-assuming roles, see this [Amazon blog article](https:\/\/aws.amazon.com\/blogs\/security\/announcing-an-update-to-iam-role-trust-policy-behavior\/). \n1. Find your Databricks account ID. \n1. Log in to the Databricks [account console](https:\/\/accounts.cloud.databricks.com).\n2. Click your username.\n3. From the menu, copy the **Account ID** value.\n2. In AWS, create an IAM role with a **Custom Trust Policy**.\n3. In the **Custom Trust Policy** field, paste the following policy JSON, replacing `<DATABRICKS-ACCOUNT-ID>` with the Databricks account ID you found in step 1 (not your AWS account ID). \nThis policy establishes a cross-account trust relationship so that Unity Catalog can assume the role to access the data in the bucket on behalf of Databricks users. This is specified by the ARN in the `Principal` section. It is a static value that references a role created by Databricks. Do not modify it. \n```\n{\n\"Version\": \"2012-10-17\",\n\"Statement\": [{\n\"Effect\": \"Allow\",\n\"Principal\": {\n\"AWS\": [\n\"arn:aws:iam::414351767826:role\/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL\"\n]\n},\n\"Action\": \"sts:AssumeRole\",\n\"Condition\": {\n\"StringEquals\": {\n\"sts:ExternalId\": \"<DATABRICKS-ACCOUNT-ID>\"\n}\n}\n}]\n}\n\n``` \nIf you are are using [AWS GovCloud](https:\/\/docs.databricks.com\/security\/privacy\/gov-cloud.html) use the policy below: \n```\n{\n\"Version\": \"2012-10-17\",\n\"Statement\": [{\n\"Effect\": \"Allow\",\n\"Principal\": {\n\"AWS\": [\n\"arn:aws-us-gov:iam::044793339203:role\/unity-catalog-prod-UCMasterRole-1QRFA8SGY15OJ\"\n]\n},\n\"Action\": \"sts:AssumeRole\",\n\"Condition\": {\n\"StringEquals\": {\n\"sts:ExternalId\": \"<DATABRICKS-ACCOUNT-ID>\"\n}\n}\n}]\n}\n\n```\n4. Skip the permissions policy configuration. You\u2019ll go back to add that in a later step.\n5. Save the IAM role.\n6. Modify the trust relationship policy to make it \u201cself-assuming.\u201d \n1. Return to your saved IAM role and go to the **Trust Relationships** tab.\n2. Edit the trust relationship policy, adding the following ARN to the \u201cAllow\u201d statement. \nReplace `<YOUR-AWS-ACCOUNT-ID>` and `<THIS-ROLE-NAME>` with your actual IAM role values. \n```\n\"arn:aws:iam::<YOUR-AWS-ACCOUNT-ID>:role\/<THIS-ROLE-NAME>\"\n\n```Your policy should now look like this (with replacement text updated to use your Databricks account ID and IAM role values): \n```\n{\n\"Version\": \"2012-10-17\",\n\"Statement\": [\n{\n\"Effect\": \"Allow\",\n\"Principal\": {\n\"AWS\": [\n\"arn:aws:iam::414351767826:role\/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL\",\n\"arn:aws:iam::<YOUR-AWS-ACCOUNT-ID>:role\/<THIS-ROLE-NAME>\"\n]\n},\n\"Action\": \"sts:AssumeRole\",\n\"Condition\": {\n\"StringEquals\": {\n\"sts:ExternalId\": \"<DATABRICKS-ACCOUNT-ID>\"\n}\n}\n}\n]\n}\n\n```\n7. In AWS, create an IAM policy in the same AWS account as the S3 bucket. \nTo avoid unexpected issues, you must use the following sample policy, replacing the following values: \n* `<BUCKET>`: The name of the S3 bucket you created in the previous step.\n* `<KMS-KEY>`: Optional. If encryption is enabled, provide the name of the KMS key that encrypts the S3 bucket contents. **If encryption is disabled, remove the entire KMS section of the IAM policy.**\n* `<AWS-ACCOUNT-ID>`: The Account ID of the current AWS account (not your Databricks account).\n* `<AWS-IAM-ROLE-NAME>`: The name of the AWS IAM role that you created in the previous step.\n```\n{\n\"Version\": \"2012-10-17\",\n\"Statement\": [\n{\n\"Action\": [\n\"s3:GetObject\",\n\"s3:PutObject\",\n\"s3:DeleteObject\",\n\"s3:ListBucket\",\n\"s3:GetBucketLocation\"\n],\n\"Resource\": [\n\"arn:aws:s3:::<BUCKET>\/*\",\n\"arn:aws:s3:::<BUCKET>\"\n],\n\"Effect\": \"Allow\"\n},\n{\n\"Action\": [\n\"kms:Decrypt\",\n\"kms:Encrypt\",\n\"kms:GenerateDataKey*\"\n],\n\"Resource\": [\n\"arn:aws:kms:<KMS-KEY>\"\n],\n\"Effect\": \"Allow\"\n},\n{\n\"Action\": [\n\"sts:AssumeRole\"\n],\n\"Resource\": [\n\"arn:aws:iam::<AWS-ACCOUNT-ID>:role\/<AWS-IAM-ROLE-NAME>\"\n],\n\"Effect\": \"Allow\"\n}\n]\n}\n\n``` \nNote \nIf you need a more restrictive IAM policy for Unity Catalog, contact your Databricks representative for assistance.\n8. Attach the IAM policy to the IAM role. \nOn the IAM role\u2019s **Permissions** tab, attach the IAM policy that you just created.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create a Unity Catalog metastore\n##### Step 3: Create the metastore and attach a workspace\n\nEach Databricks region requires its own Unity Catalog metastore. \nYou create a metastore for each region in which your organization operates. You can link each of these regional metastores to any number of workspaces in that region. Each linked workspace has the same view of the data in the metastore, and data access control can be managed across workspaces. You can access data in other metastores using [Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html). \nIf you chose to create metastore-level storage, the metastore will use the the S3 bucket and IAM role that you created in the previous steps. \nTo create a metastore: \n1. Log in to the Databricks [account console](https:\/\/accounts.cloud.databricks.com\/).\n2. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. Click **Create metastore**.\n4. Enter the following: \n* A name for the metastore.\n* The region where you want to deploy the metastore. \nThis must be in the same region as the workspaces you want to use to access the data. Make sure that this matches the region of the storage bucket you created earlier.\n* (Optional) The S3 bucket path (you can omit `s3:\/\/`) and IAM role name for the bucket and role you created in [Step 1 (Optional): Create an S3 bucket for metastore-level managed storage in AWS](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html#cloud-tenant-setup-aws).\n5. Click **Create**.\n6. When prompted, select workspaces to link to the metastore. \nFor details, see [Enable a workspace for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/enable-workspaces.html).\n7. Transfer the metastore admin role to a group. \nThe user who creates a metastore is its owner, also called the metastore admin. The metastore admin can create top-level objects in the metastore such as catalogs and can manage access to tables and other objects. Databricks recommends that you reassign the metastore admin role to a group. See [Assign a metastore admin](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html#assign-metastore-admin).\n8. Enable Databricks management of uploads to managed volumes. \nDatabricks uses cross-origin resource sharing (CORS) to upload data to [managed volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html) in Unity Catalog. See [Configure Unity Catalog storage account for CORS](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/storage-cors.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create a Unity Catalog metastore\n##### Next steps\n\n* [Create and manage catalogs](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html)\n* [Create and manage schemas (databases)](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-schemas.html)\n* [Create tables in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html)\n* Learn more about [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create a Unity Catalog metastore\n##### Add managed storage to an existing metastore\n\nMetastore-level managed storage is optional, and it is not included for metastores that were created automatically. You might want to add metastore-level storage to your metastore if you prefer a data isolation model that stores data centrally for multiple workspaces. You need metastore-level storage if you want to share notebooks using Delta Sharing or if you are a Databricks partner who uses personal staging locations. \nSee also [Managed storage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#managed-storage). \n### Requirements \n* You must have at least one workspace attached to the Unity Catalog metastore.\n* Databricks permissions required: \n+ To create an external location, you must be a metastore admin or user with the `CREATE EXTERNAL LOCATION` and `CREATE STORAGE CREDENTIAL` privileges.\n+ To add the storage location to the metastore definition, you must be an account admin.\n* AWS permissions required: the ability to create S3 buckets, IAM roles, IAM policies, and cross-account trust relationships. \n### Step 1: Create the storage location \nFollow the instructions in [Step 1 (Optional): Create an S3 bucket for metastore-level managed storage in AWS](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html#cloud-tenant-setup-aws) to create a dedicated S3 bucket in an AWS account in the same region as your metastore. \n### Step 2: Create an external location in Unity Catalog \nIn this step, you create an external location in Unity Catalog that represents the bucket that you just created. \n1. Open a workspace that is attached to the metastore.\n2. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog** to open Catalog Explorer.\n3. Click the **+ Add** button and select **Add an external location**.\n4. On the **Create a new external location** dialog, click **AWS Quickstart (Recommended)** and click **Next**. \nThe AWS Quickstart configures the external location and creates a storage credential for you. If you choose to use the **Manual** option, you must manually create an IAM role that gives access to the S3 bucket and create the storage credential in Databricks yourself.\n5. On the **Create external location with Quickstart** dialog, enter the path to the S3 bucket in the **Bucket Name** field.\n6. Click **Generate new token** to generate the personal access token that you will use to authenticate between Databricks and your AWS account.\n7. Copy the token and click **Launch in Quickstart**.\n8. In the AWS CloudFormation template that launches (labeled **Quick create stack**), paste the token into the **Databricks Account Credentials** field.\n9. Accept the terms at the bottom of the page (**I acknowledge that AWS CloudFormation might create IAM resources with custom names**).\n10. Click **Create stack**. \nIt may take a few minutes for the CloudFormation template to finish creating the external location object in Databricks.\n11. Return to your Databricks workspace and go to the **External locations** pane in **Catalog Explorer**. \nIn the left pane of Catalog Explorer, scroll down and click **External Data > External Locations**.\n12. Confirm that a new external location has been created. \nAutomatically-generated external locations use the naming syntax `db_s3_external_databricks-S3-ingest-<id>`.\n13. Grant yourself the `CREATE MANAGED STORAGE` privilege on the external location. \n1. Click the external location name to open the details pane.\n2. On the **Permissions** tab, click **Grant**.\n3. On the **Grant on `<external location>`** dialog, select yourself in the **Principals** field and select `CREATE MANAGED STORAGE`.\n4. Click **Grant**. \n### Step 3: Add the storage location to the metastore \nAfter you have created an external location that represents the metastore storage bucket, you can add it to the metastore. \n1. As an account admin, log in to the [account console](https:\/\/accounts.cloud.databricks.com).\n2. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. Click the metastore name.\n4. Confirm that you are the **Metastore Admin**. \nIf you are not, click **Edit** and assign yourself as the metastore admin. You can unassign yourself when you are done with this procedure.\n5. On the **Configuration** tab, next to **S3 bucket path**, click **Set**.\n6. On the **Set metastore root** dialog, enter the S3 bucket path that you used to create the external location, and click **Update**. \nYou cannot modify this path once you set it.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create a Unity Catalog metastore\n##### Delete a metastore\n\nIf you are closing your Databricks account or have another reason to delete access to data managed by your Unity Catalog metastore, you can delete the metastore. \nWarning \nAll objects managed by the metastore will become inaccessible using Databricks workspaces. This action cannot be undone. \n[Managed table](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#managed-table) data and metadata will be auto-deleted after 30 days. External table data in your cloud storage is not affected by metastore deletion. \nTo delete a metastore: \n1. As a metastore admin, log in to the [account console](https:\/\/accounts.cloud.databricks.com).\n2. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. Click the metastore name.\n4. On the **Configuration** tab, click the three-button menu at the far upper right and select **Delete**.\n5. On the confirmation dialog, enter the name of the metastore and click **Delete**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n### What are workspace files?\n##### Work with Python and R modules\n\nThis article describes how you can use relative paths to import custom Python and R modules stored in workspace files alongside your Databricks notebooks. Workspace files can facilitate tighter development lifecycles, allowing you to [modularize your code](https:\/\/docs.databricks.com\/files\/workspace-modules.html#refactor-code), [convert %run commands to import statements](https:\/\/docs.databricks.com\/files\/workspace-modules.html#migrate-run), and [refactor Python wheel files to co-versioned modules](https:\/\/docs.databricks.com\/files\/workspace-modules.html#migrate-whl). You can also use the built-in Databricks [web terminal to test your code](https:\/\/docs.databricks.com\/files\/workspace-modules.html#terminal-test). \nNote \nIn Databricks Runtime 14.0 and above, the the default current working directory (CWD) for code executed locally is the directory containing the notebook or script being run. This is a change in behavior from Databricks Runtime 13.3 LTS and below. See [What is the default current working directory?](https:\/\/docs.databricks.com\/files\/cwd-dbr-14.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/workspace-modules.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n### What are workspace files?\n##### Work with Python and R modules\n###### Import Python and R modules\n\nImportant \nIn Databricks Runtime 13.3 LTS and above, directories added to the Python `sys.path`, or directories that are structured as [Python packages](https:\/\/docs.python.org\/3\/tutorial\/modules.html#packages), are automatically distributed to all executors in the cluster. In Databricks Runtime 12.2 LTS and below, libraries added to the `sys.path` must be explicitly installed on executors. \nIn Databricks Runtime 11.3 LTS and above, the current working directory of your notebook is automatically added to the Python path. If you\u2019re using Git folders, the root repo directory is added. \nTo import modules from another directory, you must add the directory containing the module to `sys.path`. You can specify directories using a relative path, as in the following example: \n```\nimport sys\nimport os\nsys.path.append(os.path.abspath('..'))\n\n``` \nYou import functions from a module stored in workspace files just as you would from a module saved as a cluster library or notebook-scoped library: \n```\nfrom sample import power\npower.powerOfTwo(3)\n\n``` \n```\nsource(\"sample.R\")\npower.powerOfTwo(3)\n\n``` \nImportant \nWhen you use an `import` statement, Databricks follows a set precedence if multiple libraries of the same name exist. See [Python library precedence](https:\/\/docs.databricks.com\/libraries\/index.html#precedence).\n\n##### Work with Python and R modules\n###### Autoreload for Python modules\n\nIf you are editing multiple files while developing Python code, you can use the following commands in any notebook cell or python file to force a reload of all modules: \n```\n%load_ext autoreload\n%autoreload 2\n\n``` \nNote that autoreload only works on the driver and does not reload code into the executor for UDFs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/workspace-modules.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n### What are workspace files?\n##### Work with Python and R modules\n###### Refactor code\n\nA best practice for code development is to modularize code so it can be easily reused. You can create custom Python files with workspace files and make the code in those files available to a notebook using the `import` statement. \nTo refactor notebook code into reusable files: \n1. Create a new source code file for your code.\n2. Add Python import statements to the notebook to make the code in your new file available to the notebook.\n\n##### Work with Python and R modules\n###### Migrate from `%run` commands\n\nIf you are using `%run` commands to make Python or R functions defined in a notebook available to another notebook, or are installing custom `.whl` files on a cluster, consider including those custom modules as workspace files. In this way, you can keep your notebooks and other code modules in sync, ensuring that your notebook always uses the correct version. \n`%run` commands let you include one notebook within another and are often used to make supporting Python or R code available to a notebook. In this example, a notebook named `power.py` includes the code below. \n```\n# This code is in a notebook named \"power.py\".\ndef n_to_mth(n,m):\nprint(n, \"to the\", m, \"th power is\", n**m)\n\n``` \nYou can then make functions defined in `power.py` available to a different notebook with a `%run` command: \n```\n# This notebook uses a %run command to access the code in \"power.py\".\n%run .\/power\nn_to_mth(3, 4)\n\n``` \nUsing workspace files, you can directly import the module that contains the Python code and run the function. \n```\nfrom power import n_to_mth\nn_to_mth(3, 4)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/workspace-modules.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n### What are workspace files?\n##### Work with Python and R modules\n###### Refactor Python `.whl` files to relative libraries\n\nYou can install custom `.whl` files onto a cluster and then import them into a notebook attached to that cluster. For code that is frequently updated, this process might be cumbersome and error-prone. Workspace files lets you keep these Python files in the same directory with the notebooks that use the code, ensuring that your notebook always uses the correct version. \nFor more information about packaging Python projects, see this [tutorial](https:\/\/packaging.python.org\/tutorials\/packaging-projects\/).\n\n##### Work with Python and R modules\n###### Use Databricks web terminal for testing\n\nYou can use Databricks web terminal to test modifications to your Python or R code without having to import the file to a notebook and execute the notebook. \n1. Open [web terminal](https:\/\/docs.databricks.com\/compute\/web-terminal.html).\n2. Change to the directory: `cd \/Workspace\/Users\/<path-to-directory>\/`.\n3. Run the Python or R file: `python file_name.py` or `Rscript file_name.r`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/workspace-modules.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Work with features in Workspace Feature Store\n##### Control access to feature tables\n\nThis article describes how to control access to feature tables in workspaces that are not enabled for Unity Catalog. If your workspace is enabled for Unity Catalog, use [Unity Catalog privileges](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html) instead. \nYou can configure Feature Store access control to grant fine-grained permissions on feature table metadata. You can control a user\u2019s ability to view a feature table in the UI, edit its description, manage other users\u2019 permissions on the table, and delete the table. \nNote \nFeature Store access control does not govern access to the underlying [Delta table](https:\/\/docs.databricks.com\/delta\/index.html), which is governed by [table access control](https:\/\/docs.databricks.com\/data-governance\/table-acls\/index.html). \nYou can assign three permission levels to feature table metadata: CAN VIEW METADATA, CAN EDIT METADATA, and CAN MANAGE. Any user can create a new feature table. The table lists the abilities for each permission. \n| Ability | CAN VIEW METADATA | CAN EDIT METADATA | CAN MANAGE |\n| --- | --- | --- | --- |\n| Read feature table | X | X | X |\n| Search feature table | X | X | X |\n| Publish feature table to online store | X | X | X |\n| Write features to feature table | | X | X |\n| Update description of feature table | | X | X |\n| Modify permissions on feature table | | | X |\n| Delete feature table | | | X | \nBy default, when a feature table is created: \n* The creator has CAN MANAGE permission\n* Workspace admins have CAN MANAGE permission\n* Other users have NO PERMISSIONS\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/access-control.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### Work with features in Workspace Feature Store\n##### Control access to feature tables\n###### Configure permissions for a feature table\n\n1. On the feature table page, click the arrow to the right of the name of the feature table and select **Permissions**. If you do not have CAN MANAGE permission for the feature table, you will not see this option. \n![Select permissions from drop-down menu](https:\/\/docs.databricks.com\/_images\/feature-store-permissions.png)\n2. Edit the permissions and click **Save**.\n\n##### Control access to feature tables\n###### Configure permissions for all feature tables in Feature Store\n\nWorkspace administrators can use the Feature Store UI to set permission levels on all feature tables for specific users or groups. \nNote \n* A user with CAN MANAGE permission for the Feature Store can change Feature Store permissions for all other users.\n* Permissions set from the feature store page also apply to all future feature tables. \n1. On the feature store page, click **Permissions**. This button is only available for workspace administrators and users with CAN MANAGE permission for the Feature Store. \n![Drop-down menu where you select permissions](https:\/\/docs.databricks.com\/_images\/feature-store-wide-permissions.png)\n2. Edit the permissions and click **Save**. \nPermissions set on the Feature Store page can only be removed from that page. On the feature table page, you can override settings from the Feature Store page to add permissions, but you cannot set more restrictive permissions. \nWhen you navigate to a specific feature table page, permissions set from the feature store page are marked \u201cSome permissions cannot be removed because they are inherited\u201d.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/access-control.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n#### Configure encryption for S3 with KMS\n\nThis article covers how to configure server-side encryption with a KMS key for writing files in `s3a:\/\/` paths. To encrypt your workspace\u2019s root S3 bucket, see [Customer-managed keys for workspace storage](https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html#workspace-storage).\n\n#### Configure encryption for S3 with KMS\n##### Step 1: Configure an instance profile\n\nIn Databricks, create an [instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html).\n\n#### Configure encryption for S3 with KMS\n##### Step 2: Add the instance profile as a key user for the KMS key provided in the configuration\n\n1. In AWS, go to the KMS service.\n2. Click the key that you want to add permission to.\n3. In the Key Users section, click **Add**.\n4. Select the checkbox next to the IAM role.\n5. Click **Add**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/kms-s3.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n#### Configure encryption for S3 with KMS\n##### Step 3: Set up encryption properties\n\nSet up global KMS encryption properties in a [Spark configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration) setting or using an [init script](https:\/\/docs.databricks.com\/init-scripts\/index.html).\nConfigure the `spark.hadoop.fs.s3a.server-side-encryption.key` key with your own key ARN. \n### Spark configuration \n```\nspark.hadoop.fs.s3a.server-side-encryption.key arn:aws:kms:<region>:<aws-account-id>:key\/<bbbbbbbb-ddd-ffff-aaa-bdddddddddd>\nspark.hadoop.fs.s3a.server-side-encryption-algorithm SSE-KMS\n\n``` \nYou can also configure per-bucket KMS encryption. For example, you can configure each bucket individually using the following keys: \n```\n# Set up authentication and endpoint for a specific bucket\nspark.hadoop.fs.s3a.bucket.<bucket-name>.aws.credentials.provider <aws-credentials-provider-class>\nspark.hadoop.fs.s3a.bucket.<bucket-name>.endpoint <aws-endpoint>\n\n# Configure a different KMS encryption key for a specific bucket\nspark.hadoop.fs.s3a.bucket.<bucket-name>.server-side-encryption.key <aws-kms-encryption-key>\n\n``` \nFor more information, see [Per-bucket configuration](https:\/\/docs.databricks.com\/connect\/storage\/amazon-s3.html#per-bucket-configuration). \n### Init script \nConfigure the global encryption setting by running the following code in a notebook cell to create the init script `set-kms.sh` and [configure a cluster](https:\/\/docs.databricks.com\/init-scripts\/index.html) to run the script. \n```\ndbutils.fs.put(\"\/databricks\/scripts\/set-kms.sh\", \"\"\"\n#!\/bin\/bash\n\ncat >\/databricks\/driver\/conf\/aes-encrypt-custom-spark-conf.conf <<EOL\n[driver] {\n\"spark.hadoop.fs.s3a.server-side-encryption.key\" = \"arn:aws:kms:<region>:<aws-account-id>:key\/<bbbbbbbb-ddd-ffff-aaa-bdddddddddd>\"\n\"spark.hadoop.fs.s3a.server-side-encryption-algorithm\" = \"SSE-KMS\"\n}\nEOL\n\"\"\", True)\n\n``` \nOnce you verify that encryption is working, configure encryption on all clusters adding a cluster-scoped init script to cluster policies.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/kms-s3.html"} +{"content":"# What is Delta Lake?\n### Tutorial: Delta Lake\n\nThis tutorial introduces common Delta Lake operations on Databricks, including the following: \n* [Create a table.](https:\/\/docs.databricks.com\/delta\/tutorial.html#create)\n* [Upsert to a table.](https:\/\/docs.databricks.com\/delta\/tutorial.html#upsert)\n* [Read from a table.](https:\/\/docs.databricks.com\/delta\/tutorial.html#read)\n* [Display table history.](https:\/\/docs.databricks.com\/delta\/tutorial.html#display-history)\n* [Query an earlier version of a table.](https:\/\/docs.databricks.com\/delta\/tutorial.html#time-travel)\n* [Optimize a table.](https:\/\/docs.databricks.com\/delta\/tutorial.html#optimize)\n* [Add a Z-order index.](https:\/\/docs.databricks.com\/delta\/tutorial.html#z-order)\n* [Vacuum unreferenced files.](https:\/\/docs.databricks.com\/delta\/tutorial.html#vacuum) \nYou can run the example Python, Scala, and SQL code in this article from within a [notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html) attached to a Databricks compute resource such as a [cluster](https:\/\/docs.databricks.com\/compute\/index.html). You can also run the SQL code in this article from within a [query](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html) associated with a [SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html) in [Databricks SQL](https:\/\/docs.databricks.com\/sql\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tutorial.html"} +{"content":"# What is Delta Lake?\n### Tutorial: Delta Lake\n#### Prepare the source data\n\nThis tutorial relies on a dataset called People 10 M. It contains 10 million fictitious records that hold facts about people, like first and last names, date of birth, and salary. This tutorial assumes that this dataset is in a Unity Catalog [volume](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html) that is associated with your target Databricks workspace. \nTo get the People 10 M dataset for this tutorial, do the following: \n1. Go to the [People 10 M](https:\/\/www.kaggle.com\/datasets\/asthamular\/people-10-m) page in Kaggle.\n2. Click **Download** to download a file named `archive.zip` to your local machine.\n3. Extract the file named `export.csv` from the `archive.zip` file. The `export.csv` file contains the data for this tutorial. \nTo upload the `export.csv` file into the volume, do the following: \n1. On the sidebar, click **Catalog**.\n2. In **Catalog Explorer**, browse to and open the volume where you want to upload the `export.csv` file.\n3. Click **Upload to this volume**.\n4. Drag and drop, or browse to and select, the `export.csv` file on your local machine.\n5. Click **Upload**. \nIn the following code examples, replace `\/Volumes\/main\/default\/my-volume\/export.csv` with the path to the `export.csv` file in your target volume.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tutorial.html"} +{"content":"# What is Delta Lake?\n### Tutorial: Delta Lake\n#### Create a table\n\nAll tables created on Databricks use Delta Lake by default. Databricks recommends using Unity Catalog managed tables. \nIn the previous code example and the following code examples, replace the table name `main.default.people_10m` with your target three-part catalog, schema, and table name in Unity Catalog. \nNote \nDelta Lake is the default for all reads, writes, and table creation commands Databricks. \n```\nfrom pyspark.sql.types import StructType, StructField, IntegerType, StringType, TimestampType\n\nschema = StructType([\nStructField(\"id\", IntegerType(), True),\nStructField(\"firstName\", StringType(), True),\nStructField(\"middleName\", StringType(), True),\nStructField(\"lastName\", StringType(), True),\nStructField(\"gender\", StringType(), True),\nStructField(\"birthDate\", TimestampType(), True),\nStructField(\"ssn\", StringType(), True),\nStructField(\"salary\", IntegerType(), True)\n])\n\ndf = spark.read.format(\"csv\").option(\"header\", True).schema(schema).load(\"\/Volumes\/main\/default\/my-volume\/export.csv\")\n\n# Create the table if it does not exist. Otherwise, replace the existing table.\ndf.writeTo(\"main.default.people_10m\").createOrReplace()\n\n# If you know the table does not already exist, you can call this instead:\n# df.saveAsTable(\"main.default.people_10m\")\n\n``` \n```\nimport org.apache.spark.sql.types._\n\nval schema = StructType(Array(\nStructField(\"id\", IntegerType, nullable = true),\nStructField(\"firstName\", StringType, nullable = true),\nStructField(\"middleName\", StringType, nullable = true),\nStructField(\"lastName\", StringType, nullable = true),\nStructField(\"gender\", StringType, nullable = true),\nStructField(\"birthDate\", TimestampType, nullable = true),\nStructField(\"ssn\", StringType, nullable = true),\nStructField(\"salary\", IntegerType, nullable = true)\n))\n\nval df = spark.read.format(\"csv\").option(\"header\", \"true\").schema(schema).load(\"\/Volumes\/main\/default\/my-volume\/export.csv\")\n\n\/\/ Create the table if it does not exist. Otherwise, replace the existing table.\ndf.writeTo(\"main.default.people_10m\").createOrReplace()\n\n\/\/ If you know that the table doesn't exist, call this instead:\n\/\/ df.saveAsTable(\"main.default.people_10m\")\n\n``` \n```\nCREATE OR REPLACE TABLE main.default.people_10m (\nid INT,\nfirstName STRING,\nmiddleName STRING,\nlastName STRING,\ngender STRING,\nbirthDate TIMESTAMP,\nssn STRING,\nsalary INT\n);\n\nCOPY INTO main.default.people_10m\nFROM '\/Volumes\/main\/default\/my-volume\/export.csv'\nFILEFORMAT = CSV\nFORMAT_OPTIONS ( 'header' = 'true', 'inferSchema' = 'true' );\n\n``` \nThe preceding operations create a new managed table. For information about available options when you create a Delta table, see [CREATE TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-table.html). \nIn Databricks Runtime 13.3 LTS and above, you can use [CREATE TABLE LIKE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-table-like.html) to create a new empty Delta table that duplicates the schema and table properties for a source Delta table. This can be especially useful when promoting tables from a development environment into production, as shown in the following code example: \n```\nCREATE TABLE main.default.people_10m_prod LIKE main.default.people_10m\n\n``` \nTo create an empty table, you can also use the `DeltaTableBuilder` API in Delta Lake for [Python](https:\/\/docs.delta.io\/latest\/api\/python\/spark\/index.html) and [Scala](https:\/\/docs.delta.io\/latest\/api\/scala\/spark\/io\/delta\/tables\/DeltaTableBuilder.html). Compared to equivalent DataFrameWriter APIs, these APIs make it easier to specify additional information like column comments, table properties, and [generated columns](https:\/\/docs.databricks.com\/delta\/generated-columns.html). \nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \n```\nDeltaTable.createIfNotExists(spark)\n.tableName(\"main.default.people_10m\")\n.addColumn(\"id\", \"INT\")\n.addColumn(\"firstName\", \"STRING\")\n.addColumn(\"middleName\", \"STRING\")\n.addColumn(\"lastName\", \"STRING\", comment = \"surname\")\n.addColumn(\"gender\", \"STRING\")\n.addColumn(\"birthDate\", \"TIMESTAMP\")\n.addColumn(\"ssn\", \"STRING\")\n.addColumn(\"salary\", \"INT\")\n.execute()\n\n``` \n```\nDeltaTable.createOrReplace(spark)\n.tableName(\"main.default.people_10m\")\n.addColumn(\"id\", \"INT\")\n.addColumn(\"firstName\", \"STRING\")\n.addColumn(\"middleName\", \"STRING\")\n.addColumn(\nDeltaTable.columnBuilder(\"lastName\")\n.dataType(\"STRING\")\n.comment(\"surname\")\n.build())\n.addColumn(\"lastName\", \"STRING\", comment = \"surname\")\n.addColumn(\"gender\", \"STRING\")\n.addColumn(\"birthDate\", \"TIMESTAMP\")\n.addColumn(\"ssn\", \"STRING\")\n.addColumn(\"salary\", \"INT\")\n.execute()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tutorial.html"} +{"content":"# What is Delta Lake?\n### Tutorial: Delta Lake\n#### Upsert to a table\n\nTo merge a set of updates and insertions into an existing Delta table, you use the `DeltaTable.merge` method for [Python](https:\/\/docs.delta.io\/latest\/api\/python\/spark\/index.html) and [Scala](https:\/\/docs.delta.io\/latest\/api\/scala\/spark\/io\/delta\/tables\/DeltaTable.html), and the [MERGE INTO](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-merge-into.html) statement for SQL. For example, the following example takes data from the source table and merges it into the target Delta table. When there is a matching row in both tables, Delta Lake updates the data column using the given expression. When there is no matching row, Delta Lake adds a new row. This operation is known as an *upsert*. \n```\nfrom pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType\nfrom datetime import date\n\nschema = StructType([\nStructField(\"id\", IntegerType(), True),\nStructField(\"firstName\", StringType(), True),\nStructField(\"middleName\", StringType(), True),\nStructField(\"lastName\", StringType(), True),\nStructField(\"gender\", StringType(), True),\nStructField(\"birthDate\", DateType(), True),\nStructField(\"ssn\", StringType(), True),\nStructField(\"salary\", IntegerType(), True)\n])\n\ndata = [\n(9999998, 'Billy', 'Tommie', 'Luppitt', 'M', date.fromisoformat('1992-09-17'), '953-38-9452', 55250),\n(9999999, 'Elias', 'Cyril', 'Leadbetter', 'M', date.fromisoformat('1984-05-22'), '906-51-2137', 48500),\n(10000000, 'Joshua', 'Chas', 'Broggio', 'M', date.fromisoformat('1968-07-22'), '988-61-6247', 90000),\n(20000001, 'John', '', 'Doe', 'M', date.fromisoformat('1978-01-14'), '345-67-8901', 55500),\n(20000002, 'Mary', '', 'Smith', 'F', date.fromisoformat('1982-10-29'), '456-78-9012', 98250),\n(20000003, 'Jane', '', 'Doe', 'F', date.fromisoformat('1981-06-25'), '567-89-0123', 89900)\n]\n\npeople_10m_updates = spark.createDataFrame(data, schema)\npeople_10m_updates.createTempView(\"people_10m_updates\")\n\n# ...\n\nfrom delta.tables import DeltaTable\n\ndeltaTable = DeltaTable.forName(spark, 'main.default.people_10m')\n\n(deltaTable.alias(\"people_10m\")\n.merge(\npeople_10m_updates.alias(\"people_10m_updates\"),\n\"people_10m.id = people_10m_updates.id\")\n.whenMatchedUpdateAll()\n.whenNotMatchedInsertAll()\n.execute()\n)\n\n``` \n```\nimport org.apache.spark.sql.types._\nimport org.apache.spark.sql.Row\nimport java.sql.Timestamp\n\nval schema = StructType(Array(\nStructField(\"id\", IntegerType, nullable = true),\nStructField(\"firstName\", StringType, nullable = true),\nStructField(\"middleName\", StringType, nullable = true),\nStructField(\"lastName\", StringType, nullable = true),\nStructField(\"gender\", StringType, nullable = true),\nStructField(\"birthDate\", TimestampType, nullable = true),\nStructField(\"ssn\", StringType, nullable = true),\nStructField(\"salary\", IntegerType, nullable = true)\n))\n\nval data = Seq(\nRow(9999998, \"Billy\", \"Tommie\", \"Luppitt\", \"M\", Timestamp.valueOf(\"1992-09-17 00:00:00\"), \"953-38-9452\", 55250),\nRow(9999999, \"Elias\", \"Cyril\", \"Leadbetter\", \"M\", Timestamp.valueOf(\"1984-05-22 00:00:00\"), \"906-51-2137\", 48500),\nRow(10000000, \"Joshua\", \"Chas\", \"Broggio\", \"M\", Timestamp.valueOf(\"1968-07-22 00:00:00\"), \"988-61-6247\", 90000),\nRow(20000001, \"John\", \"\", \"Doe\", \"M\", Timestamp.valueOf(\"1978-01-14 00:00:00\"), \"345-67-8901\", 55500),\nRow(20000002, \"Mary\", \"\", \"Smith\", \"F\", Timestamp.valueOf(\"1982-10-29 00:00:00\"), \"456-78-9012\", 98250),\nRow(20000003, \"Jane\", \"\", \"Doe\", \"F\", Timestamp.valueOf(\"1981-06-25 00:00:00\"), \"567-89-0123\", 89900)\n)\n\nval people_10m_updates = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)\npeople_10m_updates.createOrReplaceTempView(\"people_10m_updates\")\n\n\/\/ ...\n\nimport io.delta.tables.DeltaTable\n\nval deltaTable = DeltaTable.forName(spark, \"main.default.people_10m\")\n\ndeltaTable.as(\"people_10m\")\n.merge(\npeople_10m_updates.as(\"people_10m_updates\"),\n\"people_10m.id = people_10m_updates.id\"\n)\n.whenMatched()\n.updateAll()\n.whenNotMatched()\n.insertAll()\n.execute()\n\n``` \n```\nCREATE OR REPLACE TEMP VIEW people_10m_updates (\nid, firstName, middleName, lastName, gender, birthDate, ssn, salary\n) AS VALUES\n(9999998, 'Billy', 'Tommie', 'Luppitt', 'M', '1992-09-17T04:00:00.000+0000', '953-38-9452', 55250),\n(9999999, 'Elias', 'Cyril', 'Leadbetter', 'M', '1984-05-22T04:00:00.000+0000', '906-51-2137', 48500),\n(10000000, 'Joshua', 'Chas', 'Broggio', 'M', '1968-07-22T04:00:00.000+0000', '988-61-6247', 90000),\n(20000001, 'John', '', 'Doe', 'M', '1978-01-14T04:00:00.000+000', '345-67-8901', 55500),\n(20000002, 'Mary', '', 'Smith', 'F', '1982-10-29T01:00:00.000+000', '456-78-9012', 98250),\n(20000003, 'Jane', '', 'Doe', 'F', '1981-06-25T04:00:00.000+000', '567-89-0123', 89900);\n\nMERGE INTO people_10m\nUSING people_10m_updates\nON people_10m.id = people_10m_updates.id\nWHEN MATCHED THEN UPDATE SET *\nWHEN NOT MATCHED THEN INSERT *;\n\n``` \nIn SQL, if you specify `*`, this updates or inserts all columns in the target table, assuming that the source table has the same columns as the target table. If the target table doesn\u2019t have the same columns, the query throws an analysis error. \nYou must specify a value for every column in your table when you perform an insert operation (for example, when there is no matching row in the existing dataset). However, you do not need to update all values. \nTo see the results, query the table. \n```\ndf = spark.read.table(\"main.default.people_10m\")\ndf_filtered = df.filter(df[\"id\"] >= 9999998)\ndisplay(df_filtered)\n\n``` \n```\nval df = spark.read.table(\"main.default.people_10m\")\nval df_filtered = df.filter($\"id\" >= 9999998)\ndisplay(df_filtered)\n\n``` \n```\nSELECT * FROM main.default.people_10m WHERE id >= 9999998\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tutorial.html"} +{"content":"# What is Delta Lake?\n### Tutorial: Delta Lake\n#### Read a table\n\nYou access data in Delta tables by the table name or the table path, as shown in the following examples: \n```\npeople_df = spark.read.table(\"main.default.people_10m\")\ndisplay(people_df)\n\n``` \n```\nval people_df = spark.read.table(\"main.default.people_10m\")\ndisplay(people_df)\n\n``` \n```\nSELECT * FROM main.default.people_10m;\n\n```\n\n### Tutorial: Delta Lake\n#### Write to a table\n\nDelta Lake uses standard syntax for writing data to tables. \nTo atomically add new data to an existing Delta table, use the append mode as shown in the following examples: \n```\ndf.write.mode(\"append\").saveAsTable(\"main.default.people_10m\")\n\n``` \n```\ndf.write.mode(\"append\").saveAsTable(\"main.default.people_10m\")\n\n``` \n```\nINSERT INTO main.default.people_10m SELECT * FROM main.default.more_people\n\n``` \nTo replace all the data in a table, use the overwrite mode as in the following examples: \n```\ndf.write.mode(\"overwrite\").saveAsTable(\"main.default.people_10m\")\n\n``` \n```\ndf.write.mode(\"overwrite\").saveAsTable(\"main.default.people_10m\")\n\n``` \n```\nINSERT OVERWRITE TABLE main.default.people_10m SELECT * FROM main.default.more_people\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tutorial.html"} +{"content":"# What is Delta Lake?\n### Tutorial: Delta Lake\n#### Update a table\n\nYou can update data that matches a predicate in a Delta table. For example, in the example `people_10m` table, to change an abbreviation in the `gender` column from `M` or `F` to `Male` or `Female`, you can run the following: \n```\nfrom delta.tables import *\nfrom pyspark.sql.functions import *\n\ndeltaTable = DeltaTable.forName(spark, \"main.default.people_10m\")\n\n# Declare the predicate by using a SQL-formatted string.\ndeltaTable.update(\ncondition = \"gender = 'F'\",\nset = { \"gender\": \"'Female'\" }\n)\n\n# Declare the predicate by using Spark SQL functions.\ndeltaTable.update(\ncondition = col('gender') == 'M',\nset = { 'gender': lit('Male') }\n)\n\n``` \n```\nimport io.delta.tables._\n\nval deltaTable = DeltaTable.forName(spark, \"main.default.people_10m\")\n\n\/\/ Declare the predicate by using a SQL-formatted string.\ndeltaTable.updateExpr(\n\"gender = 'F'\",\nMap(\"gender\" -> \"'Female'\")\n)\n\nimport org.apache.spark.sql.functions._\nimport spark.implicits._\n\n\/\/ Declare the predicate by using Spark SQL functions and implicits.\ndeltaTable.update(\ncol(\"gender\") === \"M\",\nMap(\"gender\" -> lit(\"Male\")));\n\n``` \n```\nUPDATE main.default.people_10m SET gender = 'Female' WHERE gender = 'F';\nUPDATE main.default.people_10m SET gender = 'Male' WHERE gender = 'M';\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tutorial.html"} +{"content":"# What is Delta Lake?\n### Tutorial: Delta Lake\n#### Delete from a table\n\nYou can remove data that matches a predicate from a Delta table. For instance, in the example `people_10m` table, to delete all rows corresponding to people with a value in the `birthDate` column from before `1955`, you can run the following: \n```\nfrom delta.tables import *\nfrom pyspark.sql.functions import *\n\ndeltaTable = DeltaTable.forName(spark, \"main.default.people_10m\")\n\n# Declare the predicate by using a SQL-formatted string.\ndeltaTable.delete(\"birthDate < '1955-01-01'\")\n\n# Declare the predicate by using Spark SQL functions.\ndeltaTable.delete(col('birthDate') < '1960-01-01')\n\n``` \n```\nimport io.delta.tables._\n\nval deltaTable = DeltaTable.forName(spark, \"main.default.people_10m\")\n\n\/\/ Declare the predicate by using a SQL-formatted string.\ndeltaTable.delete(\"birthDate < '1955-01-01'\")\n\nimport org.apache.spark.sql.functions._\nimport spark.implicits._\n\n\/\/ Declare the predicate by using Spark SQL functions and implicits.\ndeltaTable.delete(col(\"birthDate\") < \"1955-01-01\")\n\n``` \n```\nDELETE FROM main.default.people_10m WHERE birthDate < '1955-01-01'\n\n``` \nImportant \nDeletion removes the data from the latest version of the Delta table but does not remove it from the physical storage until the old versions are explicitly vacuumed. See [vacuum](https:\/\/docs.databricks.com\/delta\/vacuum.html) for details.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tutorial.html"} +{"content":"# What is Delta Lake?\n### Tutorial: Delta Lake\n#### Display table history\n\nTo view the history of a table, you use the `DeltaTable.history` method for [Python](https:\/\/docs.delta.io\/latest\/api\/python\/spark\/index.html) and [Scala](https:\/\/docs.delta.io\/latest\/api\/scala\/spark\/io\/delta\/tables\/DeltaTable.html), and the [DESCRIBE HISTORY](https:\/\/docs.databricks.com\/delta\/history.html) statement in SQL, which provides provenance information, including the table version, operation, user, and so on, for each write to a table. \n```\nfrom delta.tables import *\n\ndeltaTable = DeltaTable.forName(spark, \"main.default.people_10m\")\ndisplay(deltaTable.history())\n\n``` \n```\nimport io.delta.tables._\n\nval deltaTable = DeltaTable.forName(spark, \"main.default.people_10m\")\ndisplay(deltaTable.history())\n\n``` \n```\nDESCRIBE HISTORY main.default.people_10m\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tutorial.html"} +{"content":"# What is Delta Lake?\n### Tutorial: Delta Lake\n#### Query an earlier version of the table (time travel)\n\nDelta Lake time travel allows you to query an older snapshot of a Delta table. \nTo query an older version of a table, specify the table\u2019s version or timestamp. For example, to query version 0 or timestamp `2024-05-15T22:43:15.000+00:00Z` from the preceding history, use the following: \n```\nfrom delta.tables import *\n\ndeltaTable = DeltaTable.forName(spark, \"main.default.people_10m\")\ndeltaHistory = deltaTable.history()\n\ndisplay(deltaHistory.where(\"version == 0\"))\n# Or:\ndisplay(deltaHistory.where(\"timestamp == '2024-05-15T22:43:15.000+00:00'\"))\n\n``` \n```\nimport io.delta.tables._\n\nval deltaTable = DeltaTable.forName(spark, \"main.default.people_10m\")\nval deltaHistory = deltaTable.history()\n\ndisplay(deltaHistory.where(\"version == 0\"))\n\/\/ Or:\ndisplay(deltaHistory.where(\"timestamp == '2024-05-15T22:43:15.000+00:00'\"))\n\n``` \n```\nSELECT * FROM main.default.people_10m VERSION AS OF 0\n-- Or:\nSELECT * FROM main.default.people_10m TIMESTAMP AS OF '2019-01-29 00:37:58'\n\n``` \nFor timestamps, only date or timestamp strings are accepted, for example, `\"2024-05-15T22:43:15.000+00:00\"` or `\"2024-05-15 22:43:15\"`. \nDataFrameReader options allow you to create a DataFrame from a Delta table that is fixed to a specific version or timestamp of the table, for example: \n```\ndf = spark.read.option('versionAsOf', 0).table(\"main.default.people_10m\")\n# Or:\ndf = spark.read.option('timestampAsOf', '2024-05-15T22:43:15.000+00:00').table(\"main.default.people_10m\")\n\ndisplay(df)\n\n``` \n```\nval df = spark.read.option(\"versionAsOf\", 0).table(\"main.default.people_10m\")\n\/\/ Or:\nval df = spark.read.option(\"timestampAsOf\", \"2024-05-15T22:43:15.000+00:00\").table(\"main.default.people_10m\")\n\ndisplay(df)\n\n``` \n```\nSELECT * FROM main.default.people_10m VERSION AS OF 0\n-- Or:\nSELECT * FROM main.default.people_10m TIMESTAMP AS OF '2024-05-15T22:43:15.000+00:00'\n\n``` \nFor details, see [Work with Delta Lake table history](https:\/\/docs.databricks.com\/delta\/history.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tutorial.html"} +{"content":"# What is Delta Lake?\n### Tutorial: Delta Lake\n#### Optimize a table\n\nAfter you have performed multiple changes to a table, you might have a lot of small files. To improve the speed of read queries, you can use the optimize operation to collapse small files into larger ones: \n```\nfrom delta.tables import *\n\ndeltaTable = DeltaTable.forName(spark, \"main.default.people_10m\")\ndeltaTable.optimize()\n\n``` \n```\nimport io.delta.tables._\n\nval deltaTable = DeltaTable.forName(spark, \"main.default.people_10m\")\ndeltaTable.optimize()\n\n``` \n```\nOPTIMIZE main.default.people_10m\n\n```\n\n### Tutorial: Delta Lake\n#### Z-order by columns\n\nTo improve read performance further, you can collocate related information in the same set of files by z-ordering. Delta Lake data-skipping algorithms use this collocation to dramatically reduce the amount of data that needs to be read. To z-order data, you specify the columns to order on in the z-order by operation. For example, to collocate by `gender`, run: \n```\nfrom delta.tables import *\n\ndeltaTable = DeltaTable.forName(spark, \"main.default.people_10m\")\ndeltaTable.optimize().executeZOrderBy(\"gender\")\n\n``` \n```\nimport io.delta.tables._\n\nval deltaTable = DeltaTable.forName(spark, \"main.default.people_10m\")\ndeltaTable.optimize().executeZOrderBy(\"gender\")\n\n``` \n```\nOPTIMIZE main.default.people_10m\nZORDER BY (gender)\n\n``` \nFor the full set of options available when running the optimize operation, see [Compact data files with optimize on Delta Lake](https:\/\/docs.databricks.com\/delta\/optimize.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tutorial.html"} +{"content":"# What is Delta Lake?\n### Tutorial: Delta Lake\n#### Clean up snapshots with `VACUUM`\n\nDelta Lake provides snapshot isolation for reads, which means that it is safe to run an optimize operation even while other users or jobs are querying the table. Eventually however, you should clean up old snapshots. You can do this by running the vacuum operation: \n```\nfrom delta.tables import *\n\ndeltaTable = DeltaTable.forName(spark, \"main.default.people_10m\")\ndeltaTable.vacuum()\n\n``` \n```\nimport io.delta.tables._\n\nval deltaTable = DeltaTable.forName(spark, \"main.default.people_10m\")\ndeltaTable.vacuum()\n\n``` \n```\nVACUUM main.default.people_10m\n\n``` \nFor details on using the vacuum operation effectively, see [Remove unused data files with vacuum](https:\/\/docs.databricks.com\/delta\/vacuum.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tutorial.html"} +{"content":"# AI and Machine Learning on Databricks\n## Reference solutions for machine learning\n#### Data labeling\n\nLabeling additional training data is an important step for many machine learning\nworkflows, such as classification or computer vision applications. Databricks does not directly support data labeling; however, the Databricks partnership with [Labelbox](https:\/\/labelbox.com) simplifies the process. \nFor more information about data labeling integration, see [Partner Connect documentation for Labelbox](https:\/\/docs.databricks.com\/partners\/ml\/labelbox.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/data-labeling.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create views\n\nThis article shows how to create views in Unity Catalog. \nA view is a read-only object composed from one or more tables and views in a metastore. It resides in the third layer of Unity Catalog\u2019s three-level namespace. A view can be created from tables and other views in multiple schemas and catalogs. \n[Dynamic views](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-views.html#dynamic-view) can be used to provide row- and column-level access control, in addition to data masking. \nExample syntax for creating a view: \n```\nCREATE VIEW main.default.experienced_employee\n(id COMMENT 'Unique identification number', Name)\nCOMMENT 'View for experienced employees'\nAS SELECT id, name\nFROM all_employee\nWHERE working_years > 5;\n\n``` \nNote \nViews might have different execution semantics if they\u2019re backed by data sources other than Delta tables. Databricks recommends that you always define views by referencing data sources using a table or view name. Defining views against datasets by specifying a path or URI can lead to confusing data governance requirements.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-views.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create views\n##### Requirements\n\nTo create a view: \n* You must have the `USE CATALOG` permission on the parent catalog and the `USE SCHEMA` and `CREATE TABLE` permissions on the parent schema. A metastore admin or the catalog owner can grant you all of these privileges. A schema owner can grant you `USE SCHEMA` and `CREATE TABLE` privileges on the schema.\n* You must be able to read the tables and views referenced in the view (`SELECT` on the table or view, as well as `USE CATALOG` on the catalog and `USE SCHEMA` on the schema).\n* If a view references tables in the workspace-local Hive metastore, the view can be accessed only from the workspace that contains the workspace-local tables. For this reason, Databricks recommends creating views only from tables or views that are in the Unity Catalog metastore.\n* You cannot create a view that references a view that has been shared with you using Delta Sharing. See [Share data and AI assets securely using Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html). \nTo read a view, the permissions required depend on the compute type and access mode: \n* For shared clusters and SQL warehouses, you need `SELECT` on the view itself, `USE CATALOG` on its parent catalog, and `USE SCHEMA` on its parent schema.\n* For single-user clusters, you must also have `SELECT` on all tables and views that the view references, in addition to `USE CATALOG` on their parent catalogs and `USE SCHEMA` on their parent schemas. \nTo create or read dynamic views: \n* Requirements for dynamic views are the same as those listed in the preceding sections, except that you must use a shared cluster or SQL warehouse to create or read a dynamic view. You cannot use single-user clusters.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-views.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create views\n##### Create a view\n\nTo create a view, run the following SQL command. Items in brackets are optional. Replace the placeholder values: \n* `<catalog-name>`: The name of the catalog.\n* `<schema-name>`: The name of the schema.\n* `<view-name>`: A name for the view.\n* `<query>`: The query, columns, and tables and views used to compose the view. \n```\nCREATE VIEW <catalog-name>.<schema-name>.<view-name> AS\nSELECT <query>;\n\n``` \n```\nspark.sql(\"CREATE VIEW <catalog-name>.<schema-name>.<view-name> AS \"\n\"SELECT <query>\")\n\n``` \n```\nlibrary(SparkR)\n\nsql(paste(\"CREATE VIEW <catalog-name>.<schema-name>.<view-name> AS \",\n\"SELECT <query>\",\nsep = \"\"))\n\n``` \n```\nspark.sql(\"CREATE VIEW <catalog-name>.<schema-name>.<view-name> AS \" +\n\"SELECT <query>\")\n\n``` \nFor example, to create a view named `sales_redacted` from columns in the `sales_raw` table: \n```\nCREATE VIEW sales_metastore.sales.sales_redacted AS\nSELECT\nuser_id,\nemail,\ncountry,\nproduct,\ntotal\nFROM sales_metastore.sales.sales_raw;\n\n``` \n```\nspark.sql(\"CREATE VIEW sales_metastore.sales.sales_redacted AS \"\n\"SELECT \"\n\" user_id, \"\n\" email, \"\n\" country, \"\n\" product, \"\n\" total \"\n\"FROM sales_metastore.sales.sales_raw\")\n\n``` \n```\nlibrary(SparkR)\n\nsql(paste(\"CREATE VIEW sales_metastore.sales.sales_redacted AS \",\n\"SELECT \",\n\" user_id, \",\n\" email, \",\n\" country, \",\n\" product, \",\n\" total \",\n\"FROM sales_metastore.sales.sales_raw\",\nsep = \"\"))\n\n``` \n```\nspark.sql(\"CREATE VIEW sales_metastore.sales.sales_redacted AS \" +\n\"SELECT \" +\n\" user_id, \" +\n\" email, \" +\n\" country, \" +\n\" product, \" +\n\" total \" +\n\"FROM sales_metastore.sales.sales_raw\")\n\n``` \nYou can also create a view by using the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_table](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/table). You can retrieve a list of view full names by using [databricks\\_views](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/data-sources\/views).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-views.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create views\n##### Create a dynamic view\n\nIn Unity Catalog, you can use dynamic views to configure fine-grained access control, including: \n* Security at the level of columns or rows.\n* Data masking. \nNote \nFine-grained access control using dynamic views is not available on clusters with **Single User** [access mode](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode). \nUnity Catalog introduces the following functions, which allow you to dynamically limit which users can access a row, column, or record in a view: \n* `current_user()`: Returns the current user\u2019s email address.\n* `is_account_group_member()`: Returns `TRUE` if the current user is a member of a specific account-level group. Recommended for use in dynamic views against Unity Catalog data.\n* `is_member()`: Returns `TRUE` if the current user is a member of a specific workspace-level group. This function is provided for compatibility with the existing Hive metastore. Avoid using it with views against Unity Catalog data, because it does not evaluate account-level group membership. \nDatabricks recommends that you do not grant users the ability to read the tables and views referenced in the view. \nThe following examples illustrate how to create dynamic views in Unity Catalog. \n### Column-level permissions \nWith a dynamic view, you can limit the columns a specific user or group can access. In the following example, only members of the `auditors` group can access email addresses from the `sales_raw` table. During query analysis, Apache Spark replaces the `CASE` statement with either the literal string `REDACTED` or the actual contents of the email address column. Other columns are returned as normal. This strategy has no negative impact on the query performance. \n```\n-- Alias the field 'email' to itself (as 'email') to prevent the\n-- permission logic from showing up directly in the column name results.\nCREATE VIEW sales_redacted AS\nSELECT\nuser_id,\nCASE WHEN\nis_account_group_member('auditors') THEN email\nELSE 'REDACTED'\nEND AS email,\ncountry,\nproduct,\ntotal\nFROM sales_raw\n\n``` \n```\n# Alias the field 'email' to itself (as 'email') to prevent the\n# permission logic from showing up directly in the column name results.\nspark.sql(\"CREATE VIEW sales_redacted AS \"\n\"SELECT \"\n\" user_id, \"\n\" CASE WHEN \"\n\" is_account_group_member('auditors') THEN email \"\n\" ELSE 'REDACTED' \"\n\" END AS email, \"\n\" country, \"\n\" product, \"\n\" total \"\n\"FROM sales_raw\")\n\n``` \n```\nlibrary(SparkR)\n\n# Alias the field 'email' to itself (as 'email') to prevent the\n# permission logic from showing up directly in the column name results.\nsql(paste(\"CREATE VIEW sales_redacted AS \",\n\"SELECT \",\n\" user_id, \",\n\" CASE WHEN \",\n\" is_account_group_member('auditors') THEN email \",\n\" ELSE 'REDACTED' \",\n\" END AS email, \",\n\" country, \",\n\" product, \",\n\" total \",\n\"FROM sales_raw\",\nsep = \"\"))\n\n``` \n```\n\/\/ Alias the field 'email' to itself (as 'email') to prevent the\n\/\/ permission logic from showing up directly in the column name results.\nspark.sql(\"CREATE VIEW sales_redacted AS \" +\n\"SELECT \" +\n\" user_id, \" +\n\" CASE WHEN \" +\n\" is_account_group_member('auditors') THEN email \" +\n\" ELSE 'REDACTED' \" +\n\" END AS email, \" +\n\" country, \" +\n\" product, \" +\n\" total \" +\n\"FROM sales_raw\")\n\n``` \n### Row-level permissions \nWith a dynamic view, you can specify permissions down to the row or field level. In the following example, only members of the `managers` group can view transaction amounts when they exceed $1,000,000. Matching results are filtered out for other users. \n```\nCREATE VIEW sales_redacted AS\nSELECT\nuser_id,\ncountry,\nproduct,\ntotal\nFROM sales_raw\nWHERE\nCASE\nWHEN is_account_group_member('managers') THEN TRUE\nELSE total <= 1000000\nEND;\n\n``` \n```\nspark.sql(\"CREATE VIEW sales_redacted AS \"\n\"SELECT \"\n\" user_id, \"\n\" country, \"\n\" product, \"\n\" total \"\n\"FROM sales_raw \"\n\"WHERE \"\n\"CASE \"\n\" WHEN is_account_group_member('managers') THEN TRUE \"\n\" ELSE total <= 1000000 \"\n\"END\")\n\n``` \n```\nlibrary(SparkR)\n\nsql(paste(\"CREATE VIEW sales_redacted AS \",\n\"SELECT \",\n\" user_id, \",\n\" country, \",\n\" product, \",\n\" total \",\n\"FROM sales_raw \",\n\"WHERE \",\n\"CASE \",\n\" WHEN is_account_group_member('managers') THEN TRUE \",\n\" ELSE total <= 1000000 \",\n\"END\",\nsep = \"\"))\n\n``` \n```\nspark.sql(\"CREATE VIEW sales_redacted AS \" +\n\"SELECT \" +\n\" user_id, \" +\n\" country, \" +\n\" product, \" +\n\" total \" +\n\"FROM sales_raw \" +\n\"WHERE \" +\n\"CASE \" +\n\" WHEN is_account_group_member('managers') THEN TRUE \" +\n\" ELSE total <= 1000000 \" +\n\"END\")\n\n``` \n### Data masking \nBecause views in Unity Catalog use Spark SQL, you can implement advanced data masking by using more complex SQL expressions and regular expressions. In the following example, all users can analyze email domains, but only members of the `auditors` group can view a user\u2019s entire email address. \n```\n-- The regexp_extract function takes an email address such as\n-- user.x.lastname@example.com and extracts 'example', allowing\n-- analysts to query the domain name.\n\nCREATE VIEW sales_redacted AS\nSELECT\nuser_id,\nregion,\nCASE\nWHEN is_account_group_member('auditors') THEN email\nELSE regexp_extract(email, '^.*@(.*)$', 1)\nEND\nFROM sales_raw\n\n``` \n```\n# The regexp_extract function takes an email address such as\n# user.x.lastname@example.com and extracts 'example', allowing\n# analysts to query the domain name.\n\nspark.sql(\"CREATE VIEW sales_redacted AS \"\n\"SELECT \"\n\" user_id, \"\n\" region, \"\n\" CASE \"\n\" WHEN is_account_group_member('auditors') THEN email \"\n\" ELSE regexp_extract(email, '^.*@(.*)$', 1) \"\n\" END \"\n\" FROM sales_raw\")\n\n``` \n```\nlibrary(SparkR)\n\n# The regexp_extract function takes an email address such as\n# user.x.lastname@example.com and extracts 'example', allowing\n# analysts to query the domain name.\n\nsql(paste(\"CREATE VIEW sales_redacted AS \",\n\"SELECT \",\n\" user_id, \",\n\" region, \",\n\" CASE \",\n\" WHEN is_account_group_member('auditors') THEN email \",\n\" ELSE regexp_extract(email, '^.*@(.*)$', 1) \",\n\" END \",\n\" FROM sales_raw\",\nsep = \"\"))\n\n``` \n```\n\/\/ The regexp_extract function takes an email address such as\n\/\/ user.x.lastname@example.com and extracts 'example', allowing\n\/\/ analysts to query the domain name.\n\nspark.sql(\"CREATE VIEW sales_redacted AS \" +\n\"SELECT \" +\n\" user_id, \" +\n\" region, \" +\n\" CASE \" +\n\" WHEN is_account_group_member('auditors') THEN email \" +\n\" ELSE regexp_extract(email, '^.*@(.*)$', 1) \" +\n\" END \" +\n\" FROM sales_raw\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-views.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create views\n##### Drop a view\n\nYou must be the view\u2019s owner to drop a view. To drop a view, run the following SQL command: \n```\nDROP VIEW IF EXISTS catalog_name.schema_name.view_name;\n\n```\n\n#### Create views\n##### Next steps\n\n* [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-views.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### `renv` on Databricks\n\n[renv](https:\/\/rstudio.github.io\/renv\/articles\/renv.html) is an R package that lets users manage R dependencies specific to the notebook. \nUsing `renv`, you can create and manage the R library environment for your project, save the state of these libraries to a `lockfile`, and later restore libraries as required. Together, these tools can help make projects more isolated, portable, and reproducible.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/renv.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### `renv` on Databricks\n##### Basic `renv` workflow\n\nIn this section: \n* [Install `renv`](https:\/\/docs.databricks.com\/sparkr\/renv.html#install-renv)\n* [Initialize `renv` session with pre-installed R libraries](https:\/\/docs.databricks.com\/sparkr\/renv.html#initialize-renv-session-with-pre-installed-r-libraries)\n* [Use `renv` to install additional packages](https:\/\/docs.databricks.com\/sparkr\/renv.html#use-renv-to-install-additional-packages)\n* [Use `renv` to save your R notebook environment to DBFS](https:\/\/docs.databricks.com\/sparkr\/renv.html#use-renv-to-save-your-r-notebook-environment-to-dbfs)\n* [Reinstall a `renv` environment given a `lockfile` from DBFS](https:\/\/docs.databricks.com\/sparkr\/renv.html#reinstall-a-renv-environment-given-a-lockfile-from-dbfs) \n### [Install `renv`](https:\/\/docs.databricks.com\/sparkr\/renv.html#id1) \nYou can install `renv` as a [cluster-scoped library](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html) or as a [notebook-scoped library](https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html). To install `renv` as a notebook-scoped library, use: \n```\nrequire(devtools)\n\ninstall_version(\npackage = \"renv\",\nrepos = \"http:\/\/cran.us.r-project.org\"\n)\n\n``` \nDatabricks recommends using a CRAN snapshot as the repository to [fix the package version](https:\/\/kb.databricks.com\/r\/pin-r-packages.html). \n### [Initialize `renv` session with pre-installed R libraries](https:\/\/docs.databricks.com\/sparkr\/renv.html#id2) \nThe first step when using `renv` is to initialize a session using `renv::init()`. Set `libPaths` to change the default download location to be your [R notebook-scoped library path](https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html). \n```\nrenv::init(settings = list(external.libraries=.libPaths()))\n.libPaths(c(.libPaths()[2], .libPaths())\n\n``` \n### [Use `renv` to install additional packages](https:\/\/docs.databricks.com\/sparkr\/renv.html#id3) \nYou can now use `renv`\u2019s API to install and remove R packages. For example, to install the latest version of `digest`, run the following inside of a notebook cell. \n```\nrenv::install(\"digest\")\n\n``` \nTo install an old version of `digest`, run the following inside of a notebook cell. \n```\nrenv::install(\"digest@0.6.18\")\n\n``` \nTo install `digest` from GitHub, run the following inside of a notebook cell. \n```\nrenv::install(\"eddelbuettel\/digest\")\n\n``` \nTo install a package from Bioconductor, run the following inside of a notebook cell. \n```\n# (note: requires the BiocManager package)\nrenv::install(\"bioc::Biobase\")\n\n``` \nNote that the `renv::install` API uses the [renv Cache](https:\/\/docs.databricks.com\/sparkr\/renv.html#cache). \n### [Use `renv` to save your R notebook environment to DBFS](https:\/\/docs.databricks.com\/sparkr\/renv.html#id4) \nRun the following command once before saving the environment. \n```\nrenv::settings$snapshot.type(\"all\")\n\n``` \nThis sets `renv` to snapshot all packages that are installed into `libPaths`, not just the ones that are currently used in the notebook. See [renv documentation](https:\/\/rstudio.github.io\/renv\/reference\/snapshot.html#snapshot-type) for more information. \nNow you can run the following inside of a notebook cell to save the current state of your environment. \n```\nrenv::snapshot(lockfile=\"\/dbfs\/PATH\/TO\/WHERE\/YOU\/WANT\/TO\/SAVE\/renv.lock\", force=TRUE)\n\n``` \nThis updates the `lockfile` by capturing all packages installed on `libPaths`. It also moves your `lockfile` from the local filesystem to [DBFS](https:\/\/docs.databricks.com\/dbfs\/index.html), where it persists even if your cluster terminates or restarts. \n### [Reinstall a `renv` environment given a `lockfile` from DBFS](https:\/\/docs.databricks.com\/sparkr\/renv.html#id5) \nFirst, make sure that your new cluster is running an identical Databricks Runtime version as the one you first created the `renv` environment on. This ensures that the pre-installed R packages are identical. You can find a list of these in each runtime\u2019s [release notes](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). After you [Install renv](https:\/\/docs.databricks.com\/sparkr\/renv.html#installrenv), run the following inside of a notebook cell. \n```\nrenv::init(settings = list(external.libraries=.libPaths()))\n.libPaths(c(.libPaths()[2], .libPaths()))\nrenv::restore(lockfile=\"\/dbfs\/PATH\/TO\/WHERE\/YOU\/SAVED\/renv.lock\", exclude=c(\"Rserve\", \"SparkR\"))\n\n``` \nThis copies your `lockfile` from DBFS into the local file system and then restores any packages specified in the `lockfile`. \nNote \nTo avoid missing repository errors, exclude the `Rserve` and `SparkR` packages from package restoration. Both of these packages are pre-installed in all runtimes.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/renv.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### `renv` on Databricks\n##### `renv` Cache\n\nA very useful feature of `renv` is its [global package cache](https:\/\/rstudio.github.io\/renv\/articles\/renv.html#cache-1), which is shared across all `renv` projects on the cluster. It speeds up installation times and saves disk space. The `renv` cache does not cache packages downloaded via the `devtools` API or `install.packages()` with any additional arguments other than `pkgs`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/renv.html"} +{"content":"# Technology partners\n## Connect to semantic layer partners using Partner Connect\n#### Connect to AtScale\n\nAtScale\u2019s semantic layer delivers a single source of governed metrics and KPIs, tied to live lakehouse data and to BI tools including Excel, Tableau, and Power BI. \nYou can connect Databricks SQL warehouses (formerly Databricks SQL endpoints) and Databricks clusters to AtScale.\n\n#### Connect to AtScale\n##### Connect to AtScale using Partner Connect\n\nNote \nPartner Connect only supports SQL warehouses for AtScale connections. \nTo connect to AtScale using Partner Connect, see [Connect to semantic layer partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/semantic-layer.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/semantic-layer\/atscale.html"} +{"content":"# Technology partners\n## Connect to semantic layer partners using Partner Connect\n#### Connect to AtScale\n##### Connect to AtScale manually\n\nThis section describes how to connect to AtScale manually. \nNote \nTo connect to AtScale faster, use Partner Connect. Partner Connect only supports SQL warehouses for AtScale connections. \n### Requirements \nBefore you connect to AtScale manually, you must have the following: \n* An AtScale account. \n* A cluster or SQL warehouse in your Databricks workspace. \n+ [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n+ [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* The connection details for your cluster or SQL warehouse, specifically the **Server Hostname**, **Port**, and **HTTP Path** values. \n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). To create a personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**.\nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n+ [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n+ [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \n### Manual connection steps \nTo connect to AtScale manually, see [Adding Databricks Data Warehouses](https:\/\/www.atscale.com\/wp-content\/uploads\/2022\/11\/AtScale-Databricks-Documentation.pdf) on the AtScale website.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/semantic-layer\/atscale.html"} +{"content":"# Technology partners\n## Connect to semantic layer partners using Partner Connect\n#### Connect to AtScale\n##### Next steps\n\nRead the AtScale documentation: \n1. Log in to your AtScale account.\n2. Click the **Help** icon.\n3. Click **Documentation**.\n\n#### Connect to AtScale\n##### Additional resources\n\n* [AtScale website](https:\/\/www.atscale.com\/)\n* [Services and support](https:\/\/www.atscale.com\/services-support\/)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/semantic-layer\/atscale.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Production considerations for Structured Streaming\n##### Monitoring Structured Streaming queries on Databricks\n\nDatabricks provides built-in monitoring for Structured Streaming applications through the Spark UI under the **Streaming** tab.\n\n##### Monitoring Structured Streaming queries on Databricks\n###### Distinguish Structured Streaming queries in the Spark UI\n\nProvide your streams a unique query name by adding `.queryName(<query-name>)` to your `writeStream` code to easily distinguish which metrics belong to which stream in the Spark UI.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/stream-monitoring.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Production considerations for Structured Streaming\n##### Monitoring Structured Streaming queries on Databricks\n###### Push Structured Streaming metrics to external services\n\nStreaming metrics can be pushed to external services for alerting or dashboarding use cases by using Apache Spark\u2019s Streaming Query Listener interface. In Databricks Runtime 11.3 LTS and above, the Streaming Query Listener is available in Python and Scala. \nImportant \nCredentials and objects managed by Unity Catalog cannot be used in `StreamingQueryListener` logic. \nNote \nProcessing latency associated with listeners can adversely impact query processing. Databricks recommends minimizing processing logic in these listeners and writing to low latency sinks such as Kafka. \nThe following code provides basic examples of the syntax for implementing a listener: \n```\nimport org.apache.spark.sql.streaming.StreamingQueryListener\nimport org.apache.spark.sql.streaming.StreamingQueryListener._\n\nval myListener = new StreamingQueryListener {\n\n\/**\n* Called when a query is started.\n* @note This is called synchronously with\n* [[org.apache.spark.sql.streaming.DataStreamWriter `DataStreamWriter.start()`]].\n* `onQueryStart` calls on all listeners before\n* `DataStreamWriter.start()` returns the corresponding [[StreamingQuery]].\n* Do not block this method, as it blocks your query.\n*\/\ndef onQueryStarted(event: QueryStartedEvent): Unit = {}\n\n\/**\n* Called when there is some status update (ingestion rate updated, etc.)\n*\n* @note This method is asynchronous. The status in [[StreamingQuery]] returns the\n* latest status, regardless of when this method is called. The status of [[StreamingQuery]]\n* may change before or when you process the event. For example, you may find [[StreamingQuery]]\n* terminates when processing `QueryProgressEvent`.\n*\/\ndef onQueryProgress(event: QueryProgressEvent): Unit = {}\n\n\/**\n* Called when the query is idle and waiting for new data to process.\n*\/\ndef onQueryIdle(event: QueryProgressEvent): Unit = {}\n\n\/**\n* Called when a query is stopped, with or without error.\n*\/\ndef onQueryTerminated(event: QueryTerminatedEvent): Unit = {}\n}\n\n``` \n```\nclass MyListener(StreamingQueryListener):\ndef onQueryStarted(self, event):\n\"\"\"\nCalled when a query is started.\n\nParameters\n----------\nevent: :class:`pyspark.sql.streaming.listener.QueryStartedEvent`\nThe properties are available as the same as Scala API.\n\nNotes\n-----\nThis is called synchronously with\nmeth:`pyspark.sql.streaming.DataStreamWriter.start`,\nthat is, ``onQueryStart`` will be called on all listeners before\n``DataStreamWriter.start()`` returns the corresponding\n:class:`pyspark.sql.streaming.StreamingQuery`.\nDo not block in this method as it will block your query.\n\"\"\"\npass\n\ndef onQueryProgress(self, event):\n\"\"\"\nCalled when there is some status update (ingestion rate updated, etc.)\n\nParameters\n----------\nevent: :class:`pyspark.sql.streaming.listener.QueryProgressEvent`\nThe properties are available as the same as Scala API.\n\nNotes\n-----\nThis method is asynchronous. The status in\n:class:`pyspark.sql.streaming.StreamingQuery` returns the\nmost recent status, regardless of when this method is called. The status\nof :class:`pyspark.sql.streaming.StreamingQuery`.\nmay change before or when you process the event.\nFor example, you may find :class:`StreamingQuery`\nterminates when processing `QueryProgressEvent`.\n\"\"\"\npass\n\ndef onQueryIdle(self, event):\n\"\"\"\nCalled when the query is idle and waiting for new data to process.\n\"\"\"\npass\n\ndef onQueryTerminated(self, event):\n\"\"\"\nCalled when a query is stopped, with or without error.\n\nParameters\n----------\nevent: :class:`pyspark.sql.streaming.listener.QueryTerminatedEvent`\nThe properties are available as the same as Scala API.\n\"\"\"\npass\n\nmy_listener = MyListener()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/stream-monitoring.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Production considerations for Structured Streaming\n##### Monitoring Structured Streaming queries on Databricks\n###### Defining observable metrics in Structured Streaming\n\nObservable metrics are named arbitrary aggregate functions that can be defined on a query (DataFrame). As soon as the execution of a DataFrame reaches a completion point (that is, finishes a batch query or reaches a streaming epoch), a named event is emitted that contains the metrics for the data processed since the last completion point. \nYou can observe these metrics by attaching a listener to the Spark session. The listener depends on the execution mode: \n* **Batch mode**: Use `QueryExecutionListener`. \n`QueryExecutionListener` is called when the query completes. Access the metrics using the `QueryExecution.observedMetrics` map.\n* **Streaming, or micro-batch**: Use `StreamingQueryListener`. \n`StreamingQueryListener` is called when the streaming query completes an epoch. Access the metrics using the `StreamingQueryProgress.observedMetrics` map. Databricks does not support continuous execution streaming. \nFor example: \n```\n\/\/ Observe row count (rc) and error row count (erc) in the streaming Dataset\nval observed_ds = ds.observe(\"my_event\", count(lit(1)).as(\"rc\"), count($\"error\").as(\"erc\"))\nobserved_ds.writeStream.format(\"...\").start()\n\n\/\/ Monitor the metrics using a listener\nspark.streams.addListener(new StreamingQueryListener() {\noverride def onQueryProgress(event: QueryProgressEvent): Unit = {\nevent.progress.observedMetrics.get(\"my_event\").foreach { row =>\n\/\/ Trigger if the number of errors exceeds 5 percent\nval num_rows = row.getAs[Long](\"rc\")\nval num_error_rows = row.getAs[Long](\"erc\")\nval ratio = num_error_rows.toDouble \/ num_rows\nif (ratio > 0.05) {\n\/\/ Trigger alert\n}\n}\n}\n})\n\n``` \n```\n# Observe metric\nobserved_df = df.observe(\"metric\", count(lit(1)).as(\"cnt\"), count(col(\"error\")).as(\"malformed\"))\nobserved_df.writeStream.format(\"...\").start()\n\n# Define my listener.\nclass MyListener(StreamingQueryListener):\ndef onQueryStarted(self, event):\nprint(f\"'{event.name}' [{event.id}] got started!\")\ndef onQueryProgress(self, event):\nrow = event.progress.observedMetrics.get(\"metric\")\nif row is not None:\nif row.malformed \/ row.cnt > 0.5:\nprint(\"ALERT! Ouch! there are too many malformed \"\nf\"records {row.malformed} out of {row.cnt}!\")\nelse:\nprint(f\"{row.cnt} rows processed!\")\ndef onQueryTerminated(self, event):\nprint(f\"{event.id} got terminated!\")\n\n# Add my listener.\nspark.streams.addListener(MyListener())\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/stream-monitoring.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Production considerations for Structured Streaming\n##### Monitoring Structured Streaming queries on Databricks\n###### StreamingQueryListener object metrics\n\n| Metric | Description |\n| --- | --- |\n| `id` | Unique query ID that persists across restarts. See [StreamingQuery.id()](https:\/\/spark.apache.org\/docs\/3.1.1\/api\/python\/reference\/api\/pyspark.sql.streaming.StreamingQuery.id.html#pyspark.sql.streaming.StreamingQuery.id). |\n| `runId` | Unique query ID for every start or restart. See [StreamingQuery.runId()](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/reference\/pyspark.ss\/api\/pyspark.sql.streaming.StreamingQuery.runId.html). |\n| `name` | User-specified name of the query. Null if not specified. |\n| `timestamp` | Timestamp for the execution of the micro-batch. |\n| `batchId` | Unique ID for the current batch of data being processed. Note that in the case of retries after a failure, a given batch ID can be executed more than once. Similarly, when there is no data to be processed, the batch ID is not incremented. |\n| `numInputRows` | Aggregate (across all sources) number of records processed in a trigger. |\n| `inputRowsPerSecond` | Aggregate (across all sources) rate of arriving data. |\n| `processedRowsPerSecond` | Aggregate (across all sources) rate at which Spark is processing data. | \n### durationMs object \nInformation about the time it takes to complete various stages of the micro-batch execution process. \n| Metric | Description |\n| --- | --- |\n| `durationMs.addBatch` | Time taken to execute the microbatch. This excludes the time Spark takes to plan the microbatch. |\n| `durationMs.getBatch` | Time it takes to retrieve the metadata about the offsets from the source. |\n| `durationMs.latestOffset` | Latest offset consumed for the microbatch. This progress object refers to the time taken to retrieve the latest offset from sources. |\n| `durationMs.queryPlanning` | Time taken to generate the execution plan. |\n| `durationMs.triggerExecution` | Time taken to plan and execute the microbatch. |\n| `durationMs.walCommit` | Time taken to commit the new available offsets. | \n### eventTime object \nInformation about the event time value seen within the data being processed in the micro-batch. This data is used by the watermark to figure out how to trim the state for processing stateful aggregations defined in the Structured Streaming job. \n| Metric | Description |\n| --- | --- |\n| `eventTime.avg` | Average event time seen in the trigger. |\n| `eventTime.max` | Maximum event time seen in the trigger. |\n| `eventTime.min` | Minimum event time seen in the trigger. |\n| `eventTime.watermark` | Value of the watermark used in the trigger. | \n### stateOperators object \nInformation about the stateful operations that are defined in the Structured Streaming job and the aggregations that are produced from them. \n| Metric | Description |\n| --- | --- |\n| `stateOperators.operatorName` | Name of the stateful operator that the metrics relate to. For example, `symmetricHashJoin`, `dedupe`, `stateStoreSave`. |\n| `stateOperators.numRowsTotal` | Number of rows in the state as a result of the stateful operator or aggregation. |\n| `stateOperators.numRowsUpdated` | Number of rows updated in the state as a result of the stateful operator or aggregation. |\n| `stateOperators.numRowsRemoved` | Number of rows removed from the state as a result of the stateful operator or aggregation. |\n| `stateOperators.commitTimeMs` | Time taken to commit all updates (puts and removes) and return a new version. |\n| `stateOperators.memoryUsedBytes` | Memory used by the state store. |\n| `stateOperators.numRowsDroppedByWatermark` | Number of rows that are considered too late to be included in the stateful aggregation. **Streaming aggregations only**: Number of rows dropped post-aggregation, and not raw input rows. The number is not precise, but it can indicate that late data is being dropped. |\n| `stateOperators.numShufflePartitions` | Number of shuffle partitions for this stateful operator. |\n| `stateOperators.numStateStoreInstances` | Actual state store instance that the operator has initialized and maintained. In many stateful operators, this is the same as the number of partitions, but stream-stream join initializes four state store instances per partition. | \n### stateOperators.customMetrics object \nInformation collected from RocksDB that captures metrics about its performance and operations with respect to the stateful values it maintains for the Structured Streaming job. For more information, see [Configure RocksDB state store on Databricks](https:\/\/docs.databricks.com\/structured-streaming\/rocksdb-state-store.html). \n| Metric | Description |\n| --- | --- |\n| `customMetrics.rocksdbBytesCopied` | Number of bytes copied as tracked by the RocksDB File Manager. |\n| `customMetrics.rocksdbCommitCheckpointLatency` | Time in milliseconds to take a snapshot of native RocksDB and write it to a local directory. |\n| `customMetrics.rocksdbCompactLatency` | Time in milliseconds for compaction (optional) during the checkpoint commit. |\n| `customMetrics.rocksdbCommitFileSyncLatencyMs` | Time in milliseconds to sync the native RocksDB snapshot related files to an external storage (checkpoint location). |\n| `customMetrics.rocksdbCommitFlushLatency` | Time in milliseconds to flush the RocksDB in-memory changes to your local disk. |\n| `customMetrics.rocksdbCommitPauseLatency` | Time in milliseconds to stop the background worker threads (for example, for compaction) as part of the checkpoint commit. |\n| `customMetrics.rocksdbCommitWriteBatchLatency` | Time in milliseconds to apply the staged writes in in-memory structure (`WriteBatch`) to native RocksDB. |\n| `customMetrics.rocksdbFilesCopied` | Number of files copied as tracked by the RocksDB File Manager. |\n| `customMetrics.rocksdbFilesReused` | Number of files reused as tracked by the RocksDB File Manager. |\n| `customMetrics.rocksdbGetCount` | Number of `get` calls to the DB (This doesn\u2019t include `gets` from `WriteBatch`: In-memory batch used for staging writes). |\n| `customMetrics.rocksdbGetLatency` | Average time in nanoseconds for the underlying native `RocksDB::Get` call. |\n| `customMetrics.rocksdbReadBlockCacheHitCount` | How much of the block cache in RocksDB is useful or not and avoiding local disk reads. |\n| `customMetrics.rocksdbReadBlockCacheMissCount` | How much of the block cache in RocksDB is useful or not and avoiding local disk reads. |\n| `customMetrics.rocksdbSstFileSize` | Size of all SST files. SST stands for Static Sorted Table, which is the tabular structure RocksDB uses to store data. |\n| `customMetrics.rocksdbTotalBytesRead` | Number of uncompressed bytes read by `get` operations. |\n| `customMetrics.rocksdbTotalBytesReadByCompaction` | Number of bytes that the compaction process reads from the disk. |\n| `customMetrics.rocksdbTotalBytesReadThroughIterator` | Some of the stateful operations (for example, timeout processing in `FlatMapGroupsWithState` and watermarking) require reading data in DB through an iterator. This metric represents the size of uncompressed data read using the iterator. |\n| `customMetrics.rocksdbTotalBytesWritten` | Number of uncompressed bytes written by `put` operations. |\n| `customMetrics.rocksdbTotalBytesWrittenByCompaction` | Number of bytes the compaction process writes to the disk. |\n| `customMetrics.rocksdbTotalCompactionLatencyMs` | Time milliseconds for RocksDB compactions, including background compactions and the optional compaction initiated during the commit. |\n| `customMetrics.rocksdbTotalFlushLatencyMs` | Flush time, including background flushing. Flush operations are processes by which the MemTable is flushed to storage once it\u2019s full. MemTables are the first level where data is stored in RocksDB. |\n| `customMetrics.rocksdbZipFileBytesUncompressed` | RocksDB File Manager manages the physical SST file disk space utilization and deletion. This metric represents the uncompressed zip files in bytes as reported by the File Manager. | \n### sources object (Kafka) \n| Metric | Description |\n| --- | --- |\n| `sources.description` | Name of the source the streaming query is reading from. For example, `\u201cKafkaV2[Subscribe[KAFKA_TOPIC_NAME_INPUT_A]]\u201d`. |\n| `sources.startOffset` object | Starting offset number within the Kafka topic that the streaming job started at. |\n| `sources.endOffset` object | Latest offset processed by the microbatch. This could be equal to `latestOffset` for an ongoing microbatch execution. |\n| `sources.latestOffset` object | Latest offset figured by the microbatch. When there is throttling, the micro-batching process might not process all offsets, causing `endOffset` and `latestOffset` to differ. |\n| `sources.numInputRows` | Number of input rows processed from this source. |\n| `sources.inputRowsPerSecond` | Rate at which data is arriving for processing for this source. |\n| `sources.processedRowsPerSecond` | Rate at which Spark is processing data for this source. | \n### sources.metrics object (Kafka) \n| Metric | Description |\n| --- | --- |\n| `sources.metrics.avgOffsetsBehindLatest` | Average number of offsets that the streaming query is behind the latest available offset among all the subscribed topics. |\n| `sources.metrics.estimatedTotalBytesBehindLatest` | Estimated number of bytes that the query process has not consumed from the subscribed topics. |\n| `sources.metrics.maxOffsetsBehindLatest` | Maximum number of offsets that the streaming query is behind the latest available offset among all the subscribed topics. |\n| `sources.metrics.minOffsetsBehindLatest` | Minimum number of offsets that the streaming query is behind the latest available offset among all the subscribed topics. | \n### sink object (Kafka) \n| Metric | Description |\n| --- | --- |\n| `sink.description` | Name of the sink the streaming query is writing to. For example, `\u201corg.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable@e04b100\u201d`. |\n| `sink.numOutputRows` | Number of rows that were written to the output table or sink as part of the microbatch. For some situations, this value can be \u201c-1\u201d and generally can be interpreted as \u201cunknown\u201d. | \n### sources object (Delta Lake) \n| Metric | Description |\n| --- | --- |\n| `sources.description` | Name of the source the streaming query is reading from. For example, `\u201cDeltaSource[table]\u201d`. |\n| `sources.[startOffset\/endOffset].sourceVersion` | Version of serialization that this offset is encoded with. |\n| `sources.[startOffset\/endOffset].reservoirId` | ID of the table you are reading from. This is used to detect misconfiguration when restarting a query. |\n| `sources.[startOffset\/endOffset].reservoirVersion` | Version of the table that you are currently processing. |\n| `sources.[startOffset\/endOffset].index` | Index in the sequence of `AddFiles` in this version. This is used to break large commits into multiple batches. This index is created by sorting on `modificationTimestamp` and `path`. |\n| `sources.[startOffset\/endOffset].isStartingVersion` | Whether this offset denotes a query that is starting rather than processing changes. When starting a new query, all data present in the table at the start is processed, and then new data that has arrived. |\n| `sources.latestOffset` | Latest offset processed by the microbatch query. |\n| `sources.numInputRows` | Number of input rows processed from this source. |\n| `sources.inputRowsPerSecond` | Rate at which data is arriving for processing for this source. |\n| `sources.processedRowsPerSecond` | Rate at which Spark is processing data for this source. |\n| `sources.metrics.numBytesOutstanding` | Size of the outstanding files (files tracked by RocksDB) combined. This is the backlog metric for Delta and Auto Loader as the streaming source. |\n| `sources.metrics.numFilesOutstanding` | Number of outstanding files to be processed. This is the backlog metric for Delta and Auto Loader as the streaming source. | \n### sink object (Delta Lake) \n| Metric | Description |\n| --- | --- |\n| `sink.description` | Name of the sink that the streaming query writes to. For example, `\u201cDeltaSink[table]\u201d`. |\n| `sink.numOutputRows` | Number of rows in this metric is \u201c-1\u201d because Spark can\u2019t infer output rows for DSv1 sinks, which is the classification for the Delta Lake sink. | \n### sources object (Kinesis) \n| Metric | Description |\n| --- | --- |\n| `description` | Name of the source that the streaming query reads from. For example, `\u201cKinesisV2[stream]\u201d`. | \nFor more information, see [What metrics does Kinesis report?](https:\/\/docs.databricks.com\/connect\/streaming\/kinesis.html#metrics). \n### sources metrics object (Kinesis) \n| Metric | Description |\n| --- | --- |\n| `avgMsBehindLatest` | Average number of milliseconds a consumer has fallen behind the beginning of a stream. |\n| `maxMsBehindLatest` | Maximum number of milliseconds a consumer has fallen behind the beginning of a stream. |\n| `minMsBehindLatest` | Minimum number of milliseconds a consumer has fallen behind the beginning of a stream. |\n| `totalPrefetchedBytes` | Number of bytes left to process. This is the backlog metric for Kinesis as a source. | \n### sink (Kinesis) \n| Metric | Description |\n| --- | --- |\n| `sink.description` | Name of the sink that the streaming query writes to. For example, `\u201cKinesisV2[stream]\u201d`. |\n| `sink.numOutputRows` | Number of rows that were written to the output table or sink as part of the microbatch. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/stream-monitoring.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Production considerations for Structured Streaming\n##### Monitoring Structured Streaming queries on Databricks\n###### Examples\n\n### Example Kafka-to-Kafka StreamingQueryListener event \n```\n{\n\"id\" : \"3574feba-646d-4735-83c4-66f657e52517\",\n\"runId\" : \"38a78903-9e55-4440-ad81-50b591e4746c\",\n\"name\" : \"STREAMING_QUERY_NAME_UNIQUE\",\n\"timestamp\" : \"2022-10-31T20:09:30.455Z\",\n\"batchId\" : 1377,\n\"numInputRows\" : 687,\n\"inputRowsPerSecond\" : 32.13433743393049,\n\"processedRowsPerSecond\" : 34.067241892293964,\n\"durationMs\" : {\n\"addBatch\" : 18352,\n\"getBatch\" : 0,\n\"latestOffset\" : 31,\n\"queryPlanning\" : 977,\n\"triggerExecution\" : 20165,\n\"walCommit\" : 342\n},\n\"eventTime\" : {\n\"avg\" : \"2022-10-31T20:09:18.070Z\",\n\"max\" : \"2022-10-31T20:09:30.125Z\",\n\"min\" : \"2022-10-31T20:09:09.793Z\",\n\"watermark\" : \"2022-10-31T20:08:46.355Z\"\n},\n\"stateOperators\" : [ {\n\"operatorName\" : \"stateStoreSave\",\n\"numRowsTotal\" : 208,\n\"numRowsUpdated\" : 73,\n\"allUpdatesTimeMs\" : 434,\n\"numRowsRemoved\" : 76,\n\"allRemovalsTimeMs\" : 515,\n\"commitTimeMs\" : 0,\n\"memoryUsedBytes\" : 167069743,\n\"numRowsDroppedByWatermark\" : 0,\n\"numShufflePartitions\" : 20,\n\"numStateStoreInstances\" : 20,\n\"customMetrics\" : {\n\"rocksdbBytesCopied\" : 0,\n\"rocksdbCommitCheckpointLatency\" : 0,\n\"rocksdbCommitCompactLatency\" : 0,\n\"rocksdbCommitFileSyncLatencyMs\" : 0,\n\"rocksdbCommitFlushLatency\" : 0,\n\"rocksdbCommitPauseLatency\" : 0,\n\"rocksdbCommitWriteBatchLatency\" : 0,\n\"rocksdbFilesCopied\" : 0,\n\"rocksdbFilesReused\" : 0,\n\"rocksdbGetCount\" : 222,\n\"rocksdbGetLatency\" : 0,\n\"rocksdbPutCount\" : 0,\n\"rocksdbPutLatency\" : 0,\n\"rocksdbReadBlockCacheHitCount\" : 165,\n\"rocksdbReadBlockCacheMissCount\" : 41,\n\"rocksdbSstFileSize\" : 232729,\n\"rocksdbTotalBytesRead\" : 12844,\n\"rocksdbTotalBytesReadByCompaction\" : 0,\n\"rocksdbTotalBytesReadThroughIterator\" : 161238,\n\"rocksdbTotalBytesWritten\" : 0,\n\"rocksdbTotalBytesWrittenByCompaction\" : 0,\n\"rocksdbTotalCompactionLatencyMs\" : 0,\n\"rocksdbTotalFlushLatencyMs\" : 0,\n\"rocksdbWriterStallLatencyMs\" : 0,\n\"rocksdbZipFileBytesUncompressed\" : 0\n}\n}, {\n\"operatorName\" : \"dedupe\",\n\"numRowsTotal\" : 2454744,\n\"numRowsUpdated\" : 73,\n\"allUpdatesTimeMs\" : 4155,\n\"numRowsRemoved\" : 0,\n\"allRemovalsTimeMs\" : 0,\n\"commitTimeMs\" : 0,\n\"memoryUsedBytes\" : 137765341,\n\"numRowsDroppedByWatermark\" : 34,\n\"numShufflePartitions\" : 20,\n\"numStateStoreInstances\" : 20,\n\"customMetrics\" : {\n\"numDroppedDuplicateRows\" : 193,\n\"rocksdbBytesCopied\" : 0,\n\"rocksdbCommitCheckpointLatency\" : 0,\n\"rocksdbCommitCompactLatency\" : 0,\n\"rocksdbCommitFileSyncLatencyMs\" : 0,\n\"rocksdbCommitFlushLatency\" : 0,\n\"rocksdbCommitPauseLatency\" : 0,\n\"rocksdbCommitWriteBatchLatency\" : 0,\n\"rocksdbFilesCopied\" : 0,\n\"rocksdbFilesReused\" : 0,\n\"rocksdbGetCount\" : 146,\n\"rocksdbGetLatency\" : 0,\n\"rocksdbPutCount\" : 0,\n\"rocksdbPutLatency\" : 0,\n\"rocksdbReadBlockCacheHitCount\" : 3,\n\"rocksdbReadBlockCacheMissCount\" : 3,\n\"rocksdbSstFileSize\" : 78959140,\n\"rocksdbTotalBytesRead\" : 0,\n\"rocksdbTotalBytesReadByCompaction\" : 0,\n\"rocksdbTotalBytesReadThroughIterator\" : 0,\n\"rocksdbTotalBytesWritten\" : 0,\n\"rocksdbTotalBytesWrittenByCompaction\" : 0,\n\"rocksdbTotalCompactionLatencyMs\" : 0,\n\"rocksdbTotalFlushLatencyMs\" : 0,\n\"rocksdbWriterStallLatencyMs\" : 0,\n\"rocksdbZipFileBytesUncompressed\" : 0\n}\n}, {\n\"operatorName\" : \"symmetricHashJoin\",\n\"numRowsTotal\" : 2583,\n\"numRowsUpdated\" : 682,\n\"allUpdatesTimeMs\" : 9645,\n\"numRowsRemoved\" : 508,\n\"allRemovalsTimeMs\" : 46,\n\"commitTimeMs\" : 21,\n\"memoryUsedBytes\" : 668544484,\n\"numRowsDroppedByWatermark\" : 0,\n\"numShufflePartitions\" : 20,\n\"numStateStoreInstances\" : 80,\n\"customMetrics\" : {\n\"rocksdbBytesCopied\" : 0,\n\"rocksdbCommitCheckpointLatency\" : 0,\n\"rocksdbCommitCompactLatency\" : 0,\n\"rocksdbCommitFileSyncLatencyMs\" : 0,\n\"rocksdbCommitFlushLatency\" : 0,\n\"rocksdbCommitPauseLatency\" : 0,\n\"rocksdbCommitWriteBatchLatency\" : 0,\n\"rocksdbFilesCopied\" : 0,\n\"rocksdbFilesReused\" : 0,\n\"rocksdbGetCount\" : 4218,\n\"rocksdbGetLatency\" : 3,\n\"rocksdbPutCount\" : 0,\n\"rocksdbPutLatency\" : 0,\n\"rocksdbReadBlockCacheHitCount\" : 3425,\n\"rocksdbReadBlockCacheMissCount\" : 149,\n\"rocksdbSstFileSize\" : 742827,\n\"rocksdbTotalBytesRead\" : 866864,\n\"rocksdbTotalBytesReadByCompaction\" : 0,\n\"rocksdbTotalBytesReadThroughIterator\" : 0,\n\"rocksdbTotalBytesWritten\" : 0,\n\"rocksdbTotalBytesWrittenByCompaction\" : 0,\n\"rocksdbTotalCompactionLatencyMs\" : 0,\n\"rocksdbTotalFlushLatencyMs\" : 0,\n\"rocksdbWriterStallLatencyMs\" : 0,\n\"rocksdbZipFileBytesUncompressed\" : 0\n}\n} ],\n\"sources\" : [ {\n\"description\" : \"KafkaV2[Subscribe[KAFKA_TOPIC_NAME_INPUT_A]]\",\n\"startOffset\" : {\n\"KAFKA_TOPIC_NAME_INPUT_A\" : {\n\"0\" : 349706380\n}\n},\n\"endOffset\" : {\n\"KAFKA_TOPIC_NAME_INPUT_A\" : {\n\"0\" : 349706672\n}\n},\n\"latestOffset\" : {\n\"KAFKA_TOPIC_NAME_INPUT_A\" : {\n\"0\" : 349706672\n}\n},\n\"numInputRows\" : 292,\n\"inputRowsPerSecond\" : 13.65826278123392,\n\"processedRowsPerSecond\" : 14.479817514628582,\n\"metrics\" : {\n\"avgOffsetsBehindLatest\" : \"0.0\",\n\"estimatedTotalBytesBehindLatest\" : \"0.0\",\n\"maxOffsetsBehindLatest\" : \"0\",\n\"minOffsetsBehindLatest\" : \"0\"\n}\n}, {\n\"description\" : \"KafkaV2[Subscribe[KAFKA_TOPIC_NAME_INPUT_B]]\",\n\"startOffset\" : {\n\"KAFKA_TOPIC_NAME_INPUT_B\" : {\n\"2\" : 143147812,\n\"1\" : 129288266,\n\"0\" : 138102966\n}\n},\n\"endOffset\" : {\n\"KAFKA_TOPIC_NAME_INPUT_B\" : {\n\"2\" : 143147812,\n\"1\" : 129288266,\n\"0\" : 138102966\n}\n},\n\"latestOffset\" : {\n\"KAFKA_TOPIC_NAME_INPUT_B\" : {\n\"2\" : 143147812,\n\"1\" : 129288266,\n\"0\" : 138102966\n}\n},\n\"numInputRows\" : 0,\n\"inputRowsPerSecond\" : 0.0,\n\"processedRowsPerSecond\" : 0.0,\n\"metrics\" : {\n\"avgOffsetsBehindLatest\" : \"0.0\",\n\"maxOffsetsBehindLatest\" : \"0\",\n\"minOffsetsBehindLatest\" : \"0\"\n}\n} ],\n\"sink\" : {\n\"description\" : \"org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable@e04b100\",\n\"numOutputRows\" : 76\n}\n}\n\n``` \n### Example Delta Lake-to-Delta Lake StreamingQueryListener event \n```\n{\n\"id\" : \"aeb6bc0f-3f7d-4928-a078-ba2b304e2eaf\",\n\"runId\" : \"35d751d9-2d7c-4338-b3de-6c6ae9ebcfc2\",\n\"name\" : \"silverTransformFromBronze\",\n\"timestamp\" : \"2022-11-01T18:21:29.500Z\",\n\"batchId\" : 4,\n\"numInputRows\" : 0,\n\"inputRowsPerSecond\" : 0.0,\n\"processedRowsPerSecond\" : 0.0,\n\"durationMs\" : {\n\"latestOffset\" : 62,\n\"triggerExecution\" : 62\n},\n\"stateOperators\" : [ ],\n\"sources\" : [ {\n\"description\" : \"DeltaSource[dbfs:\/FileStore\/max.fisher@databricks.com\/ctc\/stateful-trade-analysis-demo\/table]\",\n\"startOffset\" : {\n\"sourceVersion\" : 1,\n\"reservoirId\" : \"84590dac-da51-4e0f-8eda-6620198651a9\",\n\"reservoirVersion\" : 3216,\n\"index\" : 3214,\n\"isStartingVersion\" : true\n},\n\"endOffset\" : {\n\"sourceVersion\" : 1,\n\"reservoirId\" : \"84590dac-da51-4e0f-8eda-6620198651a9\",\n\"reservoirVersion\" : 3216,\n\"index\" : 3214,\n\"isStartingVersion\" : true\n},\n\"latestOffset\" : null,\n\"numInputRows\" : 0,\n\"inputRowsPerSecond\" : 0.0,\n\"processedRowsPerSecond\" : 0.0,\n\"metrics\" : {\n\"numBytesOutstanding\" : \"0\",\n\"numFilesOutstanding\" : \"0\"\n}\n} ],\n\"sink\" : {\n\"description\" : \"DeltaSink[dbfs:\/user\/hive\/warehouse\/maxfisher.db\/trade_history_silver_delta_demo2]\",\n\"numOutputRows\" : -1\n}\n}\n\n``` \n### Example rate source to Delta Lake StreamingQueryListener event \n```\n{\n\"id\" : \"912ebdc1-edf2-48ec-b9fb-1a9b67dd2d9e\",\n\"runId\" : \"85de73a5-92cc-4b7f-9350-f8635b0cf66e\",\n\"name\" : \"dataGen\",\n\"timestamp\" : \"2022-11-01T18:28:20.332Z\",\n\"batchId\" : 279,\n\"numInputRows\" : 300,\n\"inputRowsPerSecond\" : 114.15525114155251,\n\"processedRowsPerSecond\" : 158.9825119236884,\n\"durationMs\" : {\n\"addBatch\" : 1771,\n\"commitOffsets\" : 54,\n\"getBatch\" : 0,\n\"latestOffset\" : 0,\n\"queryPlanning\" : 4,\n\"triggerExecution\" : 1887,\n\"walCommit\" : 58\n},\n\"stateOperators\" : [ ],\n\"sources\" : [ {\n\"description\" : \"RateStreamV2[rowsPerSecond=100, rampUpTimeSeconds=0, numPartitions=default\",\n\"startOffset\" : 560,\n\"endOffset\" : 563,\n\"latestOffset\" : 563,\n\"numInputRows\" : 300,\n\"inputRowsPerSecond\" : 114.15525114155251,\n\"processedRowsPerSecond\" : 158.9825119236884\n} ],\n\"sink\" : {\n\"description\" : \"DeltaSink[dbfs:\/user\/hive\/warehouse\/maxfisher.db\/trade_history_bronze_delta_demo]\",\n\"numOutputRows\" : -1\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/stream-monitoring.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Production considerations for Structured Streaming\n##### Recover from Structured Streaming query failures with workflows\n\nStructured Streaming provides fault-tolerance and data consistency for streaming queries; using Databricks workflows, you can easily configure your Structured Streaming queries to automatically restart on failure. By enabling checkpointing for a streaming query, you can restart the query after a failure. The restarted query continues where the failed one left off.\n\n##### Recover from Structured Streaming query failures with workflows\n###### Enable checkpointing for Structured Streaming queries\n\nDatabricks recommends that you always specify the `checkpointLocation` option a cloud storage path before you start the query. For example: \n```\nstreamingDataFrame.writeStream\n.format(\"parquet\")\n.option(\"path\", \"\/path\/to\/table\")\n.option(\"checkpointLocation\", \"\/path\/to\/table\/_checkpoint\")\n.start()\n\n``` \nThis checkpoint location preserves all of the essential information that identifies a query. Each query must have a different checkpoint location. Multiple queries should never have the same location. For more information, see the [Structured Streaming Programming Guide](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html). \nNote \nWhile `checkpointLocation` is required for most types of output sinks, some sinks, such as memory sink, may automatically generate a temporary checkpoint location when you do not provide `checkpointLocation`. These temporary checkpoint locations do not ensure any fault tolerance or data consistency guarantees and may not get cleaned up properly. Avoid potential pitfalls by always specifying a `checkpointLocation`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/query-recovery.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Production considerations for Structured Streaming\n##### Recover from Structured Streaming query failures with workflows\n###### Configure Structured Streaming jobs to restart streaming queries on failure\n\nYou can create a Databricks [job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) with the notebook or JAR that has your streaming queries and configure it to: \n* Always use a new cluster.\n* Always retry on failure. \nAutomatically restarting on job failure is especially important when configuring streaming workloads with schema evolution. Schema evolution works on Databricks by raising an expected error when a schema change is detected, and then properly processing data using the new schema when the job restarts. Databricks recommends always configuring streaming tasks that contain queries with schema evolution to restart automatically in Databricks workflows. \n[Jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) have tight integration with Structured Streaming APIs and can monitor all streaming queries active in a run. This configuration ensures that if any part of the query fails, jobs automatically terminate the run (along with all the other queries) and start a new run in a new cluster. This re-runs the notebook or JAR code and restarts all of the queries again. This is the safest way to return to a good state. \nNote \n* Failure in any of the active streaming queries causes the active run to fail and terminate all the other streaming queries.\n* You do not need to use `streamingQuery.awaitTermination()` or `spark.streams.awaitAnyTermination()` at the end of your notebook. Jobs automatically prevent a run from completing when a streaming query is active.\n* Databricks recommends using jobs instead of `%run` and `dbutils.notebook.run()` when orchestrating Structured Streaming notebooks. See [Run a Databricks notebook from another notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-workflows.html). \nThe following is an example of a recommended job configuration. \n* **Cluster**: Set this always to use a new cluster and use the latest Spark version (or at least version 2.1). Queries started in Spark 2.1 and above are recoverable after query and Spark version upgrades.\n* **Notifications**: Set this if you want email notification on failures.\n* **Schedule**: *Do not set a schedule*.\n* **Timeout**: *Do not set a timeout.* Streaming queries run for an indefinitely long time.\n* **Maximum concurrent runs**: Set to **1**. There must be only one instance of each query concurrently active.\n* **Retries**: Set to **Unlimited**. \nSee [Create and run Databricks Jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) to understand these configurations.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/query-recovery.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Production considerations for Structured Streaming\n##### Recover from Structured Streaming query failures with workflows\n###### Recover after changes in a Structured Streaming query\n\nThere are limitations on what changes in a streaming query are allowed between restarts from the same checkpoint location. Here are a few changes that are either not allowed or the effect of the change is not well-defined. For all of them: \n* The term *allowed* means you can do the specified change but whether the semantics of its effect is well-defined depends on the query and the change.\n* The term *not allowed* means you should not do the specified change as the restarted query is likely to fail with unpredictable errors.\n* `sdf` represents a streaming DataFrame\/Dataset generated with `sparkSession.readStream`. \n### Types of changes in Structured Streaming queries \n* **Changes in the number or type (that is, different source) of input sources**: This is not allowed.\n* **Changes in the parameters of input sources**: Whether this is allowed and whether the semantics of the change are well-defined depends on the source and the query. Here are a few examples. \n+ Addition, deletion, and modification of rate limits is allowed: \n```\nspark.readStream.format(\"kafka\").option(\"subscribe\", \"article\")\n\n``` \nto \n```\nspark.readStream.format(\"kafka\").option(\"subscribe\", \"article\").option(\"maxOffsetsPerTrigger\", ...)\n\n```\n+ Changes to subscribed articles and files are generally not allowed as the results are unpredictable: `spark.readStream.format(\"kafka\").option(\"subscribe\", \"article\")` to `spark.readStream.format(\"kafka\").option(\"subscribe\", \"newarticle\")`\n* **Changes in the trigger interval**: You can change triggers between incremental batches and time intervals. See [Changing trigger intervals between runs](https:\/\/docs.databricks.com\/structured-streaming\/triggers.html#change-interval).\n* **Changes in the type of output sink**: Changes between a few specific combinations of sinks are allowed. This needs to be verified on a case-by-case basis. Here are a few examples. \n+ File sink to Kafka sink is allowed. Kafka will see only the new data.\n+ Kafka sink to file sink is not allowed.\n+ Kafka sink changed to foreach, or vice versa is allowed.\n* **Changes in the parameters of output sink**: Whether this is allowed and whether the semantics of the change are well-defined depends on the sink and the query. Here are a few examples. \n+ Changes to output directory of a file sink is not allowed: `sdf.writeStream.format(\"parquet\").option(\"path\", \"\/somePath\")` to `sdf.writeStream.format(\"parquet\").option(\"path\", \"\/anotherPath\")`\n+ Changes to output topic is allowed: `sdf.writeStream.format(\"kafka\").option(\"topic\", \"topic1\")` to `sdf.writeStream.format(\"kafka\").option(\"topic\", \"topic2\")`\n+ Changes to the user-defined foreach sink (that is, the `ForeachWriter` code) is allowed, but the semantics of the change depends on the code.\n* **Changes in projection \/ filter \/ map-like operations**: Some cases are allowed. For example: \n+ Addition \/ deletion of filters is allowed: `sdf.selectExpr(\"a\")` to `sdf.where(...).selectExpr(\"a\").filter(...)`.\n+ Changes in projections with same output schema is allowed: `sdf.selectExpr(\"stringColumn AS json\").writeStream` to `sdf.select(to_json(...).as(\"json\")).writeStream`.\n+ Changes in projections with different output schema are conditionally allowed: `sdf.selectExpr(\"a\").writeStream` to `sdf.selectExpr(\"b\").writeStream` is allowed only if the output sink allows the schema change from `\"a\"` to `\"b\"`.\n* **Changes in stateful operations**: Some operations in streaming queries need to maintain state data in order to continuously update the result. Structured Streaming automatically checkpoints the state data to fault-tolerant storage (for example, DBFS, AWS S3, Azure Blob storage) and restores it after restart. However, this assumes that the schema of the state data remains same across restarts. This means that *any changes (that is, additions, deletions, or schema modifications) to the stateful operations of a streaming query are not allowed between restarts*. Here is the list of stateful operations whose schema should not be changed between restarts in order to ensure state recovery: \n+ **Streaming aggregation**: For example, `sdf.groupBy(\"a\").agg(...)`. Any change in number or type of grouping keys or aggregates is not allowed.\n+ **Streaming deduplication**: For example, `sdf.dropDuplicates(\"a\")`. Any change in number or type of grouping keys or aggregates is not allowed.\n+ **Stream-stream join**: For example, `sdf1.join(sdf2, ...)` (i.e. both inputs are generated with `sparkSession.readStream`). Changes in the schema or equi-joining columns are not allowed. Changes in join type (outer or inner) not allowed. Other changes in the join condition are ill-defined.\n+ **Arbitrary stateful operation**: For example, `sdf.groupByKey(...).mapGroupsWithState(...)` or `sdf.groupByKey(...).flatMapGroupsWithState(...)`. Any change to the schema of the user-defined state and the type of timeout is not allowed. Any change within the user-defined state-mapping function are allowed, but the semantic effect of the change depends on the user-defined logic. If you really want to support state schema changes, then you can explicitly encode\/decode your complex state data structures into bytes using an encoding\/decoding scheme that supports schema migration. For example, if you save your state as Avro-encoded bytes, then you can change the Avro-state-schema between query restarts as this restores the binary state.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/query-recovery.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Create your first workflow with a Databricks job\n\nThis article demonstrates a Databricks [job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) that orchestrates tasks to read and process a sample dataset. In this quickstart, you: \n1. Create a new notebook and add code to retrieve a sample dataset containing popular baby names by year.\n2. Save the sample dataset to Unity Catalog.\n3. Create a new notebook and add code to read the dataset from Unity Catalog, filter it by year, and display the results.\n4. Create a new job and configure two tasks using the notebooks.\n5. Run the job and view the results.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-quickstart.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Create your first workflow with a Databricks job\n##### Requirements\n\nIf your workspace is Unity Catalog-enabled and [Serverless Workflows](https:\/\/docs.databricks.com\/workflows\/jobs\/run-serverless-jobs.html) is enabled, by default, the job runs on Serverless compute. You do not need cluster creation permission to run your job with Serverless compute. \nOtherwise, you must have [cluster creation permission](https:\/\/docs.databricks.com\/compute\/use-compute.html#permissions) to create job compute or [permissions](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions) to all-purpose compute resources. \nYou must have a [volume](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html) in [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). This article uses a volume named `my-volume` in a schema named `default` within a catalog named `main`. Also, you must have the following permissions in Unity Catalog: \n* `READ VOLUME` and `WRITE VOLUME`, or `ALL PRIVILEGES`, for the `my-volume` volume.\n* `USE SCHEMA` or `ALL PRIVILEGES` for the `default` schema.\n* `USE CATALOG` or `ALL PRIVILEGES` for the `main` catalog. \nTo set these permissions, see your Databricks administrator or [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-quickstart.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Create your first workflow with a Databricks job\n##### Create the notebooks\n\n### Retrieve and save data \nTo create a notebook to retrieve the sample dataset and save it to Unity Catalog: \n1. Go to your Databricks landing page and click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar and select **Notebook**. Databricks creates and opens a new, blank notebook in your default folder. The default language is the language you most recently used, and the notebook is automatically attached to the compute resource that you most recently used.\n2. If necessary, [change the default language to Python](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#set-default-language).\n3. Copy the following Python code and paste it into the first cell of the notebook. \n```\nimport requests\n\nresponse = requests.get('https:\/\/health.data.ny.gov\/api\/views\/jxy9-yhdk\/rows.csv')\ncsvfile = response.content.decode('utf-8')\ndbutils.fs.put(\"\/Volumes\/main\/default\/my-volume\/babynames.csv\", csvfile, True)\n\n``` \n### Read and display filtered data \nTo create a notebook to read and present the data for filtering: \n1. Go to your Databricks landing page and click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar and select **Notebook**. Databricks creates and opens a new, blank notebook in your default folder. The default language is the language you most recently used, and the notebook is automatically attached to the compute resource that you most recently used.\n2. If necessary, [change the default language to Python](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#set-default-language).\n3. Copy the following Python code and paste it into the first cell of the notebook. \n```\nbabynames = spark.read.format(\"csv\").option(\"header\", \"true\").option(\"inferSchema\", \"true\").load(\"\/Volumes\/main\/default\/my-volume\/babynames.csv\")\nbabynames.createOrReplaceTempView(\"babynames_table\")\nyears = spark.sql(\"select distinct(Year) from babynames_table\").toPandas()['Year'].tolist()\nyears.sort()\ndbutils.widgets.dropdown(\"year\", \"2014\", [str(x) for x in years])\ndisplay(babynames.filter(babynames.Year == dbutils.widgets.get(\"year\")))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-quickstart.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Create your first workflow with a Databricks job\n##### Create a job\n\n1. Click ![Workflows Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows** in the sidebar.\n2. Click ![Create Job Button](https:\/\/docs.databricks.com\/_images\/create-job.png). \nThe **Tasks** tab displays with the create task dialog. \n![Create first task dialog](https:\/\/docs.databricks.com\/_images\/create-job-dialog.png)\n3. Replace **Add a name for your job\u2026** with your job name.\n4. In the **Task name** field, enter a name for the task; for example, **retrieve-baby-names**.\n5. In the **Type** drop-down menu, select **Notebook**.\n6. Use the file browser to find the first notebook you created, click the notebook name, and click **Confirm**.\n7. Click **Create task**.\n8. Click ![Add Task Button](https:\/\/docs.databricks.com\/_images\/add-task.png) below the task you just created to add another task.\n9. In the **Task name** field, enter a name for the task; for example, **filter-baby-names**.\n10. In the **Type** drop-down menu, select **Notebook**.\n11. Use the file browser to find the second notebook you created, click the notebook name, and click **Confirm**.\n12. Click **Add** under **Parameters**. In the **Key** field, enter `year`. In the **Value** field, enter `2014`.\n13. Click **Create task**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-quickstart.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Create your first workflow with a Databricks job\n##### Run the job\n\nTo run the job immediately, click ![Run Now Button](https:\/\/docs.databricks.com\/_images\/run-now-button.png) in the upper right corner. You can also run the job by clicking the **Runs** tab and clicking **Run Now** in the **Active Runs** table.\n\n#### Create your first workflow with a Databricks job\n##### View run details\n\n1. Click the **Runs** tab and click the link for the run in the **Active Runs** table or in the **Completed Runs (past 60 days)** table.\n2. Click either task to see the output and details. For example, click the **filter-baby-names** task to view the output and run details for the filter task: \n![View filter names results](https:\/\/docs.databricks.com\/_images\/quickstart-view-results.png)\n\n#### Create your first workflow with a Databricks job\n##### Run with different parameters\n\nTo re-run the job and filter baby names for a different year: \n1. Click ![Blue Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret-blue.png) next to **Run Now** and select **Run Now with Different Parameters** or click **Run Now with Different Parameters** in the [Active Runs](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#view-job-run-list) table.\n2. In the **Value** field, enter `2015`.\n3. Click **Run**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-quickstart.html"} +{"content":"# Technology partners\n## Databricks sign-on from partner solutions\n#### Override partner OAuth token lifetime policy\n\nThis article describes how to override the OAuth token lifetime policy for existing partner OAuth applications. \nNote \nUpdates to partner OAuth applications can take 30 minutes to process.\n\n#### Override partner OAuth token lifetime policy\n##### Before you begin\n\nBefore you override the OAuth token lifetime policy, do the following: \n* [Install the Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/install.html) and [set up authentication between the Databricks CLI and your Databricks account](https:\/\/docs.databricks.com\/dev-tools\/cli\/authentication.html).\n* [Locate your account ID](https:\/\/docs.databricks.com\/admin\/account-settings\/index.html#account-id).\n* Locate the integration ID of the OAuth application you want to modify. \n+ For dbt Core, Power BI, or Tableau Desktop, run the following command:\n```\ndatabricks account published-app-integration list\n\n``` \n+ For Tableau Cloud or Tableau Server, run the following command:\n```\ndatabricks account custom-app-integration list\n\n``` \nThe unique integration ID for each OAuth application is returned.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/manage-oauth.html"} +{"content":"# Technology partners\n## Databricks sign-on from partner solutions\n#### Override partner OAuth token lifetime policy\n##### Override the default token lifetime policy for dbt Core, Power BI, or Tableau Desktop\n\nTo override the default token lifecycle policy (`token_access_policy`) for dbt Core, Power BI, or Tableau Desktop, run the following command: \n```\ndatabricks account published-app-integration update <integration-id> --json '{\"token_access_policy\": {\"access_token_ttl_in_minutes\": <new-access-token-ttl>,\"refresh_token_ttl_in_minutes\":<new-refresh-token-ttl>}}'\n\n``` \n* Replace `<integration-id>` with either `databricks-dbt-adapter`, `power-bi`, or `tableau-desktop`.\n* Replace `<new-access-token-ttl>` with the new access token lifetime.\n* Replace `<new-refresh-token-ttl>` with the new refresh token lifetime.\n\n#### Override partner OAuth token lifetime policy\n##### Override the default token lifetime policy for Tableau Cloud or Tableau Server\n\nTo override the default token lifecycle policy (`token_access_policy`) for Tableau Cloud or Tableau Server, run the following command: \n```\ndatabricks account custom-app-integration update <integration-id> --json '{\"token_access_policy\": {\"access_token_ttl_in_minutes\": <new-access-token-ttl>,\"refresh_token_ttl_in_minutes\":<new-refresh-token-ttl>}}'\n\n``` \n* Replace `<integration-id>` with the integration ID of the OAuth application you want to modify.\n* Replace `<new-access-token-ttl>` with the new access token lifetime.\n* Replace `<new-refresh-token-ttl>` with the new refresh token lifetime.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/manage-oauth.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Audit and monitor data sharing\n\nThis article describes how data providers and recipients can use audit logs to monitor Delta Sharing events. Provider audit logs record actions taken by the provider and actions taken by recipients on the provider\u2019s shared data. Recipient audit logs record events related to the accessing of shares and the management of provider objects. \nTo view the list Delta Sharing audit log events, see [Delta Sharing events](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html#ds).\n\n### Audit and monitor data sharing\n#### Requirements\n\nTo access audit logs, an account admin must enable the audit log system table for your Databricks account. See [Enable system tables](https:\/\/docs.databricks.com\/admin\/system-tables\/index.html#enable). For information on the audit log system table, see [Audit log system table reference](https:\/\/docs.databricks.com\/admin\/system-tables\/audit-logs.html). \nIf you are not an account admin or metastore admin, you must be given access to `system.access.audit` to read audit logs.\n\n### Audit and monitor data sharing\n#### View Delta Sharing events in the audit log\n\nIf your account has system tables enabled, audit logs are stored in `system.access.audit`. If, alternatively, your account has an [audit log delivery setup](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-log-delivery.html), you need to know the bucket and path where the logs are delivered.\n\n### Audit and monitor data sharing\n#### Logged events\n\nTo view the list of Delta Sharing audit log events, see [Delta Sharing events](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html#ds).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/audit-logs.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Audit and monitor data sharing\n#### View details of a recipient\u2019s query result\n\nIn the provider logs, the events returned as `deltaSharingQueriedTableChanges` and `deltaSharingQueriedTable` are logged after a data recipient\u2019s query gets a response. Providers can view the `response.result` field of these logs to see more details about what was shared with the recipient. The field can include the following values. This list is not exhaustive. \n```\n\"checkpointBytes\": \"0\",\n\"earlyTermination\": \"false\",\n\"maxRemoveFiles\": \"0\",\n\"path\": \"file: example\/s3\/path\/golden\/snapshot-data0\/_delta_log\",\n\"deltaSharingPartitionFilteringAccessed\": \"false\",\n\"deltaSharingRecipientId\": \"<redacted>\",\n\"deltaSharingRecipientIdHash\": \"<recipient-hash-id>\",\n\"jsonLogFileNum\": \"1\",\n\"scannedJsonLogActionNum\": \"5\",\n\"numRecords\": \"3\",\n\"deltaSharingRecipientMetastoreId\": \"<redacted>\",\n\"userAgent\": \"Delta-Sharing-Unity-Catalog-Databricks-Auth\/1.0 Linux\/4.15.0-2068-azure-fips OpenJDK_64-Bit_Server_VM\/11.0.7+10-jvmci-20.1-b02 java\/11.0.7 scala\/2.12.15 java_vendor\/GraalVM_Community\",\n\"jsonLogFileBytes\": \"2846\",\n\"checkpointFileNum\": \"0\",\n\"metastoreId\": \"<redacted>\",\n\"limitHint\": \"Some(1)\",\n\"tableName\": \"cookie_ingredients\",\n\"tableId\": \"1234567c-6d8b-45fd-9565-32e9fc23f8f3\",\n\"activeAddFiles\": \"2\", \/\/ number of AddFiles returned in the query\n\"numAddFiles\": \"2\", \/\/ number of AddFiles returned in the query\n\"numAddCDCFiles\": \"2\", \/\/ number of AddFiles returned in the CDF query\n\"numRemoveFiles\": \"2\", \/\/ number of RemoveFiles returned in the query\n\"numSeenAddFiles\": \"3\",\n\"scannedAddFileSize\": \"1300\", \/\/ file size in bytes for the AddFile returned in the query\n\"scannedAddCDCFileSize\": \"1300\", \/\/ file size in bytes for the AddCDCFile returned in the CDF query\n\"scannedRemoveFileSize\": \"1300\", \/\/ file size in bytes for the RemoveFile returned in the query\n\"scannedCheckpointActionNum\": \"0\",\n\"tableVersion\": \"0\"\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/audit-logs.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Audit and monitor data sharing\n#### Logged errors\n\nIf an attempted Delta Sharing action fails, the action is logged with the error message in the `response.error_message` field of the log. Items between `<` and `>` characters represent placeholder text. \n### Error messages in provider logs \nDelta Sharing logs the following errors for data providers: \n* Delta Sharing is not enabled on the selected metastore. \n```\nDatabricksServiceException: FEATURE_DISABLED:\nDelta Sharing is not enabled\n\n```\n* An operation was attempted on a catalog that does not exist. \n```\nDatabricksServiceException: CATALOG_DOES_NOT_EXIST:\nCatalog \u2018<catalog>\u2019 does not exist.\n\n```\n* A user who is not an account admin or metastore admin attempted to perform a privileged operation. \n```\nDatabricksServiceException: PERMISSION_DENIED:\nOnly administrators can <operation-name> <operation-target>\n\n```\n* An operation was attempted on a metastore from a workspace to which the metastore is not assigned. \n```\nDatabricksServiceException: INVALID_STATE:\nWorkspace <workspace-name> is no longer assigned to this metastore\n\n```\n* A request was missing the recipient name or share name. \n```\nDatabricksServiceException: INVALID_PARAMETER_VALUE: CreateRecipient\/CreateShare Missing required field: <recipient-name>\/<share-name>\n\n```\n* A request included an invalid recipient name or share name. \n```\nDatabricksServiceException: INVALID_PARAMETER_VALUE: CreateRecipient\/CreateShare <recipient-name>\/<share-name> is not a valid name\n\n```\n* A user attempted to share a table that is not in a Unity Catalog metastore. \n```\nDatabricksServiceException: INVALID_PARAMETER_VALUE: Only managed or external table on Unity Catalog can be added to a share\n\n```\n* A user attempted to rotate a recipient that was already in a rotated state and whose previous token had not yet expired. \n```\nDatabricksServiceException: INVALID_PARAMETER_VALUE: There are already two active tokens for recipient <recipient-name>\n\n```\n* A user attempted to create a new recipient or share with the same name as an existing one. \n```\nDatabricksServiceException: RECIPIENT_ALREADY_EXISTS\/SHARE_ALREADY_EXISTS: Recipient\/Share <name> already exists`\n\n```\n* A user attempted to perform an operation on a recipient or share that does not exist. \n```\nDatabricksServiceException: RECIPIENT_DOES_NOT_EXIST\/SHARE_DOES_NOT_EXIST: Recipient\/Share '<name>' does not exist\n\n```\n* A user attempted to add a table to a share, but the table had already been added. \n```\nDatabricksServiceException: RESOURCE_ALREADY_EXISTS: Shared Table '<name>' already exists\n\n```\n* A user attempted to perform an operation that referenced a table that does not exist. \n```\nDatabricksServiceException: TABLE_DOES_NOT_EXIST: Table '<name>' does not exist\n\n```\n* A user attempted to perform an operation that referenced a schema that did not exist. \n```\nDatabricksServiceException: SCHEMA_DOES_NOT_EXIST: Schema '<name>' does not exist\n\n```\n* A user attempted to access a share that does not exist. \n```\nDatabricksServiceException: SHARE_DOES_NOT_EXIST: Share <share-name> does not exist.\n\n``` \n### Error messages in recipient logs \nDelta Sharing logs the following errors for data recipients: \n* The user attempted to access a share they do not have permission to access. \n```\nDatabricksServiceException: PERMISSION_DENIED:\nUser does not have SELECT on Share <share-name>\n\n```\n* The user attempted to access a share that does not exist. \n```\nDatabricksServiceException: SHARE_DOES_NOT_EXIST: Share <share-name> does not exist.\n\n```\n* The user attempted to access a table that does not exist in the share. \n```\nDatabricksServiceException: TABLE_DOES_NOT_EXIST: <table-name> does not exist.\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/audit-logs.html"} +{"content":"# What is data warehousing on Databricks?\n### Use materialized views in Databricks SQL\n\nPreview \nThis feature is in Public Preview. \nThis article describes how to create and use materialized views in Databricks SQL to improve performance and reduce the cost of your data processing and analysis workloads.\n\n### Use materialized views in Databricks SQL\n#### What are materialized views?\n\nIn Databricks SQL, materialized views are Unity Catalog managed tables that allow users to precompute results based on the latest version of data in source tables. Materialized views on Databricks differ from other implementations as the results returned reflect the state of data when the materialized view was last refreshed rather than always updating results when the materialized view is queried. You can manually refresh materialized views or schedule refreshes. \nMaterialized views are powerful for data processing workloads such as extract, transform, and load (ETL) processing. Materialized views provide a simple, declarative way to process data for compliance, corrections, aggregations, or general change data capture (CDC). Materialized views reduce cost and improve query latency by pre-computing slow queries and frequently used computations. Materialized views also enable easy-to-use transformations by cleaning, enriching, and denormalizing base tables. Materialized views can reduce costs while providing a simplified end-user experience because, in some cases, they can incrementally compute changes from the base tables. \nMaterialized views were first supported on the Databricks Data Intelligence Platform with the launch of [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html). When you create a materialized view in a Databricks SQL warehouse, a Delta Live Tables pipeline is created to process refreshes to the materialized view. You can monitor the status of refresh operations in the Delta Live Tables UI, the Delta Live Tables API, or the Delta Live Tables CLI. See [View the status of a materialized view refresh](https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html#view-status).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html"} +{"content":"# What is data warehousing on Databricks?\n### Use materialized views in Databricks SQL\n#### Requirements\n\n* You must use a Unity Catalog-enabled Databricks SQL warehouse to create and refresh materialized views. \n* You must have accepted the serverless [terms of use](https:\/\/docs.databricks.com\/admin\/sql\/serverless.html#accept-terms). \n* Your workspace must be in a [serverless-enabled region](https:\/\/docs.databricks.com\/resources\/supported-regions.html). \nTo learn about restrictions when using materialized views with Databricks SQL, see [Limitations](https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html#mv-limitations).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html"} +{"content":"# What is data warehousing on Databricks?\n### Use materialized views in Databricks SQL\n#### Create a materialized view\n\nTo create a materialized view, use the `CREATE MATERIALIZED VIEW` statement. See [CREATE MATERIALIZED VIEW](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-materialized-view.html) in the Databricks SQL reference. To submit a create statement, use the [SQL editor](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html) in the Databricks UI, the [Databricks SQL CLI](https:\/\/docs.databricks.com\/dev-tools\/databricks-sql-cli.html), or the [Databricks SQL API](https:\/\/docs.databricks.com\/dev-tools\/sql-execution-tutorial.html). \nNote \nThe user who creates a materialized view is the materialized view owner and needs to have the following permissions: \n* `SELECT` privilege on the base tables referenced by the materialized view.\n* `USE CATALOG` and `USE SCHEMA` privileges on the catalog and schema containing the source tables for the materialized view.\n* `USE CATALOG` and `USE SCHEMA` privileges on the target catalog and schema for the materialized view.\n* `CREATE TABLE` and `CREATE MATERIALIZED VIEW` privileges on the schema containing the materialized view. \nThe following example creates the materialized view `mv1` from the base table `base_table1`: \n```\nCREATE MATERIALIZED VIEW mv1\nAS SELECT\ndate, sum(sales) AS sum_of_sales\nFROM\ntable1\nGROUP BY\ndate;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html"} +{"content":"# What is data warehousing on Databricks?\n### Use materialized views in Databricks SQL\n#### How are materialized views created?\n\nDatabricks SQL materialized view `CREATE` operations use a Databricks SQL warehouse to create and load data in the materialized view. Because creating a materialized view is a synchronous operation in the Databricks SQL warehouse, the `CREATE MATERIALIZED VIEW` command blocks until the materialized view is created and the initial data load finishes. A Delta Live Tables pipeline is automatically created for every Databricks SQL materialized view. When the materialized view is [refreshed](https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html#mv-refresh), an update to the Delta Live Tables pipeline is started to process the refresh.\n\n### Use materialized views in Databricks SQL\n#### Load data from external systems\n\nDatabricks recommends loading external data using Lakehouse Federation for [supported data sources](https:\/\/docs.databricks.com\/query-federation\/index.html#connection-types). For information on loading data from sources not supported by Lakehouse Federation, see [Data format options](https:\/\/docs.databricks.com\/query\/formats\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html"} +{"content":"# What is data warehousing on Databricks?\n### Use materialized views in Databricks SQL\n#### Refresh a materialized view\n\nThe `REFRESH` operation refreshes the materialized view to reflect the latest changes to the base table. To refresh a materialized view, use the `REFRESH MATERIALIZED VIEW` statement. See [REFRESH (MATERIALIZED VIEW and STREAMING TABLE)](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-refresh-full.html) in the Databricks SQL reference. To submit a refresh statement, use the [SQL editor](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html) in the Databricks UI, the [Databricks SQL CLI](https:\/\/docs.databricks.com\/dev-tools\/databricks-sql-cli.html), or the [Databricks SQL API](https:\/\/docs.databricks.com\/dev-tools\/sql-execution-tutorial.html). \nOnly the owner can `REFRESH` the materialized view. \nThe following example refreshes the `mv1` materialized view: \n```\nREFRESH MATERIALIZED VIEW mv1;\n\n```\n\n### Use materialized views in Databricks SQL\n#### How are Databricks SQL materialized views refreshed?\n\nDatabricks SQL materialized views use Delta Live Tables for refresh operations. When the materialized view is refreshed, an [update](https:\/\/docs.databricks.com\/delta-live-tables\/updates.html) to the Delta Live Tables pipeline managing the materialized view is started to process the refresh. \nBecause the refresh is managed by a Delta Live Tables pipeline, the Databricks SQL warehouse used to create the materialized view is not used and does not need to be running during the refresh operation. \nSome queries can be incrementally refreshed. See [Refresh operations for materialized views](https:\/\/docs.databricks.com\/optimizations\/incremental-refresh.html). If an incremental refresh cannot be performed, a full refresh is performed instead.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html"} +{"content":"# What is data warehousing on Databricks?\n### Use materialized views in Databricks SQL\n#### Schedule materialized view refreshes\n\nYou can configure a Databricks SQL materialized view to refresh automatically based on a defined schedule. Configure this schedule with the `SCHEDULE` clause when you [create the materialized view](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-materialized-view.html) or add a schedule with the [ALTER VIEW](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-view.html) statement. When a schedule is created, a new Databricks job is automatically configured to process the update. You can view the schedule any time with the `DESCRIBE EXTENDED` statement.\n\n### Use materialized views in Databricks SQL\n#### Update the definition of a materialized view\n\nTo update the definition of a materialized view, you must first drop, then re-create the materialized view.\n\n### Use materialized views in Databricks SQL\n#### Drop a materialized view\n\nNote \nTo submit the command to drop a materialized view, you must be the owner of that materialized view. \nTo drop a materialized view, use the [DROP VIEW](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-drop-view.html) statement. To submit a `DROP` statement, you can use the [SQL editor](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html) in the Databricks UI, the [Databricks SQL CLI](https:\/\/docs.databricks.com\/dev-tools\/databricks-sql-cli.html), or the [Databricks SQL API](https:\/\/docs.databricks.com\/dev-tools\/sql-execution-tutorial.html). The following example drops the `mv1` materialized view: \n```\nDROP MATERIALIZED VIEW mv1;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html"} +{"content":"# What is data warehousing on Databricks?\n### Use materialized views in Databricks SQL\n#### Describe a materialized view\n\nTo retrieve the columns and data types for a materialized view, use the `DESCRIBE` statement. To retrieve the columns, data types, and metadata such as owner, location, creation time, and refresh status for a materialized view, use `DESCRIBE EXTENDED`. To submit a `DESCRIBE` statement, use the [SQL editor](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html) in the Databricks UI, the [Databricks SQL CLI](https:\/\/docs.databricks.com\/dev-tools\/databricks-sql-cli.html), or the [Databricks SQL API](https:\/\/docs.databricks.com\/dev-tools\/sql-execution-tutorial.html).\n\n### Use materialized views in Databricks SQL\n#### View the status of a materialized view refresh\n\nNote \nBecause a Delta Live Tables pipeline manages materialized view refreshes, there is latency incurred by the startup time for the pipeline. This time might be in the seconds to minutes, in addition to the time required to perform the refresh. \nYou can view the status of a materialized view refresh by viewing the pipeline that manages the materialized view in the Delta Live Tables UI or by viewing the **Refresh Information** returned by the `DESCRIBE EXTENDED` command for the materialized view. \nYou can also view the refresh history of a materialized view by querying the Delta Live Tables event log. See [View the refresh history for a materialized view](https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html#refresh-history).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html"} +{"content":"# What is data warehousing on Databricks?\n### Use materialized views in Databricks SQL\n#### View the refresh status in the Delta Live Tables UI\n\nBy default, the Delta Live Tables pipeline that manages a materialized view is not visible in the Delta Live Tables UI. To view the pipeline in the Delta Live Tables UI, you must directly access the link to the pipeline\u2019s **Pipeline details** page. To access the link: \n* If you submit the `REFRESH` command in the [SQL editor](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html), follow the link in the **Results** panel.\n* Follow the link returned by the `DESCRIBE EXTENDED` statement.\n* On the [lineage tab](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html) for the materialized view, click **Pipelines** and then click the pipeline link.\n\n### Use materialized views in Databricks SQL\n#### Stop an active refresh\n\nTo stop an active refresh in the Delta Live Tables UI, in the **Pipeline details** page click **Stop** to stop the pipeline update. You can also stop the refresh with the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) or the [POST \/api\/2.0\/pipelines\/{pipeline\\_id}\/stop](https:\/\/docs.databricks.com\/api\/workspace\/pipelines\/stop) operation in the Pipelines API.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html"} +{"content":"# What is data warehousing on Databricks?\n### Use materialized views in Databricks SQL\n#### Change the owner of a materialized view\n\nYou can change the owner of an materialized view if you are a both a metastore admin and a workspace admin. Materialized views automatically create and use Delta Live Tables pipelines to process changes. Use the following steps to change an materialized views owner: \n* Click ![Jobs Icon](https:\/\/docs.databricks.com\/_images\/workflows-icon.png) **Workflows**, then click the **Delta Live Tables** tab.\n* Click the name of the pipeline whose owner you want to change.\n* Click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu to the right of the pipeline name and click **Permissions**. This opens the permissions dialog.\n* Click **x** to the right of the current owner\u2019s name to remove the current owner.\n* Start typing to filter the list of available users. Click the user who should be the new pipeline owner.\n* Click **Save** to save your changes and close the dialog. \nAll pipeline assets, including materialized views defined in the pipeline, are owned by the new pipeline owner. All future updates are run using the new owner\u2019s identity.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html"} +{"content":"# What is data warehousing on Databricks?\n### Use materialized views in Databricks SQL\n#### Control access to materialized views\n\nMaterialized views support rich access controls to support data-sharing while avoiding exposing potentially private data. A materialized view owner can grant `SELECT` privileges to other users. Users with `SELECT` access to the materialized view do not need `SELECT` access to the tables referenced by the materialized view. This access control enables data sharing while controlling access to the underlying data. \n### Grant privileges to a materialized view \nTo grant access to a materialized view, use the `GRANT` statement: \n```\nGRANT\nprivilege_type [, privilege_type ] ...\nON <mv_name> TO principal;\n\n``` \nThe privilege\\_type can be: \n* `SELECT` - the user can `SELECT` the materialized view.\n* `REFRESH` - the user can `REFRESH` the materialized view. Refreshes are run using the owner\u2019s permissions. \nThe following example creates a materialized view and grants select and refresh privileges to a user: \n```\nCREATE MATERIALIZED VIEW <mv_name> AS SELECT * FROM <base_table>;\nGRANT SELECT ON <mv_name> TO user;\nGRANT REFRESH ON <mv_name> TO user;\n\n``` \n### Revoke privileges from a materialized view \nTo revoke access from a materialized view, use the `REVOKE` statement: \n```\nREVOKE\nprivilege_type [, privilege_type ]\nON <name> FROM principal;\n\n``` \nWhen `SELECT` privileges on a base table are revoked from the materialized view owner or any other user who has been granted `SELECT` privileges to the materialized view, or the base table is dropped, the materialized view owner or user granted access is still able to query the materialized view. However, the following behavior occurs: \n* The materialized view owner or others who have lost access to a materialized view can no longer `REFRESH` that materialized view, and the materialized view will become stale.\n* If automated with a schedule, the next scheduled `REFRESH` fails or is not run. \nThe following example revokes the `SELECT` privilege from `mv1`: \n```\nREVOKE SELECT ON mv1 FROM user1;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html"} +{"content":"# What is data warehousing on Databricks?\n### Use materialized views in Databricks SQL\n#### Enable change data feed\n\nChange data feed is required on the materialized views base tables, except for certain advanced use cases. To enable change data feed on a base table, set the `delta.enableChangeDataFeed` table property using the following syntax: \n```\nALTER TABLE table1 SET TBLPROPERTIES (delta.enableChangeDataFeed = true);\n\n```\n\n### Use materialized views in Databricks SQL\n#### View the refresh history for a materialized view\n\nTo view the status of `REFRESH` operations on a materialized view, including current and past refreshes, query the Delta Live Tables event log: \n```\nSELECT\n*\nFROM\nevent_log(TABLE(<fully-qualified-table-name>))\nWHERE\nevent_type = \"update_progress\"\nORDER BY\ntimestamp desc;\n\n``` \nReplace `<fully-qualified-table-name>` with the fully qualified name of the materialized view, including the catalog and schema. \nSee [What is the Delta Live Tables event log?](https:\/\/docs.databricks.com\/delta-live-tables\/observability.html#event-log).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html"} +{"content":"# What is data warehousing on Databricks?\n### Use materialized views in Databricks SQL\n#### Determining if an incremental or full refresh is used\n\nTo optimize the performance of materialized view refreshes, Databricks uses a cost model to select the technique used for the refresh. The following table describes these techniques: \n| Technique | Incremental refresh? | Description |\n| --- | --- | --- |\n| `FULL_RECOMPUTE` | No | The materialized view was fully recomputed |\n| `NO_OP` | Not applicable | The materialized view was not updated because no changes to the base table were detected. |\n| `ROW_BASED` or `PARTITION_OVERWRITE` | Yes | The materialized view was incrementally refreshed using the specified technique. | \nTo determine the technique used, query the Delta Live Tables event log where the `event_type` is `planning_information`: \n```\nSELECT\ntimestamp,\nmessage\nFROM\nevent_log(TABLE(<fully-qualified-table-name>))\nWHERE\nevent_type = 'planning_information'\nORDER BY\ntimestamp desc;\n\n``` \nReplace `<fully-qualified-table-name>` with the fully qualified name of the materialized view, including the catalog and schema. \nSee [What is the Delta Live Tables event log?](https:\/\/docs.databricks.com\/delta-live-tables\/observability.html#event-log).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html"} +{"content":"# What is data warehousing on Databricks?\n### Use materialized views in Databricks SQL\n#### Limitations\n\n* There are restrictions on how MVs can be managed and where they can be queried: \n+ Databricks SQL materialized views can only be created and refreshed in pro SQL warehouses and serverless SQL warehouses.\n+ A Databricks SQL materialized view can only be refreshed from the workspace that created it.\n+ Databricks SQL materialized views can only be queried from Databricks SQL warehouses, Delta Live Tables, and shared clusters running Databricks Runtime 11.3 or greater. You cannot query materialized views from Single User access mode clusters.\n* Materialized views do not support identity columns or surrogate keys.\n* If a materialized view uses a sum aggregate over a `NULL`-able column and only `NULL` values remain in that column, the materialized views resultant aggregate value is zero instead of `NULL`.\n* The underlying files supporting materialized views might include data from upstream tables (including possible personally identifiable information) that do not appear in the materialized view definition. This data is automatically added to the underlying storage to support incremental refreshing of materialized views. Because the underlying files of a materialized view might risk exposing data from upstream tables not part of the materialized view schema, Databricks recommends not sharing the underlying storage with untrusted downstream consumers. For example, suppose the definition of a materialized view includes a `COUNT(DISTINCT field_a)` clause. Even though the materialized view definition only includes the aggregate `COUNT DISTINCT` clause, the underlying files will contain a list of the actual values of `field_a`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Read and write data from Snowflake\n\nDatabricks provides a Snowflake connector in the Databricks Runtime to support reading and writing data from Snowflake. \nNote \nYou may prefer Lakehouse Federation for managing queries on Snowflake data. See [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/snowflake.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Read and write data from Snowflake\n##### Query a Snowflake table in Databricks\n\nYou can configure a connection to Snowflake and then query data. Before you begin, check which version of Databricks Runtime your cluster runs on. The following code provides example syntax in Python, SQL, and Scala. \n```\n\n# The following example applies to Databricks Runtime 11.3 LTS and above.\n\nsnowflake_table = (spark.read\n.format(\"snowflake\")\n.option(\"host\", \"hostname\")\n.option(\"port\", \"port\") # Optional - will use default port 443 if not specified.\n.option(\"user\", \"username\")\n.option(\"password\", \"password\")\n.option(\"sfWarehouse\", \"warehouse_name\")\n.option(\"database\", \"database_name\")\n.option(\"schema\", \"schema_name\") # Optional - will use default schema \"public\" if not specified.\n.option(\"dbtable\", \"table_name\")\n.load()\n)\n\n# The following example applies to Databricks Runtime 10.4 and below.\n\nsnowflake_table = (spark.read\n.format(\"snowflake\")\n.option(\"dbtable\", table_name)\n.option(\"sfUrl\", database_host_url)\n.option(\"sfUser\", username)\n.option(\"sfPassword\", password)\n.option(\"sfDatabase\", database_name)\n.option(\"sfSchema\", schema_name)\n.option(\"sfWarehouse\", warehouse_name)\n.load()\n)\n\n``` \n```\n\/* The following example applies to Databricks Runtime 11.3 LTS and above. *\/\n\nDROP TABLE IF EXISTS snowflake_table;\nCREATE TABLE snowflake_table\nUSING snowflake\nOPTIONS (\nhost '<hostname>',\nport '<port>', \/* Optional - will use default port 443 if not specified. *\/\nuser '<username>',\npassword '<password>',\nsfWarehouse '<warehouse_name>',\ndatabase '<database-name>',\nschema '<schema-name>', \/* Optional - will use default schema \"public\" if not specified. *\/\ndbtable '<table-name>'\n);\nSELECT * FROM snowflake_table;\n\n\/* The following example applies to Databricks Runtime 10.4 LTS and below. *\/\n\nDROP TABLE IF EXISTS snowflake_table;\nCREATE TABLE snowflake_table\nUSING snowflake\nOPTIONS (\ndbtable '<table-name>',\nsfUrl '<database-host-url>',\nsfUser '<username>',\nsfPassword '<password>',\nsfDatabase '<database-name>',\nsfSchema '<schema-name>',\nsfWarehouse '<warehouse-name>'\n);\nSELECT * FROM snowflake_table;\n\n``` \n```\n# The following example applies to Databricks Runtime 11.3 LTS and above.\n\nval snowflake_table = spark.read\n.format(\"snowflake\")\n.option(\"host\", \"hostname\")\n.option(\"port\", \"port\") \/* Optional - will use default port 443 if not specified. *\/\n.option(\"user\", \"username\")\n.option(\"password\", \"password\")\n.option(\"sfWarehouse\", \"warehouse_name\")\n.option(\"database\", \"database_name\")\n.option(\"schema\", \"schema_name\") \/* Optional - will use default schema \"public\" if not specified. *\/\n.option(\"dbtable\", \"table_name\")\n.load()\n\n# The following example applies to Databricks Runtime 10.4 and below.\n\nval snowflake_table = spark.read\n.format(\"snowflake\")\n.option(\"dbtable\", table_name)\n.option(\"sfUrl\", database_host_url)\n.option(\"sfUser\", username)\n.option(\"sfPassword\", password)\n.option(\"sfDatabase\", database_name)\n.option(\"sfSchema\", schema_name)\n.option(\"sfWarehouse\", warehouse_name)\n.load()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/snowflake.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Read and write data from Snowflake\n##### Notebook example: Snowflake Connector for Spark\n\nThe following notebooks provide simple examples of how to write data to and read data from Snowflake. See [Using the Spark Connector](https:\/\/docs.snowflake.com\/en\/user-guide\/spark-connector-use.html) for more details. In particular, see [Setting Configuration Options for the Connector](https:\/\/docs.snowflake.com\/en\/user-guide\/spark-connector-use.html#setting-configuration-options-for-the-connector) for all configuration options. \nTip \nAvoid exposing your Snowflake username and password in notebooks by using [Secrets](https:\/\/docs.databricks.com\/security\/secrets\/index.html), which are demonstrated in the notebooks. \n### Snowflake Python notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/snowflake-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### Read and write data from Snowflake\n##### Notebook example: Save model training results to Snowflake\n\nThe following notebook walks through best practices for using the Snowflake Connector for Spark. It writes data to Snowflake, uses Snowflake for some basic data manipulation, trains a machine learning model in Databricks, and writes the results back to Snowflake. \n### Store ML training results in Snowflake notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/snowflake-ml.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/snowflake.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Read and write data from Snowflake\n##### Frequently asked questions (FAQ)\n\n### Why don\u2019t my Spark DataFrame columns appear in the same order in Snowflake? \nThe Snowflake Connector for Spark doesn\u2019t respect the order of the columns in the table being written to; you must explicitly specify the mapping between DataFrame and Snowflake columns. To specify this mapping, use the [columnmap parameter](https:\/\/docs.snowflake.net\/manuals\/user-guide\/spark-connector-use.html#setting-configuration-options-for-the-connector). \n### Why is `INTEGER` data written to Snowflake read back as `DECIMAL`? \nSnowflake represents all `INTEGER` types as `NUMBER`, which can cause a change in data type when you write data to and read data from Snowflake. For example, `INTEGER` data can be converted to `DECIMAL` when writing to Snowflake, because `INTEGER` and `DECIMAL` are semantically equivalent in Snowflake (see [Snowflake Numeric Data Types](https:\/\/docs.snowflake.net\/manuals\/sql-reference\/data-types-numeric.html#int-integer-bigint-smallint-tinyint-byteint)). \n### Why are the fields in my Snowflake table schema always uppercase? \nSnowflake uses uppercase fields by default, which means that the table schema is converted to uppercase.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/snowflake.html"} +{"content":"# \n### AI and Machine Learning on Databricks\n\nThis article describes the tools that Databricks provides to help you build and monitor AI and ML workflows. The diagram shows how these components work together to help you implement your model development and deployment process. \n![Machine learning diagram: Model development and deployment on Databricks](https:\/\/docs.databricks.com\/_images\/ml-diagram-model-development-deployment.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/index.html"} +{"content":"# \n### AI and Machine Learning on Databricks\n#### Why use Databricks for machine learning and deep learning?\n\nWith Databricks, you can implement the full ML lifecycle on a single platform with end-to-end governance throughout the ML pipeline. Databricks includes the following built-in tools to support ML workflows: \n* [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) for governance, discovery, versioning, and access control for data, features, models, and functions.\n* [Lakehouse Monitoring](https:\/\/docs.databricks.com\/lakehouse-monitoring\/index.html) for data monitoring.\n* [Feature engineering and serving](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html).\n* Support for the model lifecycle: \n+ [Databricks AutoML](https:\/\/docs.databricks.com\/machine-learning\/automl\/index.html) for automated model training.\n+ [MLflow for model development tracking](https:\/\/docs.databricks.com\/mlflow\/tracking.html).\n+ [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) for model management.\n+ [Databricks Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) for high-availability, low-latency model serving. This includes deploying LLMs using: \n- [Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html) which allow you to access and query state-of-the-art open models from a serving endpoint.\n- [External models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html) which allow you to access models hosted outside of Databricks.\n+ [Lakehouse Monitoring](https:\/\/docs.databricks.com\/lakehouse-monitoring\/index.html) to track model prediction quality and drift.\n* [Databricks Workflows](https:\/\/docs.databricks.com\/workflows\/index.html) for automated workflows and production-ready ETL pipelines.\n* [Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html) for code management and Git integration.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/index.html"} +{"content":"# \n### AI and Machine Learning on Databricks\n#### Deep learning on Databricks\n\nConfiguring infrastructure for deep learning applications can be difficult. \n[Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html#dbr-ml) takes care of that for you, with clusters that have built-in compatible versions of the most common deep learning libraries like TensorFlow, PyTorch, and Keras, and supporting libraries such as Petastorm, Hyperopt, and Horovod. Databricks Runtime ML clusters also include pre-configured GPU support with drivers and supporting libraries. It also supports libraries like [Ray](https:\/\/docs.databricks.com\/machine-learning\/ray-integration.html) to parallelize compute processing for scaling ML workflows and AI applications. \nDatabricks Runtime ML clusters also include pre-configured GPU support with drivers and supporting libraries. [Databricks Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) enables creation of scalable GPU endpoints for deep learning models with no extra configuration. \nFor machine learning applications, Databricks recommends using a cluster running Databricks Runtime for Machine Learning. See [Create a cluster using Databricks Runtime ML](https:\/\/docs.databricks.com\/machine-learning\/index.html#create-ml-cluster). \nTo get started with deep learning on Databricks, see: \n* [Best practices for deep learning on Databricks](https:\/\/docs.databricks.com\/machine-learning\/train-model\/dl-best-practices.html)\n* [Deep learning on Databricks](https:\/\/docs.databricks.com\/machine-learning\/train-model\/deep-learning.html)\n* [Reference solutions for deep learning](https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/index.html"} +{"content":"# \n### AI and Machine Learning on Databricks\n#### Large language models (LLMs) and generative AI on Databricks\n\n[Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html#dbr-ml) includes libraries like [Hugging Face Transformers](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/index.html) and [LangChain](https:\/\/docs.databricks.com\/large-language-models\/langchain.html) that allow you to integrate existing pre-trained models or other open-source libraries into your workflow. The Databricks MLflow integration makes it easy to use the MLflow tracking service with transformer pipelines, models, and processing components. In addition, you can integrate [OpenAI](https:\/\/platform.openai.com\/docs\/introduction) models or solutions from partners like [John Snow Labs](https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/natural-language-processing.html#john-snow-labs) in your Databricks workflows. \nWith Databricks, you can customize a LLM on your data for your specific task. With the support of open source tooling, such as Hugging Face and DeepSpeed, you can efficiently take a foundation LLM and train it with your own data to improve its accuracy for your specific domain and workload. You can then leverage the custom LLM in your generative AI applications. \nIn addition, Databricks provides [Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html) and [external models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html) which allows you to access and query state-of-the-art open models from a serving endpoint. Using Foundation Model APIs, developers can quickly and easily build applications that leverage a high-quality generative AI model without maintaining their own model deployment. \nFor SQL users, Databricks provides AI functions that SQL data analysts can use to access LLM models, including from OpenAI, directly within their data pipelines and workflows. See [AI Functions on Databricks](https:\/\/docs.databricks.com\/large-language-models\/ai-functions.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/index.html"} +{"content":"# \n### AI and Machine Learning on Databricks\n#### Databricks Runtime for Machine Learning\n\nDatabricks Runtime for Machine Learning (Databricks Runtime ML) automates the creation of a cluster with pre-built machine learning and deep learning infrastructure including the most common ML and DL libraries. For the full list of libraries in each version of Databricks Runtime ML, see the [release notes](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). \nTo access data in Unity Catalog for machine learning workflows, the access mode for the cluster must be single user (assigned). Shared clusters are not compatible with Databricks Runtime for Machine Learning. In addition, Databricks Runtime ML is not supported on [TableACLs clusters](https:\/\/docs.databricks.com\/data-governance\/table-acls\/index.html) or clusters with `spark.databricks.pyspark.enableProcessIsolation config` set to `true`. \n### Create a cluster using Databricks Runtime ML \nWhen you [create a cluster](https:\/\/docs.databricks.com\/compute\/configure.html), select a Databricks Runtime ML version from the **Databricks runtime version** drop-down menu. Both CPU and GPU-enabled ML runtimes are available. \n![Select Databricks Runtime ML](https:\/\/docs.databricks.com\/_images\/mlruntime-dbr-dropdown.png) \nIf you [select a cluster from the drop-down menu in the notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach-a-notebook-to-a-cluster), the Databricks Runtime version appears at the right of the cluster name: \n![View Databricks Runtime ML version](https:\/\/docs.databricks.com\/_images\/cluster-attach.png) \nIf you select a GPU-enabled ML runtime, you are prompted to select a compatible **Driver type** and **Worker type**. Incompatible instance types are grayed out in the drop-down menu. GPU-enabled instance types are listed under the **GPU accelerated** label. \nNote \nTo access data in Unity Catalog for machine learning workflows, the [access mode](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode) for the cluster must be single user (assigned). Shared clusters are not compatible with Databricks Runtime for Machine Learning. \n### Photon and Databricks Runtime ML \nWhen you create a CPU cluster running Databricks Runtime 15.2 ML or above, you can choose to enable [Photon](https:\/\/docs.databricks.com\/compute\/photon.html). Photon improves performance for applications using Spark SQL, Spark DataFrames, feature engineering, GraphFrames, and xgboost4j. It is not expected to improve performance on applications using Spark RDDs, Pandas UDFs, and non-JVM languages such as Python. Thus, Python packages such as XGBoost, PyTorch, and TensorFlow will not see an improvement with Photon. \nSpark RDD APIs and [Spark MLlib](https:\/\/docs.databricks.com\/machine-learning\/train-model\/mllib.html) have limited compatibility with Photon. When processing large datasets using Spark RDD or Spark MLlib, you may experience Spark memory issues. See [Spark memory issues](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/spark-memory-issues.html). \n### Libraries included in Databricks Runtime ML \nDatabricks Runtime ML includes a variety of popular ML libraries. The libraries are updated with each release to include new features and fixes. \nDatabricks has designated a subset of the supported libraries as top-tier libraries. For these libraries, Databricks provides a faster update cadence, updating to the latest package releases with each runtime release (barring dependency conflicts). Databricks also provides advanced support, testing, and embedded optimizations for top-tier libraries. \nFor a full list of top-tier and other provided libraries, see the [release notes](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html) for Databricks Runtime ML.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/index.html"} +{"content":"# \n### AI and Machine Learning on Databricks\n#### Next steps\n\nTo get started, see: \n* [Tutorials: Get started with ML](https:\/\/docs.databricks.com\/machine-learning\/ml-tutorials.html) \nFor a recommended MLOps workflow on Databricks Machine Learning, see: \n* [MLOps workflows on Databricks](https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html) \nTo learn about key Databricks Machine Learning features, see: \n* [What is AutoML?](https:\/\/docs.databricks.com\/machine-learning\/automl\/index.html)\n* [What is a feature store?](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html)\n* [Model serving with Databricks](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html)\n* [Lakehouse Monitoring](https:\/\/docs.databricks.com\/lakehouse-monitoring\/index.html)\n* [Manage model lifecycle](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html)\n* [MLflow experiment tracking](https:\/\/docs.databricks.com\/mlflow\/tracking.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Hyperparameter tuning\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-best-practices.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Hyperparameter tuning\n##### Best practices: Hyperparameter tuning with Hyperopt\n###### Best practices\n\n* Bayesian approaches can be much more efficient than grid search and random search. Hence, with the Hyperopt Tree of Parzen Estimators (TPE) algorithm, you can explore more hyperparameters and larger ranges. Using domain knowledge to restrict the search domain can optimize tuning and produce better results.\n* When you use `hp.choice()`, Hyperopt returns the index of the choice list. Therefore the parameter logged in MLflow is also the index. Use `hyperopt.space_eval()` to retrieve the parameter values.\n* For models with long training times, start experimenting with small datasets and many hyperparameters. Use MLflow to identify the best performing models and determine which hyperparameters can be fixed. In this way, you can reduce the parameter space as you prepare to tune at scale.\n* Take advantage of Hyperopt support for conditional dimensions and hyperparameters. For example, when you evaluate multiple flavors of gradient descent, instead of limiting the hyperparameter space to just the common hyperparameters, you can have Hyperopt include conditional hyperparameters\u2014the ones that are only appropriate for a subset of the flavors. For more information about using conditional parameters, see [Defining a search space](http:\/\/hyperopt.github.io\/hyperopt\/getting-started\/search_spaces\/).\n* When using `SparkTrials`, configure parallelism appropriately for CPU-only versus GPU-enabled clusters. In Databricks, CPU and GPU clusters use different numbers of executor threads per worker node. CPU clusters use multiple executor threads per node. GPU clusters use only one executor thread per node to avoid conflicts among multiple Spark tasks trying to use the same GPU. While this is generally optimal for libraries written for GPUs, it means that maximum parallelism is reduced on GPU clusters, so be aware of how many GPUs each trial can use when selecting GPU instance types. See [GPU-enabled Clusters](https:\/\/docs.databricks.com\/compute\/gpu.html) for details.\n* Do not use `SparkTrials` on autoscaling clusters. Hyperopt selects the parallelism value when execution begins. If the cluster later autoscales, Hyperopt will not be able to take advantage of the new cluster size.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-best-practices.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Hyperparameter tuning\n##### Best practices: Hyperparameter tuning with Hyperopt\n###### Troubleshooting\n\n* A reported loss of NaN (not a number) usually means the objective function passed to `fmin()` returned NaN. This does not affect other runs and you can safely ignore it. To prevent this result, try adjusting the hyperparameter space or modifying the objective function.\n* Because Hyperopt uses stochastic search algorithms, the loss usually does not decrease monotonically with each run. However, these methods often find the best hyperparameters more quickly than other methods.\n* Both Hyperopt and Spark incur overhead that can dominate the trial duration for short trial runs (low tens of seconds). The speedup you observe may be small or even zero.\n\n##### Best practices: Hyperparameter tuning with Hyperopt\n###### Example notebook: Best practices for datasets of different sizes\n\n`SparkTrials` runs the trials on Spark worker nodes. This notebook provides guidelines on how to move datasets of different orders of magnitude to worker nodes when using `SparkTrials`. \n### Handle datasets of different orders of magnitude notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/hyperopt-spark-data.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-best-practices.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n##### Distributed training with DeepSpeed distributor\n\nThis article describes how to perform distributed training on PyTorch ML models using the [DeepSpeed distributor](https:\/\/www.deepspeed.ai\/training\/) . \nThe DeepSpeed distributor is built on top of [TorchDistributor](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/spark-pytorch-distributor.html) and is a recommended solution for customers with models that require higher compute power, but are limited by memory constraints. \nThe [DeepSpeed](https:\/\/deepspeed.readthedocs.io\/en\/latest\/training.html) library is an open-source library developed by Microsoft and is available in Databricks Runtime 14.0 ML or above. It offers optimized memory usage, reduced communication overhead, and advanced pipeline parallelism that allow for scaling of models and training procedures that would otherwise be unattainable on standard hardware. \nThe following are example scenarios where the DeepSpeed distributor is beneficial: \n* Low GPU memory.\n* Large model training.\n* Large input data, like during batch inference.\n\n##### Distributed training with DeepSpeed distributor\n###### Example notebook for distributed training with DeepSpeed\n\nThe following notebook example demonstrates how to perform distributed training with DeepSpeed distributor. \n### Fine-tune Llama 2 7B Chat with `DeepspeedTorchDistributor` notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/fine-tune-llama2-deepspeed.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/deepspeed.html"} +{"content":"# Security and compliance guide\n## Secret management\n#### Secret scopes\n\nManaging secrets begins with creating a secret scope. A secret scope is collection of secrets identified by a name. \nA workspace is limited to a maximum of 1000 secret scopes. Contact your Databricks support team if you need more. \nNote \nDatabricks recommends aligning secret scopes to roles or applications rather than individuals.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/secrets\/secret-scopes.html"} +{"content":"# Security and compliance guide\n## Secret management\n#### Secret scopes\n##### Overview\n\nA Databricks-backed secret scope is stored in (backed by) an encrypted database owned and managed by Databricks. The secret scope name: \n* Must be unique within a workspace.\n* Must consist of alphanumeric characters, dashes, underscores, `@`, and periods, and may not exceed 128 characters. \nThe names are considered non-sensitive and are readable by all users in the workspace. \nYou create a Databricks-backed secret scope using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) (version 0.205 and above). Alternatively, you can use the [Secrets API](https:\/\/docs.databricks.com\/api\/workspace\/secrets). \n### Scope permissions \nScopes are created with permissions controlled by [secret ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#secrets). By default, scopes are created with MANAGE permission for the user who created the scope (the \u201ccreator\u201d), which lets the creator read secrets in the scope, write secrets to the scope, and change ACLs for the scope. If your account has the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons), you can assign granular permissions at any time after you create the scope. For details, see [Secret ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#secrets). \nYou can also override the default and explicitly grant MANAGE permission to all users when you create the scope. In fact, you *must* do this if your account does not have the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/secrets\/secret-scopes.html"} +{"content":"# Security and compliance guide\n## Secret management\n#### Secret scopes\n##### Create a Databricks-backed secret scope\n\nSecret scope names are case insensitive. \nTo create a scope using the Databricks CLI: \n```\ndatabricks secrets create-scope <scope-name>\n\n``` \nBy default, scopes are created with MANAGE permission for the user who created the scope. If your account does not have the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons), you *must* override that default and explicitly grant the MANAGE permission to \u201cusers\u201d (all users) when you create the scope: \n```\ndatabricks secrets create-scope <scope-name> --initial-manage-principal users\n\n``` \nYou can also create a Databricks-backed secret scope using the [Secrets API](https:\/\/docs.databricks.com\/api\/workspace\/secrets). \nIf your account has the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons), you can change permissions at any time after you create the scope. For details, see [Secret ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#secrets). \nOnce you have created a Databricks-backed secret scope, you can [add secrets](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html).\n\n#### Secret scopes\n##### List secret scopes\n\nTo list the existing scopes in a workspace using the CLI: \n```\ndatabricks secrets list-scopes\n\n``` \nYou can also list existing scopes using the [Secrets API](https:\/\/docs.databricks.com\/api\/workspace\/secrets).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/secrets\/secret-scopes.html"} +{"content":"# Security and compliance guide\n## Secret management\n#### Secret scopes\n##### Delete a secret scope\n\nDeleting a secret scope deletes all secrets and ACLs applied to the scope. To delete a scope using the CLI, run the following: \n```\ndatabricks secrets delete-scope <scope-name>\n\n``` \nYou can also delete a secret scope using the [Secrets API](https:\/\/docs.databricks.com\/api\/workspace\/secrets).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/secrets\/secret-scopes.html"} +{"content":"# Security and compliance guide\n## Networking\n#### Users to Databricks networking\n\nThis guide introduces features to customize network access between users and their Databricks workspaces.\n\n#### Users to Databricks networking\n##### Why customize networking from users to Databricks?\n\nBy default, users and applications can connect to Databricks from any IP address. Users might access critical data sources using Databricks. In the case a user\u2019s credentials are compromised through phishing or a similar attack, securing network access dramatically reduces the risk of an account takeover. Configurations like private connectivity, IP access lists, and firewalls helps to keep your critical data secure. \nYou can also configure authentication and access control features to protect your user\u2019s credentials, see [Authentication and access control](https:\/\/docs.databricks.com\/security\/auth-authz\/index.html). \nNote \nUsers to Databricks secure networking features require the [Enterprise plan](https:\/\/www.databricks.com\/product\/aws-pricing).\n\n#### Users to Databricks networking\n##### Private connectivity\n\nBetween Databricks users and the control plane, PrivateLink provides strong controls that limit the source for inbound requests. If your organization routes traffic through an AWS environment, you can use PrivateLink to ensure the communication between users and the Databricks control plane does not traverse public IP addresses. See [Configure private connectivity to Databricks](https:\/\/docs.databricks.com\/security\/network\/front-end\/front-end-private-connect.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/front-end\/index.html"} +{"content":"# Security and compliance guide\n## Networking\n#### Users to Databricks networking\n##### IP access lists\n\nAuthentication proves user identity, but it does not enforce the network location of the users. Accessing a cloud service from an unsecured network poses security risks, especially when the user may have authorized access to sensitive or personal data. Using IP access lists, you can configure Databricks workspaces so that users connect to the service only through existing networks with a secure perimeter. \nAdmins can specify the IP addresses that are allowed access to Databricks. You can also specify IP addresses or subnets to block. For details, see [Manage IP access lists](https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list.html). \nYou can also use PrivateLink to block all public internet access to a Databricks workspace.\n\n#### Users to Databricks networking\n##### Firewall rules\n\nMany organizations use firewall to block traffic based on domain names. You must allow list Databricks domain names to ensure access to Databricks resources. For more information, see [Configure domain name firewall rules](https:\/\/docs.databricks.com\/resources\/firewall-rules.html). \nDatabricks also performs host header validation for both public and private connections to ensure that requests originate from the intended host. This protects against potential HTTP host header attacks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/front-end\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with Databricks AutoML Python API\n\nThis article demonstrates how to train a model with Databricks AutoML using the API. Learn more about [What is AutoML?](https:\/\/docs.databricks.com\/machine-learning\/automl\/index.html).\nThe Python API provides functions to start classification, regression, and forecasting AutoML runs. Each function call trains a set of models and generates a trial notebook for each model. \nThe following steps describe generally how to set up an AutoML experiment using the API: \n1. [Create a notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook) and attach it to a cluster running Databricks Runtime ML.\n2. Identify which table you want to use from your existing data source or [upload a data file to DBFS](https:\/\/docs.databricks.com\/archive\/legacy\/data-tab.html#import-data) and create a table.\n3. To start an AutoML run, pass the table name to the appropriate API specification: [classification](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html#classification), [regression](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html#regression), or [forecasting](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html#forecasting).\n4. When the AutoML run begins, an MLflow experiment URL appears in the console. Use this URL to monitor the progress of the run. Refresh the MLflow experiment to see the trials as they are completed.\n5. After the AutoML run completes: \n* Use the links in the output summary to navigate to the MLflow experiment or to the notebook that generated the best results.\n* Use the link to the data exploration notebook to get some insights into the data passed to AutoML. You can also attach this notebook to the same cluster and re-run the notebook to reproduce the results or do additional data analysis.\n* Use the summary object returned from the AutoML call to explore more details about the trials or to load a model trained by a given trial. Learn more about the [AutoMLSummary object](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html#return).\n* Clone any generated notebook from the trials and re-run the notebook by attaching it to the same cluster to reproduce the results. You can also make necessary edits and re-run them to train additional models and log them to the same experiment.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with Databricks AutoML Python API\n##### Requirements\n\nSee [Requirements](https:\/\/docs.databricks.com\/machine-learning\/automl\/index.html#requirement) for AutoML experiments.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with Databricks AutoML Python API\n##### Classification specification\n\nThe following code example configures an AutoML run for training a classification model. For additional parameters to further customize your AutoML run see [Classification and regression parameters](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html#classification-regression). \nNote \nThe `max_trials` parameter is deprecated in Databricks Runtime 10.4 ML and is not supported in Databricks Runtime 11.0 ML and above. Use `timeout_minutes` to control the duration of an AutoML run. \n```\ndatabricks.automl.classify(\ndataset: Union[pyspark.sql.DataFrame, pandas.DataFrame, pyspark.pandas.DataFrame, str],\n*,\ntarget_col: str,\ndata_dir: Optional[str] = None,\nexclude_cols: Optional[List[str]] = None, # <DBR> 10.3 ML and above\nexclude_frameworks: Optional[List[str]] = None, # <DBR> 10.3 ML and above\nexperiment_dir: Optional[str] = None, # <DBR> 10.4 LTS ML and above\nexperiment_name: Optional[str] = None, # <DBR> 12.1 ML and above\nfeature_store_lookups: Optional[List[Dict]] = None, # <DBR> 11.3 LTS ML and above\nimputers: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None, # <DBR> 10.4 LTS ML and above\nmax_trials: Optional[int] = None, # <DBR> 10.5 ML and below\npos_label: Optional[Union[int, bool, str] = None, # <DBR> 11.1 ML and above\nprimary_metric: str = \"f1\",\ntime_col: Optional[str] = None,\ntimeout_minutes: Optional[int] = None,\n) -> AutoMLSummary\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with Databricks AutoML Python API\n##### Regression specification\n\nThe following code example configures an AutoML run for training a regression model. For additional parameters to further customize your AutoML run see [Classification and regression parameters](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html#classification-regression). \nNote \nThe `max_trials` parameter is deprecated in Databricks Runtime 10.4 ML and is not supported in Databricks Runtime 11.0 ML and above. Use `timeout_minutes` to control the duration of an AutoML run. \n```\ndatabricks.automl.regress(\ndataset: Union[pyspark.sql.DataFrame, pandas.DataFrame, pyspark.pandas.DataFrame, str],\n*,\ntarget_col: str,\ndata_dir: Optional[str] = None,\nexclude_cols: Optional[List[str]] = None, # <DBR> 10.3 ML and above\nexclude_frameworks: Optional[List[str]] = None, # <DBR> 10.3 ML and above\nexperiment_dir: Optional[str] = None, # <DBR> 10.4 LTS ML and above\nexperiment_name: Optional[str] = None, # <DBR> 12.1 ML and above\nfeature_store_lookups: Optional[List[Dict]] = None, # <DBR> 11.3 LTS ML and above\nimputers: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None, # <DBR> 10.4 LTS ML and above\nmax_trials: Optional[int] = None, # <DBR> 10.5 ML and below\nprimary_metric: str = \"r2\",\ntime_col: Optional[str] = None,\ntimeout_minutes: Optional[int] = None,\n) -> AutoMLSummary\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with Databricks AutoML Python API\n##### Forecasting specification\n\nThe following code example configures an AutoML run for training a forecasting model. For additional detail about parameters for your AutoML run see [Forecasting parameters](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html#forecast-parameters).\nTo use Auto-ARIMA, the time series must have a regular frequency (that is, the interval between any two points must be the same throughout the time series). The frequency must match the frequency unit specified in the API call. AutoML handles missing time steps by filling in those values with the previous value. \n```\ndatabricks.automl.forecast(\ndataset: Union[pyspark.sql.DataFrame, pandas.DataFrame, pyspark.pandas.DataFrame, str],\n*,\ntarget_col: str,\ntime_col: str,\ncountry_code: str = \"US\", # <DBR> 12.0 ML and above\ndata_dir: Optional[str] = None,\nexclude_frameworks: Optional[List[str]] = None,\nexperiment_dir: Optional[str] = None,\nexperiment_name: Optional[str] = None, # <DBR> 12.1 ML and above\nfeature_store_lookups: Optional[List[Dict]] = None, # <DBR> 12.2 LTS ML and above\nfrequency: str = \"D\",\nhorizon: int = 1,\nidentity_col: Optional[Union[str, List[str]]] = None,\noutput_database: Optional[str] = None, # <DBR> 10.5 ML and above\nprimary_metric: str = \"smape\",\ntimeout_minutes: Optional[int] = None,\n) -> AutoMLSummary\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with Databricks AutoML Python API\n##### Classification and regression parameters\n\nNote \nFor classification and regression problems only, you can: \n* Specify which columns to include in training.\n* Select custom imputation methods. \n| Field Name | Type | Description |\n| --- | --- | --- |\n| dataset | str pandas.DataFrame pyspark.DataFrame pyspark.sql.DataFrame | Input table name or DataFrame that contains training features and target. Table name can be in format \u201c..\u201d or \u201c.\u201d for non Unity Catalog tables |\n| target\\_col | str | Column name for the target label. |\n| data\\_dir | str of format `dbfs:\/<folder-name>` | (Optional) [DBFS](https:\/\/docs.databricks.com\/dbfs\/index.html) path used to store the training dataset. This path is visible to both driver and worker nodes. Databricks recommends leaving this field empty, so AutoML can save the training dataset as an MLflow artifact. If a custom path is specified, the dataset does not inherit the AutoML experiment\u2019s access permissions. |\n| exclude\\_cols | List[str] | (Optional) List of columns to ignore during AutoML calculations. Default: [] |\n| exclude\\_ frameworks | List[str] | (Optional) List of algorithm frameworks that AutoML should not consider as it develops models. Possible values: empty list, or one or more of \u201csklearn\u201d, \u201clightgbm\u201d, \u201cxgboost\u201d. Default: [] (all frameworks are considered) |\n| experiment\\_dir | str | (Optional) Path to the directory in the workspace to save the generated notebooks and experiments. Default: `\/Users\/<username>\/databricks_automl\/` |\n| experiment\\_name | str | (Optional) Name for the MLflow experiment that AutoML creates. Default: Name is automatically generated. |\n| feature\\_store\\_ lookups | List[Dict] | (Optional) List of dictionaries that represent features from Feature Store for data augmentation. Valid keys in each dictionary are:* table\\_name (str): Required. Name of the feature table. * lookup\\_key (list or str): Required. Column name(s) to use as key when joining the feature table with the data passed in the `dataset` param. The order of the column names must match the order of the primary keys of the feature table. * timestamp\\_lookup\\_key (str): Required if the specified table is a [time series feature table](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html). The column name to use when performing point-in-time lookup on the feature table with the data passed in the `dataset` param. Default: [] |\n| imputers | Dict[str, Union[str, Dict[str, Any]]] | (Optional) Dictionary where each key is a column name, and each value is a string or dictionary describing the imputation strategy. If specified as a string, the value must be one of \u201cmean\u201d, \u201cmedian\u201d, or \u201cmost\\_frequent\u201d. To impute with a known value, specify the value as a dictionary `{\"strategy\": \"constant\", \"fill_value\": <desired value>}`. You can also specify string options as dictionaries, for example {\u201cstrategy\u201d: \u201cmean\u201d}. If no imputation strategy is provided for a column, AutoML selects a default strategy based on column type and content. If you specify a non-default imputation method, AutoML does not perform [semantic type detection](https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html#semantic-detection). Default: {} |\n| max\\_trials | int | (Optional) Maximum number of trials to run. This parameter is available in Databricks Runtime 10.5 ML and below, but is deprecated starting in Databricks Runtime 10.3 ML. In Databricks Runtime 11.0 ML and above, this parameter is not supported. Default: 20 If timeout\\_minutes=None, AutoML runs the maximum number of trials. |\n| pos\\_label | Union[int, bool, str] | (Classification only) The positive class. This is useful for calculating metrics such as precision and recall. Should only be specified for binary classification problems. |\n| primary\\_metric | str | Metric used to evaluate and rank model performance. Supported metrics for regression: \u201cr2\u201d (default), \u201cmae\u201d, \u201crmse\u201d, \u201cmse\u201d Supported metrics for classification: \u201cf1\u201d (default), \u201clog\\_loss\u201d, \u201cprecision\u201d, \u201caccuracy\u201d, \u201croc\\_auc\u201d |\n| time\\_col | str | Available in Databricks Runtime 10.1 ML and above. (Optional) Column name for a time column. If provided, AutoML tries to split the dataset into training, validation, and test sets chronologically, using the earliest points as training data and the latest points as a test set. Accepted column types are timestamp and integer. With Databricks Runtime 10.2 ML and above, string columns are also supported. If column type is string, AutoML tries to convert it to timestamp using semantic detection. If the conversion fails, the AutoML run fails. |\n| timeout\\_minutes | int | (Optional) Maximum time to wait for AutoML trials to complete. Longer timeouts allow AutoML to run more trials and identify a model with better accuracy. Default: 120 minutes Minimum value: 5 minutes An error is reported if the timeout is too short to allow at least one trial to complete. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with Databricks AutoML Python API\n##### Forecasting parameters\n\n| Field Name | Type | Description |\n| --- | --- | --- |\n| dataset | str pandas.DataFrame pyspark.DataFrame pyspark.sql.DataFrame | Input table name or DataFrame that contains training features and target. Table name can be in format \u201c..\u201d or \u201c.\u201d for non Unity Catalog tables |\n| target\\_col | str | Column name for the target label. |\n| time\\_col | str | Name of the time column for forecasting. |\n| frequency | str | Frequency of the time series for forecasting. This is the period with which events are expected to occur. The default setting is \u201cD\u201d or daily data. Be sure to change the setting if your data has a different frequency. Possible values: \u201cW\u201d (weeks) \u201cD\u201d \/ \u201cdays\u201d \/ \u201cday\u201d \u201chours\u201d \/ \u201chour\u201d \/ \u201chr\u201d \/ \u201ch\u201d \u201cm\u201d \/ \u201cminute\u201d \/ \u201cmin\u201d \/ \u201cminutes\u201d \/ \u201cT\u201d \u201cS\u201d \/ \u201cseconds\u201d \/ \u201csec\u201d \/ \u201csecond\u201d The following are only available with Databricks Runtime 12.0 ML and above: \u201cM\u201d \/ \u201cmonth\u201d \/ \u201cmonths\u201d \u201cQ\u201d \/ \u201cquarter\u201d \/ \u201cquarters\u201d \u201cY\u201d \/ \u201cyear\u201d \/ \u201cyears\u201d Default: \u201cD\u201d |\n| horizon | int | Number of periods into the future for which forecasts should be returned. The units are the time series frequency. Default: 1 |\n| data\\_dir | str of format `dbfs:\/<folder-name>` | (Optional) [DBFS](https:\/\/docs.databricks.com\/dbfs\/index.html) path used to store the training dataset. This path is visible to both driver and worker nodes. Databricks recommends leaving this field empty, so AutoML can save the training dataset as an MLflow artifact. If a custom path is specified, the dataset does not inherit the AutoML experiment\u2019s access permissions. |\n| exclude\\_ frameworks | List[str] | (Optional) List of algorithm frameworks that AutoML should not consider as it develops models. Possible values: empty list, or one or more of \u201cprophet\u201d, \u201carima\u201d. Default: [] (all frameworks are considered) |\n| experiment\\_dir | str | (Optional) Path to the directory in the workspace to save the generated notebooks and experiments. Default: `\/Users\/<username>\/databricks_automl\/` |\n| experiment\\_name | str | (Optional) Name for the MLflow experiment that AutoML creates. Default: Name is automatically generated. |\n| feature\\_store\\_ lookups | List[Dict] | (Optional) List of dictionaries that represent features from Feature Store for data augmentation. Valid keys in each dictionary are:* table\\_name (str): Required. Name of the feature table. * lookup\\_key (list or str): Required. Column name(s) to use as key when joining the feature table with the data passed in the `dataset` param. The order of the column names must match the order of the primary keys of the feature table. * timestamp\\_lookup\\_key (str): Required if the specified table is a [time series feature table](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html). The column name to use when performing point-in-time lookup on the feature table with the data passed in the `dataset` param. Default: [] |\n| identity\\_col | Union[str, list] | (Optional) Column(s) that identify the time series for multi-series forecasting. AutoML groups by these column(s) and the time column for forecasting. |\n| output\\_database | str | (Optional) If provided, AutoML saves predictions of the best model to a new table in the specified database. Default: Predictions are not saved. |\n| primary\\_metric | str | Metric used to evaluate and rank model performance. Supported metrics: \u201csmape\u201d(default) \u201cmse\u201d, \u201crmse\u201d, \u201cmae\u201d, or \u201cmdape\u201d. |\n| timeout\\_minutes | int | (Optional) Maximum time to wait for AutoML trials to complete. Longer timeouts allow AutoML to run more trials and identify a model with better accuracy. Default: 120 minutes Minimum value: 5 minutes An error is reported if the timeout is too short to allow at least one trial to complete. |\n| country\\_code | str | Available in Databricks Runtime 12.0 ML and above. Only supported by the Prophet forecasting model. (Optional) Two-letter country code that indicates which country\u2019s holidays the forecasting model should use. To ignore holidays, set this parameter to an empty string (\u201c\u201d). [Supported countries](https:\/\/pypi.org\/project\/holidays\/). Default: US (United States holidays). |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with Databricks AutoML Python API\n##### Returns\n\n### `AutoMLSummary` \nSummary object for an AutoML run that describes the metrics, parameters, and other details for each of the trials. You also use this object to load the model trained by a specific trial. \n| Property | Type | Description |\n| --- | --- | --- |\n| experiment | mlflow.entities.Experiment | The MLflow experiment used to log the trials. |\n| trials | List[TrialInfo] | A list containing information about all the trials that were run. |\n| best\\_trial | TrialInfo | Info about the trial that resulted in the best weighted score for the primary metric. |\n| metric\\_distribution | str | The distribution of weighted scores for the primary metric across all trials. |\n| output\\_table\\_name | str | Used with forecasting only and only if output\\_database is provided. Name of the table in output\\_database containing the model\u2019s predictions. | \n### `TrialInfo` \nSummary object for each individual trial. \n| Property | Type | Description |\n| --- | --- | --- |\n| notebook\\_path | Optional[str] | The path to the generated notebook for this trial in the workspace. For classification and regression, this value is set only for the best trial, while all other trials have the value set to `None`. For forecasting, this value is present for all trials |\n| notebook\\_url | Optional[str] | The URL of the generated notebook for this trial. For classification and regression, this value is set only for the best trial, while all other trials have the value set to `None`. For forecasting, this value present for all trials |\n| artifact\\_uri | Optional[str] | The MLflow artifact URI for the generated notebook. |\n| mlflow\\_run\\_id | str | The MLflow run ID associated with this trial run. |\n| metrics | Dict[str, float] | The metrics logged in MLflow for this trial. |\n| params | Dict[str, str] | The parameters logged in MLflow that were used for this trial. |\n| model\\_path | str | The MLflow artifact URL of the model trained in this trial. |\n| model\\_description | str | Short description of the model and the hyperparameters used for training this model. |\n| duration | str | Training duration in minutes. |\n| preprocessors | str | Description of the preprocessors run before training the model. |\n| evaluation\\_metric\\_score | float | Score of primary metric, evaluated for the validation dataset. | \n| Method | Description |\n| --- | --- |\n| load\\_model() | Load the model generated in this trial, logged as an MLflow artifact. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with Databricks AutoML Python API\n##### Import a notebook\n\nTo import a notebook that has been saved as an MLflow artifact, use the `databricks.automl.import_notebook` Python API. \n```\ndef import_notebook(artifact_uri: str, path: str, overwrite: bool = False) -> ImportNotebookResult:\n\"\"\"\nImport a trial notebook, saved as an MLflow artifact, into the workspace.\n\n:param artifact_uri: The URI of the MLflow artifact that contains the trial notebook.\n:param path: The path in the Databricks workspace where the notebook should be imported. This must be an absolute path. The directory will be created if it does not exist.\n:param overwrite: Whether to overwrite the notebook if it already exists. It is `False` by default.\n\n:return: ImportNotebookResult contains two fields, `path` and `url`, referring to the imported notebook\n\"\"\"\n\n``` \nA usage example: \n```\nsummary = databricks.automl.classify(...)\nresult = databricks.automl.import_notebook(summary.trials[5].artifact_uri, \"\/Users\/you@yourcompany.com\/path\/to\/directory\")\nprint(result.path)\nprint(result.url)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with Databricks AutoML Python API\n##### Register and deploy a model\n\nYou can register and deploy your AutoML trained model just like any registered model in the MLflow model registry, see [Log, load, register, and deploy MLflow models](https:\/\/docs.databricks.com\/mlflow\/models.html). \n### No module named \u2018pandas.core.indexes.numeric \nWhen serving a model built using AutoML with Model Serving, you may get the error: `No module named 'pandas.core.indexes.numeric`. \nThis is due to an incompatible `pandas` version between AutoML and the model serving endpoint environment. You can resolve this error by running the [add-pandas-dependency.py script](https:\/\/docs.databricks.com\/_extras\/documents\/add-pandas-dependency.py). The script edits the `requirements.txt` and `conda.yaml` for your logged model to include the appropriate `pandas` dependency version: `pandas==1.5.3`. \n1. Modify the script to include the `run_id` of the MLflow run where your model was logged.\n2. Re-registering the model to the MLflow model registry.\n3. Try serving the new version of the MLflow model.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is AutoML?\n#### Train ML models with Databricks AutoML Python API\n##### Notebook examples\n\nReview these notebooks to get started with AutoML. \nThe following notebook shows how to do classification with AutoML. \n### AutoML classification example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/automl-classification-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nThe following notebook shows how to do regression with AutoML. \n### AutoML regression example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/automl-regression-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nThe following notebook shows how to do forecasting with AutoML. \n### AutoML forecasting example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/automl-forecasting-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nThe following notebook shows how to train an ML model with AutoML and Feature Store feature tables. \n### AutoML experiment with Feature Store example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/automl-feature-store-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n### Manage privileges in Unity Catalog\n##### Admin privileges in Unity Catalog\n\nThis article describes privileges that Databricks account admins, workspace admins, and metastore admins have for managing Unity Catalog. \nNote \nIf your workspace was enabled for Unity Catalog automatically, workspace admins have default privileges on the attached metastore and the workspace catalog, if a workspace catalog was provisioned. See [Workspace admin privileges when workspaces are enabled for Unity Catalog automatically](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html#workspace-admins-auto).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n### Manage privileges in Unity Catalog\n##### Admin privileges in Unity Catalog\n###### Metastore admins\n\nThe metastore admin is a highly privileged user or group in Unity Catalog. Metastore admins have the following privileges on the metastore by default: \n* `CREATE CATALOG`: Allows a user to [create catalogs](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html) in the metastore.\n* `CREATE CONNECTION`: Allows a user to [create a connection](https:\/\/docs.databricks.com\/query-federation\/index.html#connection) to an external database in a Lakehouse Federation scenario.\n* `CREATE EXTERNAL LOCATION`: Allows a user to [create external locations](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html).\n* `CREATE STORAGE CREDENTIAL`: Allows a user to [create storage credentials](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html).\n* `CREATE FOREIGN CATALOG`: Allows a user to [create foreign catalogs](https:\/\/docs.databricks.com\/query-federation\/index.html#foreign-catalog) using a connection to an external database in a Lakehouse Federation scenario.\n* `CREATE SHARE`: Allows a data provider user to [create a share in Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/create-share.html).\n* `CREATE RECIPIENT`: Allows a data provider user to [create a recipient in Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html).\n* `CREATE PROVIDER`: Allows a data recipient user to [create a provider in Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/manage-provider.html).\n* `MANAGE ALLOWLIST`: Allows a user to [update allowlists](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/allowlist.html) that manage cluster access to init scripts and libraries. \n* `CREATE MATERIALIZED VIEW`: Allows a user to [create materialized views](https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html). \nMetastore admins are also the owners of the metastore, which grants them the following privileges: \n* Manage the privileges or transfer ownership of any object within the metastore, including storage credentials, external locations, connections, shares, recipients, and providers.\n* Grant themselves read and write access to any data in the metastore. \nMetastore admins have this ability indirectly, through their ability to transfer ownership of all objects. There is no direct access by default. Granting of permissions is audit-logged.\n* Read and update the metadata of all objects in the metastore.\n* Delete the metastore. \nMetastore admins are the only users who can grant privileges on the metastore itself. \n### Who has metastore admin privileges? \nIf an account admin creates the metastore manually, that account admin is the metastore\u2019s initial owner and metastore admin. All metastores created before November 8, 2023 were created manually by an account admin. \nIf the metastore was provisioned as part of automatic Unity Catalog enablement, the metastore was created without a metastore admin. Workspace admins in that case are automatically granted privileges that make the metastore admin optional. If needed, account admins can assign the metastore admin role to a user, service principal, or group. Groups are strongly recommended. See [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement). \n### Assign a metastore admin \nMetastore admin is a highly privileged role that you should distribute carefully. It is optional. \nAccount admins can assign the metastore admin role. Databricks recommends nominating a group as the metastore admin. By doing this, any member of the group is automatically a metastore admin. \nTo assign the metastore admin role to a group: \n1. As an account admin, log in to the [account console](https:\/\/accounts.cloud.databricks.com).\n2. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. Click the name of a metastore to open its properties.\n4. Under **Metastore Admin**, click **Edit**.\n5. Select a group from the drop-down. You can enter text in the field to search for options.\n6. Click **Save**. \nImportant \nIt can take up to 30 seconds for a metastore admin assignment change to be reflected in your account, and it may take longer to take effect in some workspaces than others. This delay is due to caching protocols.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n### Manage privileges in Unity Catalog\n##### Admin privileges in Unity Catalog\n###### Account admins\n\nAccount admin is a highly privileged role that you should distribute carefully. Account admins have the following privileges: \n* Can create metastores, and by default become the initial metastore admin.\n* Can link metastores to workspaces.\n* Can assign the metastore admin role.\n* Can grant privileges on metastores.\n* Can enable Delta Sharing for a metastore.\n* Can configure storage credentials.\n* Can enable system tables and delegate access to them.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n### Manage privileges in Unity Catalog\n##### Admin privileges in Unity Catalog\n###### Workspace admins\n\nWorkspace admin is a highly privileged role that you should distribute carefully. Workspace admins have the following privileges: \n* Can add users, service principals, and groups to a workspace.\n* Can delegate other workspace admins.\n* Can manage job ownership. See [Control access to a job](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#jobs_acl_user_guide).\n* Can manage the job **Run as** setting. See [Run a job as a service principal](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#run-as-sp).\n* Can view and manage notebooks, dashboards, queries, and other workspace objects. See [Access control lists](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html). \nAccount admins can restrict workspace admin privileges using the the `RestrictWorkspaceAdmins` setting. See [Restrict workspace admins](https:\/\/docs.databricks.com\/admin\/workspace-settings\/restrict-workspace-admins.html). \n### Workspace admin privileges when workspaces are enabled for Unity Catalog automatically \nIf your workspace was enabled for Unity Catalog automatically, the workspace is attached to a metastore by default. For more information see [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement). \nIf your workspace was enabled for Unity Catalog automatically, workspace admins have the following privileges on the attached metastore by default: \n* `CREATE CATALOG`\n* `CREATE EXTERNAL LOCATION`\n* `CREATE STORAGE CREDENTIAL`\n* `CREATE CONNECTION`\n* `CREATE SHARE`\n* `CREATE RECIPIENT`\n* `CREATE PROVIDER` \n* `CREATE MATERIALIZED VIEW` \nWorkspace admins are the default owners of the workspace catalog, if a workspace catalog was provisioned for your workspace. Ownership of this catalog grants the following privileges: \n* Manage the privileges for or transfer ownership of any object within the workspace catalog. \nThis includes the ability to grant themselves read and write access to all data in the catalog (no direct access by default; granting permissions is audit-logged).\n* Transfer ownership of the workspace catalog itself. \nAll workspace users receive the `USE CATALOG` privilege on the workspace catalog. Workspace users also receive the `USE SCHEMA`, `CREATE TABLE`, `CREATE VOLUME`, `CREATE MODEL`, `CREATE FUNCTION`, and `CREATE MATERIALIZED VIEW` privileges on the `default` schema in the catalog. \nNote \nThe default privileges granted on the attached metastore and workspace catalog are not maintained across workspaces (if, for example, the workspace catalog is also bound to another workspace).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Import Python modules from Git folders or workspace files\n\nYou can store Python code in [Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html) or in [workspace files](https:\/\/docs.databricks.com\/files\/workspace.html) and then import that Python code into your Delta Live Tables pipelines. For more information about working with modules in Git folders or workspace files, see [Work with Python and R modules](https:\/\/docs.databricks.com\/files\/workspace-modules.html). \nNote \nYou cannot import source code from a notebook stored in a Databricks Git folder or a workspace file. Instead, add the notebook directly when you create or edit a pipeline. See [Create a pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html#create-pipeline).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/import-workspace-files.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Import Python modules from Git folders or workspace files\n###### Import a Python module to a Delta Live Tables pipeline\n\nThe following example demonstrates importing data set queries as Python modules from workspace files. Although this example describes using workspace files to store the pipeline source code, you can use it with source code stored in a Git folder. \nTo run this example, use the following steps: \n1. Click ![Workspaces Icon](https:\/\/docs.databricks.com\/_images\/workspaces-icon-account.png) **Workspace** in the sidebar of your Databricks workspace to open the workspace browser.\n2. Use the workspace browser to select a directory for the Python modules.\n3. Click ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) in the rightmost column of the selected directory and click **Create > File**.\n4. Enter a name for the file, for example, `clickstream_raw_module.py`. The file editor opens. To create a module to read source data into a table, enter the following in the editor window: \n```\nfrom dlt import *\n\njson_path = \"\/databricks-datasets\/wikipedia-datasets\/data-001\/clickstream\/raw-uncompressed-json\/2015_2_clickstream.json\"\n\ndef create_clickstream_raw_table(spark):\n@table\ndef clickstream_raw():\nreturn (\nspark.read.json(json_path)\n)\n\n```\n5. To create a module that creates a new table containing prepared data, create a new file in the same directory, enter a name for the file, for example, `clickstream_prepared_module.py`, and enter the following in the new editor window: \n```\nfrom clickstream_raw_module import *\nfrom dlt import read\nfrom pyspark.sql.functions import *\nfrom pyspark.sql.types import *\n\ndef create_clickstream_prepared_table(spark):\ncreate_clickstream_raw_table(spark)\n@table\n@expect(\"valid_current_page_title\", \"current_page_title IS NOT NULL\")\n@expect_or_fail(\"valid_count\", \"click_count > 0\")\ndef clickstream_prepared():\nreturn (\nread(\"clickstream_raw\")\n.withColumn(\"click_count\", expr(\"CAST(n AS INT)\"))\n.withColumnRenamed(\"curr_title\", \"current_page_title\")\n.withColumnRenamed(\"prev_title\", \"previous_page_title\")\n.select(\"current_page_title\", \"click_count\", \"previous_page_title\")\n)\n\n```\n6. Next, create a pipeline notebook. Go to your Databricks landing page and select **Create a notebook**, or click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar and select **Notebook**. You can also create the notebook in the workspace browser by clicking ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) and click **Create > Notebook**.\n7. Name your notebook and confirm **Python** is the default language.\n8. Click **Create**.\n9. Enter the example code in the notebook. \nNote \nIf your notebook imports modules or packages from a workspace files path or a Git folders path different from the notebook directory, you must manually append the path to the files using `sys.path.append()`. \nIf you are importing a file from a Git folder, you must prepend `\/Workspace\/` to the path. For example, `sys.path.append('\/Workspace\/...')`. Omitting `\/Workspace\/` from the path results in an error. \nIf the modules or packages are stored in the same directory as the notebook, you do not need to append the path manually. You also do not need to manually append the path when importing from the root directory of a Git folder because the root directory is automatically appended to the path. \n```\nimport sys, os\n# You can omit the sys.path.append() statement when the imports are from the same directory as the notebook.\nsys.path.append(os.path.abspath('<module-path>'))\n\nimport dlt\nfrom clickstream_prepared_module import *\nfrom pyspark.sql.functions import *\nfrom pyspark.sql.types import *\n\ncreate_clickstream_prepared_table(spark)\n\n@dlt.table(\ncomment=\"A table containing the top pages linking to the Apache Spark page.\"\n)\ndef top_spark_referrers():\nreturn (\ndlt.read(\"clickstream_prepared\")\n.filter(expr(\"current_page_title == 'Apache_Spark'\"))\n.withColumnRenamed(\"previous_page_title\", \"referrer\")\n.sort(desc(\"click_count\"))\n.select(\"referrer\", \"click_count\")\n.limit(10)\n)\n\n``` \nReplace `<module-path>` with the path to the directory containing the Python modules to import.\n10. Create a pipeline using the new notebook.\n11. To run the pipeline, in the **Pipeline details** page, click **Start**. \nYou can also import Python code as a package. The following code snippet from a Delta Live Tables notebook imports the `test_utils` package from the `dlt_packages` directory inside the same directory as the notebook. The `dlt_packages` directory contains the files `test_utils.py` and `__init__.py`, and `test_utils.py` defines the function `create_test_table()`: \n```\nimport dlt\n\n@dlt.table\ndef my_table():\nreturn dlt.read(...)\n\n# ...\n\nimport dlt_packages.test_utils as test_utils\ntest_utils.create_test_table(spark)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/import-workspace-files.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### Download data from the internet\n\nYou can use Databricks notebooks to download data from public URLs. Databricks does not provide any native tools for downloading data from the internet, but you can use open source tools in supported languages. If you are accessing data from cloud object storage, accessing data directly with Apache Spark provides better results. See [Connect to data sources](https:\/\/docs.databricks.com\/connect\/index.html). \nDatabricks clusters provide general compute, allowing you to run arbitrary code in addition to Apache Spark commands. Arbitrary commands store results on ephermal storage attached to the driver by default. You must move downloaded data to a new location before reading it with Apache Spark, as Apache Spark cannot read from ephemeral storage. See [Work with files on Databricks](https:\/\/docs.databricks.com\/files\/index.html). \nDatabricks recommends using Unity Catalog volumes for storing all non-tabular data. You can optionally specify a volume as your destination during download, or move data to a volume after download. Volumes do not support random writes, so download files and unzip them to ephemeral storage before moving them to volumes. See [Expand and read Zip compressed files](https:\/\/docs.databricks.com\/files\/unzip-files.html). \nNote \nSome workspace configurations might prevent access to the public internet. Consult your workspace administrator if you need expanded network access.\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/download-internet-files.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### Download data from the internet\n##### Download a file to a volume\n\nDatabricks recommends storing all non-tabular data in Unity Catalog volumes. \nThe following examples use packages for Bash, Python, and Scala to download a file to a Unity Catalog volume: \n```\n%sh curl https:\/\/data.cityofnewyork.us\/api\/views\/kk4q-3rt2\/rows.csv --output \/Volumes\/my_catalog\/my_schema\/my_volume\/curl-subway.csv\n\n``` \n```\nimport urllib\nurllib.request.urlretrieve(\"https:\/\/data.cityofnewyork.us\/api\/views\/kk4q-3rt2\/rows.csv\", \"\/Volumes\/my_catalog\/my_schema\/my_volume\/python-subway.csv\")\n\n``` \n```\nimport java.net.URL\nimport java.io.File\nimport org.apache.commons.io.FileUtils\n\nFileUtils.copyURLToFile(new URL(\"https:\/\/data.cityofnewyork.us\/api\/views\/kk4q-3rt2\/rows.csv\"), new File(\"\/Volumes\/my_catalog\/my_schema\/my_volume\/scala-subway.csv\"))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/download-internet-files.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### Download data from the internet\n##### Download a file to ephemeral storage\n\nThe following examples use packages for Bash, Python, and Scala to download a file to ephemeral storage attached to the driver: \n```\n%sh curl https:\/\/data.cityofnewyork.us\/api\/views\/kk4q-3rt2\/rows.csv --output \/tmp\/curl-subway.csv\n\n``` \n```\nimport urllib\nurllib.request.urlretrieve(\"https:\/\/data.cityofnewyork.us\/api\/views\/kk4q-3rt2\/rows.csv\", \"\/tmp\/python-subway.csv\")\n\n``` \n```\nimport java.net.URL\nimport java.io.File\nimport org.apache.commons.io.FileUtils\n\nFileUtils.copyURLToFile(new URL(\"https:\/\/data.cityofnewyork.us\/api\/views\/kk4q-3rt2\/rows.csv\"), new File(\"\/tmp\/scala-subway.csv\"))\n\n``` \nBecause these files are downloaded to ephemeral storage attached to the driver, use `%sh` to see these files, as in the following example: \n```\n%sh ls \/tmp\/\n\n``` \nYou can use Bash commands to preview the contents of files download this way, as in the following example: \n```\n%sh head \/tmp\/curl-subway.csv\n\n``` \n### Move data with dbutils \nTo access data with Apache Spark, you must move it from ephemeral storage to cloud object storage. Databricks recommends using volumes for managing all access to cloud object storage. See [Connect to data sources](https:\/\/docs.databricks.com\/connect\/index.html). \nThe [Databricks Utilities](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html) (`dbutils`) allow you to move files from ephemeral storage attached to the driver to other locations, including Unity Catalog volumes. The following example moves data to a an example volume: \n```\ndbutils.fs.mv(\"file:\/tmp\/curl-subway.csv\", \"\/Volumes\/my_catalog\/my_schema\/my_volume\/subway.csv\")\n\n``` \n### Read downloaded data \nAfter you move the data to a volume, you can read the data as normal. The following code reads in the CSV data moved to a volume: \n```\ndf = spark.read.format(\"csv\").option(\"header\", True).load(\"\/Volumes\/my_catalog\/my_schema\/my_volume\/subway.csv\")\ndisplay(df)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/download-internet-files.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Specify a managed storage location in Unity Catalog\n\nA managed storage location specifies a location in cloud object storage for storing data for managed tables and managed volumes. \nYou can associate a managed storage location with a metastore, catalog, or schema. Managed storage locations at lower levels in the hierarchy override storage locations defined at higher levels when managed tables or managed volumes are created. \nWhen an account admin creates a metastore, they can associate a storage location in an AWS S3 or Cloudflare R2 bucket in your cloud provider account to use as a managed storage location. Managed storage locations at the catalog and schema levels are optional, but Databricks recommends assigning managed storage locations at the catalog level for logical data isolation. See [Data governance and data isolation building blocks](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html#building-blocks). \nImportant \nIf your workspace was enabled for Unity Catalog automatically, the Unity Catalog metastore was created without a metastore-level managed storage location. You should assign a managed storage location at the catalog or schema level. See [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement) and [Data governance and data isolation building blocks](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html#building-blocks).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/managed-storage.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Specify a managed storage location in Unity Catalog\n##### What is a managed storage location?\n\nManaged storage locations have the following properties: \n* Managed tables and managed volumes store data and metadata files in managed storage locations.\n* Managed storage locations cannot overlap with external tables or external volumes. \nThe following table describes how a managed storage location is declared and associated with Unity Catalog objects: \n| Associated Unity Catalog object | How to set | Relation to external locations |\n| --- | --- | --- |\n| Metastore | Configured by account admin during metastore creation. | Cannot overlap an external location. |\n| Catalog | Specified during catalog creation using the `MANAGED LOCATION` keyword. | Must be contained within an external location. |\n| Schema | Specified during schema creation using the `MANAGED LOCATION` keyword. | Must be contained within an external location. | \nThe managed storage location that stores data and metadata for managed tables and managed volumes uses the following rules: \n* If the containing schema has a managed location, the data is stored in the schema managed location.\n* If the containing schema does not have a managed location but the catalog has a managed location, the data is stored in the catalog managed location.\n* If neither the containing schema nor the containing catalog have a managed location, data is stored in the metastore managed location. \nUnity Catalog prevents overlap of location governance. See [How do paths work for data managed by Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/paths.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/managed-storage.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Specify a managed storage location in Unity Catalog\n##### Managed storage location, storage root, and storage location\n\nWhen you specify a `MANAGED LOCATION` for a catalog or schema, the provided location is tracked as the **Storage Root** in Unity Catalog. To ensure that all managed entities have a unique location, Unity Catalog adds hashed subdirectories to the specified location, using the following format: \n| Object | Path |\n| --- | --- |\n| Schema | `<storage-root>\/__unitystorage\/schemas\/00000000-0000-0000-0000-000000000000` |\n| Catalog | `<storage-root>\/__unitystorage\/catalogs\/00000000-0000-0000-0000-000000000000` | \nThe fully qualified path for the managed storage location is tracked as the **Storage Location** in Unity Catalog. \nYou can specify the same managed storage location for multiple schemas and catalogs.\n\n#### Specify a managed storage location in Unity Catalog\n##### Required privileges\n\nUsers who have the `CREATE MANAGED STORAGE` privilege on an external location can configure managed storage locations during catalog or schema creation. \nManaged storage locations set at the metastore level must be configured by account admins during metastore creation.\n\n#### Specify a managed storage location in Unity Catalog\n##### Set a managed storage location for a catalog\n\nSet a managed storage location for a catalog by using the `MANAGED LOCATION` keyword during catalog creation, as in the following example: \n```\nCREATE CATALOG <catalog-name>\nMANAGED LOCATION 's3:\/\/<external-location-bucket-path>\/<directory>';\n\n```\n\n#### Specify a managed storage location in Unity Catalog\n##### Set a managed storage location for a schema\n\nSet a managed storage location for a schema by using the `MANAGED LOCATION` keyword during schema creation, as in the following example: \n```\nCREATE SCHEMA <catalog>.<schema-name>\nMANAGED LOCATION 's3:\/\/<external-location-bucket-path>\/<directory>';\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/managed-storage.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Specify a managed storage location in Unity Catalog\n##### Next steps\n\nManage storage locations are used for creating managed tables and managed volumes. See [Create tables in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html) and [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/managed-storage.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/migrate.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Lakehouse Federation: Migrate legacy query federation connections\n\nIf you have set up [legacy query federation connections](https:\/\/docs.databricks.com\/query-federation\/non-uc.html), Databricks recommends that you migrate them to use [Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). \nLegacy query federation involved creating tables in Databricks that referenced an external data source. To \u201cmove\u201d those tables into Unity Catalog using Lakehouse Federation, you must create a Lakehouse Federation connection and foreign catalog for the database that includes the table. You can then grant user access to the catalog, or to schemas and tables in the catalog, using Unity Catalog. \nA single foreign catalog may be able to cover multiple tables that you have set up for legacy query federation. \nIn the following example: \n* The \u201cLegacy syntax\u201d tab shows the syntax that was used to create a table named `postgresql_table` in Databricks that references the `my-postgres-table` in the `my-postgres-database` database on the `postgres-demo.lb123.us-west-2.rds.amazonaws.com:5432` server.\n* The \u201cLakehouse Federation\u201d tab shows the creation of a connection to the `postgres-demo.lb123.us-west-2.rds.amazonaws.com:5432` server, followed by the creation of a foreign catalog, `my-postgres-catalog` that maps to the `my-postgres-database` database. \n```\nCREATE TABLE postgresql_table\nUSING postgresql\nOPTIONS (\ndbtable 'my-postgres-table',\nhost 'postgres-demo.lb123.us-west-2.rds.amazonaws.com',\nport '5432',\ndatabase 'my-postgres-database',\nuser 'postgres_user',\npassword 'password123'\n);\n\n``` \n```\n--Create a connection:\nCREATE CONNECTION postgres-connection TYPE postgresql\nOPTIONS (\nhost 'postgres-demo.lb123.us-west-2.rds.amazonaws.com',\nport '5432',\nuser 'postgres_user',\npassword 'password123'\n);\n\n--Create a foreign catalog that mirrors the database:\nCREATE FOREIGN CATALOG my-postgres-catalog USING CONNECTION postgres-connection\nOPTIONS (database 'my-postgres-database');\n\n``` \nThe foreign catalog will surface `my-postgres-table` and all of the other tables in `my-postgres-database`, and you can use Unity Catalog to manage access to those tables from your Databricks workspace. \nNote \nYour original query federation configuration may include options that are not available in Lakehouse Federation. You might not need those options when you move to Lakehouse Federation, but if you do need them, you can continue to use the legacy query federation connection rather than migrating. \nDetailed instructions for creating connections and foreign catalogs are available for each supported connection type. See the article for your connection type, listed in the table of contents in this documentation site\u2019s left navigation pane.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/migrate.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Package custom artifacts for Model Serving\n\nThis article describes how to ensure your model\u2019s file and artifact dependencies are available on your [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) endpoint.\n\n#### Package custom artifacts for Model Serving\n##### Requirements\n\nMLflow 1.29 and above\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-custom-artifacts.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Package custom artifacts for Model Serving\n##### Package artifacts with models\n\nWhen your model requires files or artifacts during inference, you can package them into the model artifact when you log the model. \nIf you\u2019re working with Databricks Notebooks, a common practice is to have such files reside in [DBFS](https:\/\/docs.databricks.com\/dbfs\/index.html). Models are also sometimes configured to download artifacts from the internet (such as HuggingFace Tokenizers). Real-time workloads at scale perform best when all required dependencies are statically captured at deployment time. For this reason, Model Serving requires DBFS artifacts be packaged into the model artifact itself and uses MLflow interfaces to do so. Network artifacts loaded with the model should be packaged with the model whenever possible. \nWith the MLflow command [log\\_model()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.pyfunc.html#mlflow.pyfunc.log_model) you can log a model and its dependent artifacts with the `artifacts` parameter. \n```\nmlflow.pyfunc.log_model(\n...\nartifacts={'model-weights': \"\/dbfs\/path\/to\/file\", \"tokenizer_cache\": \".\/tokenizer_cache\"},\n...\n)\n\n``` \nIn PyFunc models, these artifacts\u2019 paths are accessible from the `context` object under `context.artifacts`, and they can be loaded in the standard way for that file type. \nFor example, in a custom MLflow model: \n```\nclass ModelPyfunc(mlflow.pyfunc.PythonModel):\ndef load_context(self, context):\nself.model = torch.load(context.artifacts[\"model-weights\"])\nself.tokenizer = transformers.BertweetTokenizer.from_pretrained(\"model-base\", local_files_only=True, cache_dir=context.artifacts[\"tokenizer_cache\"])\n...\n\n``` \nAfter your files and artifacts are packaged within your model artifact, you can serve your model to a [Model Serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-custom-artifacts.html"} +{"content":"# AI and Machine Learning on Databricks\n## GraphFrames\n#### GraphFrames user guide - Python\n\nTo learn more about GraphFrames, try importing\nand running this notebook in your workspace.\n\n#### GraphFrames user guide - Python\n##### GraphFrames Python notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/graphframes-user-guide-py.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/graphframes\/user-guide-python.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n### Connect to dbt Core\n##### Tutorial: Create, run, and test dbt models locally\n\nThis tutorial walks you through how to create, run, and test dbt models locally. You can also run dbt projects as Databricks job tasks. For more information, see [Use dbt transformations in a Databricks job](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbt-in-workflows.html).\n\n##### Tutorial: Create, run, and test dbt models locally\n###### Before you begin\n\nTo follow this tutorial, you must first connect your Databricks workspace to dbt Core. For more information, see [Connect to dbt Core](https:\/\/docs.databricks.com\/partners\/prep\/dbt.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/dbt-core-tutorial.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n### Connect to dbt Core\n##### Tutorial: Create, run, and test dbt models locally\n###### Step 1: Create and run models\n\nIn this step, you use your favorite text editor to create *models*, which are `select` statements that create either a new view (the default) or a new table in a database, based on existing data in that same database. This procedure creates a model based on the sample `diamonds` table from the [Sample datasets](https:\/\/docs.databricks.com\/discover\/databricks-datasets.html). \nUse the following code to create this table. \n```\nDROP TABLE IF EXISTS diamonds;\n\nCREATE TABLE diamonds USING CSV OPTIONS (path \"\/databricks-datasets\/Rdatasets\/data-001\/csv\/ggplot2\/diamonds.csv\", header \"true\")\n\n``` \n1. In the project\u2019s `models` directory, create a file named `diamonds_four_cs.sql` with the following SQL statement. This statement selects only the carat, cut, color, and clarity details for each diamond from the `diamonds` table. The `config` block instructs dbt to create a table in the database based on this statement. \n```\n{{ config(\nmaterialized='table',\nfile_format='delta'\n) }}\n\n``` \n```\nselect carat, cut, color, clarity\nfrom diamonds\n\n``` \nTip \nFor additional `config` options such as using the Delta file format and the `merge` incremental strategy, see [Databricks configurations](https:\/\/docs.getdbt.com\/reference\/resource-configs\/databricks-configs) in the dbt documentation.\n2. In the project\u2019s `models` directory, create a second file named `diamonds_list_colors.sql` with the following SQL statement. This statement selects unique values from the `colors` column in the `diamonds_four_cs` table, sorting the results in alphabetical order first to last. Because there is no `config` block, this model instructs dbt to create a view in the database based on this statement. \n```\nselect distinct color\nfrom {{ ref('diamonds_four_cs') }}\nsort by color asc\n\n```\n3. In the project\u2019s `models` directory, create a third file named `diamonds_prices.sql` with the following SQL statement. This statement averages diamond prices by color, sorting the results by average price from highest to lowest. This model instructs dbt to create a view in the database based on this statement. \n```\nselect color, avg(price) as price\nfrom diamonds\ngroup by color\norder by price desc\n\n```\n4. With the virtual environment activated, run the `dbt run` command with the paths to the three preceding files. In the `default` database (as specified in the `profiles.yml` file), dbt creates one table named `diamonds_four_cs` and two views named `diamonds_list_colors` and `diamonds_prices`. dbt gets these view and table names from their related `.sql` file names. \n```\ndbt run --model models\/diamonds_four_cs.sql models\/diamonds_list_colors.sql models\/diamonds_prices.sql\n\n``` \n```\n...\n... | 1 of 3 START table model default.diamonds_four_cs.................... [RUN]\n... | 1 of 3 OK created table model default.diamonds_four_cs............... [OK ...]\n... | 2 of 3 START view model default.diamonds_list_colors................. [RUN]\n... | 2 of 3 OK created view model default.diamonds_list_colors............ [OK ...]\n... | 3 of 3 START view model default.diamonds_prices...................... [RUN]\n... | 3 of 3 OK created view model default.diamonds_prices................. [OK ...]\n... |\n... | Finished running 1 table model, 2 view models ...\n\nCompleted successfully\n\nDone. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3\n\n```\n5. Run the following SQL code to list information about the new views and to select all rows from the table and views. \nIf you are connecting to a cluster, you can run this SQL code from a [notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook) that is connected to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL warehouse, you can run this SQL code from a [query](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html#create-a-query). \n```\nSHOW views IN default;\n\n``` \n```\n+-----------+----------------------+-------------+\n| namespace | viewName | isTemporary |\n+===========+======================+=============+\n| default | diamonds_list_colors | false |\n+-----------+----------------------+-------------+\n| default | diamonds_prices | false |\n+-----------+----------------------+-------------+\n\n``` \n```\nSELECT * FROM diamonds_four_cs;\n\n``` \n```\n+-------+---------+-------+---------+\n| carat | cut | color | clarity |\n+=======+=========+=======+=========+\n| 0.23 | Ideal | E | SI2 |\n+-------+---------+-------+---------+\n| 0.21 | Premium | E | SI1 |\n+-------+---------+-------+---------+\n...\n\n``` \n```\nSELECT * FROM diamonds_list_colors;\n\n``` \n```\n+-------+\n| color |\n+=======+\n| D |\n+-------+\n| E |\n+-------+\n...\n\n``` \n```\nSELECT * FROM diamonds_prices;\n\n``` \n```\n+-------+---------+\n| color | price |\n+=======+=========+\n| J | 5323.82 |\n+-------+---------+\n| I | 5091.87 |\n+-------+---------+\n...\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/dbt-core-tutorial.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n### Connect to dbt Core\n##### Tutorial: Create, run, and test dbt models locally\n###### Step 2: Create and run more complex models\n\nIn this step, you create more complex models for a set of related data tables. These data tables contain information about a fictional sports league of three teams playing a season of six games. This procedure creates the data tables, creates the models, and runs the models. \n1. Run the following SQL code to create the necessary data tables. \nIf you are connecting to a cluster, you can run this SQL code from a [notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook) that is connected to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL warehouse, you can run this SQL code from a [query](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html#create-a-query). \nThe tables and views in this step start with `zzz_` to help identify them as part of this example. You do not need to follow this pattern for your own tables and views. \n```\nDROP TABLE IF EXISTS zzz_game_opponents;\nDROP TABLE IF EXISTS zzz_game_scores;\nDROP TABLE IF EXISTS zzz_games;\nDROP TABLE IF EXISTS zzz_teams;\n\nCREATE TABLE zzz_game_opponents (\ngame_id INT,\nhome_team_id INT,\nvisitor_team_id INT\n) USING DELTA;\n\nINSERT INTO zzz_game_opponents VALUES (1, 1, 2);\nINSERT INTO zzz_game_opponents VALUES (2, 1, 3);\nINSERT INTO zzz_game_opponents VALUES (3, 2, 1);\nINSERT INTO zzz_game_opponents VALUES (4, 2, 3);\nINSERT INTO zzz_game_opponents VALUES (5, 3, 1);\nINSERT INTO zzz_game_opponents VALUES (6, 3, 2);\n\n-- Result:\n-- +---------+--------------+-----------------+\n-- | game_id | home_team_id | visitor_team_id |\n-- +=========+==============+=================+\n-- | 1 | 1 | 2 |\n-- +---------+--------------+-----------------+\n-- | 2 | 1 | 3 |\n-- +---------+--------------+-----------------+\n-- | 3 | 2 | 1 |\n-- +---------+--------------+-----------------+\n-- | 4 | 2 | 3 |\n-- +---------+--------------+-----------------+\n-- | 5 | 3 | 1 |\n-- +---------+--------------+-----------------+\n-- | 6 | 3 | 2 |\n-- +---------+--------------+-----------------+\n\nCREATE TABLE zzz_game_scores (\ngame_id INT,\nhome_team_score INT,\nvisitor_team_score INT\n) USING DELTA;\n\nINSERT INTO zzz_game_scores VALUES (1, 4, 2);\nINSERT INTO zzz_game_scores VALUES (2, 0, 1);\nINSERT INTO zzz_game_scores VALUES (3, 1, 2);\nINSERT INTO zzz_game_scores VALUES (4, 3, 2);\nINSERT INTO zzz_game_scores VALUES (5, 3, 0);\nINSERT INTO zzz_game_scores VALUES (6, 3, 1);\n\n-- Result:\n-- +---------+-----------------+--------------------+\n-- | game_id | home_team_score | visitor_team_score |\n-- +=========+=================+====================+\n-- | 1 | 4 | 2 |\n-- +---------+-----------------+--------------------+\n-- | 2 | 0 | 1 |\n-- +---------+-----------------+--------------------+\n-- | 3 | 1 | 2 |\n-- +---------+-----------------+--------------------+\n-- | 4 | 3 | 2 |\n-- +---------+-----------------+--------------------+\n-- | 5 | 3 | 0 |\n-- +---------+-----------------+--------------------+\n-- | 6 | 3 | 1 |\n-- +---------+-----------------+--------------------+\n\nCREATE TABLE zzz_games (\ngame_id INT,\ngame_date DATE\n) USING DELTA;\n\nINSERT INTO zzz_games VALUES (1, '2020-12-12');\nINSERT INTO zzz_games VALUES (2, '2021-01-09');\nINSERT INTO zzz_games VALUES (3, '2020-12-19');\nINSERT INTO zzz_games VALUES (4, '2021-01-16');\nINSERT INTO zzz_games VALUES (5, '2021-01-23');\nINSERT INTO zzz_games VALUES (6, '2021-02-06');\n\n-- Result:\n-- +---------+------------+\n-- | game_id | game_date |\n-- +=========+============+\n-- | 1 | 2020-12-12 |\n-- +---------+------------+\n-- | 2 | 2021-01-09 |\n-- +---------+------------+\n-- | 3 | 2020-12-19 |\n-- +---------+------------+\n-- | 4 | 2021-01-16 |\n-- +---------+------------+\n-- | 5 | 2021-01-23 |\n-- +---------+------------+\n-- | 6 | 2021-02-06 |\n-- +---------+------------+\n\nCREATE TABLE zzz_teams (\nteam_id INT,\nteam_city VARCHAR(15)\n) USING DELTA;\n\nINSERT INTO zzz_teams VALUES (1, \"San Francisco\");\nINSERT INTO zzz_teams VALUES (2, \"Seattle\");\nINSERT INTO zzz_teams VALUES (3, \"Amsterdam\");\n\n-- Result:\n-- +---------+---------------+\n-- | team_id | team_city |\n-- +=========+===============+\n-- | 1 | San Francisco |\n-- +---------+---------------+\n-- | 2 | Seattle |\n-- +---------+---------------+\n-- | 3 | Amsterdam |\n-- +---------+---------------+\n\n```\n2. In the project\u2019s `models` directory, create a file named `zzz_game_details.sql` with the following SQL statement. This statement creates a table that provides the details of each game, such as team names and scores. The `config` block instructs dbt to create a table in the database based on this statement. \n```\n-- Create a table that provides full details for each game, including\n-- the game ID, the home and visiting teams' city names and scores,\n-- the game winner's city name, and the game date.\n\n``` \n```\n{{ config(\nmaterialized='table',\nfile_format='delta'\n) }}\n\n``` \n```\n-- Step 4 of 4: Replace the visitor team IDs with their city names.\nselect\ngame_id,\nhome,\nt.team_city as visitor,\nhome_score,\nvisitor_score,\n-- Step 3 of 4: Display the city name for each game's winner.\ncase\nwhen\nhome_score > visitor_score\nthen\nhome\nwhen\nvisitor_score > home_score\nthen\nt.team_city\nend as winner,\ngame_date as date\nfrom (\n-- Step 2 of 4: Replace the home team IDs with their actual city names.\nselect\ngame_id,\nt.team_city as home,\nhome_score,\nvisitor_team_id,\nvisitor_score,\ngame_date\nfrom (\n-- Step 1 of 4: Combine data from various tables (for example, game and team IDs, scores, dates).\nselect\ng.game_id,\ngo.home_team_id,\ngs.home_team_score as home_score,\ngo.visitor_team_id,\ngs.visitor_team_score as visitor_score,\ng.game_date\nfrom\nzzz_games as g,\nzzz_game_opponents as go,\nzzz_game_scores as gs\nwhere\ng.game_id = go.game_id and\ng.game_id = gs.game_id\n) as all_ids,\nzzz_teams as t\nwhere\nall_ids.home_team_id = t.team_id\n) as visitor_ids,\nzzz_teams as t\nwhere\nvisitor_ids.visitor_team_id = t.team_id\norder by game_date desc\n\n```\n3. In the project\u2019s `models` directory, create a file named `zzz_win_loss_records.sql` with the following SQL statement. This statement creates a view that lists team win-loss records for the season. \n```\n-- Create a view that summarizes the season's win and loss records by team.\n\n-- Step 2 of 2: Calculate the number of wins and losses for each team.\nselect\nwinner as team,\ncount(winner) as wins,\n-- Each team played in 4 games.\n(4 - count(winner)) as losses\nfrom (\n-- Step 1 of 2: Determine the winner and loser for each game.\nselect\ngame_id,\nwinner,\ncase\nwhen\nhome = winner\nthen\nvisitor\nelse\nhome\nend as loser\nfrom {{ ref('zzz_game_details') }}\n)\ngroup by winner\norder by wins desc\n\n```\n4. With the virtual environment activated, run the `dbt run` command with the paths to the two preceding files. In the `default` database (as specified in the `profiles.yml` file), dbt creates one table named `zzz_game_details` and one view named `zzz_win_loss_records`. dbt gets these view and table names from their related `.sql` file names. \n```\ndbt run --model models\/zzz_game_details.sql models\/zzz_win_loss_records.sql\n\n``` \n```\n...\n... | 1 of 2 START table model default.zzz_game_details.................... [RUN]\n... | 1 of 2 OK created table model default.zzz_game_details............... [OK ...]\n... | 2 of 2 START view model default.zzz_win_loss_records................. [RUN]\n... | 2 of 2 OK created view model default.zzz_win_loss_records............ [OK ...]\n... |\n... | Finished running 1 table model, 1 view model ...\n\nCompleted successfully\n\nDone. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2\n\n```\n5. Run the following SQL code to list information about the new view and to select all rows from the table and view. \nIf you are connecting to a cluster, you can run this SQL code from a [notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook) that is connected to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL warehouse, you can run this SQL code from a [query](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html#create-a-query). \n```\nSHOW VIEWS FROM default LIKE 'zzz_win_loss_records';\n\n``` \n```\n+-----------+----------------------+-------------+\n| namespace | viewName | isTemporary |\n+===========+======================+=============+\n| default | zzz_win_loss_records | false |\n+-----------+----------------------+-------------+\n\n``` \n```\nSELECT * FROM zzz_game_details;\n\n``` \n```\n+---------+---------------+---------------+------------+---------------+---------------+------------+\n| game_id | home | visitor | home_score | visitor_score | winner | date |\n+=========+===============+===============+============+===============+===============+============+\n| 1 | San Francisco | Seattle | 4 | 2 | San Francisco | 2020-12-12 |\n+---------+---------------+---------------+------------+---------------+---------------+------------+\n| 2 | San Francisco | Amsterdam | 0 | 1 | Amsterdam | 2021-01-09 |\n+---------+---------------+---------------+------------+---------------+---------------+------------+\n| 3 | Seattle | San Francisco | 1 | 2 | San Francisco | 2020-12-19 |\n+---------+---------------+---------------+------------+---------------+---------------+------------+\n| 4 | Seattle | Amsterdam | 3 | 2 | Seattle | 2021-01-16 |\n+---------+---------------+---------------+------------+---------------+---------------+------------+\n| 5 | Amsterdam | San Francisco | 3 | 0 | Amsterdam | 2021-01-23 |\n+---------+---------------+---------------+------------+---------------+---------------+------------+\n| 6 | Amsterdam | Seattle | 3 | 1 | Amsterdam | 2021-02-06 |\n+---------+---------------+---------------+------------+---------------+---------------+------------+\n\n``` \n```\nSELECT * FROM zzz_win_loss_records;\n\n``` \n```\n+---------------+------+--------+\n| team | wins | losses |\n+===============+======+========+\n| Amsterdam | 3 | 1 |\n+---------------+------+--------+\n| San Francisco | 2 | 2 |\n+---------------+------+--------+\n| Seattle | 1 | 3 |\n+---------------+------+--------+\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/dbt-core-tutorial.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n### Connect to dbt Core\n##### Tutorial: Create, run, and test dbt models locally\n###### Step 3: Create and run tests\n\nIn this step, you create *tests*, which are assertions you make about your models. When you run these tests, dbt tells you if each test in your project passes or fails. \nThere are two type of tests. *Schema tests*, applied in YAML, return the number of records that do not pass an assertion. When this number is zero, all records pass, therefore the tests pass. *Data tests* are specific queries that must return zero records to pass. \n1. In the project\u2019s `models` directory, create a file named `schema.yml` with the following content. This file includes schema tests that determine whether the specified columns have unique values, are not null, have only the specified values, or a combination. \n```\nversion: 2\n\nmodels:\n- name: zzz_game_details\ncolumns:\n- name: game_id\ntests:\n- unique\n- not_null\n- name: home\ntests:\n- not_null\n- accepted_values:\nvalues: ['Amsterdam', 'San Francisco', 'Seattle']\n- name: visitor\ntests:\n- not_null\n- accepted_values:\nvalues: ['Amsterdam', 'San Francisco', 'Seattle']\n- name: home_score\ntests:\n- not_null\n- name: visitor_score\ntests:\n- not_null\n- name: winner\ntests:\n- not_null\n- accepted_values:\nvalues: ['Amsterdam', 'San Francisco', 'Seattle']\n- name: date\ntests:\n- not_null\n- name: zzz_win_loss_records\ncolumns:\n- name: team\ntests:\n- unique\n- not_null\n- relationships:\nto: ref('zzz_game_details')\nfield: home\n- name: wins\ntests:\n- not_null\n- name: losses\ntests:\n- not_null\n\n```\n2. In the project\u2019s `tests` directory, create a file named `zzz_game_details_check_dates.sql` with the following SQL statement. This file includes a data test to determine whether any games happened outside of the regular season. \n```\n-- This season's games happened between 2020-12-12 and 2021-02-06.\n-- For this test to pass, this query must return no results.\n\nselect date\nfrom {{ ref('zzz_game_details') }}\nwhere date < '2020-12-12'\nor date > '2021-02-06'\n\n```\n3. In the project\u2019s `tests` directory, create a file named `zzz_game_details_check_scores.sql` with the following SQL statement. This file includes a data test to determine whether any scores were negative or any games were tied. \n```\n-- This sport allows no negative scores or tie games.\n-- For this test to pass, this query must return no results.\n\nselect home_score, visitor_score\nfrom {{ ref('zzz_game_details') }}\nwhere home_score < 0\nor visitor_score < 0\nor home_score = visitor_score\n\n```\n4. In the project\u2019s `tests` directory, create a file named `zzz_win_loss_records_check_records.sql` with the following SQL statement. This file includes a data test to determine whether any teams had negative win or loss records, had more win or loss records than games played, or played more games than were allowed. \n```\n-- Each team participated in 4 games this season.\n-- For this test to pass, this query must return no results.\n\nselect wins, losses\nfrom {{ ref('zzz_win_loss_records') }}\nwhere wins < 0 or wins > 4\nor losses < 0 or losses > 4\nor (wins + losses) > 4\n\n```\n5. With the virtual environment activated, run the `dbt test` command. \n```\ndbt test --models zzz_game_details zzz_win_loss_records\n\n``` \n```\n...\n... | 1 of 19 START test accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [RUN]\n... | 1 of 19 PASS accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [PASS ...]\n...\n... |\n... | Finished running 19 tests ...\n\nCompleted successfully\n\nDone. PASS=19 WARN=0 ERROR=0 SKIP=0 TOTAL=19\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/dbt-core-tutorial.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n### Connect to dbt Core\n##### Tutorial: Create, run, and test dbt models locally\n###### Step 4: Clean up\n\nYou can delete the tables and views you created for this example by running the following SQL code. \nIf you are connecting to a cluster, you can run this SQL code from a [notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook) that is connected to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL warehouse, you can run this SQL code from a [query](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html#create-a-query). \n```\nDROP TABLE zzz_game_opponents;\nDROP TABLE zzz_game_scores;\nDROP TABLE zzz_games;\nDROP TABLE zzz_teams;\nDROP TABLE zzz_game_details;\nDROP VIEW zzz_win_loss_records;\n\nDROP TABLE diamonds;\nDROP TABLE diamonds_four_cs;\nDROP VIEW diamonds_list_colors;\nDROP VIEW diamonds_prices;\n\n```\n\n##### Tutorial: Create, run, and test dbt models locally\n###### Troubleshooting\n\nFor information about common issues when using dbt Core with Databricks and how to resolve them, see [Getting help](https:\/\/docs.getdbt.com\/docs\/guides\/getting-help) on the dbt Labs website.\n\n##### Tutorial: Create, run, and test dbt models locally\n###### Next steps\n\nRun dbt Core projects as Databricks job tasks. See [Use dbt transformations in a Databricks job](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbt-in-workflows.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/dbt-core-tutorial.html"} +{"content":"# Technology partners\n## Connect to data prep partners using Partner Connect\n### Connect to dbt Core\n##### Tutorial: Create, run, and test dbt models locally\n###### Additional resources\n\nExplore the following resources on the dbt Labs website: \n* [dbt models](https:\/\/docs.getdbt.com\/docs\/building-a-dbt-project\/building-models)\n* [Test dbt projects](https:\/\/docs.getdbt.com\/docs\/building-a-dbt-project\/tests)\n* [Use Jinja, a templating language, for programming SQL in your dbt projects](https:\/\/docs.getdbt.com\/docs\/building-a-dbt-project\/jinja-macros)\n* [dbt best practices](https:\/\/docs.getdbt.com\/docs\/guides\/best-practices)\n* [dbt Cloud, a hosted version of dbt](https:\/\/docs.getdbt.com\/docs\/dbt-cloud\/cloud-overview)\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/dbt-core-tutorial.html"} +{"content":"# Query data\n## Data format options\n#### Read and write to CSV files\n\nThis article provides examples for reading and writing to CSV files with Databricks using Python, Scala, R, and SQL. \nNote \nDatabricks recommends the [read\\_files table-valued function](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/read_files.html) for SQL users to read CSV files. `read_files` is available in Databricks Runtime 13.3 LTS and above. \nYou can also use a temporary view. If you use SQL to read CSV data directly without using temporary views or `read_files`, the following limitations apply: \n* You can\u2019t [specify data source options](https:\/\/docs.databricks.com\/query\/formats\/csv.html#options). \n+ You can\u2019t [specify the schema](https:\/\/docs.databricks.com\/query\/formats\/csv.html#specify-schema) for the data.\n\n#### Read and write to CSV files\n##### Options\n\nYou can configure several options for CSV file data sources. See the following Apache Spark reference articles for supported read and write options. \n* Read \n+ [Python](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrameReader.csv.html#pyspark.sql.DataFrameReader.csv)\n+ [Scala](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/sql\/DataFrameReader.html#csv(path:String):Unit)\n* Write \n+ [Python](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrameWriter.csv.html#pyspark.sql.DataFrameWriter.csv)\n+ [Scala](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/sql\/DataFrameWriter.html#csv(path:String):Unit)\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/csv.html"} +{"content":"# Query data\n## Data format options\n#### Read and write to CSV files\n##### Work with malformed CSV records\n\nWhen reading CSV files with a specified schema, it is possible that the data in the files does not match the schema. For example, a field containing name of the city will not parse as an integer. The consequences depend on the mode that the parser runs in: \n* `PERMISSIVE` (default): nulls are inserted for fields that could not be parsed correctly\n* `DROPMALFORMED`: drops lines that contain fields that could not be parsed\n* `FAILFAST`: aborts the reading if any malformed data is found \nTo set the mode, use the `mode` option. \n```\ndiamonds_df = (spark.read\n.format(\"csv\")\n.option(\"mode\", \"PERMISSIVE\")\n.load(\"\/databricks-datasets\/Rdatasets\/data-001\/csv\/ggplot2\/diamonds.csv\")\n)\n\n``` \nIn the `PERMISSIVE` mode it is possible to inspect the rows that could not be parsed correctly using one of the following methods: \n* You can provide a custom path to the option `badRecordsPath` to record corrupt records to a file.\n* You can add the column `_corrupt_record` to the schema provided to the DataFrameReader to review corrupt records in the resultant DataFrame. \nNote \nThe `badRecordsPath` option takes precedence over `_corrupt_record`, meaning that malformed rows written to the provided path do not appear in the resultant DataFrame. \nDefault behavior for malformed records changes when using the [rescued data column](https:\/\/docs.databricks.com\/query\/formats\/csv.html#rescued-data). \n### Find malformed rows notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/read-csv-corrupt-record.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/csv.html"} +{"content":"# Query data\n## Data format options\n#### Read and write to CSV files\n##### Rescued data column\n\nNote \nThis feature is supported in [Databricks Runtime 8.3 (unsupported)](https:\/\/docs.databricks.com\/archive\/runtime-release-notes\/8.3.html) and above. \nWhen using the `PERMISSIVE` mode, you can enable the rescued data column to capture any data that wasn\u2019t parsed because one or more fields in a record have one of the following issues: \n* Absent from the provided schema.\n* Does not match the data type of the provided schema.\n* Has a case mismatch with the field names in the provided schema. \nThe rescued data column is returned as a JSON document containing the columns that were rescued, and the source file path of the record. To remove the source file path from the rescued data column, you can set the SQL configuration `spark.conf.set(\"spark.databricks.sql.rescuedDataColumn.filePath.enabled\", \"false\")`. You can enable the rescued data column by setting the option `rescuedDataColumn` to a column name when reading data, such as `_rescued_data` with `spark.read.option(\"rescuedDataColumn\", \"_rescued_data\").format(\"csv\").load(<path>)`. \nThe CSV parser supports three modes when parsing records: `PERMISSIVE`, `DROPMALFORMED`, and `FAILFAST`. When used together with `rescuedDataColumn`, data type mismatches do not cause records to be dropped in `DROPMALFORMED` mode or throw an error in `FAILFAST` mode. Only corrupt records\u2014that is, incomplete or malformed CSV\u2014are dropped or throw errors. \nWhen `rescuedDataColumn` is used in `PERMISSIVE` mode, the following rules apply to [corrupt records](https:\/\/docs.databricks.com\/query\/formats\/csv.html#corrupt-records): \n* The first row of the file (either a header row or a data row) sets the expected row length.\n* A row with a different number of columns is considered incomplete.\n* Data type mismatches are not considered corrupt records.\n* Only incomplete and malformed CSV records are considered corrupt and recorded to the `_corrupt_record` column or `badRecordsPath`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/csv.html"} +{"content":"# Query data\n## Data format options\n#### Read and write to CSV files\n##### SQL example: Read CSV file\n\nThe following SQL example reads a CSV file using `read_files`. \n```\n-- mode \"FAILFAST\" aborts file parsing with a RuntimeException if malformed lines are encountered\nSELECT * FROM read_files(\n's3:\/\/<bucket>\/<path>\/<file>.csv',\nformat => 'csv',\nheader => true,\nmode => 'FAILFAST')\n\n```\n\n#### Read and write to CSV files\n##### Scala, R, and Python examples: Read CSV file\n\nThe following notebook shows how to read a file, display sample data, and print the data schema using Scala, R, and Python. The examples in this section use the [diamonds dataset](https:\/\/docs.databricks.com\/discover\/databricks-datasets.html). Specify the path to the dataset as well as any options that you would like. \n### Read CSV files notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/read-csv-files.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/csv.html"} +{"content":"# Query data\n## Data format options\n#### Read and write to CSV files\n##### Example: Specify schema\n\nWhen the schema of the CSV file is known, you can specify the desired schema to the CSV reader with the `schema` option. \n### Read CSV files with schema notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/read-csv-schema.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nSQL example using `read_files`: \n```\nSELECT * FROM read_files(\n's3:\/\/<bucket>\/<path>\/<file>.csv',\nformat => 'csv',\nheader => false,\nschema => 'id string, date date, event_time timestamp')\n\n```\n\n#### Read and write to CSV files\n##### Example: Pitfalls of reading a subset of columns\n\nThe behavior of the CSV parser depends on the set of columns that are read. If the specified schema is incorrect, the results might differ considerably depending on the subset of columns that is accessed. The following notebook presents the most common pitfalls. \n### Caveats of reading a subset of columns of a CSV file notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/read-csv-column-subset.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/csv.html"} +{"content":"# Query data\n## Data format options\n#### LZO compressed file\n\nDue to licensing restrictions, the LZO compression codec is not available by default on Databricks clusters. To read an LZO compressed file, you must use [an init script](https:\/\/docs.databricks.com\/init-scripts\/index.html) to install the codec on your cluster at launch time.\n\n#### LZO compressed file\n##### Notebook example: Init LZO compressed files\n\nThe following notebook: \n* Builds the LZO codec.\n* Creates an init script that: \n+ Installs the LZO compression libraries and the `lzop` command, and copies the LZO codec to proper class path.\n+ Configures Spark to use the LZO compression codec. \n### Init LZO compressed files notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/init-lzo-compressed-files.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### LZO compressed file\n##### Notebook example: Read LZO compressed files\n\nThe following notebook reads LZO compressed files using the codec installed by the init script: \n### Read LZO compressed files notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/read-lzo-compressed-files.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/lzo.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Delta table streaming reads and writes\n\nDelta Lake is deeply integrated with [Spark Structured Streaming](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html) through `readStream` and `writeStream`. Delta Lake overcomes many of the limitations typically associated with streaming systems and files, including: \n* Coalescing small files produced by low latency ingest.\n* Maintaining \u201cexactly-once\u201d processing with more than one stream (or concurrent batch jobs).\n* Efficiently discovering which files are new when using files as the source for a stream. \nNote \nThis article describes using Delta Lake tables as streaming sources and sinks. To learn how to load data using streaming tables in Databricks SQL, see [Load data using streaming tables in Databricks SQL](https:\/\/docs.databricks.com\/sql\/load-data-streaming-table.html). \nFor information on stream-static joins with Delta Lake, see [Stream-static joins](https:\/\/docs.databricks.com\/transform\/join.html#stream-static).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Delta table streaming reads and writes\n##### Delta table as a source\n\nStructured Streaming incrementally reads Delta tables. While a streaming query is active against a Delta table, new records are processed idempotently as new table versions commit to the source table. \nThe follow code examples show configuring a streaming read using either the table name or file path. \n```\nspark.readStream.table(\"table_name\")\n\nspark.readStream.load(\"\/path\/to\/table\")\n\n``` \n```\nspark.readStream.table(\"table_name\")\n\nspark.readStream.load(\"\/path\/to\/table\")\n\n``` \nImportant \nIf the schema for a Delta table changes after a streaming read begins against the table, the query fails. For most schema changes, you can restart the stream to resolve schema mismatch and continue processing. \nIn Databricks Runtime 12.2 LTS and below, you cannot stream from a Delta table with column mapping enabled that has undergone non-additive schema evolution such as renaming or dropping columns. For details, see [Streaming with column mapping and schema changes](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html#schema-tracking).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Delta table streaming reads and writes\n##### Limit input rate\n\nThe following options are available to control micro-batches: \n* `maxFilesPerTrigger`: How many new files to be considered in every micro-batch. The default is 1000.\n* `maxBytesPerTrigger`: How much data gets processed in each micro-batch. This option sets a \u201csoft max\u201d, meaning that a batch processes approximately this amount of data and may process more than the limit in order to make the streaming query move forward in cases when the smallest input unit is larger than this limit. This is not set by default. \nIf you use `maxBytesPerTrigger` in conjunction with `maxFilesPerTrigger`, the micro-batch processes data until either the `maxFilesPerTrigger` or `maxBytesPerTrigger` limit is reached. \nNote \nIn cases when the source table transactions are cleaned up due to the `logRetentionDuration` [configuration](https:\/\/docs.databricks.com\/delta\/history.html#data-retention) and the streaming query tries to process those versions, by default the query fails to avoid data loss. You can set the option `failOnDataLoss` to `false` to ignore lost data and continue processing.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Delta table streaming reads and writes\n##### Stream a Delta Lake change data capture (CDC) feed\n\nDelta Lake [change data feed](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html) records changes to a Delta table, including updates and deletes. When enabled, you can stream from a change data feed and write logic to process inserts, updates, and deletes into downstream tables. Although change data feed data output differs slightly from the Delta table it describes, this provides a solution for propagating incremental changes to downstream tables in a [medallion architecture](https:\/\/docs.databricks.com\/lakehouse\/medallion.html). \nImportant \nIn Databricks Runtime 12.2 LTS and below, you cannot stream from the change data feed for a Delta table with column mapping enabled that has undergone non-additive schema evolution such as renaming or dropping columns. See [Streaming with column mapping and schema changes](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html#schema-tracking).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Delta table streaming reads and writes\n##### Ignore updates and deletes\n\nStructured Streaming does not handle input that is not an append and throws an exception if any modifications occur on the table being used as a source. There are two main strategies for dealing with changes that cannot be automatically propagated downstream: \n* You can delete the output and checkpoint and restart the stream from the beginning.\n* You can set either of these two options: \n+ `ignoreDeletes`: ignore transactions that delete data at partition boundaries.\n+ `skipChangeCommits`: ignore transactions that delete or modify existing records. `skipChangeCommits` subsumes `ignoreDeletes`. \nNote \nIn Databricks Runtime 12.2 LTS and above, `skipChangeCommits` deprecates the previous setting `ignoreChanges`. In Databricks Runtime 11.3 LTS and lower, `ignoreChanges` is the only supported option. \nThe semantics for `ignoreChanges` differ greatly from `skipChangeCommits`. With `ignoreChanges` enabled, rewritten data files in the source table are re-emitted after a data changing operation such as `UPDATE`, `MERGE INTO`, `DELETE` (within partitions), or `OVERWRITE`. Unchanged rows are often emitted alongside new rows, so downstream consumers must be able to handle duplicates. Deletes are not propagated downstream. `ignoreChanges` subsumes `ignoreDeletes`. \n`skipChangeCommits` disregards file changing operations entirely. Data files that are rewritten in the source table due to data changing operation such as `UPDATE`, `MERGE INTO`, `DELETE`, and `OVERWRITE` are ignored entirely. In order to reflect changes in upstream source tables, you must implement separate logic to propagate these changes. \nWorkloads configured with `ignoreChanges` continue to operate using known semantics, but Databricks recommends using `skipChangeCommits` for all new workloads. Migrating workloads using `ignoreChanges` to `skipChangeCommits` requires refactoring logic. \n### Example \nFor example, suppose you have a table `user_events` with `date`, `user_email`, and `action` columns that is partitioned by `date`. You stream out of the `user_events` table and you need to delete data from it due to GDPR. \nWhen you delete at partition boundaries (that is, the `WHERE` is on a partition column), the files are already segmented by value so the delete just drops those files from the metadata. When you delete an entire partition of data, you can use the following: \n```\nspark.readStream.format(\"delta\")\n.option(\"ignoreDeletes\", \"true\")\n.load(\"\/tmp\/delta\/user_events\")\n\n``` \nIf you delete data in multiple partitions (in this example, filtering on `user_email`), use the following syntax: \n```\nspark.readStream.format(\"delta\")\n.option(\"skipChangeCommits\", \"true\")\n.load(\"\/tmp\/delta\/user_events\")\n\n``` \nIf you update a `user_email` with the `UPDATE` statement, the file containing the `user_email` in question is rewritten. Use `skipChangeCommits` to ignore the changed data files.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Delta table streaming reads and writes\n##### Specify initial position\n\nYou can use the following options to specify the starting point of the Delta Lake streaming source without processing the entire table. \n* `startingVersion`: The Delta Lake version to start from. Databricks recommends omitting this option for most workloads. When not set, the stream starts from the latest available version including a complete snapshot of the table at that moment. \nIf specified, the stream reads all changes to the Delta table starting with the specified version (inclusive). If the specified version is no longer available, the stream fails to start. You can obtain the commit versions from the `version` column of the [DESCRIBE HISTORY](https:\/\/docs.databricks.com\/delta\/history.html) command output. \nTo return only the latest changes, specify `latest`.\n* `startingTimestamp`: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) are read by the streaming reader. If the provided timestamp precedes all table commits, the streaming read begins with the earliest available timestamp. One of: \n+ A timestamp string. For example, `\"2019-01-01T00:00:00.000Z\"`.\n+ A date string. For example, `\"2019-01-01\"`. \nYou cannot set both options at the same time. They take effect only when starting a new streaming query. If a streaming query has started and the progress has been recorded in its checkpoint, these options are ignored. \nImportant \nAlthough you can start the streaming source from a specified version or timestamp, the schema of the streaming source is always the latest schema of the Delta table. You must ensure there is no incompatible schema change to the Delta table after the specified version or timestamp. Otherwise, the streaming source may return incorrect results when reading the data with an incorrect schema. \n### Example \nFor example, suppose you have a table `user_events`. If you want to read changes since version 5, use: \n```\nspark.readStream.format(\"delta\")\n.option(\"startingVersion\", \"5\")\n.load(\"\/tmp\/delta\/user_events\")\n\n``` \nIf you want to read changes since 2018-10-18, use: \n```\nspark.readStream.format(\"delta\")\n.option(\"startingTimestamp\", \"2018-10-18\")\n.load(\"\/tmp\/delta\/user_events\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Delta table streaming reads and writes\n##### Process initial snapshot without data being dropped\n\nNote \nThis feature is available on Databricks Runtime 11.3 LTS and above. This feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nWhen using a Delta table as a stream source, the query first processes all of the data present in the table. The Delta table at this version is called the initial snapshot. By default, the Delta table\u2019s data files are processed based on which file was last modified. However, the last modification time does not necessarily represent the record event time order. \nIn a stateful streaming query with a defined watermark, processing files by modification time can result in records being processed in the wrong order. This could lead to records dropping as late events by the watermark. \nYou can avoid the data drop issue by enabling the following option: \n* withEventTimeOrder: Whether the initial snapshot should be processed with event time order. \nWith event time order enabled, the event time range of initial snapshot data is divided into time buckets. Each micro batch processes a bucket by filtering data within the time range. The maxFilesPerTrigger and maxBytesPerTrigger configuration options are still applicable to control the microbatch size but only in an approximate way due to the nature of the processing. \nThe graphic below shows this process: \n![Initial Snapshot](https:\/\/docs.databricks.com\/_images\/delta-initial-snapshot-data-drop.png) \nNotable information about this feature: \n* The data drop issue only happens when the initial Delta snapshot of a stateful streaming query is processed in the default order.\n* You cannot change `withEventTimeOrder` once the stream query is started while the initial snapshot is still being processed. To restart with `withEventTimeOrder` changed, you need to delete the checkpoint.\n* If you are running a stream query with withEventTimeOrder enabled, you cannot downgrade it to a DBR version which doesn\u2019t support this feature until the initial snapshot processing is completed. If you need to downgrade, you can wait for the initial snapshot to finish, or delete the checkpoint and restart the query.\n* This feature is not supported in the following uncommon scenarios: \n+ The event time column is a generated column and there are non-projection transformations between the Delta source and watermark.\n+ There is a watermark that has more than one Delta source in the stream query.\n* With event time order enabled, the performance of the Delta initial snapshot processing might be slower.\n* Each micro batch scans the initial snapshot to filter data within the corresponding event time range. For faster filter action, it is advised to use a Delta source column as the event time so that data skipping can be applied (check [Data skipping for Delta Lake](https:\/\/docs.databricks.com\/delta\/data-skipping.html) for when it\u2019s applicable). Additionally, table partitioning along the event time column can further speed the processing. You can check Spark UI to see how many delta files are scanned for a specific micro batch. \n### Example \nSuppose you have a table `user_events` with an `event_time` column. Your streaming query is an aggregation query. If you want to ensure no data drop during the initial snapshot processing, you can use: \n```\nspark.readStream.format(\"delta\")\n.option(\"withEventTimeOrder\", \"true\")\n.load(\"\/tmp\/delta\/user_events\")\n.withWatermark(\"event_time\", \"10 seconds\")\n\n``` \nNote \nYou can also enable this with Spark config on the cluster which will apply to all streaming queries: `spark.databricks.delta.withEventTimeOrder.enabled true`\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Delta table streaming reads and writes\n##### Delta table as a sink\n\nYou can also write data into a Delta table using Structured Streaming. The transaction log enables Delta Lake to guarantee exactly-once processing, even when there are other streams or batch queries running concurrently against the table. \nNote \nThe Delta Lake `VACUUM` function removes all files not managed by Delta Lake but skips any directories that begin with `_`. You can safely store checkpoints alongside other data and metadata for a Delta table using a directory structure such as `<table-name>\/_checkpoints`. \n### Metrics \nYou can find out the number of bytes and number of files yet to be processed in a [streaming query process](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html#reading-metrics-interactively) as the `numBytesOutstanding` and `numFilesOutstanding` metrics. Additional metrics include: \n* `numNewListedFiles`: Number of Delta Lake files that were listed in order to calculate the backlog for this batch. \n+ `backlogEndOffset`: The table version used to calculate the backlog. \nIf you are running the stream in a notebook, you can see these metrics under the **Raw Data** tab in the streaming query progress dashboard: \n```\n{\n\"sources\" : [\n{\n\"description\" : \"DeltaSource[file:\/path\/to\/source]\",\n\"metrics\" : {\n\"numBytesOutstanding\" : \"3456\",\n\"numFilesOutstanding\" : \"8\"\n},\n}\n]\n}\n\n``` \n### Append mode \nBy default, streams run in append mode, which adds new records to the table. \nYou can use the path method: \n```\n(events.writeStream\n.format(\"delta\")\n.outputMode(\"append\")\n.option(\"checkpointLocation\", \"\/tmp\/delta\/_checkpoints\/\")\n.start(\"\/delta\/events\")\n)\n\n``` \n```\nevents.writeStream\n.format(\"delta\")\n.outputMode(\"append\")\n.option(\"checkpointLocation\", \"\/tmp\/delta\/events\/_checkpoints\/\")\n.start(\"\/tmp\/delta\/events\")\n\n``` \nor the `toTable` method, as follows: \n```\n(events.writeStream\n.format(\"delta\")\n.outputMode(\"append\")\n.option(\"checkpointLocation\", \"\/tmp\/delta\/events\/_checkpoints\/\")\n.toTable(\"events\")\n)\n\n``` \n```\nevents.writeStream\n.outputMode(\"append\")\n.option(\"checkpointLocation\", \"\/tmp\/delta\/events\/_checkpoints\/\")\n.toTable(\"events\")\n\n``` \n### Complete mode \nYou can also use Structured Streaming to replace the entire table with every batch. One example use case is to compute a summary using aggregation: \n```\n(spark.readStream\n.format(\"delta\")\n.load(\"\/tmp\/delta\/events\")\n.groupBy(\"customerId\")\n.count()\n.writeStream\n.format(\"delta\")\n.outputMode(\"complete\")\n.option(\"checkpointLocation\", \"\/tmp\/delta\/eventsByCustomer\/_checkpoints\/\")\n.start(\"\/tmp\/delta\/eventsByCustomer\")\n)\n\n``` \n```\nspark.readStream\n.format(\"delta\")\n.load(\"\/tmp\/delta\/events\")\n.groupBy(\"customerId\")\n.count()\n.writeStream\n.format(\"delta\")\n.outputMode(\"complete\")\n.option(\"checkpointLocation\", \"\/tmp\/delta\/eventsByCustomer\/_checkpoints\/\")\n.start(\"\/tmp\/delta\/eventsByCustomer\")\n\n``` \nThe preceding example continuously updates a table that contains the aggregate number of events by customer. \nFor applications with more lenient latency requirements, you can save computing resources with one-time triggers. Use these to update summary aggregation tables on a given schedule, processing only new data that has arrived since the last update.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Delta table streaming reads and writes\n##### Upsert from streaming queries using `foreachBatch`\n\nYou can use a combination of `merge` and `foreachBatch` to write complex upserts from a streaming query into a Delta table. See [Use foreachBatch to write to arbitrary data sinks](https:\/\/docs.databricks.com\/structured-streaming\/foreach.html). \nThis pattern has many applications, including the following: \n* **Write streaming aggregates in Update Mode**: This is much more efficient than Complete Mode.\n* **Write a stream of database changes into a Delta table**: The [merge query for writing change data](https:\/\/docs.databricks.com\/delta\/merge.html#merge-in-cdc) can be used in `foreachBatch` to continuously apply a stream of changes to a Delta table.\n* **Write a stream of data into Delta table with deduplication**: The [insert-only merge query for deduplication](https:\/\/docs.databricks.com\/delta\/merge.html#dedupe) can be used in `foreachBatch` to continuously write data (with duplicates) to a Delta table with automatic deduplication. \nNote \n* Make sure that your `merge` statement inside `foreachBatch` is idempotent as restarts of the streaming query can apply the operation on the same batch of data multiple times.\n* When `merge` is used in `foreachBatch`, the input data rate of the streaming query (reported through `StreamingQueryProgress` and visible in the notebook rate graph) may be reported as a multiple of the actual rate at which data is generated at the source. This is because `merge` reads the input data multiple times causing the input metrics to be multiplied. If this is a bottleneck, you can cache the batch DataFrame before `merge` and then uncache it after `merge`. \nThe following example demonstrates how you can use SQL within `foreachBatch` to accomplish this task: \n```\n\/\/ Function to upsert microBatchOutputDF into Delta table using merge\ndef upsertToDelta(microBatchOutputDF: DataFrame, batchId: Long) {\n\/\/ Set the dataframe to view name\nmicroBatchOutputDF.createOrReplaceTempView(\"updates\")\n\n\/\/ Use the view name to apply MERGE\n\/\/ NOTE: You have to use the SparkSession that has been used to define the `updates` dataframe\nmicroBatchOutputDF.sparkSession.sql(s\"\"\"\nMERGE INTO aggregates t\nUSING updates s\nON s.key = t.key\nWHEN MATCHED THEN UPDATE SET *\nWHEN NOT MATCHED THEN INSERT *\n\"\"\")\n}\n\n\/\/ Write the output of a streaming aggregation query into Delta table\nstreamingAggregatesDF.writeStream\n.format(\"delta\")\n.foreachBatch(upsertToDelta _)\n.outputMode(\"update\")\n.start()\n\n``` \n```\n# Function to upsert microBatchOutputDF into Delta table using merge\ndef upsertToDelta(microBatchOutputDF, batchId):\n# Set the dataframe to view name\nmicroBatchOutputDF.createOrReplaceTempView(\"updates\")\n\n# Use the view name to apply MERGE\n# NOTE: You have to use the SparkSession that has been used to define the `updates` dataframe\n\n# In Databricks Runtime 10.5 and below, you must use the following:\n# microBatchOutputDF._jdf.sparkSession().sql(\"\"\"\nmicroBatchOutputDF.sparkSession.sql(\"\"\"\nMERGE INTO aggregates t\nUSING updates s\nON s.key = t.key\nWHEN MATCHED THEN UPDATE SET *\nWHEN NOT MATCHED THEN INSERT *\n\"\"\")\n\n# Write the output of a streaming aggregation query into Delta table\n(streamingAggregatesDF.writeStream\n.format(\"delta\")\n.foreachBatch(upsertToDelta)\n.outputMode(\"update\")\n.start()\n)\n\n``` \nYou can also choose to use the Delta Lake APIs to perform streaming upserts, as in the following example: \n```\nimport io.delta.tables.*\n\nval deltaTable = DeltaTable.forPath(spark, \"\/data\/aggregates\")\n\n\/\/ Function to upsert microBatchOutputDF into Delta table using merge\ndef upsertToDelta(microBatchOutputDF: DataFrame, batchId: Long) {\ndeltaTable.as(\"t\")\n.merge(\nmicroBatchOutputDF.as(\"s\"),\n\"s.key = t.key\")\n.whenMatched().updateAll()\n.whenNotMatched().insertAll()\n.execute()\n}\n\n\/\/ Write the output of a streaming aggregation query into Delta table\nstreamingAggregatesDF.writeStream\n.format(\"delta\")\n.foreachBatch(upsertToDelta _)\n.outputMode(\"update\")\n.start()\n\n``` \n```\nfrom delta.tables import *\n\ndeltaTable = DeltaTable.forPath(spark, \"\/data\/aggregates\")\n\n# Function to upsert microBatchOutputDF into Delta table using merge\ndef upsertToDelta(microBatchOutputDF, batchId):\n(deltaTable.alias(\"t\").merge(\nmicroBatchOutputDF.alias(\"s\"),\n\"s.key = t.key\")\n.whenMatchedUpdateAll()\n.whenNotMatchedInsertAll()\n.execute()\n)\n\n# Write the output of a streaming aggregation query into Delta table\n(streamingAggregatesDF.writeStream\n.format(\"delta\")\n.foreachBatch(upsertToDelta)\n.outputMode(\"update\")\n.start()\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Delta table streaming reads and writes\n##### Idempotent table writes in `foreachBatch`\n\nNote \nDatabricks recommends configuring a separate streaming write for each sink you wish to update. Using `foreachBatch` to write to multiple tables serializes writes, which reduces parallelizaiton and increases overall latency. \nDelta tables support the following `DataFrameWriter` options to make writes to multiple tables within `foreachBatch` idempotent: \n* `txnAppId`: A unique string that you can pass on each DataFrame write. For example, you can use the StreamingQuery ID as `txnAppId`.\n* `txnVersion`: A monotonically increasing number that acts as transaction version. \nDelta Lake uses the combination of `txnAppId` and `txnVersion` to identify duplicate writes and ignore them. \nIf a batch write is interrupted with a failure, re-running the batch uses the same application and batch ID to help the runtime correctly identify duplicate writes and ignore them. Application ID (`txnAppId`) can be any user-generated unique string and does not have to be related to the stream ID. See [Use foreachBatch to write to arbitrary data sinks](https:\/\/docs.databricks.com\/structured-streaming\/foreach.html). \nWarning \nIf you delete the streaming checkpoint and restart the query with a new checkpoint, you must provide a different `txnAppId`. New checkpoints start with a batch ID of `0`. Delta Lake uses the batch ID and `txnAppId` as a unique key, and skips batches with already seen values. \nThe following code example demonstrates this pattern: \n```\napp_id = ... # A unique string that is used as an application ID.\n\ndef writeToDeltaLakeTableIdempotent(batch_df, batch_id):\nbatch_df.write.format(...).option(\"txnVersion\", batch_id).option(\"txnAppId\", app_id).save(...) # location 1\nbatch_df.write.format(...).option(\"txnVersion\", batch_id).option(\"txnAppId\", app_id).save(...) # location 2\n\nstreamingDF.writeStream.foreachBatch(writeToDeltaLakeTableIdempotent).start()\n\n``` \n```\nval appId = ... \/\/ A unique string that is used as an application ID.\nstreamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>\nbatchDF.write.format(...).option(\"txnVersion\", batchId).option(\"txnAppId\", appId).save(...) \/\/ location 1\nbatchDF.write.format(...).option(\"txnVersion\", batchId).option(\"txnAppId\", appId).save(...) \/\/ location 2\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html"} +{"content":"# \n### Handle bad records and files\n\nDatabricks provides a number of options for dealing with files that contain bad records. Examples of bad data include: \n* **Incomplete or corrupt records**: Mainly observed in text based file formats like JSON and CSV. For example, a JSON record that doesn\u2019t have a closing brace or a CSV record that doesn\u2019t have as many columns as the header or first record of the CSV file.\n* **Mismatched data types**: When the value for a column doesn\u2019t have the specified or inferred data type.\n* **Bad field names**: Can happen in all file formats, when the column name specified in the file or record has a different casing than the specified or inferred schema.\n* **Corrupted files**: When a file cannot be read, which might be due to metadata or data corruption in binary file types such as Avro, Parquet, and ORC. On rare occasion, might be caused by long-lasting transient failures in the underlying storage system.\n* **Missing files**: A file that was discovered during query analysis time and no longer exists at processing time.\n\n### Handle bad records and files\n#### Use `badRecordsPath`\n\nWhen you set `badRecordsPath`, the specified path records exceptions for bad records or files encountered during data loading. \nIn addition to corrupt records and files, errors indicating deleted files, network connection exception, IO exception, and so on are ignored and recorded under the `badRecordsPath`. \nNote \nUsing the `badRecordsPath` option in a file-based data source has a few important limitations: \n* It is non-transactional and can lead to inconsistent results.\n* Transient errors are treated as failures.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/bad-records.html"} +{"content":"# \n### Handle bad records and files\n#### Unable to find input file\n\n```\nval df = spark.read\n.option(\"badRecordsPath\", \"\/tmp\/badRecordsPath\")\n.format(\"parquet\").load(\"\/input\/parquetFile\")\n\n\/\/ Delete the input parquet file '\/input\/parquetFile'\ndbutils.fs.rm(\"\/input\/parquetFile\")\n\ndf.show()\n\n``` \nIn the above example, since `df.show()` is unable to find the input file, Spark creates an exception file in JSON format to record the error. For example, `\/tmp\/badRecordsPath\/20170724T101153\/bad_files\/xyz` is the path of the exception file. This file is under the specified `badRecordsPath` directory, `\/tmp\/badRecordsPath`. `20170724T101153` is the creation time of this `DataFrameReader`. `bad_files` is the exception type. `xyz` is a file that contains a JSON record, which has the path of the bad file and the exception\/reason message.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/bad-records.html"} +{"content":"# \n### Handle bad records and files\n#### Input file contains bad record\n\n```\n\/\/ Creates a json file containing both parsable and corrupted records\nSeq(\"\"\"{\"a\": 1, \"b\": 2}\"\"\", \"\"\"{bad-record\"\"\").toDF().write.format(\"text\").save(\"\/tmp\/input\/jsonFile\")\n\nval df = spark.read\n.option(\"badRecordsPath\", \"\/tmp\/badRecordsPath\")\n.schema(\"a int, b int\")\n.format(\"json\")\n.load(\"\/tmp\/input\/jsonFile\")\n\ndf.show()\n\n``` \nIn this example, the DataFrame contains only the first parsable record (`{\"a\": 1, \"b\": 2}`). The second bad record (`{bad-record`) is recorded in the exception file, which is a JSON file located in `\/tmp\/badRecordsPath\/20170724T114715\/bad_records\/xyz`. The exception file contains the bad record, the path of the file containing the record, and the exception\/reason message. After you locate the exception files, you can use a JSON reader to process them.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/bad-records.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n#### Reliability for the data lakehouse\n\nThe architectural principles of the **reliability** pillar address the ability of a system to recover from failures and continue to function. \n![Reliability lakehouse architecture diagram for Databricks.](https:\/\/docs.databricks.com\/_images\/reliability.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/reliability\/index.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n#### Reliability for the data lakehouse\n##### Principles of reliability\n\n1. **Design for failure** \nIn a highly distributed environment, outages can occur. For both the platform and the various workloads - such as streaming jobs, batch jobs, model training, and BI queries - failures must be anticipated and resilient solutions must be developed to increase reliability. The focus is on designing applications to recover quickly and, in the best case, automatically.\n2. **Manage data quality** \nData quality is fundamental to deriving accurate and meaningful insights from data. Data quality has many dimensions, including completeness, accuracy, validity, and consistency. It must be actively managed to improve the quality of the final data sets so that the data serves as reliable and trustworthy information for business users.\n3. **Design for autoscaling** \nStandard ETL processes, business reports, and dashboards often have predictable resource requirements in terms of memory and compute. However, new projects, seasonal tasks, or advanced approaches such as model training (for churn, forecasting, and maintenance) create spikes in resource requirements. For an organization to handle all of these workloads, it needs a scalable storage and compute platform. Adding new resources as needed must be easy, and only actual consumption should be charged for. Once the peak is over, resources can be freed up and costs reduced accordingly. This is often referred to as horizontal scaling (number of nodes) and vertical scaling (size of nodes).\n4. **Test recovery procedures** \nAn enterprise-wide disaster recovery strategy for most applications and systems requires an assessment of priorities, capabilities, limitations, and costs. A reliable disaster recovery approach regularly tests how workloads fail and validates recovery procedures. Automation can be used to simulate different failures or recreate scenarios that have caused failures in the past.\n5. **Automate deployments and workloads** \nAutomating deployments and workloads for the lakehouse helps standardize these processes, eliminate human error, improve productivity, and provide greater repeatability. This includes using \u201cconfiguration as code\u201d to avoid configuration drift, and \u201cinfrastructure as code\u201d to automate the provisioning of all required lakehouse and cloud services.\n6. **Set up monitoring, alerting & logging** \nWorkloads in the lakehouse typically integrate Databricks platform services and external cloud services, for example as data sources or targets. Successful execution can only occur if each service in the execution chain is functioning properly. When this is not the case, monitoring, alerting, and logging are important to detect and track problems and understand system behavior.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/reliability\/index.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n#### Reliability for the data lakehouse\n##### Next: Best practices for reliability\n\nSee [Best practices for reliability](https:\/\/docs.databricks.com\/lakehouse-architecture\/reliability\/best-practices.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/reliability\/index.html"} +{"content":"# \n### What is Delta Lake?\n\nDelta Lake is the optimized storage layer that provides the foundation for tables in a lakehouse on Databricks. Delta Lake is [open source software](https:\/\/delta.io) that extends Parquet data files with a file-based transaction log for [ACID transactions](https:\/\/docs.databricks.com\/lakehouse\/acid.html) and scalable metadata handling. Delta Lake is fully compatible with Apache Spark APIs, and was developed for tight integration with Structured Streaming, allowing you to easily use a single copy of data for both batch and streaming operations and providing incremental processing at scale. \nDelta Lake is the default format for all operations on Databricks. Unless otherwise specified, all tables on Databricks are Delta tables. Databricks originally developed the Delta Lake protocol and continues to actively contribute to the open source project. Many of the optimizations and products in the Databricks platform build upon the guarantees provided by Apache Spark and Delta Lake. For information on optimizations on Databricks, see [Optimization recommendations on Databricks](https:\/\/docs.databricks.com\/optimizations\/index.html). \nFor reference information on Delta Lake SQL commands, see [Delta Lake statements](https:\/\/docs.databricks.com\/sql\/language-manual\/index.html#delta-lake-statements). \nThe Delta Lake transaction log has a well-defined open protocol that can be used by any system to read the log. See [Delta Transaction Log Protocol](https:\/\/github.com\/delta-io\/delta\/blob\/master\/PROTOCOL.md).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/index.html"} +{"content":"# \n### What is Delta Lake?\n#### Getting started with Delta Lake\n\nAll tables on Databricks are Delta tables by default. Whether you\u2019re using Apache Spark [DataFrames](https:\/\/docs.databricks.com\/getting-started\/dataframes.html) or SQL, you get all the benefits of Delta Lake just by saving your data to the lakehouse with default settings. \nFor examples of basic Delta Lake operations such as creating tables, reading, writing, and updating data, see [Tutorial: Delta Lake](https:\/\/docs.databricks.com\/delta\/tutorial.html). \nDatabricks has many recommendations for [best practices for Delta Lake](https:\/\/docs.databricks.com\/delta\/best-practices.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/index.html"} +{"content":"# \n### What is Delta Lake?\n#### Converting and ingesting data to Delta Lake\n\nDatabricks provides a number of products to accelerate and simplify loading data to your lakehouse. \n* Delta Live Tables: \n+ [Tutorial: Run your first ETL workload on Databricks](https:\/\/docs.databricks.com\/getting-started\/etl-quick-start.html)\n+ [Load data using streaming tables (Python\/SQL notebook)](https:\/\/docs.databricks.com\/ingestion\/onboard-data.html)\n+ [Load data using streaming tables in Databricks SQL](https:\/\/docs.databricks.com\/sql\/load-data-streaming-table.html)\n* [COPY INTO](https:\/\/docs.databricks.com\/ingestion\/copy-into\/index.html)\n* [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html)\n* [Add data UI](https:\/\/docs.databricks.com\/ingestion\/add-data\/index.html)\n* [Incrementally convert Parquet or Iceberg data to Delta Lake](https:\/\/docs.databricks.com\/delta\/clone-parquet.html)\n* [One-time conversion of Parquet or Iceberg data to Delta Lake](https:\/\/docs.databricks.com\/delta\/convert-to-delta.html)\n* [Third-party partners](https:\/\/docs.databricks.com\/integrations\/index.html) \nFor a full list of ingestion options, see [Ingest data into a Databricks lakehouse](https:\/\/docs.databricks.com\/ingestion\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/index.html"} +{"content":"# \n### What is Delta Lake?\n#### Updating and modifying Delta Lake tables\n\nAtomic transactions with Delta Lake provide many options for updating data and metadata. Databricks recommends you avoid interacting directly with data and transaction log files in Delta Lake file directories to avoid corrupting your tables. \n* Delta Lake supports upserts using the merge operation. See [Upsert into a Delta Lake table using merge](https:\/\/docs.databricks.com\/delta\/merge.html).\n* Delta Lake provides numerous options for selective overwrites based on filters and partitions. See [Selectively overwrite data with Delta Lake](https:\/\/docs.databricks.com\/delta\/selective-overwrite.html).\n* You can manually or automatically update your table schema without rewriting data. See [Update Delta Lake table schema](https:\/\/docs.databricks.com\/delta\/update-schema.html).\n* Enable columns mapping to rename or delete columns without rewriting data. See [Rename and drop columns with Delta Lake column mapping](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html).\n\n### What is Delta Lake?\n#### Incremental and streaming workloads on Delta Lake\n\nDelta Lake is optimized for Structured Streaming on Databricks. [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html) extends native capabilities with simplified infrastructure deployment, enhanced scaling, and managed data dependencies. \n* [Delta table streaming reads and writes](https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html)\n* [Use Delta Lake change data feed on Databricks](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html)\n\n### What is Delta Lake?\n#### Querying previous versions of a table\n\nEach write to a Delta table creates a new table version. You can use the transaction log to review modifications to your table and query previous table versions. See [Work with Delta Lake table history](https:\/\/docs.databricks.com\/delta\/history.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/index.html"} +{"content":"# \n### What is Delta Lake?\n#### Delta Lake schema enhancements\n\nDelta Lake validates schema on write, ensuring that all data written to a table matches the requirements you\u2019ve set. \n* [Schema enforcement](https:\/\/docs.databricks.com\/tables\/schema-enforcement.html)\n* [Constraints on Databricks](https:\/\/docs.databricks.com\/tables\/constraints.html)\n* [Delta Lake generated columns](https:\/\/docs.databricks.com\/delta\/generated-columns.html)\n* [Enrich Delta Lake tables with custom metadata](https:\/\/docs.databricks.com\/delta\/custom-metadata.html)\n\n### What is Delta Lake?\n#### Managing files and indexing data with Delta Lake\n\nDatabricks sets many default parameters for Delta Lake that impact the size of data files and number of table versions that are retained in history. Delta Lake uses a combination of metadata parsing and physical data layout to reduce the number of files scanned to fulfill any query. \n* [Use liquid clustering for Delta tables](https:\/\/docs.databricks.com\/delta\/clustering.html)\n* [Data skipping for Delta Lake](https:\/\/docs.databricks.com\/delta\/data-skipping.html)\n* [Compact data files with optimize on Delta Lake](https:\/\/docs.databricks.com\/delta\/optimize.html)\n* [Remove unused data files with vacuum](https:\/\/docs.databricks.com\/delta\/vacuum.html)\n* [Configure Delta Lake to control data file size](https:\/\/docs.databricks.com\/delta\/tune-file-size.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/index.html"} +{"content":"# \n### What is Delta Lake?\n#### Configuring and reviewing Delta Lake settings\n\nDatabricks stores all data and metadata for Delta Lake tables in cloud object storage. Many configurations can be set at either the table level or within the Spark session. You can review the details of the Delta table to discover what options are configured. \n* [Review Delta Lake table details with describe detail](https:\/\/docs.databricks.com\/delta\/table-details.html)\n* [Delta table properties reference](https:\/\/docs.databricks.com\/delta\/table-properties.html)\n\n### What is Delta Lake?\n#### Data pipelines using Delta Lake and Delta Live Tables\n\nDatabricks encourages users to leverage a [medallion architecture](https:\/\/docs.databricks.com\/lakehouse\/medallion.html) to process data through a series of tables as data is cleaned and enriched. [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html) simplifies ETL workloads through optimized execution and automated infrastructure deployment and scaling.\n\n### What is Delta Lake?\n#### Delta Lake feature compatibility\n\nNot all Delta Lake features are in all versions of Databricks Runtime. For information about Delta Lake versioning, see [How does Databricks manage Delta Lake feature compatibility?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/index.html"} +{"content":"# \n### What is Delta Lake?\n#### Delta Lake API documentation\n\nFor most read and write operations on Delta tables, you can use [Spark SQL](https:\/\/docs.databricks.com\/sql\/language-manual\/index.html) or Apache Spark [DataFrame](https:\/\/docs.databricks.com\/getting-started\/dataframes.html) APIs. \nFor Delta Lake-spefic SQL statements, see [Delta Lake statements](https:\/\/docs.databricks.com\/sql\/language-manual\/index.html#delta-lake-statements). \nDatabricks ensures binary compatibility with Delta Lake APIs in Databricks Runtime. To view the Delta Lake API version packaged in each Databricks Runtime version, see the **System environment** section on the relevant article in the [Databricks Runtime release notes](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). For documentation on Delta Lake APIs for Python, Scala, and Java, see the [OSS Delta Lake documentation](https:\/\/docs.delta.io\/latest\/delta-apidoc.html#delta-spark).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/index.html"} +{"content":"# \n### `\ud83d\uddc2\ufe0f Request Log` schema\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/request-log.html"} +{"content":"# \n### `\ud83d\uddc2\ufe0f Request Log` schema\n#### `df.printSchema()`\n\n```\nroot\n|-- request: struct (nullable = false)\n| |-- request_id: string (nullable = true)\n| |-- conversation_id: string (nullable = true)\n| |-- timestamp: timestamp (nullable = true)\n| |-- messages: array (nullable = true)\n| | |-- element: struct (containsNull = true)\n| | | |-- role: string (nullable = true)\n| | | |-- content: string (nullable = true)\n| |-- last_input: string (nullable = true)\n|-- trace: struct (nullable = true)\n| |-- app_version_id: string (nullable = true)\n| |-- start_timestamp: timestamp (nullable = true)\n| |-- end_timestamp: timestamp (nullable = true)\n| |-- is_truncated: boolean (nullable = true)\n| |-- steps: array (nullable = true)\n| | |-- element: struct (containsNull = true)\n| | | |-- step_id: string (nullable = true)\n| | | |-- name: string (nullable = true)\n| | | |-- type: string (nullable = true)\n| | | |-- start_timestamp: timestamp (nullable = true)\n| | | |-- end_timestamp: timestamp (nullable = true)\n| | | |-- retrieval: struct (nullable = true)\n| | | | |-- query_text: string (nullable = true)\n| | | | |-- chunks: array (nullable = true)\n| | | | | |-- element: struct (containsNull = true)\n| | | | | | |-- chunk_id: string (nullable = true)\n| | | | | | |-- parent_doc_id: string (nullable = true)\n| | | | | | |-- content: string (nullable = true)\n| | | |-- text_generation: struct (nullable = true)\n| | | | |-- prompt: string (nullable = true)\n| | | | |-- generated_text: string (nullable = true)\n|-- output: struct (nullable = false)\n| |-- choices: array (nullable = true)\n| | |-- element: struct (containsNull = true)\n| | | |-- message: struct (nullable = true)\n| | | | |-- role: string (nullable = true)\n| | | | |-- content: string (nullable = true)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/request-log.html"} +{"content":"# \n### `\ud83d\uddc2\ufe0f Request Log` schema\n#### `df.schema`\n\n```\nfrom pyspark.sql.types import *\n\nschema = StructType(\n[\nStructField(\n\"request\",\nStructType(\n[\nStructField(\"request_id\", StringType(), True),\nStructField(\"conversation_id\", StringType(), True),\nStructField(\"timestamp\", TimestampType(), True),\nStructField(\n\"messages\",\nArrayType(\nStructType(\n[\nStructField(\"role\", StringType(), True),\nStructField(\"content\", StringType(), True),\n]\n),\nTrue,\n),\nTrue,\n),\nStructField(\"last_input\", StringType(), True),\n]\n),\nFalse,\n),\nStructField(\n\"trace\",\nStructType(\n[\nStructField(\"app_version_id\", StringType(), True),\nStructField(\"start_timestamp\", TimestampType(), True),\nStructField(\"end_timestamp\", TimestampType(), True),\nStructField(\"is_truncated\", BooleanType(), True),\nStructField(\n\"steps\",\nArrayType(\nStructType(\n[\nStructField(\"step_id\", StringType(), True),\nStructField(\"name\", StringType(), True),\nStructField(\"type\", StringType(), True),\nStructField(\n\"start_timestamp\", TimestampType(), True\n),\nStructField(\"end_timestamp\", TimestampType(), True),\nStructField(\n\"retrieval\",\nStructType(\n[\nStructField(\n\"query_text\", StringType(), True\n),\nStructField(\n\"chunks\",\nArrayType(\nStructType(\n[\nStructField(\n\"chunk_id\",\nStringType(),\nTrue,\n),\nStructField(\n\"parent_doc_id\",\nStringType(),\nTrue,\n),\nStructField(\n\"content\",\nStringType(),\nTrue,\n),\n]\n),\nTrue,\n),\nTrue,\n),\n]\n),\nTrue,\n),\nStructField(\n\"text_generation\",\nStructType(\n[\nStructField(\n\"prompt\", StringType(), True\n),\nStructField(\n\"generated_text\", StringType(), True\n),\n]\n),\nTrue,\n),\n]\n),\nTrue,\n),\nTrue,\n),\n]\n),\nTrue,\n),\nStructField(\n\"output\",\nStructType(\n[\nStructField(\n\"choices\",\nArrayType(\nStructType(\n[\nStructField(\n\"message\",\nStructType(\n[\nStructField(\"role\", StringType(), True),\nStructField(\n\"content\", StringType(), True\n),\n]\n),\nTrue,\n)\n]\n),\nTrue,\n),\nTrue,\n)\n]\n),\nFalse,\n),\n]\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/request-log.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n#### What are Hugging Face Transformers?\n\nThis article provides an introduction to Hugging Face Transformers on Databricks. It includes guidance on why to use Hugging Face Transformers and how to install it on your cluster.\n\n#### What are Hugging Face Transformers?\n##### Background for Hugging Face Transformers\n\n[Hugging Face Transformers](https:\/\/huggingface.co\/docs\/transformers\/index) is an open-source framework for deep learning created by Hugging Face. It provides APIs and tools to download state-of-the-art pre-trained models and further tune them to maximize performance. These models support common tasks in different modalities, such as natural language processing, computer vision, audio, and multi-modal applications. \nNote \n[Apache License 2.0](https:\/\/github.com\/huggingface\/transformers\/blob\/main\/LICENSE). \n[Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html) includes Hugging Face `transformers` in Databricks Runtime 10.4 LTS ML and above, and includes Hugging Face [datasets](https:\/\/huggingface.co\/docs\/datasets\/index), [accelerate](https:\/\/huggingface.co\/docs\/accelerate\/index), and [evaluate](https:\/\/huggingface.co\/docs\/evaluate\/index) in Databricks Runtime 13.0 ML and above. \nTo check which version of Hugging Face is included in your configured Databricks Runtime ML version, see the Python libraries section on the relevant [release notes](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/index.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n#### What are Hugging Face Transformers?\n##### Why use Hugging Face Transformers?\n\nFor many applications, such as sentiment analysis and text summarization, pre-trained models work well without any additional model training. \nHugging Face Transformers pipelines encode best practices and have default models selected for different tasks, making it easy to get started. Pipelines make it easy to use GPUs when available and allow batching of items sent to the GPU for better throughput performance. \nHugging Face provides: \n* A [model hub](https:\/\/huggingface.co\/models) containing many pre-trained models.\n* The [\ud83e\udd17 Transformers library](https:\/\/huggingface.co\/docs\/transformers\/index) that supports the download and use of these models for NLP applications and fine-tuning. It is common to need both a tokenizer and a model for natural language processing tasks.\n* [\ud83e\udd17 Transformers pipelines](https:\/\/huggingface.co\/docs\/transformers\/v4.26.1\/en\/pipeline_tutorial) that have a simple interface for most natural language processing tasks.\n\n#### What are Hugging Face Transformers?\n##### Install `transformers`\n\nIf the Databricks Runtime version on your cluster does not include Hugging Face `transformers`, you can install the latest Hugging Face `transformers` library as a [Databricks PyPI library](https:\/\/docs.databricks.com\/libraries\/index.html). \n```\n%pip install transformers\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/index.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n#### What are Hugging Face Transformers?\n##### Install model dependencies\n\nDifferent models may have different dependencies. Databricks recommends that you use [%pip magic commands](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html#manage-libraries-with-pip-commands) to install these dependencies as needed. \nThe following are common dependencies: \n* `librosa`: supports decoding audio files.\n* `soundfile`: required while generating some audio datasets.\n* `bitsandbytes`: required when using `load_in_8bit=True`.\n* `SentencePiece`: used as the tokenizer for NLP models.\n* `timm`: required by [DetrForSegmentation](https:\/\/huggingface.co\/docs\/transformers\/v4.27.2\/en\/model_doc\/detr#transformers.DetrForSegmentation).\n\n#### What are Hugging Face Transformers?\n##### Single node training\n\nTo test and migrate single-machine workflows, use a [Single Node cluster](https:\/\/docs.databricks.com\/compute\/configure.html#single-node).\n\n#### What are Hugging Face Transformers?\n##### Additional resources\n\nThe following articles include example notebooks and guidance for how to use Hugging Face `transformers` for large language model (LLM) fine-tuning and model inference on Databricks. \n* [Prepare data for fine tuning Hugging Face models](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/load-data.html)\n* [Fine-tune Hugging Face models for a single GPU](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/fine-tune-model.html)\n* [Model inference using Hugging Face Transformers for NLP](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/model-inference-nlp.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/index.html"} +{"content":"# Ingest data into a Databricks lakehouse\n### Load data using the add data UI\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThe add data UI allows you to easily load data into Databricks from a variety of sources. To access the UI, click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New > Add data**.\n\n### Load data using the add data UI\n#### How do I add data to Databricks?\n\nThere are multiple ways to load data using the add data UI: \n* Load data from cloud object storage using Unity Catalog external locations. For more information, see [Load data using a Unity Catalog external location](https:\/\/docs.databricks.com\/ingestion\/add-data\/add-data-external-locations.html).\n* Select **Create or modify table** to load CSV, TSV, JSON, XML, Avro, Parquet, or text files into Delta Lake tables.\n* Select **Upload files to volume** to upload files in any format to a Unity Catalog volume, including structured, semi-structured, and unstructured data. For semi-structured or structured files, you can use Auto Loader or `COPY INTO` to create tables starting from the files. You can also run various machine learning and data science workloads on files uploaded to a volume. See [Upload files to a Unity Catalog volume](https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-to-volume.html).\n* Select **DBFS** to use the [legacy DBFS file upload](https:\/\/docs.databricks.com\/archive\/legacy\/data-tab.html).\n* Other icons launch sample notebooks to configure connections to many data sources. \nSee [Connect to data sources](https:\/\/docs.databricks.com\/connect\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/add-data\/index.html"} +{"content":"# Ingest data into a Databricks lakehouse\n### Load data using the add data UI\n#### Add data with Partner Connect\n\nThe Databricks Partner Connect program provides integrations maintained by independent software vendors to easily connect to most enterprise data systems. You can configure these connections through the add data UI using the following instructions: \n* [Connect to Fivetran](https:\/\/docs.databricks.com\/partners\/ingestion\/fivetran.html) \nNote \nYou must be a Databricks workspace admin to create the connection to Fivetran. If the connection exists, a Fivetran Account Administrator must add you to the account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/add-data\/index.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Databricks Foundation Model APIs\n\nThis article provides an overview of the Foundation Model APIs in Databricks. It includes requirements for use, supported models, and limitations.\n\n#### Databricks Foundation Model APIs\n##### What are Databricks Foundation Model APIs?\n\n[Databricks Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) now supports Foundation Model APIs which allow you to access and query state-of-the-art open models from a serving endpoint. With Foundation Model APIs, you can quickly and easily build applications that leverage a high-quality generative AI model without maintaining your own model deployment. \nThe Foundation Model APIs are provided in two pricing modes: \n* Pay-per-token: This is the easiest way to start accessing foundation models on Databricks and is recommended for beginning your journey with Foundation Model APIs. This mode is not designed for high-throughput applications or performant production workloads.\n* Provisioned throughput: This mode is recommended for all production workloads, especially those that require high throughput, performance guarantees, fine-tuned models, or have additional security requirements. Provisioned throughput endpoints are available with compliance certifications like HIPAA. \nSee [Use Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html#use-foundation-apis) for guidance on how to use these two modes and the supported models. \nUsing the Foundation Model APIs you can: \n* Query a generalized LLM to verify a project\u2019s validity before investing more resources.\n* Query a generalized LLM in order to create a quick proof-of-concept for an LLM-based application before investing in training and deploying a custom model.\n* Use a foundation model, along with a vector database, to build a chatbot using retrieval augmented generation (RAG).\n* Replace proprietary models with open alternatives to optimize for cost and performance.\n* Efficiently compare LLMs to see which is the best candidate for your use case, or swap a production model with a better performing one.\n* Build an LLM application for development or production on top of a scalable, SLA-backed LLM serving solution that can support your production traffic spikes.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Databricks Foundation Model APIs\n##### Requirements\n\n* Databricks API token to authenticate endpoint requests.\n* Serverless compute (for provisioned throughput models).\n* A workspace in a supported region: \n+ [Pay-per-token regions](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html#regions).\n+ [Provisioned throughput regions](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html#regions). \nNote \nFor provisioned throughput workloads that use the DBRX Base model, see [Foundation Model APIs limits](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html#fmapi-limits) for region availability.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Databricks Foundation Model APIs\n##### Use Foundation Model APIs\n\nYou have multiple options for using the Foundation Model APIs. \nThe APIs are compatible with OpenAI, so you can even use the OpenAI client for querying. You can also use the UI, the Foundation Models APIs Python SDK, the MLflow Deployments SDK, or the REST API for querying supported models. Databricks recommends using the MLflow Deployments SDK or REST API for extended interactions and the UI for trying out the feature. \nSee [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html) for scoring examples. \n### Pay-per-token Foundation Model APIs \nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nPay-per-tokens models are accessible in your Databricks workspace, and are recommended for getting started. To access them in your workspace, navigate to the **Serving** tab in the left sidebar. The Foundation Model APIs are located at the top of the Endpoints list view. \n![Serving endpoints list](https:\/\/docs.databricks.com\/_images\/serving-endpoints-list.png) \nThe following table summarizes the supported models for pay-per-token. See [Supported models for pay-per-token](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/supported-models.html) for additional model information. \nIf you want to test out and chat with these models you can do so using the AI Playground. See [Chat with supported LLMs using AI Playground](https:\/\/docs.databricks.com\/large-language-models\/ai-playground.html). \n| Model | Task type | Endpoint |\n| --- | --- | --- |\n| DBRX Instruct | Chat | `databricks-dbrx-instruct` |\n| Meta-Llama-3-70B-Instruct | Chat | `databricks-meta-llama-3-70b-instruct` |\n| Meta-Llama-2-70B-Chat | Chat | `databricks-llama-2-70b-chat` |\n| Mixtral-8x7B Instruct | Chat | `databricks-mixtral-8x7b-instruct` |\n| MPT 7B Instruct | Completion | `databricks-mpt-7b-instruct` |\n| MPT 30B Instruct | Completion | `databricks-mpt-30b-instruct` |\n| BGE Large (English) | Embedding | `databricks-bge-large-en` | \n* See [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html) for guidance on how to query Foundation Model APIs.\n* See [Foundation model REST API reference](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html) for required parameters and syntax. \n### Provisioned throughput Foundation Model APIs \nProvisioned throughput is generally available and Databricks recommends provisioned throughput for production workloads. Provisioned throughput provides endpoints with optimized inference for foundation model workloads that require performance guarantees. See [Provisioned throughput Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html) for a step-by-step guide on how to deploy Foundation Model APIs in provisioned throughout mode. \nProvisioned throughput support includes: \n* **Base models of all sizes**, such as DBRX Base. Base models can be accessed using the Databricks Marketplace, or you can alternatively download them from Hugging Face or another external source and register them in the Unity Catalog. The latter approach works with any fine-tuned variant of the supported models, irrespective of the fine-tuning method employed.\n* **Fine-tuned variants of base models**, such as LlamaGuard-7B. This includes models that are fine-tuned on proprietary data.\n* **Fully custom weights and tokenizers**, such as those trained from scratch or continued pretrained or **other variations using the base model architecture** (such as CodeLlama, Yi-34B-Chat, or SOLAR-10.7B). \nThe following table summarizes the supported model architectures for provisioned throughput. \n| Model architecture | Task types | Notes |\n| --- | --- | --- |\n| DBRX | Chat or Completion | See [Foundation Model APIs limits](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html#fmapi-limits) for region availability. |\n| Meta Llama 3 | Chat or Completion | |\n| Meta Llama 2 | Chat or Completion | |\n| Mistral | Chat or Completion | |\n| Mixtral | Chat or Completion | |\n| MPT | Chat or Completion | |\n| BGE v1.5 (English) | Embedding | |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Databricks Foundation Model APIs\n##### Limitations\n\nSee [Model Serving limits and regions](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html).\n\n#### Databricks Foundation Model APIs\n##### Additional resources\n\n* [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html)\n* [Provisioned throughput Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html)\n* [Batch inference using Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/fmapi-batch-inference.html)\n* [Supported models for pay-per-token](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/supported-models.html)\n* [Foundation model REST API reference](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html"} +{"content":"# Ingest data into a Databricks lakehouse\n### What is Auto Loader?\n\nAuto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any additional setup.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html"} +{"content":"# Ingest data into a Databricks lakehouse\n### What is Auto Loader?\n#### How does Auto Loader work?\n\nAuto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. Auto Loader can load data files from AWS S3 (`s3:\/\/`), Azure Data Lake Storage Gen2 (ADLS Gen2, `abfss:\/\/`), Google Cloud Storage (GCS, `gs:\/\/`), Azure Blob Storage (`wasbs:\/\/`), ADLS Gen1 (`adl:\/\/`), and Databricks File System (DBFS, `dbfs:\/`). Auto Loader can ingest `JSON`, `CSV`, `XML`, `PARQUET`, `AVRO`, `ORC`, `TEXT`, and `BINARYFILE` file formats. \nNote \n* The legacy Windows Azure Storage Blob driver (WASB) has been deprecated. ABFS has numerous benefits over WASB. See [Azure documentation on ABFS](https:\/\/learn.microsoft.com\/azure\/storage\/blobs\/data-lake-storage-abfs-driver). For documentation for working with the legacy WASB driver, see [Connect to Azure Blob Storage with WASB (legacy)](https:\/\/docs.databricks.com\/archive\/storage\/wasb-blob.html).\n* Azure has announced the pending retirement of [Azure Data Lake Storage Gen1](https:\/\/learn.microsoft.com\/azure\/data-lake-store\/data-lake-store-overview). Databricks recommends migrating all data from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2. If you have not yet migrated, see [Accessing Azure Data Lake Storage Gen1 from Databricks](https:\/\/docs.databricks.com\/archive\/storage\/azure-datalake.html). \nAuto Loader provides a Structured Streaming source called `cloudFiles`. Given an input directory path on the cloud file storage, the `cloudFiles` source automatically processes new files as they arrive, with the option of also processing existing files in that directory. Auto Loader has support for both Python and SQL in Delta Live Tables. \nYou can use Auto Loader to process billions of files to migrate or backfill a table. Auto Loader scales to support near real-time ingestion of millions of files per hour.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html"} +{"content":"# Ingest data into a Databricks lakehouse\n### What is Auto Loader?\n#### How does Auto Loader track ingestion progress?\n\nAs files are discovered, their metadata is persisted in a scalable key-value store (RocksDB) in the *checkpoint location* of your Auto Loader pipeline. This key-value store ensures that data is processed exactly once. \nIn case of failures, Auto Loader can resume from where it left off by information stored in the checkpoint location and continue to provide exactly-once guarantees when writing data into Delta Lake. You don\u2019t need to maintain or manage any state yourself to achieve fault tolerance or exactly-once semantics.\n\n### What is Auto Loader?\n#### Incremental ingestion using Auto Loader with Delta Live Tables\n\nDatabricks recommends Auto Loader in Delta Live Tables for incremental data ingestion. Delta Live Tables extends functionality in Apache Spark Structured Streaming and allows you to write just a few lines of declarative Python or SQL to deploy a production-quality data pipeline with: \n* Autoscaling compute infrastructure for cost savings\n* Data quality checks with [expectations](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html)\n* Automatic [schema evolution](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html) handling\n* Monitoring via metrics in the [event log](https:\/\/docs.databricks.com\/delta-live-tables\/observability.html#event-log) \nYou do not need to provide a schema or checkpoint location because Delta Live Tables automatically manages these settings for your pipelines. See [Load data with Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/load.html). \nDatabricks also recommends Auto Loader whenever you use Apache Spark Structured Streaming to ingest data from cloud object storage. APIs are available in Python and Scala.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html"} +{"content":"# Ingest data into a Databricks lakehouse\n### What is Auto Loader?\n#### Get started with Databricks Auto Loader\n\nSee the following articles to get started configuring incremental data ingestion using Auto Loader with Delta Live Tables: \n* [Tutorial: Run your first ETL workload on Databricks using sample data (Python, SQL notebook)](https:\/\/docs.databricks.com\/getting-started\/etl-quick-start.html)\n* [Load data from cloud object storage into streaming tables using Auto Loader (Notebook: Python, SQL)](https:\/\/docs.databricks.com\/ingestion\/onboard-data.html) \n* [Load data from cloud object storage into streaming tables using Auto Loader (Databricks SQL Editor)](https:\/\/docs.databricks.com\/sql\/load-data-streaming-table.html)\n\n### What is Auto Loader?\n#### Examples: Common Auto Loader patterns\n\nFor examples of common Auto Loader patterns, see [Common data loading patterns](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/patterns.html).\n\n### What is Auto Loader?\n#### Configure Auto Loader options\n\nYou can tune Auto Loader based on data volume, variety, and velocity. \n* [Configure schema inference and evolution in Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html)\n* [Configure Auto Loader for production workloads](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/production.html) \nFor a full list of Auto Loader options, see: \n* [Auto Loader options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html) \nIf you encounter unexpected performance, see the [FAQ](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/faq.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html"} +{"content":"# Ingest data into a Databricks lakehouse\n### What is Auto Loader?\n#### Configure Auto Loader file detection modes\n\nAuto Loader supports two [file detection modes](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-detection-modes.html). See: \n* [What is Auto Loader directory listing mode?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/directory-listing-mode.html)\n* [What is Auto Loader file notification mode?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/file-notification-mode.html)\n\n### What is Auto Loader?\n#### Benefits of Auto Loader over using Structured Streaming directly on files\n\nIn Apache Spark, you can read files incrementally using `spark.readStream.format(fileFormat).load(directory)`. Auto Loader provides the following benefits over the file source: \n* Scalability: Auto Loader can discover billions of files efficiently. Backfills can be performed asynchronously to avoid wasting any compute resources.\n* Performance: The cost of discovering files with Auto Loader scales with the number of files that are being ingested instead of the number of directories that the files may land in. See [What is Auto Loader directory listing mode?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/directory-listing-mode.html).\n* Schema inference and evolution support: Auto Loader can detect schema drifts, notify you when schema changes happen, and rescue data that would have been otherwise ignored or lost. See [How does Auto Loader schema inference work?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html#schema-inference).\n* Cost: Auto Loader uses native cloud APIs to get lists of files that exist in storage. In addition, Auto Loader\u2019s file notification mode can help reduce your cloud costs further by avoiding directory listing altogether. Auto Loader can automatically set up file notification services on storage to make file discovery much cheaper.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Create an ODBC DSN-less connection string for the Databricks ODBC Driver\n\nThis article describes how to create an ODBC Data Source Name (DSN)-less connection string for the [Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/index.html). \nTo create an ODBC DSN-less connection string, construct the string in the following format. Line breaks have been added for readability. The string must not contain these line breaks: \n```\nDriver=<path-to-driver>;\nHost=<server-hostname>;\nPort=443;\nHTTPPath=<http-path>;\nSSL=1;\nThriftTransport=2;\n<setting1>=<value1>;\n<setting2>=<value2>;\n<settingN>=<valueN>\n\n``` \n* To get the value for `<path-to-driver>`, see [Download and install the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html).\n* To get the values for `<server-hostname>` and `<http-path>`, see [Compute settings for the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/compute.html).\n* `<setting>=<value>` is one or more pairs of [authentication settings](https:\/\/docs.databricks.com\/integrations\/odbc\/authentication.html) and any special or advanced [driver capability settings](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html).\n\n######### Create an ODBC DSN-less connection string for the Databricks ODBC Driver\n########## Next steps\n\nTo use your DSN-less connection string with your target app, tool, client, SDK, or API, see [Technology partners](https:\/\/docs.databricks.com\/integrations\/index.html) or your provider\u2019s documentation.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/dsn-less.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Couchbase\n\n[Couchbase](https:\/\/www.couchbase.com\/) provides an enterprise-class, multi-cloud to edge database that offers the robust capabilities required for business-critical applications on a highly scalable and available platform. \nThe following notebook shows how to set up Couchbase with Databricks.\n\n#### Couchbase\n##### Couchbase notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/couchbase.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/couchbase.html"} +{"content":"# Technology partners\n## Databricks sign-on from partner solutions\n#### Configure Databricks sign-on from Tableau Server\n\nThis article describes how to configure Databricks sign-on from Tableau Server. After you complete this one-time configuration as a Databricks account admin, users can connect from Tableau Server using SSO authentication. \nThe steps in this article aren\u2019t needed for Tableau Desktop and Tableau Cloud, which are enabled as OAuth applications in your Databricks account by default. \nYou can configure Tableau login with SSO using OIDC and SAML. See [Configure Tableau and PowerBI OAuth with SAML SSO](https:\/\/docs.databricks.com\/release-notes\/product\/2023\/september.html#saml-sso). OAuth tokens for Tableau expire after 90 days. \nThis article is specific to custom Tableau Server OAuth application creation. For generic custom OAuth application creation steps, see the following: \n* [Enable custom OAuth applications using the Databricks UI](https:\/\/docs.databricks.com\/integrations\/enable-disable-oauth.html#enable-custom-app-ui) \n* [Enable custom OAuth applications using the CLI](https:\/\/docs.databricks.com\/integrations\/enable-disable-oauth.html#enable-custom-app-cli)\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/configure-oauth-tableau.html"} +{"content":"# Technology partners\n## Databricks sign-on from partner solutions\n#### Configure Databricks sign-on from Tableau Server\n##### Before you begin\n\nBefore you configure Databricks sign-on from Tableau Server: \n* You must be a Databricks account administrator.\n* [Install the Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/install.html) and [set up authentication between the Databricks CLI and your Databricks account](https:\/\/docs.databricks.com\/dev-tools\/cli\/authentication.html). \n* (Optional) To use a custom identity provider (IdP) for Tableau OAuth login, see [SSO in your Databricks account console](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/single-sign-on\/index.html). \nYou must also meet the following Tableau requirements: \n* You have a Tableau Server installation with one of the following versions: \n+ 2021.4.13 or above\n+ 2022.1.9 or above\n+ 2022.3.1 or above\n* You\u2019re a Tableau Server administrator. \n### Add Tableau Server as an OAuth application \nTo add Tableau Server as an OAuth application to your Databricks account, do the following: \n1. [Locate your account ID](https:\/\/docs.databricks.com\/admin\/account-settings\/index.html#locate-your-account-id).\n2. Locate your Tableau Server URL.\n3. Run the following command: \n```\ndatabricks account custom-app-integration create --confidential --json '{\"name\":\"<name>\", \"redirect_urls\":[\"<redirect-url>\"], \"scopes\":[\"all-apis\", \"offline_access\", \"openid\", \"profile\", \"email\"]}'\n\n``` \n* Replace `<name>` with a name for your custom OAuth application.\n* For `<redirect-url>`, append `\/auth\/add_oauth_token` to your Tableau Server URL. For example, `https:\/\/example.tableauserver.com\/auth\/add_oauth_token`.For more information about supported values, see [POST \/api\/2.0\/accounts\/{account\\_id}\/oauth2\/custom-app-integrations](https:\/\/docs.databricks.com\/api\/account\/customappintegration\/create) in the REST API reference. \nA client ID and a client secret are generated, and the following output is returned: \n```\n{\"integration_id\":\"<integration-id>\",\"client_id\":\"<client-id>\",\"client_secret\":\"<client-secret>\"}\n\n``` \nNote \nEnabling an OAuth application can take 30 minutes to process.\n4. Securely store the client secret. \nImportant \nYou can\u2019t retrieve the client secret later.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/configure-oauth-tableau.html"} +{"content":"# Technology partners\n## Databricks sign-on from partner solutions\n#### Configure Databricks sign-on from Tableau Server\n##### Configure OAuth in Tableau Server\n\nTo configure OAuth in Tableau Server, do the following: \n1. Sign in to Tableau Server as a server administrator.\n2. In the sidebar, click **Settings** > **OAuth Client Registry** > **Add OAuth client**.\n3. For **Connection Type**, select **Databricks**.\n4. For **Client ID**, enter the client ID that was generated in [Add Tableau Server as an OAuth application](https:\/\/docs.databricks.com\/integrations\/configure-oauth-tableau.html#add-custom-oauth).\n5. For **Client Secret**, enter the client secret that was generated in [Add Tableau Server as an OAuth application](https:\/\/docs.databricks.com\/integrations\/configure-oauth-tableau.html#add-custom-oauth).\n6. For **Redirect URL**, enter the redirect URL from [Add Tableau Server as an OAuth application](https:\/\/docs.databricks.com\/integrations\/configure-oauth-tableau.html#add-custom-oauth).\n7. Click **Add OAuth client**.\n\n#### Configure Databricks sign-on from Tableau Server\n##### Troubleshoot OAuth configuration\n\nThis section describes how to resolve common issues with OAuth configuration. \n### 404 error from your IdP \n**Issue**: When you try to authenticate to Tableau Server, you see a 404 error. \n**Cause**: OAuth is misconfigured. \n**Solution**: Ensure that you have correctly [configured OAuth](https:\/\/docs.databricks.com\/integrations\/configure-oauth-tableau.html#configure-oauth-in-tableau-server).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/configure-oauth-tableau.html"} +{"content":"# Technology partners\n## Databricks sign-on from partner solutions\n#### Configure Databricks sign-on from Tableau Server\n##### Next steps\n\nUsers can now use SSO to authenticate to Databricks from Tableau Server. See [Connect Tableau to Databricks](https:\/\/docs.databricks.com\/partners\/bi\/tableau.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/configure-oauth-tableau.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Track model development using MLflow\n#### Track ML and deep learning training runs\n###### Build dashboards with the MLflow Search API\n\nYou can pull aggregate metrics on your MLflow runs using the [mlflow.search\\_runs](https:\/\/www.mlflow.org\/docs\/latest\/search-syntax.html#programmatically-searching-runs) API and display them in a dashboard. Regularly such reviewing metrics can provide insight into your progress and productivity. For example, you can track improvement of a goal metric like revenue or accuracy over time, across many runs and\/or experiments. \nThis notebook demonstrates how to build the following custom dashboard using the [mlflow.search\\_runs](https:\/\/www.mlflow.org\/docs\/latest\/search-syntax.html#programmatically-searching-runs) API: \n![Search API Dashboard](https:\/\/docs.databricks.com\/_images\/search_api_dashboard.png) \nYou can either run the notebook on your own experiments or against autogenerated mock experiment data.\n\n###### Build dashboards with the MLflow Search API\n####### Dashboard comparing MLflow runs notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/custom-dashboards-mlflow-search-api.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/build-dashboards.html"} +{"content":"# \n### `\ud83e\udd16 LLM Judge`\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html)\n\n### `\ud83e\udd16 LLM Judge`\n#### Conceptual overview\n\n`\ud83e\udd16 LLM Judge` provides LLM-judged feedback on your RAG Application. This enables you to gain additional insight into your application\u2019s quality.\n\n### `\ud83e\udd16 LLM Judge`\n#### Configuring `\ud83e\udd16 LLM Judge`\n\n1. Open the `rag-config.yml` in your IDE\/code editor.\n2. Edit the `global_config.evaluation.assessment_judges` configuration. \n```\nevaluation:\n# Configure the LLM judges for assessments\nassessment_judges:\n- judge_name: LLaMa2-70B-Chat\nendpoint_name: databricks-llama-2-70b-chat # Model Serving endpoint name\nassessments: # pre-defined list based on the names of metrics\n- harmful\n- answer_correct\n- faithful_to_context\n- relevant_to_question_and_context\n\n``` \nTip \n**\ud83d\udea7 Roadmap \ud83d\udea7** Support for customer-defined `\ud83e\udd16 LLM Judge` assessments.\n3. RAG Studio automatically computes `\ud83e\udd16 LLM Judge` assessments for every invocation of your `\ud83d\udd17 Chain`. \nTip \n**\ud83d\udea7 Roadmap \ud83d\udea7** Configuration to adjust when the `\ud83e\udd16 LLM Judge` is or isn\u2019t run, including only sampling x% of responses.\n\n### `\ud83e\udd16 LLM Judge`\n#### Data flows\n\n### Online evaluation \n![online](https:\/\/docs.databricks.com\/_images\/llm-judge-online.png)\n\n### `\ud83e\udd16 LLM Judge`\n#### Offline evaluation\n\n![offline](https:\/\/docs.databricks.com\/_images\/llm-judge-offline.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/llm-judge.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Restart the Python process on Databricks\n\nYou can programmatically restart the Python process on Databricks to ensure that locally installed or upgraded libraries function correctly in the Python kernel for your current SparkSession. \nWhen you restart the Python process, you lose Python state information. Databricks recommends installing all session-scoped libraries at the beginning of a notebook and running `dbutils.library.restartPython()` to clean up the Python process before proceeding. \nYou can use this process in interactive notebooks or for Python tasks scheduled with workflows.\n\n#### Restart the Python process on Databricks\n##### What is `dbutils.library.restartPython`?\n\nThe helper function `dbutils.library.restartPython()` is the recommended way to restart the Python process in a Databricks notebook. \nNote \nMost functions in the `dbutils.library` submodule are deprecated. Databricks strongly recommends using `%pip` to manage all notebook-scoped library installations. See [Notebook-scoped Python libraries](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html).\n\n#### Restart the Python process on Databricks\n##### When should you restart your Python process?\n\nIt is a good idea to restart your Python process anytime you perform a local installation that includes any of the following: \n* Specifying a version of a package included in Databricks Runtime.\n* Installing a custom version of a package included in Databricks Runtime.\n* Explicitly updating a library to the newest version using `%pip install <library-name> --upgrade`.\n* Configuring a custom environment from a local `requirements.txt` file.\n* Installing a library that requires changing the versions of dependent libraries that are included in Databricks Runtime.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/restart-python-process.html"} +{"content":"# What is Databricks?\n### Databricks architecture overview\n\nThis article provides a high-level overview of Databricks architecture, including its enterprise architecture, in combination with AWS.\n\n### Databricks architecture overview\n#### High-level architecture\n\nDatabricks operates out of a **control plane** and a **compute plane**. \n* The **control plane** includes the backend services that Databricks manages in your Databricks account. The web application is in the control plane.\n* The **compute plane** is where your data is processed. There are two types of compute planes depending on the compute that you are using. \n+ For serverless compute, the serverless compute resources run in a *serverless compute plane* in your Databricks account.\n+ For classic Databricks compute, the compute resources are in your AWS account in what is called the *classic compute plane*. This refers to the network in your AWS account and its resources. \nThe following diagram describes the overall Databricks architecture. \n![Diagram: Databricks architecture](https:\/\/docs.databricks.com\/_images\/architecture.png)\n\n### Databricks architecture overview\n#### Serverless compute plane\n\nIn the serverless compute plane, Databricks compute resources run in a compute layer within your Databricks account. Databricks creates a serverless compute plane in the same AWS region as your workspace\u2019s classic compute plane. \nTo protect customer data within the serverless compute plane, serverless compute runs within a network boundary for the workspace, with various layers of security to isolate different Databricks customer workspaces and additional network controls between clusters of the same customer. \nTo learn more about networking in the serverless compute plane, [Serverless compute plane networking](https:\/\/docs.databricks.com\/security\/network\/serverless-network-security\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/overview.html"} +{"content":"# What is Databricks?\n### Databricks architecture overview\n#### Classic compute plane\n\nIn the classic compute plane, Databricks compute resources run in your AWS account. New compute resources are created within each workspace\u2019s virtual network in the customer\u2019s AWS account. \nA classic compute plane has natural isolation because it runs in each customer\u2019s own AWS account. To learn more about networking in the classic compute plane, see [Classic compute plane networking](https:\/\/docs.databricks.com\/security\/network\/classic\/index.html). \nFor regional support, see [Databricks clouds and regions](https:\/\/docs.databricks.com\/resources\/supported-regions.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/overview.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### PCI-DSS compliance controls\n\nPreview \nThe ability for admins to add Enhanced Security and Compliance features is a feature in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). The compliance security profile and support for compliance standards are generally available (GA). \nPCI-DSS compliance controls provide enhancements that help you with payment card industry (PCI) compliance for your workspace. \nPCI-DSS compliance controls require enabling the *compliance security profile*, which adds monitoring agents, enforces instance types for inter-node encryption, provides a hardened compute image, and other features. For technical details, see [Compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html). It is your responsibility to [confirm that each workspace has the compliance security profile enabled](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html#verify) if it\u2019s needed.\n\n####### PCI-DSS compliance controls\n######## Which compute resources get enhanced security\n\nThe compliance security profile enhancements apply to compute resources in the [classic compute plane](https:\/\/docs.databricks.com\/getting-started\/overview.html) in all regions. \nSupport for serverless SQL warehouses for the compliance security profile varies by region. See [Serverless SQL warehouses support the compliance security profile in some regions](https:\/\/docs.databricks.com\/admin\/sql\/serverless.html#security-profile).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/pci.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### PCI-DSS compliance controls\n######## Requirements\n\n* Your Databricks account must include the Enhanced Security and Compliance add-on. For details, see the [pricing page](https:\/\/databricks.com\/product\/pricing\/platform-addons).\n* Your Databricks workspace is on the Enterprise pricing tier.\n* [Single sign-on (SSO)](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/single-sign-on\/index.html) authentication is configured for the workspace.\n* Your workspace enables the [compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html) and adds the PCI-DSS compliance standard as part of the compliance security profile configuration.\n* You must use the following VM instance types: \n+ **General purpose:** `M-fleet`, `Md-fleet`, `M5dn`, `M5n`, `M5zn`, `M7g`, `M7gd`, `M6i`, `M7i`, `M6id`, `M6in`, `M6idn`, `M6a`, `M7a`\n+ **Compute optimized:** `C5a`, `C5ad`, `C5n`, `C6gn`, `C7g`, `C7gd`, `C7gn`, `C6i`, `C6id`, `C7i`, `C6in`, `C6a`, `C7a`\n+ **Memory optimized:** `R-fleet`, `Rd-fleet`, `R7g`, `R7gd`, `R6i`, `R7i`, `R7iz`, `R6id`, `R6in`, `R6idn`, `R6a`, `R7a`\n+ **Storage optimized:** `D3`, `D3en`, `P3dn`, `R5dn`, `R5n`, `I4i`, `I4g`, `I3en`, `Im4gn`, `Is4gen`\n+ **Accelerated computing:** `G4dn`, `G5`, `P4d`, `P4de`, `P5`\n* Ensure that sensitive information is never entered in customer-defined input fields, such as workspace names, cluster names, and job names.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/pci.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### PCI-DSS compliance controls\n######## Enable PCI-DSS compliance controls on a workspace\n\nTo configure your workspace to support processing of data regulated by the PCI-DSS standard, the workspace must have the [compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html) enabled. You can enable the compliance security profile and add the PCI-DSS compliance standard across all workspaces or only on some workspaces. \nTo enable the compliance security profile and add the PCI-DSS compliance standard for an existing workspace, see [Enable enhanced security and compliance features on a workspace](https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-compliance.html#aws-workspace-config). \nTo set an account-level setting to enable the compliance security profile and PCI-DSS for new workspaces, see [Set account-level defaults for new workspaces](https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-compliance.html#aws-account-level-defaults).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/pci.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### PCI-DSS compliance controls\n######## Preview features that are supported for processing credit card payment data\n\nThe following preview features are supported for processing of processing credit card payment data: \n* [SCIM provisioning](https:\/\/docs.databricks.com\/admin\/users-groups\/scim\/index.html)\n* [IAM passthrough](https:\/\/docs.databricks.com\/archive\/credential-passthrough\/iam-passthrough.html)\n* [Secret paths in environment variables](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html#spark-conf-env-var)\n* [System tables](https:\/\/docs.databricks.com\/admin\/system-tables\/index.html)\n* [Serverless SQL warehouse usage when compliance security profile is enabled](https:\/\/docs.databricks.com\/admin\/sql\/serverless.html#security-profile), with support in some regions\n* [Filtering sensitive table data with row filters and column masks](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/row-and-column-filters.html)\n* [Unified login](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/single-sign-on\/index.html#unified-login)\n* [Lakehouse Federation to Redshift](https:\/\/docs.databricks.com\/query-federation\/redshift.html)\n* [Liquid clustering for Delta tables](https:\/\/docs.databricks.com\/delta\/clustering.html)\n* [Unity Catalog-enabled DLT pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html)\n* [Databricks Assistant](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html)\n* Scala support for shared clusters\n* Delta Live Tables Hive metastore to Unity Catalog clone API\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/pci.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### PCI-DSS compliance controls\n######## Does Databricks permit the processing of credit card payment data on Databricks?\n\nYes, if you comply with the [requirements](https:\/\/docs.databricks.com\/security\/privacy\/pci.html#requirements), enable the compliance security profile, and add the PCI-DSS compliance standard as part of the compliance security profile configuration.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/pci.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n### Distributed training\n##### HorovodRunner examples\n\nThe examples in this section demonstrate how to use HorovodRunner to perform distributed training using a convolutional neural network model on the [MNIST](https:\/\/en.wikipedia.org\/wiki\/MNIST_database) dataset, a large database of handwritten digits. \n* [Deep learning using TensorFlow with HorovodRunner for MNIST](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/mnist-tensorflow-keras.html)\n* [Adapt single node PyTorch to distributed deep learning](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/mnist-pytorch.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-runner-examples.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Hex\n\nHex is a modern data workspace. It makes it easy to connect to data, analyze it in collaborative SQL and Python-powered notebooks, and share work as interactive data apps and stories. Hex has three major elements: \n* Logic View, a notebook-based interface where you can develop your analysis.\n* App Builder, an integrated interface builder, where you can arrange elements from the Logic View into an interactive app.\n* Share, where you can invite stakeholders, customers, and team members to collaborate on and interact with your analysis. \nYou can integrate your Databricks SQL warehouses (formerly Databricks SQL endpoints) with Hex. \nNote \nHex does not integrate with Databricks clusters.\n\n#### Connect to Hex\n##### Connect to Hex using Partner Connect\n\nTo connect your Databricks workspace to Hex using Partner Connect, see [Connect to BI partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/bi.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/hex.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Hex\n##### Connect to Hex manually\n\nThis section describes how to connect an existing SQL warehouse to Hex manually. \n### Requirements \nBefore you integrate with Hex manually, you must have the following: \n* A Databricks SQL [warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html) in your Databricks workspace.\n* The **JDBC URL** value for your SQL warehouse. To get this value, see [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n* A Databricks personal access token. \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \n### Steps to connect \nTo connect to Hex manually, do the following: \n1. [Create a new Hex account](https:\/\/app.hex.tech\/signup), or [sign in to your existing Hex account](https:\/\/app.hex.tech\/login).\n2. On the **Projects** page, click **New project**.\n3. In the **New Project** dialog, enter a name for the project, and then click **Create project**.\n4. In the sidebar, click the database (**Data sources**) icon.\n5. Within **Data Connections**, click the **Databricks** icon.\n6. In the **Add a data connection to Databricks** dialog, enter a name for the connection.\n7. For **JDBC URL**, enter the **JDBC URL** value for your SQL warehouse.\n8. For **Access Token**, enter the Databricks personal access token.\n9. Click **Create connection**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/hex.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Hex\n##### Next steps\n\nExplore one or more of the following resources on the [Hex website](https:\/\/hex.tech): \n* [What is Hex?](https:\/\/learn.hex.tech\/docs)\n* [Hex documentation](https:\/\/learn.hex.tech\/docs)\n* [Tutorials](https:\/\/learn.hex.tech\/tutorials)\n* [Introduction to projects](https:\/\/learn.hex.tech\/docs\/getting-started\/intro-to-projects)\n* [Support](mailto:hello%40hex.tech)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/hex.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Query and visualize data from a notebook\n\nThis get started article walks you through using a Databricks notebook to query sample data stored in Unity Catalog using SQL, Python, Scala, and R and then visualize the query results in the notebook.\n\n### Get started: Query and visualize data from a notebook\n#### Requirements\n\nTo complete the tasks in this article, you must meet the following requirements: \n* Your workspace must have [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) enabled. For information on getting started with Unity Catalog, see [Set up and manage Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html).\n* You must have permission to use an existing compute resource or create a new compute resource. See [Get started: Account and workspace setup](https:\/\/docs.databricks.com\/getting-started\/index.html) or see your Databricks administrator.\n\n### Get started: Query and visualize data from a notebook\n#### Step 1: Create a new notebook\n\nTo create a notebook in your workspace: \n1. Click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar, then click **Notebook**.\n2. On the Create Notebook page: \n* Specify a unique name for your notebook.\n* Set the default language for your notebook and then click **Confirm** if prompted.\n* Use the **Connect** dropdown menu to select a compute resource. To create a new compute resource, see [Use compute](https:\/\/docs.databricks.com\/compute\/use-compute.html). \nTo learn more about creating and managing notebooks, see [Manage notebooks](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/quick-start.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Query and visualize data from a notebook\n#### Step 2: Query a table\n\nQuery the `samples.nyctaxi.trips` table in Unity Catalog using the language of your choice. \n1. Copy and paste the following code into the new empty notebook cell. This code displays the results from querying the `samples.nyctaxi.trips` table in Unity Catalog. \n```\nSELECT * FROM samples.nyctaxi.trips\n\n``` \n```\ndisplay(spark.read.table(\"samples.nyctaxi.trips\"))\n\n``` \n```\ndisplay(spark.read.table(\"samples.nyctaxi.trips\"))\n\n``` \n```\nlibrary(SparkR)\ndisplay(sql(\"SELECT * FROM samples.nyctaxi.trips\"))\n\n```\n2. Press `Shift+Enter` to run the cell and then move to the next cell. \nThe query results appear in the notebook.\n\n### Get started: Query and visualize data from a notebook\n#### Step 3: Display the data\n\nDisplay the average fare amount by trip distance, grouped by the pickup zip code. \n1. Next to the **Table** tab, click **+** and then click **Visualization**. \nThe visualization editor displays.\n2. In the **Visualization Type** drop-down, verify that **Bar** is selected.\n3. Select `fare_amount` for the **X column**.\n4. Select `trip_distance` for the **Y column**.\n5. Select `Average` as the aggregation type.\n6. Select `pickup_zip` as the **Group by** column. \n![Bar chart](https:\/\/docs.databricks.com\/_images\/trip_distance.png)\n7. Click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/quick-start.html"} +{"content":"# Get started: Account and workspace setup\n### Get started: Query and visualize data from a notebook\n#### Next steps\n\n* To learn about adding data from CSV file to Unity Catalog and visualize data, see [Get started: Import and visualize CSV data from a notebook](https:\/\/docs.databricks.com\/getting-started\/import-visualize-data.html).\n* To learn how to load data into Databricks using Apache Spark, see [Tutorial: Load and transform data using Apache Spark DataFrames](https:\/\/docs.databricks.com\/getting-started\/dataframes.html).\n* To learn more about ingesting data into Databricks, see [Ingest data into a Databricks lakehouse](https:\/\/docs.databricks.com\/ingestion\/index.html).\n* To learn more about querying data with Databricks, see [Query data](https:\/\/docs.databricks.com\/query\/index.html).\n* To learn more about visualizations, see [Visualizations in Databricks notebooks](https:\/\/docs.databricks.com\/visualizations\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/quick-start.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Stitch\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nStitch helps you consolidate all your business data from different databases and SaaS applications (Salesforce, Hubspot, Marketo, and so on) into Delta Lake. \nHere are the steps for using Stitch with Databricks.\n\n#### Connect to Stitch\n##### Step 1: Generate a Databricks personal access token\n\nStitch authenticates with Databricks using a Databricks personal access token. \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/stitch.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Stitch\n##### Step 2: Set up a cluster to support integration needs\n\nStitch will write data to an S3 bucket and the Databricks integration cluster will read data from that location. Therefore the integration cluster requires secure access to the S3 bucket. \n### Secure access to an S3 bucket \nTo access AWS resources, you can launch the Databricks integration cluster with an instance profile. The instance profile should have access to the staging S3 bucket and the target S3 bucket where you want to write the Delta tables. To create an instance profile and configure the integration cluster to use the role, follow the instructions in [Tutorial: Configure S3 access with an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html). \nAs an alternative, you can use [IAM credential passthrough](https:\/\/docs.databricks.com\/archive\/credential-passthrough\/iam-passthrough.html), which enables user-specific access to S3 data from a shared cluster. \n### Specify the cluster configuration \n1. Set **Cluster Mode** to **Standard**.\n2. Set **Databricks Runtime Version** to Runtime: 6.3 or above.\n3. Enable [optimized writes and auto compaction](https:\/\/docs.databricks.com\/delta\/tune-file-size.html) by adding the following properties to your [Spark configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration): \n```\nspark.databricks.delta.optimizeWrite.enabled true\nspark.databricks.delta.autoCompact.enabled true\n\n```\n4. Configure your cluster depending on your integration and scaling needs. \nFor cluster configuration details, see [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html). \nSee [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html) for the steps to obtain the JDBC URL and HTTP path.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/stitch.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Stitch\n##### Step 3: Configure Stitch with Databricks\n\nGo to the [Stitch](https:\/\/www.stitchdata.com\/signup\/?utm_source=partner&utm_medium=app&utm_campaign=databricks) login page and follow the instructions.\n\n#### Connect to Stitch\n##### Additional resources\n\n[Support](https:\/\/www.stitchdata.com\/)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/stitch.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Share data using the Delta Sharing Databricks-to-Databricks protocol (for providers)\n\nThis article gives an overview of how to use Databricks-to-Databricks Delta Sharing to share data securely with any Databricks user, regardless of account or cloud host, as long as that user has access to a workspace enabled for [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). \nNote \nIf you are a data recipient (a user or group of users with whom Databricks data is being shared), see [Access data shared with you using Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/recipient.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/share-data-databricks.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Share data using the Delta Sharing Databricks-to-Databricks protocol (for providers)\n#### Who should use Databricks-to-Databricks Delta Sharing?\n\nThere are three ways to share data using Delta Sharing. \n1. **The Databricks-to-Databricks sharing protocol**, covered in this article, lets you share data from your Unity Catalog-enabled workspace with users who also have access to a Unity Catalog-enabled Databricks workspace. \nThis approach uses the Delta Sharing server that is built into Databricks and provides support for notebook sharing, Unity Catalog data governance, auditing, and usage tracking for both providers and recipients. The integration with Unity Catalog simplifies setup and governance for both providers and recipients and improves performance.\n2. **The Databricks open sharing protocol** lets you share data that you manage in a Unity Catalog-enabled Databricks workspace with users on any computing platform. \nSee [Share data using the Delta Sharing open sharing protocol (for providers)](https:\/\/docs.databricks.com\/data-sharing\/share-data-open.html).\n3. **A customer-managed implementation of the open-source Delta Sharing server** lets you share from any platform to any platform, whether Databricks or not. \nSee [github.com\/delta-io\/delta-sharing](https:\/\/github.com\/delta-io\/delta-sharing). \nFor an introduction to Delta Sharing and more information about these three approaches, see [Share data and AI assets securely using Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/share-data-databricks.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Share data using the Delta Sharing Databricks-to-Databricks protocol (for providers)\n#### Databricks-to-Databricks Delta Sharing workflow\n\nThis section provides a high-level overview of the Databricks-to-Databricks sharing workflow, with links to detailed documentation for each step. \nIn the Databricks-to-Databricks Delta Sharing model: \n1. A data *recipient* gives a data *provider* the unique *sharing identifier* for the Databricks Unity Catalog metastore that is attached to the Databricks workspace that the recipient (which represents a user or group of users) will use to access the data that the data provider is sharing. \nFor details, see [Step 1: Request the recipient\u2019s sharing identifier](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#request-uuid).\n2. The data provider creates a *share* in the provider\u2019s Unity Catalog metastore. This named object contains a collection of tables, views, volumes, and notebooks registered in the metastore. \nFor details, see [Create and manage shares for Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/create-share.html).\n3. The data provider creates a recipient object in the provider\u2019s Unity Catalog metastore. This named object represents the user or group of users who will access the data included in the share, along with the sharing identifier of the Unity Catalog metastore that is attached to the workspace that the user or group of users will use to access the share. The sharing identifier is the key identifier that enables the secure connection. \nFor details, see [Step 2: Create the recipient](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#create-recipient-db-to-db).\n4. The data provider grants the recipient access to the share. \nFor details, see [Manage access to Delta Sharing data shares (for providers)](https:\/\/docs.databricks.com\/data-sharing\/grant-access.html).\n5. The share becomes available in the recipient\u2019s Databricks workspace, and users can access it using Catalog Explorer, the Databricks CLI, or SQL commands in a Databricks notebook or the Databricks SQL query editor. \nTo access the tables, views, volumes, and notebooks in a share, a metastore admin or [privileged user](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html#access-data) must create a catalog from the share. Then that user or another user who is granted the appropriate privilege can give other users access to the catalog and objects in the catalog. Granting permissions on shared catalogs and data assets works just like it does with any other assets registered in Unity Catalog, with the important distinction being that users can be granted only read access on objects in catalogs that are created from Delta Sharing shares. \nShared notebooks live at the catalog level, and any user with the `USE CATALOG` privilege on the catalog can access them. \nFor details, see [Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/share-data-databricks.html"} +{"content":"# AI and Machine Learning on Databricks\n### What is AutoML?\n\nDatabricks AutoML helps you automatically apply machine learning to a dataset. You provide the dataset and identify the prediction target, while AutoML prepares the dataset for model training. AutoML then performs and records a set of trials that creates, tunes, and evaluates multiple models. After model evaluation, AutoML displays the results and provides a Python notebook with the source code for each trial run so you can review, reproduce, and modify the code. AutoML also calculates summary statistics on your dataset and saves this information in a notebook that you can review later. \nYou can use Databricks AutoML for regression, classification, and forecasting problems. Learn more about [How Databricks AutoML works](https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n### What is AutoML?\n#### Requirements\n\n* Databricks Runtime 9.1 ML or above. For the general availability (GA) version, Databricks Runtime 10.4 LTS ML or above. \n+ For time series forecasting, Databricks Runtime 10.0 ML or above.\n+ With Databricks Runtime 9.1 LTS ML and above, AutoML depends on the `databricks-automl-runtime` package, which contains components that are useful outside of AutoML, and also helps simplify the notebooks generated by AutoML training. `databricks-automl-runtime` is available on [PyPI](https:\/\/pypi.org\/project\/databricks-automl-runtime\/).\n* No additional libraries other than those that are preinstalled in Databricks Runtime for Machine Learning should be installed on the cluster. \n+ Any modification (removal, upgrades, or downgrades) to existing library versions results in run failures due to incompatibility.\n* AutoML is not compatible with [shared access mode clusters](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode).\n* To use [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) with AutoML, the [cluster access mode](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode) must be **Single User**, and you must be the designated single user of the cluster.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n### What is AutoML?\n#### Next steps\n\n* [Train ML models with the Databricks AutoML UI](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-ui.html)\n* [Train ML models with Databricks AutoML Python API](https:\/\/docs.databricks.com\/machine-learning\/automl\/train-ml-model-automl-api.html)\n* [How Databricks AutoML works](https:\/\/docs.databricks.com\/machine-learning\/automl\/how-automl-works.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl\/index.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Tips, recommendations, and features for developing and testing Delta Live Tables pipelines\n\nThis article describes patterns you can use to develop and test Delta Live Tables pipelines. Through the pipeline settings, Delta Live Tables allows you to specify configurations to isolate pipelines in developing, testing, and production environments. The recommendations in this article are applicable for both SQL and Python code development.\n\n##### Tips, recommendations, and features for developing and testing Delta Live Tables pipelines\n###### Use development mode to run pipeline updates\n\nDelta Live Tables provides a UI toggle to control whether your pipeline updates run in development or production mode. This mode controls how pipeline updates are processed, including: \n* Development mode does not immediately terminate compute resources after an update succeeds or fails. You can reuse the same compute resources to run multiple updates of the pipeline without waiting for a cluster to start.\n* Development mode does not automatically retry on task failure, allowing you to immediately detect and fix logical or syntactic errors in your pipeline. \nDatabricks recommends using development mode during development and testing and always switching to production mode when deploying to a production environment. \nSee [Development and production modes](https:\/\/docs.databricks.com\/delta-live-tables\/updates.html#optimize-execution).\n\n##### Tips, recommendations, and features for developing and testing Delta Live Tables pipelines\n###### Test pipeline source code without waiting for tables to update\n\nTo check for problems with your pipeline source code, such as syntax and analysis errors, during development and testing, you can run a [Validate update](https:\/\/docs.databricks.com\/delta-live-tables\/updates.html#validate-update). Because a `Validate` update only verifies the correctness of pipeline source code without running an actual update on any tables, you can more quickly identify and fix issues before running an actual pipeline update.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/testing.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Tips, recommendations, and features for developing and testing Delta Live Tables pipelines\n###### Specify a target schema during all development lifecycle phases\n\nAll datasets in a Delta Live Tables pipeline reference the `LIVE` virtual schema, which is not accessible outside the pipeline. If a target schema is specified, the `LIVE` virtual schema points to the target schema. To review the results written out to each table during an update, you must specify a target schema. \nYou must specify a target schema that is unique to your environment. Each table in a given schema can only be updated by a single pipeline. \nBy creating separate pipelines for development, testing, and production with different targets, you can keep these environments isolated. Using the target schema parameter allows you to remove logic that uses string interpolation or other widgets or parameters to control data sources and targets. \nSee [Publish data from Delta Live Tables to the Hive metastore](https:\/\/docs.databricks.com\/delta-live-tables\/publish.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/testing.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Tips, recommendations, and features for developing and testing Delta Live Tables pipelines\n###### Use Databricks Git folders to manage Delta Live Tables pipelines\n\nDatabricks recommends using Git folders during Delta Live Tables pipeline development, testing, and deployment to production. Git folders enables the following: \n* Keeping track of how code is changing over time.\n* Merging changes that are being made by multiple developers.\n* Software development practices such as code reviews. \nDatabricks recommends configuring a single Git repository for all code related to a pipeline. \nEach developer should have their own Databricks Git folder configured for development. During development, the user configures their own pipeline from their Databricks Git folder and tests new logic using development datasets and isolated schema and locations. As development work is completed, the user commits and pushes changes back to their branch in the central Git repository and opens a pull request against the testing or QA branch. \nThe resulting branch should be checked out in a Databricks Git folder and a pipeline configured using test datasets and a development schema. Assuming logic runs as expected, a pull request or release branch should be prepared to push the changes to production. \nWhile Git folders can be used to synchronize code across environments, pipeline settings need to be kept up to date either manually or using tools like Terraform. \nThis workflow is similar to using Git folders for CI\/CD in all Databricks jobs. See [CI\/CD techniques with Git and Databricks Git folders (Repos)](https:\/\/docs.databricks.com\/repos\/ci-cd-techniques-with-repos.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/testing.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Tips, recommendations, and features for developing and testing Delta Live Tables pipelines\n###### Segment libraries for ingestion and transformation steps\n\nDatabricks recommends isolating queries that ingest data from transformation logic that enriches and validates data. You can then organize libraries used for ingesting data from development or testing data sources in a separate directory from production data ingestion logic, allowing you to easily configure pipelines for various environments. You can then use smaller datasets for testing, accelerating development. See [Create sample datasets for development and testing](https:\/\/docs.databricks.com\/delta-live-tables\/testing.html#sample-data). \nYou can also use parameters to control data sources for development, testing, and production. See [Control data sources with parameters](https:\/\/docs.databricks.com\/delta-live-tables\/testing.html#parameters). \nBecause Delta Live Tables pipelines use the `LIVE` virtual schema for managing all dataset relationships, by configuring development and testing pipelines with ingestion libraries that load sample data, you can substitute sample datasets using production table names to test code. The same transformation logic can be used in all environments.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/testing.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Tips, recommendations, and features for developing and testing Delta Live Tables pipelines\n###### Create sample datasets for development and testing\n\nDatabricks recommends creating development and test datasets to test pipeline logic with both expected data and potential malformed or corrupt records. There are multiple ways to create datasets that can be useful for development and testing, including the following: \n* Select a subset of data from a production dataset.\n* Use anonymized or artificially generated data for sources containing PII.\n* Create test data with well-defined outcomes based on downstream transformation logic.\n* Anticipate potential data corruption, malformed records, and upstream data changes by creating records that break data schema expectations. \nFor example, if you have a notebook that defines a dataset using the following code: \n```\nCREATE OR REFRESH STREAMING TABLE input_data AS SELECT * FROM cloud_files(\"\/production\/data\", \"json\")\n\n``` \nYou could create a sample dataset containing specific records using a query like the following: \n```\nCREATE OR REFRESH LIVE TABLE input_data AS\nSELECT \"2021\/09\/04\" AS date, 22.4 as sensor_reading UNION ALL\nSELECT \"2021\/09\/05\" AS date, 21.5 as sensor_reading\n\n``` \nThe following example demonstrates filtering published data to create a subset of the production data for development or testing: \n```\nCREATE OR REFRESH LIVE TABLE input_data AS SELECT * FROM prod.input_data WHERE date > current_date() - INTERVAL 1 DAY\n\n``` \nTo use these different datasets, create multiple pipelines with the notebooks implementing the transformation logic. Each pipeline can read data from the `LIVE.input_data` dataset but is configured to include the notebook that creates the dataset specific to the environment.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/testing.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Tips, recommendations, and features for developing and testing Delta Live Tables pipelines\n###### Control data sources with parameters\n\nYou can reference parameters set during pipeline configuration from within your libraries. These parameters are set as key-value pairs in the **Compute > Advanced > Configurations** portion of the pipeline settings UI. This pattern allows you to specify different data sources in different configurations of the same pipeline. \nFor example, you can specify different paths in development, testing, and production configurations for a pipeline using the variable `data_source_path` and then reference it using the following code: \n```\nCREATE STREAMING TABLE bronze\nAS (\nSELECT\n*,\n_metadata.file_path AS source_file_path\nFROM cloud_files( '${data_source_path}', 'csv',\nmap(\"header\", \"true\"))\n)\n\n``` \n```\nimport dlt\nfrom pyspark.sql.functions import col\n\ndata_source_path = spark.conf.get(\"data_source_path\")\n\n@dlt.table\ndef bronze():\nreturn (spark.readStream\n.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"csv\")\n.option(\"header\", True)\n.load(data_source_path )\n.select(\"*\", col(\"_metadata.file_path\").alias(\"source_file_name\"))\n)\n\n``` \nThis pattern is especially useful if you need to test how ingestion logic might handle changes to schema or malformed data during initial ingestion. You can use the identical code throughout your entire pipeline in all environments while switching out datasets.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/testing.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Foundation Model Training\n\nImportant \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). Reach out to your Databricks account team to enroll in the Public Preview. \nWith Foundation Model Training, you can use your own data to customize a foundation model to optimize its performance for your specific application. By fine-tuning or continuing training of a foundation model, you can train your own model using significantly less data, time, and compute resources than training a model from scratch. \nWith Databricks you have everything in a single platform: your own data to use for training, the foundation model to train, checkpoints saved to MLflow, and the model registered in Unity Catalog and ready to deploy. \nThis article gives an overview of Foundation Model Training on Databricks. For details on how to use it, see the following: \n* [Create a training run using the Foundation Model Training API](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/create-fine-tune-run.html)\n* [Tutorial: Create and deploy a training run using Foundation Model Training](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/fine-tune-run-tutorial.html)\n* [Create a training run using the Foundation Model Training UI](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/ui.html)\n* [View, manage, and analyze Foundation Model Training runs](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/view-manage-runs.html)\n* [Prepare data for Foundation Model Training](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/data-preparation.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Foundation Model Training\n#### What is Foundation Model Training?\n\nFoundation Model Training lets you use the Databricks API or UI to tune or further train a foundation model. \nUsing Foundation Model Training, you can: \n* Train a model with your custom data, with the checkpoints saved to MLflow. You retain complete control of the trained model.\n* Automatically register the model to Unity Catalog, allowing easy deployment with model serving.\n* Further train a completed, proprietary model by loading the weights of a previously trained model. \nDatabricks recommends that you try Foundation Model Training if: \n* You have tried few-shot learning and want better results.\n* You have tried prompt engineering on an existing model and want better results.\n* You want full ownership over a custom model for data privacy.\n* You are latency-sensitive or cost-sensitive and want to use a smaller, cheaper model with your task-specific data.\n\n### Foundation Model Training\n#### Supported tasks\n\nFoundation Model Training supports the following use cases: \n* **Supervised fine-tuning**: Train your model on structured prompt-response data. Use this to adapt your model to a new task, change its response style, or add instruction-following capabilities.\n* **Continued pre-training**: Train your model with additional text data. Use this to add new knowledge to a model or focus a model on a specific domain.\n* **Chat completion**: Train your model on chat logs between a user and an AI assistant. This format can be used both for actual chat logs, and as a standard format for question answering and conversational text. The text is automatically formatted into the appropriate chat format for the specific model.\n\n### Foundation Model Training\n#### Requirements\n\n* A Databricks workspace in one of the following AWS regions: `us-east-1`, `us-west-2`.\n* Foundation Model Training APIs installed using `pip install databricks_genai`.\n* Databricks Runtime 12.2 LTS ML or above if your data is in a Delta table. \nSee [Prepare data for Foundation Model Training](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/data-preparation.html) for information about required input data formats.\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Foundation Model Training\n#### Recommended data size for model training\n\nFor supervised fine-tuning and chat completion, you should provide enough tokens for at least one full context length of the model. For example, 4096 tokens for `meta-llama\/Llama-2-7b-chat-hf` or 32768 tokens for `mistralai\/Mistral-7B-v0.1`. \nFor continued pre-training, Databricks recommends a minimum of 1.5 million samples to get a higher quality model that learns your custom data.\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Foundation Model Training\n#### Supported models\n\nImportant \nLlama 3 is licensed under the [LLAMA 3 Community License](https:\/\/llama.meta.com\/llama3\/license\/), Copyright \u00a9 Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring compliance with applicable model licenses. \nLlama 2 and Code Llama models are licensed under the [LLAMA 2 Community License](https:\/\/ai.meta.com\/llama\/license\/), Copyright \u00a9 Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring compliance with applicable model licenses. \nDBRX is provided under and subject to the [Databricks Open Model License](https:\/\/www.databricks.com\/legal\/open-model-license), Copyright \u00a9 Databricks, Inc. All rights reserved. Customers are responsible for ensuring compliance with applicable model licenses, including the [Databricks Acceptable Use policy](https:\/\/www.databricks.com\/legal\/acceptable-use-policy-open-model). \n| Model | Maximum context length |\n| --- | --- |\n| `databricks\/dbrx-base` | 4096 |\n| `databricks\/dbrx-instruct` | 4096 |\n| `meta-llama\/Meta-Llama-3-70B` | 8192 |\n| `meta-llama\/Meta-Llama-3-70B-Instruct` | 8192 |\n| `meta-llama\/Meta-Llama-3-8B` | 8192 |\n| `meta-llama\/Meta-Llama-3-8B-Instruct` | 8192 |\n| `meta-llama\/Llama-2-7b-hf` | 4096 |\n| `meta-llama\/Llama-2-13b-hf` | 4096 |\n| `meta-llama\/Llama-2-70b-hf` | 4096 |\n| `meta-llama\/Llama-2-7b-chat-hf` | 4096 |\n| `meta-llama\/Llama-2-13b-chat-hf` | 4096 |\n| `meta-llama\/Llama-2-70b-chat-hf` | 4096 |\n| `codellama\/CodeLlama-7b-hf` | 16384 |\n| `codellama\/CodeLlama-13b-hf` | 16384 |\n| `codellama\/CodeLlama-34b-hf` | 16384 |\n| `codellama\/CodeLlama-7b-Instruct-hf` | 16384 |\n| `codellama\/CodeLlama-13b-Instruct-hf` | 16384 |\n| `codellama\/CodeLlama-34b-Instruct-hf` | 16384 |\n| `codellama\/CodeLlama-7b-Python-hf` | 16384 |\n| `codellama\/CodeLlama-13b-Python-hf` | 16384 |\n| `codellama\/CodeLlama-34b-Python-hf` | 16384 |\n| `mistralai\/Mistral-7B-v0.1` | 32768 |\n| `mistralai\/Mistral-7B-Instruct-v0.2` | 32768 |\n| `mistralai\/Mixtral-8x7B-v0.1` | 32768 |\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Foundation Model Training\n#### Use Foundation Model Training\n\nFoundation Model Training is accessible using the `databricks_genai` SDK. The following example creates and launches a training run that uses data from Unity Catalog Volumes. See the [Create a training run using the Foundation Model Training API](https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/create-fine-tune-run.html) for configuration details. \n```\nfrom databricks.model_training import foundation_model as fm\n\nmodel = 'meta-llama\/Llama-2-7b-chat-hf'\n# UC Volume with JSONL formatted data\ntrain_data_path = 'dbfs:\/Volumes\/main\/mydirectory\/ift\/train.jsonl'\nregister_to = 'main.mydirectory'\nrun = fm.create(\nmodel=model,\ntrain_data_path=train_data_path,\nregister_to=register_to,\n)\n\n```\n\n### Foundation Model Training\n#### Limitations\n\n* Large datasets (10B+ tokens) are not supported due to compute availability.\n* PrivateLink is not supported.\n* For continuous pre-training, workloads are limited to 60-256MB files. Files larger than 1GB may cause longer processing times.\n* Databricks strives to make the latest state-of-the-art models available for customization using Foundation Model Training. As we make new models available, we might remove the ability to access older models from the API and\/or UI, deprecate older models, or update supported models. If a foundation model will be removed from the API and\/or UI or deprecated, Databricks will take the following steps to notify customers at least three months before the removal and\/or deprecation date: \n+ Display a warning message in the model card from the **Experiments > Foundation Model Training** page of your Databricks workspace indicating that the model is scheduled for deprecation.\n+ Update our documentation to include a notice indicating that the model is scheduled for deprecation.\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/foundation-model-training\/index.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Tutorial: Deploy and query a custom model\n\nThis article provides the basic steps for deploying and querying a custom model, that is a traditional ML model, installed in Unity Catalog or registered in the workspace model registry using Databricks model serving. \nThe following are guides that describe serving and deploying a [foundation model](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/foundation-models.html) for generative AI and LLM: \n* [Foundation Model API](https:\/\/docs.databricks.com\/en\/machine-learning\/foundation-models\/query-foundation-model-apis.html)\n* [External Models](https:\/\/docs.databricks.com\/en\/generative-ai\/external-models\/external-models-tutorial.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-intro.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Tutorial: Deploy and query a custom model\n##### Step 1: Log the model to the model registry\n\nThere are different ways to log your model for model serving: \n| Logging technique | Description |\n| --- | --- |\n| Autologging | This is automatically turned on when you use Databricks Runtime for machine learning. It\u2019s the easiest way but gives you less control. |\n| Logging using MLflow\u2019s built-in flavors | You can manually log the model with [MLflow\u2019s built-in model flavors](https:\/\/mlflow.org\/docs\/latest\/models.html#built-in-model-flavors). |\n| Custom logging with `pyfunc` | Use this if you have a custom model or if you need extra steps before or after inference. | \nThe following example shows how to log your MLflow model using the `transformer` flavor and specify parameters you need for your model. \n```\nwith mlflow.start_run():\nmodel_info = mlflow.transformers.log_model(\ntransformers_model=text_generation_pipeline,\nartifact_path=\"my_sentence_generator\",\ninference_config=inference_config,\nregistered_model_name='gpt2',\ninput_example=input_example,\nsignature=signature\n)\n\n``` \nAfter your model is logged be sure to check that your model is registered in either the MLflow [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) or [Model Registry](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-intro.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Tutorial: Deploy and query a custom model\n##### Step 2: Create endpoint using the Serving UI\n\nAfter your registered model is logged and you are ready to serve it, you can create a model serving endpoint using the **Serving** UI. \n1. Click **Serving** in the sidebar to display the **Serving** UI.\n2. Click **Create serving endpoint**. \n![Model serving pane in Databricks UI](https:\/\/docs.databricks.com\/_images\/serving-pane.png)\n3. In the **Name** field, provide a name for your endpoint.\n4. In the **Served entities** section \n1. Click into the **Entity** field to open the **Select served entity** form.\n2. Select the type of model you want to serve. The form dynamically updates based on your selection.\n3. Select which model and model version you want to serve.\n4. Select the percentage of traffic to route to your served model.\n5. Select what size compute to use. You can use CPU or GPU computes for your workloads. Support for model serving on GPU is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). See [GPU workload types](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html#gpu) for more information on available GPU computes.\n6. Under **Compute Scale-out**, select the size of the compute scale out that corresponds with the number of requests this served model can process at the same time. This number should be roughly equal to QPS x model execution time. \n1. Available sizes are **Small** for 0-4 requests, **Medium** 8-16 requests, and **Large** for 16-64 requests.\n7. Specify if the endpoint should scale to zero when not in use.\n5. Click **Create**. The **Serving endpoints** page appears with **Serving endpoint state** shown as Not Ready. \n![Create a model serving endpoint](https:\/\/docs.databricks.com\/_images\/create-endpoint1.png) \nIf you prefer to create an endpoint programmatically with the Databricks Serving API, see [Create custom model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-intro.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Tutorial: Deploy and query a custom model\n##### Step 3: Query the endpoint\n\nThe easiest and fastest way to test and send scoring requests to your served model is to use the **Serving** UI. \n1. From the **Serving endpoint** page, select **Query endpoint**.\n2. Insert the model input data in JSON format and click **Send Request**. If the model has been logged with an input example, click **Show Example** to load the input example. \n```\n{\n\"inputs\" : [\"Hello, I'm a language model,\"],\n\"params\" : {\"max_new_tokens\": 10, \"temperature\": 1}\n}\n\n``` \nTo send scoring requests, construct a JSON with one of the supported keys and a JSON object corresponding to the input format. See [Query serving endpoints for custom models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-custom-model-endpoints.html) for supported formats and guidance on how to send scoring requests using the API. \nIf you plan to access your serving endpoint outside of the Databricks Serving UI, you need a `DATABRICKS_API_TOKEN`. \nImportant \nAs a security best practice for production scenarios, Databricks recommends that you use [machine-to-machine OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html) for authentication during production. \nFor testing and development, Databricks recommends using a personal access token belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-intro.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Tutorial: Deploy and query a custom model\n##### Example notebooks\n\nSee the following notebook for serving a MLflow `transformers` model with Model Serving. \n### Deploy a Hugging Face transformers model notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/deploy-transformers-model-serving.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nSee the following notebook for serving a MLflow `pyfunc` model with Model Serving. For additional details on customizing your model deployments, see [Deploy Python code with Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/deploy-custom-models.html). \n### Deploy a MLflow pyfunc model notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/deploy-mlflow-pyfunc-model-serving.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-intro.html"} +{"content":"# \n### Cluster Configuration for RAG Studio\n\nThis article describes the clusters which RAG Studio provisions to automate tasks including data ingestion, RAG chain creation, and RAG evaluation. \nBy default, RAG Studio provisions new job clusters specifically for these tasks.\n\n### Cluster Configuration for RAG Studio\n#### Default cluster provisioning\n\nThe default clusters provisioned by RAG Studio are: \n* Access Mode: Assigned\n* Databricks Runtime Version: 13.3 LTS ML \nThis setup is optimized for stability and performance. \n### Permissions requirement \nTo allow RAG Studio to provision these clusters automatically, ensure that your Databricks account has the necessary permissions to create job clusters with the above properties.\n\n### Cluster Configuration for RAG Studio\n#### Use an existing interactive cluster\n\nIf you prefer to use an existing interactive cluster for RAG Studio tasks, you can configure this by specifying the cluster ID in your usage of `rag`, for example: \n```\n.\/rag create-rag-version -e dev --cluster-id <your-cluster-id>\n\n``` \nTo identify a cluster\u2019s ID, see [Cluster URL and ID](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#cluster-url-and-id). \nAlternatively, you can specify a cluster ID in your `rag-config.yml` configuration file. This method is useful for setting a default cluster for all RAG Studio operations within a specific environment. Add the `cluster_id` field under the appropriate environment section, as shown below: \n```\ndevelopment:\n- name: dev\n...\ncluster_id: <your_cluster_id>\n\n``` \nCluster override is only supported for the `dev` environment.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/setup\/clusters.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n#### PySpark on Databricks\n\nThis article describes the fundamentals of PySpark, a Python API for Spark, on Databricks. \nDatabricks is built on top of [Apache Spark](https:\/\/docs.databricks.com\/spark\/index.html), a unified analytics engine for big data and machine learning. PySpark helps you interface with Apache Spark using the Python programming language, which is a flexible language that is easy to learn, implement, and maintain. It also provides many options for [data visualization](https:\/\/docs.databricks.com\/visualizations\/index.html) in Databricks. PySpark combines the power of Python and Apache Spark.\n\n","doc_uri":"https:\/\/docs.databricks.com\/pyspark\/index.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n#### PySpark on Databricks\n##### APIs and libraries\n\nAs with all APIs for Spark, PySpark comes equipped with many APIs and libraries that enable and support powerful functionality, including: \n* Processing of structured data with relational queries with **Spark SQL and DataFrames**. Spark SQL allows you to mix SQL queries with Spark programs. With Spark DataFrames, you can efficiently read, write, transform, and analyze data using Python and SQL, which means you are always leveraging the full power of Spark. See [PySpark Getting Started](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/getting_started\/index.html).\n* Scalable processing of streams with **Structured Streaming**. You can express your streaming computation the same way you would express a batch computation on static data and the Spark SQL engine runs it incrementally and continuously as streaming data continues to arrive. See [Structured Streaming Overview](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html#overview).\n* Pandas data structures and data analysis tools that work on Apache Spark with **Pandas API on Spark**. Pandas API on Spark allows you to scale your pandas workload to any size by running it distributed across multiple nodes, with a single codebase that works with pandas (tests, smaller datasets) and with Spark (production, distributed datasets). See [Pandas API on Spark Overview](https:\/\/spark.apache.org\/pandas-on-spark\/).\n* Machine learning algorithms with **Machine Learning (MLLib)**. MLlib is a scalable machine learning library built on Spark that provides a uniform set of APIs that help users create and tune practical machine learning pipelines. See [Machine Learning Library Overview](https:\/\/spark.apache.org\/docs\/latest\/ml-guide.html#overview).\n* Graphs and graph-parallel computation with **GraphX**. GraphX introduces a new directed multigraph with properties attached to each vertex and edge, and exposes graph computation operators, algorithms, and builders to simplify graph analytics tasks. See [GraphX Overview](https:\/\/spark.apache.org\/docs\/latest\/graphx-programming-guide.html#overview).\n\n","doc_uri":"https:\/\/docs.databricks.com\/pyspark\/index.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n#### PySpark on Databricks\n##### DataFrames, transformations, and lazy evaluation\n\nApache Spark DataFrames are datasets organized into named columns. They are two-dimensional labeled data structures with columns of different types. DataFrames provide a rich set of functions that allow you to solve common data analysis problems efficiently, and they make it easy to transform data with built-in methods to sort, filter, and aggregate data. \nFundamental to Apache Spark are two categories of data processing operations: transformations and actions. An action operation returns a value, such as `count`, `first`, and `collect`. A transformation operation, such as `filter` or `groupBy`, returns a DataFrame but it doesn\u2019t execute until an action triggers it. This is known as lazy evaluation. Lazy evaluation also allows you to chain multiple operations because Spark handles their execution in a deferred manner, rather than immediately executing them when they are defined.\n\n#### PySpark on Databricks\n##### Spark tutorials\n\nIn addition to the [Apache Spark Tutorial](https:\/\/docs.databricks.com\/getting-started\/dataframes.html) which walks you through loading and transforming data using DataFrames, the [Apache Spark documentation](https:\/\/spark.apache.org\/docs\/latest) also has quickstarts and guides for learning Spark, including the following articles: \n* [PySpark DataFrames QuickStart](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/getting_started\/quickstart_df.html)\n* [Spark SQL Getting Started](https:\/\/spark.apache.org\/docs\/latest\/sql-getting-started.html)\n* [Structured Streaming Programming Guide](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html)\n* [Pandas API on Spark QuickStart](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/getting_started\/quickstart_ps.html)\n* [Machine Learning Library Programming Guide](https:\/\/spark.apache.org\/docs\/latest\/ml-guide.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/pyspark\/index.html"} +{"content":"# Develop on Databricks\n## Databricks for Python developers\n#### PySpark on Databricks\n##### PySpark reference\n\nDatabricks maintains its own version of the PySpark APIs and corresponding reference, which can be found in these sections: \n* [Spark SQL Reference](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/index.html)\n* [Pandas API on Spark Reference](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.pandas\/index.html)\n* [Structured Streaming API Reference](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.ss\/index.html)\n* [MLlib (DataFrame-based) API Reference](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.ml.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/pyspark\/index.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Databricks from Microsoft Excel\n\nThis article describes how to use the Databricks ODBC driver to connect Databricks to Microsoft Excel. After you establish the connection, you can access the data in Databricks from Excel. You can also use Excel to further analyze the data.\n\n#### Connect to Databricks from Microsoft Excel\n##### Before you begin\n\n* [Workspace creation options](https:\/\/docs.databricks.com\/admin\/workspace\/index.html#create-a-workspace).\n* Create a Databricks cluster and associate data with your cluster. See [Run your first ETL workload on Databricks](https:\/\/docs.databricks.com\/getting-started\/etl-quick-start.html).\n* [Download the 64-bit version of the ODBC driver for your OS](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html).\n* Install and configure the ODBC driver ([Windows](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html#odbc-windows) | [MacOS](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html#odbc-mac) | [Linux](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html#odbc-linux)). This sets up a Data Source Name (DSN) configuration that you can use to connect Databricks to Microsoft Excel.\n* [Token management API](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement).\n* Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html#pat-user).\n* Install Microsoft Excel. You can [use a trial version](https:\/\/products.office.com\/excel).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/excel.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Databricks from Microsoft Excel\n##### Connect from Microsoft Excel\n\nThis section describes how to pull data from Databricks into Microsoft Excel using the DSN you created in the prerequisites. \n### Steps to connect using OAuth 2.0 \nNote \nThe steps in this section were tested using Excel for Microsoft 365 for Windows Server 2022 Datacenter 64 bit. \nThe following are the steps to allow a user to connect to Databricks in a single sign-on experience. \n1. Launch **ODBC Data Sources**.\n2. Go to **System DSN** tab and select **Simba Spark** entry in the DSN list (or you can also create a new DSN by following the instructions hosted on [Microsoft site](https:\/\/support.microsoft.com\/en-us\/office\/administer-odbc-data-sources-b19f856b-5b9b-48c9-8b93-07484bfab5a7)).\n3. Click on **Configure** button and you will see the below pop-up window. \n![Spark DSN](https:\/\/docs.databricks.com\/_images\/spark-dsn.png) \n4. On **Mechanism**, select **OAuth 2.0**.\n5. Click on **OAuth Options** button, you will see the following **OAuth Option** pop-up window. \n![OAuth Options](https:\/\/docs.databricks.com\/_images\/oauth-options.png) \n6. Select **Browser Based Authorization Code** and Uncheck **IGNORE\\_SQLDRIVER\\_NOPROMPT**. Close the pop-up window.\n7. Click on **HTTP Options** button and enter the HTTP path in the pop-up window. \n![HTTP Options](https:\/\/docs.databricks.com\/_images\/http-options.png) \n8. Close the **HTTP Options** pop-up window. Click on **Advanced Options** button, then click on **Server Side Properties** button in the pop-up window. \n![Advanced Options](https:\/\/docs.databricks.com\/_images\/advanced-options.png) \n9. Add a server side property **Auth\\_Flow** and value **2**. \n![Server Side Properties](https:\/\/docs.databricks.com\/_images\/server-side-properties.png) \n![Add a Server Side Property](https:\/\/docs.databricks.com\/_images\/add-server-side-properties.png) \nNow you have successfully configured an ODBC DSN. \n10. Launch Microsoft Excel and create a new blank workbook. Select menu **Data** > **Get Data** > **From Other Sources** > **From ODBC** \n![ODBC Data Source](https:\/\/docs.databricks.com\/_images\/datasource-odbc.png) \n11. Select the DSN you just configured. \n![From ODBC](https:\/\/docs.databricks.com\/_images\/from-odbc.png) \nClick on **OK** button to connect. You will be prompted to authenticate yourself on a browser pop-up window. \n#### Connect using a connection URL with OAuth 2.0 \nNote \nThe steps in this section were tested using Excel for Microsoft 365 for Windows version 11. \nYou can also directly connect Excel to Databricks using a connection URL. The conneciton URL is in the following format: \n```\nDriver=Simba Spark ODBC Driver;Host=<hostName>;Port=443;HttpPath=<httpPath>;SSL=1;AuthMech=11;Auth_Flow=2;Catalog=samples;Schema=default\n\n``` \n1. Launch Excel and select menu **Data** > **Get Data** > **From Other Sources** > **From ODBC**\n2. In **Data source name (DSN)**, select **Simba Spark**\n3. Click and expand **Advanced options** section.\n4. Enter the above connection URL in the text box of **Connection string (non-credential properties)(optional)** \n![From ODBC Connection URL](https:\/\/docs.databricks.com\/_images\/fromt-odbc-connection-url.png) \n5. Click **OK** button. If you are asked for the username and password in the pop-up window, you can enter random username and password to get by. \n![Random Username and Password](https:\/\/docs.databricks.com\/_images\/random-username.png) \n6. Click **Connect** button, you should be prompted a browser pop-up window to authenticate yourself through OAuth 2.0. \n### Steps to connect using Databricks personal access token \nNote \nThe steps in this section were tested using Excel for Microsoft 365 for Mac version 16.70. \n1. Open a blank workbook in Microsoft Excel.\n2. In the **Data** ribbon, click the down caret next to **Get Data (Power Query)**, then click **From database (Microsoft Query)**.\n3. In the **iODBC Data Source Chooser**, select the DSN that you created in the prerequisites, and then click **OK**.\n4. For **Username**, enter `token`.\n5. For **Password**, enter your personal access token from the prerequisites.\n6. In the **Microsoft Query** dialog, select the Databricks table that you want to load into Excel, and then click **Return Data**.\n7. In the **Import Data** dialog, select **Table** and **Existing sheet**, and then click **Import**. \nAfter you load your data into your Excel workbook, you can perform analytical operations on it.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/excel.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n#### Best practices for deep learning on Databricks\n\nThis article includes tips for deep learning on Databricks and information about built-in tools and libraries designed to optimize deep learning workloads such as the following: \n* [Delta](https:\/\/docs.databricks.com\/delta\/index.html) and [Petastorm](https:\/\/docs.databricks.com\/machine-learning\/load-data\/petastorm.html) to load data\n* [Horovod](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-runner.html) and [Hyperopt](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/index.html#hyperparameter-tuning-with-hyperopt) to parallelize training\n* [Pandas UDFs](https:\/\/docs.databricks.com\/udf\/pandas.html) for inference \nDatabricks Machine Learning provides pre-built deep learning infrastructure with Databricks Runtime for Machine Learning, which includes the most common deep learning libraries like TensorFlow, PyTorch, and Keras. It also has built-in, pre-configured GPU support including drivers and supporting libraries. \nDatabricks Runtime ML also includes all of the capabilities of the Databricks workspace, such as cluster creation and management, library and environment management, code management with Databricks Git folders, automation support including Databricks Jobs and APIs, and integrated MLflow for model development tracking and model deployment and serving.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/dl-best-practices.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n#### Best practices for deep learning on Databricks\n##### Resource and environment management\n\nDatabricks helps you to both customize your deep learning environment and keep the environment consistent across users. \n### Customize the development environment \nWith Databricks Runtime, you can customize your development environment at the notebook, cluster, and job levels. \n* Use [notebook-scoped Python libraries](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html) or [notebook-scoped R libraries](https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html) to use a specific set or version of libraries without affecting other cluster users.\n* [Install libraries at the cluster level](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html) to standardize versions for a team or a project.\n* Set up a Databricks [job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) to ensure that a repeated task runs in a consistent, unchanging environment. \n### Use cluster policies \nYou can create [cluster policies](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html) to guide data scientists to the right choices, such as using a Single Node cluster for development and using an autoscaling cluster for large jobs. \n### Consider A100 GPUs for deep learning workloads \nA100 GPUs are an efficient choice for many deep learning tasks, such as training and tuning large language models, natural language processing, object detection and classification, and recommendation engines. \n* Databricks supports A100 GPUs on all clouds. For the complete list of supported GPU types, see [Supported instance types](https:\/\/docs.databricks.com\/compute\/gpu.html#gpu-list).\n* A100 GPUs usually have limited availability. Contact your cloud provider for resource allocation, or consider reserving capacity in advance.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/dl-best-practices.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n#### Best practices for deep learning on Databricks\n##### Best practices for loading data\n\nCloud data storage is typically not optimized for I\/O, which can be a challenge for deep learning models that require large datasets. Databricks Runtime ML includes [Delta Lake](https:\/\/docs.databricks.com\/delta\/index.html) and [Petastorm](https:\/\/docs.databricks.com\/machine-learning\/load-data\/petastorm.html) to optimize data throughput for deep learning applications. \nDatabricks recommends using Delta Lake tables for data storage. Delta Lake simplifies ETL and lets you access data efficiently. Especially for images, Delta Lake helps optimize ingestion for both training and inference. The [reference solution for image applications](https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/images-etl-inference.html) provides an example of optimizing ETL for images using Delta Lake. \nPetastorm provides APIs that let you prepare data in parquet format for use by TensorFlow, Keras, or PyTorch. The SparkConverter API provides Spark DataFrame integration. Petastorm also provides data sharding for distributed processing. See [Load data using Petastorm](https:\/\/docs.databricks.com\/machine-learning\/load-data\/petastorm.html) for details.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/dl-best-practices.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n#### Best practices for deep learning on Databricks\n##### Best practices for training deep learning models\n\nDatabricks recommends using [Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html) and [MLflow tracking](https:\/\/docs.databricks.com\/mlflow\/tracking.html) and [autologging](https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html) for all model training. \n### Start with a Single Node cluster \nA [Single Node](https:\/\/docs.databricks.com\/compute\/configure.html#single-node) (driver only) [GPU cluster](https:\/\/docs.databricks.com\/compute\/gpu.html) is typically fastest and most cost-effective for deep learning model development. One node with 4 GPUs is likely to be faster for deep learning training that 4 worker nodes with 1 GPU each. This is because distributed training incurs network communication overhead. \nA Single Node cluster is a good option during fast, iterative development and for training models on small- to medium-size data. If your dataset is large enough to make training slow on a single machine, consider moving to multi-GPU and even distributed compute. \n### Use TensorBoard and cluster metrics to monitor the training process \nTensorBoard is preinstalled in Databricks Runtime ML. You can use it within a notebook or in a separate tab. See [TensorBoard](https:\/\/docs.databricks.com\/machine-learning\/train-model\/tensorboard.html) for details. \nCluster metrics are available in all Databricks runtimes. You can examine network, processor, and memory usage to inspect for bottlenecks. See [cluster metrics](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#metrics) for details. \n### Optimize performance for deep learning \nYou can, and should, use deep learning performance optimization techniques on Databricks. \n#### Early stopping \nEarly stopping monitors the value of a metric calculated on the validation set and stops training when the metric stops improving. This is a better approach than guessing at a good number of epochs to complete. Each deep learning library provides a native API for early stopping; for example, see the EarlyStopping callback APIs for [TensorFlow\/Keras](https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/keras\/callbacks\/EarlyStopping) and for [PyTorch Lightning](https:\/\/pytorch-lightning.readthedocs.io\/en\/latest\/common\/early_stopping.html). For an example notebook, see [TensorFlow Keras example notebook](https:\/\/docs.databricks.com\/machine-learning\/train-model\/tensorflow.html#tensorflow-keras-example-notebook). \n#### Batch size tuning \nBatch size tuning helps optimize GPU utilization. If the batch size is too small, the calculations cannot fully use the GPU capabilities. You can use [cluster metrics](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#metrics) to view GPU metrics. \nAdjust the batch size in conjunction with the learning rate. A good rule of thumb is, when you increase the batch size by n, increase the learning rate by sqrt(n). When tuning manually, try changing batch size by a factor of 2 or 0.5. Then continue tuning to optimize performance, either manually or by testing a variety of hyperparameters using an automated tool like [Hyperopt](https:\/\/docs.databricks.com\/machine-learning\/train-model\/dl-best-practices.html#hyperopt). \n#### Transfer learning \nWith transfer learning, you start with a previously trained model and modify it as needed for your application. Transfer learning can significantly reduce the time required to train and tune a new model. See [Featurization for transfer learning](https:\/\/docs.databricks.com\/machine-learning\/preprocess-data\/transfer-learning-tensorflow.html) for more information and an example. \n### Move to distributed training \nDatabricks Runtime ML includes HorovodRunner, `spark-tensorflow-distributor`, TorchDistributor and Hyperopt to facilitate the move from single-node to distributed training. \n#### HorovodRunner \nHorovod is an open-source project that scales deep learning training to multi-GPU or distributed computation. [HorovodRunner](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-runner.html), built by Databricks and included in Databricks Runtime ML, is a Horovod wrapper that provides Spark compatibility. The API lets you scale single-node code with minimal changes. HorovodRunner works with TensorFlow, Keras, and PyTorch. \n#### `spark-tensorflow-distributor` \n`spark-tensorflow-distributor` is an open-source native package in TensorFlow for distributed training with TensorFlow on Spark clusters. See the [example notebook](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/spark-tf-distributor.html). \n### TorchDistributor \nTorchDistributor is an open-source module in PySpark that facilitates distributed training with PyTorch on Spark clusters, that allows you to launch PyTorch training jobs as Spark jobs. See [Distributed training with TorchDistributor](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/spark-pytorch-distributor.html). \n#### Hyperopt \n[Hyperopt](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/index.html#hyperparameter-tuning-with-hyperopt) provides adaptive hyperparameter tuning for machine learning. With the SparkTrials class, you can iteratively tune parameters for deep learning models in parallel across a cluster.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/dl-best-practices.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deep learning\n#### Best practices for deep learning on Databricks\n##### Best practices for inference\n\nThis section contains general tips about using models for inference with Databricks. \n* To minimize costs, consider both CPUs and inference-optimized GPUs such as the Amazon EC2 G4 and G5 instances. There is no clear recommendation, as the best choice depends on model size, data dimensions, and other variables. \n* Use [MLflow](https:\/\/docs.databricks.com\/mlflow\/index.html) to simplify deployment and model serving. MLflow can log any deep learning model, including custom preprocessing and postprocessing logic. [Models in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html) or models registered in the [Workspace Model Registry](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html) can be deployed for batch, streaming, or online inference. \n### Online serving \nThe best option for low-latency serving is online serving behind a REST API. Databricks provides [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) for online inference. Model Serving provides a unified interface to deploy, govern, and query AI models and supports serving the following: \n* [Custom models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/custom-models.html). These are Python models packaged in the MLflow format. Examples include scikit-learn, XGBoost, PyTorch, and Hugging Face transformer models.\n* State-of-the-art open models made available by [Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html). These models are curated foundation model architectures that support optimized inference. For example, base models like Llama-2-70B-chat, BGE-Large, and Mistral-7B are available for immediate use with **pay-per-token** pricing. For workloads that require performance guarantees and fine-tuned model variants, you can deploy them with **provisioned throughput**.\n* [External models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html). These are models that are hosted outside of Databricks. For example, foundation models like, OpenAI\u2019s GPT-4, Anthropic\u2019s Claude, and others. Endpoints that serve these models can be centrally governed and customers can establish rate limits and access control for them. \nAlternatively, MLflow provides [APIs](https:\/\/mlflow.org\/docs\/latest\/python_api\/index.html) for deploying to various managed services for online inference, as well as [APIs for creating Docker containers](https:\/\/mlflow.org\/docs\/latest\/cli.html#mlflow-models-build-docker) for custom serving solutions. \nYou can also use SageMaker Serving. See the [example notebook](https:\/\/docs.databricks.com\/mlflow\/scikit-learn-model-deployment-on-sagemaker.html) and the [MLflow documentation](https:\/\/mlflow.org\/docs\/latest\/models.html#deploy-a-python-function-model-on-amazon-sagemaker). \n### Batch and streaming inference \nBatch and streaming scoring supports high-throughput, low-cost scoring at latencies as low as minutes. For more information, see [Use MLflow for model inference](https:\/\/docs.databricks.com\/machine-learning\/model-inference\/index.html#offline-batch-predictions). \n* If you expect to access data for inference more than once, consider creating a preprocessing job to ETL the data into a Delta Lake table before running the inference job. This way, the cost of ingesting and preparing the data is spread across multiple reads of the data. Separating preprocessing from inference also allows you to select different hardware for each job to optimize cost and performance. For example, you might use CPUs for ETL and GPUs for inference.\n* Use [Spark Pandas UDFs](https:\/\/docs.databricks.com\/udf\/pandas.html) to scale batch and streaming inference across a cluster. \n+ When you log a model from Databricks, MLflow automatically provides inference code to [apply the model as a pandas UDF](https:\/\/docs.databricks.com\/mlflow\/runs.html#code-snippets-for-prediction).\n+ You can also optimize your inference pipeline further, especially for large deep learning models. See the [reference solution for image ETL](https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/images-etl-inference.html) for an example.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/dl-best-practices.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Use foreachBatch to write to arbitrary data sinks\n\nThis article discusses using `foreachBatch` with Structured Streaming to write the output of a streaming query to data sources that do not have an existing streaming sink. \nThe code pattern `streamingDF.writeStream.foreachBatch(...)` allows you to apply batch functions to the output data of every micro-batch of the streaming query. Functions used with `foreachBatch` take two parameters: \n* A DataFrame that has the output data of a micro-batch.\n* The unique ID of the micro-batch. \nYou must use `foreachBatch` for Delta Lake merge operations in Structured Streaming. See [Upsert from streaming queries using foreachBatch](https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html#merge-in-streaming).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/foreach.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Use foreachBatch to write to arbitrary data sinks\n##### Apply additional DataFrame operations\n\nMany DataFrame and Dataset operations are not supported in streaming DataFrames because Spark does not support generating incremental plans in those cases. Using `foreachBatch()` you can apply some of these operations on each micro-batch output. For example, you can use `foreachBath()` and the SQL `MERGE INTO` operation to write the output of streaming aggregations into a Delta table in update mode. See more details in [MERGE INTO](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-merge-into.html). \nImportant \n* `foreachBatch()` provides only at-least-once write guarantees. However, you can use the `batchId` provided to the function as way to deduplicate the output and get an exactly-once guarantee. In either case, you will have to reason about the end-to-end semantics yourself.\n* `foreachBatch()` does not work with the [continuous processing mode](https:\/\/databricks.com\/blog\/2018\/03\/20\/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html) as it fundamentally relies on the micro-batch execution of a streaming query. If you write data in continuous mode, use `foreach()` instead. \nAn empty dataframe can be invoked with `foreachBatch()` and user code needs to be resilient to allow for proper operation. An example is shown here: \n```\n.foreachBatch(\n(outputDf: DataFrame, bid: Long) => {\n\/\/ Process valid data frames only\nif (!outputDf.isEmpty) {\n\/\/ business logic\n}\n}\n).start()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/foreach.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Use foreachBatch to write to arbitrary data sinks\n##### Behavior changes for `foreachBatch` in Databricks Runtime 14.0\n\nIn Databricks Runtime 14.0 and above on compute configured with shared access mode, `forEachBatch` runs in a separate isolated Python process on Apache Spark, rather than in the REPL environment. It is serialized and pushed to Spark and does not have access to global `spark` objects for the duration of the session. \nIn all other compute configurations, `foreachBatch` runs in the same Python REPL that runs the rest of your code. As a result, the function is not serialized. \nWhen you use Databricks Runtime 14.0 and above on compute configured with shared access mode, you must use the `sparkSession` variable scoped to the local DataFrame when using `foreachBatch` in Python, as in the following code example: \n```\ndef example_function(df, batch_id):\ndf.sparkSession.sql(\"<query>\")\n\n``` \nThe following behavior changes apply: \n* You cannot access any global Python variables from within your function.\n* `print()` commands write output to the driver logs.\n* Any files, modules, or objects referenced in the function must be serializable and available on Spark.\n\n#### Use foreachBatch to write to arbitrary data sinks\n##### Reuse existing batch data sources\n\nUsing `foreachBatch()`, you can use existing batch data writers for data sinks that might not have Structured Streaming support. Here are a few examples: \n* [Cassandra Scala example](https:\/\/docs.databricks.com\/structured-streaming\/examples.html#foreachbatch-cassandra-example)\n* [Azure Synapse Analytics Python example](https:\/\/docs.databricks.com\/structured-streaming\/examples.html#foreachbatch-sqldw-example) \nMany other batch data sources can be used from `foreachBatch()`. See [Connect to data sources](https:\/\/docs.databricks.com\/connect\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/foreach.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Use foreachBatch to write to arbitrary data sinks\n##### Write to multiple locations\n\nIf you need to write the output of a streaming query to multiple locations, Databricks recommends using multiple Structured Streaming writers for best parallelization and throughput. \nUsing `foreachBatch` to write to multiple sinks serializes the execution of streaming writes, which can increase latency for each micro-batch. \nIf you do use `foreachBatch` to write to multiple Delta tables, see [Idempotent table writes in foreachBatch](https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html#idempot-write).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/foreach.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Stream from Unity Catalog views\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nIn Databricks Runtime 14.1 and above, you can use Structured Streaming to perform streaming reads from views registered with Unity Catalog. Databricks only supports streaming reads from views defined against Delta tables.\n\n#### Stream from Unity Catalog views\n##### Read a view as a stream\n\nTo read a view with Structured Streaming, provide the identifier for the view to the `.table()` method, as in the following example: \n```\ndf = (spark.readStream\n.table(\"demoView\")\n)\n\n``` \nUsers must have `SELECT` privileges on the target view.\n\n#### Stream from Unity Catalog views\n##### Supported options for configuring streaming reads against views\n\nThe following options are supported when configuring streaming reads against views: \n* `maxFilesPerTrigger`\n* `maxBytesPerTrigger`\n* `ignoreDeletes`\n* `skipChangeCommits`\n* `withEventTimeOrder`\n* `startingTimestamp`\n* `startingVersion` \nThe streaming reader applies these options to the files and metadata defining the underlying Delta tables. \nImportant \nReads against views defined with `UNION ALL` do not support the options `withEventTimeOrder` and `startingVersion`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/views.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Stream from Unity Catalog views\n##### Supported operations in source views\n\nNot all views support streaming reads. Unsupported operations in source views include aggregations and sorting. \nThe following list provides descriptions and example view definitions for supported operations: \n* **Project** \n+ Description: Controls column-level permissions\n+ Operator: `SELECT... FROM...`\n+ Example statement: \n```\nCREATE VIEW project_view AS\nSELECT id, value\nFROM source_table\n\n```\n* **Filter** \n+ Description: Controls row-level permissions\n+ Operator: `WHERE...`\n+ Example statement: \n```\nCREATE VIEW filter_view AS\nSELECT * FROM source_table\nWHERE value > 100\n\n```\n* **Union all** \n+ Description: Results from multiple tables\n+ Operator: `UNION ALL`\n+ Example statement: \n```\nCREATE VIEW union_view AS\nSELECT id, value FROM source_table1\nUNION ALL\nSELECT * FROM source_table2\n\n``` \nNote \nYou cannot modify the view definition to add or change the tables referenced in the view and use the same streaming checkpoint.\n\n#### Stream from Unity Catalog views\n##### Limitations\n\nThe following limitations apply: \n* You can only stream from views backed by Delta tables. Views defined against other data sources are not supported.\n* You must register views with Unity Catalog.\n* The following exception displays if you stream from a view with an unsupported operator: \n```\nUnsupportedOperationException: [UNEXPECTED_OPERATOR_IN_STREAMING_VIEW] Unexpected operator <operator> in the CREATE VIEW statement as a streaming source. A streaming view query must consist only of SELECT, WHERE, and UNION ALL operations.\n\n```\n* The following exception displays if you provide unsupported options:\n```\nAnalysisException: [UNSUPPORTED\\_STREAMING\\_OPTIONS\\_FOR\\_VIEW.UNSUPPORTED\\_OPTION] Unsupported for streaming a view. Reason: option <option> is not supported.\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/views.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/bamboolib.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### bamboolib\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nNote \nbamboolib is supported in Databricks Runtime 11.3 LTS and above. \nbamboolib is a user interface component that allows no-code data analysis and transformations from within a Databricks [notebook](https:\/\/docs.databricks.com\/notebooks\/index.html). bamboolib helps users more easily work with their data and speeds up common data wrangling, exploration, and visualization tasks. As users complete these kinds of tasks with their data, bamboolib automatically generates [Python](https:\/\/docs.databricks.com\/languages\/python.html) code in the background. Users can share this code with others, who can run this code in their own notebooks to quickly reproduce those original tasks. They can also use bamboolib to extend those original tasks with additional data tasks, all without needing to know how to code. Those who are experienced with coding can extend this code to create even more sophisticated results. \nBehind the scenes, bamboolib uses [ipywidgets](https:\/\/ipywidgets.readthedocs.io\/en\/latest\/), which is an interactive HTML widget framework for the [IPython kernel](https:\/\/docs.databricks.com\/notebooks\/ipython-kernel.html). ipywidgets runs inside of the [IPython kernel](https:\/\/docs.databricks.com\/notebooks\/ipython-kernel.html). \nContents \n* [Requirements](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#requirements)\n* [Quickstart](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#quickstart)\n* [Walkthroughs](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#walkthroughs)\n* [Key tasks](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#key-tasks)\n* [Limitations](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#limitations)\n* [Additional resources](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#additional-resources)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/bamboolib.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### bamboolib\n##### [Requirements](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id5)\n\n* A Databricks [notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook), which is [attached](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) to a Databricks [cluster](https:\/\/docs.databricks.com\/compute\/configure.html) with [Databricks Runtime](https:\/\/docs.databricks.com\/compute\/configure.html#version) 11.0 or above.\n* The `bamboolib` library must be available to the notebook. \n+ To install the library from PyPI only on a specific cluster, see [Cluster libraries](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html).\n+ To use the `%pip` command to make the library available only to a specific notebook, see [Notebook-scoped Python libraries](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/bamboolib.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### bamboolib\n##### [Quickstart](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id6)\n\n1. [Create](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook) a Python notebook.\n2. [Attach](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) the notebook to a cluster that meets the [requirements](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#requirements).\n3. In the notebook\u2019s first [cell](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#add-a-cell), enter the following code, and then [run](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html) the cell. This step can be skipped if bamboolib is [already installed in the workspace or cluster](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#requirements). \n```\n%pip install bamboolib\n\n```\n4. In the notebook\u2019s second cell, enter the following code, and then run the cell. \n```\nimport bamboolib as bam\n\n```\n5. In the notebook\u2019s third cell, enter the following code, and then run the cell. \n```\nbam\n\n``` \nNote \nAlternatively, you can [print an existing pandas DataFrame](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#existing-dataframe) to display bamboolib for use with that specific DataFrame.\n6. Continue with [key tasks](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#key-tasks).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/bamboolib.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### bamboolib\n##### [Walkthroughs](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id7)\n\nYou can use bamboolib by itself or [with an existing pandas DataFrame](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#existing-dataframe). \n### Use bamboolib by itself \nIn this walkthrough, you use bamboolib to display in your notebook the contents of an example sales data set. You then experiment with some of the related notebook code that bamboolib automatically generates for you. You finish by querying and sorting a copy of the sales data set\u2019s contents. \n1. [Create](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook) a Python notebook.\n2. [Attach](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) the notebook to a cluster that meets the [requirements](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#requirements).\n3. In the notebook\u2019s first [cell](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#add-a-cell), enter the following code, and then [run](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html) the cell. This step can be skipped if bamboolib is [already installed in the workspace or cluster](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#requirements). \n```\n%pip install bamboolib\n\n```\n4. In the notebook\u2019s second cell, enter the following code, and then run the cell. \n```\nimport bamboolib as bam\n\n```\n5. In the notebook\u2019s third cell, enter the following code, and then run the cell. \n```\nbam\n\n```\n6. Click **Load dummy data**.\n7. In the **Load dummy data** pane, for **Load a dummy data set for testing bamboolib**, select **Sales dataset**.\n8. Click **Execute**.\n9. Display all of the rows where **item\\_type** is **Baby Food**: \n1. In the **Search actions** list, select **Filter rows**.\n2. In the **Filter rows** pane, in the **Choose** list (above **where**), select **Select rows**.\n3. In the list below **where**, select **item\\_type**.\n4. In the **Choose** list next to **item\\_type**, select **has value(s)**.\n5. In the **Choose value(s)** box next to **has value(s)**, select **Baby Food**.\n6. Click **Execute**.\n10. Copy the automatically generated Python code for this query: \n1. Cick **Copy Code** below the data preview.\n11. Paste and modify the code: \n1. In the notebook\u2019s fourth cell, paste the code that you copied. It should look like this: \n```\nimport pandas as pd\ndf = pd.read_csv(bam.sales_csv)\n# Step: Keep rows where item_type is one of: Baby Food\ndf = df.loc[df['item_type'].isin(['Baby Food'])]\n\n```\n2. Add to this code so that it displays only those rows where **order\\_prio** is **C**, and then run the cell: \n```\nimport pandas as pd\ndf = pd.read_csv(bam.sales_csv)\n# Step: Keep rows where item_type is one of: Baby Food\ndf = df.loc[df['item_type'].isin(['Baby Food'])]\n\n# Add the following code.\n# Step: Keep rows where order_prio is one of: C\ndf = df.loc[df['order_prio'].isin(['C'])]\ndf\n\n```\nTip \nInstead of writing this code, you can also do the same thing by just using bamboolib in the third cell to display only those rows where **order\\_prio** is **C**. This step is an example of extending the code that bamboolib automatically generated earlier.\n12. Sort the rows by **region** in ascending order: \n1. In the widget within the fourth cell, in the **Search actions** list, select **Sort rows**.\n2. In the **Sort column(s)** pane, in the **Choose column** list, select **region**.\n3. In the list next to **region**, select **ascending (A-Z)**.\n4. Click **Execute**.\nNote \nThis is equivalent to writing the following code yourself: \n```\ndf = df.sort_values(by=['region'], ascending=[True])\ndf\n\n``` \nYou could have also just used bamboolib in the third cell to sort the rows by **region** in ascending order. This step demonstrates how you can use bamboolib to extend the code that you write. As you use bamboolib, it automatically generates the additional code for you in the background, so that you can further extend your already-extended code!\n13. Continue with [key tasks](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#key-tasks). \n### Use bamboolib with an existing DataFrame \nIn this walkthrough, you use bamboolib to display in your notebook the contents of a [pandas DataFrame](https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.html). This DataFrame contains a copy of an example sales data set. You then experiment with some of the related notebook code that bamboolib automatically generates for you. You finish by querying and sorting some of the DataFrame\u2019s contents. \n1. [Create](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook) a Python notebook.\n2. [Attach](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) the notebook to a cluster that meets the [requirements](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#requirements).\n3. In the notebook\u2019s first [cell](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#add-a-cell), enter the following code, and then [run](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html) the cell. This step can be skipped if bamboolib is [already installed in the workspace or cluster](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#requirements). \n```\n%pip install bamboolib\n\n```\n4. In the notebook\u2019s second cell, enter the following code, and then run the cell. \n```\nimport bamboolib as bam\n\n```\n5. In the notebook\u2019s third cell, enter the following code, and then run the cell. \n```\nimport pandas as pd\n\ndf = pd.read_csv(bam.sales_csv)\ndf\n\n``` \nNote that bamboolib only supports [pandas DataFrames](https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.html). To convert a PySpark DataFrame to a pandas DataFrame, call [toPandas](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrame.toPandas.html) on the PySpark DataFrame. To convert a Pandas API on Spark DataFrame to a pandas DataFrame, call [to\\_pandas](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.pandas\/api\/pyspark.pandas.DataFrame.to_pandas.html) on the Pandas API on Spark DataFrame.\n6. Click **Show bamboolib UI**.\n7. Display all of the rows where **item\\_type** is **Baby Food**: \n1. In the **Search actions** list, select **Filter rows**.\n2. In the **Filter rows** pane, in the **Choose** list (above **where**), select **Select rows**.\n3. In the list below **where**, select **item\\_type**.\n4. In the **Choose** list next to **item\\_type**, select **has value(s)**.\n5. In the **Choose value(s)** box next to **has value(s)**, select **Baby Food**.\n6. Click **Execute**.\n8. Copy the automatically generated Python code for this query. To do this, click **Copy Code** below the data preview.\n9. Paste and modify the code: \n1. In the notebook\u2019s fourth cell, paste the code that you copied. It should look like this: \n```\n# Step: Keep rows where item_type is one of: Baby Food\ndf = df.loc[df['item_type'].isin(['Baby Food'])]\n\n```\n2. Add to this code so that it displays only those rows where **order\\_prio** is **C**, and then run the cell: \n```\n# Step: Keep rows where item_type is one of: Baby Food\ndf = df.loc[df['item_type'].isin(['Baby Food'])]\n\n# Add the following code.\n# Step: Keep rows where order_prio is one of: C\ndf = df.loc[df['order_prio'].isin(['C'])]\ndf\n\n```\nTip \nInstead of writing this code, you can also do the same thing by just using bamboolib in the third cell to display only those rows where **order\\_prio** is **C**. This step is an example of extending the code that bamboolib automatically generated earlier.\n10. Sort the rows by **region** in ascending order: \na. In the widget within the fourth cell, click **Sort rows**. \n3. In the **Sort column(s)** pane, in the **Choose column** list, select **region**.\n4. In the list next to **region**, select **ascending (A-Z)**.\n5. Click **Execute**.\nNote \nThis is equivalent to writing the following code yourself: \n```\ndf = df.sort_values(by=['region'], ascending=[True])\ndf\n\n``` \nYou could have also just used bamboolib in the third cell to sort the rows by **region** in ascending order. This step demonstrates how you can use bamboolib to extend the code that you write. As you use bamboolib, it automatically generates the additional code for you in the background, so that you can further extend your already-extended code!\n11. Continue with [key tasks](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#key-tasks).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/bamboolib.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### bamboolib\n##### [Key tasks](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id8)\n\nIn this section: \n* [Add the widget to a cell](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#add-the-widget-to-a-cell)\n* [Clear the widget](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#clear-the-widget)\n* [Data loading tasks](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#data-loading-tasks)\n* [Data action tasks](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#data-action-tasks)\n* [Data action history tasks](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#data-action-history-tasks)\n* [Get code to programmatically recreate the widget\u2019s current state as a DataFrame](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#get-code-to-programmatically-recreate-the-widgets-current-state-as-a-dataframe) \n### [Add the widget to a cell](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id11) \n**Scenario**: You want the bamboolib widget to display in a cell. \n1. Make sure the notebook meets the [requirements](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#requirements) for bamboolib.\n2. If bamboolib is not [already installed in the workspace or cluster](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#requirements) run the following code in a cell in the notebook, preferably in the first cell: \n```\n%pip install bamboolib\n\n```\n3. Run the following code in the notebook, preferably in the notebook\u2019s first or second cell: \n```\nimport bamboolib as bam\n\n```\n4. **Option 1**: In the cell where you want the widget to appear, add the following code, and then run the cell: \n```\nbam\n\n``` \nThe widget appears in the cell below the code. \nOr: \n**Option 2**: In a cell that contains a reference to a [pandas DataFrame](https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.html), print the DataFrame. For example, given the following DataFrame definition, run the cell: \n```\nimport pandas as pd\nfrom datetime import datetime, date\n\ndf = pd.DataFrame({\n'a': [ 1, 2, 3 ],\n'b': [ 2., 3., 4. ],\n'c': [ 'string1', 'string2', 'string3' ],\n'd': [ date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1) ],\n'e': [ datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), datetime(2000, 1, 3, 12, 0) ]\n})\n\ndf\n\n``` \nThe widget appears in the cell below the code. \nNote that bamboolib only supports [pandas DataFrames](https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.html). To convert a PySpark DataFrame to a pandas DataFrame, call [toPandas](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrame.toPandas.html) on the PySpark DataFrame. To convert a Pandas API on Spark DataFrame to a pandas DataFrame, call [to\\_pandas](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.pandas\/api\/pyspark.pandas.DataFrame.to_pandas.html) on the Pandas API on Spark DataFrame. \n### [Clear the widget](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id12) \n**Scenario**: You want to clear the contents of a widget and then read new data into the existing widget. \n**Option 1**: Run the following code within the cell that contains the target widget: \n```\nbam\n\n``` \nThe widget clears and then redisplays the **Databricks: Read CSV file from DBFS**, **Databricks: Load database table**, and **Load dummy data** buttons. \nNote \nIf the error `name 'bam' is not defined` appears, run the following code in the notebook (preferably in the notebook\u2019s first cell), and then try again: \n```\nimport bamboolib as bam\n\n``` \n**Option 2**: In a cell that contains a reference to a [pandas DataFrame](https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.html), print the DataFrame again by running the cell again. The widget clears and then displays the new data. \n### [Data loading tasks](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id13) \nIn this section: \n* [Read an example dataset\u2019s contents into the widget](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#read-an-example-datasets-contents-into-the-widget)\n* [Read a CSV file\u2019s contents into the widget](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#read-a-csv-files-contents-into-the-widget)\n* [Read a database table\u2019s contents into the widget](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#read-a-database-tables-contents-into-the-widget) \n#### [Read an example dataset\u2019s contents into the widget](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id17) \n**Scenario**: You want to read some example data into the widget, for example some pretend sales data, so that you can test out the widget\u2019s functionality. \n1. Click **Load dummy data**. \nNote \nIf **Load dummy data** is not visible, [clear the widget with Option 1](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#clear-widget) and try again.\n2. In the **Load dummy data** pane, for **Load a dummy data set for testing bamboolib**, select the name of the dataset that you want to load.\n3. For **Dataframe name**, enter a name for the programmatic identifier of the table\u2019s contents as a [DataFrame](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrame.html), or leave **df** as the default programmatic identifier.\n4. Click **Execute**. \nThe widget displays the contents of the dataset. \nTip \nYou can switch the current widget to display the contents of a different example dataset: \n1. In the current widget, click the **Load dummy data** tab.\n2. Follow the preceding steps to read the other example dataset\u2019s contents into the widget. \n#### [Read a CSV file\u2019s contents into the widget](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id18) \n**Scenario**: You want to read the contents of a CSV file within your Databricks workspace into the widget. \n1. Click **Databricks: Read CSV file from DBFS**. \nNote \nIf **Databricks: Read CSV file from DBFS** is not visible, [clear the widget with Option 1](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#clear-widget) and try again.\n2. In the **Read CSV from DBFS** pane, browse to the location that contains the target CSV file.\n3. Select the target CSV file.\n4. For **Dataframe name**, enter a name for the programmatic identifier of the CSV file\u2019s contents as a [DataFrame](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrame.html), or leave **df** as the default programmatic identifier.\n5. For **CSV value separator**, enter the character that separates values in the CSV file, or leave the **,** (comma) character as the default value separator.\n6. For **Decimal separator**, enter the character that separates decimals in the CSV file, or leave the **.** (dot) character as the default value separator.\n7. For **Row limit: read the first N rows - leave empty for no limit**, enter the maximum number of rows to read into the widget, or leave **100000** as the default number of rows, or leave this box empty to specify no row limit.\n8. Click **Open CSV file**. \nThe widget displays the contents of the CSV file, based on the settings that you specified. \nTip \nYou can switch the current widget to display the contents of a different CSV file: \n1. In the current widget, click the **Read CSV from DBFS** tab.\n2. Follow the preceding steps to read the other CSV file\u2019s contents into the widget. \n#### [Read a database table\u2019s contents into the widget](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id19) \n**Scenario**: You want to read the contents of a database table within your Databricks workspace into the widget. \n1. Click **Databricks: Load database table**. \nNote \nIf **Databricks: Load database table** is not visible, [clear the widget with Option 1](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#clear-widget) and try again.\n2. In the **Databricks: Load database table** pane, for **Database - leave empty for default database**, enter the name of the database in which the target table is located, or leave this box empty to specify the **default** database.\n3. For **Table**, enter the name of the target table.\n4. For **Row limit: read the first N rows - leave empty for no limit**, enter the maximum number of rows to read into the widget, or leave **100000** as the default number of rows, or leave this box empty to specify no row limit.\n5. For **Dataframe name**, enter a name for the programmatic identifier of the table\u2019s contents as a [DataFrame](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrame.html), or leave **df** as the default programmatic identifier.\n6. Click **Execute**. \nThe widget displays the contents of the table, based on the settings that you specified. \nTip \nYou can switch the current widget to display the contents of a different table: \n1. In the current widget, click the **Databricks: Load database table** tab.\n2. Follow the preceding steps to read the other table\u2019s contents into the widget. \n### [Data action tasks](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id14) \nbamboolib offers over 50 data actions. Following are some of the more common getting-started data action tasks. \nIn this section: \n* [Select columns](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#select-columns)\n* [Drop columns](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#drop-columns)\n* [Filter rows](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#filter-rows)\n* [Sort rows](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#sort-rows)\n* [Grouping rows and columns tasks](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#grouping-rows-and-columns-tasks)\n* [Remove rows with missing values](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#remove-rows-with-missing-values)\n* [Remove duplicated rows](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#remove-duplicated-rows)\n* [Find and replace missing values](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#find-and-replace-missing-values)\n* [Create a column formula](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#create-a-column-formula) \n#### [Select columns](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id20) \n**Scenario**: You want to show only specific table columns by name, by data type, or that match some regular expression. For example, in the dummy **Sales dataset**, you want to show only the `item_type` and `sales_channel` columns, or you want to show only the columns that contain the string `_date` in their column names. \n1. On the **Data** tab, in the **Search actions** drop-down list, do one of the following: \n* Type **select**, and then select **Select or drop columns**.\n* Select **Select or drop columns**.\n2. In the **Select or drop columns** pane, in the **Choose** drop-down list, select **Select**.\n3. Select the target column names or inclusion criterion.\n4. For **Dataframe name**, enter a name for the programmatic identifier of the table\u2019s contents as a [DataFrame](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrame.html), or leave **df** as the default programmatic identifier.\n5. Click **Execute**. \n#### [Drop columns](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id21) \n**Scenario**: You want to hide specific table columns by name, by data type, or that match some regular expression. For example, in the dummy **Sales dataset**, you want to hide the `order_prio`, `order_date`, and `ship_date` columns, or you want to hide all columns that contain only date-time values. \n1. On the **Data** tab, in the **Search actions** drop-down list, do one of the following: \n* Type **drop**, and then select **Select or drop columns**.\n* Select **Select or drop columns**.\n2. In the **Select or drop columns** pane, in the **Choose** drop-down list, select **Drop**.\n3. Select the target column names or inclusion criterion.\n4. For **Dataframe name**, enter a name for the programmatic identifier of the table\u2019s contents as a [DataFrame](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrame.html), or leave **df** as the default programmatic identifier.\n5. Click **Execute**. \n#### [Filter rows](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id22) \n**Scenario**: You want to show or hide specific table rows based on criteria such as specific column values that are matching or missing. For example, in the dummy **Sales dataset**, you want to show only those rows where the `item_type` column\u2019s value is set to `Baby Food`. \n1. On the **Data** tab, in the **Search actions** drop-down list, do one of the following: \n* Type **filter**, and then select **Filter rows**.\n* Select **Filter rows**.\n2. In the **Filter rows** pane, in the **Choose** drop-down list above **where**, select **Select rows** or **Drop rows**.\n3. Specify the first filter criterion.\n4. To add another filter criterion, click **add condition**, and specify the next filter criterion. Repeat as desired.\n5. For **Dataframe name**, enter a name for the programmatic identifier of the table\u2019s contents as a [DataFrame](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrame.html), or leave **df** as the default programmatic identifier.\n6. Click **Execute**. \n#### [Sort rows](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id23) \n**Scenario**: You want to sort table rows based on the values within one or more columns. For example, in the dummy **Sales dataset**, you want to show the rows by the `region` column\u2019s values in alphabetical order from A to Z. \n1. On the **Data** tab, in the **Search actions** drop-down list, do one of the following: \n* Type **sort**, and then select **Sort rows**.\n* Select **Sort rows**.\n2. In the **Sort column(s)** pane, choose the first column to sort by and the sort order.\n3. To add another sort criterion, click **add column**, and specify the next sort criterion. Repeat as desired.\n4. For **Dataframe name**, enter a name for the programmatic identifier of the table\u2019s contents as a [DataFrame](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrame.html), or leave **df** as the default programmatic identifier.\n5. Click **Execute**. \n#### [Grouping rows and columns tasks](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id24) \nIn this section: \n* [Group rows and columns by a single aggregate function](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#group-rows-and-columns-by-a-single-aggregate-function)\n* [Group rows and columns by multiple aggregate functions](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#group-rows-and-columns-by-multiple-aggregate-functions) \n##### [Group rows and columns by a single aggregate function](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id29) \n**Scenario**: You want to show row and column results by calculated groupings, and you want to assign custom names to those groupings. For example, in the dummy **Sales dataset**, you want to group the rows by the `country` column\u2019s values, showing the numbers of rows containing the same `country` value, and giving the list of calculated counts the name `country_count`. \n1. On the **Data** tab, in the **Search actions** drop-down list, do one of the following: \n* Type **group**, and then select **Group by and aggregate (with renaming)**.\n* Select **Group by and aggregate (with renaming)**.\n2. In the **Group by with column rename** pane, select the columns to group by, the first calculation, and optionally specify a name for the calculated column.\n3. To add another calculation, click **add calculation**, and specify the next calculation and column name. Repeat as desired.\n4. Specify where to store the result.\n5. For **Dataframe name**, enter a name for the programmatic identifier of the table\u2019s contents as a [DataFrame](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrame.html), or leave **df** as the default programmatic identifier.\n6. Click **Execute**. \n##### [Group rows and columns by multiple aggregate functions](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id30) \n**Scenario**: You want to show row and column results by calculated groupings. For example, in the dummy **Sales dataset**, you want to group the rows by the `region`, `country`, and `sales_channel` columns\u2019 values, showing the numbers of rows containing the same `region` and `country` value by `sales_channel`, as well as the `total_revenue` by unique combination of `region`, `country`, and `sales_channel`. \n1. On the **Data** tab, in the **Search actions** drop-down list, do one of the following: \n* Type **group**, and then select **Group by and aggregate (default)**.\n* Select **Group by and aggregate (default)**.\n2. In the **Group by with column rename** pane, select the columns to group by and the first calculation.\n3. To add another calculation, click **add calculation**, and specify the next calculation. Repeat as desired.\n4. Specify where to store the result.\n5. For **Dataframe name**, enter a name for the programmatic identifier of the table\u2019s contents as a [DataFrame](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrame.html), or leave **df** as the default programmatic identifier.\n6. Click **Execute**. \n#### [Remove rows with missing values](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id25) \n**Scenario**: You want to remove any row that has a missing value for the specified columns. For example, in the dummy **Sales dataset**, you want to remove any rows that have a missing `item_type` value. \n1. On the **Data** tab, in the **Search actions** drop-down list, do one of the following: \n* Type **drop** or **remove**, and then select **Drop missing values**.\n* Select **Drop missing values**.\n2. In the **Drop missing values** pane, select the columns to remove any row that has a missing value for that column.\n3. For **Dataframe name**, enter a name for the programmatic identifier of the table\u2019s contents as a [DataFrame](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrame.html), or leave **df** as the default programmatic identifier.\n4. Click **Execute**. \n#### [Remove duplicated rows](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id26) \n**Scenario**: You want to to remove any row that has a duplicated value for the specified columns. For example, in the dummy **Sales dataset**, you want to remove any rows that are exact duplicates of each other. \n1. On the **Data** tab, in the **Search actions** drop-down list, do one of the following: \n* Type **drop** or **remove**, and then select **Drop\/Remove duplicates**.\n* Select **Drop\/Remove duplicates**.\n2. In the **Remove Duplicates** pane, select the columns to remove any row that has a duplicated value for those columns, and then select whether to keep the first or last row that has the duplicated value.\n3. For **Dataframe name**, enter a name for the programmatic identifier of the table\u2019s contents as a [DataFrame](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrame.html), or leave **df** as the default programmatic identifier.\n4. Click **Execute**. \n#### [Find and replace missing values](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id27) \n**Scenario**: You want to replace the missing value with a replacement value for any row with the specified columns. For example, in the dummy **Sales dataset**, you want to replace any row with a missing value in the `item_type` column with the value `Unknown Item Type`. \n1. On the **Data** tab, in the **Search actions** drop-down list, do one of the following: \n* Type **find** or **replace**, and then select **Find and replace missing values**.\n* Select **Find and replace missing values**.\n2. In the **Replace missing values** pane, select the columns to replace missing values for, and then specify the replacement value.\n3. Click **Execute**. \n#### [Create a column formula](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id28) \n**Scenario**: You want to create a column that uses a unique formula. For example, in the dummy **Sales dataset**, you want to create a column named `profit_per_unit` that displays the result of dividing the `total_profit` column value by the `units_sold` column value for each row. \n1. On the **Data** tab, in the **Search actions** drop-down list, do one of the following: \n* Type **formula**, and then select **New column formula**.\n* Select **New column formula**.\n2. In the **Replace missing values** pane, select the columns to replace missing values for, and then specify the replacement value.\n3. Click **Execute**. \n### [Data action history tasks](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id15) \nIn this section: \n* [View the list of actions taken in the widget](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#view-the-list-of-actions-taken-in-the-widget)\n* [Undo the most recent action taken in the widget](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#undo-the-most-recent-action-taken-in-the-widget)\n* [Redo the most recent action taken in the widget](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#redo-the-most-recent-action-taken-in-the-widget)\n* [Change the most recent action taken in the widget](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#change-the-most-recent-action-taken-in-the-widget) \n#### [View the list of actions taken in the widget](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id31) \n**Scenario**: You want to see a list of all of the changes that were made in the widget, starting with the most recent change. \nClick **History**. The list of actions appears in the **Transformations history** pane. \n#### [Undo the most recent action taken in the widget](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id32) \n**Scenario**: You want to revert the most recent change that was made in the widget. \nDo one of the following: \n* Click the counterclockwise arrow icon.\n* Click **History**, and in the **Transformations history** pane, click **Undo last step**. \n#### [Redo the most recent action taken in the widget](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id33) \n**Scenario**: You want to revert the most recent revert that was made in the widget. \nDo one of the following: \n* Click the clockwise arrow icon.\n* Click **History**, and in the **Transformations history** pane, click **Recover last step**. \n#### [Change the most recent action taken in the widget](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id34) \n**Scenario**: You want to change the most recent change that was taken in the widget. \n1. Do one of the following: \n* Click the pencil icon.\n* Click **History**, and in the **Transformations history** pane, click **Edit last step**.\n2. Make the desired change, and then click **Execute**. \n### [Get code to programmatically recreate the widget\u2019s current state as a DataFrame](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id16) \n**Scenario**: You want to get Python code that programmatically recreates the current widget\u2019s state, represented as a pandas DataFrame. You want to run this code in a different cell in this workbook or a different workbook altogether. \n1. Click **Get Code**.\n2. In the **Export code** pane, click **Copy code**. The code is copied to your system\u2019s clipboard.\n3. Paste the code into a different cell in this workbook or into a different workbook.\n4. Write additional code to work with this pandas DataFrame programmatically, and then run the cell. For example, to display the DataFrame\u2019s contents, assuming that your DataFrame is represented programmatically by `df`: \n```\n# Your pasted code here, followed by...\ndf\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/bamboolib.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n#### bamboolib\n##### [Limitations](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id9)\n\n* Using bamboolib for data wrangling is limited to approximately 10 million rows. This limit is based on pandas and your cluster\u2019s compute resources.\n* Using bamboolib for data visualizations is limited to approximately 10 thousand rows. This limit is based on plotly.\n\n#### bamboolib\n##### [Additional resources](https:\/\/docs.databricks.com\/notebooks\/bamboolib.html#id10)\n\n* [bamboolib plugins](https:\/\/github.com\/tkrabel\/bamboolib\/tree\/master\/plugins)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/bamboolib.html"} +{"content":"# What is Delta Lake?\n### Constraints on Databricks\n\nDatabricks supports standard SQL constraint management clauses. Constraints fall into two categories: \n* Enforced contraints ensure that the quality and integrity of data added to a table is automatically verified.\n* Informational primary key and foreign key constraints encode relationships between fields in tables and are not enforced. \nAll constraints on Databricks require Delta Lake. \nDelta Live Tables has a similar concept known as expectations. See [Manage data quality with Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html).\n\n### Constraints on Databricks\n#### Enforced constraints on Databricks\n\nWhen a constraint is violated, the transaction fails with an error. Two types of constraints are supported: \n* `NOT NULL`: indicates that values in specific columns cannot be null.\n* `CHECK`: indicates that a specified boolean expression must be true for each input row. \nImportant \nAdding a constraint automatically upgrades the table writer protocol version if the previous writer version was less than 3. See [How does Databricks manage Delta Lake feature compatibility?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html) to understand table protocol versioning and what it means to upgrade the protocol version.\n\n","doc_uri":"https:\/\/docs.databricks.com\/tables\/constraints.html"} +{"content":"# What is Delta Lake?\n### Constraints on Databricks\n#### Set a `NOT NULL` constraint in Databricks\n\nYou specify `NOT NULL` constraints in the schema when you create a table. You drop or add `NOT NULL` constraints using the `ALTER TABLE ALTER COLUMN` command. \n```\nCREATE TABLE people10m (\nid INT NOT NULL,\nfirstName STRING,\nmiddleName STRING NOT NULL,\nlastName STRING,\ngender STRING,\nbirthDate TIMESTAMP,\nssn STRING,\nsalary INT\n) USING DELTA;\n\nALTER TABLE people10m ALTER COLUMN middleName DROP NOT NULL;\nALTER TABLE people10m ALTER COLUMN ssn SET NOT NULL;\n\n``` \nBefore adding a `NOT NULL` constraint to a table, Databricks verifies that all existing rows satisfy the constraint. \nIf you specify a `NOT NULL` constraint on a column nested within a struct, the parent struct must also be not null. Columns nested within array or map types do not accept `NOT NULL` constraints. \nSee [CREATE TABLE [USING]](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-table-using.html) and [ALTER TABLE ALTER COLUMN](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/tables\/constraints.html"} +{"content":"# What is Delta Lake?\n### Constraints on Databricks\n#### Set a `CHECK` constraint in Databricks\n\nYou manage `CHECK` constraints using the `ALTER TABLE ADD CONSTRAINT` and `ALTER TABLE DROP CONSTRAINT` commands. `ALTER TABLE ADD CONSTRAINT` verifies that all existing rows satisfy the constraint before adding it to the table. \n```\nCREATE TABLE people10m (\nid INT,\nfirstName STRING,\nmiddleName STRING,\nlastName STRING,\ngender STRING,\nbirthDate TIMESTAMP,\nssn STRING,\nsalary INT\n) USING DELTA;\n\nALTER TABLE people10m ADD CONSTRAINT dateWithinRange CHECK (birthDate > '1900-01-01');\nALTER TABLE people10m DROP CONSTRAINT dateWithinRange;\n\n``` \nSee [ALTER TABLE ADD CONSTRAINT](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table.html#add-constraint) and [ALTER TABLE DROP CONSTRAINT](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table.html#drop-constraint). \n`CHECK` constraints are exposed as table properties in the output of the `DESCRIBE DETAIL` and `SHOW TBLPROPERTIES` commands. \n```\nALTER TABLE people10m ADD CONSTRAINT validIds CHECK (id > 1 and id < 99999999);\n\nDESCRIBE DETAIL people10m;\n\nSHOW TBLPROPERTIES people10m;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/tables\/constraints.html"} +{"content":"# What is Delta Lake?\n### Constraints on Databricks\n#### Declare primary key and foreign key relationships\n\nNote \n* Primary key and foreign key constraints are available in Databricks Runtime 11.3 LTS and above, and are fully GA in Databricks Runtime 15.2 and above.\n* Primary key and foreign key constraints require Unity Catalog and Delta Lake. \nYou can use primary key and foreign key relationships on fields in Unity Catalog tables. Primary and foreign keys are informational only and are not enforced. Foreign keys must reference a primary key in another table. \nYou can declare primary keys and foreign keys as part of the table specification clause during table creation. This clause is not allowed during CTAS statements. You can also add constraints to existing tables. \n```\nCREATE TABLE T(pk1 INTEGER NOT NULL, pk2 INTEGER NOT NULL,\nCONSTRAINT t_pk PRIMARY KEY(pk1, pk2));\nCREATE TABLE S(pk INTEGER NOT NULL PRIMARY KEY,\nfk1 INTEGER, fk2 INTEGER,\nCONSTRAINT s_t_fk FOREIGN KEY(fk1, fk2) REFERENCES T);\n\n``` \nYou can query the `information_schema` or use `DESCRIBE` to get details about how constraints are applied across a given catalog. \nSee: \n* [ALTER TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table.html)\n* [ADD CONSTRAINT](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table-add-constraint.html)\n* [DROP CONSTRAINT](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table-drop-constraint.html)\n* [CONSTRAINT clause](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-table-constraint.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/tables\/constraints.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Foundation Model APIs model maintenance policy\n\nThis article describes the model maintenance policy for the [Foundation Model APIs pay-per-token](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html#token-foundation-apis) offering. \nThe Foundation Model API pay-per-token offering allows customers to experiment with the best models. In order to continue supporting the most state-of-the-art models, Databricks might deprecate older models or update supported models. \nIf you require long-term support for a specific model version, Databricks recommends using [provisioned throughput](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html#throughput).\n\n##### Foundation Model APIs model maintenance policy\n###### Model deprecation\n\nThe following deprecation policy only applies to chat and completion models. \nIf a model is set for deprecation, Databricks takes the following steps to notify customers: \n* A warning message displays in the model card from the **Serving** page of your Databricks workspace that indicates that the model is deprecated.\n* The documentation contains a notice that indicates the model is deprecated. \nAfter customers are notified about the upcoming model deprecation, Databricks will retire the model in 3 months. During this period of time, customers can choose to migrate to a provisioned throughput endpoint to continue using the model past its end-of-life date.\n\n##### Foundation Model APIs model maintenance policy\n###### Model updates\n\nDatabricks might ship incremental updates to pay-per-token models to deliver optimizations. When a model is updated, the endpoint URL remains the same, but the model ID in the response object changes to reflect the date of the update. For example, if an update is shipped to `llama-2-70b-chat` on 3\/4\/2024, the model name in the response object updates accordingly to `llama-2-70b-chat-030424`. Databricks maintains a version history of the updates that customers can refer to.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deprecate-pay-per-token-model.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Common data loading patterns\n\nAuto Loader simplifies a number of common data ingestion tasks. This quick reference provides examples for several popular patterns.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/patterns.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Common data loading patterns\n##### Filtering directories or files using glob patterns\n\nGlob patterns can be used for filtering directories and files when provided in the path. \n| Pattern | Description |\n| --- | --- |\n| `?` | Matches any single character |\n| `*` | Matches zero or more characters |\n| `[abc]` | Matches a single character from character set {a,b,c}. |\n| `[a-z]` | Matches a single character from the character range {a\u2026z}. |\n| `[^a]` | Matches a single character that is not from character set or range {a}. Note that the `^` character must occur immediately to the right of the opening bracket. |\n| `{ab,cd}` | Matches a string from the string set {ab, cd}. |\n| `{ab,c{de, fh}}` | Matches a string from the string set {ab, cde, cfh}. | \nUse the `path` for providing prefix patterns, for example: \n```\ndf = spark.readStream.format(\"cloudFiles\") \\\n.option(\"cloudFiles.format\", <format>) \\\n.schema(schema) \\\n.load(\"<base-path>\/*\/files\")\n\n``` \n```\nval df = spark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", <format>)\n.schema(schema)\n.load(\"<base-path>\/*\/files\")\n\n``` \nImportant \nYou need to use the option `pathGlobFilter` for explicitly providing suffix patterns. The `path` only provides a prefix filter. \nFor example, if you would like to parse only `png` files in a directory that contains files with different suffixes, you can do: \n```\ndf = spark.readStream.format(\"cloudFiles\") \\\n.option(\"cloudFiles.format\", \"binaryFile\") \\\n.option(\"pathGlobfilter\", \"*.png\") \\\n.load(<base-path>)\n\n``` \n```\nval df = spark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"binaryFile\")\n.option(\"pathGlobfilter\", \"*.png\")\n.load(<base-path>)\n\n``` \nNote \nThe default globbing behavior of Auto Loader is different than the default behavior of other Spark file sources. Add `.option(\"cloudFiles.useStrictGlobber\", \"true\")` to your read to use globbing that matches default Spark behavior against file sources. See the following table for more on globbing: \n| Pattern | File path | Default globber | Strict globber |\n| --- | --- | --- | --- |\n| \/a\/b | \/a\/b\/c\/file.txt | *Yes* | *Yes* |\n| \/a\/b | \/a\/b\\_dir\/c\/file.txt | *No* | *No* |\n| \/a\/b | \/a\/b.txt | *No* | *No* |\n| \/a\/b\/ | \/a\/b.txt | *No* | *No* |\n| \/a\/\\*\/c\/ | \/a\/b\/c\/file.txt | *Yes* | *Yes* |\n| \/a\/\\*\/c\/ | \/a\/b\/c\/d\/file.txt | *Yes* | *Yes* |\n| \/a\/\\*\/c\/ | \/a\/b\/x\/y\/c\/file.txt | *Yes* | *No* |\n| \/a\/\\*\/c | \/a\/b\/c\\_file.txt | *Yes* | *No* |\n| \/a\/\\*\/c\/ | \/a\/b\/c\\_file.txt | *Yes* | *No* |\n| \/a\/\\*\/c\/ | \/a\/\\*\/cookie\/file.txt | *Yes* | *No* |\n| \/a\/b\\* | \/a\/b.txt | *Yes* | *Yes* |\n| \/a\/b\\* | \/a\/b\/file.txt | *Yes* | *Yes* |\n| \/a\/{0.txt,1.txt} | \/a\/0.txt | *Yes* | *Yes* |\n| \/a\/\\*\/{0.txt,1.txt} | \/a\/0.txt | *No* | *No* |\n| \/a\/b\/[cde-h]\/i\/ | \/a\/b\/c\/i\/file.txt | *Yes* | *Yes* |\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/patterns.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Common data loading patterns\n##### Enable easy ETL\n\nAn easy way to get your data into Delta Lake without losing any data is to use the following pattern and enabling schema inference with Auto Loader. Databricks recommends running the following code in a Databricks job for it to automatically restart your stream when the schema of your source data changes. By default, the schema is inferred as string types, any parsing errors (there should be none if everything remains as a string) will go to `_rescued_data`, and any new columns will fail the stream and evolve the schema. \n```\nspark.readStream.format(\"cloudFiles\") \\\n.option(\"cloudFiles.format\", \"json\") \\\n.option(\"cloudFiles.schemaLocation\", \"<path-to-schema-location>\") \\\n.load(\"<path-to-source-data>\") \\\n.writeStream \\\n.option(\"mergeSchema\", \"true\") \\\n.option(\"checkpointLocation\", \"<path-to-checkpoint>\") \\\n.start(\"<path_to_target\")\n\n``` \n```\nspark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"json\")\n.option(\"cloudFiles.schemaLocation\", \"<path-to-schema-location>\")\n.load(\"<path-to-source-data>\")\n.writeStream\n.option(\"mergeSchema\", \"true\")\n.option(\"checkpointLocation\", \"<path-to-checkpoint>\")\n.start(\"<path_to_target\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/patterns.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Common data loading patterns\n##### Prevent data loss in well-structured data\n\nWhen you know your schema, but want to know whenever you receive unexpected data, Databricks recommends using the `rescuedDataColumn`. \n```\nspark.readStream.format(\"cloudFiles\") \\\n.schema(expected_schema) \\\n.option(\"cloudFiles.format\", \"json\") \\\n# will collect all new fields as well as data type mismatches in _rescued_data\n.option(\"cloudFiles.schemaEvolutionMode\", \"rescue\") \\\n.load(\"<path-to-source-data>\") \\\n.writeStream \\\n.option(\"checkpointLocation\", \"<path-to-checkpoint>\") \\\n.start(\"<path_to_target\")\n\n``` \n```\nspark.readStream.format(\"cloudFiles\")\n.schema(expected_schema)\n.option(\"cloudFiles.format\", \"json\")\n\/\/ will collect all new fields as well as data type mismatches in _rescued_data\n.option(\"cloudFiles.schemaEvolutionMode\", \"rescue\")\n.load(\"<path-to-source-data>\")\n.writeStream\n.option(\"checkpointLocation\", \"<path-to-checkpoint>\")\n.start(\"<path_to_target\")\n\n``` \nIf you want your stream to stop processing if a new field is introduced that doesn\u2019t match your schema, you can add: \n```\n.option(\"cloudFiles.schemaEvolutionMode\", \"failOnNewColumns\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/patterns.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Common data loading patterns\n##### Enable flexible semi-structured data pipelines\n\nWhen you\u2019re receiving data from a vendor that introduces new columns to the information they provide, you may not be aware of exactly when they do it, or you may not have the bandwidth to update your data pipeline. You can now leverage schema evolution to restart the stream and let Auto Loader update the inferred schema automatically. You can also leverage `schemaHints` for some of the \u201cschemaless\u201d fields that the vendor may be providing. \n```\nspark.readStream.format(\"cloudFiles\") \\\n.option(\"cloudFiles.format\", \"json\") \\\n# will ensure that the headers column gets processed as a map\n.option(\"cloudFiles.schemaHints\",\n\"headers map<string,string>, statusCode SHORT\") \\\n.load(\"\/api\/requests\") \\\n.writeStream \\\n.option(\"mergeSchema\", \"true\") \\\n.option(\"checkpointLocation\", \"<path-to-checkpoint>\") \\\n.start(\"<path_to_target\")\n\n``` \n```\nspark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"json\")\n\/\/ will ensure that the headers column gets processed as a map\n.option(\"cloudFiles.schemaHints\",\n\"headers map<string,string>, statusCode SHORT\")\n.load(\"\/api\/requests\")\n.writeStream\n.option(\"mergeSchema\", \"true\")\n.option(\"checkpointLocation\", \"<path-to-checkpoint>\")\n.start(\"<path_to_target\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/patterns.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Common data loading patterns\n##### Transform nested JSON data\n\nBecause Auto Loader infers the top level JSON columns as strings, you can be left with nested JSON objects that require further transformations. You can use the [semi-structured data access APIs](https:\/\/docs.databricks.com\/optimizations\/semi-structured.html) to further transform complex JSON content. \n```\nspark.readStream.format(\"cloudFiles\") \\\n.option(\"cloudFiles.format\", \"json\") \\\n# The schema location directory keeps track of your data schema over time\n.option(\"cloudFiles.schemaLocation\", \"<path-to-checkpoint>\") \\\n.load(\"<source-data-with-nested-json>\") \\\n.selectExpr(\n\"*\",\n\"tags:page.name\", # extracts {\"tags\":{\"page\":{\"name\":...}}}\n\"tags:page.id::int\", # extracts {\"tags\":{\"page\":{\"id\":...}}} and casts to int\n\"tags:eventType\" # extracts {\"tags\":{\"eventType\":...}}\n)\n\n``` \n```\nspark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"json\")\n\/\/ The schema location directory keeps track of your data schema over time\n.option(\"cloudFiles.schemaLocation\", \"<path-to-checkpoint>\")\n.load(\"<source-data-with-nested-json>\")\n.selectExpr(\n\"*\",\n\"tags:page.name\", \/\/ extracts {\"tags\":{\"page\":{\"name\":...}}}\n\"tags:page.id::int\", \/\/ extracts {\"tags\":{\"page\":{\"id\":...}}} and casts to int\n\"tags:eventType\" \/\/ extracts {\"tags\":{\"eventType\":...}}\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/patterns.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Common data loading patterns\n##### Infer nested JSON data\n\nWhen you have nested data, you can use the `cloudFiles.inferColumnTypes` option to infer the nested structure of your data and other column types. \n```\nspark.readStream.format(\"cloudFiles\") \\\n.option(\"cloudFiles.format\", \"json\") \\\n# The schema location directory keeps track of your data schema over time\n.option(\"cloudFiles.schemaLocation\", \"<path-to-checkpoint>\") \\\n.option(\"cloudFiles.inferColumnTypes\", \"true\") \\\n.load(\"<source-data-with-nested-json>\")\n\n``` \n```\nspark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"json\")\n\/\/ The schema location directory keeps track of your data schema over time\n.option(\"cloudFiles.schemaLocation\", \"<path-to-checkpoint>\")\n.option(\"cloudFiles.inferColumnTypes\", \"true\")\n.load(\"<source-data-with-nested-json>\")\n\n```\n\n#### Common data loading patterns\n##### Load CSV files without headers\n\n```\ndf = spark.readStream.format(\"cloudFiles\") \\\n.option(\"cloudFiles.format\", \"csv\") \\\n.option(\"rescuedDataColumn\", \"_rescued_data\") \\ # makes sure that you don't lose data\n.schema(<schema>) \\ # provide a schema here for the files\n.load(<path>)\n\n``` \n```\nval df = spark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"csv\")\n.option(\"rescuedDataColumn\", \"_rescued_data\") \/\/ makes sure that you don't lose data\n.schema(<schema>) \/\/ provide a schema here for the files\n.load(<path>)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/patterns.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Common data loading patterns\n##### Enforce a schema on CSV files with headers\n\n```\ndf = spark.readStream.format(\"cloudFiles\") \\\n.option(\"cloudFiles.format\", \"csv\") \\\n.option(\"header\", \"true\") \\\n.option(\"rescuedDataColumn\", \"_rescued_data\") \\ # makes sure that you don't lose data\n.schema(<schema>) \\ # provide a schema here for the files\n.load(<path>)\n\n``` \n```\nval df = spark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"csv\")\n.option(\"header\", \"true\")\n.option(\"rescuedDataColumn\", \"_rescued_data\") \/\/ makes sure that you don't lose data\n.schema(<schema>) \/\/ provide a schema here for the files\n.load(<path>)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/patterns.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Common data loading patterns\n##### Ingest image or binary data to Delta Lake for ML\n\nOnce the data is stored in Delta Lake, you can run distributed inference on the data. See [Perform distributed inference using pandas UDF](https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/images-etl-inference.html#perform-distributed-inference-using-pandas-udf). \n```\nspark.readStream.format(\"cloudFiles\") \\\n.option(\"cloudFiles.format\", \"binaryFile\") \\\n.load(\"<path-to-source-data>\") \\\n.writeStream \\\n.option(\"checkpointLocation\", \"<path-to-checkpoint>\") \\\n.start(\"<path_to_target\")\n\n``` \n```\nspark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"binaryFile\")\n.load(\"<path-to-source-data>\")\n.writeStream\n.option(\"checkpointLocation\", \"<path-to-checkpoint>\")\n.start(\"<path_to_target\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/patterns.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## What is Auto Loader?\n#### Common data loading patterns\n##### Auto Loader syntax for DLT\n\nDelta Live Tables provides slightly modified Python syntax for Auto Loader adds SQL support for Auto Loader. \nThe following examples use Auto Loader to create datasets from CSV and JSON files: \n```\n@dlt.table\ndef customers():\nreturn (\nspark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"csv\")\n.load(\"\/databricks-datasets\/retail-org\/customers\/\")\n)\n\n@dlt.table\ndef sales_orders_raw():\nreturn (\nspark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"json\")\n.load(\"\/databricks-datasets\/retail-org\/sales_orders\/\")\n)\n\n``` \n```\nCREATE OR REFRESH STREAMING TABLE customers\nAS SELECT * FROM cloud_files(\"\/databricks-datasets\/retail-org\/customers\/\", \"csv\")\n\nCREATE OR REFRESH STREAMING TABLE sales_orders_raw\nAS SELECT * FROM cloud_files(\"\/databricks-datasets\/retail-org\/sales_orders\/\", \"json\")\n\n``` \nYou can use supported [format options](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/options.html#format-options) with Auto Loader. Using the `map()` function, you can pass options to the `cloud_files()` method. Options are key-value pairs, where the keys and values are strings. The following describes the syntax for working with Auto Loader in SQL: \n```\nCREATE OR REFRESH STREAMING TABLE <table-name>\nAS SELECT *\nFROM cloud_files(\n\"<file-path>\",\n\"<file-format>\",\nmap(\n\"<option-key>\", \"<option_value\",\n\"<option-key>\", \"<option_value\",\n...\n)\n)\n\n``` \nThe following example reads data from tab-delimited CSV files with a header: \n```\nCREATE OR REFRESH STREAMING TABLE customers\nAS SELECT * FROM cloud_files(\"\/databricks-datasets\/retail-org\/customers\/\", \"csv\", map(\"delimiter\", \"\\t\", \"header\", \"true\"))\n\n``` \nYou can use the `schema` to specify the format manually; you must specify the `schema` for formats that do not support [schema inference](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html): \n```\n@dlt.table\ndef wiki_raw():\nreturn (\nspark.readStream.format(\"cloudFiles\")\n.schema(\"title STRING, id INT, revisionId INT, revisionTimestamp TIMESTAMP, revisionUsername STRING, revisionUsernameId INT, text STRING\")\n.option(\"cloudFiles.format\", \"parquet\")\n.load(\"\/databricks-datasets\/wikipedia-datasets\/data-001\/en_wikipedia\/articles-only-parquet\")\n)\n\n``` \n```\nCREATE OR REFRESH STREAMING TABLE wiki_raw\nAS SELECT *\nFROM cloud_files(\n\"\/databricks-datasets\/wikipedia-datasets\/data-001\/en_wikipedia\/articles-only-parquet\",\n\"parquet\",\nmap(\"schema\", \"title STRING, id INT, revisionId INT, revisionTimestamp TIMESTAMP, revisionUsername STRING, revisionUsernameId INT, text STRING\")\n)\n\n``` \nNote \nDelta Live Tables automatically configures and manages the schema and checkpoint directories when using Auto Loader to read files. However, if you manually configure either of these directories, performing a full refresh does not affect the contents of the configured directories. Databricks recommends using the automatically configured directories to avoid unexpected side effects during processing.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/auto-loader\/patterns.html"} +{"content":"# Data governance with Unity Catalog\n### What is Unity Catalog?\n\nThis article introduces Unity Catalog, a unified governance solution for data and AI assets on Databricks.\n\n### What is Unity Catalog?\n#### Overview of Unity Catalog\n\nUnity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces. \n![Unity Catalog diagram](https:\/\/docs.databricks.com\/_images\/with-unity-catalog.png) \nKey features of Unity Catalog include: \n* **Define once, secure everywhere**: Unity Catalog offers a single place to administer data access policies that apply across all workspaces.\n* **Standards-compliant security model**: Unity Catalog\u2019s security model is based on standard ANSI SQL and allows administrators to grant permissions in their existing data lake using familiar syntax, at the level of catalogs, databases (also called schemas), tables, and views.\n* **Built-in auditing and lineage**: Unity Catalog automatically captures user-level audit logs that record access to your data. Unity Catalog also captures lineage data that tracks how data assets are created and used across all languages.\n* **Data discovery**: Unity Catalog lets you tag and document data assets, and provides a search interface to help data consumers find data.\n* **System tables (Public Preview)**: Unity Catalog lets you easily access and query your account\u2019s operational data, including audit logs, billable usage, and lineage.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html"} +{"content":"# Data governance with Unity Catalog\n### What is Unity Catalog?\n#### How does Unity Catalog govern access to data and AI assets in cloud object storage?\n\nDatabricks recommends that you configure all access to cloud object storage using Unity Catalog. See [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html). \nUnity Catalog introduces the following concepts to manage relationships between data in Databricks and cloud object storage: \n* **Storage credentials** encapsulate a long-term cloud credential that provides access to cloud storage. For example, an IAM role that can access S3 buckets or a Cloudflare R2 API token. See [Create a storage credential for connecting to AWS S3](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html) and [Create a storage credential for connecting to Cloudflare R2](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials-r2.html).\n* **External locations** contain a reference to a storage credential and a cloud storage path. See [Create an external location to connect cloud storage to Databricks](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html).\n* **Managed storage locations** associate a storage location in an S3 bucket or Cloudflare R2 bucket in your own cloud storage account with a metastore, catalog, or schema. Managed storage locations are used as storage location for managed tables and managed volumes. See [Specify a managed storage location in Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/managed-storage.html).\n* **Volumes** provide access to non-tabular data stored in cloud object storage. See [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html).\n* **Tables** provide access to tabular data stored in cloud object storage. \nNote \nLakehouse Federation provides integrations to data in other external systems. These objects are not backed by cloud object storage.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html"} +{"content":"# Data governance with Unity Catalog\n### What is Unity Catalog?\n#### The Unity Catalog object model\n\nIn Unity Catalog, the hierarchy of primary data objects flows from metastore to table or volume: \n* **Metastore**: The top-level container for metadata. Each metastore exposes a three-level namespace (`catalog`.`schema`.`table`) that organizes your data.\n* **Catalog**: The first layer of the object hierarchy, used to organize your data assets.\n* **Schema**: Also known as databases, schemas are the second layer of the object hierarchy and contain tables and views.\n* **Tables, views, and volumes**: At the lowest level in the data object hierarchy are tables, views, and volumes. Volumes provide governance for non-tabular data.\n* **Models**: Although they are not, strictly speaking, data assets, registered models can also be managed in Unity Catalog and reside at the lowest level in the object hierarchy. \n![Unity Catalog object model diagram](https:\/\/docs.databricks.com\/_images\/object-model.png) \nThis is a simplified view of securable Unity Catalog objects. For more details, see [Securable objects in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html#securable-objects). \nYou reference all data in Unity Catalog using a three-level namespace: `catalog.schema.asset`, where `asset` can be a table, view, volume, or model. \n### Metastores \nA metastore is the top-level container of objects in Unity Catalog. It registers metadata about data and AI assets and the permissions that govern access to them. Databricks account admins should create one metastore for each region in which they operate and assign them to Databricks workspaces in the same region. For a workspace to use Unity Catalog, it must have a Unity Catalog metastore attached. \nA metastore can optionally be configured with a managed storage location in an S3 bucket or Cloudflare R2 bucket in your own cloud storage account. See [Managed storage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#managed-storage). \nNote \nThis metastore is distinct from the Hive metastore included in Databricks workspaces that have not been enabled for Unity Catalog. If your workspace includes a legacy Hive metastore, the data in that metastore will still be available alongside data defined in Unity Catalog, in a catalog named `hive_metastore`. Note that the `hive_metastore` catalog is not managed by Unity Catalog and does not benefit from the same feature set as catalogs defined in Unity Catalog. \nSee [Create a Unity Catalog metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html). \n### Catalogs \nA catalog is the first layer of Unity Catalog\u2019s three-level namespace. It\u2019s used to organize your data assets. Users can see all catalogs on which they have been assigned the `USE CATALOG` [data permission](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html). \nDepending on how your workspace was created and enabled for Unity Catalog, your users may have default permissions on automatically provisioned catalogs, including either the `main` catalog or the *workspace catalog* (`<workspace-name>`). For more information, see [Default user privileges](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#default-privileges). \nSee [Create and manage catalogs](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html). \n### Schemas \nA schema (also called a database) is the second layer of Unity Catalog\u2019s three-level namespace. A schema organizes tables and views. Users can see all schemas on which they have been assigned the `USE SCHEMA` permission, along with the `USE CATALOG` permission on the schema\u2019s parent catalog. To access or list a table or view in a schema, users must also have `SELECT` permission on the table or view. \nIf your workspace was enabled for Unity Catalog manually, it includes a default schema named `default` in the `main` catalog that is accessible to all users in your workspace. If your workspace was enabled for Unity Catalog automatically and includes a `<workspace-name>` catalog, that catalog contains a schema named `default` that is accessible to all users in your workspace. \nSee [Create and manage schemas (databases)](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-schemas.html). \n### Tables \nA table resides in the third layer of Unity Catalog\u2019s three-level namespace. It contains rows of data. To create a table, users must have `CREATE` and `USE SCHEMA` permissions on the schema, and they must have the `USE CATALOG` permission on its parent catalog. To query a table, users must have the `SELECT` permission on the table, the `USE SCHEMA` permission on its parent schema, and the `USE CATALOG` permission on its parent catalog. \nA table can be *managed* or *external*. \n#### Managed tables \nManaged tables are the preferred way to create tables in Unity Catalog. Unity Catalog manages the lifecycle and file layout for these tables. Unity Catalog also [optimizes their performance automatically](https:\/\/docs.databricks.com\/optimizations\/predictive-optimization.html). You should not use tools outside of Databricks to manipulate files in these tables directly. Managed tables always use the [Delta](https:\/\/docs.databricks.com\/delta\/index.html) table format. \nFor workspaces that were enabled for Unity Catalog manually, managed tables are stored in the root storage location that you configure when you create a metastore. You can optionally specify managed table storage locations at the catalog or schema levels, overriding the root storage location. \nFor workspaces that were enabled for Unity Catalog automatically, the metastore root storage location is optional, and managed tables are typically stored at the catalog or schema levels. \nWhen a managed table is dropped, its underlying data is deleted from your cloud tenant within 30 days. \nSee [Managed tables](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#managed-table). \n#### External tables \nExternal tables are tables whose data lifecycle and file layout are not managed by Unity Catalog. Use external tables to register large amounts of existing data in Unity Catalog, or if you require direct access to the data using tools outside of Databricks clusters or Databricks SQL warehouses. \nWhen you drop an external table, Unity Catalog does not delete the underlying data. You can manage privileges on external tables and use them in queries in the same way as managed tables. \nExternal tables can use the following file formats: \n* DELTA\n* CSV\n* JSON\n* AVRO\n* PARQUET\n* ORC\n* TEXT \nSee [External tables](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#external-table). \n### Views \nA view is a read-only object created from one or more tables and views in a metastore. It resides in the third layer of Unity Catalog\u2019s three-level namespace. A view can be created from tables and other views in multiple schemas and catalogs. You can create [dynamic views](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) to enable row- and column-level permissions. \nSee [Create a dynamic view](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-views.html#dynamic-view). \n### Volumes \nA volume resides in the third layer of Unity Catalog\u2019s three-level namespace. Volumes are siblings to tables, views, and other objects organized under a schema in Unity Catalog. \nVolumes contain directories and files for data stored in any format. Volumes provide non-tabular access to data, meaning that files in volumes cannot be registered as tables. \n* To create a volume, users must have `CREATE VOLUME` and `USE SCHEMA` permissions on the schema, and they must have the `USE CATALOG` permission on its parent catalog.\n* To read files and directories stored inside a volume, users must have the `READ VOLUME` permission, the `USE SCHEMA` permission on its parent schema, and the `USE CATALOG` permission on its parent catalog.\n* To add, remove, or modify files and directories stored inside a volume, users must have `WRITE VOLUME` permission, the `USE SCHEMA` permission on its parent schema, and the `USE CATALOG` permission on its parent catalog. \nA volume can be *managed* or *external*. \nNote \nWhen you define a volume, cloud URI access to data under the volume path is governed by the permissions of the volume. \n#### Managed volumes \nManaged volumes are a convenient solution when you want to provision a governed location for working with non-tabular files. \nManaged volumes store files in the Unity Catalog managed storage location for the schema in which they\u2019re contained. For workspaces that were enabled for Unity Catalog manually, managed volumes are stored in the root storage location that you configure when you create a metastore. You can optionally specify managed volume storage locations at the catalog or schema levels, overriding the root storage location. For workspaces that were enabled for Unity Catalog automatically, the metastore root storage location is optional, and managed volumes are typically stored at the catalog or schema levels. \nThe following precedence governs which location is used for a managed volume: \n* Schema location\n* Catalog location\n* Unity Catalog metastore root storage location \nWhen you delete a managed volume, the files stored in this volume are also deleted from your cloud tenant within 30 days. \nSee [What is a managed volume?](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html#managed). \n#### External volumes \nAn external volume is registered to a Unity Catalog external location and provides access to existing files in cloud storage without requiring data migration. Users must have the `CREATE EXTERNAL VOLUME` permission on the external location to create an external volume. \nExternal volumes support scenarios where files are produced by other systems and staged for access from within Databricks using object storage or where tools outside Databricks require direct file access. \nUnity Catalog does not manage the lifecycle and layout of the files in external volumes. When you drop an external volume, Unity Catalog does not delete the underlying data. \nSee [What is an external volume?](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html#external). \n### Models \nA model resides in the third layer of Unity Catalog\u2019s three-level namespace. In this context, \u201cmodel\u201d refers to a machine learning model that is registered in the [MLflow Model Registry](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html). To create a model in Unity Catalog, users must have the `CREATE MODEL` privilege for the catalog or schema. The user must also have the `USE CATALOG` privilege on the parent catalog and `USE SCHEMA` on the parent schema.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html"} +{"content":"# Data governance with Unity Catalog\n### What is Unity Catalog?\n#### Managed storage\n\nYou can store managed tables and managed volumes at any of these levels in the Unity Catalog object hierarchy: metastore, catalog, or schema. Storage at lower levels in the hierarchy overrides storage defined at higher levels. \nWhen an account admin creates a metastore manually, they have the option to assign a storage location in an S3 bucket or Cloudflare R2 bucket in your own cloud storage account to use as metastore-level storage for managed tables and volumes. If a metastore-level managed storage location has been assigned, then managed storage locations at the catalog and schema levels are optional. That said, metastore-level storage is optional, and Databricks recommends assigning managed storage at the catalog level for logical data isolation. See [Data governance and data isolation building blocks](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html#building-blocks). \nImportant \nIf your workspace was enabled for Unity Catalog automatically, the Unity Catalog metastore was created without metastore-level managed storage. You can opt to add metastore-level storage, but Databricks recommends assigning managed storage at the catalog and schema levels. For help deciding whether you need metastore-level storage, see [(Optional) Create metastore-level storage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#metastore-storage) and [Data is physically separated in storage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html#physically-separate). \nManaged storage has the following properties: \n* Managed tables and managed volumes store data and metadata files in managed storage.\n* Managed storage locations cannot overlap with external tables or external volumes. \nThe following table describes how managed storage is declared and associated with Unity Catalog objects: \n| Associated Unity Catalog object | How to set | Relation to external locations |\n| --- | --- | --- |\n| Metastore | Configured by account admin during metastore creation or added after metastore creation if no storage was specified at creation. | Cannot overlap an external location. |\n| Catalog | Specified during catalog creation using the `MANAGED LOCATION` keyword. | Must be contained within an external location. |\n| Schema | Specified during schema creation using the `MANAGED LOCATION` keyword. | Must be contained within an external location. | \nThe managed storage location used to store data and metadata for managed tables and managed volumes uses the following rules: \n* If the containing schema has a managed location, the data is stored in the schema managed location.\n* If the containing schema does not have a managed location but the catalog has a managed location, the data is stored in the catalog managed location.\n* If neither the containing schema nor the containing catalog have a managed location, data is stored in the metastore managed location.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html"} +{"content":"# Data governance with Unity Catalog\n### What is Unity Catalog?\n#### Storage credentials and external locations\n\nTo manage access to the underlying cloud storage for external tables, external volumes, and managed storage, Unity Catalog uses the following object types: \n* **Storage credentials** encapsulate a long-term cloud credential that provides access to cloud storage, for example, an IAM role that can access S3 buckets or a Cloudflare R2 API token. See [Create a storage credential for connecting to AWS S3](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html) and [Create a storage credential for connecting to Cloudflare R2](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials-r2.html).\n* **External locations** contain a reference to a storage credential and a cloud storage path. \nSee [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html"} +{"content":"# Data governance with Unity Catalog\n### What is Unity Catalog?\n#### Identity management for Unity Catalog\n\nUnity Catalog uses the identities in the Databricks account to resolve users, service principals, and groups, and to enforce permissions. \nTo configure identities in the account, follow the instructions in [Manage users, service principals, and groups](https:\/\/docs.databricks.com\/admin\/users-groups\/index.html). Refer to those users, service principals, and groups when you create [access-control policies](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html) in Unity Catalog. \nUnity Catalog users, service principals, and groups must also be added to workspaces to access Unity Catalog data in a notebook, a Databricks SQL query, Catalog Explorer, or a REST API command. The assignment of users, service principals, and groups to workspaces is called *identity federation*. \nAll workspaces that have a Unity Catalog metastore attached to them are enabled for identity federation. \n### Special considerations for groups \nAny groups that already exist in the workspace are labeled **Workspace local** in the account console. These workspace-local groups cannot be used in Unity Catalog to define access policies. You must use account-level groups. If a workspace-local group is referenced in a command, that command will return an error that the group was not found. If you previously used workspace-local groups to manage access to notebooks and other artifacts, these permissions remain in effect. \nSee [Manage groups](https:\/\/docs.databricks.com\/admin\/users-groups\/groups.html).\n\n### What is Unity Catalog?\n#### Admin roles for Unity Catalog\n\nAccount admins, metastore admins, and workspace admins are involved in managing Unity Catalog: \nSee [Admin privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html"} +{"content":"# Data governance with Unity Catalog\n### What is Unity Catalog?\n#### Data permissions in Unity Catalog\n\nIn Unity Catalog, data is secure by default. Initially, users have no access to data in a metastore. Access can be granted by either a metastore admin, the owner of an object, or the owner of the catalog or schema that contains the object. Securable objects in Unity Catalog are hierarchical and privileges are inherited downward. \nYou can assign and revoke permissions using Catalog Explorer, SQL commands, or REST APIs. \nSee [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html).\n\n### What is Unity Catalog?\n#### Supported compute and cluster access modes for Unity Catalog\n\nUnity Catalog is supported on clusters that run Databricks Runtime 11.3 LTS or above. Unity Catalog is supported by default on all [SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html) compute versions. \nClusters running on earlier versions of Databricks Runtime do not provide support for all Unity Catalog GA features and functionality. \nTo access data in Unity Catalog, clusters must be configured with the correct *access mode*. Unity Catalog is secure by default. If a cluster is not configured with one of the Unity-Catalog-capable access modes (that is, shared or assigned), the cluster can\u2019t access data in Unity Catalog. See [Access modes](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode). \nFor detailed information about Unity Catalog functionality changes in each Databricks Runtime version, see the [release notes](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). \nLimitations for Unity Catalog vary by access mode and Databricks Runtime version. See [Compute access mode limitations for Unity Catalog](https:\/\/docs.databricks.com\/compute\/access-mode-limitations.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html"} +{"content":"# Data governance with Unity Catalog\n### What is Unity Catalog?\n#### Data lineage for Unity Catalog\n\nYou can use Unity Catalog to capture runtime data lineage across queries in any language executed on a Databricks cluster or SQL warehouse. Lineage is captured down to the column level, and includes notebooks, workflows and dashboards related to the query. To learn more, see [Capture and view data lineage using Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html).\n\n### What is Unity Catalog?\n#### Lakehouse Federation and Unity Catalog\n\nLakehouse Federation is the query federation platform for Databricks. The term *query federation* describes a collection of features that enable users and systems to run queries against multiple siloed data sources without needing to migrate all data to a unified system. \nDatabricks uses Unity Catalog to manage query federation. You use Unity Catalog to configure read-only *connections* to popular external database systems and create *foreign catalogs* that mirror external databases. Unity Catalog\u2019s data governance and data lineage tools ensure that data access is managed and audited for all federated queries made by the users in your Databricks workspaces. \nSee [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html).\n\n### What is Unity Catalog?\n#### How do I set up Unity Catalog for my organization?\n\nTo learn how to set up Unity Catalog, see [Set up and manage Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html).\n\n### What is Unity Catalog?\n#### Supported regions\n\nAll regions support Unity Catalog. For details, see [Databricks clouds and regions](https:\/\/docs.databricks.com\/resources\/supported-regions.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html"} +{"content":"# Data governance with Unity Catalog\n### What is Unity Catalog?\n#### Supported data file formats\n\nUnity Catalog supports the following table formats: \n* [Managed tables](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#managed-table) must use the `delta` table format.\n* [External tables](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#external-table) can use `delta`, `CSV`, `JSON`, `avro`, `parquet`, `ORC`, or `text`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html"} +{"content":"# Data governance with Unity Catalog\n### What is Unity Catalog?\n#### Unity Catalog limitations\n\nUnity Catalog has the following limitations. \nNote \nIf your cluster is running on a Databricks Runtime version below 11.3 LTS, there may be additional limitations, not listed here. Unity Catalog is supported on Databricks Runtime 11.3 LTS or above. \nUnity Catalog limitations vary by Databricks Runtime and access mode. Structured Streaming workloads have additional limitations based on Databricks Runtime and access mode. See [Compute access mode limitations for Unity Catalog](https:\/\/docs.databricks.com\/compute\/access-mode-limitations.html). \n* Workloads in R do not support the use of dynamic views for row-level or column-level security.\n* In Databricks Runtime 13.3 LTS and above, shallow clones are supported to create Unity Catalog managed tables from existing Unity Catalog managed tables. In Databricks Runtime 12.2 LTS and below, there is no support for shallow clones in Unity Catalog. See [Shallow clone for Unity Catalog tables](https:\/\/docs.databricks.com\/delta\/clone-unity-catalog.html).\n* Bucketing is not supported for Unity Catalog tables. If you run commands that try to create a bucketed table in Unity Catalog, it will throw an exception.\n* Writing to the same path or Delta Lake table from workspaces in multiple regions can lead to unreliable performance if some clusters access Unity Catalog and others do not.\n* Custom partition schemes created using commands like `ALTER TABLE ADD PARTITION` are not supported for tables in Unity Catalog. Unity Catalog can access tables that use directory-style partitioning.\n* Overwrite mode for DataFrame write operations into Unity Catalog is supported only for Delta tables, not for other file formats. The user must have the `CREATE` privilege on the parent schema and must be the owner of the existing object or have the `MODIFY` privilege on the object.\n* In Databricks Runtime 13.3 LTS and above, Python scalar UDFs are supported. In Databricks Runtime 12.2 LTS and below, you cannot use Python UDFs, including UDAFs, UDTFs, and Pandas on Spark (`applyInPandas` and `mapInPandas`).\n* In Databricks Runtime 14.2 and above, Scala scalar UDFs are supported on shared clusters. In Databricks Runtime 14.1 and below, all Scala UDFs are not supported on shared clusters.\n* Groups that were previously created in a workspace (that is, workspace-level groups) cannot be used in Unity Catalog GRANT statements. This is to ensure a consistent view of groups that can span across workspaces. To use groups in GRANT statements, create your groups at the account level and update any automation for principal or group management (such as SCIM, Okta and Microsoft Entra ID (formerly Azure Active Directory) connectors, and Terraform) to reference account endpoints instead of workspace endpoints. See [Difference between account groups and workspace-local groups](https:\/\/docs.databricks.com\/admin\/users-groups\/groups.html#account-vs-workspace-group).\n* Standard Scala thread pools are not supported. Instead, use the special thread pools in `org.apache.spark.util.ThreadUtils`, for example, `org.apache.spark.util.ThreadUtils.newDaemonFixedThreadPool`. However, the following thread pools in `ThreadUtils` are not supported: `ThreadUtils.newForkJoinPool` and any `ScheduledExecutorService` thread pool. \nThe following limitations apply for all object names in Unity Catalog: \n* Object names cannot exceed 255 characters.\n* The following special characters are not allowed: \n+ Period (`.`)\n+ Space ()\n+ Forward slash (`\/`)\n+ All ASCII control characters (00-1F hex)\n+ The DELETE character (7F hex)\n* Unity Catalog stores all object names as lowercase.\n* When referencing UC names in SQL, you must use backticks to escape names that contain special characters such as hyphens (`-`). \nNote \nColumn names can use special characters, but the name must be escaped with backticks in all SQL statements if special characters are used. Unity Catalog preserves column name casing, but queries against Unity Catalog tables are case-insensitive. \nDatabricks recommends that you grant write privileges on a table that is backed by an external location in S3 only if the external location is defined in a single metastore. You can safely use multiple metastores to read data in a single external S3 location, but concurrent writes to the same S3 location from multiple metastores might lead to consistency issues. \nAdditional limitations exists for models in Unity Catalog. See [Limitations](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#limitations).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html"} +{"content":"# Data governance with Unity Catalog\n### What is Unity Catalog?\n#### Resource quotas\n\nUnity Catalog enforces resource quotas on all securable objects. Limits respect the same hierarchical organization throughout Unity Catalog. If you expect to exceed these resource limits, contact your Databricks account team. \nQuota values below are expressed relative to the parent (or grandparent) object in Unity Catalog. \n| Object | Parent | Value |\n| --- | --- | --- |\n| table | schema | 10000 |\n| table | metastore | 100000 |\n| volume | schema | 10000 |\n| function | schema | 10000 |\n| registered model | schema | 1000 |\n| registered model | metastore | 5000 |\n| model version | registered model | 10000 |\n| model version | metastore | 100000 |\n| schema | catalog | 10000 |\n| catalog | metastore | 1000 |\n| connection | metastore | 1000 |\n| storage credential | metastore | 200 |\n| external location | metastore | 500 | \nFor Delta Sharing limits, see [Resource quotas](https:\/\/docs.databricks.com\/data-sharing\/index.html#quotas).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)\n\nThis article describes how to read data that has been shared with you using the *Databricks-to-Databricks* Delta Sharing protocol, in which Databricks manages a secure connection for data sharing. Unlike the Delta Sharing *open sharing* protocol, the Databricks-to-Databricks protocol does not require a credential file (token-based security). \nDatabricks-to-Databricks sharing requires that you, as a recipient, have access to a Databricks workspace that is [enabled for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/enable-workspaces.html). \nIf you do not have a Databricks workspace that is enabled for Unity Catalog, then data must be shared with you using the Delta Sharing open sharing protocol, and this article doesn\u2019t apply to you. See [Read data shared using Delta Sharing open sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)\n#### How do I make shared data available to my team?\n\nTo read data and notebooks that have been shared with you using the Databricks-to-Databricks protocol, you must be a user on a Databricks workspace that is enabled for [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). A member of your team provides the data provider with a unique identifier for your Unity Catalog metastore, and the data provider uses that identifier to create a secure sharing connection with your organization. The shared data then becomes available for read access in your workspace, and any updates that the data provider makes to the shared tables, views, volumes, and partitions are reflected in your workspace in near real time. \nNote \nUpdates to shared data tables, views, and volumes appear in your workspace in near real time. However, column changes (adding, renaming, deleting) may not appear in Catalog Explorer for up to one minute. Likewise, new shares and updates to shares (such as adding new tables to a share) are cached for one minute before they are available for you to view and query. \nTo read data that has been shared with you: \n1. A user on your team finds the *share*\u2014the container for the tables, views, volumes, and notebooks that have been shared with you\u2014and uses that share to create a *catalog*\u2014the top-level container for all data in Databricks Unity Catalog.\n2. A user on your team grants or denies access to the catalog and the objects inside the catalog (schemas, tables, views, and volumes) to other members of your team.\n3. You read the data in the tables, views, and volumes that you have been granted access to just like any other data asset in Databricks that you have read-only (`SELECT` or `READ VOLUME`) access to.\n4. You preview and clone notebooks in the share, as long as you have the `USE CATALOG` privilege on the catalog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)\n#### Permissions required\n\nTo be able to list and view details about all providers and provider shares, you must be a metastore admin or have the `USE PROVIDER` privilege. Other users have access only to the providers and shares that they own. \nTo create a catalog from a provider share, you must be a metastore admin, a user who has both the `CREATE_CATALOG` and `USE PROVIDER` privileges for your Unity Catalog metastore, or a user who has both the `CREATE_CATALOG` privilege and ownership of the provider object. \nThe ability to grant read-only access to the schemas (databases), tables, views, and volumes in the catalog created from the share follows the typical Unity Catalog privilege hierarchy. The ability to view notebooks in the catalog created from the share requires the `USE CATALOG` privilege on the catalog. See [Manage permissions for the schemas, tables, and volumes in a Delta Sharing catalog](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html#schema-table-permissions).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)\n#### View providers and shares\n\nTo start reading the data that has been shared with you by a data provider, you need to know the name of the *provider* and *share* objects that are stored in your Unity Catalog metastore once the provider has shared data with you. \nThe provider object represents the Unity Catalog metastore, cloud platform, and region of the organization that shared the data with you. \nThe share object represents the tables, volumes, and views that the provider has shared with you. \n### View all providers who have shared data with you \nTo view a list of available data providers, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `SHOW PROVIDERS` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** You must be a metastore admin or have the `USE PROVIDER` privilege. Other users have access only to the providers and provider shares that they own. \nFor details, see [View providers](https:\/\/docs.databricks.com\/data-sharing\/manage-provider.html#view-providers). \n### View provider details \nTo view details about a provider, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `DESCRIBE PROVIDER` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** You must be a metastore admin, have the `USE PROVIDER` privilege, or own the provider object. \nFor details, see [View provider details](https:\/\/docs.databricks.com\/data-sharing\/manage-provider.html#view-provider-details). \n### View shares \nTo view the shares that a provider has shared with you, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `SHOW SHARES IN PROVIDER` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** You must be a metastore admin, have the `USE PROVIDER` privilege, or own the provider object. \nFor details, see [View shares that a provider has shared with you](https:\/\/docs.databricks.com\/data-sharing\/manage-provider.html#view-shares).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)\n#### Access data in a shared table or volume\n\nTo read data in a shared table or volume: \n1. A privileged user must create a catalog from the share that contains the table or volume. This can be a metastore admin, a user who has both the `CREATE_CATALOG` and `USE PROVIDER` privileges for your Unity Catalog metastore, or a user who has both the `CREATE_CATALOG` privilege and ownership of the provider object.\n2. That user or a user with the same privileges must grant you access to the shared table or volume.\n3. You can access the table or volume just as you would any other data asset registered in your Unity Catalog metastore. \n### Create a catalog from a share \nTo make the data in a share accessible to your team, you must create a catalog from the share. To create a catalog from a share, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or SQL commands in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: A metastore admin, a user who has both the `CREATE_CATALOG` and `USE PROVIDER` privileges for your Unity Catalog metastore, or a user who has both the `CREATE_CATALOG` privilege and ownership of the provider object. \nNote \nIf the share includes views, you must use a catalog name that is different than the name of the catalog that contains the view in the provider\u2019s metastore. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared with me**.\n3. On the **Providers** tab, select the provider.\n4. On the **Shares** tab, find the share and click **Create catalog** on the share row.\n5. Enter a name for the catalog and optional comment.\n6. Click **Create**. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nCREATE CATALOG [IF NOT EXISTS] <catalog-name>\nUSING SHARE <provider-name>.<share-name>;\n\n``` \n```\ndatabricks catalogs create <catalog-name> \/\n--provider-name <provider-name> \/\n--share-name <share-name>\n\n``` \nThe catalog created from a share has a catalog type of Delta Sharing. You can view the type on the catalog details page in Catalog Explorer or by running the [DESCRIBE CATALOG](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-aux-describe-catalog.html) SQL command in a notebook or Databricks SQL query. All shared catalogs are listed under **Catalog > Shared** in the Catalog Explorer left pane. \nA Delta Sharing catalog can be managed in the same way as regular catalogs on a Unity Catalog metastore. You can view, update, and delete a Delta Sharing catalog using Catalog Explorer, the Databricks CLI, and by using `SHOW CATALOGS`, `DESCRIBE CATALOG`, `ALTER CATALOG`, and `DROP CATALOG` SQL commands. \nThe 3-level namespace structure under a Delta Sharing catalog created from a share is the same as the one under a regular catalog on Unity Catalog: `catalog.schema.table` or `catalog.schema.volume`. \nTable and volume data under a shared catalog is read-only, which means you can perform read operations like: \n* `DESCRIBE`, `SHOW`, and `SELECT` for tables.\n* `DESCRIBE VOLUME`, `LIST <volume-path>`, `SELECT * FROM <format>.'<volume_path>'`, and `COPY INTO` for volumes. \nNotebooks in a shared catalog can be previewed and cloned by any user with `USE CATALOG` on the catalog. \nModels in a shared catalog can be read and loaded for inference by any user with the following privileges: `EXECUTE` privilege on the registered model, plus `USE SCHEMA` and `USE CATALOG` privileges on the schema and catalog containing the model. \n### Manage permissions for the schemas, tables, and volumes in a Delta Sharing catalog \nBy default, the catalog creator is the owner of all data objects under a Delta Sharing catalog and can manage permissions for any of them. \nPrivileges are inherited downward, although some workspaces may still be on the legacy security model that did not provide inheritance. See [Inheritance model](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html#inheritance-model). Any user granted the `SELECT` privilege on the catalog will have the `SELECT` privilege on all of the schemas and tables in the catalog unless that privilege is revoked. Likewise, any user granted the `READ VOLUME` privilege on the catalog will have the `READ VOLUME` privilege on all of the volumes in the catalog unless that privilege is revoked. You cannot grant privileges that give write or update access to a Delta Sharing catalog or objects in a Delta Sharing catalog. \nThe catalog owner can delegate the ownership of data objects to other users or groups, thereby granting those users the ability to manage the object permissions and life cycles. \nFor detailed information about managing privileges on data objects using Unity Catalog, see [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html). \n### Read data in a shared table \nYou can read data in a shared table using any of the tools available to you as a Databricks user: Catalog Explorer, notebooks, SQL queries, the Databricks CLI, and Databricks REST APIs. You must have the `SELECT` privilege on the table. \n### Read data in a shared volume \nYou can read data in a shared volume using any of the tools available to you as a Databricks user: Catalog Explorer, notebooks, SQL queries, the Databricks CLI, and Databricks REST APIs. You must have the `READ VOLUME` privilege on the volume. \n### Load a shared model for inference \nFor details on loading a shared model and using it for batch inference, see [Load models for inference](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#load-models-for-inference). \n### Query a table\u2019s history data \nIf history is shared along with the table, you can query the table data as of a version or timestamp. Requires Databricks Runtime 12.2 LTS or above. \nFor example: \n```\nSELECT * FROM vaccine.vaccine_us.vaccine_us_distribution VERSION AS OF 3;\nSELECT * FROM vaccine.vaccine_us.vaccine_us_distribution TIMESTAMP AS OF \"2023-01-01 00:00:00\";\n\n``` \nIn addition, if the change data feed (CDF) is enabled with the table, you can query the CDF. Both version and timestamp are supported: \n```\nSELECT * FROM table_changes('vaccine.vaccine_us.vaccine_us_distribution', 0, 3);\nSELECT * FROM table_changes('vaccine.vaccine_us.vaccine_us_distribution', \"2023-01-01 00:00:00\", \"2022-02-01 00:00:00\");\n\n``` \nFor more information about change data feed, see [Use Delta Lake change data feed on Databricks](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html). \n### Query a table using Apache Spark Structured Streaming \nIf a table is shared with history, you can use it as the source for Spark Structured Streaming. Requires Databricks Runtime 12.2 LTS or above. \nSupported options: \n* `ignoreDeletes`: Ignore transactions that delete data.\n* `ignoreChanges`: Re-process updates if files were rewritten in the source table due to a data changing operation such as `UPDATE`, `MERGE INTO`, `DELETE` (within partitions), or `OVERWRITE`. Unchanged rows can still be emitted. Therefore your downstream consumers should be able to handle duplicates. Deletes are not propagated downstream. `ignoreChanges` subsumes `ignoreDeletes`. Therefore, if you use `ignoreChanges`, your stream will not be disrupted by either deletions or updates to the source table.\n* `startingVersion`: The shared table version to start from. All table changes starting from this version (inclusive) will be read by the streaming source.\n* `startingTimestamp`: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) will be read by the streaming source. Example: `\"2023-01-01 00:00:00.0\"`\n* `maxFilesPerTrigger`: The number of new files to be considered in every micro-batch.\n* `maxBytesPerTrigger`: The amount of data that gets processed in each micro-batch. This option sets a \u201csoft max\u201d, meaning that a batch processes approximately this amount of data and might process more than the limit in order to make the streaming query move forward in cases when the smallest input unit is larger than this limit.\n* `readChangeFeed`: Stream read the change data feed of the shared table. \nUnsupported options: \n* `Trigger.availableNow` \n#### Sample Structured Streaming queries \n```\nspark.readStream.format(\"deltaSharing\")\n.option(\"startingVersion\", 0)\n.option(\"ignoreChanges\", true)\n.option(\"maxFilesPerTrigger\", 10)\n.table(\"vaccine.vaccine_us.vaccine_us_distribution\")\n\n``` \n```\nspark.readStream.format(\"deltaSharing\")\\\n.option(\"startingVersion\", 0)\\\n.option(\"ignoreDeletes\", true)\\\n.option(\"maxBytesPerTrigger\", 10000)\\\n.table(\"vaccine.vaccine_us.vaccine_us_distribution\")\n\n``` \nIf change data feed (CDF) is enabled with the table, you can stream read the CDF. \n```\nspark.readStream.format(\"deltaSharing\")\n.option(\"readChangeFeed\", \"true\")\n.table(\"vaccine.vaccine_us.vaccine_us_distribution\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)\n#### Read tables with deletion vectors or column mapping enabled\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nDeletion vectors are a storage optimization feature that your provider can enable on shared Delta tables. See [What are deletion vectors?](https:\/\/docs.databricks.com\/delta\/deletion-vectors.html). \nDatabricks also supports column mapping for Delta tables. See [Rename and drop columns with Delta Lake column mapping](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html). \nIf your provider shared a table with deletion vectors or column mapping enabled, you can perform batch reads on the table using a SQL warehouse or a cluster running Databricks Runtime 14.1 or above. CDF and streaming queries require Databricks Runtime 14.2 or above. \nYou can perform batch queries as-is, because they can automatically resolve `responseFormat` based on the table features of the shared table. \nTo read a change data feed (CDF) or to perform streaming queries on shared tables with deletion vectors or column mapping enabled, you must set the additional option `responseFormat=delta`. \nThe following examples show batch, CDF, and streaming queries: \n```\nimport org.apache.spark.sql.SparkSession\n\n\/\/ Batch query\nspark.read.format(\"deltaSharing\").table(<tableName>)\n\n\/\/ CDF query\nspark.read.format(\"deltaSharing\")\n.option(\"readChangeFeed\", \"true\")\n.option(\"responseFormat\", \"delta\")\n.option(\"startingVersion\", 1)\n.table(<tableName>)\n\n\/\/ Streaming query\nspark.readStream.format(\"deltaSharing\").option(\"responseFormat\", \"delta\").table(<tableName>)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)\n#### Read shared views\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nNote \nView sharing is supported only in Databricks-to-Databricks sharing. \nReading shared views is the same as [reading shared tables](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html#access-data), with these exceptions: \n**Compute requirements:** \n* If your Databricks account is different from the provider\u2019s, you must use a [Serverless SQL warehouse](https:\/\/docs.databricks.com\/admin\/sql\/serverless.html) to query shared views.\n* If the provider is on the same Databricks account, you can use any SQL warehouse and can also use a cluster that uses shared access mode. \n**View-on-view restrictions:** \nYou cannot create views that reference shared views. \n**View sharing restrictions:** \nYou cannot share views that reference shared tables or shared views. \n**Naming requirements:** \nThe catalog name that you use for the shared catalog that contains the view cannot be the same as any provider catalog that contains a table referenced by the view. For example, if the shared view is contained in your `test` catalog, and one of the provider\u2019s tables referenced in that view is contained in the provider\u2019s `test` catalog, the query will result in a namespace conflict error. See [Create a catalog from a share](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html#create-catalog). \n**History and streaming:** \nYou cannot query history or use a view as a streaming source. \n**JDBC\/ODBC:** \nThe instructions in this article focus on reading shared data using Databricks user interfaces, specifically Unity Catalog syntax and interfaces. You can also query shared views using Apache Spark, Python, and BI tools like Tableau and Power BI using Databricks JDBC\/ODBC drivers. To learn how to connect using the Databricks JDBC\/ODBC drivers, see [Databricks ODBC and JDBC Drivers](https:\/\/docs.databricks.com\/integrations\/jdbc-odbc-bi.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)\n#### Read shared notebooks\n\nTo preview and clone shared notebook files, you can use Catalog Explorer. \n**Permissions required:** Catalog owner or user with the `USE CATALOG` privilege on the catalog created from the share. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Catalog** menu, find and select the catalog created from the share.\n3. On the **Other assets** tab, you\u2019ll see any shared notebook files.\n4. Click the name of a shared notebook file to preview it.\n5. (Optional) Click the **Clone** button to import the shared notebook file to your workspace. \n1. On the **Clone to** dialog, optionally enter a **New name**, then select the workspace folder you want to clone the notebook file to.\n2. Click **Clone**.\n3. Once the notebook is cloned, a dialog pops up to let you know that it successfully cloned. Click **reveal in the notebook editor** on the dialog to view it in the notebook editor.See [Introduction to Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html"} +{"content":"# Technology partners\n## Connect to semantic layer partners using Partner Connect\n#### Connect to Stardog\n\nThe Stardog Enterprise Knowledge Graph Platform provides a foundation for a flexible semantic data layer designed to answer complex queries across data silos. \nYou can integrate your Databricks SQL warehouses (formerly Databricks SQL endpoints) and Databricks clusters with Stardog.\n\n#### Connect to Stardog\n##### Connect to Stardog using Partner Connect\n\nNote \nPartner Connect only supports integrating SQL warehouses with Stardog. To integrate a cluster with Stardog, connect to Stardog manually. \nTo connect to Stardog using Partner Connect, do the following: \n1. [Connect to semantic layer partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/semantic-layer.html).\n2. In your Stardog account, follow the prompt to update your Stardog profile. Check the boxes to agree to the Stardog terms of use and privacy policy, then click **Update**.\n3. Click **Get a Stardog Cloud Instance**.\n4. Click **Start Free**, check the box to agree to the Stardog terms of service, then click **Checkout**.\n5. Under **Action**, click the three-dot menu, then click **Create datasource**. \nThe **Manage Databricks Datasource** dialog opens.\n6. Optionally edit the data source name, then click **Create Datasource**. \nYou can find videos about how to use starter kits with sample data in your Stardog account, or you can click your Databricks connection to access [Stardog applications](https:\/\/docs.stardog.com\/stardog-applications\/) to start modeling your data.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/semantic-layer\/stardog.html"} +{"content":"# Technology partners\n## Connect to semantic layer partners using Partner Connect\n#### Connect to Stardog\n##### Connect to Stardog manually\n\nThis section describes how to connect to Stardog manually. \nNote \nYou can use Partner Connect to simplify the connection experience for a SQL warehouse. \n### Requirements \nBefore you connect to Stardog manually, you need the following: \n* A cluster or SQL warehouse in your Databricks workspace. \n+ [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n+ [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* The connection details for your cluster or SQL warehouse, specifically the **Server Hostname**, **Port**, and **HTTP Path** values. \n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). To create a personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**.\nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n+ [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n+ [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \n### Steps to connect \nTo connect to Stardog manually, do the following: \n1. [Create](https:\/\/www.stardog.com\/stardog-cloud\/) a Stardog account.\n2. In your Stardog account, follow the prompt to update your Stardog profile. Check the boxes to agree to the Stardog terms of use and privacy policy, then click **Update**.\n3. Follow the prompt to verify your email address.\n4. Click **Get a Stardog Cloud Instance**.\n5. Click **Start Free**, check the box to agree to the Stardog terms of service, then click **Checkout**.\n6. Click your Databricks connection.\n7. Click **Stardog Studio**. \nStardog Studio opens in a new tab.\n8. Click **Data**.\n9. Click **+ Data source**.\n10. In the **Add Data Source** dialog box, enter a name for your data source.\n11. For **Data Source Type**, select **Databricks and Spark SQL** from the drop-down list.\n12. For **JDBC Connection URL**, enter the connection details from the requirements.\n13. For **JDBC Username**, enter `token`.\n14. For **JDBC Password**, enter the personal access token from the requirements.\n15. For **JDBC Driver Class**, enter `com.simba.spark.jdbc.Driver`.\n16. Click **Add**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/semantic-layer\/stardog.html"} +{"content":"# Technology partners\n## Connect to semantic layer partners using Partner Connect\n#### Connect to Stardog\n##### Next steps\n\n1. Create Knowledge Graph models in [Stardog Designer](https:\/\/docs.stardog.com\/stardog-applications\/designer\/).\n2. Visualize models in [Stardog Explorer](https:\/\/docs.stardog.com\/stardog-applications\/explorer\/).\n\n#### Connect to Stardog\n##### Additional resources\n\nExplore the following Stardog resources: \n* [Website](https:\/\/www.stardog.com\/)\n* [Documentation](https:\/\/docs.stardog.com\/)\n* [Getting support](https:\/\/docs.stardog.com\/getting-support\/)\n* [Community](https:\/\/community.stardog.com\/)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/semantic-layer\/stardog.html"} +{"content":"# Technology partners\n### Connect to ML partners using Partner Connect\n\nTo connect your Databricks workspace to a machine learning partner solution using Partner Connect, you typically follow the steps in this article. \nImportant \nBefore you follow the steps in this article, see the appropriate partner article for important partner-specific information. There might be differences in the connection steps between partner solutions. Some partner solutions also allow you to integrate with Databricks SQL warehouses (formerly Databricks SQL endpoints) or Databricks clusters, but not both.\n\n### Connect to ML partners using Partner Connect\n#### Requirements\n\nSee the [requirements](https:\/\/docs.databricks.com\/partner-connect\/index.html#requirements) for using Partner Connect. \nImportant \nFor partner-specific requirements, see the appropriate partner article.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/ml.html"} +{"content":"# Technology partners\n### Connect to ML partners using Partner Connect\n#### Steps to connect to a machine learning partner\n\nTo connect your Databricks workspace to a machine learning partner solution, follow the steps in this section. \nTip \nIf you have an existing partner account, Databricks recommends that you follow the steps to connect to the partner solution manually in the appropriate partner article. This is because the connection experience in Partner Connect is optimized for new partner accounts. \n1. In the sidebar, click ![Partner Connect button](https:\/\/docs.databricks.com\/_images\/partner-connect.png) **Partner Connect**.\n2. Click the partner tile. \nNote \nIf the partner tile has a check mark icon inside it, an administrator has already used Partner Connect to connect the partner to your workspace. Skip to step 4. The partner uses the email address for your Databricks account to prompt you to sign in to your existing partner account.\n3. Click **Next**. \nPartner Connect creates the following resources in your workspace: \n* A Databricks cluster named **`<PARTNER>_CLUSTER`**.\n* A Databricks [service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) named **`<PARTNER>_USER`**.\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) that is associated with the **`<PARTNER>_USER`** service principal.The **Email** box displays the email address for your Databricks account. The partner uses this email address to prompt you to either create a new partner account or sign in to your existing partner account.\n4. Click **Connect to `<Partner>`** or **Sign in**. \nA new tab opens in your web browser, which displays the partner website.\n5. Complete the on-screen instructions on the partner website to create your trial partner account or sign in to your existing partner account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/ml.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Manage model lifecycle in Unity Catalog\n\nImportant \n* This article documents Models in Unity Catalog, which Databricks recommends for governing and deploying models. If your workspace is not enabled for Unity Catalog, the functionality on this page is not available. Instead, see [Manage model lifecycle using the Workspace Model Registry (legacy)](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html). For guidance on how to upgrade from the Workspace Model Registry to Unity Catalog, see [Migrate workflows and models to Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#migrate-models-to-uc)\n* Models in Unity Catalog isn\u2019t available in AWS GovCloud regions. \nThis article describes how to use Models in Unity Catalog as part of your machine learning workflow to manage the full lifecycle of ML models. Databricks provides a hosted version of MLflow Model Registry in [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). Models in Unity Catalog extends the benefits of Unity Catalog to ML models, including centralized access control, auditing, lineage, and model discovery across workspaces. Models in Unity Catalog is compatible with the open-source MLflow Python client. \nKey features of models in Unity Catalog include: \n* Namespacing and governance for models, so you can group and govern models at the environment, project, or team level (\u201cGrant data scientists read-only access to production models\u201d).\n* Chronological model lineage (which MLflow experiment and run produced the model at a given time).\n* [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html).\n* Model versioning.\n* Model deployment via aliases. For example, mark the \u201cChampion\u201d version of a model within your `prod` catalog. \nIf your [workspace\u2019s default catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement) is configured to a catalog in Unity Catalog, models registered using MLflow APIs such as `mlflow.<model-type>.log_model(..., registered_model_name)` or `mlflow.register_model(model_uri, name)` are registered to Unity Catalog by default. \nThis article includes instructions for both the Models in Unity Catalog UI and API. \nFor an overview of Model Registry concepts, see [ML lifecycle management using MLflow](https:\/\/docs.databricks.com\/mlflow\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Manage model lifecycle in Unity Catalog\n##### Requirements\n\n1. Unity Catalog must be enabled in your workspace. See [Get started using Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html) to create a Unity Catalog Metastore, enable it in a workspace, and create a catalog. If Unity Catalog is not enabled, you can still use the classic [workspace model registry](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html).\n2. Your workspace must be attached to a Unity Catalog metastore that supports privilege inheritance. This is true for all metastores created after August 25, 2022. If running on an older metastore, [follow docs](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/upgrade-privilege-model.html) to upgrade.\n3. You must have access to run commands on a [cluster with access to Unity Catalog](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode).\n4. To create new registered models, you need the `CREATE_MODEL` privilege on a schema, in addition to the `USE SCHEMA` and `USE CATALOG` privileges on the schema and its enclosing catalog. `CREATE_MODEL` is a new schema-level privilege that you can grant using the Catalog Explorer UI or the [SQL GRANT command](https:\/\/docs.databricks.com\/sql\/language-manual\/security-grant.html), as shown below. \n```\nGRANT CREATE_MODEL ON SCHEMA <schema-name> TO <principal>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Manage model lifecycle in Unity Catalog\n##### Upgrade training workloads to Unity Catalog\n\nThis section includes instructions to upgrade existing training workloads to Unity Catalog. \n### Install MLflow Python client \nSupport for models in Unity Catalog is included in Databricks Runtime 13.2 ML and above. \nYou can also use models in Unity Catalog on Databricks Runtime 11.3 LTS and above by installing the latest version of the MLflow Python client in your notebook, using the code below. \n```\n%pip install --upgrade \"mlflow-skinny[databricks]\"\ndbutils.library.restartPython()\n\n``` \n### Configure MLflow client to access models in Unity Catalog \nBy default, the MLflow Python client creates models in the Databricks workspace model registry. To upgrade to models in Unity Catalog, configure the MLflow client: \n```\nimport mlflow\nmlflow.set_registry_uri(\"databricks-uc\")\n\n``` \nNote \nIf your workspace\u2019s [default catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#view-the-current-default-catalog) is in Unity Catalog (rather than `hive_metastore`) and you are running a cluster using Databricks Runtime 13.3 LTS or above, models are automatically created in and loaded from the default catalog, with no configuration required. There is no change in behavior for other Databricks Runtime versions. A small number of workspaces where both the default catalog was configured to a catalog in Unity Catalog prior to January 2024 and the workspace model registry was used prior to January 2024 are exempt from this behavior. \n### Train and register Unity Catalog-compatible models \n**Permissions required**: To create a new registered model, you need the `CREATE_MODEL` and `USE SCHEMA` privileges on the enclosing schema, and `USE CATALOG` privilege on the enclosing catalog. To create new model versions under a registered model, you must be the owner of the registered model and have `USE SCHEMA` and `USE CATALOG` privileges on the schema and catalog containing the model. \nML model versions in UC must have a [model signature](https:\/\/mlflow.org\/docs\/latest\/models.html#model-signature). If you\u2019re not already logging MLflow models with signatures in your model training workloads, you can either: \n* Use [Databricks autologging](https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html), which automatically logs models with signatures for many popular ML frameworks. See supported frameworks in the [MLflow docs](https:\/\/mlflow.org\/docs\/latest\/tracking.html#automatic-logging).\n* With MLflow 2.5.0 and above, you can specify an input example in your `mlflow.<flavor>.log_model` call, and the model signature is automatically inferred. For further information, refer to [the MLflow documentation](https:\/\/mlflow.org\/docs\/latest\/models.html#how-to-log-models-with-signatures). \nThen, pass the three-level name of the model to MLflow APIs, in the form `<catalog>.<schema>.<model>`. \nThe examples in this section create and access models in the `ml_team` schema under the `prod` catalog. \nThe model training examples in this section create a new model version and register it in the `prod` catalog. Using the `prod` catalog doesn\u2019t necessarily mean that the model version serves production traffic. The model version\u2019s enclosing catalog, schema, and registered model reflect its environment (`prod`) and associated governance rules (for example, privileges can be set up so that only admins can delete from the `prod` catalog), but not its deployment status. To manage the deployment status, use [model aliases](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#uc-model-aliases). \n#### Register a model to Unity Catalog using autologging \n```\nfrom sklearn import datasets\nfrom sklearn.ensemble import RandomForestClassifier\n\n# Train a sklearn model on the iris dataset\nX, y = datasets.load_iris(return_X_y=True, as_frame=True)\nclf = RandomForestClassifier(max_depth=7)\nclf.fit(X, y)\n\n# Note that the UC model name follows the pattern\n# <catalog_name>.<schema_name>.<model_name>, corresponding to\n# the catalog, schema, and registered model name\n# in Unity Catalog under which to create the version\n# The registered model will be created if it doesn't already exist\nautolog_run = mlflow.last_active_run()\nmodel_uri = \"runs:\/{}\/model\".format(autolog_run.info.run_id)\nmlflow.register_model(model_uri, \"prod.ml_team.iris_model\")\n\n``` \n#### Register a model to Unity Catalog with automatically inferred signature \nSupport for automatically inferred signatures is available in MLflow version 2.5.0 and above, and is supported in Databricks Runtime 11.3 LTS ML and above. To use automatically inferred signatures, use the following code to install the latest MLflow Python client in your notebook: \n```\n%pip install --upgrade \"mlflow-skinny[databricks]\"\ndbutils.library.restartPython()\n\n``` \nThe following code shows an example of an automatically inferred signature. \n```\nfrom sklearn import datasets\nfrom sklearn.ensemble import RandomForestClassifier\n\nwith mlflow.start_run():\n# Train a sklearn model on the iris dataset\nX, y = datasets.load_iris(return_X_y=True, as_frame=True)\nclf = RandomForestClassifier(max_depth=7)\nclf.fit(X, y)\n# Take the first row of the training dataset as the model input example.\ninput_example = X.iloc[[0]]\n# Log the model and register it as a new version in UC.\nmlflow.sklearn.log_model(\nsk_model=clf,\nartifact_path=\"model\",\n# The signature is automatically inferred from the input example and its predicted output.\ninput_example=input_example,\nregistered_model_name=\"prod.ml_team.iris_model\",\n)\n\n``` \n### Track the data lineage of a model in Unity Catalog \nNote \nSupport for table to model lineage in Unity Catalog is available in MLflow 2.11.0 and above. \nWhen you train a model on a table in Unity Catalog, you can track the lineage of the model to the upstream dataset(s) it was trained and evaluated on. To do this, use [mlflow.log\\_input](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.html?highlight=log_input#mlflow.log_input). This saves the input table information with the MLflow run that generated the model. Data lineage is also automatically captured for models logged using feature store APIs. See [View feature store lineage](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/lineage.html). \nWhen you register the model to Unity Catalog, lineage information is automatically saved and is visible in the **Lineage** tab of the [model version UI in Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/explore-models.html#view-model-version-information). \nThe following code shows an example. \n```\nimport mlflow\nimport pandas as pd\nimport pyspark.pandas as ps\nfrom sklearn.datasets import load_iris\nfrom sklearn.ensemble import RandomForestRegressor\n\n# Write a table to Unity Catalog\niris = load_iris()\niris_df = pd.DataFrame(iris.data, columns=iris.feature_names)\niris_df.rename(\ncolumns = {\n'sepal length (cm)':'sepal_length',\n'sepal width (cm)':'sepal_width',\n'petal length (cm)':'petal_length',\n'petal width (cm)':'petal_width'},\ninplace = True\n)\niris_df['species'] = iris.target\nps.from_pandas(iris_df).to_table(\"prod.ml_team.iris\", mode=\"overwrite\")\n\n# Load a Unity Catalog table, train a model, and log the input table\ndataset = mlflow.data.load_delta(table_name=\"prod.ml_team.iris\", version=\"0\")\npd_df = dataset.df.toPandas()\nX = pd_df.drop(\"species\", axis=1)\ny = pd_df[\"species\"]\nwith mlflow.start_run():\nclf = RandomForestRegressor(n_estimators=100)\nclf.fit(X, y)\nmlflow.log_input(dataset, \"training\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Manage model lifecycle in Unity Catalog\n##### View models in the UI\n\n**Permissions required**: To view a registered model and its model versions in the UI, you need `EXECUTE` privilege on the registered model,\nplus `USE SCHEMA` and `USE CATALOG` privileges on the schema and catalog containing the model \nYou can view and manage registered models and model versions in Unity Catalog using the [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html).\n\n#### Manage model lifecycle in Unity Catalog\n##### Control access to models\n\nFor information about controlling access to models registered in Unity Catalog, see [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html). For best best practices on organizing models across catalogs and schemas, see [Organize your data](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html#organize-data). \nYou can configure model permissions programatically using the [Grants REST API](https:\/\/docs.databricks.com\/api\/workspace\/grants). When configuring model permissions, set `securable_type` to `\"FUNCTION\"` in REST API requests. For example, use `PATCH \/api\/2.1\/unity-catalog\/permissions\/function\/{full_name}` to update registered model permissions.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Manage model lifecycle in Unity Catalog\n##### Deploy and organize models with aliases and tags\n\nModel aliases and tags help you organize and manage models in Unity Catalog. \nModel aliases allow you to assign a mutable, named reference to a particular version of a registered model. You can use aliases to indicate the deployment status of a model version. For example, you could allocate a \u201cChampion\u201d alias to the model version currently in production and target this alias in workloads that use the production model. You can then update the production model by reassigning the \u201cChampion\u201d alias to a different model version. \n[Tags](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/tags.html) are key-value pairs that you associate with registered models and model versions, allowing you to label and categorize them by function or status. For example, you could apply a tag with key `\"task\"` and value `\"question-answering\"` (displayed in the UI as `task:question-answering`) to registered models intended for question answering tasks. At the model version level, you could tag versions undergoing pre-deployment validation with `validation_status:pending` and those cleared for deployment with `validation_status:approved`. \nSee the following sections for how to use aliases and tags. \n### Set and delete aliases on models \n**Permissions required**: Owner of the registered model, plus `USE SCHEMA` and `USE CATALOG` privileges on the schema and catalog containing the model. \nYou can set, update, and remove aliases for models in Unity Catalog by using [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/explore-models.html). You can manage aliases across a registered model in the [model details page](https:\/\/docs.databricks.com\/catalog-explorer\/explore-models.html#view-model-information) and configure aliases for a specific model version in the [model version details page](https:\/\/docs.databricks.com\/catalog-explorer\/explore-models.html#view-model-version-information). \nTo set, update, and delete aliases using the MLflow Client API, see the examples below: \n```\nfrom mlflow import MlflowClient\nclient = MlflowClient()\n\n# create \"Champion\" alias for version 1 of model \"prod.ml_team.iris_model\"\nclient.set_registered_model_alias(\"prod.ml_team.iris_model\", \"Champion\", 1)\n\n# reassign the \"Champion\" alias to version 2\nclient.set_registered_model_alias(\"prod.ml_team.iris_model\", \"Champion\", 2)\n\n# get a model version by alias\nclient.get_model_version_by_alias(\"prod.ml_team.iris_model\", \"Champion\")\n\n# delete the alias\nclient.delete_registered_model_alias(\"prod.ml_team.iris_model\", \"Champion\")\n\n``` \n### Set and delete tags on models \n**Permissions required**: Owner of or have `APPLY_TAG` privilege on the registered model, plus `USE SCHEMA` and `USE CATALOG` privileges on the schema and catalog containing the model. \nSee [Manage tags in Catalog Explorer](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/tags.html#catalog-explorer) on how to set and delete tags using the UI. \nTo set and delete tags using the MLflow Client API, see the examples below: \n```\nfrom mlflow import MlflowClient\nclient = MlflowClient()\n\n# Set registered model tag\nclient.set_registered_model_tag(\"prod.ml_team.iris_model\", \"task\", \"classification\")\n\n# Delete registered model tag\nclient.delete_registered_model_tag(\"prod.ml_team.iris_model\", \"task\")\n\n# Set model version tag\nclient.set_model_version_tag(\"prod.ml_team.iris_model\", \"1\", \"validation_status\", \"approved\")\n\n# Delete model version tag\nclient.delete_model_version_tag(\"prod.ml_team.iris_model\", \"1\", \"validation_status\")\n\n``` \nBoth registered model and model version tags must meet the [platform-wide constraints](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/tags.html#constraints). \nFor more details on alias and tag client APIs, see the [MLflow API documentation](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.client.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Manage model lifecycle in Unity Catalog\n##### Load models for inference\n\n### Consume model versions by alias in inference workloads \n**Permissions required**: `EXECUTE` privilege on the registered model, plus `USE SCHEMA` and `USE CATALOG` privileges on the schema and catalog containing the model. \nYou can write batch inference workloads that reference a model version by alias. For example, the snippet below loads and applies the \u201cChampion\u201d model version for batch inference. If the \u201cChampion\u201d version is updated to reference a new model version, the batch inference workload automatically picks it up on its next execution. This allows you to decouple model deployments from your batch inference workloads. \n```\nimport mlflow.pyfunc\nmodel_version_uri = \"models:\/prod.ml_team.iris_model@Champion\"\nchampion_version = mlflow.pyfunc.load_model(model_version_uri)\nchampion_version.predict(test_x)\n\n``` \nYou can also write deployment workflows to get a model version by alias and update a model serving endpoint to serve that version, using the [model serving REST API](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html#endpoint-config): \n```\nimport mlflow\nimport requests\nclient = mlflow.tracking.MlflowClient()\nchampion_version = client.get_model_version_by_alias(\"prod.ml_team.iris_model\", \"Champion\")\n# Invoke the model serving REST API to update endpoint to serve the current \"Champion\" version\nmodel_name = champion_version.name\nmodel_version = champion_version.version\nrequests.request(...)\n\n``` \n### Consume model versions by version number in inference workloads \nYou can also load model versions by version number: \n```\nimport mlflow.pyfunc\n# Load version 1 of the model \"prod.ml_team.iris_model\"\nmodel_version_uri = \"models:\/prod.ml_team.iris_model\/1\"\nfirst_version = mlflow.pyfunc.load_model(model_version_uri)\nfirst_version.predict(test_x)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Manage model lifecycle in Unity Catalog\n##### Share models across workspaces\n\n### Share models with users in the same region \nAs long as you have the appropriate privileges, you can access models in Unity Catalog from any workspace that is attached to the metastore containing the model. For example, you can access models from the `prod` catalog in a dev workspace, to facilitate comparing newly-developed models to the production baseline. \nTo collaborate with other users (share write privileges) on a registered model you created, you must grant ownership of the model to a group containing yourself and the users you\u2019d like to collaborate with. Collaborators must also have the `USE CATALOG` and `USE SCHEMA` privileges on the catalog and schema containing the model. See [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html) for details. \n### Share models with users in another region or account \nTo share models with users in other regions or accounts, use the Delta Sharing [Databricks-to-Databricks sharing flow](https:\/\/docs.databricks.com\/data-sharing\/index.html#delta-sharing). See [Add models to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#models) (for providers) and [Get access in the Databricks-to-Databricks model](https:\/\/docs.databricks.com\/data-sharing\/recipient.html#get-access-db-to-db) (for recipients). As a recipient, after you create a catalog from a share, you access models in that shared catalog the same way as any other model in Unity Catalog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Manage model lifecycle in Unity Catalog\n##### Promote a model across environments\n\nDatabricks recommends that you deploy ML pipelines as code. This eliminates the need to promote models across environments, as all production models can be produced through automated training workflows in a production environment. \nHowever, in some cases, it may be too expensive to retrain models across environments. Instead, you can copy model versions across registered models in Unity Catalog to promote them across environments. \nYou need the following privileges to execute the example code below: \n* `USE CATALOG` on the `staging` and `prod` catalogs.\n* `USE SCHEMA` on the `staging.ml_team` and `prod.ml_team` schemas.\n* `EXECUTE` on `staging.ml_team.fraud_detection`. \nIn addition, you must be the owner of the registered model `prod.ml_team.fraud_detection`. \nThe following code snippet uses the `copy_model_version` [MLflow Client API](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.client.html#mlflow.client.MlflowClient.copy_model_version), available in MLflow version 2.8.0 and above. \n```\nimport mlflow\nmlflow.set_registry_uri(\"databricks-uc\")\n\nclient = mlflow.tracking.MlflowClient()\nsrc_model_name = \"staging.ml_team.fraud_detection\"\nsrc_model_version = \"1\"\nsrc_model_uri = f\"models:\/{src_model_name}\/{src_model_version}\"\ndst_model_name = \"prod.ml_team.fraud_detection\"\ncopied_model_version = client.copy_model_version(src_model_uri, dst_model_name)\n\n``` \nAfter the model version is in the production environment, you can perform any necessary pre-deployment validation. Then, you can mark the model version for deployment [using aliases](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#uc-model-aliases). \n```\nclient = mlflow.tracking.MlflowClient()\nclient.set_registered_model_alias(name=\"prod.ml_team.fraud_detection\", alias=\"Champion\", version=copied_model_version.version)\n\n``` \nIn the example above, only users who can read from the `staging.ml_team.fraud_detection` registered model and write to the `prod.ml_team.fraud_detection` registered model can promote staging models to the production environment. The same users can also use aliases to manage which model versions are deployed within the production environment. You don\u2019t need to configure any other rules or policies to govern model promotion and deployment. \nYou can customize this flow to promote the model version across multiple environments that match your setup, such as `dev`, `qa`, and `prod`. Access control is enforced as configured in each environment.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Manage model lifecycle in Unity Catalog\n##### Annotate a model or model version\n\n**Permissions required**: Owner of the registered model, plus `USE SCHEMA` and `USE CATALOG` privileges on the schema and catalog containing the model. \nYou can provide information about a model or model version by annotating it. For example, you may want to include an overview of the problem or information about the methodology and algorithm used. \n### Annotate a model or model version using the UI \nSee [Document data in Catalog Explorer using markdown comments](https:\/\/docs.databricks.com\/catalog-explorer\/markdown-data-comments.html). \n### Annotate a model or model version using the API \nTo update a registered model description, use the MLflow Client API `update_registered_model()` method: \n```\nclient = MlflowClient()\nclient.update_registered_model(\nname=\"<model-name>\",\ndescription=\"<description>\"\n)\n\n``` \nTo update a model version description, use the MLflow Client API `update_model_version()` method: \n```\nclient = MlflowClient()\nclient.update_model_version(\nname=\"<model-name>\",\nversion=<model-version>,\ndescription=\"<description>\"\n)\n\n```\n\n#### Manage model lifecycle in Unity Catalog\n##### Rename a model\n\n**Permissions required**: Owner of the registered model, `CREATE_MODEL` privilege on the schema containing the registered model, and `USE SCHEMA` and `USE CATALOG` privileges on the schema and catalog containing the model. \nTo rename a registered model, use the MLflow Client API `rename_registered_model()` method: \n```\nclient=MlflowClient()\nclient.rename_registered_model(\"<full-model-name>\", \"<new-model-name>\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Manage model lifecycle in Unity Catalog\n##### Delete a model or model version\n\n**Permissions required**: Owner of the registered model, plus `USE SCHEMA` and `USE CATALOG` privileges on the schema and catalog containing the model. \nYou can delete a registered model or a model version within a registered model using the [Catalog Explorer UI](https:\/\/docs.databricks.com\/catalog-explorer\/explore-models.html) or the API. \n### Delete a model version or model using the API \nWarning \nYou cannot undo this action. When you delete a model, all model artifacts stored by Unity Catalog and all the metadata associated with the registered model are deleted. \n#### Delete a model version \nTo delete a model version, use the MLflow Client API `delete_model_version()` method: \n```\n# Delete versions 1,2, and 3 of the model\nclient = MlflowClient()\nversions=[1, 2, 3]\nfor version in versions:\nclient.delete_model_version(name=\"<model-name>\", version=version)\n\n``` \n#### Delete a model \nTo delete a model, use the MLflow Client API `delete_registered_model()` method: \n```\nclient = MlflowClient()\nclient.delete_registered_model(name=\"<model-name>\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Manage model lifecycle in Unity Catalog\n##### List and search models\n\nYou can list registered models in Unity Catalog with MLflow\u2019s [search\\_registered\\_models()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.client.html#mlflow.client.MlflowClient.search_registered_models) Python API: \n```\nclient=MlflowClient()\nclient.search_registered_models()\n\n``` \nYou can also search for a specific model name and list its version details using the `search_model_versions()` method: \n```\nfrom pprint import pprint\n\nclient=MlflowClient()\n[pprint(mv) for mv in client.search_model_versions(\"name='<model-name>'\")]\n\n``` \nNote \nNot all search API fields and operators are supported for models in Unity Catalog. See [Limitations](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#limitations) for details.\n\n#### Manage model lifecycle in Unity Catalog\n##### Download model files (advanced use case)\n\nIn most cases, to load models, you should use MLflow APIs like `mlflow.pyfunc.load_model` or `mlflow.<flavor>.load_model` (for example, `mlflow.transformers.load_model` for HuggingFace models). \nIn some cases you may need to download model files to debug model behavior or model loading issues. You can download model files using `mlflow.artifacts.download_artifacts`, as follows: \n```\nimport mlflow\nmlflow.set_registry_uri(\"databricks-uc\")\nmodel_uri = f\"models:\/{model_name}\/{version}\" # reference model by version or alias\ndestination_path = \"\/local_disk0\/model\"\nmlflow.artifacts.download_artifacts(artifact_uri=model_uri, dst_path=destination_path)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Manage model lifecycle in Unity Catalog\n##### Example\n\nThis example illustrates how to use Models in Unity Catalog to build a machine learning application. \n[Models in Unity Catalog example](https:\/\/docs.databricks.com\/mlflow\/models-in-uc-example.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Manage model lifecycle in Unity Catalog\n##### Migrate workflows and models to Unity Catalog\n\nDatabricks recommends using Models in Unity Catalog for improved governance, easy sharing across workspaces and environments, and more flexible MLOps workflows. The table compares the capabilities of the Workspace Model Registry and Unity Catalog. \n| Capability | Workspace Model Registry (legacy) | Models in Unity Catalog (recommended) |\n| --- | --- | --- |\n| Reference model versions by named aliases | Model Registry Stages: Move model versions into one of four fixed stages to reference them by that stage. Cannot rename or add stages. | Model Registry Aliases: Create up to 10 custom and reassignable named references to model versions for each registered model. |\n| Create access-controlled environments for models | Model Registry Stages: Use stages within one registered model to denote the environment of its model versions, with access controls for only two of the four fixed stages (`Staging` and `Production`). | Registered Models: Create a registered model for each environment in your MLOps workflow, utilizing three-level namespaces and permissions of Unity Catalog to express governance. |\n| Promote models across environments (deploy model) | Use the `transition_model_version_stage()` MLflow Client API to move a model version to a different stage, potentially breaking workflows that reference the previous stage. | Use the `copy_model_version()` MLflow Client API to copy a model version from one registered model to another. |\n| Access and share models across workspaces | Manually export and import models across workspaces, or configure connections to remote model registries using personal access tokens and workspace secret scopes. | Out of the box access to models across workspaces in the same account. No configuration required. |\n| Configure permissions | Set permissions at the workspace-level. | Set permissions at the account-level, which applies consistent governance across workspaces. |\n| Access models in the Databricks markplace | Unavailable. | Load models from the Databricks marketplace into your Unity Catalog metastore and access them across workspaces. | \nThe articles linked below describe how to migrate workflows (model training and batch inference jobs) and models from the Workspace Model Registry to Unity Catalog. \n* [Upgrade ML workflows to target models in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html)\n* [Upgrade models to Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-models.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Manage model lifecycle in Unity Catalog\n##### Limitations\n\n* Stages are not supported for models in Unity Catalog. Databricks recommends using the three-level namespace in Unity Catalog to express the environment a model is in, and using aliases to promote models for deployment. See [Promote a model across environments](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html#promote) for details.\n* Webhooks are not supported for models in Unity Catalog. See suggested alternatives in [the upgrade guide](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html#manual-approval).\n* Some search API fields and operators are not supported for models in Unity Catalog. This can be mitigated by calling the search APIs using supported filters and scanning the results. Following are some examples: \n+ The `order_by` parameter is not supported in the [search\\_model\\_versions](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.client.html#mlflow.client.MlflowClient.search_model_versions) or [search\\_registered\\_models](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.client.html#mlflow.client.MlflowClient.search_registered_models) client APIs.\n+ Tag-based filters (`tags.mykey = 'myvalue'`) are not supported for `search_model_versions` or `search_registered_models`.\n+ Operators other than exact equality (for example, `LIKE`, `ILIKE`, `!=`) are not supported for `search_model_versions` or `search_registered_models`.\n+ Searching registered models by name (for example, `MlflowClient().search_registered_models(filter_string=\"name='main.default.mymodel'\")` is not supported. To fetch a particular registered model by name, use [get\\_registered\\_model](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.client.html#mlflow.client.MlflowClient.get_registered_model).\n* Email notifications and comment discussion threads on registered models and model versions are not supported in Unity Catalog.\n* The activity log is not supported for models in Unity Catalog. However, you can track activity on models in Unity Catalog using [audit logs](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html#uc).\n* `search_registered_models` might return stale results for models shared through Delta Sharing. To ensure the most recent results, use the Databricks CLI or [SDK](https:\/\/databricks-sdk-py.readthedocs.io\/en\/latest\/workspace\/catalog\/registered_models.html#databricks.sdk.service.catalog.RegisteredModelsAPI.list) to list the models in a schema.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Monitor and manage Delta Sharing egress costs (for providers)\n\nThis article describes tools that you can use to monitor and manage cloud vendor egress costs when you share data and AI assets using Delta Sharing. \nUnlike other data sharing platforms, Delta Sharing does not require data replication. This model has many advantages, but it means that your cloud vendor may charge data egress fees when you share data across clouds or regions. If you use Delta Sharing to share data and AI assets within a region, you incur no egress cost. \nTo monitor and manage egress charges, Databricks provides: \n* [Notebooks that you can use to run a DLT pipeline that monitors egress usage patterns and cost](https:\/\/docs.databricks.com\/data-sharing\/manage-egress.html#notebooks).\n* [Instructions for replicating data between regions to avoid egress fees](https:\/\/docs.databricks.com\/data-sharing\/manage-egress.html#replicate).\n* [Support for Cloudflare R2 storage to avoid egress fees](https:\/\/docs.databricks.com\/data-sharing\/manage-egress.html#r2).\n\n### Monitor and manage Delta Sharing egress costs (for providers)\n#### Delta Sharing egress pipeline notebooks\n\nIn Databricks Marketplace, the listing [Delta Sharing Egress Pipeline](https:\/\/marketplace.databricks.com\/details\/a6f2e062-3084-4976-9eb0-47b2c8244d43\/Databricks_Delta-Sharing-Egress-Pipeline) includes two notebooks that you can clone and use to monitor egress usage patterns and costs associated with Delta Sharing. Both of these notebooks create and execute a Delta Live Tables pipeline: \n* IP Ranges Mapping Pipeline notebook\n* Egress Cost Analysis Pipeline notebook \nWhen you run these notebooks as a Delta Live Tables template, they will automatically generate a detailed cost report. Logs are joined with cloud provider IP range tables and Delta Sharing system tables to generate egress bytes transferred, attributed by share and recipient. \nComplete requirements and instructions are available in the listing.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/manage-egress.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Monitor and manage Delta Sharing egress costs (for providers)\n#### Replicate data to avoid egress costs\n\nOne approach to avoiding egress costs is for the provider to create and sync local replicas of shared data in regions that their recipients are using. Another approach is for recipients to clone the shared data to local regions for active querying, setting up syncs between the shared table and the local clone. This section discusses a number of replication patterns. \n### Use Delta deep clone for incremental replication \nProviders can use `DEEP CLONE` to replicate Delta tables to external locations across the regions that they share to. Deep clones copy the source table data and metadata to the clone target. Deep clones also enable incremental updates by identifying new data in the source table and refreshing the target accordingly. \n```\nCREATE TABLE [IF NOT EXISTS] table_name DEEP CLONE source_table_name\n[TBLPROPERTIES clause] [LOCATION path];\n\n``` \nYou can schedule a Databricks Workflows job to refresh target table data incrementally with recent updates in the shared table, using the following command: \n```\nCREATE OR REPLACE TABLE table_name DEEP CLONE source_table_name;\n\n``` \nSee [Clone a table on Databricks](https:\/\/docs.databricks.com\/delta\/clone.html) and [Introduction to Databricks Workflows](https:\/\/docs.databricks.com\/workflows\/index.html). \n### Enable change data feed (CDF) on shared tables for incremental replication \nWhen a table is shared with its CDF, the recipient can access the changes and merge them into a local copy of the table, where users perform queries. In this scenario, recipient access to the data does not cross region boundaries, and egress is limited to refreshing a local copy. If the recipient is on Databricks, they can use a Databricks workflow job to propagate changes to a local replica. \nTo share a table with CDF, you must enable CDF on the table and share it `WITH HISTORY`. \nFor more information about using CDF, see [Use Delta Lake change data feed on Databricks](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html) and [Add tables to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#add-tables).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/manage-egress.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Monitor and manage Delta Sharing egress costs (for providers)\n#### Use Cloudflare R2 replicas or migrate storage to R2\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nCloudflare R2 object storage incurs no egress fees. Replicating or migrating data that you share to R2 enables you to share data using Delta Sharing without incurring egress fees. This section describes how to replicate data to an R2 location and enable incremental updates from source tables. \n### Requirements \n* Databricks workspace enabled for Unity Catalog.\n* Databricks Runtime 14.3 or above, or SQL warehouse 2024.15 or above.\n* Cloudflare account. See <https:\/\/dash.cloudflare.com\/sign-up>.\n* Cloudflare R2 Admin role. See the [Cloudflare roles documentation](https:\/\/developers.cloudflare.com\/fundamentals\/setup\/manage-members\/roles\/#account-scoped-roles).\n* `CREATE STORAGE CREDENTIAL` privilege on the Unity Catalog metastore attached to the workspace. Account admins and metastore admins have this privilege by default.\n* `CREATE EXTERNAL LOCATION` privilege on both the metastore and the storage credential referenced in the external location. Metastore admins have this privilege by default.\n* `CREATE MANAGED STORAGE` privilege on the external location.\n* `CREATE CATALOG` on the metastore. Metastore admins have this privilege by default. \n### Mount an R2 bucket as an external location in Databricks \n1. Create a Cloudflare R2 bucket. \nSee [Configure an R2 bucket](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials-r2.html#bucket).\n2. Create a storage credential in Unity Catalog that gives access to the R2 bucket. \nSee [Create the storage credential](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials-r2.html#credential).\n3. Use the storage credential to create an external location in Unity Catalog. \nSee [Create an external location to connect cloud storage to Databricks](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html). \n### Create a new catalog using the external location \nCreate a catalog that uses the new external location as its managed storage location. \nSee [Create and manage catalogs](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html). \nWhen you create the catalog, do the following: \n* Select a **Standard** catalog type.\n* Under **Storage location**, select **Select a storage location** and enter the path to the R2 bucket you defined as an external location. For example, `r2:\/\/mybucket@my-account-id.r2.cloudflarestorage.com` \nUse the path to the R2 bucket you defined as an external location. For example: \n```\nCREATE CATALOG IF NOT EXISTS my-r2-catalog\nMANAGED LOCATION 'r2:\/\/mybucket@my-account-id.r2.cloudflarestorage.com'\nCOMMENT 'Location for managed tables and volumes to share using Delta Sharing';\n\n``` \n### Clone the data you want to share to a table in the new catalog \nUse `DEEP CLONE` to replicate tables in S3 to the new catalog that uses R2 for managed storage. Deep clones copy the source table data and metadata to the clone target. Deep clones also enable incremental updates by identifying new data in the source table and refreshing the target accordingly. \n```\nCREATE TABLE IF NOT EXISTS new_catalog.schema1.new_table DEEP CLONE old_catalog.schema1.source_table\nLOCATION 'r2:\/\/mybucket@my-account-id.r2.cloudflarestorage.com';\n\n``` \nYou can schedule a Databricks Workflows job to refresh target table data incrementally with recent updates in the source table, using the following command: \n```\nCREATE OR REPLACE TABLE new_catalog.schema1.new_table DEEP CLONE old_catalog.schema1.source_table;\n\n``` \nSee [Clone a table on Databricks](https:\/\/docs.databricks.com\/delta\/clone.html) and [Introduction to Databricks Workflows](https:\/\/docs.databricks.com\/workflows\/index.html). \n### Share the new table \nWhen you create the share, add the tables that are in the new catalog, stored in R2. The process is the same as adding any table to a share. \nSee [Create and manage shares for Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/create-share.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/manage-egress.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Serve multiple models to a Model Serving endpoint\n\nThis article describes how to serve multiple models to a serving endpoint that utilizes Databricks [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html).\n\n#### Serve multiple models to a Model Serving endpoint\n##### Requirements\n\nSee [Requirements](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html#requirement) for Model Serving endpoint creation. \nTo understand access control options for model serving endpoints and best practice guidance for endpoint management, see [Serving endpoint ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#serving-endpoints).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/serve-multiple-models-to-serving-endpoint.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Serve multiple models to a Model Serving endpoint\n##### Create an endpoint and set the initial traffic split\n\nYou can create Model Serving endpoints with the Databricks Machine Learning API. An endpoint can serve any registered Python MLflow model registered in the Model Registry. \nThe following API example creates a single endpoint with two models and sets the endpoint traffic split between those models. The served model, `current`, hosts version 1 of `model-A` and gets 90% of the endpoint traffic, while the other served model, `challenger`, hosts version 1 of `model-B` and gets 10% of the endpoint traffic. \n```\nPOST \/api\/2.0\/serving-endpoints\n\n{\n\"name\":\"multi-model\"\n\"config\":{\n\"served_entities\":[\n{\n\"name\":\"current\",\n\"entity_name\":\"model-A\",\n\"entity_version\":\"1\",\n\"workload_size\":\"Small\",\n\"scale_to_zero_enabled\":true\n},\n{\n\"name\":\"challenger\",\n\"entity_name\":\"model-B\",\n\"entity_version\":\"1\",\n\"workload_size\":\"Small\",\n\"scale_to_zero_enabled\":true\n}\n],\n\"traffic_config\":{\n\"routes\":[\n{\n\"served_model_name\":\"current\",\n\"traffic_percentage\":\"90\"\n},\n{\n\"served_model_name\":\"challenger\",\n\"traffic_percentage\":\"10\"\n}\n]\n}\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/serve-multiple-models-to-serving-endpoint.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Serve multiple models to a Model Serving endpoint\n##### Update the traffic split between served models\n\nYou can also update the traffic split between served models. The following API example sets the served model, `current`, to get 50% of the endpoint traffic and the other model, `challenger`, to get the remaining 50% of the traffic. \nYou can also make this update from the **Serving** tab in the Databricks Machine Learning UI using the **Edit configuration** button. \n```\nPUT \/api\/2.0\/serving-endpoints\/{name}\/config\n\n{\n\"served_entities\":[\n{\n\"name\":\"current\",\n\"entity_name\":\"model-A\",\n\"entity_version\":\"1\",\n\"workload_size\":\"Small\",\n\"scale_to_zero_enabled\":true\n},\n{\n\"name\":\"challenger\",\n\"entity_name\":\"model-B\",\n\"entity_version\":\"1\",\n\"workload_size\":\"Small\",\n\"scale_to_zero_enabled\":true\n}\n],\n\"traffic_config\":{\n\"routes\":[\n{\n\"served_model_name\":\"current\",\n\"traffic_percentage\":\"50\"\n},\n{\n\"served_model_name\":\"challenger\",\n\"traffic_percentage\":\"50\"\n}\n]\n}\n}\n\n```\n\n#### Serve multiple models to a Model Serving endpoint\n##### Query individual models behind an endpoint\n\nIn some scenarios, you may want to query individual models behind the endpoint. \nYou can do so by using: \n```\nPOST \/serving-endpoints\/{endpoint-name}\/served-models\/{served-model-name}\/invocations\n\n``` \nHere the specific served model is queried. The request format is the same as querying the endpoint. While querying the individual served model, the traffic settings are ignored. \nIn the context of the `multi-model` endpoint example, if all requests are sent to `\/serving-endpoints\/multi-model\/served-models\/challenger\/invocations`, then all requests are served by the `challenger` served model.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/serve-multiple-models-to-serving-endpoint.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Serve multiple models to a Model Serving endpoint\n##### Notebook: Package multiple models into one model\n\nTo save on compute costs, you can also package multiple models into one model. \n### Package multiple MLflow models into one MLflow model notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/package-multiple-models-model-serving.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/serve-multiple-models-to-serving-endpoint.html"} +{"content":"# What is data warehousing on Databricks?\n### Get started with data warehousing using Databricks SQL\n\nIf you\u2019re a data analyst who works primarily with SQL queries and your favorite BI tools, Databricks SQL provides an intuitive environment for running ad-hoc queries and creating dashboards on data stored in your data lake. These articles can help you get started.\n\n### Get started with data warehousing using Databricks SQL\n#### Basic Databricks SQL concepts\n\nTo start, familiarize yourself with some basic Databricks SQL concepts. See [Databricks SQL concepts](https:\/\/docs.databricks.com\/sql\/get-started\/concepts.html).\n\n### Get started with data warehousing using Databricks SQL\n#### Interact with sample dashboards\n\nThen, learn how to import and use dashboards in the Dashboard Samples Gallery that visualize queries. See [Tutorial: Use sample dashboards](https:\/\/docs.databricks.com\/sql\/get-started\/sample-dashboards.html).\n\n### Get started with data warehousing using Databricks SQL\n#### Visualize queries and create a dashboard\n\nNext, use dashboards to explore data and create a dashboard that you can share. See [Dashboards](https:\/\/docs.databricks.com\/dashboards\/index.html).\n\n### Get started with data warehousing using Databricks SQL\n#### Use Databricks SQL in a Databricks job\n\nNext, use the SQL task type in a Databricks job, allowing you to create, schedule, operate, and monitor workflows that include Databricks SQL objects such as queries, legacy dashboards, and alerts. See [Tutorial: Use Databricks SQL in a Databricks job](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-dbsql-in-workflows.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/get-started\/index.html"} +{"content":"# What is data warehousing on Databricks?\n### Get started with data warehousing using Databricks SQL\n#### Use Databricks SQL with a notebook\n\nYou can also attach a notebook to a SQL warehouse. See [Notebooks and SQL warehouses](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#notebook-sql-warehouse) for more information and limitations.\n\n### Get started with data warehousing using Databricks SQL\n#### Use COPY INTO to load data\n\nNext, learn how to use COPY INTO in Databricks SQL. See [Tutorial: Use COPY INTO with Databricks SQL](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html).\n\n### Get started with data warehousing using Databricks SQL\n#### Create a SQL warehouse\n\nTo create a SQL warehouse, see [Configure SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n\n### Get started with data warehousing using Databricks SQL\n#### Work with technology partners\n\nYou can also connect your Databricks workspace to a BI and visualization partner solution using Partner Connect. See [Connect to BI partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/bi.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/get-started\/index.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create tables in Unity Catalog\n\nThis article introduces the concept of *managed* and *external* tables in Unity Catalog and describes how to create tables in Unity Catalog. \nNote \nWhen you create a table, be sure to reference a catalog that is governed by Unity Catalog or set the *default catalog* to a catalog that is governed by Unity Catalog. See [Manage the default catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#default). \nThe catalog `hive_metastore` appears in Catalog Explorer but is not considered governed by Unity Catalog. It is managed by your Databricks workspace\u2019s Hive metastore. All other catalogs listed are governed by Unity Catalog. \nYou can use the Unity Catalog table upgrade interface to upgrade existing tables registered in the Hive metastore to Unity Catalog. See [Upgrade Hive tables and views to Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html).\n\n#### Create tables in Unity Catalog\n##### Managed tables\n\nManaged tables are the default way to create tables in Unity Catalog. Unity Catalog manages the lifecycle and file layout for these tables. You should not use tools outside of Databricks to manipulate files in these tables directly. \nManaged tables are stored in *managed storage*, either at the metastore, catalog, or schema level, depending on how the schema and catalog are configured. See [Specify a managed storage location in Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/managed-storage.html). \nManaged tables always use the [Delta](https:\/\/docs.databricks.com\/delta\/index.html) table format. \nWhen a managed table is dropped, its underlying data is deleted from your cloud tenant within 30 days.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create tables in Unity Catalog\n##### External tables\n\nExternal tables are tables whose data is stored outside of the managed storage location specified for the metastore, catalog, or schema. Use external tables only when you require direct access to the data outside of Databricks clusters or Databricks SQL warehouses. \nWhen you run `DROP TABLE` on an external table, Unity Catalog does not delete the underlying data. To drop a table you must be its owner. You can manage privileges on external tables and use them in queries in the same way as managed tables. To create an external table with SQL, specify a `LOCATION` path in your `CREATE TABLE` statement. External tables can use the following file formats: \n* DELTA\n* CSV\n* JSON\n* AVRO\n* PARQUET\n* ORC\n* TEXT \nTo manage access to the underlying cloud storage for an external table, you must set up [storage credentials and external locations](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html). \nDatabricks recommends that you grant write privileges on a table that is backed by an external location in S3 only if the external location is defined in a single metastore. You can safely use multiple metastores to read data in a single external S3 location, but concurrent writes to the same S3 location from multiple metastores might lead to consistency issues. \nTo learn more, see [Create an external table](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#create-an-external-table).\n\n#### Create tables in Unity Catalog\n##### Requirements\n\nYou must have the `CREATE TABLE` privilege on the schema in which you want to create the table, as well as the `USE SCHEMA` privilege on the schema and the `USE CATALOG` privilege on the parent catalog. \nIf you are creating an external table, see [Create an external table](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#create-an-external-table) for additional requirements.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create tables in Unity Catalog\n##### Create a managed table\n\nTo create a managed table, run the following SQL command. Items in brackets are optional. Replace the placeholder values: \n* `<catalog-name>`: The name of the catalog that will contain the table.. \nThis cannot be the `hive_metastore` catalog that is created automatically for the Hive metastore associated with your Databricks workspace. You can drop the catalog name if you are creating the table in the workspace\u2019s [default catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#default).\n* `<schema-name>`: The name of the schema that will contain the table..\n* `<table-name>`: A name for the table.\n* `<column-specification>`: The name and data type for each column. \n```\nCREATE TABLE <catalog-name>.<schema-name>.<table-name>\n(\n<column-specification>\n);\n\n``` \n```\nspark.sql(\"CREATE TABLE <catalog-name>.<schema-name>.<table-name> \"\n\"(\"\n\" <column-specification>\"\n\")\")\n\n``` \n```\nlibrary(SparkR)\n\nsql(paste(\"CREATE TABLE <catalog-name>.<schema-name>.<table-name> \",\n\"(\",\n\" <column-specification>\",\n\")\",\nsep = \"\"))\n\n``` \n```\nspark.sql(\"CREATE TABLE <catalog-name>.<schema-name>.<table-name> \" +\n\"(\" +\n\" <column-specification>\" +\n\")\")\n\n``` \nYou can also create a managed table by using the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_table](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/table). You can retrieve a list of table full names by using [databricks\\_tables](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/data-sources\/tables). \nFor example, to create the table `main.default.department` and insert five rows into it: \n```\nCREATE TABLE main.default.department\n(\ndeptcode INT,\ndeptname STRING,\nlocation STRING\n);\n\nINSERT INTO main.default.department VALUES\n(10, 'FINANCE', 'EDINBURGH'),\n(20, 'SOFTWARE', 'PADDINGTON'),\n(30, 'SALES', 'MAIDSTONE'),\n(40, 'MARKETING', 'DARLINGTON'),\n(50, 'ADMIN', 'BIRMINGHAM');\n\n``` \n```\nspark.sql(\"CREATE TABLE main.default.department \"\n\"(\"\n\" deptcode INT,\"\n\" deptname STRING,\"\n\" location STRING\"\n\")\"\n\"INSERT INTO main.default.department VALUES \"\n\" (10, 'FINANCE', 'EDINBURGH'),\"\n\" (20, 'SOFTWARE', 'PADDINGTON'),\"\n\" (30, 'SALES', 'MAIDSTONE'),\"\n\" (40, 'MARKETING', 'DARLINGTON'),\"\n\" (50, 'ADMIN', 'BIRMINGHAM')\")\n\n``` \n```\nlibrary(SparkR)\n\nsql(paste(\"CREATE TABLE main.default.department \",\n\"(\",\n\" deptcode INT,\",\n\" deptname STRING,\",\n\" location STRING\",\n\")\",\n\"INSERT INTO main.default.department VALUES \",\n\" (10, 'FINANCE', 'EDINBURGH'),\",\n\" (20, 'SOFTWARE', 'PADDINGTON'),\",\n\" (30, 'SALES', 'MAIDSTONE'),\",\n\" (40, 'MARKETING', 'DARLINGTON'),\",\n\" (50, 'ADMIN', 'BIRMINGHAM')\",\nsep = \"\"))\n\n``` \n```\nspark.sql(\"CREATE TABLE main.default.department \" +\n\"(\" +\n\" deptcode INT,\" +\n\" deptname STRING,\" +\n\" location STRING\" +\n\")\" +\n\"INSERT INTO main.default.department VALUES \" +\n\" (10, 'FINANCE', 'EDINBURGH'),\" +\n\" (20, 'SOFTWARE', 'PADDINGTON'),\" +\n\" (30, 'SALES', 'MAIDSTONE'),\" +\n\" (40, 'MARKETING', 'DARLINGTON'),\" +\n\" (50, 'ADMIN', 'BIRMINGHAM')\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create tables in Unity Catalog\n##### Drop a managed table\n\nYou must be the table\u2019s owner to drop a table. To drop a managed table, run the following SQL command: \n```\nDROP TABLE IF EXISTS catalog_name.schema_name.table_name;\n\n``` \nWhen a managed table is dropped, its underlying data is deleted from your cloud tenant within 30 days.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create tables in Unity Catalog\n##### Create an external table\n\nThe data in an external table is stored in a path on your cloud tenant. To work with external tables, Unity Catalog introduces two objects to access and work with external cloud storage: \n* A *storage credential* contains an authentication method for accessing a cloud storage location. The storage credential does not contain a mapping to the path to which it grants access. Storage credentials are access-controlled to determine which users can use the credential.\n* An *external location* maps a storage credential with a cloud storage path to which it grants access. The external location grants access only to that cloud storage path and its contents. External locations are access-controlled to determine which users can use them. An external location is used automatically when your SQL command contains a `LOCATION` clause. \nDatabricks recommends that you grant write privileges on a table that is backed by an external location in S3 only if the external location is defined in a single metastore. You can safely use multiple metastores to read data in a single external S3 location, but concurrent writes to the same S3 location from multiple metastores might lead to consistency issues. \n### Requirements \nTo create an external table, you must have: \n* The `CREATE EXTERNAL TABLE` privilege on an external location that grants access to the `LOCATION` accessed by the external table.\n* The `USE SCHEMA` permission on the table\u2019s parent schema.\n* The `USE CATALOG` permission on the table\u2019s parent catalog.\n* The `CREATE TABLE` permission on the table\u2019s parent schema. \nExternal locations and storage credentials are stored at the metastore level, rather than in a catalog. To create a storage credential, you must be an account admin or have the `CREATE STORAGE CREDENTIAL` privilege. To create an external location, you must be the metastore admin or have the `CREATE EXTERNAL LOCATION` privilege. See [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html). \n### Create a table \nUse one of the following command examples in a notebook or the SQL query editor to create an external table. \nYou can also use an [example notebook](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#example-notebook-external-table) to create the storage credential, external location, and external table, and also manage permissions for them. \nIn the following examples, replace the placeholder values: \n* `<catalog>`: The name of the catalog that will contain the table. \nThis cannot be the `hive_metastore` catalog that is created automatically for the Hive metastore associated with your Databricks workspace. You can drop the catalog name if you are creating the table in the workspace\u2019s [default catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#default).\n* `<schema>`: The name of the schema that will contain the table.\n* `<table-name>`: A name for the table.\n* `<column-specification>`: The name and data type for each column.\n* `<bucket-path>`: The path to the cloud storage bucket where the table will be created.\n* `<table-directory>`: A directory where the table will be created. Use a unique directory for each table. \nImportant \nOnce a table is created in a path, users can no longer directly access the files in that path from Databricks even if they have been given privileges on an external location or storage credential to do so. This is to ensure that users cannot circumvent access controls applied to tables by reading files from your cloud tenant directly. \n```\nCREATE TABLE <catalog>.<schema>.<table-name>\n(\n<column-specification>\n)\nLOCATION 's3:\/\/<bucket-path>\/<table-directory>';\n\n``` \n```\nspark.sql(\"CREATE TABLE <catalog>.<schema>.<table-name> \"\n\"(\"\n\" <column-specification>\"\n\") \"\n\"LOCATION 's3:\/\/<bucket-path>\/<table-directory>'\")\n\n``` \n```\nlibrary(SparkR)\n\nsql(paste(\"CREATE TABLE <catalog>.<schema>.<table-name> \",\n\"(\",\n\" <column-specification>\",\n\") \",\n\"LOCATION 's3:\/\/<bucket-path>\/<table-directory>'\",\nsep = \"\"))\n\n``` \n```\nspark.sql(\"CREATE TABLE <catalog>.<schema>.<table-name> \" +\n\"(\" +\n\" <column-specification>\" +\n\") \" +\n\"LOCATION 's3:\/\/<bucket-path>\/<table-directory>'\")\n\n``` \nUnity Catalog checks that you have the following permissions: \n* `CREATE EXTERNAL TABLE` on the external location that references the cloud storage path you specify.\n* `CREATE TABLE` on the parent schema.\n* `USE SCHEMA` on the parent schema.\n* `USE CATALOG` on the parent catalog. \nIf you do, the external table is created. Otherwise, an error occurs and the external table is not created. \nNote \nYou can instead migrate an existing external table in the Hive metastore to Unity Catalog without duplicating its data. See [Upgrade a single Hive table to a Unity Catalog external table using the upgrade wizard](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#migrate-external). \nYou can also create an external table by using the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_table](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/table). You can retrieve a list of table full names by using [databricks\\_tables](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/data-sources\/tables).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create tables in Unity Catalog\n##### Example notebook: Create external tables\n\nYou can use the following example notebook to create a catalog, schema, and external table, and to manage permissions on them. \n### Create and manage an external table in Unity Catalog notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/unity-catalog-external-table-example-aws.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create tables in Unity Catalog\n##### Create a table from files stored in your cloud tenant\n\nYou can populate a managed or external table with records from files stored in your cloud tenant. Unity Catalog reads the files at that location and inserts their contents into the table. In Unity Catalog, this is called *path-based-access*. \nYou can follow the examples in this section or [use the add data UI](https:\/\/docs.databricks.com\/ingestion\/add-data\/add-data-external-locations.html). \n### Explore the contents of the files \nTo explore data stored in an external location before you create tables from that data, you can use Catalog Explorer or the following commands. \n**Permissions required**: You must have the `READ FILES` permission on the external location associated with the cloud storage path to return a list of data files in that location. \n1. List the files in a cloud storage path: \n```\nLIST 's3:\/\/<path-to-files>';\n\n```\n2. Query the data in the files in a given path: \n```\nSELECT * FROM <format>.`s3:\/\/<path-to-files>`;\n\n``` \n1. List the files in a cloud storage path: \n```\ndisplay(spark.sql(\"LIST 's3:\/\/<path-to-files>'\"))\n\n```\n2. Query the data in the files in a given path: \n```\ndisplay(spark.read.load(\"s3:\/\/<path-to-files>\"))\n\n``` \n1. List the files in a cloud storage path: \n```\nlibrary(SparkR)\n\ndisplay(sql(\"LIST 's3:\/\/<path-to-files>'\"))\n\n```\n2. Query the data in the files in a given path: \n```\nlibrary(SparkR)\n\ndisplay(loadDF(\"s3:\/\/<path-to-files>\"))\n\n``` \n1. List the files in a cloud storage path: \n```\ndisplay(spark.sql(\"LIST 's3:\/\/<path-to-files>'\"))\n\n```\n2. Query the data in the files in a given path: \n```\ndisplay(spark.read.load(\"s3:\/\/<path-to-files>\"))\n\n``` \n### Create a table from the files \nFollow the examples in this section to create a new table and populate it with data files on your cloud tenant. \nNote \nYou can instead migrate an existing external table in the Hive metastore to Unity Catalog without duplicating its data. See [Upgrade a single Hive table to a Unity Catalog external table using the upgrade wizard](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html#migrate-external). \nImportant \n* When you create a table using this method, the storage path is read only once, to prevent duplication of records. If you want to re-read the contents of the directory, you must drop and re-create the table. For an existing table, you can [insert records](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#insert-records-from-a-path-into-an-existing-table) from a storage path.\n* The bucket path where you create a table cannot also be used to read or write data files.\n* Only the files in the exact directory are read; the read is not recursive.\n* You must have the following permissions: \n+ `USE CATALOG` on the parent catalog and `USE SCHEMA` on the schema.\n+ `CREATE TABLE` on the parent schema.\n+ `READ FILES` on the external location associated with the bucket path where the files are located, or directly on the storage credential if you are not using an external location.\n+ If you are creating an external table, you need `CREATE EXTERNAL TABLE` on the bucket path where the table will be created. \nTo create a new managed table and populate it with data in your cloud storage, use the following examples. \n```\nCREATE TABLE <catalog>.<schema>.<table-name>\n(\n<column-specification>\n)\nSELECT * from <format>.`s3:\/\/<path-to-files>`;\n\n``` \n```\nspark.sql(\"CREATE TABLE <catalog>.<schema>.<table-name> \"\n\"( \"\n\" <column-specification> \"\n\") \"\n\"SELECT * from <format>.`s3:\/\/<path-to-files>`\")\n\n``` \n```\nlibrary(SparkR)\n\nsql(paste(\"CREATE TABLE <catalog>.<schema>.<table-name> \",\n\"( \",\n\" <column-specification> \",\n\") \",\n\"SELECT * from <format>.`s3:\/\/<path-to-files>`\",\nsep = \"\"))\n\n``` \n```\nspark.sql(\"CREATE TABLE <catalog>.<schema>.<table-name> \" +\n\"( \" +\n\" <column-specification> \" +\n\") \" +\n\"SELECT * from <format>.`s3:\/\/<path-to-files>`\")\n\n``` \nTo create an external table and populate it with data in your cloud storage, add a `LOCATION` clause: \n```\nCREATE TABLE <catalog>.<schema>.<table-name>\n(\n<column-specification>\n)\nUSING <format>\nLOCATION 's3:\/\/<table-location>'\nSELECT * from <format>.`s3:\/\/<path-to-files>`;\n\n``` \n```\nspark.sql(\"CREATE TABLE <catalog>.<schema>.<table-name> \"\n\"( \"\n\" <column-specification> \"\n\") \"\n\"USING <format> \"\n\"LOCATION 's3:\/\/<table-location>' \"\n\"SELECT * from <format>.`s3:\/\/<path-to-files>`\")\n\n``` \n```\nlibrary(SparkR)\n\nsql(paste(\"CREATE TABLE <catalog>.<schema>.<table-name> \",\n\"( \",\n\" <column-specification> \",\n\") \",\n\"USING <format> \",\n\"LOCATION 's3:\/\/<table-location>' \",\n\"SELECT * from <format>.`s3:\/\/<path-to-files>`\",\nsep = \"\"))\n\n``` \n```\nspark.sql(\"CREATE TABLE <catalog>.<schema>.<table-name> \" +\n\"( \" +\n\" <column-specification> \" +\n\") \" +\n\"USING <format> \" +\n\"LOCATION 's3:\/\/<table-location>' \" +\n\"SELECT * from <format>.`s3:\/\/<path-to-files>`\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create tables in Unity Catalog\n##### Insert records from a path into an existing table\n\nTo insert records from a bucket path into an existing table, use the `COPY INTO` command. In the following examples, replace the placeholder values: \n* `<catalog>`: The name of the table\u2019s parent catalog.\n* `<schema>`: The name of the table\u2019s parent schema.\n* `<path-to-files>`: The bucket path that contains the data files.\n* `<format>`: The format of the files, for example `delta`.\n* `<table-location>`: The bucket path where the table will be created. \nImportant \n* When you insert records into a table using this method, the bucket path you provide is read only once, to prevent duplication of records.\n* The bucket path where you create a table cannot also be used to read or write data files.\n* Only the files in the exact directory are read; the read is not recursive.\n* You must have the following permissions: \n+ `USE CATALOG` on the parent catalog and `USE SCHEMA` on the schema.\n+ `MODIFY` on the table.\n+ `READ FILES` on the external location associated with the bucket path where the files are located, or directly on the storage credential if you are not using an external location.\n+ To insert records into an external table, you need `CREATE EXTERNAL TABLE` on the bucket path where the table is located. \nTo insert records from files in a bucket path into a managed table, using an external location to read from the bucket path: \n```\nCOPY INTO <catalog>.<schema>.<table>\nFROM (\nSELECT *\nFROM 's3:\/\/<path-to-files>'\n)\nFILEFORMAT = <format>;\n\n``` \n```\nspark.sql(\"COPY INTO <catalog>.<schema>.<table> \"\n\"FROM ( \"\n\" SELECT * \"\n\" FROM 's3:\/\/<path-to-files>' \"\n\") \"\n\"FILEFORMAT = <format>\")\n\n``` \n```\nlibrary(SparkR)\n\nsql(paste(\"COPY INTO <catalog>.<schema>.<table> \",\n\"FROM ( \",\n\" SELECT * \",\n\" FROM 's3:\/\/<path-to-files>' \",\n\") \",\n\"FILEFORMAT = <format>\",\nsep = \"\"))\n\n``` \n```\nspark.sql(\"COPY INTO <catalog>.<schema>.<table> \" +\n\"FROM ( \" +\n\" SELECT * \" +\n\" FROM 's3:\/\/<path-to-files>' \" +\n\") \" +\n\"FILEFORMAT = <format>\")\n\n``` \nTo insert into an external table, add a `LOCATION` clause: \n```\nCOPY INTO <catalog>.<schema>.<table>\nLOCATION 's3:\/\/<table-location>'\nFROM (\nSELECT *\nFROM 's3:\/\/<path-to-files>'\n)\nFILEFORMAT = <format>;\n\n``` \n```\nspark.sql(\"COPY INTO <catalog>.<schema>.<table> \"\n\"LOCATION 's3:\/\/<table-location>' \"\n\"FROM ( \"\n\" SELECT * \"\n\" FROM 's3:\/\/<path-to-files>' \"\n\") \"\n\"FILEFORMAT = <format>\")\n\n``` \n```\nlibrary(SparkR)\n\nsql(paste(\"COPY INTO <catalog>.<schema>.<table> \",\n\"LOCATION 's3:\/\/<table-location>' \",\n\"FROM ( \",\n\" SELECT * \",\n\" FROM 's3:\/\/<path-to-files>' \",\n\") \",\n\"FILEFORMAT = <format>\",\nsep = \"\"))\n\n``` \n```\nspark.sql(\"COPY INTO <catalog>.<schema>.<table> \" +\n\"LOCATION 's3:\/\/<table-location>' \" +\n\"FROM ( \" +\n\" SELECT * \" +\n\" FROM 's3:\/\/<path-to-files>' \" +\n\") \" +\n\"FILEFORMAT = <format>\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create tables in Unity Catalog\n##### Add comments to a table\n\nAs a table owner or a user with the `MODIFY` privilege on a table, you can add comments to a table and its columns. You can add comments using the following functionality: \n* The [COMMENT ON](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-comment.html) command. This option does not support column comments.\n* The `COMMENT` option when you use the [CREATE TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-table-using.html) and [ALTER TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table.html) commands. This option supports column comments.\n* The \u201cmanual\u201d comment field in Catalog Explorer. This option supports column comments. See [Document data in Catalog Explorer using markdown comments](https:\/\/docs.databricks.com\/catalog-explorer\/markdown-data-comments.html).\n* AI-generated comments (also known as AI-generated documentation) in Catalog Explorer. You can view a comment suggested by a large language model (LLM) that takes into account the table metadata, such as the table schema and column names, and edit or accept the comment as-is to add it. See [Add AI-generated comments to a table](https:\/\/docs.databricks.com\/catalog-explorer\/ai-comments.html).\n\n#### Create tables in Unity Catalog\n##### Next steps\n\n* [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Connect to Amazon Kinesis\n\nThis article describes how you can use Structured Streaming to read and write data to Amazon Kinesis. \nDatabricks recommends that you enable S3 VPC endpoints to ensure that all S3 traffic is routed on the AWS network. \nNote \nIf you delete and recreate a Kinesis stream, you cannot reuse any existing checkpoint directories to restart a streaming query. You must delete the checkpoint directories and start those queries from scratch. You can reshard with Structured Streaming by increasing the number of shards without interupting or restarting the stream. \nSee [Recommendations for working with Kinesis](https:\/\/docs.databricks.com\/connect\/streaming\/kinesis.html#recommendations).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/kinesis.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Connect to Amazon Kinesis\n##### Authenticate with Amazon Kinesis\n\nDatabricks recommends managing your connection to Kinesis using an instance profile. See [Instance profiles](https:\/\/docs.databricks.com\/compute\/configure.html#instance-profiles). \nWarning \nInstance profiles are not supported in shared access mode. Use either single user access mode or an alternate authentication method with shared access mode. See [Compute access mode limitations for Unity Catalog](https:\/\/docs.databricks.com\/compute\/access-mode-limitations.html). \nIf you want to use keys for access, you can provide them using the options `awsAccessKey` and `awsSecretKey`. \nYou can also [assume an IAM role](https:\/\/docs.databricks.com\/archive\/admin-guide\/iam-kinesis.html) using the `roleArn` option. You can optionally specify the external ID with `roleExternalId` and a session name with `roleSessionName`. In order to assume a role, you can either launch your cluster with permissions to assume the role or provide access keys through `awsAccessKey` and `awsSecretKey`. For cross-account authentication, Databricks recommends using `roleArn` to hold the assumed role, which can then be assumed through your Databricks AWS account. For more information about cross-account authentication, see [Delegate Access Across AWS Accounts Using IAM Roles](https:\/\/docs.aws.amazon.com\/IAM\/latest\/UserGuide\/tutorial_cross-account-with-roles.html). \nNote \nThe Kinesis source requires `ListShards`, `GetRecords`, and `GetShardIterator` permissions. If you encounter `Amazon: Access Denied` exceptions, check that your user or profile has these permissions. See [Controlling Access to Amazon Kinesis Data Streams Resources Using IAM](https:\/\/docs.aws.amazon.com\/streams\/latest\/dev\/controlling-access.html) for more details.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/kinesis.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Connect to Amazon Kinesis\n##### Schema\n\nKinesis returns records with the following schema: \n| Column | Type |\n| --- | --- |\n| partitionKey | string |\n| data | binary |\n| stream | string |\n| shardId | string |\n| sequenceNumber | string |\n| approximateArrivalTimestamp | timestamp | \nTo deserialize the data in the `data` column, you can cast the field to a string.\n\n#### Connect to Amazon Kinesis\n##### Quickstart\n\nThe following notebook demonstrates how to run WordCount using Structured Streaming with Kinesis. \n### Kinesis WordCount with Structured Streaming notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/structured-streaming-kinesis.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/kinesis.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Connect to Amazon Kinesis\n##### Configure Kinesis options\n\nImportant \nIn Databricks Runtime 13.3 LTS and above, you can use `Trigger.AvailableNow` with Kinesis. See [Ingest Kinesis records as an incremental batch](https:\/\/docs.databricks.com\/connect\/streaming\/kinesis.html#available-now). \nThe following are common configurations for Kinesis data sources: \n| Option | Value | Default | Description |\n| --- | --- | --- | --- |\n| streamName | A comma-separated list of stream names. | None (required param) | The stream names to subscribe to. |\n| region | Region for the streams to be specified. | Locally resolved region | The region the streams are defined in. |\n| endpoint | Region for the Kinesis data stream. | Locally resolved region | The regional endpoint for Kinesis Data Streams. |\n| initialPosition | `latest`, `trim_horizon`, `earliest` (alias for trim\\_horizon), `at_timestamp`. | `latest` | Where to start reading from in the stream. Specify `at_timestamp` as a JSON string using Java default format for timestamps, such as `{\"at_timestamp\": \"06\/25\/2020 10:23:45 PDT\"}`. The streaming query reads all changes at or after the given timestamp (inclusive). You can explicitly specify formats by providing an additional field in the JSON string, such as `{\"at_timestamp\": \"06\/25\/2020 10:23:45 PDT\", \"format\": \"MM\/dd\/yyyy HH:mm:ss ZZZ\"}`. |\n| maxRecordsPerFetch | A positive integer. | 10,000 | How many records to be read per API request to Kinesis. Number of records returned may actually be higher depending on whether sub-records were aggregated into a single record using the Kinesis Producer Library. |\n| maxFetchRate | A positive decimal representing data rate in MB\/s. | 1.0 (max = 2.0) | How fast to prefetch data per shard. This is to rate limit on fetches and avoid Kinesis throttling. 2.0 MB\/s is the maximum rate that Kinesis allows. |\n| minFetchPeriod | A duration string, for example, `1s` for 1 second. | 400ms (min = 200ms) | How long to wait between consecutive prefetch attempts. This is to limit frequency of fetches and avoid Kinesis throttling. 200ms is the minimum as Kinesis allows a maximum of 5 fetches\/sec. |\n| maxFetchDuration | A duration string, for example, `1m` for 1 minute. | 10s | How long to buffer prefetched new data before making it available for processing. |\n| fetchBufferSize | A byte string, for example, `2gb` or `10mb`. | 20gb | How much data to buffer for the next trigger. This is used as a stopping condition and not a strict upper bound,therefore more data may be buffered than what\u2019s specified for this value. |\n| shardsPerTask | A positive integer. | 5 | How many Kinesis shards to prefetch from in parallel per Spark task. Ideally `# cores in cluster` >= `# Kinesis shards` \/ `shardsPerTask`for min query latency & max resource usage. |\n| shardFetchInterval | A duration string, for example, `2m` for 2 minutes. | 1s | How often to poll Kinesis for resharding. |\n| awsAccessKey | String | No default. | AWS access key. |\n| awsSecretKey | String | No default. | AWS secret access key corresponding to the access key. |\n| roleArn | String | No default. | The Amazon Resource Name (ARN) of the role to assume when accessing Kinesis. |\n| roleExternalId | String | No default. | An optional value that can be used when delegating access to the AWS account. See [How to Use an External ID](https:\/\/docs.aws.amazon.com\/IAM\/latest\/UserGuide\/id_roles_create_for-user_externalid.html). |\n| roleSessionName | String | No default. | An identifier for the assumed role session that uniquely identifies a session when the same role is assumed by different principals or for different reasons. |\n| coalesceThresholdBlockSize | A positive integer. | 10,000,000 | The threshold at which the automatic coalesce occurs. If the average block size is less than this value, pre-fetched blocks are coalesced toward the `coalesceBinSize`. |\n| coalesceBinSize | A positive integer. | 128,000,000 | The approximate block size after coalescing. |\n| consumerMode | `polling` or `efo`. | `polling` | Consumer type to run the streaming query with. See [Configure Kinesis enhanced fan-out (EFO) for streaming query reads](https:\/\/docs.databricks.com\/connect\/streaming\/kinesis.html#efo). |\n| requireConsumerDeregistration | `true` or `false`. | `false` | Whether to de-register enhanced fan-out consumer on query termination. Requires `efo` for `consumerMode`. | \nNote \nThe default values of the options have been chosen such that two readers (Spark or otherwise) can simultaneously consume a Kinesis stream without hitting Kinesis rate limits. If you have more consumers, you have to adjust the options accordingly. For example, you may have to reduce `maxFetchRate`, and increase `minFetchPeriod`. \n### Low latency monitoring and alerting \nWhen you have an alerting use case, you would want lower latency. To achieve that: \n* Ensure that there is only one consumer (that is, only your streaming query and no one else) of the Kinesis stream, so that we can optimize your only streaming query to fetch as fast as possible without running into Kinesis rate limits.\n* Set the option `maxFetchDuration` to a small value (say, 200ms) to start processing fetched data as fast as possible. If you are using `Trigger.AvailableNow`, this increases the chances of not being able to keep up with the newest records in the Kinesis stream.\n* Set the option `minFetchPeriod` to 210ms to fetch as frequently as possible.\n* Set the option `shardsPerTask` or configure the cluster such that `# cores in cluster >= 2 * (# Kinesis shards) \/ shardsPerTask`. This ensures that the background prefetching tasks and the streaming query tasks can execute concurrently. \nIf you see that your query is receiving data every 5 seconds, then it is likely that you are [hitting Kinesis rate limits](https:\/\/docs.aws.amazon.com\/streams\/latest\/dev\/troubleshooting-consumers.html).\nReview your configurations.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/kinesis.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Connect to Amazon Kinesis\n##### What metrics does Kinesis report?\n\nKinesis reports the number of milliseconds a consumer has fallen behind the beginning of a stream for each workspace. You can get the average, minimum, and maximum of the number of milliseconds among all the workspaces in the streaming query process as the `avgMsBehindLatest`, `maxMsBehindLatest`, and `minMsBehindLatest` metrics. See [sources metrics object (Kinesis)](https:\/\/docs.databricks.com\/structured-streaming\/stream-monitoring.html#sources-metrics-kinesis). \nIf you are running the stream in a notebook, you can see metrics under the **Raw Data** tab in the streaming query progress dashboard, such as the following example: \n```\n{\n\"sources\" : [ {\n\"description\" : \"KinesisV2[stream]\",\n\"metrics\" : {\n\"avgMsBehindLatest\" : \"32000.0\",\n\"maxMsBehindLatest\" : \"32000\",\n\"minMsBehindLatest\" : \"32000\"\n},\n} ]\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/kinesis.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Connect to Amazon Kinesis\n##### Ingest Kinesis records as an incremental batch\n\nIn Databricks Runtime 13.3 LTS and above, Databricks supports using `Trigger.AvailableNow` with Kinesis data sources for incremental batch semantics. The following describes the basic configuration: \n1. When a micro-batch read triggers in available now mode, the current time is recorded by the Databricks client.\n2. Databricks polls the source system for all records with timestamps between this recorded time and the previous checkpoint.\n3. Databricks loads these records using `Trigger.AvailableNow` semantics. \nDatabricks uses a best-effort mechanism to try and consume all records that exist in Kinesis stream(s) when the streaming query is executed. Because of small potential differences in timestamps and a lack of guarantee in ordering in data sources, some records might not be included in a triggered batch. Omitted records are processed as part of the next triggered micro-batch. \nNote \nIf the query continues failing to fetch records from the Kinesis stream even if there are records, try increasing the `maxFetchDuration` value. \nSee [Configuring incremental batch processing](https:\/\/docs.databricks.com\/structured-streaming\/triggers.html#available-now).\n\n#### Connect to Amazon Kinesis\n##### Write to Kinesis\n\nThe following code snippet can be used as a `ForeachSink` to write data to Kinesis. It requires a `Dataset[(String, Array[Byte])]`. \nNote \nThe following code snippet provides *at least once* semantics, not exactly once. \n### Kinesis Foreach Sink notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/structured-streaming-kinesis-sink.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/kinesis.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Connect to Amazon Kinesis\n##### Recommendations for working with Kinesis\n\nKinesis queries might experience latency for a number of reasons. This section provides recommendations for troubleshooting latency. \nThe Kinesis source runs Spark jobs in a background thread to prefetch Kinesis data periodically and cache it in the memory of the Spark executors. The streaming query processes the cached data after each prefetch step completes and makes the data available for processing. The prefetch step significantly affects the observed end-to-end latency and throughput. \n### Reduce prefetch latency \nTo optimize for minimal query latency and maximum resource usage, use the following calculation: \n`total number of CPU cores in the cluster (across all executors)` >= `total number of Kinesis shards` \/ `shardsPerTask`. \nImportant \n`minFetchPeriod` can create multiple GetRecords API calls to the Kinesis shard until it hits `ReadProvisionedThroughputExceeded`. If an exception occurs, it\u2019s not indicative of an issue as the connector maximizes the utilization of the Kinesis shard. \n### Avoid slowdowns caused by too many rate limit errors \nThe connector reduces the amount of data read from Kinesis by half each time it encounters a rate limiting error and records this event in the log with a message: `\"Hit rate limit. Sleeping for 5 seconds.\"` \nIt is common to see these errors as a stream is being caught up, but after it is, you should no longer see these errors. If you do, you might need to tune either from the Kinesis side (by increasing capacity) or adjust the prefetching options. \n### Avoid spilling data to disk \nIf you have a sudden spike in your Kinesis streams, the assigned buffer capacity might fill up and the buffer not be emptied fast enough for new data to be added. \nIn such cases, Spark spills blocks from the buffer to disk and slows down processing, which affects stream performance. This event appears in the log with a message like this: \n```\n.\/log4j.txt:879546:20\/03\/02 17:15:04 INFO BlockManagerInfo: Updated kinesis_49290928_1_ef24cc00-abda-4acd-bb73-cb135aed175c on disk on 10.0.208.13:43458 (current size: 88.4 MB, original size: 0.0 B)\n\n``` \nTo address this problem, try increasing the cluster memory capacity (either add more nodes or increase the memory per node), or adjust the configuration parameter `fetchBufferSize`. \n### Hanging S3 write tasks \nYou can enable Spark speculation to terminate hanging tasks that would prevent stream processing from proceeding. To ensure that tasks are not terminated too aggressively, tune the quantile and multiplier for this setting carefully. A good starting point is to set `spark.speculation.multiplier` to `3` and `spark.speculation.quantile` to `0.95`. \n### Reduce latency associated with checkpointing in stateful streams \nDatabricks recommends using RocksDB with changelog checkpointing for stateful streaming queries. See [Enable changelog checkpointing](https:\/\/docs.databricks.com\/structured-streaming\/rocksdb-state-store.html#changelog-checkpoint).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/kinesis.html"} +{"content":"# Connect to data sources\n## Configure streaming data sources\n#### Connect to Amazon Kinesis\n##### Configure Kinesis enhanced fan-out (EFO) for streaming query reads\n\nIn Databricks Runtime 11.3 and above, the Databricks Runtime Kinesis connector provides support for using the Amazon Kinesis enhanced fan-out (EFO) feature. \n[Kinesis enhanced fan-out](https:\/\/docs.aws.amazon.com\/streams\/latest\/dev\/enhanced-consumers.html) is a feature that provides support for enhanced fan-out stream consumers with a dedicated throughput of 2MB\/s per shard, per consumer (maximum of 20 consumers per Kinesis stream), and records delivery in push mode instead of pull mode. \nIf a Structured Streaming query is running in EFO mode, then it acts as a consumer with dedicated throughput and registers itself with Kinesis Data Streams. In order to register with Kinesis Data Streams, the query needs to provide a unique consumer name so that it can use the generated consumer ARN (Amazon Resource Number) for future operations. You can either provide an explicit consumer name or reuse the streaming query id as the consumer name. All consumers registered by the Databricks source have the \u201cdatabricks\\_\u201d prefix. Structured Streaming queries that reference consumers that have previously been registered use the `consumerARN` returned by `describeStreamConsumer`. \nThe `consumerName` field allows you to provide a unique name for your streaming query. If you choose not to provide a name, the streaming query ID is used. The `consumerName` must be a string comprising letters, numbers and special characters such as `_` (underscore), `.` (dot) and `-` (hyphen). \nImportant \nA registered EFO consumer incurs [additional charges on Amazon Kinesis](https:\/\/aws.amazon.com\/kinesis\/data-streams\/pricing\/). To deregister the consumer automatically on query teardown, set the `requireConsumerDeregistration` option to `true`. Databricks cannot guarantee de-registration on events such as driver crashes or node failures. In case of job failure, Databricks recommends managing registered consumers directly to prevent excess Kinesis charges. \n### Offline consumer management using a Databricks notebook \nDatabricks provides a consumer management utility to register, list or deregister consumers associated with Kinesis data streams. The following code demonstrates using this utility in a Databricks notebook: \n1. In a new Databricks notebook attached to an active cluster, create a `AWSKinesisConsumerManager` by providing the necessary authentication information. \n```\nimport com.databricks.sql.kinesis.AWSKinesisConsumerManager\n\nval manager = AWSKinesisConsumerManager.newManager()\n.option(\"awsAccessKey\", awsAccessKeyId)\n.option(\"awsSecretKey\", awsSecretKey)\n.option(\"region\", kinesisRegion)\n.create()\n\n```\n2. List and display consumers. \n```\nval consumers = manager.listConsumers(\"<stream name>\")\ndisplay(consumers)\n\n```\n3. Register consumer for given stream. \n```\nval consumerARN = manager.registerConsumer(\"<stream name>\", \"<consumer name>\")\n\n```\n4. Deregister consumer for given stream. \n```\nmanager.deregisterConsumer(\"<stream name>\", \"<consumer name>\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/streaming\/kinesis.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### sparklyr\n\nDatabricks supports sparklyr in notebooks, jobs, and RStudio Desktop. This article describes how you can use sparklyr and provides example scripts that you can run. See [R interface to Apache Spark](https:\/\/spark.rstudio.com\/) for more information.\n\n#### sparklyr\n##### Requirements\n\nDatabricks distributes the latest stable version of sparklyr with every Databricks Runtime release. You can use sparklyr in Databricks R notebooks or inside RStudio Server hosted on Databricks by importing the installed version of sparklyr. \nIn RStudio Desktop, Databricks Connect allows you to connect sparklyr from your local machine to Databricks clusters and run Apache Spark code. See [Use sparklyr and RStudio Desktop with Databricks Connect](https:\/\/docs.databricks.com\/dev-tools\/databricks-connect\/index.html).\n\n#### sparklyr\n##### Connect sparklyr to Databricks clusters\n\nTo establish a sparklyr connection, you can use `\"databricks\"` as the connection method in `spark_connect()`.\nNo additional parameters to `spark_connect()` are needed, nor is calling `spark_install()` needed because Spark is already installed on a Databricks cluster. \n```\n# Calling spark_connect() requires the sparklyr package to be loaded first.\nlibrary(sparklyr)\n\n# Create a sparklyr connection.\nsc <- spark_connect(method = \"databricks\")\n\n```\n\n#### sparklyr\n##### Progress bars and Spark UI with sparklyr\n\nIf you assign the sparklyr connection object to a variable named `sc` as in the above example,\nyou will see Spark progress bars in the notebook after each command that triggers Spark jobs.\nIn addition, you can click the link next to the progress bar to view the Spark UI associated with\nthe given Spark job. \n![Sparklyr progress](https:\/\/docs.databricks.com\/_images\/sparklyr-progress.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/sparklyr.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### sparklyr\n##### Use sparklyr\n\nAfter you install sparklyr and establish the connection, all other\nsparklyr API work as they normally do.\nSee the [example notebook](https:\/\/docs.databricks.com\/sparkr\/sparklyr.html#notebook) for some examples. \nsparklyr is usually used along with other [tidyverse packages](https:\/\/www.tidyverse.org\/) such as\n[dplyr](https:\/\/cran.rstudio.com\/web\/packages\/dplyr\/vignettes\/dplyr.html).\nMost of these packages are preinstalled on Databricks for your convenience.\nYou can simply import them and start using the API.\n\n#### sparklyr\n##### Use sparklyr and SparkR together\n\nSparkR and sparklyr can be used together in a single notebook or job.\nYou can import SparkR along with sparklyr and use its functionality.\nIn Databricks notebooks, the SparkR connection is pre-configured. \nSome of the functions in SparkR mask a number of functions in dplyr: \n```\n> library(SparkR)\nThe following objects are masked from \u2018package:dplyr\u2019:\n\narrange, between, coalesce, collect, contains, count, cume_dist,\ndense_rank, desc, distinct, explain, filter, first, group_by,\nintersect, lag, last, lead, mutate, n, n_distinct, ntile,\npercent_rank, rename, row_number, sample_frac, select, sql,\nsummarize, union\n\n``` \nIf you import SparkR after you imported dplyr, you can reference the functions in dplyr by using\nthe fully qualified names, for example, `dplyr::arrange()`.\nSimilarly if you import dplyr after SparkR, the functions in SparkR are masked by dplyr. \nAlternatively, you can selectively detach one of the two packages while you do not need it. \n```\ndetach(\"package:dplyr\")\n\n``` \nSee also [Comparing SparkR and sparklyr](https:\/\/docs.databricks.com\/sparkr\/sparkr-vs-sparklyr.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/sparklyr.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### sparklyr\n##### Use sparklyr in spark-submit jobs\n\nYou can run scripts that use sparklyr on Databricks as spark-submit jobs, with minor code modifications. Some of the instructions above do not apply to using sparklyr in spark-submit jobs on Databricks. In particular, you must provide the Spark master URL to `spark_connect`. For example: \n```\nlibrary(sparklyr)\n\nsc <- spark_connect(method = \"databricks\", spark_home = \"<spark-home-path>\")\n...\n\n```\n\n#### sparklyr\n##### Unsupported features\n\nDatabricks does not support sparklyr methods such as `spark_web()` and `spark_log()` that require a\nlocal browser. However, since the Spark UI is built-in on Databricks, you can inspect Spark jobs and logs easily.\nSee [Compute driver and worker logs](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#driver-logs).\n\n#### sparklyr\n##### Example notebook: Sparklyr demonstration\n\n### Sparklyr notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/sparklyr.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nFor additional examples, see [Work with DataFrames and tables in R](https:\/\/docs.databricks.com\/sparkr\/dataframes-tables.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/sparklyr.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Pre-trained models in Unity Catalog and Marketplace\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nDatabricks includes a selection of high-quality, pre-trained foundation models in Unity Catalog. In addition, you can install and deploy pre-trained models from external providers using Databricks Marketplace. This article describes how you can use those models and incorporate them into your inference workflows. These pre-trained models allow you to access state-of-the-art AI capabilities, saving you the time and expense of building your own custom models. \nFor information about using your own custom models with Unity Catalog, see [Manage model lifecycle in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/pretrained-models.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Pre-trained models in Unity Catalog and Marketplace\n#### Find pre-trained foundation models in Unity Catalog\n\nIn regions that are enabled for Databricks Model Serving, Databricks has pre-installed a selection of state-of-the-art foundation models. These models have permissive licenses and have been optimized for serving with [Provisioned throughput Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html). \nThese models are available directly from Catalog Explorer, under the catalog `system` in the schema `ai` (`system.ai`). \nYou can serve these models with a single click or incorporate them directly into your batch inference workflows. To serve a pre-trained model, click the model\u2019s name in the Catalog to open the model page and click **Serve this model**. For more information about Model Serving, see [Model serving with Databricks](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html). For a list of regions supported for Model Serving, see [Region availability](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html#region-availability). \nModels in `system.ai` are available to all account users by default. Unity Catalog metastore admins can limit access to these models. See [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html). \n![locate and serve pre-trained model in Unity Catalog](https:\/\/docs.databricks.com\/_images\/serve-pretrained-model.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/pretrained-models.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Pre-trained models in Unity Catalog and Marketplace\n#### Find pre-trained models in Databricks Marketplace\n\nIn addition to the built-in Databricks-provided models in the `system.ai` schema, you can find and install models from external providers in Databricks Marketplace. You can install model listings from Databricks Marketplace into Unity Catalog, and then deploy models in the listing for inference tasks just as you would one of your own models. \nSee [Access data products in Databricks Marketplace (Unity Catalog-enabled workspaces)](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html) for instructions.\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/pretrained-models.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Load data using COPY INTO with temporary credentials\n\nIf your Databricks cluster or SQL warehouse doesn\u2019t have permissions to read your source files, you can use temporary credentials to access data from external cloud object storage and load files into a Delta Lake table. \nDepending on how your organization manages your cloud security, you might need to ask a cloud administrator or power user to provide you with credentials. For more information, see [Generate temporary credentials for ingestion](https:\/\/docs.databricks.com\/ingestion\/copy-into\/generate-temporary-credentials.html).\n\n#### Load data using COPY INTO with temporary credentials\n##### Specifying temporary credentials or encryption options to access data\n\nNote \nCredential and encryption options are available in Databricks Runtime 10.4 LTS and above. \n`COPY INTO` supports: \n* [Azure SAS tokens](https:\/\/learn.microsoft.com\/azure\/storage\/common\/storage-sas-overview) to read data from ADLS Gen2 and Azure Blob Storage. Azure Blob Storage temporary tokens are at the container level, whereas ADLS Gen2 tokens can be at the directory level in addition to the container level. Databricks recommends using directory level SAS tokens when possible. The SAS token must have \u201cRead\u201d, \u201cList\u201d, and \u201cPermissions\u201d permissions.\n* [AWS STS tokens](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/AuthUsingTempSessionToken.html) to read data from AWS S3. Your tokens should have the \u201cs3:GetObject\\*\u201d, \u201cs3:ListBucket\u201d, and \u201cs3:GetBucketLocation\u201d permissions. \nWarning \nTo avoid misuse or exposure of temporary credentials, Databricks recommends that you set expiration horizons that are just long enough to complete the task. \n`COPY INTO` supports loading encrypted data from AWS S3. To load encrypted data, provide the type of encryption and the key to decrypt the data.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/temporary-credentials.html"} +{"content":"# Ingest data into a Databricks lakehouse\n## Get started using COPY INTO to load data\n#### Load data using COPY INTO with temporary credentials\n##### Load data using temporary credentials\n\nThe following example loads data from S3 and ADLS Gen2 using temporary credentials to provide access to the source data. \n```\nCOPY INTO my_json_data\nFROM 's3:\/\/my-bucket\/jsonData' WITH (\nCREDENTIAL (AWS_ACCESS_KEY = '...', AWS_SECRET_KEY = '...', AWS_SESSION_TOKEN = '...')\n)\nFILEFORMAT = JSON\n\nCOPY INTO my_json_data\nFROM 'abfss:\/\/container@storageAccount.dfs.core.windows.net\/jsonData' WITH (\nCREDENTIAL (AZURE_SAS_TOKEN = '...')\n)\nFILEFORMAT = JSON\n\n```\n\n#### Load data using COPY INTO with temporary credentials\n##### Load encrypted data\n\nUsing customer-provided encryption keys, the following example loads data from S3. \n```\nCOPY INTO my_json_data\nFROM 's3:\/\/my-bucket\/jsonData' WITH (\nENCRYPTION (TYPE = 'AWS_SSE_C', MASTER_KEY = '...')\n)\nFILEFORMAT = JSON\n\n```\n\n#### Load data using COPY INTO with temporary credentials\n##### Load JSON data using credentials for source and target\n\nThe following example loads JSON data from a file on AWS S3 into the external Delta table called `my_json_data`.\nThis table must be created before `COPY INTO` can be executed.\nThe command uses one existing credential to write to external Delta table and another to read from the S3 location. \n```\nCOPY INTO my_json_data WITH (CREDENTIAL target_credential)\nFROM 's3:\/\/my-bucket\/jsonData' WITH (CREDENTIAL source_credential)\nFILEFORMAT = JSON\nFILES = ('f.json')\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/temporary-credentials.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Updating from Jobs API 2.0 to 2.1\n\nYou can now orchestrate multiple tasks with Databricks [jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html). This article details changes to the [Jobs API](https:\/\/docs.databricks.com\/api\/workspace\/jobs) that support jobs with multiple tasks and provides guidance to help you update your existing API clients to work with this new feature. \nDatabricks recommends Jobs API 2.1 for your API scripts and clients, particularly when using jobs with multiple tasks. \nThis article refers to jobs defined with a single task as *single-task format* and jobs defined with multiple tasks as *multi-task format*. \nJobs API 2.0 and 2.1 now support the [update](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#update-job) request. Use the `update` request to change an existing job instead of the [reset](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#reset-job) request to minimize changes between single-task format jobs and multi-task format jobs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Updating from Jobs API 2.0 to 2.1\n##### API changes\n\nThe Jobs API now defines a `TaskSettings` object to capture settings for each task in a job. For multi-task format jobs, the `tasks` field, an array of `TaskSettings` data structures, is included in the `JobSettings` object. Some fields previously part of `JobSettings` are now part of the task settings for multi-task format jobs. `JobSettings` is also updated to include the `format` field. The `format` field indicates the format of the job and is a `STRING` value set to `SINGLE_TASK` or `MULTI_TASK`. \nYou need to update your existing API clients for these changes to JobSettings for multi-task format jobs. See the [API client guide](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#client-guidance) for more information on required changes. \nJobs API 2.1 supports the multi-task format. All API 2.1 requests must conform to the multi-task format and responses are structured in the multi-task format. New features are released for API 2.1 first. \nJobs API 2.0 is updated with an additional field to support multi-task format jobs. Except where noted, the examples in this document use API 2.0. However, Databricks recommends API 2.1 for new and existing API scripts and clients. \nAn example JSON document representing a multi-task format job for API 2.0 and 2.1: \n```\n{\n\"job_id\": 53,\n\"settings\": {\n\"name\": \"A job with multiple tasks\",\n\"email_notifications\": {},\n\"timeout_seconds\": 0,\n\"max_concurrent_runs\": 1,\n\"tasks\": [\n{\n\"task_key\": \"clean_data\",\n\"description\": \"Clean and prepare the data\",\n\"notebook_task\": {\n\"notebook_path\": \"\/Users\/user@databricks.com\/clean-data\"\n},\n\"existing_cluster_id\": \"1201-my-cluster\",\n\"max_retries\": 3,\n\"min_retry_interval_millis\": 0,\n\"retry_on_timeout\": true,\n\"timeout_seconds\": 3600,\n\"email_notifications\": {}\n},\n{\n\"task_key\": \"analyze_data\",\n\"description\": \"Perform an analysis of the data\",\n\"notebook_task\": {\n\"notebook_path\": \"\/Users\/user@databricks.com\/analyze-data\"\n},\n\"depends_on\": [\n{\n\"task_key\": \"clean_data\"\n}\n],\n\"existing_cluster_id\": \"1201-my-cluster\",\n\"max_retries\": 3,\n\"min_retry_interval_millis\": 0,\n\"retry_on_timeout\": true,\n\"timeout_seconds\": 3600,\n\"email_notifications\": {}\n}\n],\n\"format\": \"MULTI_TASK\"\n},\n\"created_time\": 1625841911296,\n\"creator_user_name\": \"user@databricks.com\",\n\"run_as_user_name\": \"user@databricks.com\"\n}\n\n``` \nJobs API 2.1 supports configuration of task level clusters or one or more shared job clusters: \n* A task level cluster is created and started when a task starts and terminates when the task completes.\n* A shared job cluster allows multiple tasks in the same job to use the cluster. The cluster is created and started when the first task using the cluster starts and terminates after the last task using the cluster completes. A shared job cluster is not terminated when idle but terminates only after all tasks using it are complete. Multiple non-dependent tasks sharing a cluster can start at the same time. If a shared job cluster fails or is terminated before all tasks have finished, a new cluster is created. \nTo configure shared job clusters, include a `JobCluster` array in the `JobSettings` object. You can specify a maximum of 100 clusters per job. The following is an example of an API 2.1 response for a job configured with two shared clusters: \nNote \nIf a task has library dependencies, you must configure the libraries in the `task` field settings; libraries cannot be configured in a shared job cluster configuration. In the following example, the `libraries` field in the configuration of the `ingest_orders` task demonstrates specification of a library dependency. \n```\n{\n\"job_id\": 53,\n\"settings\": {\n\"name\": \"A job with multiple tasks\",\n\"email_notifications\": {},\n\"timeout_seconds\": 0,\n\"max_concurrent_runs\": 1,\n\"job_clusters\": [\n{\n\"job_cluster_key\": \"default_cluster\",\n\"new_cluster\": {\n\"spark_version\": \"7.3.x-scala2.12\",\n\"node_type_id\": \"i3.xlarge\",\n\"spark_conf\": {\n\"spark.speculation\": true\n},\n\"aws_attributes\": {\n\"availability\": \"SPOT\",\n\"zone_id\": \"us-west-2a\"\n},\n\"autoscale\": {\n\"min_workers\": 2,\n\"max_workers\": 8\n}\n}\n},\n{\n\"job_cluster_key\": \"data_processing_cluster\",\n\"new_cluster\": {\n\"spark_version\": \"7.3.x-scala2.12\",\n\"node_type_id\": \"r4.2xlarge\",\n\"spark_conf\": {\n\"spark.speculation\": true\n},\n\"aws_attributes\": {\n\"availability\": \"SPOT\",\n\"zone_id\": \"us-west-2a\"\n},\n\"autoscale\": {\n\"min_workers\": 8,\n\"max_workers\": 16\n}\n}\n}\n],\n\"tasks\": [\n{\n\"task_key\": \"ingest_orders\",\n\"description\": \"Ingest order data\",\n\"depends_on\": [ ],\n\"job_cluster_key\": \"auto_scaling_cluster\",\n\"spark_jar_task\": {\n\"main_class_name\": \"com.databricks.OrdersIngest\",\n\"parameters\": [\n\"--data\",\n\"dbfs:\/path\/to\/order-data.json\"\n]\n},\n\"libraries\": [\n{\n\"jar\": \"dbfs:\/mnt\/databricks\/OrderIngest.jar\"\n}\n],\n\"timeout_seconds\": 86400,\n\"max_retries\": 3,\n\"min_retry_interval_millis\": 2000,\n\"retry_on_timeout\": false\n},\n{\n\"task_key\": \"clean_orders\",\n\"description\": \"Clean and prepare the order data\",\n\"notebook_task\": {\n\"notebook_path\": \"\/Users\/user@databricks.com\/clean-data\"\n},\n\"job_cluster_key\": \"default_cluster\",\n\"max_retries\": 3,\n\"min_retry_interval_millis\": 0,\n\"retry_on_timeout\": true,\n\"timeout_seconds\": 3600,\n\"email_notifications\": {}\n},\n{\n\"task_key\": \"analyze_orders\",\n\"description\": \"Perform an analysis of the order data\",\n\"notebook_task\": {\n\"notebook_path\": \"\/Users\/user@databricks.com\/analyze-data\"\n},\n\"depends_on\": [\n{\n\"task_key\": \"clean_data\"\n}\n],\n\"job_cluster_key\": \"data_processing_cluster\",\n\"max_retries\": 3,\n\"min_retry_interval_millis\": 0,\n\"retry_on_timeout\": true,\n\"timeout_seconds\": 3600,\n\"email_notifications\": {}\n}\n],\n\"format\": \"MULTI_TASK\"\n},\n\"created_time\": 1625841911296,\n\"creator_user_name\": \"user@databricks.com\",\n\"run_as_user_name\": \"user@databricks.com\"\n}\n\n``` \nFor single-task format jobs, the `JobSettings` data structure remains unchanged except for the addition of the `format` field. No `TaskSettings` array is included, and the task settings remain defined at the top level of the `JobSettings` data structure. You will not need to make changes to your existing API clients to process single-task format jobs. \nAn example JSON document representing a single-task format job for API 2.0: \n```\n{\n\"job_id\": 27,\n\"settings\": {\n\"name\": \"Example notebook\",\n\"existing_cluster_id\": \"1201-my-cluster\",\n\"libraries\": [\n{\n\"jar\": \"dbfs:\/FileStore\/jars\/spark_examples.jar\"\n}\n],\n\"email_notifications\": {},\n\"timeout_seconds\": 0,\n\"schedule\": {\n\"quartz_cron_expression\": \"0 0 0 * * ?\",\n\"timezone_id\": \"US\/Pacific\",\n\"pause_status\": \"UNPAUSED\"\n},\n\"notebook_task\": {\n\"notebook_path\": \"\/notebooks\/example-notebook\",\n\"revision_timestamp\": 0\n},\n\"max_concurrent_runs\": 1,\n\"format\": \"SINGLE_TASK\"\n},\n\"created_time\": 1504128821443,\n\"creator_user_name\": \"user@databricks.com\"\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Updating from Jobs API 2.0 to 2.1\n##### API client guide\n\nThis section provides guidelines, examples, and required changes for API calls affected by the new multi-task format feature. \nIn this section: \n* [Create](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#create)\n* [Runs submit](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#runs-submit)\n* [Update](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#update)\n* [Reset](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#reset)\n* [List](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#list)\n* [Get](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#get)\n* [Runs get](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#runs-get)\n* [Runs get output](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#runs-get-output)\n* [Runs list](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#runs-list) \n### [Create](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#id1) \nTo create a single-task format job through the [Create a new job](https:\/\/docs.databricks.com\/api\/workspace\/jobs) operation (`POST \/jobs\/create`) in the Jobs API, you do not need to change existing clients. \nTo create a multi-task format job, use the `tasks` field in `JobSettings` to specify settings for each task. The following example creates a job with two notebook tasks. This example is for API 2.0 and 2.1: \nNote \nA maximum of 100 tasks can be specified per job. \n```\n{\n\"name\": \"Multi-task-job\",\n\"max_concurrent_runs\": 1,\n\"tasks\": [\n{\n\"task_key\": \"clean_data\",\n\"description\": \"Clean and prepare the data\",\n\"notebook_task\": {\n\"notebook_path\": \"\/Users\/user@databricks.com\/clean-data\"\n},\n\"existing_cluster_id\": \"1201-my-cluster\",\n\"timeout_seconds\": 3600,\n\"max_retries\": 3,\n\"retry_on_timeout\": true\n},\n{\n\"task_key\": \"analyze_data\",\n\"description\": \"Perform an analysis of the data\",\n\"notebook_task\": {\n\"notebook_path\": \"\/Users\/user@databricks.com\/analyze-data\"\n},\n\"depends_on\": [\n{\n\"task_key\": \"clean_data\"\n}\n],\n\"existing_cluster_id\": \"1201-my-cluster\",\n\"timeout_seconds\": 3600,\n\"max_retries\": 3,\n\"retry_on_timeout\": true\n}\n]\n}\n\n``` \n### [Runs submit](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#id2) \nTo submit a one-time run of a single-task format job with the [Create and trigger a one-time run](https:\/\/docs.databricks.com\/api\/workspace\/jobs) operation (`POST \/runs\/submit`) in the Jobs API, you do not need to change existing clients. \nTo submit a one-time run of a multi-task format job, use the `tasks` field in `JobSettings` to specify settings for each task, including clusters. Clusters must be set at the task level when submitting a multi-task format job because the `runs submit` request does not support shared job clusters. See [Create](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#create-job) for an example `JobSettings` specifying multiple tasks. \n### [Update](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#id3) \nTo update a single-task format job with the [Partially update a job](https:\/\/docs.databricks.com\/api\/workspace\/jobs) operation (`POST \/jobs\/update`) in the Jobs API, you do not need to change existing clients. \nTo update the settings of a multi-task format job, you must use the unique `task_key` field to identify new `task` settings. See [Create](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#create-job) for an example `JobSettings` specifying multiple tasks. \n### [Reset](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#id4) \nTo overwrite the settings of a single-task format job with the [Overwrite all settings for a job](https:\/\/docs.databricks.com\/api\/workspace\/jobs) operation (`POST \/jobs\/reset`) in the Jobs API, you do not need to change existing clients. \nTo overwrite the settings of a multi-task format job, specify a `JobSettings` data structure with an array of `TaskSettings` data structures. See [Create](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#create-job) for an example `JobSettings` specifying multiple tasks. \nUse [Update](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#update-job) to change individual fields without switching from single-task to multi-task format. \n### [List](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#id5) \nFor single-task format jobs, no client changes are required to process the response from the [List all jobs](https:\/\/docs.databricks.com\/api\/workspace\/jobs) operation (`GET \/jobs\/list`) in the Jobs API. \nFor multi-task format jobs, most settings are defined at the task level and not the job level. Cluster configuration may be set at the task or job level. To modify clients to access cluster or task settings for a multi-task format job returned in the `Job` structure: \n* Parse the `job_id` field for the multi-task format job.\n* Pass the `job_id` to the [Get a job](https:\/\/docs.databricks.com\/api\/workspace\/jobs) operation (`GET \/jobs\/get`) in the Jobs API to retrieve job details. See [Get](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#get-job) for an example response from the `Get` API call for a multi-task format job. \nThe following example shows a response containing single-task and multi-task format jobs. This example is for API 2.0: \n```\n{\n\"jobs\": [\n{\n\"job_id\": 36,\n\"settings\": {\n\"name\": \"A job with a single task\",\n\"existing_cluster_id\": \"1201-my-cluster\",\n\"email_notifications\": {},\n\"timeout_seconds\": 0,\n\"notebook_task\": {\n\"notebook_path\": \"\/Users\/user@databricks.com\/example-notebook\",\n\"revision_timestamp\": 0\n},\n\"max_concurrent_runs\": 1,\n\"format\": \"SINGLE_TASK\"\n},\n\"created_time\": 1505427148390,\n\"creator_user_name\": \"user@databricks.com\"\n},\n{\n\"job_id\": 53,\n\"settings\": {\n\"name\": \"A job with multiple tasks\",\n\"email_notifications\": {},\n\"timeout_seconds\": 0,\n\"max_concurrent_runs\": 1,\n\"format\": \"MULTI_TASK\"\n},\n\"created_time\": 1625841911296,\n\"creator_user_name\": \"user@databricks.com\"\n}\n]\n}\n\n``` \n### [Get](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#id6) \nFor single-task format jobs, no client changes are required to process the response from the [Get a job](https:\/\/docs.databricks.com\/api\/workspace\/jobs) operation (`GET \/jobs\/get`) in the Jobs API. \nMulti-task format jobs return an array of `task` data structures containing task settings. If you require access to task level details, you need to modify your clients to iterate through the `tasks` array and extract required fields. \nThe following shows an example response from the `Get` API call for a multi-task format job. This example is for API 2.0 and 2.1: \n```\n{\n\"job_id\": 53,\n\"settings\": {\n\"name\": \"A job with multiple tasks\",\n\"email_notifications\": {},\n\"timeout_seconds\": 0,\n\"max_concurrent_runs\": 1,\n\"tasks\": [\n{\n\"task_key\": \"clean_data\",\n\"description\": \"Clean and prepare the data\",\n\"notebook_task\": {\n\"notebook_path\": \"\/Users\/user@databricks.com\/clean-data\"\n},\n\"existing_cluster_id\": \"1201-my-cluster\",\n\"max_retries\": 3,\n\"min_retry_interval_millis\": 0,\n\"retry_on_timeout\": true,\n\"timeout_seconds\": 3600,\n\"email_notifications\": {}\n},\n{\n\"task_key\": \"analyze_data\",\n\"description\": \"Perform an analysis of the data\",\n\"notebook_task\": {\n\"notebook_path\": \"\/Users\/user@databricks.com\/analyze-data\"\n},\n\"depends_on\": [\n{\n\"task_key\": \"clean_data\"\n}\n],\n\"existing_cluster_id\": \"1201-my-cluster\",\n\"max_retries\": 3,\n\"min_retry_interval_millis\": 0,\n\"retry_on_timeout\": true,\n\"timeout_seconds\": 3600,\n\"email_notifications\": {}\n}\n],\n\"format\": \"MULTI_TASK\"\n},\n\"created_time\": 1625841911296,\n\"creator_user_name\": \"user@databricks.com\",\n\"run_as_user_name\": \"user@databricks.com\"\n}\n\n``` \n### [Runs get](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#id7) \nFor single-task format jobs, no client changes are required to process the response from the [Get a job run](https:\/\/docs.databricks.com\/api\/workspace\/jobs) operation (`GET \/jobs\/runs\/get`) in the Jobs API. \nThe response for a multi-task format job run contains an array of `TaskSettings`. To retrieve run results for each task: \n* Iterate through each of the tasks.\n* Parse the `run_id` for each task.\n* Call the [Get the output for a run](https:\/\/docs.databricks.com\/api\/workspace\/jobs) operation (`GET \/jobs\/runs\/get-output`) with the `run_id` to get details on the run for each task. The following is an example response from this request: \n```\n{\n\"job_id\": 53,\n\"run_id\": 759600,\n\"number_in_job\": 7,\n\"original_attempt_run_id\": 759600,\n\"state\": {\n\"life_cycle_state\": \"TERMINATED\",\n\"result_state\": \"SUCCESS\",\n\"state_message\": \"\"\n},\n\"cluster_spec\": {},\n\"start_time\": 1595943854860,\n\"setup_duration\": 0,\n\"execution_duration\": 0,\n\"cleanup_duration\": 0,\n\"trigger\": \"ONE_TIME\",\n\"creator_user_name\": \"user@databricks.com\",\n\"run_name\": \"Query logs\",\n\"run_type\": \"JOB_RUN\",\n\"tasks\": [\n{\n\"run_id\": 759601,\n\"task_key\": \"query-logs\",\n\"description\": \"Query session logs\",\n\"notebook_task\": {\n\"notebook_path\": \"\/Users\/user@databricks.com\/log-query\"\n},\n\"existing_cluster_id\": \"1201-my-cluster\",\n\"state\": {\n\"life_cycle_state\": \"TERMINATED\",\n\"result_state\": \"SUCCESS\",\n\"state_message\": \"\"\n}\n},\n{\n\"run_id\": 759602,\n\"task_key\": \"validate_output\",\n\"description\": \"Validate query output\",\n\"depends_on\": [\n{\n\"task_key\": \"query-logs\"\n}\n],\n\"notebook_task\": {\n\"notebook_path\": \"\/Users\/user@databricks.com\/validate-query-results\"\n},\n\"existing_cluster_id\": \"1201-my-cluster\",\n\"state\": {\n\"life_cycle_state\": \"TERMINATED\",\n\"result_state\": \"SUCCESS\",\n\"state_message\": \"\"\n}\n}\n],\n\"format\": \"MULTI_TASK\"\n}\n\n``` \n### [Runs get output](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#id8) \nFor single-task format jobs, no client changes are required to process the response from the [Get the output for a run](https:\/\/docs.databricks.com\/api\/workspace\/jobs) operation (`GET \/jobs\/runs\/get-output`) in the Jobs API. \nFor multi-task format jobs, calling `Runs get output` on a parent run results in an error since run output is available only for individual tasks. To get the output and metadata for a multi-task format job: \n* Call the [Get the output for a run](https:\/\/docs.databricks.com\/api\/workspace\/jobs) request.\n* Iterate over the child `run_id` fields in the response.\n* Use the child `run_id` values to call `Runs get output`. \n### [Runs list](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html#id9) \nFor single-task format jobs, no client changes are required to process the response from the [List runs for a job](https:\/\/docs.databricks.com\/api\/workspace\/jobs) operation (`GET \/jobs\/runs\/list`). \nFor multi-task format jobs, an empty `tasks` array is returned. Pass the `run_id` to the [Get a job run](https:\/\/docs.databricks.com\/api\/workspace\/jobs) operation (`GET \/jobs\/runs\/get`) to retrieve the tasks. The following shows an example response from the `Runs list` API call for a multi-task format job: \n```\n{\n\"runs\": [\n{\n\"job_id\": 53,\n\"run_id\": 759600,\n\"number_in_job\": 7,\n\"original_attempt_run_id\": 759600,\n\"state\": {\n\"life_cycle_state\": \"TERMINATED\",\n\"result_state\": \"SUCCESS\",\n\"state_message\": \"\"\n},\n\"cluster_spec\": {},\n\"start_time\": 1595943854860,\n\"setup_duration\": 0,\n\"execution_duration\": 0,\n\"cleanup_duration\": 0,\n\"trigger\": \"ONE_TIME\",\n\"creator_user_name\": \"user@databricks.com\",\n\"run_name\": \"Query logs\",\n\"run_type\": \"JOB_RUN\",\n\"tasks\": [],\n\"format\": \"MULTI_TASK\"\n}\n],\n\"has_more\": false\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use online tables for real-time feature serving\n\nPreview \nOnline Tables are in public preview. During the preview, ingesting data into Online Tables consumes SQL Serverless DBUs. Final pricing for Online Tables will be made available at a future date. \nOnline Tables Preview is available in the following regions: `us-east-1`, `us-west-2`, `eu-west-1`, `ap-southeast-2`. \nAn online table is a read-only copy of a Delta Table that is stored in row-oriented format optimized for online access. Online tables are fully serverless tables that auto-scale throughput capacity with the request load and provide low latency and high throughput access to data of any scale. Online tables are designed to work with Databricks Model Serving, Feature Serving, and retrieval-augmented generation (RAG) applications where they are used for fast data lookups. \nYou can also use online tables in queries using [Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). When using Lakehouse Federation, you must use a Serverless SQL warehouse to access online tables. Only read operations (`SELECT`) are supported. This capability is intended for interactive or debugging purposes only and should not be used for production or mission critical workloads. \nCreating an online table using the Databricks UI is a one-step process. Just select the Delta table in Catalog Explorer and select **Create online table**. You can also use the REST API or the Databricks SDK to create and manage online tables. See [Work with online tables using APIs](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html#api-sdk).\n\n#### Use online tables for real-time feature serving\n##### Requirements\n\n* The workspace must be enabled for Unity Catalog. Follow the [documentation](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html) to create a Unity Catalog Metastore, enable it in a workspace, and create a Catalog.\n* A model must be registered in Unity Catalog to access online tables.\n* A Databricks admin must accept the Serverless Terms of Service in the account console.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use online tables for real-time feature serving\n##### Work with online tables using the UI\n\nThis section describes how to create and delete online tables, and how to check the status and trigger updates of online tables. \n### Create an online table using the UI \nYou create an online table using Catalog Explorer. For information about required permissions, see [User permissions](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html#permissions). \n1. To create an online table, the source Delta table must have a primary key. If the Delta table you want to use does not have a primary key, create one by following these instructions: [Use an existing Delta table in Unity Catalog as a feature table](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html#use-existing-uc-table).\n2. In Catalog Explorer, navigate to the source table that you want to sync to an online table. From the **Create** menu, select **Online table**. \n![select create online table](https:\/\/docs.databricks.com\/_images\/create-online-table.png)\n3. Use the selectors in the dialog to configure the online table. \n![configure online table dialog](https:\/\/docs.databricks.com\/_images\/create-online-table-dlg.png) \n**Name**: Name to use for the online table in Unity Catalog. \n**Primary Key**: Column(s) in the source table to use as primary key(s) in the online table. \n**Timeseries Key**: (Optional). Column in the source table to use as timeseries key. When specified, the online table includes only the row with the latest timeseries key value for each primary key. \n**Sync mode**: Specifies how the synchronization pipeline updates the online table. Select one of **Snapshot**, **Triggered**, or **Continuous**. \n| Policy | Description |\n| --- | --- |\n| Snapshot | The pipeline runs once to take a snapshot of the source table and copy it to the online table. Subsequent changes to the source table are automatically reflected in the online table by taking a new snapshot of the source and creating a new copy. The content of the online table is updated atomically. |\n| Triggered | The pipeline runs once to create an initial snapshot copy of the source table in the online table. Unlike the Snapshot sync mode, when the online table is refreshed, only changes since the last pipeline execution are retrieved and applied to the online table. The incremental refresh can be manually triggered or automatically triggered according to a schedule. |\n| Continuous | The pipeline runs continuously. Subsequent changes to the source table are incrementally applied to the online table in real time streaming mode. No manual refresh is necessary. | \nNote \nTo support **Triggered** or **Continuous** sync mode, the source table must have [Change data feed](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html) enabled. \n1. When you are done, click **Confirm**. The online table page appears.\n2. The new online table is created under the catalog, schema, and name specified in the creation dialog. In Catalog Explorer, the online table is indicated by ![online table icon](https:\/\/docs.databricks.com\/_images\/online-table-icon.png). \n### Get status and trigger updates using the UI \nTo check the status of the online table, click the name of the table in the Catalog to open it. The online table page appears with the **Overview** tab open. The **Data Ingest** section shows the status of the latest update. To trigger an update, click **Sync now**. The **Data Ingest** section also includes a link to the Delta Live Tables pipeline that updates the table. \n![view of online table page in catalog](https:\/\/docs.databricks.com\/_images\/online-table-in-catalog.png) \n### Delete an online table using the UI \nFrom the online table page, select **Delete** from the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use online tables for real-time feature serving\n##### Work with online tables using APIs\n\nYou can also use the Databricks SDK or the REST API to create and manage online tables. \nFor reference information, see the reference documentation for the [Databricks SDK for Python](https:\/\/databricks-sdk-py.readthedocs.io\/en\/latest\/workspace\/catalog\/online_tables.html) or the [REST API](https:\/\/docs.databricks.com\/api\/workspace\/onlinetables). \n### Requirements \nDatabricks SDK version 0.20 or above. \n### Create an online table using APIs \n```\nfrom pprint import pprint\nfrom databricks.sdk import WorkspaceClient\nfrom databricks.sdk.service.catalog import *\n\nw = WorkspaceClient(host='https:\/\/xxx.databricks.com', token='xxx')\n\n# Create an online table\nspec = OnlineTableSpec(\nprimary_key_columns=[\"pk_col\"],\nsource_table_full_name=\"main.default.source_table\",\nrun_triggered=OnlineTableSpecTriggeredSchedulingPolicy.from_dict({'triggered': 'true'})\n)\n\nw.online_tables.create(name='main.default.my_online_table', spec=spec)\n\n``` \n```\ncurl --request POST \"https:\/\/xxx.databricks.com\/api\/2.0\/online-tables\" \\\n--header \"Authorization: Bearer xxx\" \\\n--data '{\n\"name\": \"main.default.my_online_table\",\n\"spec\": {\n\"run_triggered\": {},\n\"source_table_full_name\": \"main.default.source_table\",\n\"primary_key_columns\": [\"a\"]\n}\n}'\n\n``` \nThe online table automatically starts syncing after it is created. \n### Get status and trigger refresh using APIs \nYou can view the status and the spec of the online table following the example below. If your online table is not continuous\nand you would like to trigger a manual refresh of its data, you can use the pipeline API to do so. \nUse the pipeline ID associated with the online table in the online table spec and start a new update on the pipeline\nto trigger the refresh. This is equivalent to clicking **Sync now** in the online table UI in Catalog Explorer. \n```\npprint(w.online_tables.get('main.default.my_online_table'))\n\n# Sample response\nOnlineTable(name='main.default.my_online_table',\nspec=OnlineTableSpec(perform_full_copy=None,\npipeline_id='some-pipeline-id',\nprimary_key_columns=['pk_col'],\nrun_continuously=None,\nrun_triggered={},\nsource_table_full_name='main.default.source_table',\ntimeseries_key=None),\nstatus=OnlineTableStatus(continuous_update_status=None,\ndetailed_state=OnlineTableState.PROVISIONING,\nfailed_status=None,\nmessage='Online Table creation is '\n'pending. Check latest status in '\n'Delta Live Tables: '\n'https:\/\/xxx.databricks.com\/pipelines\/some-pipeline-id',\nprovisioning_status=None,\ntriggered_update_status=None))\n\n# Trigger an online table refresh by calling the pipeline API. To discard all existing data\n# in the online table before refreshing, set \"full_refresh\" to \"True\". This is useful if your\n# online table sync is stuck due to, for example, the source table being deleted and recreated\n# with the same name while the sync was running.\nw.pipelines.start_update(pipeline_id='some-pipeline-id', full_refresh=True)\n\n``` \n```\ncurl --request GET \\\n\"https:\/\/xxx.databricks.com\/api\/2.0\/online-tables\/main.default.my_online_table\" \\\n--header \"Authorization: Bearer xxx\"\n\n# Sample response\n{\n\"name\": \"main.default.my_online_table\",\n\"spec\": {\n\"run_triggered\": {},\n\"source_table_full_name\": \"main.default.source_table\",\n\"primary_key_columns\": [\"pk_col\"],\n\"pipeline_id\": \"some-pipeline-id\"\n},\n\"status\": {\n\"detailed_state\": \"PROVISIONING\",\n\"message\": \"Online Table creation is pending. Check latest status in Delta Live Tables: https:\/\/xxx.databricks.com#joblist\/pipelines\/some-pipeline-id\"\n}\n}\n\n# Trigger an online table refresh by calling the pipeline API. To discard all existing data\n# in the online table before refreshing, set \"full_refresh\" to \"True\". This is useful if your\n# online table sync is stuck due to, for example, the source table being deleted and recreated\n# with the same name while the sync was running.\ncurl --request POST \"https:\/\/xxx.databricks.com\/api\/2.0\/pipelines\/some-pipeline-id\/updates\" \\\n--header \"Authorization: Bearer xxx\" \\\n--data '{\n\"full_refresh\": true\n}'\n\n``` \n### Delete an online table using APIs \n```\nw.online_tables.delete('main.default.my_online_table')\n\n``` \n```\ncurl --request DELETE \\\n\"https:\/\/xxx.databricks.com\/api\/2.0\/online-tables\/main.default.my_online_table\" \\\n--header \"Authorization: Bearer xxx\"\n\n``` \nDeleting the online table stops any ongoing data synchronization and releases all its resources.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use online tables for real-time feature serving\n##### Serve online table data using a feature serving endpoint\n\nFor models and applications hosted outside of Databricks, you can create a feature serving endpoint to serve features from online tables. The endpoint makes features available at low latency using a REST API. \n1. Create a feature spec. \nWhen you create a feature spec, you specify the source Delta table. This allows the feature spec to be used in both offline and online scenarios. For online lookups, the serving endpoint automatically uses the online table to perform low-latency feature lookups. \nThe source Delta table and the online table must use the same primary key. \nThe feature spec can be viewed in the **Function** tab in Catalog Explorer. \n```\nfrom databricks.feature_engineering import FeatureEngineeringClient, FeatureLookup\n\nfe = FeatureEngineeringClient()\nfe.create_feature_spec(\nname=\"catalog.default.user_preferences_spec\",\nfeatures=[\nFeatureLookup(\ntable_name=\"user_preferences\",\nlookup_key=\"user_id\"\n)\n]\n)\n\n```\n2. Create a feature serving endpoint. \nThis step assumes that you have created an online table named `user_preferences_online_table` that synchonizes data from the Delta table `user_preferences`. Use the feature spec to create a feature serving endpoint. The endpoint makes data available through a REST API using the associated online table. \nNote \nThe user who performs this operation must be the owner of both the offline table and online table. \n```\nfrom databricks.sdk import WorkspaceClient\nfrom databricks.sdk.service.serving import EndpointCoreConfigInput, ServedEntityInput\n\nworkspace = WorkspaceClient()\n\n# Create endpoint\nendpoint_name = \"fse-location\"\n\nworkspace.serving_endpoints.create_and_wait(\nname=endpoint_name,\nconfig=EndpointCoreConfigInput(\nserved_entities=[\nServedEntityInput(\nentity_name=feature_spec_name,\nscale_to_zero_enabled=True,\nworkload_size=\"Small\"\n)\n]\n)\n)\n\n``` \n```\nfe.create_feature_serving_endpoint(\nname=\"user-preferences\",\nconfig=EndpointCoreConfig(\nserved_entities=ServedEntity(\nfeature_spec_name=\"catalog.default.user_preferences_spec\",\nworkload_size=\"Small\",\nscale_to_zero_enabled=True\n)\n)\n)\n\n```\n3. Get data from the feature serving endpoint. \nTo access the API endpoint, send an HTTP GET request to the endpoint URL. The example shows how to do this using Python APIs. For other languages and tools, see [Feature Serving](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html). \n```\n# Set up credentials\nexport DATABRICKS_TOKEN=...\n\n``` \n```\nurl = \"https:\/\/{workspace_url}\/serving-endpoints\/user-preferences\/invocations\"\n\nheaders = {'Authorization': f'Bearer {DATABRICKS_TOKEN}', 'Content-Type': 'application\/json'}\n\ndata = {\n\"dataframe_records\": [{\"user_id\": user_id}]\n}\ndata_json = json.dumps(data, allow_nan=True)\n\nresponse = requests.request(method='POST', headers=headers, url=url, data=data_json)\nif response.status_code != 200:\nraise Exception(f'Request failed with status {response.status_code}, {response.text}')\n\nprint(response.json()['outputs'][0]['hotel_preference'])\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use online tables for real-time feature serving\n##### Use online tables with RAG applications\n\nRAG applications are a common use case for online tables. You create an online table for the structured data that the RAG application needs and host it on a feature serving endpoint. The RAG application uses the feature serving endpoint to look up relevant data from the online table. \nThe typical steps are as follows: \n1. Create a feature serving endpoint.\n2. Create a LangChainTool that uses the endpoint to look up relevant data.\n3. Use the tool in the LangChain agent to retrieve relevant data.\n4. Create a model serving endpoint to host the LangChain application. \nFor step-by-step instructions, see the following example notebook.\n\n#### Use online tables for real-time feature serving\n##### Notebook examples\n\nThe following notebook illustrates how to publish features to online tables for real-time serving and automated feature lookup. \n### Online tables demo notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/online-tables.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nThe following notebook illustrates how to use Databricks online tables and feature serving endpoints for retrieval augmented generation (RAG) applications. \n### Online tables with RAG applications demo notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/structured-data-for-rag.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use online tables for real-time feature serving\n##### Use online tables with Databricks Model Serving\n\nYou can use online tables to look up features for Databricks Model Serving. When you sync a feature table to an online table, models trained using features from that feature table automatically look up feature values from the online table during inference. No additional configuration is required. \n1. Use a `FeatureLookup` to train the model. \nFor model training, use features from the offline feature table in the model training set, as shown in the following example: \n```\ntraining_set = fe.create_training_set(\ndf=id_rt_feature_labels,\nlabel='quality',\nfeature_lookups=[\nFeatureLookup(\ntable_name=\"user_preferences\",\nlookup_key=\"user_id\"\n)\n],\nexclude_columns=['user_id'],\n)\n\n```\n2. Serve the model with Databricks Model Serving. The model automatically looks up features from the online table. See [Automatic feature lookup with MLflow models on Databricks](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/automatic-feature-lookup.html) for details.\n\n#### Use online tables for real-time feature serving\n##### User permissions\n\nYou must have the following permissions to create an online table: \n* `SELECT` privilege on the source table.\n* `USE_CATALOG` privilege on the destination catalog.\n* `USE_SCHEMA` and `CREATE_TABLE` privilege on the destination schema. \nTo manage the data synchronization pipeline of an online table, you must either be the owner of the online table or be granted the REFRESH privilege on the online table. Users who do not have USE\\_CATALOG and USE\\_SCHEMA privileges on the catalog will not see the online table in Catalog Explorer. \nThe Unity Catalog metastore must have [Privilege Model Version 1.0](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/upgrade-privilege-model.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use online tables for real-time feature serving\n##### Endpoint permission model\n\nA unique system service principal is automatically created for a feature serving or model serving endpoint with limited permissions required to query data and execute functions. This service principal allows endpoints to access data and function resources independently of the user who created the resource and ensures that the endpoint can continue to function if the creator leaves the workspace. \nThe lifetime of this system service principal is the lifetime of the endpoint. Audit logs may indicate system generated records for the owner of the Unity Catalog catalog granting necessary privileges to this system service principal.\n\n#### Use online tables for real-time feature serving\n##### Limitations\n\n* Only one online table is supported per source table.\n* An online table and its source table can have at most 1000 columns.\n* Columns of data types ARRAY, MAP, or STRUCT cannot be used as primary keys in the online table.\n* If a column is used as a primary key in the online table, all rows in the source table where the column contains null values are ignored.\n* Foreign, system, and internal tables are not supported as source tables.\n* Source tables without Delta change data feed enabled support only the **Snapshot** sync mode.\n* Delta Sharing tables are only supported in the **Snapshot** sync mode.\n* Catalog, schema, and table names of the online table can only contain alphanumeric characters and underscores, and must not start with numbers. Dashes (`-`) are not allowed.\n* Columns of String type are limited to 64KB length.\n* Column names are limited to 64 characters in length.\n* The maximum size of the row is 2MB.\n* The maximum size of an online table during gated public preview is 200GB uncompressed user data.\n* The combined size of all online tables in a Unity Catalog metastore during gated public preview is 1TB uncompressed user data.\n* The maximum queries per second (QPS) is 200. This limit can be increased to 25,000 or more. Reach out to your Databricks account team to increase the limit.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Use online tables for real-time feature serving\n##### Troubleshooting\n\n### \u201cCreate online table\u201d does not appear in Catalog Explorer. \nThe cause is usually that the table you are trying to sync from (the source table) is not a supported type. Make sure the source table\u2019s Securable Kind (shown in the Catalog Explorer **Details** tab) is one of the supported options below: \n* `TABLE_EXTERNAL`\n* `TABLE_DELTA`\n* `TABLE_DELTA_EXTERNAL`\n* `TABLE_DELTASHARING`\n* `TABLE_DELTASHARING_MUTABLE`\n* `TABLE_STREAMING_LIVE_TABLE`\n* `TABLE_STANDARD`\n* `TABLE_FEATURE_STORE`\n* `TABLE_FEATURE_STORE_EXTERNAL`\n* `TABLE_VIEW`\n* `TABLE_VIEW_DELTASHARING`\n* `TABLE_MATERIALIZED_VIEW` \n### I cannot select either \u201cTriggered\u201d or \u201cContinuous\u201d sync modes when creating an online table. \nThis happens if the source table does not have Delta change data feed enabled or if it is a View or Materialized View. To use the **Incremental** sync mode, either enable change data feed on the source table, or use a non-view table. \n### Online table update fails or status shows offline \nTo begin troubleshooting this error, click the pipeline id that appears in the **Overview** tab of the online table in Catalog Explorer. \n![online tables pipeline failure](https:\/\/docs.databricks.com\/_images\/online-tables-failed-pipeline.png) \nOn the pipeline UI page that appears, click on the entry that says \u201cFailed to resolve flow \u2018\\_\\_online\\_table\u201d. \n![online tables pipeline error message](https:\/\/docs.databricks.com\/_images\/online-tables-pipeline-error.png) \nA popup appears with details in the **Error details** section. \n![online tables details of error](https:\/\/docs.databricks.com\/_images\/online-tables-error-details.png) \nCommon causes of errors include the following: \n* The source table was deleted, or deleted and recreated with the same name, while the online table was synchronizing. This is particularly common with continuous online tables, because they are constantly synchronizing.\n* The source table cannot be accessed through Serverless Compute due to firewall settings. In this situation, the **Error details** section might show the error message \u201cFailed to start the DLT service on cluster xxx\u2026\u201d.\n* The aggregate size of online tables exceeds the 1 TiB (uncompressed size) metastore-wide limit. The 1 TiB limit refers to the uncompressed size after expanding the Delta table in row-oriented format. The size of the table in row-format can be significantly larger than the size of the Delta table shown in Catalog Explorer, which refers to the compressed size of the table in a column-oriented format. The difference can be as large as 100x, depending on the content of the table. \nTo estimate the uncompressed, row-expanded size of a Delta table, use the following query from a Serverless SQL Warehouse. The query returns the estimated expanded table size in bytes. Succesfully executing this query also confirms that Serverless Compute can access the source table. \n```\nSELECT sum(length(to_csv(struct(*)))) FROM `source_table`;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html"} +{"content":"# Data governance with Unity Catalog\n## Hive metastore table access control (legacy)\n#### What is the `ANY FILE` securable?\n\nPrivileges on the `ANY FILE` securable grant the entitled principal direct access to the filesystem and data in cloud object storage, regardless of any Hive table ACLs set on database objects like schemas or tables.\n\n#### What is the `ANY FILE` securable?\n##### Privileges for `ANY FILE`\n\nYou can grant `MODIFY` or `SELECT` privilege on the `ANY FILE` securable to any service principal, user, or group using legacy Hive table access control lists (ACLs). All workspace admins have `MODIFY` privileges on `ANY FILE` by default. Any user with `MODIFY` privileges can grant or revoke privileges on `ANY FILE`. \nYou must have privileges on the `ANY FILE` securable when using custom data sources or JDBC drivers not included in Lakehouse Federation. See [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). \nPrivileges on the `ANY FILE` securable cannot override Unity Catalog privileges and do not grant or expand privileges on data objects governed by Unity Catalog. Some drivers and custom-installed libraries might compromise user isolation by storing data of all users in one common temp directory. \nPrivileges on the `ANY FILE` securable apply only when you use SQL warehouses or clusters with shared access mode. \n`ANY FILE` respects legacy access patterns for data in cloud object storage, including mounts and storage credentials defined at the compute level. See [Configure access to cloud object storage for Databricks](https:\/\/docs.databricks.com\/connect\/storage\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/table-acls\/any-file.html"} +{"content":"# Data governance with Unity Catalog\n## Hive metastore table access control (legacy)\n#### What is the `ANY FILE` securable?\n##### How does `ANY FILE` interact with Unity Catalog?\n\nWhen using Unity Catalog-enabled shared clusters or SQL warehouses, privileges on the `ANY FILE` securable are evaluated when accessing storage paths or data sources that are *not governed* by Unity Catalog. Privileges on the `ANY FILE` securable are evaluated after all Unity Catalog-related privileges and serve as a fallback for storage paths and connector libraries not managed with Unity Catalog. \nDatabricks recommends using Lakehouse Federation for configuring read-only access to supported external data sources. Lakehouse Federation never requires privileges on the `ANY FILE` securable. See [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). \nUnity Catalog volumes and tables provide full governance for tabular and nontabular data and do not require privileges on the `ANY FILE` securable. \nAccess to any data governed by Unity Catalog using URIs cannot use privileges on the `ANY FILE` securable. See [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html). \nYou must have `SELECT` privileges on the `ANY FILE` securable to read using the following patterns on Unity Catalog-enabled shared clusters: \n* Cloud object storage using URIs.\n* Data stored in the DBFS root or using DBFS mounts.\n* Data sources using custom libraries or drivers.\n* JDBC drivers not configured with Lakehouse Federation.\n* External data sources that are not governed by Unity Catalog.\n* Streaming data sources, except tables and volumes governed by Unity Catalog and streams that use table names registered to the Hive metastore.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/table-acls\/any-file.html"} +{"content":"# Data governance with Unity Catalog\n## Hive metastore table access control (legacy)\n#### What is the `ANY FILE` securable?\n##### Concerns about `ANY FILE` securable privileges\n\nPrivileges on the `ANY FILE` securable essentially bypass legacy Hive table ACLs set on database objects. Use discretion when you grant privileges on the `ANY FILE` securable, if you have not fully migrated all tables to Unity Catalog and you still rely on legacy Hive table ACLs for managing access to data. \nPrivileges granted on the `ANY FILE` securable never bypass Unity Catalog data governance. However, users that have privileges on the `ANY FILE` securable have expanded ability to configure and access data sources not governed by Unity Catalog.\n\n#### What is the `ANY FILE` securable?\n##### Limitations for `ANY FILE`\n\n`ANY FILE` is a legacy securable that is not reported in the information schema.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/table-acls\/any-file.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Isolation levels and write conflicts on Databricks\n\nThe isolation level of a table defines the degree to which a transaction must be isolated from modifications made by concurrent operations. Write conflicts on Databricks depend on the isolation level. \nDelta Lake provides ACID transaction guarantees between reads and writes. This means that: \n* Multiple writers across multiple clusters can simultaneously modify a table partition. Writers see a consistent snapshot view of the table and writes occur in a serial order. \n+ Readers continue to see a consistent snapshot view of the table that the Databricks job started with, even when a table is modified during a job. \nSee [What are ACID guarantees on Databricks?](https:\/\/docs.databricks.com\/lakehouse\/acid.html). \nNote \nDatabricks use Delta Lake for all tables by default. This article describes behavior for Delta Lake on Databricks. \nImportant \nMetadata changes cause all concurrent write operations to fail. These operations include changes to table protocol, table properties, or data schema. \nStreaming reads fail when they encounter a commit that changes table metadata. If you want the stream to continue you must restart it. For recommended methods, see [Production considerations for Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/production.html). \nThe following are examples of queries that change metadata: \n```\n-- Set a table property.\nALTER TABLE table-name SET TBLPROPERTIES ('delta.isolationLevel' = 'Serializable')\n\n-- Enable a feature using a table property and update the table protocol.\nALTER TABLE table_name SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);\n\n-- Drop a table feature.\nALTER TABLE table_name DROP FEATURE deletionVectors;\n\n-- Upgrade to UniForm.\nREORG TABLE table_name APPLY (UPGRADE UNIFORM(ICEBERG_COMPAT_VERSION=2));\n\n-- Update the table schema.\nALTER TABLE table_name ADD COLUMNS (col_name STRING);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/isolation-level.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Isolation levels and write conflicts on Databricks\n##### Write conflicts with row-level concurrency\n\nRow-level concurrency reduces conflicts between concurrent write operations by detecting changes at the row-level and automatically resolving conflicts that occur when concurrent writes update or delete different rows in the same data file. \nRow-level concurrency is generally available on Databricks Runtime 14.2 and above. Row-level concurrency is supported by default for the following conditions: \n* Tables with deletion vectors enabled and without partitioning.\n* Tables with liquid clustering, unless you have disabled deletion vectors. \nTables with partitions do not support row-level concurrency but can still avoid conflicts between `OPTIMIZE` and all other write operations when deletion vectors are enabled. See [Limitations for row-level concurrency](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html#rlc-limitations). \nFor other Databricks Runtime versions, see [Row-level concurrency preview behavior (legacy)](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html#rlc-preview). \n`MERGE INTO` support for row-level concurrency requires Photon in Databricks Runtime 14.2. In Databricks Runtime 14.3 LTS and above, Photon is not required. \nThe following table describes which pairs of write operations can conflict in each [isolation level](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html#isolation-levels) with row-level concurrency enabled. \nNote \nTables with identity columns do not support concurrent transactions. See [Use identity columns in Delta Lake](https:\/\/docs.databricks.com\/delta\/generated-columns.html#identity). \n| | INSERT (1) | UPDATE, DELETE, MERGE INTO | OPTIMIZE |\n| --- | --- | --- | --- |\n| **INSERT** | Cannot conflict | | |\n| **UPDATE, DELETE, MERGE INTO** | Cannot conflict in WriteSerializable. Can conflict in Serializable when modifying the same row. See [Limitations for row-level concurrency](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html#rlc-limitations). | Can conflict when modifying the same row. See [Limitations for row-level concurrency](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html#rlc-limitations). | |\n| **OPTIMIZE** | Cannot conflict | Can conflict when `ZORDER BY` is used. Cannot conflict otherwise. | Can conflict when `ZORDER BY` is used. Cannot conflict otherwise. | \nImportant \n**(1)** All `INSERT` operations in the tables above describe append operations that do not read any data from the same table before committing. `INSERT` operations that contain subqueries reading the same table support the same concurrency as `MERGE`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/isolation-level.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Isolation levels and write conflicts on Databricks\n##### Write conflicts without row-level concurrency\n\nThe following table describes which pairs of write operations can conflict in each [isolation level](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html#isolation-levels). \nTables do not support row-level concurrency if they have partitions defined or do not have deletion vectors enabled. Databricks Runtime 14.2 or above is required for row-level concurrency. \nNote \nTables with identity columns do not support concurrent transactions. See [Use identity columns in Delta Lake](https:\/\/docs.databricks.com\/delta\/generated-columns.html#identity). \n| | INSERT (1) | UPDATE, DELETE, MERGE INTO | OPTIMIZE |\n| --- | --- | --- | --- |\n| **INSERT** | Cannot conflict | | |\n| **UPDATE, DELETE, MERGE INTO** | Cannot conflict in WriteSerializable. Can conflict in Serializable. See [avoid conflicts with partitions](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html#conflicts). | Can conflict in Serializable and WriteSerializable. See [avoid conflicts with partitions](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html#conflicts). | |\n| **OPTIMIZE** | Cannot conflict | Cannot conflict with in tables with deletion vectors enabled, unless `ZORDER BY` is used. Can conflict otherwise. | Cannot conflict with in tables with deletion vectors enabled, unless `ZORDER BY` is used. Can conflict otherwise. | \nImportant \n**(1)** All `INSERT` operations in the tables above describe append operations that do not read any data from the same table before committing. `INSERT` operations that contain subqueries reading the same table support the same concurrency as `MERGE`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/isolation-level.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Isolation levels and write conflicts on Databricks\n##### Limitations for row-level concurrency\n\nSome limitations apply for row-level concurrency. For the following operations, conflict resolution follows normal concurrency for write conflicts on Databricks. See [Write conflicts without row-level concurrency](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html#write-conflicts). \n* Commands with complex conditional clauses, including the following: \n+ Conditions on complex data types such as structs, arrays, or maps.\n+ Conditions using non-deterministic expressions and subqueries.\n+ Conditions that contain correlated subqueries.\n* For `MERGE` commands, you must use an explicit predicate on the target table to filter rows matching the source table. For merge resolution, the filter is used to only scan rows that might conflict based on filter conditions in concurrent operations. \nNote \nRow-level conflict detection can increase the total execution time. In the case of many concurrent transactions, the writer prioritizes latency over conflict resolution and conflicts may occur. \nAll limitations for deletion vectors also apply. See [Limitations](https:\/\/docs.databricks.com\/delta\/deletion-vectors.html#limitations).\n\n#### Isolation levels and write conflicts on Databricks\n##### When does Delta Lake commit without reading the table?\n\nDelta Lake `INSERT` or append operations do not read the table state before committing if the following conditions are satisfied: \n1. Logic is expressed using `INSERT` SQL logic or append mode.\n2. Logic contains no subqueries or conditionals that reference the table targeted by the write operation. \nAs in other commits, Delta Lake validates and resolves the table versions on commit using metadata in the transaction log, but no version of the table is actually read. \nNote \nMany common patterns use `MERGE` operations to insert data based on table conditions. Although it might be possible to rewrite this logic using `INSERT` statements, if any conditional expression references a column in the target table, these statements have the same concurrency limitations as `MERGE`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/isolation-level.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Isolation levels and write conflicts on Databricks\n##### Write serializable vs. serializable isolation levels\n\nThe isolation level of a table defines the degree to which a transaction must be isolated from modifications made by concurrent transactions. Delta Lake on Databricks supports two isolation levels: Serializable and WriteSerializable. \n* **Serializable**: The strongest isolation level. It ensures that committed write operations and all reads are [Serializable](https:\/\/en.wikipedia.org\/wiki\/Serializability). Operations are allowed as long as there exists a serial sequence of executing them one-at-a-time that generates the same outcome as that seen in the table. For the write operations, the serial sequence is exactly the same as that seen in the table\u2019s history.\n* **WriteSerializable (Default)**: A weaker isolation level than Serializable. It ensures only that the write operations (that is, not reads) are serializable. However, this is still stronger than [Snapshot](https:\/\/en.wikipedia.org\/wiki\/Snapshot_isolation) isolation. WriteSerializable is the default isolation level because it provides great balance of data consistency and availability for most common operations. \nIn this mode, the content of the Delta table may be different from that which is expected from the sequence of operations seen in the table history. This is because this mode allows certain pairs of concurrent writes (say, operations X and Y) to proceed such that the result would be as if Y was performed before X (that is, serializable between them) even though the history would show that Y was committed after X. To disallow this reordering, [set the table isolation level](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html#setting-isolation-level) to be Serializable to cause these transactions to fail. \nRead operations always use snapshot isolation. The write isolation level determines whether or not it is possible for a reader to see a snapshot of a table, that according to the history, \u201cnever existed\u201d. \nFor the Serializable level, a reader always sees only tables that conform to the history. For the WriteSerializable level, a reader could see a table that does not exist in the Delta log. \nFor example, consider txn1, a long running delete and txn2, which inserts data deleted by txn1. txn2 and txn1 complete and they are recorded in that order in the history. According to the history, the data inserted in txn2 should not exist in the table. For Serializable level, a reader would never see data inserted by txn2. However, for the WriteSerializable level, a reader could at some point see the data inserted by txn2. \nFor more information on which types of operations can conflict with each other in each isolation level and the possible errors, see [Avoid conflicts using partitioning and disjoint command conditions](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html#conflicts).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/isolation-level.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Isolation levels and write conflicts on Databricks\n##### Set the isolation level\n\nYou set the isolation level using the `ALTER TABLE` command. \n```\nALTER TABLE <table-name> SET TBLPROPERTIES ('delta.isolationLevel' = <level-name>)\n\n``` \nwhere `<level-name>` is `Serializable` or `WriteSerializable`. \nFor example, to change the isolation level from the default `WriteSerializable` to `Serializable`, run: \n```\nALTER TABLE <table-name> SET TBLPROPERTIES ('delta.isolationLevel' = 'Serializable')\n\n```\n\n#### Isolation levels and write conflicts on Databricks\n##### Avoid conflicts using partitioning and disjoint command conditions\n\nIn all cases marked \u201ccan conflict\u201d, whether the two operations will conflict depends on whether they operate on the same set of files. You can make the two sets of files disjoint by partitioning the table by the same columns as those used in the conditions of the operations. For example, the two commands `UPDATE table WHERE date > '2010-01-01' ...` and `DELETE table WHERE date < '2010-01-01'` will conflict if the table is not partitioned by date, as both can attempt to modify the same set of files. Partitioning the table by `date` will avoid the conflict. Therefore, partitioning a table according to the conditions commonly used on the command can reduce conflicts significantly. However, partitioning a table by a column that has high cardinality can lead to other performance issues due to the large number of subdirectories.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/isolation-level.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Isolation levels and write conflicts on Databricks\n##### Conflict exceptions\n\nWhen a transaction conflict occurs, you will observe one of the following exceptions: \n### ConcurrentAppendException \nThis exception occurs when a concurrent operation adds files in the same partition (or anywhere in an unpartitioned table) that your operation reads. The file additions can be caused by `INSERT`, `DELETE`, `UPDATE`, or `MERGE` operations. \nWith the default [isolation level](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html#isolation-levels) of `WriteSerializable`, files added by *blind* `INSERT` operations (that is, operations that blindly append data without reading any data) do not conflict with any operation, even if they touch the same partition (or anywhere in an unpartitioned table). If the isolation level is set to `Serializable`, then blind appends may conflict. \nThis exception is often thrown during concurrent `DELETE`, `UPDATE`, or `MERGE` operations. While the concurrent operations may be physically updating different partition directories, one of them may read the same partition that the other one concurrently updates, thus causing a conflict. You can avoid this by making the separation explicit in the operation condition. Consider the following example. \n```\n\/\/ Target 'deltaTable' is partitioned by date and country\ndeltaTable.as(\"t\").merge(\nsource.as(\"s\"),\n\"s.user_id = t.user_id AND s.date = t.date AND s.country = t.country\")\n.whenMatched().updateAll()\n.whenNotMatched().insertAll()\n.execute()\n\n``` \nSuppose you run the above code concurrently for different dates or countries. Since each job is working on an independent partition on the target Delta table, you don\u2019t expect any conflicts. However, the condition is not explicit enough and can scan the entire table and can conflict with concurrent operations updating any other partitions. Instead, you can rewrite your statement to add specific date and country to the merge condition, as shown in the following example. \n```\n\/\/ Target 'deltaTable' is partitioned by date and country\ndeltaTable.as(\"t\").merge(\nsource.as(\"s\"),\n\"s.user_id = t.user_id AND s.date = t.date AND s.country = t.country AND t.date = '\" + <date> + \"' AND t.country = '\" + <country> + \"'\")\n.whenMatched().updateAll()\n.whenNotMatched().insertAll()\n.execute()\n\n``` \nThis operation is now safe to run concurrently on different dates and countries. \n### ConcurrentDeleteReadException \nThis exception occurs when a concurrent operation deleted a file that your operation read. Common causes are a `DELETE`, `UPDATE`, or `MERGE` operation that rewrites files. \n### ConcurrentDeleteDeleteException \nThis exception occurs when a concurrent operation deleted a file that your operation also deletes. This could be caused by two concurrent compaction operations rewriting the same files. \n### MetadataChangedException \nThis exception occurs when a concurrent transaction updates the metadata of a Delta table. Common causes are `ALTER TABLE` operations or writes to your Delta table that update the schema of the table. \n### ConcurrentTransactionException \nIf a streaming query using the same checkpoint location is started multiple times concurrently and tries to write to the Delta table at the same time. You should never have two streaming queries use the same checkpoint location and run at the same time. \n### ProtocolChangedException \nThis exception can occur in the following cases: \n* When your Delta table is upgraded to a new protocol version. For future operations to succeed you may need to upgrade your Databricks Runtime.\n* When multiple writers are creating or replacing a table at the same time.\n* When multiple writers are writing to an empty path at the same time. \nSee [How does Databricks manage Delta Lake feature compatibility?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html) for more details.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/isolation-level.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Isolation levels and write conflicts on Databricks\n##### Row-level concurrency preview behavior (legacy)\n\nThis section describes preview behaviors for row-level concurrency in Databricks Runtime 14.1 and below. Row-level concurrency always requires deletion vectors. \nIn Databricks Runtime 13.3 LTS and above, tables with liquid clustering enabled automatically enable row-level concurrency. \nIn Databricks Runtime 14.0 and 14.1, you can enable row-level concurrency for tables with deletion vectors by setting the following configuration for the cluster or SparkSession: \n```\nspark.databricks.delta.rowLevelConcurrencyPreview = true\n\n``` \nIn Databricks Runtime 14.1 and below, non-Photon compute only supports row-level concurrency for `DELETE` operations.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/isolation-level.html"} +{"content":"# AI and Machine Learning on Databricks\n### Prepare data and environment for ML and DL\n\nThis section describes how to prepare your data and your Databricks environment for machine learning and deep learning.\n\n### Prepare data and environment for ML and DL\n#### Prepare data\n\nThe articles in this section cover aspects of loading and preprocessing data that are specific to ML and DL applications. \n* [Load data for machine learning and deep learning](https:\/\/docs.databricks.com\/machine-learning\/load-data\/index.html)\n* [Preprocess data for machine learning and deep learning](https:\/\/docs.databricks.com\/machine-learning\/preprocess-data\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/data-preparation.html"} +{"content":"# AI and Machine Learning on Databricks\n### Prepare data and environment for ML and DL\n#### Prepare environment\n\n[Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html) (Databricks Runtime ML) is a ready-to-go environment optimized for machine learning and data science. Databricks Runtime ML includes many external libraries, including TensorFlow, PyTorch, Horovod, scikit-learn and XGBoost, and provides extensions to improve performance, including GPU acceleration in [XGBoost](https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost.html), distributed deep learning using [HorovodRunner](https:\/\/docs.databricks.com\/machine-learning\/train-model\/distributed-training\/horovod-runner.html), and model checkpointing using a [Databricks File System (DBFS) FUSE mount](https:\/\/docs.databricks.com\/machine-learning\/load-data\/index.html#store-files-for-data-loading-and-model-checkpointing). \nTo use Databricks Runtime ML, select the ML version of the runtime when you [create your cluster](https:\/\/docs.databricks.com\/compute\/index.html). \nNote \nTo access data in Unity Catalog for machine learning workflows, the [access mode](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode) for the cluster must be single user (assigned). Shared clusters are not compatible with Databricks Runtime for Machine Learning. \n### Install libraries \nYou can install additional libraries to create a custom environment for your notebook or cluster. \n* To make a library available for all notebooks running on a cluster, [create a cluster library](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html#install-libraries). You can also use an [init script](https:\/\/docs.databricks.com\/init-scripts\/cluster-scoped.html) to install libraries on clusters upon creation.\n* To install a library that is available only to a specific notebook session, use [Notebook-scoped Python libraries](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html). \n### Use GPU clusters \nYou can create GPU clusters to accelerate deep learning tasks. For information about creating Databricks GPU clusters, see [GPU-enabled compute](https:\/\/docs.databricks.com\/compute\/gpu.html). Databricks Runtime ML includes GPU hardware drivers and NVIDIA libraries such as CUDA.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/data-preparation.html"} +{"content":"# What is Databricks?\n## What is a data lakehouse?\n#### What are ACID guarantees on Databricks?\n\nDatabricks uses Delta Lake by default for all reads and writes and builds upon the ACID guarantees provided by the [open source Delta Lake protocol](https:\/\/delta.io). ACID stands for atomicity, consistency, isolation, and durability. \n* [Atomicity](https:\/\/docs.databricks.com\/lakehouse\/acid.html#atomicity) means that all transactions either succeed or fail completely.\n* [Consistency](https:\/\/docs.databricks.com\/lakehouse\/acid.html#consistency) guarantees relate to how a given state of the data is observed by simultaneous operations.\n* [Isolation](https:\/\/docs.databricks.com\/lakehouse\/acid.html#isolation) refers to how simultaneous operations potentially conflict with one another.\n* [Durability](https:\/\/docs.databricks.com\/lakehouse\/acid.html#durability) means that committed changes are permanent. \nWhile many data processing and warehousing technologies describe having ACID transactions, specific guarantees vary by system, and transactions on Databricks might differ from other systems you\u2019ve worked with. \nNote \nThis page describes guarantees for tables backed by Delta Lake. Other data formats and integrated systems might not provide transactional guarantees for reads and writes. \nAll Databricks writes to cloud object storage use transactional commits, which create metadata files starting with `_started_<id>` and `_committed_<id>` alongside data files. You do not need to interact with these files, as Databricks routinely cleans up stale commit metadata files.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/acid.html"} +{"content":"# What is Databricks?\n## What is a data lakehouse?\n#### What are ACID guarantees on Databricks?\n##### How are transactions scoped on Databricks?\n\nDatabricks manages transactions at the table level. Transactions always apply to one table at a time. For managing concurrent transactions, Databricks uses optimistic concurrency control. This means that there are no locks on reading or writing against a table, and deadlock is not a possibility. \nBy default, Databricks provides snapshot isolation on reads and [write-serializable isolation](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html#isolation-levels) on writes. Write-serializable isolation provides stronger guarantees than snapshot isolation, but it applies that stronger isolation only for writes. \nRead operations referencing multiple tables return the current version of each table at the time of access, but do not interrupt concurrent transactions that might modify referenced tables. \nDatabricks does not have `BEGIN\/END` constructs that allow multiple operations to be grouped together as a single transaction. Applications that modify multiple tables commit transactions to each table in a serial fashion. You can combine inserts, updates, and deletes against a table into a single write transaction using `MERGE INTO`.\n\n#### What are ACID guarantees on Databricks?\n##### How does Databricks implement atomicity?\n\nThe transaction log controls commit atomicity. During a transaction, data files are written to the file directory backing the table. When the transaction completes, a new entry is committed to the transaction log that includes the paths to all files written during the transaction. Each commit increments the table version and makes new data files visible to read operations. The current state of the table comprises all data files marked valid in the transaction logs. \nData files are not tracked unless the transaction log records a new version. If a transaction fails after writing data files to a table, these data files will not corrupt the table state, but the files will not become part of the table. The `VACUUM` operation deletes all untracked data files in a table directory, including remaining uncommitted files from failed transactions.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/acid.html"} +{"content":"# What is Databricks?\n## What is a data lakehouse?\n#### What are ACID guarantees on Databricks?\n##### How does Databricks implement durability?\n\nDatabricks uses cloud object storage to store all data files and transaction logs. Cloud object storage has high availability and durability. Because transactions either succeed or fail completely and the transaction log lives alongside data files in cloud object storage, tables on Databricks inherit the durability guarantees of the cloud object storage on which they\u2019re stored.\n\n#### What are ACID guarantees on Databricks?\n##### How does Databricks implement consistency?\n\nDelta Lake uses optimistic concurrency control to provide transactional guarantees between writes. Under this mechanism, writes operate in three stages: \n1. **Read**: Reads (if needed) the latest available version of the table to identify which files need to be modified (that is, rewritten). \n* Writes that are append-only do not read the current table state before writing. Schema validation leverages metadata from the transaction log.\n2. **Write**: Writes data files to the directory used to define the table.\n3. **Validate and commit**: \n* Checks whether the proposed changes conflict with any other changes that may have been concurrently committed since the snapshot that was read.\n* If there are no conflicts, all the staged changes are committed as a new versioned snapshot, and the write operation succeeds.\n* If there are conflicts, the write operation fails with a concurrent modification exception. This failure prevents corruption of data. \nOptimistic conccurency assumes that most concurrent transactions on your data could not conflict with one another, but conflicts can occur. See [Isolation levels and write conflicts on Databricks](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/acid.html"} +{"content":"# What is Databricks?\n## What is a data lakehouse?\n#### What are ACID guarantees on Databricks?\n##### How does Databricks implement isolation?\n\nDatabricks uses write serializable isolation by default for all table writes and updates. Snapshot isolation is used for all table reads. \nWrite serializability and optimistic concurrency control work together to provide high throughput for writes. The current valid state of a table is always available, and a write can be started against a table at any time. Concurrent reads are only limited by throughput of the metastore and cloud resources. \nSee [Isolation levels and write conflicts on Databricks](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html).\n\n#### What are ACID guarantees on Databricks?\n##### Does Delta Lake support multi-table transactions?\n\nDelta Lake does not support multi-table transactions. Delta Lake supports transactions at the *table* level. \nPrimary key and foreign key relationships on Databricks are informational and not enforced. See [Declare primary key and foreign key relationships](https:\/\/docs.databricks.com\/tables\/constraints.html#pk-fk).\n\n#### What are ACID guarantees on Databricks?\n##### What does it mean that Delta Lake supports multi-cluster writes?\n\nDelta Lake prevents data corruption when multiple clusters write to the same table concurrently. Some write operations can conflict during simultaneous execution, but don\u2019t corrupt the table. See [Isolation levels and write conflicts on Databricks](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html). \nNote \nDelta Lake on S3 has several limitations not found on other storage systems. See [Delta Lake limitations on S3](https:\/\/docs.databricks.com\/delta\/s3-limitations.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/acid.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage shares for Delta Sharing\n\nThis article explains how to create and manage shares for Delta Sharing. \nA share is a securable object in Unity Catalog that you can use for sharing the following data assets with one or more recipients: \n* Tables and table partitions\n* Views, including dynamic views that restrict access at the row and column level\n* Volumes\n* Notebooks\n* AI models \nIf you share an entire schema (database), the recipient can access all of the tables, views, models, and volumes in the schema at the moment you share it, along with any data and AI assets that are added to the schema in the future. \nA share can contain data and AI assets from only one Unity Catalog metastore. You can add or remove data and AI assets from a share at any time. \nFor more information, see [Shares, providers, and recipients](https:\/\/docs.databricks.com\/data-sharing\/index.html#shares-recipients).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-share.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage shares for Delta Sharing\n#### Requirements\n\nTo create a share, you must: \n* Be a metastore admin or have the `CREATE SHARE` privilege for the Unity Catalog metastore where the data you want to share is registered.\n* Create the share using a Databricks workspace that has that Unity Catalog metastore attached. \nTo add tables or views to a share, you must: \n* Be the share owner.\n* Have the `USE CATALOG` and `USE SCHEMA` privilege on the catalog and schema that contain the table or view, or ownership of the catalog or schema.\n* Have the `SELECT` privilege on the table or view. You must keep that privilege in order for the table or view to continue to be shared. If you lose it, the recipient cannot access the table or view through the share. Databricks therefore recommends that you use a group as the share owner. \nTo add volumes to a share, you must: \n* Be the share owner.\n* Have the `USE CATALOG` and `USE SCHEMA` privilege on the catalog and schema that contain the volume, or ownership of the catalog or schema.\n* Have the `READ VOLUME` privilege on the volume. You must keep that privilege in order for the volume to continue to be shared. If you lose it, the recipient cannot access the volume through the share. Databricks therefore recommends that you use a group as the share owner. \nTo add models to a share, you must: \n* Be the share owner.\n* Have the `USE CATALOG` and `USE SCHEMA` privilege on the catalog and schema that contain the model, or ownership of the catalog or schema.\n* Have the `EXECUTE` privilege on the model. You must keep that privilege in order for the model to continue to be shared. If you lose it, the recipient cannot access the model through the share. Databricks therefore recommends that you use a group as the share owner. \nTo share an entire schema, you must: \n* Be the share owner and the schema owner, or have `USE SCHEMA.`\n* Have `SELECT` on the schema to share tables.\n* Have `READ VOLUME` on the schema to share volumes. \nTo add notebook files to a share, you must be: \n* The share owner and have CAN READ permission on the notebook. \nTo grant recipient access to a share, you must be one of these: \n* Metastore admin.\n* User with delegated permissions or ownership on both the share and the recipient objects ((`USE SHARE` + `SET SHARE PERMISSION`) or share owner) AND (`USE RECIPIENT` or recipient owner). \nTo view shares, you must be one of these: \n* A metastore admin (can view all)\n* A user with the `USE SHARE` privilege (can view all)\n* The share object owner \nCompute requirements: \n* If you use a Databricks notebook to create the share, your cluster must use Databricks Runtime 11.3 LTS or above and the shared or single-user cluster access mode.\n* If you use SQL statements to add a schema to a share (or update or remove a schema), you must use a SQL warehouse or compute running Databricks Runtime 13.3 LTS or above. Doing the same using Catalog Explorer has no compute requirements.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-share.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage shares for Delta Sharing\n#### Create a share object\n\nTo create a share, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `CREATE SHARE` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: Metastore admin or user with the `CREATE SHARE` privilege for the metastore. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. Click **Share data**.\n4. Enter the share **Name** and an optional comment. \nRun the following command in a notebook or the Databricks SQL query editor: \n```\nCREATE SHARE [IF NOT EXISTS] <share-name>\n[COMMENT \"<comment>\"];\n\n``` \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \n```\ndatabricks shares create <share-name>\n\n``` \nYou can use `--comment` to add a comment or `--json` to add assets to the share. For details, see the sections that follow.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-share.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage shares for Delta Sharing\n#### Add tables to a share\n\nTo add tables to a share, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or SQL commands in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: Owner of the share object, `USE CATALOG` and `USE SCHEMA` on the catalog and schema that contain the table, and the `SELECT` privilege on the table. You must maintain the `SELECT` privilege for as long as you want to share the table. For more information, see [Requirements](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#requirements). \nNote \nIf you are a workspace admin and you inherited the `USE SCHEMA` and `USE CATALOG` permissions on the schema and catalog that contain the table from the workspace admin group, then you cannot add the table to a share. You must first grant yourself the `USE SCHEMA` and `USE CATALOG` permissions on the schema and catalog. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Shares** tab, find the share you want to add a table to and click its name.\n4. Click **Manage assets > Add data assets**.\n5. On the **Add tables** page, select either an entire schema (database) or individual tables and views. \n* To select a table or view, first select the catalog, then the schema that contains the table or view, then the table or view itself. \nYou can search for tables by name, column name, or comment using workspace search. See [Search for workspace objects](https:\/\/docs.databricks.com\/search\/index.html).\n* To select a schema, first select the catalog and then the schema. \nFor detailed information about sharing schemas, see [Add schemas to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#schemas).\n6. (Optional) Click **Advanced table options** to specify the following options. Alias and partitions are not available if you select an entire schema. Table history is included by default if you select an entire schema. \n* **Alias**: An alternate table name to make the table name more readable. The alias is the table name that the recipient sees and must use in queries. Recipients cannot use the actual table name if an alias is specified.\n* **Partition**: Share only part of the table. For example, `(column = 'value')`. See [Specify table partitions to share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#partitions) and [Use recipient properties to do partition filtering](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#properties).\n* **History Sharing**: Share the table history to allow recipients to perform time travel queries or read the table with Spark Structured Streaming. Requires Databricks Runtime 12.2 LTS or above. \nNote \nIf, in addition to doing time travel queries and streaming reads, you want your customers to be able to query a table\u2019s change data feed (CDF) using the [table\\_changes() function](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/table_changes.html), you must [enable CDF on the table](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html#enable) before you share it `WITH HISTORY`.\n7. Click **Save**. \nRun the following command in a notebook or the Databricks SQL query editor to add a table: \n```\nALTER SHARE <share-name> ADD TABLE <catalog-name>.<schema-name>.<table-name> [COMMENT \"<comment>\"]\n[PARTITION(<clause>)] [AS <alias>]\n[WITH HISTORY | WITHOUT HISTORY];\n\n``` \nRun the following to add an entire schema. The `ADD SCHEMA` command requires a SQL warehouse or compute running Databricks Runtime 13.3 LTS or above. For detailed information about sharing schemas, see [Add schemas to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#schemas). \n```\nALTER SHARE <share-name> ADD SCHEMA <catalog-name>.<schema-name>\n[COMMENT \"<comment>\"];\n\n``` \nOptions include the following. `PARTITION` and `AS <alias>` are not available if you select an entire schema. `WITH HISTORY` is selected by default for all tables if you select an entire schema. \n* `PARTITION(<clause>)`: If you want to share only part of the table, you can specify a partition. For example, `(column = 'value')` See [Specify table partitions to share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#partitions) and [Use recipient properties to do partition filtering](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#properties).\n* `AS <alias>`: An alternate table name, or **Alias** to make the table name more readable. The alias is the table name that the recipient sees and must use in queries. Recipients cannot use the actual table name if an alias is specified. Use the format `<schema-name>.<table-name>`.\n* `WITH HISTORY` or `WITHOUT HISTORY`: When `WITH HISTORY` is specified, share the table with full history, allowing recipients to perform time travel queries and streaming reads. The default behavior for table sharing is `WITHOUT HISTORY` and for schema sharing is `WITH HISTORY`. Requires Databricks Runtime 12.2 LTS or above. \nNote \nIf, in addition to doing time travel queries and streaming reads, you want your customers to be able to query a table\u2019s change data feed (CDF) using the [table\\_changes() function](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/table_changes.html), you must [enable CDF on the table](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html#enable) before you share it `WITH HISTORY`. \nFor more information about `ALTER SHARE` options, see [ALTER SHARE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-share.html). \nTo add a table, run the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \n```\ndatabricks shares update <share-name> \\\n--json '{\n\"updates\": [\n{\n\"action\": \"ADD\",\n\"data_object\": {\n\"name\": \"<table-full-name>\",\n\"data_object_type\": \"TABLE\",\n\"shared_as\": \"<table-alias>\"\n}\n}\n]\n}'\n\n``` \nTo add a schema, run the following Databricks CLI command: \n```\ndatabricks shares update <share-name> \\\n--json '{\n\"updates\": [\n{\n\"action\": \"ADD\",\n\"data_object\": {\n\"name\": \"<schema-full-name>\",\n\"data_object_type\": \"SCHEMA\"\n}\n}\n]\n}'\n\n``` \nNote \nFor tables, and only tables, you can omit `\"data_object_type\"`. \nTo learn about the options listed in this example, view the instructions on the SQL tab. \nTo learn about additional parameters, run `databricks shares update --help` or see [PATCH \/api\/2.1\/unity-catalog\/shares\/](https:\/\/docs.databricks.com\/api\/workspace\/shares\/update) in the REST API reference. \nFor information about removing tables from a share, see [Update shares](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#update).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-share.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage shares for Delta Sharing\n#### Specify table partitions to share\n\nTo share only part of a table when you add the table to a share, you can provide a partition specification. You can specify partitions when you add a table to a share or update a share, using Catalog Explorer, the Databricks Unity Catalog CLI, or SQL commands in a Databricks notebook or the Databricks SQL query editor. See [Add tables to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#add-tables) and [Update shares](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#update). \n### Basic example \nThe following SQL example shares part of the data in the `inventory` table, partitioned by the `year`, `month`, and `date` columns: \n* Data for the year 2021.\n* Data for December 2020.\n* Data for December 25, 2019. \n```\nALTER SHARE share_name\nADD TABLE inventory\nPARTITION (year = \"2021\"),\n(year = \"2020\", month = \"Dec\"),\n(year = \"2019\", month = \"Dec\", date = \"2019-12-25\");\n\n``` \n### Use recipient properties to do partition filtering \nYou can share a table partition that matches [data recipient properties](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#properties), also known as parameterized partition sharing. \nDefault properties include: \n* `databricks.accountId`: The Databricks account that a data recipient belongs to (Databricks-to-Databricks sharing only).\n* `databricks.metastoreId`: The Unity Catalog metastore that a data recipient belongs to (Databricks-to-Databricks sharing only).\n* `databricks.name`: The name of the data recipient. \nYou can create any custom property you like when you create or update a recipient. \nFiltering by recipient property enables you to share the same tables, using the same share, across multiple Databricks accounts, workspaces, and users while maintaining data boundaries between them. \nFor example, if your tables include a Databricks account ID column, you can create a single share with table partitions defined by Databricks account ID. When you share, Delta Sharing dynamically delivers to each recipient only the data associated with their Databricks account. \n![Diagram of parameter-based dynamic partition sharing in Delta Sharing](https:\/\/docs.databricks.com\/_images\/parameterized-partitions.png) \nWithout the ability to dynamically partition by property, you would have to create a separate share for each recipient. \nTo specify a partition that filters by recipient properties when you create or update a share, you can use Catalog Explorer or the `CURRENT_RECIPIENT` SQL function in a Databricks notebook or the Databricks SQL query editor: \nNote \nRecipient properties are available on Databricks Runtime 12.2 and above. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Shares** tab, find the share you want to update and click its name.\n4. Click **Manage assets > Add data assets**.\n5. On the **Add tables** page, select the catalog and database that contain the table, then select the table. \nIf you aren\u2019t sure which catalog and database contain the table, you can search for it by name, column name, or comment using workspace search. See [Search for workspace objects](https:\/\/docs.databricks.com\/search\/index.html).\n6. (Optional) Click **Advanced table options** to add **Partition** specifications. \nOn the **Add partition to a table** dialog, add the property-based partition specification using the following syntax: \n```\n(<column-name> = CURRENT_RECIPIENT(<property-key>))\n\n``` \nFor example, \n```\n(country = CURRENT_RECIPIENT('country'))\n\n```\n7. Click **Save**. \nRun the following command in a notebook or the Databricks SQL query editor: \n```\nALTER SHARE <share-name> ADD TABLE <catalog-name>.<schema-name>.<table-name>\nPARTITION (<column-name> = CURRENT_RECIPIENT(<property-key>);\n\n``` \nFor example, \n```\nALTER SHARE acme ADD TABLE acme.default.some_table\nPARTITION (country = CURRENT_RECIPIENT('country'))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-share.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage shares for Delta Sharing\n#### Add tables with deletion vectors or column mapping to a share\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nDeletion vectors are a storage optimization feature that you can enable on Delta tables. See [What are deletion vectors?](https:\/\/docs.databricks.com\/delta\/deletion-vectors.html). \nDatabricks also supports column mapping for Delta tables. See [Rename and drop columns with Delta Lake column mapping](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html). \nTo share a table with deletion vectors or column mapping, you must share it with history. See [Add tables to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#add-tables). \nWhen you share a table with deletion vectors or column mapping, recipients can query the table using a SQL warehouse, a cluster running Databricks Runtime 14.1 or above, or compute that is running open source `delta-sharing-spark` 3.1 or above. See [Read tables with deletion vectors or column mapping enabled](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html#deletion-vectors) and [Read tables with deletion vectors or column mapping enabled](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html#deletion-vectors).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-share.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage shares for Delta Sharing\n#### Add views to a share\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nViews are read-only objects created from one or more tables or other views. A view can be created from tables and other views that are contained in multiple schemas and catalogs in a Unity Catalog metastore. See [Create views](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-views.html). \nThis section describes how to add views to a share using Catalog Explorer, Databricks CLI, or SQL commands in a Databricks notebook or the Databricks SQL query editor. If you prefer to use the Unity Catalog REST API, see [PATCH \/api\/2.1\/unity-catalog\/shares\/](https:\/\/docs.databricks.com\/api\/workspace\/shares\/update) in the REST API reference. \n**Permissions required**: Owner of the share object, `USE CATALOG` and `USE SCHEMA` on the catalog and schema that contain the view, and `SELECT` on the view. You must maintain the `SELECT` privilege for as long as you want to share the view. For more information, see [Requirements](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#requirements). \n**Additional requirements**: \n* View sharing is supported only in Databricks-to-Databricks sharing.\n* Shareable views must be defined on Delta tables or other shareable views.\n* You cannot share views that reference shared tables or shared views.\n* You must use a SQL warehouse or a cluster on Databricks Runtime 13.3 LTS or above when you add a view to a share.\n* For requirements and limitations on recipient usage of views, see [Read shared views](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html#views). \nTo add views to a share: \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Shares** tab, find the share you want to add a view to and click its name.\n4. Click **Manage assets > Add data assets**.\n5. On the **Add tables** page, search or browse for the view that you want to share and select it.\n6. (Optional) Click **Advanced table options** to specify an **Alias**, or alternate view name, to make the view name more readable. The alias is the name that the recipient sees and must use in queries. Recipients cannot use the actual view name if an alias is specified.\n7. Click **Save**. \nRun the following command in a notebook or the Databricks SQL query editor: \n```\nALTER SHARE <share-name> ADD VIEW <catalog-name>.<schema-name>.<view-name>\n[COMMENT \"<comment>\"]\n[AS <alias>];\n\n``` \nOptions include: \n* `AS <alias>`: An alternate view name, or alias, to make the view name more readable. The alias is the view name that the recipient sees and must use in queries. Recipients cannot use the actual view name if an alias is specified. Use the format `<schema-name>.<view-name>`.\n* `COMMENT \"<comment>\"`: Comments appear in the Catalog Explorer UI and when you list and display view details using SQL statements. \nFor more information about `ALTER SHARE` options, see [ALTER SHARE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-share.html). \nRun the following Databricks CLI command: \n```\ndatabricks shares update <share-name> \\\n--json '{\n\"updates\": [\n{\n\"action\": \"ADD\",\n\"data_object\": {\n\"name\": \"<view-full-name>\",\n\"data_object_type\": \"VIEW\",\n\"shared_as\": \"<view-alias>\"\n}\n}\n]\n}'\n\n``` \n`\"shared_as\": \"<view-alias>\"` is optional and provides an alternate view name, or alias, to make the view name more readable. The alias is the view name that the recipient sees and must use in queries. Recipients cannot use the actual view name if an alias is specified. Use the format `<schema-name>.<view-name>`. \nTo learn about additional parameters, run `databricks shares update --help` or see [PATCH \/api\/2.1\/unity-catalog\/shares\/](https:\/\/docs.databricks.com\/api\/workspace\/shares\/update) in the REST API reference. \nFor information about removing views from a share, see [Update shares](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#update).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-share.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage shares for Delta Sharing\n#### Add dynamic views to a share to filter rows and columns\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nYou can use dynamic views to configure fine-grained access control to table data, including: \n* Security at the level of columns or rows.\n* Data masking. \nWhen you create a dynamic view that uses the [CURRENT\\_RECIPIENT() function](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/current_recipient.html), you can limit recipient access according to properties that you specify in the recipient definition. \nThis section provides examples of restricting recipient access to table data at both the row and column level using a dynamic view. \n### Requirements \n* **Databricks Runtime version**: The `CURRENT_RECIPIENT` function is supported in Databricks Runtime 14.2 and above.\n* **Permissions**: \n+ To create a view, you must be the owner of the share object, have `USE CATALOG` and `USE SCHEMA` on the catalog and schema that contain the view, along with `SELECT` on the view. You must maintain the `SELECT` privilege for as long as you want to share the view.\n+ To set properties on a recipient, you must be the owner of the recipient object.\n* **Limitations**: All limitations for [view sharing](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#views), including restriction to Databricks-to-Databricks sharing, plus the following: \n+ When a provider shares a view that uses the `CURRENT_RECIPIENT` function, the provider can\u2019t query the view directly because of the sharing context. To test such a dynamic view, the provider must share the view with themselves and query the view as a recipient.\n+ Providers cannot create a view that references a dynamic view. \n### Set a recipient property \nIn these examples, the table to be shared has a column named `country`, and only recipients with a matching `country` property can view certain rows or columns. \nYou can set recipient properties using Catalog Explorer or SQL commands in a Databricks notebook or the SQL query editor. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Recipients** tab, find the recipient you want to add the properties to and click its name.\n4. Click **Edit properties**.\n5. On the **Edit recipient properties** dialog, enter the column name as a key (in this case `country`) and the value you want to filter by as the value (for example, `CA`).\n6. Click **Save**. \nTo set the property on the recipient, use `ALTER RECIPIENT`. In this example, the `country` property is set to `CA`. \n```\nALTER RECIPIENT recipient1 SET PROPERTIES ('country' = 'CA');\n\n``` \n### Create a dynamic view with row-level permission for recipients \nIn this example, only recipients with a matching `country` property can view certain rows. \n```\nCREATE VIEW my_catalog.default.view1 AS\nSELECT * FROM my_catalog.default.my_table\nWHERE country = CURRENT_RECIPIENT('country');\n\n``` \nAnother option is for the data provider to maintain a separate mapping table that maps fact table fields to recipient properties, allowing recipient properties and fact table fields to be decoupled for greater flexiblity. \n### Create a dynamic view with column-level permission for recipients \nIn this example, only recipients that match the `country` property can view certain columns. Others see the data returned as `REDACTED`: \n```\nCREATE VIEW my_catalog.default.view2 AS\nSELECT\nCASE\nWHEN CURRENT_RECIPIENT('country') = 'US' THEN pii\nELSE 'REDACTED'\nEND AS pii\nFROM my_catalog.default.my_table;\n\n``` \n### Share the dynamic view with a recipient \nTo share the dynamic view with a recipient, use the same SQL commands or UI procedure as you would for a standard view. See [Add views to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#views).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-share.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage shares for Delta Sharing\n#### Add volumes to a share\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nVolumes are Unity Catalog objects that represent a logical volume of storage in a cloud object storage location. They are intended primarily to provide governance over non-tabular data assets. See [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html). \nThis section describes how to add volumes to a share using Catalog Explorer, the Databricks CLI, or SQL commands in a Databricks notebook or SQL query editor. If you prefer to use the Unity Catalog REST API, see [PATCH\n\/api\/2.1\/unity-catalog\/shares\/](https:\/\/docs.databricks.com\/api\/workspace\/shares\/update) in the REST API reference. \n**Permissions required**: Owner of the share object, `USE CATALOG` and `USE SCHEMA` on the catalog and schema that contain the volume, and `READ VOLUME` on the volume. You must maintain the `READ VOLUME` privilege for as long as you want to share the volume. For more information, see [Requirements](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#requirements). \n**Additional requirements**: \n* Volume sharing is supported only in Databricks-to-Databricks sharing.\n* You must use a SQL warehouse on version 2023.50 or above or a cluster on Databricks Runtime 14.1 or above when you add a volume to a share. \nTo add volumes to a share: \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Shares** tab, find the share you want to add a volume to and click its name.\n4. Click **Manage assets > Edit assets**.\n5. On the **Edit assets** page, search or browse for the volume that you want to share and select it. \nAlternatively, you can select the entire schema that contains the volume. See [Add schemas to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#schemas).\n6. (Optional) Click **Advanced options** to specify an alternate volume name, or **Alias**, to make the volume name more readable. \nAliases are not available if you select an entire schema. \nThe alias is the name that the recipient sees and must use in queries. Recipients cannot use the actual volume name if an alias is specified.\n7. Click **Save**. \nRun the following command in a notebook or the Databricks SQL query editor: \n```\nALTER SHARE <share-name> ADD VOLUME <catalog-name>.<schema-name>.<volume-name>\n[COMMENT \"<comment>\"]\n[AS <alias>];\n\n``` \nOptions include: \n* `AS <alias>`: An alternate volume name, or alias, to make the volume name more readable. The alias is the volume name that the recipient sees and must use in queries. Recipients cannot use the actual volume name if an alias is specified. Use the format `<schema-name>.<volume-name>`.\n* `COMMENT \"<comment>\"`: Comments appear in the Catalog Explorer UI and when you list and display volume details using SQL statements. \nFor more information about `ALTER SHARE` options, see [ALTER SHARE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-share.html). \nRun the following command using Databricks CLI 0.210 or above: \n```\ndatabricks shares update <share-name> \\\n--json '{\n\"updates\": [\n{\n\"action\": \"ADD\",\n\"data_object\": {\n\"name\": \"<volume-full-name>\",\n\"data_object_type\": \"VOLUME\",\n\"string_shared_as\": \"<volume-alias>\"\n}\n}\n]\n}'\n\n``` \n`\"string_shared_as\": \"<volume-alias>\"` is optional and provides an alternate volume name, or alias, to make the volume name more readable. The alias is the volume name that the recipient sees and must use in queries. Recipients cannot use the actual volume name if an alias is specified. Use the format `<schema-name>.<volume-name>`. \nTo learn about additional parameters, run `databricks shares update --help` or see [PATCH \/api\/2.1\/unity-catalog\/shares\/](https:\/\/docs.databricks.com\/api\/workspace\/shares\/update) in the REST API reference. \nFor information about removing volumes from a share, see [Update shares](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#update).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-share.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage shares for Delta Sharing\n#### Add models to a share\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis section describes how to add models to a share using Catalog Explorer, the Databricks CLI, or SQL commands in a Databricks notebook or SQL query editor. If you prefer to use the Unity Catalog REST API, see [PATCH\n\/api\/2.1\/unity-catalog\/shares\/](https:\/\/docs.databricks.com\/api\/workspace\/shares\/update) in the REST API reference. \n**Permissions required**: Owner of the share object, `USE CATALOG` and `USE SCHEMA` on the catalog and schema that contain the model, and `EXECUTE` on the model. You must maintain the `EXECUTE` privilege for as long as you want to share the model. For more information, see [Requirements](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#requirements). \n**Additional requirements**: \n* Model sharing is supported only in Databricks-to-Databricks sharing.\n* You must use a SQL warehouse on version 2023.50 or above or a cluster on Databricks Runtime 14.0 or above when you add a model to a share. \nTo add models to a share: \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Shares** tab, find the share you want to add a model to and click its name.\n4. Click **Manage assets > Edit assets**.\n5. On the **Edit assets** page, search or browse for the model that you want to share and select it. \nAlternatively, you can select the entire schema that contains the model. See [Add schemas to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#schemas).\n6. (Optional) Click **Advanced options** to specify an alternate model name, or **Alias**, to make the model name more readable. \nAliases are not available if you select an entire schema. \nThe alias is the name that the recipient sees and must use in queries. Recipients cannot use the actual model name if an alias is specified.\n7. Click **Save**. \nRun the following command in a notebook or the Databricks SQL query editor: \n```\nALTER SHARE <share-name> ADD MODEL <catalog-name>.<schema-name>.<model-name>\n[COMMENT \"<comment>\"]\n[AS <alias>];\n\n``` \nOptions include: \n* `AS <alias>`: An alternate model name, or alias, to make the model name more readable. The alias is the model name that the recipient sees and must use in queries. Recipients cannot use the actual model name if an alias is specified. Use the format `<schema-name>.<model-name>`.\n* `COMMENT \"<comment>\"`: Comments appear in the Catalog Explorer UI and when you list and display model details using SQL statements. \nFor more information about `ALTER SHARE` options, see [ALTER SHARE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-share.html). \nRun the following command using Databricks CLI 0.210 or above: \n```\ndatabricks shares update <share-name> \\\n--json '{\n\"updates\": [\n{\n\"action\": \"ADD\",\n\"data_object\": {\n\"name\": \"<model-full-name>\",\n\"data_object_type\": \"MODEL\",\n\"string_shared_as\": \"<model-alias>\"\n}\n}\n]\n}'\n\n``` \n`\"string_shared_as\": \"<model-alias>\"` is optional and provides an alternate model name, or alias, to make the model name more readable. The alias is the model name that the recipient sees and must use in queries. Recipients cannot use the actual model name if an alias is specified. Use the format `<schema-name>.<model-name>`. \nTo learn about additional parameters, run `databricks shares update --help` or see [PATCH \/api\/2.1\/unity-catalog\/shares\/](https:\/\/docs.databricks.com\/api\/workspace\/shares\/update) in the REST API reference. \nFor information about removing models from a share, see [Update shares](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#update).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-share.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage shares for Delta Sharing\n#### Add schemas to a share\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nWhen you add an entire schema to a share, your recipients will have access not only to all of the data assets in the schema at the time that you create the share, but any assets that are added to the schema over time. This includes all tables, views, and volumes in the schema. Tables shared this way always include full history. \nAdding, updating, or removing a schema using SQL requires a SQL warehouse or compute running Databricks Runtime 13.3 LTS or above. Doing the same using Catalog Explorer has no compute requirements. \n**Permissions required**: Owner of the share object and owner of the schema (or a user with `USE SCHEMA` and `SELECT` privileges on the schema). \nTo add a schema to a share, follow the instructions in [Add tables to a share](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#add-tables), paying attention to the content that specifies how to add a schema. \nTable aliases, partitions, and volume aliases are not available if you select an entire schema. If you have created aliases or partitions for any assets in the schema, these are removed when you add the entire schema to the share. \nIf you want to specify advanced options for a table or volume that you are sharing using schema sharing, you must share the table or volume using SQL and give the table or volume an alias with a different schema name.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-share.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage shares for Delta Sharing\n#### Add notebook files to a share\n\nUse Catalog Explorer to add a notebook file to a share. \nNote \nTo share notebooks, your metastore must have metastore-level storage. \n**Permissions required**: Owner of the share object and CAN READ permission on the notebook you want to share. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Shares** tab, find the share you want to add a notebook to and click its name.\n4. Click **Manage assets** and select **Add notebook file**.\n5. On the **Add notebook file** page, click the file icon to browse for the notebook you want to share. \nClick the file you want to share and click **Select**. \n(Optionally) specify a user-friendly alias for the file in the **Share as** field. This is the identifier that recipients will see.\n6. Click **Save**. \nThe shared notebook file now appears in the **Notebook files** list on the **Assets** tab. \n### Remove notebook files from shares \nTo remove a notebook file from a share: \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Shares** tab, find the share that includes the notebook, and click the share name.\n4. On the **Assets** tab, find the notebook file you want to remove from the share.\n5. Click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu (also known as the three-dot menu) to the right of the row, and select **Delete notebook file**.\n6. On the confirmation dialog, click **Delete**. \n### Update notebook files in shares \nTo update a notebook that you have already shared, you must re-add it, giving it a new alias in the **Share as** field. Databricks recommends that you use a name that indicates the notebook\u2019s revised status, such as `<old-name>-update-1`. You may need to notify the recipient of the change. The recipient must select and clone the new notebook to take advantage of your update.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-share.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage shares for Delta Sharing\n#### Grant recipients access to a share\n\nTo grant share access to recipients, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `GRANT ON SHARE` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: One of the following: \n* Metastore admin.\n* Delegated permissions or ownership on both the share and the recipient objects ((`USE SHARE` + `SET SHARE PERMISSION`) or share owner) AND (`USE RECIPIENT` or recipient owner). \nFor instructions, see [Manage access to Delta Sharing data shares (for providers)](https:\/\/docs.databricks.com\/data-sharing\/grant-access.html). This article also explains how to revoke a recipient\u2019s access to a share.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-share.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage shares for Delta Sharing\n#### View shares and share details\n\nTo view a list of shares or details about a share, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or SQL commands in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: The list of shares returned depends on your role and permissions. Metastore admins and users with the `USE SHARE` privilege see all shares. Otherwise, you can view only the shares for which you are the share object owner. \nDetails include: \n* The share\u2019s owner, creator, creation timestamp, updater, updated timestamp, comments.\n* Data assets in the share.\n* Recipients with access to the share. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. Open the **Shares** tab to view a list of shares.\n4. View share details on the **Details** tab. \nTo view a list of shares, run the following command in a notebook or the Databricks SQL query editor. Optionally, replace `<pattern>` with a [`LIKE` predicate](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/like.html). \n```\nSHOW SHARES [LIKE <pattern>];\n\n``` \nTo view details about a specific share, run the following command. \n```\nDESCRIBE SHARE <share-name>;\n\n``` \nTo view details about all tables, views, and volumes in a share, run the following command. \n```\nSHOW ALL IN SHARE <share-name>;\n\n``` \nTo view a list of shares, run the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \n```\ndatabricks shares list\n\n``` \nTo view details about a specific share, run the following command. \n```\ndatabricks shares get <share-name>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-share.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage shares for Delta Sharing\n#### View the recipients who have permissions on a share\n\nTo view the list of shares that a recipient has been granted access to, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `SHOW GRANTS TO RECIPIENT` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required**: Metastore admin, `USE SHARE` privilege, or share object owner. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Shares** tab, find and select the recipient.\n4. Go to the **Recipients** tab to view the list of recipients who can access the share. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nSHOW GRANTS ON SHARE <share-name>;\n\n``` \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \n```\ndatabricks shares share-permissions <share-name>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-share.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage shares for Delta Sharing\n#### Update shares\n\nIn addition to [adding tables](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#add-tables), [views](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#views), [volumes](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#volumes), and [notebooks](https:\/\/docs.databricks.com\/data-sharing\/create-share.html#add-remove-notebook-files) to a share, you can: \n* Rename a share.\n* Remove tables, views, volumes, and schemas from a share.\n* Add or update a comment on a share.\n* Enable or disable access to a table\u2019s history data, allowing recipients to perform time travel queries or streaming reads of the table.\n* Add, update, or remove partition definitions.\n* Change the share owner. \nTo make these updates to shares, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or SQL commands in a Databricks notebook or the Databricks SQL query editor. You cannot use Catalog Explorer to rename the share. \n**Permissions required**: To update the share owner, you must be one of the following: a metastore admin, the owner of the share object, or a user with both the `USE SHARE` and `SET SHARE PERMISSION` privileges. To update the share name, you must be a metastore admin (or user with the `CREATE_SHARE` privilege) *and* share owner. To update any other share properties, you must be the owner. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Shares** tab, find the share you want to update and click its name. \nOn the share details page, you can do the following: \n* Click the ![Edit icon](https:\/\/docs.databricks.com\/_images\/pencil-edit-icon.png) edit icon next to the Owner or Comment field to update these values.\n* Click the vertical ellipsis ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) button in an asset row to remove it.\n* Click **Manage assets > Edit assets** to update all other properties: \n+ To remove an asset, clear the checkbox next to the asset.\n+ To add, update, or remove partition definitions, click **Advanced options**. \nRun the following commands in a notebook or the Databricks SQL editor. \nRename a share: \n```\nALTER SHARE <share-name> RENAME TO <new-share-name>;\n\n``` \nRemove tables from a share: \n```\nALTER SHARE share_name REMOVE TABLE <table-name>;\n\n``` \nRemove volumes from a share: \n```\nALTER SHARE share_name REMOVE VOLUME <volume-name>;\n\n``` \nAdd or update a comment on a share: \n```\nCOMMENT ON SHARE <share-name> IS '<comment>';\n\n``` \nAdd or modify partitions for a table in a share: \n```\nALTER SHARE <share-name> ADD TABLE <table-name> PARTITION(<clause>);\n\n``` \nChange share owner: \n```\nALTER SHARE <share-name> OWNER TO '<principal>'\n\n-- Principal must be an account-level user email address or group name.\n\n``` \nEnable history sharing for a table: \n```\nALTER SHARE <share-name> ADD TABLE <table-name> WITH HISTORY;\n\n``` \nFor details about `ALTER SHARE` parameters, see [ALTER SHARE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-share.html). \nRun the following commands using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \nRename a share: \n```\ndatabricks shares update <share-name> --name <new-share-name>\n\n``` \nRemove tables from a share: \n```\ndatabricks shares update <share-name> \\\n--json '{\n\"updates\": [\n{\n\"action\": \"REMOVE\",\n\"data_object\": {\n\"name\": \"<table-full-name>\",\n\"data_object_type\": \"TABLE\",\n\"shared_as\": \"<table-alias>\"\n}\n}\n]\n}'\n\n``` \nRemove volumes from a share (using Databricks CLI 0.210 or above): \n```\ndatabricks shares update <share-name> \\\n--json '{\n\"updates\": [\n{\n\"action\": \"REMOVE\",\n\"data_object\": {\n\"name\": \"<volume-full-name>\",\n\"data_object_type\": \"VOLUME\",\n\"string_shared_as\": \"<volume-alias>\"\n}\n}\n]\n}'\n\n``` \nNote \nUse the `name` property if there is no alias for the volume. Use `string_shared_as` if there is an alias. \nAdd or update a comment on a share: \n```\ndatabricks shares update <share-name> --comment '<comment>'\n\n``` \nChange share owner: \n```\ndatabricks shares update <share-name> --owner '<principal>'\n\n``` \nPrincipal must be an account-level user email address or group name.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-share.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Create and manage shares for Delta Sharing\n#### Delete a share\n\nTo delete a share, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the `DELETE SHARE` SQL command in a Databricks notebook or the Databricks SQL query editor. You must be an owner of the share. \nWhen you delete a share, recipients can no longer access the shared data. \n**Permissions required**: Share object owner. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **Delta Sharing** menu and select **Shared by me**.\n3. On the **Shares** tab, find the share you want to delete and click its name.\n4. Click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu (also known as the three-dot menu) and select **Delete**.\n5. On the confirmation dialog, click **Delete**. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nDROP SHARE [IF EXISTS] <share-name>;\n\n``` \nRun the following command using the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \n```\ndatabricks shares delete <share-name>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/create-share.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Access storage using a service principal & Microsoft Entra ID(Azure Active Directory)\n\nRegistering an application with Microsoft Entra ID (formerly Azure Active Directory) creates a service principal you can use to provide access to Azure storage accounts. \nYou can then configure access to these service principals using credentials stored with [secrets](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html). Databricks recommends using <entra-service-principal>s scoped to clusters or SQL warehouses to configure data access. See [Connect to Azure Data Lake Storage Gen2 and Blob Storage](https:\/\/docs.databricks.com\/connect\/storage\/azure-storage.html) and [Enable data access configuration](https:\/\/docs.databricks.com\/admin\/sql\/data-access-configuration.html).\n\n#### Access storage using a service principal & Microsoft Entra ID(Azure Active Directory)\n##### Register a Microsoft Entra ID application\n\n[Registering a Microsoft Entra ID (formerly Azure Active Directory) application](https:\/\/learn.microsoft.com\/azure\/active-directory\/develop\/howto-create-service-principal-portal#register-an-application-with-azure-ad-and-create-a-service-principal) and assigning appropriate permissions will create a service principal that can access Azure Data Lake Storage Gen2 or Blob Storage resources. \nTo register a Microsoft Entra ID application, you must have the `Application Administrator` role or the `Application.ReadWrite.All` permission in Microsoft Entra ID. \n1. In the Azure portal, go to the **Microsoft Entra ID** service.\n2. Under **Manage**, click **App Registrations**.\n3. Click **+ New registration**. Enter a name for the application and click **Register**.\n4. Click **Certificates & Secrets**.\n5. Click **+ New client secret**.\n6. Add a description for the secret and click **Add**.\n7. Copy and save the value for the new secret.\n8. In the application registration overview, copy and save the **Application (client) ID** and **Directory (tenant) ID**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/aad-storage-service-principal.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Access storage using a service principal & Microsoft Entra ID(Azure Active Directory)\n##### Assign roles\n\nYou control access to storage resources by assigning roles to a Microsoft Entra ID application registration associated with the storage account. You might need to assign other roles depending on specific requirements. \nTo assign roles on a storage account you must have the Owner or User Access Administrator Azure RBAC role on the storage account. \n1. In the Azure portal, go to the **Storage accounts** service.\n2. Select an Azure storage account to use with this application registration.\n3. Click **Access Control (IAM)**.\n4. Click **+ Add** and select **Add role assignment** from the dropdown menu.\n5. Set the **Select** field to the Microsoft Entra ID application name and set **Role** to **Storage Blob Data Contributor**.\n6. Click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/aad-storage-service-principal.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### File management operations for Unity Catalog volumes\n\nCatalog Explorer provides options for common file management tasks for files stored with Unity Catalog volumes. \nSee [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html).\n\n#### File management operations for Unity Catalog volumes\n##### Rename or delete volume\n\nClick the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) next to **Upload to this volume** to see options to **Rename** or **Delete** the volume.\n\n#### File management operations for Unity Catalog volumes\n##### Set permissions on a volume\n\nYou can use Catalog Explorer to manage permissions on a volume or assign a new principal as the owner of a volume. See [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html) and [Manage Unity Catalog object ownership](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/ownership.html).\n\n#### File management operations for Unity Catalog volumes\n##### Upload files to volume\n\nThe **Upload to this volume** button opens a dialog to upload files. See [Upload files to a Unity Catalog volume](https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-to-volume.html). \nUploaded files cannot exceed 5 GB.\n\n#### File management operations for Unity Catalog volumes\n##### Manage files in volumes\n\nClick the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) next to a file name to perform the following actions: \n* Copy path\n* Download file\n* Delete file\n* Create table\n\n","doc_uri":"https:\/\/docs.databricks.com\/catalog-explorer\/manage-volumes.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### File management operations for Unity Catalog volumes\n##### Create table from volumes\n\nDatabricks provides a UI to create a Unity Catalog managed table from a file stored in a Unity Catalog volume. \nYou must have `CREATE TABLE` permissions in the target schema and have access to a running SQL warehouse. \nYou can use the provided UI to make the following selections: \n* Choose to **Create new table** or **Overwrite existing table**\n* Select the target **Catalog** and **Schema**.\n* Specify the **Table name**.\n* Override default column names and types, or choose to exclude columns. \nNote \nClick **Advanced attributes** to view additional options. \nClick **Create table** to create the table in the specified location. Upon completion, Catalog Explorer displays the table details.\n\n","doc_uri":"https:\/\/docs.databricks.com\/catalog-explorer\/manage-volumes.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n#### Performance efficiency for the data lakehouse\n\nThis article covers architectural principles of the **performance efficiency** pillar, referring to the ability of a system to adapt to load changes. \n![Performance efficiency lakehouse architecture diagram for Databricks.](https:\/\/docs.databricks.com\/_images\/performance-efficiency.png)\n\n#### Performance efficiency for the data lakehouse\n##### Principles of performance efficiency\n\n1. **Use serverless services** \nServerless services do not require customers to operate and maintain computing infrastructure in the cloud. This eliminates the operational overhead of managing cloud infrastructure and reduces transaction costs because managed services operate at cloud scale. They also provide immediate availability, out-of-the-box security, and require minimal configuration or administration.\n2. **Design workloads for performance** \nFor repeated workloads, such as data engineering pipelines, performance should never be an afterthought. Data must be: \n* Efficiently read from object memory.\n* Efficiently transformed.\n* Efficiently published for consumption.In addition, most pipelines or consumption patterns use a chain of systems. To achieve the best possible performance, the entire chain must be considered and selected for the best performance.\n3. **Run performance testing in the scope of development** \nEvery development workload must undergo continuous performance testing. The tests ensure that any change to the code base does not adversely affect the performance of the workload. Establish a regular schedule for running tests. Run the test as part of a scheduled event or as part of a continuous integration build pipeline. \nEstablish performance baselines and determine the current efficiency of the workloads and supporting infrastructure. Measuring performance against baselines can provide strategies for improvement and determine if the application meets business objectives. \nIdentify bottlenecks that may be affecting performance. These bottlenecks can be caused by code errors or misconfiguration of a service. Typically, bottlenecks get worse as load increases.\n4. **Monitor performance** \nEnsure that resources and services remain accessible, and that performance meets user expectations or workload requirements. Monitoring can help you identify bottlenecks or insufficient resources, optimize configurations, and detect pipeline\/workload errors.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/performance-efficiency\/index.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n#### Performance efficiency for the data lakehouse\n##### Next: Best practices for performance efficiency\n\nSee [Best practices for performance efficiency](https:\/\/docs.databricks.com\/lakehouse-architecture\/performance-efficiency\/best-practices.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/performance-efficiency\/index.html"} +{"content":"# Model serving with Databricks\n### Deploy custom models\n\nThis article describes support for deploying a **custom model** using [Databricks Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html). It also provides details about supported model logging options and compute types, how to package model dependencies for serving, and endpoint creation and scaling.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/custom-models.html"} +{"content":"# Model serving with Databricks\n### Deploy custom models\n#### What are custom models?\n\nModel Serving can deploy any Python model as a production-grade API. Databricks refers to such models as **custom models**. These ML models can be trained using standard ML libraries like scikit-learn, XGBoost, PyTorch, and HuggingFace transformers and can include any Python code. \nTo deploy a **custom model**, \n1. Log the model or code in the MLflow format, using either native [MLflow built-in flavors](https:\/\/mlflow.org\/docs\/latest\/models.html#built-in-model-flavors) or [pyfunc](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.pyfunc.html#module-mlflow.pyfunc).\n2. After the model is logged, register it in the Unity Catalog (recommended) or the workspace registry.\n3. From here, you can create a model serving endpoint to deploy and query your model. \n1. See [Create custom model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html)\n2. See [Query serving endpoints for custom models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-custom-model-endpoints.html). \nFor a complete tutorial on how to serve custom models on Databricks, see [Model serving tutorial](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-intro.html). \nDatabricks also supports serving foundation models for generative AI applications, see [Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html) and [External models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html) for supported models and compute offerings. \nImportant \nIf you rely on Anaconda, review the [terms of service](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/custom-models.html#anaconda-notice) notice for additional information.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/custom-models.html"} +{"content":"# Model serving with Databricks\n### Deploy custom models\n#### Log ML models\n\nThere are different methods to log your ML model for model serving. The following list summarizes the supported methods and examples. \n* **Autologging** This method is automatically enabled when using Databricks Runtime for ML. \n```\nimport mlflow\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.datasets import load_iris\n\niris = load_iris()\nmodel = RandomForestRegressor()\nmodel.fit(iris.data, iris.target)\n\n```\n* **Log using MLflow\u2019s built-in flavors**. You can use this method if you want to manually log the model for more detailed control. \n```\nimport mlflow\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.datasets import load_iris\n\niris = load_iris()\nmodel = RandomForestClassifier()\nmodel.fit(iris.data, iris.target)\n\nwith mlflow.start_run():\nmlflow.sklearn.log_model(model, \"random_forest_classifier\")\n\n```\n* **Custom logging with `pyfunc`**. You can use this method for deploying arbitrary python code models or deploying additional code alongside your model. \n```\nimport mlflow\nimport mlflow.pyfunc\n\nclass Model(mlflow.pyfunc.PythonModel):\ndef predict(self, context, model_input):\nreturn model_input * 2\n\nwith mlflow.start_run():\nmlflow.pyfunc.log_model(\"custom_model\", python_model=Model())\n\n```\n* **Download from HuggingFace**. You can download a model directly from Hugging Face and log that model for serving. For examples, see [Notebook examples](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html#notebooks). \n### Signature and input examples \nAdding a signature and input example to MLflow is recommended. Signatures are necessary for logging models to the Unity Catalog. \nThe following is a signature example: \n```\nfrom mlflow.models.signature import infer_signature\n\nsignature = infer_signature(training_data, model.predict(training_data))\nmlflow.sklearn.log_model(model, \"model\", signature=signature)\n\n``` \nThe following is an input example: \n```\n\ninput_example = {\"feature1\": 0.5, \"feature2\": 3}\nmlflow.sklearn.log_model(model, \"model\", input_example=input_example)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/custom-models.html"} +{"content":"# Model serving with Databricks\n### Deploy custom models\n#### Compute type\n\nNote \nGPU model serving is in Public Preview. \nDatabricks Model Serving provides a variety of CPU and GPU options for deploying your model. When deploying with a GPU, it is essential to make sure that your code is set up so that predictions are run on the GPU, using the methods provided by your framework. MLflow does this automatically for models logged with the PyTorch or Transformers flavors. \n| workload type | GPU instance | memory |\n| --- | --- | --- |\n| `CPU` | | 4GB per concurrency |\n| `GPU_SMALL` | 1xT4 | 16GB |\n| `GPU_MEDIUM` | 1xA10G | 24GB |\n| `MULTIGPU_MEDIUM` | 4xA10G | 96GB |\n| `GPU_MEDIUM_8` | 8xA10G | 192GB |\n| `GPU_LARGE_8` | 8xA100-80GB | 320GB |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/custom-models.html"} +{"content":"# Model serving with Databricks\n### Deploy custom models\n#### Deployment container and dependencies\n\nDuring deployment, a production-grade container is built and deployed as the endpoint. This container includes libraries automatically captured or specified in the MLflow model. \nThe model serving container doesn\u2019t contain pre-installed dependencies, which might lead to dependency errors if not all required dependencies are included in the model. When running into model deployment issues, Databricks recommends you test the model locally. \n### Package and code dependencies \nCustom or private libraries can be added to your deployment. See [Use custom Python libraries with Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/private-libraries-model-serving.html). \nFor MLflow native flavor models, the necessary package dependencies are automatically captured. \nFor custom `pyfunc` models, dependencies can be explicitly added. \nYou can add package dependencies using: \n* The `pip_requirements` parameter: \n```\nmlflow.sklearn.log_model(model, \"sklearn-model\", pip_requirements = [\"scikit-learn\", \"numpy\"])\n\n```\n* The `conda_env` parameter: \n```\n\nconda_env = {\n'channels': ['defaults'],\n'dependencies': [\n'python=3.7.0',\n'scikit-learn=0.21.3'\n],\n'name': 'mlflow-env'\n}\n\nmlflow.sklearn.log_model(model, \"sklearn-model\", conda_env = conda_env)\n\n```\n* To include additional requirements beyond what is automatically captured, use `extra_pip_requirements`. \n```\nmlflow.sklearn.log_model(model, \"sklearn-model\", extra_pip_requirements = [\"sklearn_req\"])\n\n``` \nIf you have code dependencies, these can be specified using `code_path`. \n```\nmlflow.sklearn.log_model(model, \"sklearn-model\", code_path=[\"path\/to\/helper_functions.py\"],)\n\n``` \n### Dependency validation \nPrior to deploying a custom MLflow model, it is beneficial to verify that the model is capable of being served. MLflow provides an API that allows for validation of the model artifact that both simulates the deployment environment and allows for testing of modified dependencies. \nThere are two pre-deployment validation APIs the [MLflow Python API](https:\/\/mlflow.org\/docs\/latest\/cli.html#mlflow-models-predict) and the [MLflow CLI](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.models.html#mlflow.models.predict). \nYou can specify the following using either of these APIs. \n* The `model_uri` of the model that is deployed to model serving.\n* One of the following: \n+ The `input_data` in the expected format for the `mlflow.pyfunc.PyFuncModel.predict()` call of the model.\n+ The `input_path` that defines a file containing input data that will be loaded and used for the call to `predict`.\n* The `content_type` in `csv` or `json` format.\n* An optional `output_path` to write the predictions to a file. If you omit this parameter, the predictions are printed to `stdout`.\n* An environment manager, `env_manager`, that is used to build the the environment for serving: \n+ The default is `virtualenv`. Recommended for serving validation.\n+ `local` is available, but potentially error prone for serving validation. Generally used only for rapid debugging.\n* Whether to install the current version of MLflow that is in your environment with the virtual environment using `install_mlflow`. This setting defaults to `False`.\n* Whether to update and test different versions of package dependencies for troubleshooting or debugging. You can specify this as a list of string dependency overrides or additions using the override argument, `pip_requirements_override`. \nFor example: \n```\nimport mlflow\n\nrun_id = \"...\"\nmodel_uri = f\"runs:\/{run_id}\/model\"\n\nmlflow.models.predict(\nmodel_uri=model_uri,\ninput_data={\"col1\": 34.2, \"col2\": 11.2, \"col3\": \"green\"},\ncontent_type=\"json\",\nenv_manager=\"virtualenv\",\ninstall_mlflow=False,\npip_requirements_override=[\"pillow==10.3.0\", \"scipy==1.13.0\"],\n)\n\n``` \n### Dependency updates \nIf there are any issues with the dependencies specified with a logged model, you can update the requirements by using the [MLflow CLI](https:\/\/mlflow.org\/docs\/latest\/cli.html#mlflow-models-update-pip-requirements) or `mlflow.models.model.update_model_requirements()` in th MLflow Python API without having to log another model. \nThe following example shows how update the `pip_requirements.txt` of a logged model in-place. \nYou can update existing definitions with specified package versions or add non-existent requirements to the `pip_requirements.txt` file. This file is within the MLflow model artifact at the specified `model_uri` location. \n```\nfrom mlflow.models.model import update_model_requirements\n\nupdate_model_requirements(\nmodel_uri=model_uri,\noperation=\"add\",\nrequirement_list=[\"pillow==10.2.0\", \"scipy==1.12.0\"],\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/custom-models.html"} +{"content":"# Model serving with Databricks\n### Deploy custom models\n#### Expectations and limitations\n\nThe following sections describe known expectations and limitations for serving custom models using Model Serving. \n### Endpoint creation and update expectations \nNote \nThe information in this section does not apply to endpoints that serve foundation models. \nDeploying a newly registered model version involves packaging the model and its model environment and provisioning the model endpoint itself. This process can take approximately 10 minutes. \nDatabricks performs a zero-downtime update of endpoints by keeping the existing endpoint configuration up until the new one becomes ready. Doing so reduces risk of interruption for endpoints that are in use. \nIf model computation takes longer than 120 seconds, requests will time out. If you believe your model computation will take longer than 120 seconds, reach out to your Databricks account team. \nDatabricks performs occasional zero-downtime system updates and maintenance on existing Model Serving endpoints. During maintenance, Databricks reloads models and marks an endpoint as Failed if a model fails to reload. Make sure your customized models are robust and are able to reload at any time. \n### Endpoint scaling expectations \nNote \nThe information in this section does not apply to endpoints that serve foundation models. \nServing endpoints automatically scale based on traffic and the capacity of provisioned concurrency units. \n* **Provisioned concurrency:** The maximum number of parallel requests the system can handle. Estimate the required concurrency using the formula: provisioned concurrency = queries per second (QPS) \\* model execution time (s).\n* **Scaling behavior:** Endpoints scale up almost immediately with increased traffic and scale down every five minutes to match reduced traffic.\n* **Scale to zero:** Endpoints can scale down to zero after 30 minutes of inactivity. The first request after scaling to zero experiences a \u201ccold start,\u201d leading to higher latency. For latency-sensitive applications, consider strategies to manage this feature effectively. \n### GPU workload limitations \nThe following are limitations for serving endpoints with GPU workloads: \n* Container image creation for GPU serving takes longer than image creation for CPU serving due to model size and increased installation requirements for models served on GPU.\n* When deploying very large models, the deployment process might timeout if the container build and model deployment exceed a 60-minute duration. Should this occur, initiating a retry of the process should successfully deploy the model.\n* Autoscaling for GPU serving takes longer than for CPU serving.\n* GPU capacity is not guaranteed when scaling to zero. GPU endpoints might expect extra high latency for the first request after scaling to zero.\n* This functionality is not available in `ap-southeast-1`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/custom-models.html"} +{"content":"# Model serving with Databricks\n### Deploy custom models\n#### Anaconda licensing update\n\nThe following notice is for customers relying on Anaconda. \nImportant \nAnaconda Inc. updated their [terms of service](https:\/\/www.anaconda.com\/terms-of-service) for anaconda.org channels. Based on the new terms of service you may require a commercial license if you rely on Anaconda\u2019s packaging and distribution. See [Anaconda Commercial Edition FAQ](https:\/\/www.anaconda.com\/blog\/anaconda-commercial-edition-faq) for more information. Your use of any Anaconda channels is governed by their terms of service. \nMLflow models logged before [v1.18](https:\/\/mlflow.org\/news\/2021\/06\/18\/1.18.0-release\/index.html) (Databricks Runtime 8.3 ML or earlier) were by default logged with the conda `defaults` channel (<https:\/\/repo.anaconda.com\/pkgs\/>) as a dependency. Because of this license change, Databricks has stopped the use of the `defaults` channel for models logged using MLflow v1.18 and above. The default channel logged is now `conda-forge`, which points at the community managed <https:\/\/conda-forge.org\/>. \nIf you logged a model before MLflow v1.18 without excluding the `defaults` channel from the conda environment for the model, that model may have a dependency on the `defaults` channel that you may not have intended.\nTo manually confirm whether a model has this dependency, you can examine `channel` value in the `conda.yaml` file that is packaged with the logged model. For example, a model\u2019s `conda.yaml` with a `defaults` channel dependency may look like this: \n```\nchannels:\n- defaults\ndependencies:\n- python=3.8.8\n- pip\n- pip:\n- mlflow\n- scikit-learn==0.23.2\n- cloudpickle==1.6.0\nname: mlflow-env\n\n``` \nBecause Databricks can not determine whether your use of the Anaconda repository to interact with your models is permitted under your relationship with Anaconda, Databricks is not forcing its customers to make any changes. If your use of the Anaconda.com repo through the use of Databricks is permitted under Anaconda\u2019s terms, you do not need to take any action. \nIf you would like to change the channel used in a model\u2019s environment, you can re-register the model to the model registry with a new `conda.yaml`. You can do this by specifying the channel in the `conda_env` parameter of `log_model()`. \nFor more information on the `log_model()` API, see the MLflow documentation for the model flavor you are working with, for example, [log\\_model for scikit-learn](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.sklearn.html#mlflow.sklearn.log_model). \nFor more information on `conda.yaml` files, see the [MLflow documentation](https:\/\/www.mlflow.org\/docs\/latest\/models.html#additional-logged-files).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/custom-models.html"} +{"content":"# Model serving with Databricks\n### Deploy custom models\n#### Additional resources\n\n* [Tutorial: Deploy and query a custom model](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-intro.html)\n* [Create custom model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html)\n* [Query serving endpoints for custom models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-custom-model-endpoints.html)\n* [Use custom Python libraries with Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/private-libraries-model-serving.html)\n* [Package custom artifacts for Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-custom-artifacts.html)\n* [Deploy Python code with Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/deploy-custom-models.html)\n* [Serve multiple models to a Model Serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/serve-multiple-models-to-serving-endpoint.html)\n* [Configure access to resources from model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/store-env-variable-model-serving.html)\n* [Add an instance profile to a model serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/add-model-serving-instance-profile.html)\n* [Configure route optimization on serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/route-optimization.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/custom-models.html"} +{"content":"# Databricks data engineering\n### Libraries\n\nTo make third-party or custom code available to notebooks and jobs running on your clusters, you can install a library. Libraries can be written in Python, Java, Scala, and R. You can upload Python, Java, and Scala libraries and point to external packages in PyPI, Maven, and CRAN repositories. \nDatabricks includes many common libraries in Databricks Runtime. To see which libraries are included in Databricks Runtime, look at the **System Environment** subsection of the [Databricks Runtime release notes](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html) for your Databricks Runtime version.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/index.html"} +{"content":"# Databricks data engineering\n### Libraries\n#### Cluster-scoped libraries\n\nYou can install libraries on clusters so that they can be used by all notebooks and jobs running on the cluster. Databricks supports Python, JAR, and R libraries. See [Cluster libraries](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html). \nYou can install a cluster library directly from the following sources: \n* A [package repository](https:\/\/docs.databricks.com\/libraries\/package-repositories.html) such as PyPI, Maven, or CRAN\n* [Workspace files](https:\/\/docs.databricks.com\/libraries\/workspace-files-libraries.html)\n* Unity Catalog [volumes](https:\/\/docs.databricks.com\/libraries\/volume-libraries.html)\n* A [cloud object storage](https:\/\/docs.databricks.com\/libraries\/object-storage-libraries.html) location\n* A path on your local machine \nNot all locations are supported for all types of libraries or all compute configurations. See [Recommendations for uploading libraries](https:\/\/docs.databricks.com\/libraries\/index.html#recommendations) for configuration recommendations. \nImportant \nLibraries can be installed from DBFS when using Databricks Runtime 14.3 LTS and below. However, any workspace user can modify library files stored in DBFS. To improve the security of libraries in a Databricks workspace, storing library files in the DBFS root is deprecated and disabled by default in Databricks Runtime 15.0 and above. See [Storing libraries in DBFS root is deprecated and disabled by default](https:\/\/docs.databricks.com\/release-notes\/runtime\/15.0.html#libraries-dbfs-deprecation). \nInstead, Databricks [recommends](https:\/\/docs.databricks.com\/libraries\/index.html#recommendations) uploading all libraries, including Python libraries, JAR files, and Spark connectors, to workspace files or Unity Catalog volumes, or using library package repositories. If your workload does not support these patterns, you can also use libraries stored in cloud object storage. \nFor complete library support information, see [Python library support](https:\/\/docs.databricks.com\/libraries\/index.html#python-library-support), [Java and Scala library support](https:\/\/docs.databricks.com\/libraries\/index.html#jar-library-support), and [R library support](https:\/\/docs.databricks.com\/libraries\/index.html#r-library-support). \n### Recommendations for uploading libraries \nDatabricks supports most configuration installations of Python, JAR, and R libraries, but there are some unsupported scenarios. It is recommended that you upload libraries to source locations that support installation onto compute with shared access mode, as this is the recommended mode for all workloads. See [Access modes](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode). When scheduling workflows with shared access mode run the workflow with a [service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/best-practices.html). \nImportant \nOnly use compute with single user access mode if required functionality is not supported by shared access mode. No isolation shared access mode is a legacy configuration on Databricks that is not recommended. \nThe following table provides recommendations organized by Databricks Runtime version and Unity Catalog enablement. \n| Configuration | Recommendation |\n| --- | --- |\n| Databricks Runtime 13.3 LTS and above with Unity Catalog | Install libraries on compute with [shared access mode](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode) from Unity Catalog [volumes](https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-to-volume.html) with GRANT READ for all account users. If applicable, Maven coordinates and JAR library paths need to be added to the [allowlist](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/allowlist.html). |\n| Databricks Runtime 11.3 LTS and above without Unity Catalog | Install libraries from [workspace files](https:\/\/docs.databricks.com\/libraries\/workspace-files-libraries.html). (File size limit is 500 MB.) |\n| Databricks Runtime 10.4 LTS and below | Install libraries from [cloud object storage](https:\/\/docs.databricks.com\/connect\/storage\/index.html). | \n### Python library support \nThe following table indicates Databricks Runtime version compatibility for Python wheel files for different cluster access modes based on the library source location. See [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html) and [Access modes](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode). \nWith Databricks Runtime 15.0 and above, you can use [requirements.txt files](https:\/\/pip.pypa.io\/en\/stable\/reference\/requirements-file-format\/) to manage your Python dependencies. These files can be uploaded to any supported source location. \nNote \nInstalling Python egg files is not supported with Databricks Runtime 14.0 and above. Use Python wheel files or install packages from PyPI instead. \n| | Shared access mode | Single user access mode | No isolation shared access mode *(Legacy)* |\n| --- | --- | --- | --- |\n| **PyPI** | 13.3 LTS and above | All supported Databricks Runtime versions | All supported Databricks Runtime versions |\n| **Workspace files** | 13.3 LTS and above | 13.3 LTS and above | 14.1 and above |\n| **Volumes** | 13.3 LTS and above | 13.3 LTS and above | Not supported |\n| **Cloud storage** | 13.3 LTS and above | All supported Databricks Runtime versions | All supported Databricks Runtime versions |\n| **DBFS *(Not recommended)*** | Not supported | 14.3 and below | 14.3 and below | \n### Java and Scala library support \nThe following table indicates Databricks Runtime version compatibility for JAR files for different cluster access modes based on the library source location. See [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html) and [Access modes](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode). \nNote \nShared access mode requires an admin to add Maven coordinates and paths for JAR libraries to an `allowlist`. See [Allowlist libraries and init scripts on shared compute](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/allowlist.html). \n| | Shared access mode | Single user access mode | No isolation shared access mode *(Legacy)* |\n| --- | --- | --- | --- |\n| **Maven** | 13.3 LTS and above | All supported Databricks Runtime versions | All supported Databricks Runtime versions |\n| **Workspace files** | Not supported | Not supported | 14.1 and above |\n| **Volumes** | 13.3 LTS and above | 13.3 LTS and above | Not supported |\n| **Cloud storage** | 13.3 LTS and above | All supported Databricks Runtime versions | All supported Databricks Runtime versions |\n| **DBFS *(Not recommended)*** | Not supported | 14.3 and below | 14.3 and below | \n### R library support \nThe following table indicates Databricks Runtime version compatibility for CRAN packages for different cluster access modes. See [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html) and [Access modes](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode). \n| | Shared access mode | Single user access mode | No isolation shared access mode *(Legacy)* |\n| --- | --- | --- | --- |\n| **CRAN** | Not supported | All supported Databricks Runtime versions | All supported Databricks Runtime versions |\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/index.html"} +{"content":"# Databricks data engineering\n### Libraries\n#### Notebook-scoped libraries\n\nNotebook-scoped libraries, available for Python and R, allow you to install libraries and create an environment scoped to a notebook session. These libraries do not affect other notebooks running on the same cluster. Notebook-scoped libraries do not persist and must be re-installed for each session. Use notebook-scoped libraries when you need a custom environment for a specific notebook. \n* [Notebook-scoped Python libraries](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html)\n* [Notebook-scoped R libraries](https:\/\/docs.databricks.com\/libraries\/notebooks-r-libraries.html) \nNote \nJARs cannot be installed at the notebook level. \nImportant \nWorkspace libraries have been deprecated and should not be used. See [Workspace libraries (legacy)](https:\/\/docs.databricks.com\/archive\/legacy\/workspace-libraries.html). However, storing libraries as workspace files is distinct from workspace libraries and is still fully supported. You can install libraries stored as workspace files directly to compute or job tasks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/index.html"} +{"content":"# Databricks data engineering\n### Libraries\n#### Python environment management\n\nThe following table provides an overview of options you can use to install Python libraries in Databricks. \n| Python package source | [Notebook-scoped libraries with %pip](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html) | [Cluster libraries](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html) | [Job libraries](https:\/\/docs.databricks.com\/api\/workspace\/libraries) with [Jobs API](https:\/\/docs.databricks.com\/api\/workspace\/jobs) |\n| --- | --- | --- | --- |\n| PyPI | Use `%pip install`. See [example](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html#pip-install). | Select [PyPI as the source](https:\/\/docs.databricks.com\/libraries\/package-repositories.html#pypi-libraries). | Add a new `pypi` object to the job libraries and specify the `package` field. |\n| Private PyPI mirror, such as Nexus or Artifactory | Use `%pip install` with the `--index-url` option. [Secret management](https:\/\/docs.databricks.com\/security\/secrets\/index.html) is available. See [example](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html#pip-install-private). | Not supported. | Not supported. |\n| VCS, such as GitHub, with raw source | Use `%pip install` and specify the repository URL as the package name. See [example](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html#pip-install-vcs). | Select [PyPI as the source](https:\/\/docs.databricks.com\/libraries\/package-repositories.html#pypi-libraries) and specify the repository URL as the package name. | Add a new `pypi` object to the job libraries and specify the repository URL as the `package` field. |\n| Private VCS with raw source | Use `%pip install` and specify the repository URL with basic authentication as the package name. [Secret management](https:\/\/docs.databricks.com\/security\/secrets\/index.html) is available. See [example](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html#pip-install-private). | Not supported. | Not supported. |\n| File path | Use `%pip install`. See [example](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html#workspace-files). | Select **File path\/S3** as the source. | Add a new `egg` or `whl` object to the job libraries and specify the file path as the `package` field. |\n| S3 | Use `%pip install` together with a pre-signed URL. Paths with the S3 protocol `s3:\/\/` are not supported. | Select **File path\/S3** as the source. | Add a new `egg` or `whl` object to the job libraries and specify the S3 path as the `package` field. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/index.html"} +{"content":"# Databricks data engineering\n### Libraries\n#### Python library precedence\n\nYou might encounter a situation where you need to override the version for a built-in library, or have a custom library that conflicts in name with another library installed on the cluster. When you run `import <library>`, the library with the high precedence is imported. \nImportant \nLibraries stored in workspace files have different precedence depending on how they are added to the Python `sys.path`. A Databricks Git folder adds the current working directory to the path before all other libraries, while notebooks outside Git folders add the current working directory after other libraries are installed. If you manually append workspace directories to your path, these always have the lowest precedence. \nThe following list orders precedence from highest to lowest. In this list, a lower number means higher precedence. \n1. Libraries in the current working directory (Git folders only).\n2. Libraries in the Git folder root directory (Git folders only).\n3. Notebook-scoped libraries (`%pip install` in notebooks).\n4. Cluster libraries (using the UI, CLI, or API).\n5. Libraries included in Databricks Runtime. \n* Libraries installed with init scripts might resolve before or after built-in libraries, depending on how they are installed. Databricks does not recommend installing libraries with init scripts.\n6. Libraries in the current working directory (not in Git folders).\n7. Workspace files appended to the `sys.path`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n#### Use scikit-learn on Databricks\n\nThis page provides examples of how you can use the `scikit-learn` package to train machine learning models in Databricks. [scikit-learn](https:\/\/scikit-learn.org\/stable\/index.html) is one of the most popular Python libraries for single-node machine learning and is included in Databricks Runtime and Databricks Runtime ML. See [Databricks Runtime release notes](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html) for the scikit-learn library version included with your cluster\u2019s runtime. \nYou can [import these notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html#import-notebook) and run them in your Databricks workspace. \nFor additional example notebooks to get started quickly on Databricks, see [Tutorials: Get started with ML](https:\/\/docs.databricks.com\/machine-learning\/ml-tutorials.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/scikit-learn.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n#### Use scikit-learn on Databricks\n##### Basic example using scikit-learn\n\nThis notebook provides a quick overview of machine learning model training on Databricks. It uses the `scikit-learn` package to train a simple classification model. It also illustrates the use of [MLflow](https:\/\/docs.databricks.com\/mlflow\/index.html) to track the model development process, and [Hyperopt](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/index.html#hyperparameter-tuning-with-hyperopt) to automate hyperparameter tuning. \nIf your workspace is enabled for Unity Catalog, use this version of the notebook: \n### scikit-learn classification notebook (Unity Catalog) \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning-with-unity-catalog.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nIf your workspace is not enabled for Unity Catalog, use this version of the notebook: \n### scikit-learn classification notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/ml-quickstart-training.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/scikit-learn.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n#### Use scikit-learn on Databricks\n##### End-to-end example using scikit-learn on Databricks\n\nThis notebook uses scikit-learn to illustrate a complete end-to-end example of loading data, model training, distributed hyperparameter tuning, and model inference. It also illustrates model lifecycle management using MLflow Model Registry to log and register your model. \nIf your workspace is enabled for Unity Catalog, use this version of the notebook: \n### Use scikit-learn with MLflow integration on Databricks (Unity Catalog) \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-end-to-end-example-uc.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nIf your workspace is not enabled for Unity Catalog, use this version of the notebook: \n### Use scikit-learn with MLflow integration on Databricks \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-end-to-end-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n* [Track scikit-learn model training with MLflow](https:\/\/docs.databricks.com\/mlflow\/tracking-ex-scikit.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/scikit-learn.html"} +{"content":"# \n### Security and compliance guide\n\nThis guide provides an overview of security features and capabilities that an enterprise data team can use to harden their Databricks environment according to their risk profile and governance policy. \nThis guide does not cover information about securing your data. For that information, see [Data governance with Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/index.html). \nNote \nThis article focuses on [the most recent (E2) version](https:\/\/docs.databricks.com\/archive\/aws\/end-of-life-legacy-workspaces.html#e2-architecture) of the Databricks platform. Some of the features described here may not be supported on legacy deployments that have not migrated to the E2 platform.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/index.html"} +{"content":"# \n### Security and compliance guide\n#### Authentication and access control\n\nIn Databricks, a *workspace* is a Databricks deployment in the cloud that functions as the unified environment that a specified set of users use for accessing all of their Databricks [assets](https:\/\/docs.databricks.com\/workspace\/workspace-assets.html). Your organization can choose to have multiple workspaces or just one, depending on your needs. A Databricks *account* represents a single entity for purposes of billing, user management, and support. An account can include multiple workspaces and Unity Catalog metastores. \nAccount admins handle general account management, and workspace admins manage the settings and features of individual workspaces in the account. Both account and workspace admins manage Databricks users, service principals, and groups, as well as authentication settings and access control. \nDatabricks provides security features, such as single sign-on, to configure strong authentication. Admins can configure these settings to help prevent account takeovers, in which credentials belonging to a user are compromised using methods like phishing or brute force, giving an attacker access to all of the data accessible from the environment. \nAccess control lists determine who can view and perform operations on objects in Databricks workspaces, such as notebooks and SQL warehouses. \nTo learn more about authentication and access control in Databricks, see [Authentication and access control](https:\/\/docs.databricks.com\/security\/auth-authz\/index.html).\n\n### Security and compliance guide\n#### Networking\n\nDatabricks provides network protections that enable you to secure Databricks workspaces and help prevent users from exfiltrating sensitive data. You can use IP access lists to enforce the network location of Databricks users. Using a customer-managed VPC, you can lock down outbound network access. To learn more, see [Networking](https:\/\/docs.databricks.com\/security\/network\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/index.html"} +{"content":"# \n### Security and compliance guide\n#### Data security and encryption\n\nSecurity-minded customers sometimes voice a concern that Databricks itself might be compromised, which could result in the compromise of their environment. Databricks has an extremely strong security program which manages the risk of such an incident. See the [Security and Trust Center](https:\/\/databricks.com\/trust) for an overview on the program. That said, no company can completely eliminate all risk, and Databricks provides encryption features for additional control of your data. See [Data security and encryption](https:\/\/docs.databricks.com\/security\/keys\/index.html).\n\n### Security and compliance guide\n#### Secret management\n\nSometimes accessing data requires that you authenticate to external data sources. Databricks recommends that you use Databricks secrets to store your credentials instead of directly entering your credentials into a notebook. For more infromation, see [Secret management](https:\/\/docs.databricks.com\/security\/secrets\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/index.html"} +{"content":"# \n### Security and compliance guide\n#### Auditing, privacy, and compliance\n\nDatabricks provides auditing features to enable admins to monitor user activities to detect security anomalies. For example, you can monitior account takeovers by alerting on unusual time of logins or simultaneous remote logins. \nDatabricks also provides controls that help meet security requirements for many compliance standards, such as HIPAA and PCI. \nFor more information, see [Auditing, privacy, and compliance](https:\/\/docs.databricks.com\/security\/privacy\/index.html). \n### Security Analysis Tool \nExperimental \nThe Security Analysis Tool (SAT) is a productivity tool in an [Experimental](https:\/\/docs.databricks.com\/release-notes\/release-types.html) state. It\u2019s not meant to be used as a certification of your deployments. The SAT project is regularly updated to improve correctness of checks, add new checks, and fix bugs. \nYou can use the Security Analysis Tool (SAT) to analyze your Databricks account and workspace security configurations. SAT provides recommendations that help you follow Databricks security best practices. SAT is typically run daily as an automated workflow. The details of these check results are persisted in Delta tables in your storage so that trends can be analyzed over time. These results are displayed in a centralized Databricks dashboard. \nFor more information, see [the Security Analysis Tool GitHub repo](https:\/\/github.com\/databricks-industry-solutions\/security-analysis-tool). \n![Security Analysis Tool diagram](https:\/\/docs.databricks.com\/_images\/sat_diagram.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/index.html"} +{"content":"# \n### Security and compliance guide\n#### Learn more\n\nHere are some resources to help you build a comprehensive security solution that meets your organization\u2019s needs: \n* The [Databricks Security and Trust Center](https:\/\/databricks.com\/trust), which provides information about the ways in which security is built into every layer of the Databricks platform.\n* [Security Best Practices](https:\/\/www.databricks.com\/wp-content\/uploads\/2022\/09\/security-best-practices-databricks-on-aws.pdf), which provides a checklist of security practices, considerations, and patterns that you can apply to your deployment, learned from our enterprise engagements.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/index.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n### Manage privileges in Unity Catalog\n##### Unity Catalog privileges and securable objects\n\nThis article describes the Unity Catalog securable objects and the privileges that apply to them. To learn how to grant privileges in Unity Catalog, see [Show, grant, and revoke privileges](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html#grant). \nNote \nThis article refers to the Unity Catalog privileges and inheritance model in Privilege Model version 1.0. If you created your Unity Catalog metastore during the public preview (before August 25, 2022), you might be on an earlier privilege model that doesn\u2019t support the current inheritance model. You can upgrade to Privilege Model version 1.0 to get privilege inheritance. See [Upgrade to privilege inheritance](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/upgrade-privilege-model.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n### Manage privileges in Unity Catalog\n##### Unity Catalog privileges and securable objects\n###### Securable objects in Unity Catalog\n\nA securable object is an object defined in the Unity Catalog metastore on which privileges can be granted to a principal (user, service principal, or group). Securable objects in Unity Catalog are hierarchical. \n![Unity Catalog object hierarchy](https:\/\/docs.databricks.com\/_images\/object-hierarchy.png) \nThe securable objects are: \n* **METASTORE**: The top-level container for metadata. Each Unity Catalog metastore exposes a three-level namespace (`catalog`.`schema`.`table`) that organizes your data. \nWhen you manage privileges on a metastore, you do not include the metastore name in a SQL command. Unity Catalog grants or revokes the privilege on the metastore attached to your workspace. For example, the following command grants a group named *engineering* the ability to create a catalog in the metastore attached to the workspace: \n```\nGRANT CREATE CATALOG ON METASTORE TO engineering\n\n```\n* **CATALOG**: The first layer of the object hierarchy, used to organize your data assets. A *foreign catalog* is a special catalog type that mirrors a database in an external data system in a Lakehouse Federation scenario.\n* **SCHEMA**: Also known as databases, schemas are the second layer of the object hierarchy and contain tables and views.\n* **TABLE**: The lowest level in the object hierarchy, tables can be *external* (stored in external locations in your cloud storage of choice) or *managed* tables (stored in a storage container in your cloud storage that you create expressly for Databricks).\n* **VIEW**: A read-only object created from a query on one or more tables that is contained within a schema. \n* **MATERIALIZED VIEW**: An object created from a query on one or more tables that is contained within a schema. Its results reflect the state of data when it was last refreshed. \n* **VOLUME**: The lowest level in the object hierarchy, volumes can be *external* (stored in external locations in your cloud storage of choice) or *managed* (stored in a storage container in your cloud storage that you create expressly for Databricks).\n* **REGISTERED MODEL**: An [MLflow registered model](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html) that is contained within a schema.\n* **FUNCTION**: A user-defined function that is contained within a schema. See [User-defined functions (UDFs) in Unity Catalog](https:\/\/docs.databricks.com\/udf\/unity-catalog.html).\n* **EXTERNAL LOCATION**: An object that contains a reference to a storage credential and a cloud storage path that is contained within a Unity Catalog metastore.\n* **STORAGE CREDENTIAL**: An object that encapsulates a long-term cloud credential that provides access to cloud storage that is contained within a Unity Catalog metastore.\n* **CONNECTION**: An object that specifies a path and credentials for accessing an external database system in a Lakehouse Federation scenario.\n* **SHARE**: A logical grouping for the tables you intend to share using Delta Sharing. A share is contained within a Unity Catalog metastore.\n* **RECIPIENT**: An object that identifies an organization or group of users that can have data shared with them using Delta Sharing. These objects are contained within a Unity Catalog metastore.\n* **PROVIDER**: An object that represents an organization that has made data available for sharing using Delta Sharing. These objects are contained within a Unity Catalog metastore.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n### Manage privileges in Unity Catalog\n##### Unity Catalog privileges and securable objects\n###### Privilege types by securable object in Unity Catalog\n\nThe following table lists the privilege types that apply to each securable object in Unity Catalog. To learn how to grant privileges in Unity Catalog, see [Show, grant, and revoke privileges](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html#grant). \n| Securable | Privileges |\n| --- | --- |\n| Metastore | `CREATE CATALOG`, `CREATE CONNECTION`, `CREATE EXTERNAL LOCATION`, `CREATE PROVIDER`, `CREATE RECIPIENT`, `CREATE SHARE`, `CREATE STORAGE CREDENTIAL`, `SET SHARE PERMISSION`, `USE MARKETPLACE ASSETS`, `USE PROVIDER`, `USE RECIPIENT`, `USE SHARE` |\n| Catalog | `ALL PRIVILEGES`, `APPLY TAG`, `BROWSE`, `CREATE SCHEMA`, `USE CATALOG` All users have `USE CATALOG` on the `main` catalog by default. The following privilege types apply to securable objects in a catalog. You can grant these privileges at the catalog level to apply them to the pertinent current and future objects in the catalog. `CREATE FUNCTION`, `CREATE TABLE`, `CREATE MODEL`, `CREATE VOLUME`, `CREATE FOREIGN CATALOG`, `READ VOLUME`, `REFRESH`, `WRITE VOLUME`, `EXECUTE`, `MODIFY`, `SELECT`, `USE SCHEMA` |\n| Schema | `ALL PRIVILEGES`, `APPLY TAG`, `CREATE FUNCTION`, `CREATE TABLE`, `CREATE MODEL`, `CREATE VOLUME`, `CREATE MATERIALIZED VIEW`, `USE SCHEMA` The following privilege types apply to securable objects within a schema. You can grant these privileges at the schema level to apply them to the pertinent current and future objects within the schema. `EXECUTE`, `MODIFY`, `SELECT`, `READ VOLUME`, `REFRESH`, `WRITE VOLUME` |\n| Table | `ALL PRIVILEGES`, `APPLY TAG`, `MODIFY`, `SELECT` |\n| Materialized view | `ALL PRIVILEGES`, `APPLY TAG`, `REFRESH`, `SELECT` |\n| View | `ALL PRIVILEGES`, `APPLY TAG`, `SELECT` |\n| Volume | `ALL PRIVILEGES`, `READ VOLUME`, `WRITE VOLUME` |\n| External location | `ALL PRIVILEGES`, `BROWSE`, `CREATE EXTERNAL TABLE`, `CREATE EXTERNAL VOLUME`, `READ FILES`, `WRITE FILES`, `CREATE MANAGED STORAGE` |\n| Storage credential | `ALL PRIVILEGES`, `CREATE EXTERNAL LOCATION`, `CREATE EXTERNAL TABLE`, `READ FILES`, `WRITE FILES` |\n| Connection | `ALL PRIVILEGES`, `CREATE FOREIGN CATALOG`, `USE CONNECTION` |\n| Function | `ALL PRIVILEGES`, `EXECUTE` |\n| Registered Model | `ALL PRIVILEGES`, `APPLY TAG`, `EXECUTE` |\n| Share | `SELECT` (Can be granted to `RECIPIENT`) |\n| Recipient | None |\n| Provider | None |\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n### Manage privileges in Unity Catalog\n##### Unity Catalog privileges and securable objects\n###### General Unity Catalog privilege types\n\nThis section provides details about the privilege types that apply generally to Unity Catalog. To learn how to grant privileges in Unity Catalog, see [Show, grant, and revoke privileges](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html#grant). \n### ALL PRIVILEGES \n**Applicable object types: `CATALOG`, `EXTERNAL LOCATION`, `STORAGE CREDENTIAL`, `SCHEMA`, `FUNCTION`, `REGISTERED MODEL`, `TABLE`, `MATERIALIZED VIEW`, `VIEW,` `VOLUME`** \nUsed to grant or revoke all privileges applicable to the securable object and its child objects without explicitly specifying them. \nWhen `ALL PRIVILEGES` is granted on an object, it does not individually grant the user each applicable privilege at the time of the grant. Instead, it expands to all available privileges at the time permissions checks are made. \nWhen `ALL PRIVILEGES` is revoked, the `ALL PRIVILEGES` privilege is revoked and any explicit privileges granted to the user on the object are also revoked. \nNote \nThis privilege is powerful when applied at higher levels in the hierarchy. For example, GRANT ALL PRIVILEGES ON CATALOG main TO `analysts` would give the analyst team all privileges on every object (schemas, tables, views, functions) in the catalog. \n### APPLY TAG \n**Applicable object types: `CATALOG`, `SCHEMA`, `REGISTERED MODEL`, `TABLE`, `MATERIALIZED VIEW`, `VIEW`** \nAllows a user to add and edit tags on an object. Granting `APPLY TAG` to a table or view also enables column tagging. \nThe user must also have the `USE CATALOG` privilege on the parent catalog and `USE SCHEMA` on the parent schema. \n### BROWSE \n**Applicable object types: `CATALOG`, `EXTERNAL LOCATION`** \nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nAllows a user to view an object\u2019s metadata using Catalog Explorer, the schema browser, search results, the lineage graph, `information_schema`, and the REST API. \nThe user does not require the `USE CATALOG` privilege on the parent catalog or `USE SCHEMA` on the parent schema. \n### CREATE CATALOG \n**Applicable object types: Unity Catalog metastore** \nAllows a user to create a catalog in a Unity Catalog metastore. To create a foreign catalog, you must also have the [CREATE FOREIGN CATALOG](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html#foreign-catalog) privilege on the connection that contains the foreign catalog or on the metastore. \n### CREATE CONNECTION \n**Applicable object types: Unity Catalog metastore** \nAllows a user to create a connection to an external database in a Lakehouse Federation scenario. \n### CREATE EXTERNAL LOCATION \n**Applicable object types: Unity Catalog metastore, `STORAGE CREDENTIAL`** \nTo create an external location, the user must have this privilege on both the metastore and the storage credential that is being referenced in the external location. \n### CREATE EXTERNAL TABLE \n**Applicable object types: `EXTERNAL LOCATION`, `STORAGE CREDENTIAL`** \nAllows a user to create external tables directly in your cloud tenant using an external location or storage credential. Databricks recommends granting this privilege on an external location rather than storage credential (since it\u2019s scoped to a path, it allows more control over where users can create external tables in your cloud tenant). \n### CREATE EXTERNAL VOLUME \n**Applicable object types: `EXTERNAL LOCATION`** \nAllows a user to create external volumes using an external location. \n### CREATE FOREIGN CATALOG \n**Applicable object types: `CONNECTION`** \nAllows a user to create foreign catalogs using a connection to an external database in a Lakehouse Federation scenario. \n### CREATE FUNCTION \n**Applicable object types: `SCHEMA`** \nAllows a user to create a function in the schema. Since privileges are inherited, `CREATE FUNCTION` can also be granted on a catalog, which allows a user to create a function in any existing or future schema in the catalog. \nThe user must also have the `USE CATALOG` privilege on the parent catalog and `USE SCHEMA` on the parent schema. \n### CREATE MODEL \n**Applicable object types: `SCHEMA`** \nAllows a user to create an MLflow registered model in the schema. Since privileges are inherited, `CREATE MODEL` can also be granted on a catalog, which allows a user to create a registered model in any existing or future schema in the catalog. \nThe user must also have the `USE CATALOG` privilege on the parent catalog and `USE SCHEMA` on the parent schema. \n### CREATE MANAGED STORAGE \n**Applicable object types: `EXTERNAL LOCATION`** \nAllows a user to specify a location for storing managed tables at the catalog or schema level, overriding the default root storage for the metastore. \n### CREATE SCHEMA \n**Applicable object types: `CATALOG`** \nAllows a user to create a schema. The user must also have the `USE CATALOG` privilege on the catalog. \n### CREATE STORAGE CREDENTIAL \n**Applicable object types: Unity Catalog metastore** \nAllows a user to create a storage credential in a Unity Catalog metastore. \n### CREATE TABLE \n**Applicable object types: `SCHEMA`** \nAllows a user to create a table or view in the schema. Since privileges are inherited, `CREATE TABLE` can also be granted on a catalog, which allows a user to create a table or view in any existing or future schema in the catalog. \nThe user must also have the `USE CATALOG` privilege on its parent catalog and the `USE SCHEMA` privilege on its parent schema. \n### CREATE MATERIALIZED VIEW \nPreview \nThis feature is in Public Preview. \n**Applicable object types: `SCHEMA`** \nAllows a user to create a materialized view in the schema. Since privileges are inherited, `CREATE MATERIALIZED VIEW` can also be granted on a catalog, which allows a user to create a table or view in any existing or future schema in the catalog. \nThe user must also have the `USE CATALOG` privilege on its parent catalog and the `USE SCHEMA` privilege on its parent schema. \n### CREATE VOLUME \n**Applicable object types: `SCHEMA`** \nAllows a user to create a volume in the schema. Since privileges are inherited, `CREATE VOLUME` can also be granted on a catalog, which allows a user to create a volume in any existing or future schema in the catalog. \nThe user must also have the `USE CATALOG` privilege on the volume\u2019s parent catalog and the `USE SCHEMA` privilege on its parent schema. \n### EXECUTE \n**Applicable object types: `FUNCTION`, `REGISTERED MODEL`** \nAllows a user to invoke a user defined function or load a model for inference, if the user also has `USE CATALOG` on its parent catalog and `USE SCHEMA` on its parent schema. For functions, `EXECUTE` grants the ability to view the function definition and metadata. For registered models, `EXECUTE` grants the ability to view metadata for all versions of the registered model, and to download model files. \nSince privileges are inherited, you can grant a user the `EXECUTE` privilege on a catalog or schema, which automatically grants the user the `EXECUTE` privilege on all current and future functions in the catalog or schema. \n### MANAGE ALLOWLIST \n**Applicable object types: Unity Catalog metastore** \nAllows a user to add or modify paths for init scripts, JARs, and Maven coordinates in the allowlist that governs Unity Catalog-enabled clusters with shared access mode. See [Allowlist libraries and init scripts on shared compute](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/allowlist.html). \n### MODIFY \n**Applicable object types: `TABLE`** \nAllows a user to add, update, and delete data to or from the table if the user also has `SELECT` on the table as well as `USE CATALOG` on its parent catalog and `USE SCHEMA` on its parent schema. \nSince privileges are inherited, you can grant a user the `MODIFY` privilege on a catalog or schema, which automatically grants the user the `MODIFY` privilege on all current and future tables in the catalog or schema. \n### READ FILES \n**Applicable object types: `VOLUME`, `EXTERNAL LOCATION`** \nAllows a user to read files directly from your cloud object storage. Databricks recommends granting this privilege on volumes and granting on external locations for limited use cases. For more guidance, see [Manage external locations, external tables, and external volumes](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html#manage-external). \n### READ VOLUME \n**Applicable object types: `VOLUME`** \nAllows a user to read files and directories stored inside a volume if the user also has `USE CATALOG` on its parent catalog and `USE SCHEMA` on its parent schema. \nPrivileges are inherited. When you can grant a user the `READ VOLUME` privilege on a catalog or schema, you automatically grant the user the `READ VOLUME` privilege on all current and future volumes in the catalog or schema. \n### SELECT \n**Applicable object types: `TABLE`, `VIEW`, `MATERIALIZED VIEW`, `SHARE`** \nIf applied to a table or view, allows a user to select from the table or view, if the user also has `USE CATALOG` on its parent catalog and `USE SCHEMA` on its parent schema. If applied to a share, allows a recipient to select from the share. \nSince privileges are inherited, you can grant a user the `SELECT` privilege on a catalog or schema, which automatically grants the user `SELECT` privilege on all current and future tables, and views in the catalog or schema. \n### USE CATALOG \n**Applicable object types: `CATALOG`** \nThis privilege does not grant access to the catalog itself, but is needed for a user to interact with any object within the catalog. For example, to select data from a table, users need to have the `SELECT` privilege on that table and `USE CATALOG` privileges on its parent catalog as well as `USE SCHEMA` privileges on its parent schema. \nThis is useful for allowing catalog owners to be able to limit how far individual schema and table owners can share data they produce. For example, a table owner granting `SELECT` to another user does not allow that user read access to the table unless they also have been granted `USE CATALOG` privileges on its parent catalog as well as `USE SCHEMA` privileges on its parent schema. \nThe `USE CATALOG` privilege on the parent catalog is not required to read an object\u2019s metadata if the user has the `BROWSE` privilege on that catalog. \n### USE CONNECTION \n**Applicable object types: `CONNECTION`** \nAllows a user to list and view details about connections to an external database in a Lakehouse Federation scenario. To create foreign catalogs for a connection, you must have `CREATE FOREIGN CATALOG` on the connection or ownership of the connection. \n### USE SCHEMA \n**Applicable object types: `SCHEMA`** \nThis privilege does not grant access to the schema itself, but is needed for a user to interact with any object within the schema. For example, to select data from a table, users need to have the `SELECT` privilege on that table and `USE SCHEMA` on its parent schema as well as `USE CATALOG` on its parent catalog. \nSince privileges are inherited, you can grant a user the `USE SCHEMA` privilege on a catalog, which automatically grants the user the `USE SCHEMA` privilege on all current and future schemas in the catalog. \nThe `USE SCHEMA` privilege on the parent schema is not required to read an object\u2019s metadata if the user has the `BROWSE` privilege on that schema or its parent catalog. \n### WRITE FILES \n**Applicable object types: `VOLUME`,`EXTERNAL LOCATION`** \nAllows a user to write files directly into your cloud object storage. Databricks recommends granting this privilege on volumes. Grant this privilege sparingly on external locations. For more guidance, see [Manage external locations, external tables, and external volumes](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/best-practices.html#manage-external). \n### WRITE VOLUME \n**Applicable object types: `VOLUME`** \nAllows a user to add, remove, or modify files and directories stored inside a volume if the user also has `USE CATALOG` on its parent catalog and `USE SCHEMA` on its parent schema. \nPrivileges are inherited. When you can grant a user the `WRITE VOLUME` privilege on a catalog or schema, you automatically grant the user the `WRITE VOLUME` privilege on all current and future volumes in the catalog or schema. \n### REFRESH \n**Applicable object types: `MATERIALIZED VIEW`** \nAllows a user to refresh a materialized view if the user also has `USE CATALOG` on its parent catalog and `USE SCHEMA` on its parent schema. \nPrivileges are inherited. When you grant the `REFRESH` privilege on a catalog or schema to a user, you automatically grant the user the `REFRESH` privilege on all current and future materialized views in the catalog or schema.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n### Manage privileges in Unity Catalog\n##### Unity Catalog privileges and securable objects\n###### Privilege types that apply only to Delta Sharing or Databricks Marketplace\n\nThis section provides details about the privilege types that apply only to Delta Sharing. \n### CREATE PROVIDER \n**Applicable object types: Unity Catalog metastore** \nAllows a user to create a Delta Sharing provider object in the metastore. A provider identifies an organization or group of users that have shared data using Delta Sharing. Provider creation is performed by a user in the recipient\u2019s Databricks account. See [Share data and AI assets securely using Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html). \n### CREATE RECIPIENT \n**Applicable object types: Unity Catalog metastore** \nAllows a user to create a Delta Sharing recipient object in the metastore. A recipient identifies an organization or group of users that can have data shared with them using Delta Sharing. Recipient creation is performed by a user in the provider\u2019s Databricks account. See [Share data and AI assets securely using Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html). \n### CREATE SHARE \n**Applicable object types: Unity Catalog metastore** \nAllows a user to create a share in the metastore. A share is a logical grouping for the tables you intend to share using Delta Sharing \n### SET SHARE PERMISSION \n**Applicable object types: Unity Catalog metastore** \nIn Delta Sharing, this privilege, combined with `USE SHARE` and `USE RECIPIENT` (or recipient ownership), gives a provider user the ability to grant a recipient access to a share. Combined with `USE SHARE`, it gives the ability to transfer ownership of a share to another user, group, or service principal. \n### USE MARKETPLACE ASSETS \n**Applicable object types: Unity Catalog metastore** \n*Enabled by default for all Unity Catalog metastores.* In Databricks Marketplace, this privilege gives a user the ability to get instant access or request access for data products shared in a Marketplace listing. It also allows a user to access the read-only catalog that is created when a provider shares a data product. Without this privilege, the user would require the `CREATE CATALOG` and `USE PROVIDER` privileges or the metastore admin role. This enables you to limit the number of users with those powerful permissions. \n### USE PROVIDER \n**Applicable object types: Unity Catalog metastore** \nIn Delta Sharing, gives a recipient user read-only access to all providers in a recipient metastore and their shares. Combined with the `CREATE CATALOG` privilege, this privilege allows a recipient user who is not a metastore admin to mount a share as a catalog. This enables you to limit the number of users with the powerful metastore admin role. \n### USE RECIPIENT \n**Applicable object types: Unity Catalog metastore** \nIn Delta Sharing, gives a provider user read-only access to all recipients in a provider metastore and their shares. This allows a provider user who is not a metastore admin to view recipient details, recipient authentication status, and the list of shares that the provider has shared with the recipient. \nIn [Databricks Marketplace](https:\/\/docs.databricks.com\/marketplace\/index.html), this gives provider users the ability to view listings and consumer requests in the Provider console. \n### USE SHARE \n**Applicable object types: Unity Catalog metastore** \nIn Delta Sharing, gives a provider user read-only access to all shares defined in a provider metastore. This allows a provider user who is not a metastore admin to list shares and list the assets (tables and notebooks) in a share, along with the share\u2019s recipients. \nIn [Databricks Marketplace](https:\/\/docs.databricks.com\/marketplace\/index.html), this gives provider users the ability to view details about the data shared in a listing.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Transform data with Delta Live Tables\n\nThis article describes how you can use Delta Live Tables to declare transformations on datasets and specify how records are processed through query logic. It also contains some examples of common transformation patterns that can be useful when building out Delta Live Tables pipelines. \nYou can define a dataset against any query that returns a DataFrame. You can use Apache Spark built-in operations, UDFs, custom logic, and MLflow models as transformations in your Delta Live Tables pipeline. Once data has been ingested into your Delta Live Tables pipeline, you can define new datasets against upstream sources to create new streaming tables, materialized views, and views. \nTo learn how to effectively perform stateful processing with Delta Live Tables, see [Optimize stateful processing in Delta Live Tables with watermarks](https:\/\/docs.databricks.com\/delta-live-tables\/stateful-processing.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/transform.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Transform data with Delta Live Tables\n###### When to use views, materialized views, and streaming tables\n\nTo ensure your pipelines are efficient and maintainable, choose the best dataset type when you implement your pipeline queries. \nConsider using a view when: \n* You have a large or complex query that you want to break into easier-to-manage queries.\n* You want to validate intermediate results using expectations.\n* You want to reduce storage and compute costs and do not require the materialization of query results. Because tables are materialized, they require additional computation and storage resources. \nConsider using a materialized view when: \n* Multiple downstream queries consume the table. Because views are computed on demand, the view is re-computed every time the view is queried.\n* Other pipelines, jobs, or queries consume the table. Because views are not materialized, you can only use them in the same pipeline.\n* You want to view the results of a query during development. Because tables are materialized and can be viewed and queried outside of the pipeline, using tables during development can help validate the correctness of computations. After validating, convert queries that do not require materialization into views. \nConsider using a streaming table when: \n* A query is defined against a data source that is continuously or incrementally growing.\n* Query results should be computed incrementally.\n* High throughput and low latency is desired for the pipeline. \nNote \nStreaming tables are always defined against streaming sources. You can also use streaming sources with `APPLY CHANGES INTO` to apply updates from CDC feeds. See [APPLY CHANGES API: Simplify change data capture in Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/transform.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Transform data with Delta Live Tables\n###### Combine streaming tables and materialized views in a single pipeline\n\nStreaming tables inherit the processing guarantees of Apache Spark Structured Streaming and are configured to process queries from append-only data sources, where new rows are always inserted into the source table rather than modified. \nNote \nAlthough, by default, streaming tables require append-only data sources, when a streaming source is another streaming table that requires updates or deletes, you can override this behavior with the [skipChangeCommits flag](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html#ignore-changes). \nA common streaming pattern includes ingesting source data to create the initial datasets in a pipeline. These initial datasets are commonly called *bronze* tables and often perform simple transformations. \nBy contrast, the final tables in a pipeline, commonly referred to as *gold* tables, often require complicated aggregations or reading from sources that are the targets of an `APPLY CHANGES INTO` operation. Because these operations inherently create updates rather than appends, they are not supported as inputs to streaming tables. These transformations are better suited for materialized views. \nBy mixing streaming tables and materialized views into a single pipeline, you can simplify your pipeline, avoid costly re-ingestion or re-processing of raw data, and have the full power of SQL to compute complex aggregations over an efficiently encoded and filtered dataset. The following example illustrates this type of mixed processing: \nNote \nThese examples use Auto Loader to load files from cloud storage. To load files with Auto Loader in a Unity Catalog enabled pipeline, you must use [external locations](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html). To learn more about using Unity Catalog with Delta Live Tables, see [Use Unity Catalog with your Delta Live Tables pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html). \n```\n@dlt.table\ndef streaming_bronze():\nreturn (\n# Since this is a streaming source, this table is incremental.\nspark.readStream.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"json\")\n.load(\"s3:\/\/path\/to\/raw\/data\")\n)\n\n@dlt.table\ndef streaming_silver():\n# Since we read the bronze table as a stream, this silver table is also\n# updated incrementally.\nreturn dlt.read_stream(\"streaming_bronze\").where(...)\n\n@dlt.table\ndef live_gold():\n# This table will be recomputed completely by reading the whole silver table\n# when it is updated.\nreturn dlt.read(\"streaming_silver\").groupBy(\"user_id\").count()\n\n``` \n```\nCREATE OR REFRESH STREAMING TABLE streaming_bronze\nAS SELECT * FROM cloud_files(\n\"s3:\/\/path\/to\/raw\/data\", \"json\"\n)\n\nCREATE OR REFRESH STREAMING TABLE streaming_silver\nAS SELECT * FROM STREAM(LIVE.streaming_bronze) WHERE...\n\nCREATE OR REFRESH LIVE TABLE live_gold\nAS SELECT count(*) FROM LIVE.streaming_silver GROUP BY user_id\n\n``` \nLearn more about using [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html) to efficiently read JSON files from S3 for incremental processing.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/transform.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Transform data with Delta Live Tables\n###### Stream-static joins\n\nStream-static joins are a good choice when denormalizing a continuous stream of append-only data with a primarily static dimension table. \nWith each pipeline update, new records from the stream are joined with the most current snapshot of the static table. If records are added or updated in the static table after corresponding data from the streaming table has been processed, the resultant records are not recalculated unless a full refresh is performed. \nIn pipelines configured for triggered execution, the static table returns results as of the time the update started. In pipelines configured for continuous execution, each time the table processes an update, the most recent version of the static table is queried. \nThe following is an example of a stream-static join: \n```\n@dlt.table\ndef customer_sales():\nreturn dlt.read_stream(\"sales\").join(dlt.read(\"customers\"), [\"customer_id\"], \"left\")\n\n``` \n```\nCREATE OR REFRESH STREAMING TABLE customer_sales\nAS SELECT * FROM STREAM(LIVE.sales)\nINNER JOIN LEFT LIVE.customers USING (customer_id)\n\n```\n\n##### Transform data with Delta Live Tables\n###### Calculate aggregates efficiently\n\nYou can use streaming tables to incrementally calculate simple distributive aggregates like count, min, max, or sum, and algebraic aggregates like average or standard deviation. Databricks recommends incremental aggregation for queries with a limited number of groups, for example, a query with a `GROUP BY country` clause. Only new input data is read with each update. \nTo learn more about writing Delta Live Tables queries that perform incremental aggregations, see [Perform windowed aggregations with watermarks](https:\/\/docs.databricks.com\/delta-live-tables\/stateful-processing.html#stateful-aggregations).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/transform.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Transform data with Delta Live Tables\n###### Use MLflow models in a Delta Live Tables pipeline\n\nNote \nTo use MLflow models in a Unity Catalog-enabled pipeline, your pipeline must be configured to use the `preview` channel. To use the `current` channel, you must configure your pipeline to publish to the Hive metastore. \nYou can use MLflow-trained models in Delta Live Tables pipelines. MLflow models are treated as transformations in Databricks, meaning they act upon a Spark DataFrame input and return results as a Spark DataFrame. Because Delta Live Tables defines datasets against DataFrames, you can convert Apache Spark workloads that leverage MLflow to Delta Live Tables with just a few lines of code. For more on MLflow, see [ML lifecycle management using MLflow](https:\/\/docs.databricks.com\/mlflow\/index.html). \nIf you already have a Python notebook calling an MLflow model, you can adapt this code to Delta Live Tables by using the `@dlt.table` decorator and ensuring functions are defined to return transformation results. Delta Live Tables does not install MLflow by default, so make sure you `%pip install mlflow` and import `mlflow` and `dlt` at the top of your notebook. For an introduction to Delta Live Tables syntax, see [Example: Ingest and process New York baby names data](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html#python-example). \nTo use MLflow models in Delta Live Tables, complete the following steps: \n1. Obtain the run ID and model name of the MLflow model. The run ID and model name are used to construct the URI of the MLflow model.\n2. Use the URI to define a Spark UDF to load the MLflow model.\n3. Call the UDF in your table definitions to use the MLflow model. \nThe following example shows the basic syntax for this pattern: \n```\n%pip install mlflow\n\nimport dlt\nimport mlflow\n\nrun_id= \"<mlflow-run-id>\"\nmodel_name = \"<the-model-name-in-run>\"\nmodel_uri = f\"runs:\/{run_id}\/{model_name}\"\nloaded_model_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri)\n\n@dlt.table\ndef model_predictions():\nreturn dlt.read(<input-data>)\n.withColumn(\"prediction\", loaded_model_udf(<model-features>))\n\n``` \nAs a complete example, the following code defines a Spark UDF named `loaded_model_udf` that loads an MLflow model trained on loan risk data. The data columns used to make the prediction are passed as an argument to the UDF. The table `loan_risk_predictions` calculates predictions for each row in `loan_risk_input_data`. \n```\n%pip install mlflow\n\nimport dlt\nimport mlflow\nfrom pyspark.sql.functions import struct\n\nrun_id = \"mlflow_run_id\"\nmodel_name = \"the_model_name_in_run\"\nmodel_uri = f\"runs:\/{run_id}\/{model_name}\"\nloaded_model_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri)\n\ncategoricals = [\"term\", \"home_ownership\", \"purpose\",\n\"addr_state\",\"verification_status\",\"application_type\"]\n\nnumerics = [\"loan_amnt\", \"emp_length\", \"annual_inc\", \"dti\", \"delinq_2yrs\",\n\"revol_util\", \"total_acc\", \"credit_length_in_years\"]\n\nfeatures = categoricals + numerics\n\n@dlt.table(\ncomment=\"GBT ML predictions of loan risk\",\ntable_properties={\n\"quality\": \"gold\"\n}\n)\ndef loan_risk_predictions():\nreturn dlt.read(\"loan_risk_input_data\")\n.withColumn('predictions', loaded_model_udf(struct(features)))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/transform.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### Transform data with Delta Live Tables\n###### Retain manual deletes or updates\n\nDelta Live Tables allows you to manually delete or update records from a table and do a refresh operation to recompute downstream tables. \nBy default, Delta Live Tables recomputes table results based on input data each time a pipeline is updated, so you must ensure the deleted record isn\u2019t reloaded from the source data. Setting the `pipelines.reset.allowed` table property to `false` prevents refreshes to a table but does not prevent incremental writes to the tables or prevent new data from flowing into the table. \nThe following diagram illustrates an example using two streaming tables: \n* `raw_user_table` ingests raw user data from a source.\n* `bmi_table` incrementally computes BMI scores using weight and height from `raw_user_table`. \nYou want to manually delete or update user records from the `raw_user_table` and recompute the `bmi_table`. \n![Retain data diagram](https:\/\/docs.databricks.com\/_images\/dlt-cookbook-disable-refresh.png) \nThe following code demonstrates setting the `pipelines.reset.allowed` table property to `false` to disable full refresh for `raw_user_table` so that intended changes are retained over time, but downstream tables are recomputed when a pipeline update is run: \n```\nCREATE OR REFRESH STREAMING TABLE raw_user_table\nTBLPROPERTIES(pipelines.reset.allowed = false)\nAS SELECT * FROM cloud_files(\"\/databricks-datasets\/iot-stream\/data-user\", \"csv\");\n\nCREATE OR REFRESH STREAMING TABLE bmi_table\nAS SELECT userid, (weight\/2.2) \/ pow(height*0.0254,2) AS bmi FROM STREAM(LIVE.raw_user_table);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/transform.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Tutorial: Configure S3 access with an instance profile\n\nNote \nThis article describes legacy patterns for configuring access to <S3>. Databricks recommends using Unity Catalog. See [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html). \nThis tutorial walks you through how to create an instance profile with read, write, update, and delete permissions on a single S3 bucket. You can grant privileges for multiple buckets using a single IAM role and instance profile. It is also possible to use instance profiles to grant only read and list permissions on S3. \nAdministrators configure IAM roles in AWS, link them to a Databricks workspace, and grant access to privileged users to associate instance profiles with compute. All users that have access to compute resources with an instance profile attached to it gain the privileges granted by the instance profile.\n\n#### Tutorial: Configure S3 access with an instance profile\n##### Before you begin\n\nThis tutorial is designed for workspace administrators. You must have sufficient privileges in the AWS account containing your Databricks workspace, and be a Databricks workspace administrator. \nThis tutorial assumes the following existing permissions and assets: \n* Privileges to edit the IAM role used to deploy the Databricks workspace.\n* Privileges to create new IAM roles in AWS.\n* Privileges to edit permissions on an S3 bucket.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Tutorial: Configure S3 access with an instance profile\n##### Step 1: Create an instance profile using the AWS console\n\n1. In the AWS console, go to the **IAM** service.\n2. Click the **Roles** tab in the sidebar.\n3. Click **Create role**. \n1. Under **Trusted entity type**, select **AWS service**.\n2. Under **Use case**, select **EC2**.\n3. Click **Next**.\n4. At the bottom of the page, click **Next**.\n5. In the **Role name** field, type a role name.\n6. Click **Create role**.\n4. In the role list, click the role.\n5. Add an inline policy to the role. This policy grants access to the S3 bucket. \n1. In the Permissions tab, click **Add permissions > Create inline policy**.\n2. Click the **JSON** tab.\n3. Copy this policy and set `<s3-bucket-name>` to the name of your bucket. \n```\n{\n\"Version\": \"2012-10-17\",\n\"Statement\": [\n{\n\"Effect\": \"Allow\",\n\"Action\": [\n\"s3:ListBucket\"\n],\n\"Resource\": [\n\"arn:aws:s3:::<s3-bucket-name>\"\n]\n},\n{\n\"Effect\": \"Allow\",\n\"Action\": [\n\"s3:PutObject\",\n\"s3:GetObject\",\n\"s3:DeleteObject\",\n\"s3:PutObjectAcl\"\n],\n\"Resource\": [\n\"arn:aws:s3:::<s3-bucket-name>\/*\"\n]\n}\n]\n}\n\n```\n4. Click **Review policy**.\n5. In the **Name** field, type a policy name.\n6. Click **Create policy**.\n6. In the role summary, copy the **Role ARN**. \n![Instance profile ARN](https:\/\/docs.databricks.com\/_images\/copy-instanceprofile-arn.png) \nNote \nIf you intend to enable [encryption](https:\/\/docs.databricks.com\/dbfs\/mounts.html#s3-encryption) for the S3 bucket, you must add the IAM role as a **Key User** for the KMS key provided in the configuration. See [Configure encryption for S3 with KMS](https:\/\/docs.databricks.com\/security\/keys\/kms-s3.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Tutorial: Configure S3 access with an instance profile\n##### Step 2: Enable the policy to work with serverless resources\n\nThis step ensures that your instance profile also works for configuring SQL warehouses with instance profiles. See [Enable data access configuration](https:\/\/docs.databricks.com\/admin\/sql\/data-access-configuration.html). \n1. In the role list, click your instance profile.\n2. Select the **Trust Relationships** tab.\n3. Click **Edit Trust Policy**.\n4. Within the existing `Statement` array, append the following JSON block to the end of the existing trust policy. Ensure that you don\u2019t overwrite the existing policy. \n```\n{\n\"Effect\": \"Allow\",\n\"Principal\": {\n\"AWS\": [\n\"arn:aws:iam::790110701330:role\/serverless-customer-resource-role\"\n]\n},\n\"Action\": \"sts:AssumeRole\",\n\"Condition\": {\n\"StringEquals\": {\n\"sts:ExternalId\": [\n\"databricks-serverless-<YOUR-WORKSPACE-ID1>\",\n\"databricks-serverless-<YOUR-WORKSPACE-ID2>\"\n]\n}\n}\n}\n\n``` \nThe only thing you need to change in the statement is the workspace IDs. Replace the `YOUR_WORKSPACE-ID`s with one or more Databricks [workspace IDs](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url) for the workspaces that will use this role. \nNote \nTo get your workspace ID, check the URL when you\u2019re using your workspace. For example, in `https:\/\/<databricks-instance>\/?o=6280049833385130`, the number after `o=` is the workspace ID. \nDo **not** edit the principal of the policy. The `Principal.AWS` field must continue to have the value `arn:aws:iam::790110701330:role\/serverless-customer-resource-role`. This references a serverless compute role managed by Databricks.\n5. Click **Review policy**.\n6. Click **Save changes**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Tutorial: Configure S3 access with an instance profile\n##### Step 3: Create the bucket policy\n\nAt a minimum, the S3 policy must include the `ListBucket` and `GetObject` actions, which provide read-only access to a bucket. Delta Lake uses `DeleteObject` and `PutObject` permissions during regular operations. The permissions in the example policy below are the recommended defaults for clusters that read and write data. \nNote \nS3 buckets have universally unique names and do not require an account ID for universal identification. If you choose to link an S3 bucket to an IAM role and Databricks workspace in a different AWS account, you must specify the account ID when configuring your S3 bucket policy. \n1. Go to your S3 console. From the **Buckets** list, select the bucket for which you want to create a policy.\n2. Click **Permissions**.\n3. Under **Bucket policy**, click **Edit**.\n4. Paste in a policy. A sample cross-account bucket IAM policy could be the following, replacing `<aws-account-id-databricks>` with the AWS account ID where the Databricks environment is deployed, `<iam-role-for-s3-access>` with the instance profile role, and `<s3-bucket-name>` with the bucket name. \n```\n{\n\"Version\": \"2012-10-17\",\n\"Statement\": [\n{\n\"Sid\": \"Example permissions\",\n\"Effect\": \"Allow\",\n\"Principal\": {\n\"AWS\": \"arn:aws:iam::<aws-account-id-databricks>:role\/<iam-role-for-s3-access>\"\n},\n\"Action\": [\n\"s3:GetBucketLocation\",\n\"s3:ListBucket\"\n],\n\"Resource\": \"arn:aws:s3:::<s3-bucket-name>\"\n},\n{\n\"Effect\": \"Allow\",\n\"Principal\": {\n\"AWS\": \"arn:aws:iam::<aws-account-id-databricks>:role\/<iam-role-for-s3-access>\"\n},\n\"Action\": [\n\"s3:PutObject\",\n\"s3:GetObject\",\n\"s3:DeleteObject\",\n\"s3:PutObjectAcl\"\n],\n\"Resource\": \"arn:aws:s3:::<s3-bucket-name>\/*\"\n}\n]\n}\n\n```\n5. Click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Tutorial: Configure S3 access with an instance profile\n##### Step 4: Locate the IAM role that created the Databricks deployment\n\nIf you don\u2019t know which IAM role created the Databricks deployment, do the following: \n1. As an account admin, log in to the [account console](https:\/\/accounts.cloud.databricks.com).\n2. Go to **Workspaces** and click your workspace name.\n3. In the **Credentials** box, note the role name at the end of the Role ARN. \nFor example, in the Role ARN `arn:aws:iam::123456789123:role\/finance-prod`, the role name is `finance-prod`.\n\n#### Tutorial: Configure S3 access with an instance profile\n##### Step 5: Add the S3 IAM role to the EC2 policy\n\n1. In the AWS console, go to the **IAM** service.\n2. Click the **Roles** tab in the sidebar.\n3. Click the role that created the Databricks deployment.\n4. On the **Permissions** tab, click the policy.\n5. Click **Edit Policy**.\n6. Append the following block to the end of the `Statement` array. Ensure that you don\u2019t overwrite any of the existing policy. Replace `<iam-role-for-s3-access>` with the role you created in [Tutorial: Configure S3 access with an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html): \n```\n{\n\"Effect\": \"Allow\",\n\"Action\": \"iam:PassRole\",\n\"Resource\": \"arn:aws:iam::<aws-account-id-databricks>:role\/<iam-role-for-s3-access>\"\n}\n\n```\n7. Click **Review policy**.\n8. Click **Save changes**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Tutorial: Configure S3 access with an instance profile\n##### Step 6: Add the instance profile to Databricks\n\n1. As a workspace admin, go to the [settings page](https:\/\/docs.databricks.com\/admin\/index.html#admin-settings).\n2. Click the **Security** tab.\n3. Click **Manage** next to **Instance profiles**.\n4. Click **Add Instance Profile**.\n5. Paste your instance profile ARN into the **Instance profile ARN** field. If you don\u2019t have the ARN, see [Tutorial: Configure S3 access with an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html).\n6. For [serverless SQL](https:\/\/docs.databricks.com\/admin\/sql\/serverless.html) to work with your instance profile, you might need to explicitly specify the role ARN associated with your instance profile in the **IAM role ARN** field. \nThis is only a required step if your instance profile\u2019s associated role name (the text after the last slash in the role ARN) and the instance profile name (the text after the last slash in the instance profile ARN) do not match. To confirm whether this applies to you: \n1. In the AWS console, go to the [IAM service\u2019s Roles tab](https:\/\/console.aws.amazon.com\/iam\/home#\/roles). It lists the IAM roles in your account.\n2. Click the role with the name that matches the instance profile name in the Databricks SQL admin settings in the **Data Security** section for the **Instance Profile** field that you found earlier in this section.\n3. In the summary area, find the **Role ARN** and **Instance Profile ARNs** fields and see if they match. \n![Does instance profile name and role arn name match](https:\/\/docs.databricks.com\/_images\/serverless-compute-aws-console-instance-profile-names.png)\n4. If they do not match, paste the role ARN into the **IAM role ARN** field. If the names match, you do not need to set the **IAM role ARN** field.\n5. Only if you are setting up [IAM credential passthrough](https:\/\/docs.databricks.com\/archive\/credential-passthrough\/iam-passthrough.html), select the **Meta Instance Profile** property.\n7. Databricks validates that the instance profile ARN is both syntactically and semantically correct. To validate semantic correctness, Databricks does a dry run by launching a cluster with this instance profile. Any failure in this dry run produces a validation error in the UI. Validation of the instance profile can fail if the instance profile contains the `tag-enforcement` policy, preventing you from adding a legitimate instance profile. If the validation fails and you still want to add the instance profile, select the **Skip Validation** checkbox.\n8. Click **Add**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Tutorial: Configure S3 access with an instance profile\n##### Manage instance profiles\n\nWorkspace admins can manage manage access to instance profiles and update them. See [Manage instance profiles in Databricks](https:\/\/docs.databricks.com\/admin\/workspace-settings\/manage-instance-profiles.html).\n\n#### Tutorial: Configure S3 access with an instance profile\n##### Next steps\n\n* [Enable data access configuration](https:\/\/docs.databricks.com\/admin\/sql\/data-access-configuration.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html"} +{"content":"# Query data\n## Data format options\n#### Read Parquet files using Databricks\n\nThis article shows you how to read data from Apache Parquet files using Databricks.\n\n#### Read Parquet files using Databricks\n##### What is Parquet?\n\n[Apache Parquet](https:\/\/parquet.apache.org\/) is a columnar file format with optimizations that speed up queries. It\u2019s a more efficient file format than [CSV](https:\/\/docs.databricks.com\/query\/formats\/csv.html) or [JSON](https:\/\/docs.databricks.com\/query\/formats\/json.html). \nFor more information, see [Parquet Files](https:\/\/spark.apache.org\/docs\/latest\/sql-data-sources-parquet.html).\n\n#### Read Parquet files using Databricks\n##### Options\n\nSee the following Apache Spark reference articles for supported read and write options. \n* Read \n+ [Python](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrameReader.parquet.html?highlight=parquet#pyspark.sql.DataFrameReader.parquet)\n+ [Scala](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/sql\/DataFrameReader.html#parquet(paths:String*):org.apache.spark.sql.DataFrame)\n* Write \n+ [Python](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrameWriter.parquet.html?highlight=parquet#pyspark.sql.DataFrameWriter.parquet)\n+ [Scala](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/sql\/DataFrameWriter.html#parquet(path:String):Unit)\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/parquet.html"} +{"content":"# Query data\n## Data format options\n#### Read Parquet files using Databricks\n##### Notebook example: Read and write to Parquet files\n\nThe following notebook shows how to read and write data to Parquet files. \n### Reading Parquet files notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/read-parquet-files.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/parquet.html"} +{"content":"# Databricks data engineering\n### Introduction to Databricks Workflows\n\nDatabricks Workflows orchestrates data processing, machine learning, and analytics pipelines on the Databricks Data Intelligence Platform. Workflows has fully managed orchestration services integrated with the Databricks platform, including Databricks Jobs to run non-interactive code in your Databricks workspace and Delta Live Tables to build reliable and maintainable ETL pipelines. \nTo learn more about the benefits of orchestrating your workflows with the Databricks platform, see [Databricks Workflows](https:\/\/www.databricks.com\/product\/workflows).\n\n### Introduction to Databricks Workflows\n#### An example Databricks workflow\n\nThe following diagram illustrates a workflow that is orchestrated by a Databricks job to: \n1. Run a Delta Live Tables pipeline that ingests raw clickstream data from cloud storage, cleans and prepares the data, sessionizes the data, and persists the final sessionized data set to Delta Lake.\n2. Run a Delta Live Tables pipeline that ingests order data from cloud storage, cleans and transforms the data for processing, and persist the final data set to Delta Lake.\n3. Join the order and sessionized clickstream data to create a new data set for analysis.\n4. Extract features from the prepared data.\n5. Perform tasks in parallel to persist the features and train a machine learning model. \n![Diagram illustrating an example workflow](https:\/\/docs.databricks.com\/_images\/example-workflow-diagram.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/index.html"} +{"content":"# Databricks data engineering\n### Introduction to Databricks Workflows\n#### What is Databricks Jobs?\n\nA Databricks job is a way to run your data processing and analysis applications in a Databricks workspace. Your job can consist of a single task or can be a large, multi-task workflow with complex dependencies. Databricks manages the task orchestration, cluster management, monitoring, and error reporting for all of your jobs. You can run your jobs immediately, periodically through an easy-to-use scheduling system, whenever new files arrive in an external location, or continuously to ensure an instance of the job is always running. You can also run jobs interactively in the [notebook UI](https:\/\/docs.databricks.com\/notebooks\/index.html). \nYou can create and run a job using the Jobs UI, the Databricks CLI, or by invoking the Jobs API. You can repair and re-run a failed or canceled job using the UI or API. You can monitor job run results using the UI, CLI, API, and notifications (for example, email, webhook destination, or Slack notifications). \nTo learn about using the Databricks CLI, see [What is the Databricks CLI?](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). To learn about using the Jobs API, see the [Jobs API](https:\/\/docs.databricks.com\/api\/workspace\/jobs). \nThe following sections cover important features of Databricks Jobs. \nImportant \n* A workspace is limited to 1000 concurrent task runs. A `429 Too Many Requests` response is returned when you request a run that cannot start immediately.\n* The number of jobs a workspace can create in an hour is limited to 10000 (includes \u201cruns submit\u201d). This limit also affects jobs created by the REST API and notebook workflows. \n### Implement data processing and analysis with job tasks \nYou implement your data processing and analysis workflow using *tasks*. A job is composed of one or more tasks. You can create job tasks that run notebooks, JARS, Delta Live Tables pipelines, or Python, Scala, Spark submit, and Java applications. Your job tasks can also orchestrate Databricks SQL queries, alerts and dashboards to create analyses and visualizations, or you can use the dbt task to run dbt transformations in your workflow. Legacy Spark Submit applications are also supported. \nYou can also add a task to a job that runs a different job. This feature allows you to break a large process into multiple smaller jobs, or create generalized modules that can be reused by multiple jobs. \nYou control the execution order of tasks by specifying dependencies between the tasks. You can configure tasks to run in sequence or parallel. \n### Run jobs interactively, continuously, or using job triggers \nYou can run your jobs interactively from the Jobs UI, API, or CLI or you can run a [continuous job](https:\/\/docs.databricks.com\/workflows\/jobs\/schedule-jobs.html#continuous-jobs). You can [create a schedule](https:\/\/docs.databricks.com\/workflows\/jobs\/schedule-jobs.html#job-schedule) to run your job periodically or run your job when [new files arrive](https:\/\/docs.databricks.com\/workflows\/jobs\/file-arrival-triggers.html) in an external location such as Amazon S3, Azure storage or Google Cloud storage. \n### Monitor job progress with notifications \nYou can receive notifications when a job or task starts, completes, or fails. You can send notifications to one or more email addresses or system destinations (for example, webhook destinations or Slack). See [Add email and system notifications for job events](https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html). \n### Run your jobs with Databricks compute resources \nDatabricks clusters and SQL warehouses provide the computation resources for your jobs. You can run your jobs with a job cluster, an all-purpose cluster, or a SQL warehouse: \n* A job cluster is a dedicated cluster for your job or individual job tasks. Your job can use a job cluster that\u2019s shared by all tasks or you can configure a cluster for individual tasks when you create or edit a task. An job cluster is created when the job or task starts and terminated when the job or task ends.\n* An all-purpose cluster is a shared cluster that is manually started and terminated and can be shared by multiple users and jobs. \nTo optimize resource usage, Databricks recommends using a job cluster for your jobs. To reduce the time spent waiting for cluster startup, consider using an all-purpose cluster. See [Use Databricks compute with your jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/use-compute.html). \nYou use a SQL warehouse to run Databricks SQL tasks such as queries, dashboards, or alerts. You can also use a SQL warehouse to run dbt transformations with the dbt task. \n### Next steps \nTo get started with Databricks Jobs: \n* Create your first Databricks job with the [quickstart](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-quickstart.html).\n* Learn how to create and run workflows with the Databricks Jobs [user interface](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html). \n* Learn how to run a job without having to configure Databricks compute resources with [serverless workflows](https:\/\/docs.databricks.com\/workflows\/jobs\/run-serverless-jobs.html). \n* Learn about [monitoring job runs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html) in the Databricks Jobs user interface.\n* Learn about [configuration options](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html) for jobs. \nLearn more about building, managing, and troubleshooting workflows with Databricks Jobs: \n* Learn how to communicate information between tasks in a Databricks job with [task values](https:\/\/docs.databricks.com\/workflows\/jobs\/share-task-context.html).\n* Learn how to pass context about job runs into job tasks with [task parameter variables](https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html).\n* Learn how to configure your job tasks to run [conditionally](https:\/\/docs.databricks.com\/workflows\/jobs\/conditional-tasks.html) based on the status of the task\u2019s dependencies.\n* Learn how to [troubleshoot and fix failed](https:\/\/docs.databricks.com\/workflows\/jobs\/repair-job-failures.html) jobs.\n* Get notified when your job runs start, complete or fail with [job run notifications](https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html).\n* Trigger your jobs on a [custom schedule or run a continuous job](https:\/\/docs.databricks.com\/workflows\/jobs\/schedule-jobs.html).\n* Learn how to run your Databricks job when new data arrives with [file arrival triggers](https:\/\/docs.databricks.com\/workflows\/jobs\/file-arrival-triggers.html).\n* Learn how to use [Databricks compute resources](https:\/\/docs.databricks.com\/workflows\/jobs\/use-compute.html) to run your jobs.\n* Learn about [updates to the Jobs API](https:\/\/docs.databricks.com\/workflows\/jobs\/jobs-api-updates.html) to support creating and managing workflows with Databricks jobs.\n* Use [how-to guides and tutorials](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/index.html) to learn more about implementing data workflows with Databricks Jobs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/index.html"} +{"content":"# Databricks data engineering\n### Introduction to Databricks Workflows\n#### What is Delta Live Tables?\n\nDelta Live Tables is a framework that simplifies ETL and streaming data processing. Delta Live Tables provides efficient ingestion of data with built-in support for [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html), SQL and Python interfaces that support declarative implementation of data transformations, and support for writing transformed data to Delta Lake. You define the transformations to perform on your data, and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling. \nTo get started, see [What is Delta Live Tables?](https:\/\/docs.databricks.com\/delta-live-tables\/index.html).\n\n### Introduction to Databricks Workflows\n#### Databricks Jobs and Delta Live Tables\n\nDatabricks Jobs and Delta Live Tables provide a comprehensive framework for building and deploying end-to-end data processing and analysis workflows. \nUse Delta Live Tables for all ingestion and transformation of data. Use Databricks Jobs to orchestrate workloads composed of a single task or multiple data processing and analysis tasks on the Databricks platform, including Delta Live Tables ingestion and transformation. \nAs a workflow orchestration system, Databricks Jobs also supports: \n* Running jobs on a triggered basis, for example, running a workflow on a schedule.\n* Data analysis through SQL queries, machine learning and data analysis with notebooks, scripts, or external libraries, and so forth.\n* Running a job composed of a single task, for example, running an Apache Spark job packaged in a JAR.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/index.html"} +{"content":"# Databricks data engineering\n### Introduction to Databricks Workflows\n#### Workflow orchestration with Apache AirFlow\n\nAlthough Databricks recommends using Databricks Jobs to orchestrate your data workflows, you can also use [Apache Airflow](https:\/\/airflow.apache.org\/) to manage and schedule your data workflows. With Airflow, you define your workflow in a Python file, and Airflow manages scheduling and running the workflow. See [Orchestrate Databricks jobs with Apache Airflow](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-airflow-with-jobs.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Manage model lifecycle in Unity Catalog\n##### Upgrade models to Unity Catalog\n\nDatabricks recommends deploying ML pipelines as code, rather than deploying individual ML models. The recommended approach for migration is to [upgrade ML pipelines](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html) to use models in Unity Catalog. \nIn some cases, you might want to migrate individual models to Unity Catalog for initial testing, or for particularly critical models. The notebook demonstrates how to upgrade existing models to Unity Catalog.\n\n##### Upgrade models to Unity Catalog\n###### Upgrade Models to Unity Catalog\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/upgrade-models-to-unity-catalog.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-models.html"} +{"content":"# What is Delta Lake?\n### Configure Delta Lake to control data file size\n\nDelta Lake provides options for manually or automatically configuring the target file size for writes and for `OPTIMIZE` operations. Databricks automatically tunes many of these settings, and enables features that automatically improve table performance by seeking to right-size files. \nNote \nIn Databricks Runtime 13.3 and above, Databricks recommends using clustering for Delta table layout. See [Use liquid clustering for Delta tables](https:\/\/docs.databricks.com\/delta\/clustering.html). \nDatabricks recommends using predictive optimization to automatically run `OPTIMIZE` and `VACUUM` for Delta tables. See [Predictive optimization for Delta Lake](https:\/\/docs.databricks.com\/optimizations\/predictive-optimization.html). \nIn Databricks Runtime 10.4 LTS and above, auto compaction and optimized writes are always enabled for `MERGE`, `UPDATE`, and `DELETE` operations. You cannot disable this functionality. \nUnless otherwise specified, all recommendations in this article do not apply to Unity Catalog managed tables running the latest runtimes. \nFor Unity Catalog managed tables, Databricks tunes most of these configurations automatically if you\u2019re using a SQL warehouse or Databricks Runtime 11.3 LTS or above. \nIf you\u2019re upgrading a workload from Databricks Runtime 10.4 LTS or below, see [Upgrade to background auto compaction](https:\/\/docs.databricks.com\/delta\/tune-file-size.html#upgrade).\n\n### Configure Delta Lake to control data file size\n#### When to run `OPTIMIZE`\n\nAuto compaction and optimized writes each reduce small file problems, but are not a full replacement for `OPTIMIZE`. Especially for tables larger than 1 TB, Databricks recommends running `OPTIMIZE` on a schedule to further consolidate files. Databricks does not automatically run `ZORDER` on tables, so you must run `OPTIMIZE` with `ZORDER` to enable enhanced data skipping. See [Data skipping for Delta Lake](https:\/\/docs.databricks.com\/delta\/data-skipping.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tune-file-size.html"} +{"content":"# What is Delta Lake?\n### Configure Delta Lake to control data file size\n#### What is auto optimize on Databricks?\n\nThe term *auto optimize* is sometimes used to describe functionality controlled by the settings `delta.autoOptimize.autoCompact` and `delta.autoOptimize.optimizeWrite`. This term has been retired in favor of describing each setting individually. See [Auto compaction for Delta Lake on Databricks](https:\/\/docs.databricks.com\/delta\/tune-file-size.html#auto-compact) and [Optimized writes for Delta Lake on Databricks](https:\/\/docs.databricks.com\/delta\/tune-file-size.html#optimized-writes).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tune-file-size.html"} +{"content":"# What is Delta Lake?\n### Configure Delta Lake to control data file size\n#### Auto compaction for Delta Lake on Databricks\n\nAuto compaction combines small files within Delta table partitions to automatically reduce small file problems. Auto compaction occurs after a write to a table has succeeded and runs synchronously on the cluster that has performed the write. Auto compaction only compacts files that haven\u2019t been compacted previously. \nYou can control the output file size by setting the [Spark configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration) `spark.databricks.delta.autoCompact.maxFileSize`. Databricks recommends using autotuning based on workload or table size. See [Autotune file size based on workload](https:\/\/docs.databricks.com\/delta\/tune-file-size.html#autotune-workload) and [Autotune file size based on table size](https:\/\/docs.databricks.com\/delta\/tune-file-size.html#autotune-table). \nAuto compaction is only triggered for partitions or tables that have at least a certain number of small files. You can optionally change the minimum number of files required to trigger auto compaction by setting `spark.databricks.delta.autoCompact.minNumFiles`. \nAuto compaction can be enabled at the table or session level using the following settings: \n* Table property: `delta.autoOptimize.autoCompact`\n* SparkSession setting: `spark.databricks.delta.autoCompact.enabled` \nThese settings accept the following options: \n| Options | Behavior |\n| --- | --- |\n| `auto` (recommended) | Tunes target file size while respecting other autotuning functionality. Requires Databricks Runtime 10.4 LTS or above. |\n| `legacy` | Alias for `true`. Requires Databricks Runtime 10.4 LTS or above. |\n| `true` | Use 128 MB as the target file size. No dynamic sizing. |\n| `false` | Turns off auto compaction. Can be set at the session level to override auto compaction for all Delta tables modified in the workload. | \nImportant \nIn Databricks Runtime 9.1 LTS, when other writers perform operations like `DELETE`, `MERGE`, `UPDATE`, or `OPTIMIZE` concurrently, auto compaction can cause those other jobs to fail with a transaction conflict. This is not an issue in Databricks Runtime 10.4 LTS and above.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tune-file-size.html"} +{"content":"# What is Delta Lake?\n### Configure Delta Lake to control data file size\n#### Optimized writes for Delta Lake on Databricks\n\nOptimized writes improve file size as data is written and benefit subsequent reads on the table. \nOptimized writes are most effective for partitioned tables, as they reduce the number of small files written to each partition. Writing fewer large files is more efficient than writing many small files, but you might still see an increase in write latency because data is shuffled before being written. \nThe following image demonstrates how optimized writes works: \n![Optimized writes](https:\/\/docs.databricks.com\/_images\/optimized-writes.png) \nNote \nYou might have code that runs `coalesce(n)` or `repartition(n)` just before you write out your data to control the number of files written. Optimized writes eliminates the need to use this pattern. \nOptimized writes are enabled by default for the following operations in Databricks Runtime 9.1 LTS and above: \n* `MERGE`\n* `UPDATE` with subqueries\n* `DELETE` with subqueries \nOptimized writes are also enabled for `CTAS` statements and `INSERT` operations when using SQL warehouses. In Databricks Runtime 13.3 LTS and above, all Delta tables registered in Unity Catalog have optimized writes enabled for `CTAS` statements and `INSERT` operations for partitioned tables. \nOptimized writes can be enabled at the table or session level using the following settings: \n* Table setting: `delta.autoOptimize.optimizeWrite`\n* SparkSession setting: `spark.databricks.delta.optimizeWrite.enabled` \nThese settings accept the following options: \n| Options | Behavior |\n| --- | --- |\n| `true` | Use 128 MB as the target file size. |\n| `false` | Turns off optimized writes. Can be set at the session level to override auto compaction for all Delta tables modified in the workload. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tune-file-size.html"} +{"content":"# What is Delta Lake?\n### Configure Delta Lake to control data file size\n#### Set a target file size\n\nIf you want to tune the size of files in your Delta table, set the [table property](https:\/\/docs.databricks.com\/delta\/table-properties.html) `delta.targetFileSize` to the desired size. If this property is set, all data layout optimization operations will make a best-effort attempt to generate files of the specified size. Examples here include [optimize](https:\/\/docs.databricks.com\/delta\/optimize.html) or [Z-order](https:\/\/docs.databricks.com\/delta\/data-skipping.html), [auto compaction](https:\/\/docs.databricks.com\/delta\/tune-file-size.html#auto-compact), and [optimized writes](https:\/\/docs.databricks.com\/delta\/tune-file-size.html#optimized-writes). \nNote \nWhen using Unity Catalog managed tables and SQL warehouses or Databricks Runtime 11.3 LTS and above, only `OPTIMIZE` commands respect the `targetFileSize` setting. \n| Table property |\n| --- |\n| **delta.targetFileSize** Type: Size in bytes or higher units. The target file size. For example, `104857600` (bytes) or `100mb`. Default value: None | \nFor existing tables, you can set and unset properties using the SQL command [ALTER TABLE SET TBL PROPERTIES](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table.html). You can also set these properties automatically when creating new tables using Spark session configurations. See [Delta table properties reference](https:\/\/docs.databricks.com\/delta\/table-properties.html) for details.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tune-file-size.html"} +{"content":"# What is Delta Lake?\n### Configure Delta Lake to control data file size\n#### Autotune file size based on workload\n\nDatabricks recommends setting the table property `delta.tuneFileSizesForRewrites` to `true` for all tables that are targeted by many `MERGE` or DML operations, regardless of Databricks Runtime, Unity Catalog, or other optimizations. When set to `true`, the target file size for the table is set to a much lower threshold, which accelerates write-intensive operations. \nIf not explicitly set, Databricks automatically detects if 9 out of last 10 previous operations on a Delta table were `MERGE` operations and sets this table property to `true`. You must explicitly set this property to `false` to avoid this behavior. \n| Table property |\n| --- |\n| **delta.tuneFileSizesForRewrites** Type: `Boolean` Whether to tune file sizes for data layout optimization. Default value: None | \nFor existing tables, you can set and unset properties using the SQL command [ALTER TABLE SET TBL PROPERTIES](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table.html). You can also set these properties automatically when creating new tables using Spark session configurations. See [Delta table properties reference](https:\/\/docs.databricks.com\/delta\/table-properties.html) for details.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tune-file-size.html"} +{"content":"# What is Delta Lake?\n### Configure Delta Lake to control data file size\n#### Autotune file size based on table size\n\nTo minimize the need for manual tuning, Databricks automatically tunes the file size of Delta tables based on the size of the table. Databricks will use smaller file sizes for smaller tables and larger file sizes for larger tables so that the number of files in the table does not grow too large. Databricks does not autotune tables that you have tuned with a [specific target size](https:\/\/docs.databricks.com\/delta\/tune-file-size.html#set-target-size) or based on a workload with frequent rewrites. \nThe target file size is based on the current size of the Delta table. For tables smaller than 2.56 TB, the autotuned target file size is 256 MB. For tables with a size between 2.56 TB and 10 TB, the target size will grow linearly from 256 MB to 1 GB. For tables larger than 10 TB, the target file size is 1 GB. \nNote \nWhen the target file size for a table grows, existing files are not re-optimized into larger files by the `OPTIMIZE` command. A large table can therefore always have some files that are smaller than the target size. If it is required to optimize those smaller files into larger files as well, you can configure a fixed target file size for the table using the `delta.targetFileSize` table property. \nWhen a table is written incrementally, the target file sizes and file counts will be close to the following numbers, based on table size. The file counts in this table are only an example. The actual results will be different depending on many factors. \n| Table size | Target file size | Approximate number of files in table |\n| --- | --- | --- |\n| 10 GB | 256 MB | 40 |\n| 1 TB | 256 MB | 4096 |\n| 2.56 TB | 256 MB | 10240 |\n| 3 TB | 307 MB | 12108 |\n| 5 TB | 512 MB | 17339 |\n| 7 TB | 716 MB | 20784 |\n| 10 TB | 1 GB | 24437 |\n| 20 TB | 1 GB | 34437 |\n| 50 TB | 1 GB | 64437 |\n| 100 TB | 1 GB | 114437 |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tune-file-size.html"} +{"content":"# What is Delta Lake?\n### Configure Delta Lake to control data file size\n#### Limit rows written in a data file\n\nOccasionally, tables with narrow data might encounter an error where the number of rows in a given data file exceeds the support limits of the Parquet format. To avoid this error, you can use the SQL session configuration `spark.sql.files.maxRecordsPerFile` to specify the maximum number of records to write to a single file for a Delta Lake table. Specifying a value of zero or a negative value represents no limit. \nIn Databricks Runtime 11.3 LTS and above, you can also use the DataFrameWriter option `maxRecordsPerFile` when using the DataFrame APIs to write to a Delta Lake table. When `maxRecordsPerFile` is specified, the value of the SQL session configuration `spark.sql.files.maxRecordsPerFile` is ignored. \nNote \nDatabricks does not recommend using this option unless it is necessary to avoid the aforementioned error. This setting might still be necessary for some Unity Catalog managed tables with very narrow data.\n\n### Configure Delta Lake to control data file size\n#### Upgrade to background auto compaction\n\nBackground auto compaction is available for Unity Catalog managed tables in Databricks Runtime 11.3 LTS and above. When migrating a legacy workload or table, do the following: \n* Remove the Spark config `spark.databricks.delta.autoCompact.enabled` from cluster or notebook configuration settings.\n* For each table, run `ALTER TABLE <table_name> UNSET TBLPROPERTIES (delta.autoOptimize.autoCompact)` to remove any legacy auto compaction settings. \nAfter removing these legacy configurations, you should see background auto compaction triggered automatically for all Unity Catalog managed tables.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/tune-file-size.html"} +{"content":"# Compute\n## Use compute\n### Troubleshoot compute issues\n##### Debugging with the Apache Spark UI\n\nThis article outlines different debugging options available to peek at the internals of your Apache Spark application. The three important places to look are: \n* Spark UI\n* Driver logs\n* Executor logs \nSee [Diagnose cost and performance issues using the Spark UI](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/index.html) to walk through diagnosing cost and performance issues using the Spark UI.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/troubleshooting\/debugging-spark-ui.html"} +{"content":"# Compute\n## Use compute\n### Troubleshoot compute issues\n##### Debugging with the Apache Spark UI\n###### Spark UI\n\nOnce you start the job, the Spark UI shows information about what\u2019s happening in your application. To get to the Spark UI, click the attached compute: \n### Streaming tab \nOnce you get to the Spark UI, you will see a Streaming tab if a streaming job is running in this compute. If there is no streaming job running in this compute, this tab will not be visible. You can skip to [Driver logs](https:\/\/docs.databricks.com\/compute\/troubleshooting\/debugging-spark-ui.html#driver-logs) to learn how to check for exceptions that might have happened while starting the streaming job. \nThe first thing to look for in this page is to check if your streaming application is receiving any input events from your source. In this case, you can see the job receives 1000 events\/second. \nIf you have an application that receives multiple input streams, you can click the **Input Rate** link which will show the # of events received for each receiver. \n### Processing time \nAs you scroll down, find the graph for **Processing Time**. This is one of the key graphs to understand the performance of your streaming job. As a general rule of thumb, it is good if you can process each batch within 80% of your batch processing time. \nFor this application, the batch interval was 2 seconds. The average processing time is 450ms which is well under the batch interval. If the average processing time is closer or greater than your batch interval, then you will have a streaming application that will start queuing up resulting in backlog soon which can bring down your streaming job eventually. \n### Completed batches \nTowards the end of the page, you will see a list of all the completed batches. The page displays details about the last 1000 batches that completed. From the table, you can get the # of events processed for each batch and their processing time. If you want to know more about what happened on one of the batches, you can click the batch link to get to the Batch Details Page. \n### Batch details page \nThis page has all the details you want to know about a batch. Two key things are: \n* Input: Has details about the input to the batch. In this case, it has details about the Apache Kafka topic, partition and offsets read by Spark Structured Streaming for this batch. In case of TextFileStream, you see a list of file names that was read for this batch. This is the best way to start debugging a Streaming application reading from text files.\n* Processing: You can click the link to the Job ID which has all the details about the processing done during this batch. \n### Job details page \nThe job details page shows a DAG visualization. This is a very useful to understand the order of operations and dependencies for every batch. In this case, you can see that the batch read input from Kafka direct stream followed by a flat map operation and then a map operation. The resulting stream was then used to update a global state using updateStateByKey. (The grayed boxes represents skipped stages. Spark is smart enough to skip some stages if they don\u2019t need to be recomputed. If the data is checkpointed or cached, then Spark would skip recomputing those stages. In this case, those stages correspond to the dependency on previous batches because of `updateStateBykey`. Since Spark Structured Streaming internally checkpoints the stream and it reads from the checkpoint instead of depending on the previous batches, they are shown as grayed stages.) \nAt the bottom of the page, you will also find the list of jobs that were executed for this batch. You can click the links in the description to drill further into the task level execution. \n### Task details page \nThis is the most granular level of debugging you can get into from the Spark UI for a Spark application. This page has all the tasks that were executed for this batch. If you are investigating performance issues of your streaming application, then this page would provide information such as the number of tasks that were executed and where they were executed (on which executors) and shuffle information \nTip \nEnsure that the tasks are executed on multiple executors (nodes) in your compute to have enough parallelism while processing. If you have a single receiver, sometimes only one executor might be doing all the work though you have more than one executor in your compute. \n### Thread dump \nA thread dump shows a snapshot of a JVM\u2019s thread states. \nThread dumps are useful in debugging a specific hanging or slow-running task. To view a specific task\u2019s thread dump in the Spark UI: \n1. Click the **Jobs** tab.\n2. In the **Jobs** table, find the target job that corresponds to the thread dump you want to see, and click the link in the **Description** column.\n3. In the job\u2019s **Stages** table, find the target stage that corresponds to the thread dump you want to see, and click the link in the **Description** column.\n4. In the stage\u2019s **Tasks** list, find the target task that corresponds to the thread dump you want to see, and note its **Task ID** and **Executor ID** values.\n5. Click the **Executors** tab.\n6. In the **Executors** table, find the row that contains the **Executor ID** value that corresponds to the **Executor ID** value that you noted earlier. In that row, click the link in the **Thread Dump** column.\n7. In the **Thread dump for executor** table, click the row where the **Thread Name** column contains **TID** followed by the **Task ID** value that you noted earlier. (If the task has finished running, you will not find a matching thread). The task\u2019s thread dump is shown. \nThread dumps are also useful for debugging issues where the driver appears to be hanging (for example, no Spark progress bars are showing) or making no progress on queries (for example, Spark progress bars are stuck at 100%). To view the driver\u2019s thread dump in the Spark UI: \n1. Click the **Executors** tab.\n2. In the **Executors** table, in the **driver** row, click the link in the **Thread Dump** column. The driver\u2019s thread dump is shown.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/troubleshooting\/debugging-spark-ui.html"} +{"content":"# Compute\n## Use compute\n### Troubleshoot compute issues\n##### Debugging with the Apache Spark UI\n###### Driver logs\n\nDriver logs are helpful for 2 purposes: \n* Exceptions: Sometimes, you may not see the Streaming tab in the Spark UI. This is because the Streaming job was not started because of some exception. You can drill into the Driver logs to look at the stack trace of the exception. In some cases, the streaming job may have started properly. But you will see all the batches never going to the Completed batches section. They might all be in processing or failed state. In such cases too, driver logs could be handy to understand on the nature of the underlying issues.\n* Prints: Any print statements as part of the DAG shows up in the logs too.\n\n##### Debugging with the Apache Spark UI\n###### Executor logs\n\nExecutor logs are sometimes helpful if you see certain tasks are misbehaving and would like to see the logs for specific tasks. From the task details page shown above, you can get the executor where the task was run. Once you have that, you can go to the compute UI page, click the # nodes, and then the master. The master page lists all the workers. You can choose the worker where the suspicious task was run and then get to the log4j output.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/troubleshooting\/debugging-spark-ui.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n### Customer-managed keys for encryption\n##### Configure customer-managed keys for encryption\n\nAccount admins can use the Databricks account console to configure customer-managed keys for encryption. You can also configure customer-managed keys using the [Account Key Configurations API](https:\/\/docs.databricks.com\/api\/account\/encryptionkeys). \nThere are two Databricks use cases for adding a customer-managed key: \n* Managed services data in the [Databricks control plane](https:\/\/docs.databricks.com\/getting-started\/overview.html) (notebooks, secrets, and Databricks SQL query data). \n* Workspace storage (your workspace storage bucket and the EBS volumes of compute resources in the classic compute plane). \nNote \nCustomer-managed keys for EBS volumes, does *not* apply to serverless compute resources. Disks for serverless compute resources are short-lived and tied to the lifecycle of the serverless workload. When compute resources are stopped or scaled down, the VMs and their storage are destroyed. \nTo compare the customer-managed key use cases, see [Compare customer-managed keys use cases](https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html#compare). \nFor a list of regions that support customer-managed keys, see [Databricks clouds and regions](https:\/\/docs.databricks.com\/resources\/supported-regions.html). This feature requires the [Enterprise pricing tier](https:\/\/databricks.com\/product\/pricing\/platform-addons).\n\n##### Configure customer-managed keys for encryption\n###### What is an encryption keys configuration?\n\nCustomer-managed keys are managed with encryption keys configurations. Encryption keys configurations are account-level objects that reference your cloud\u2019s key. \nAccount admins create encryption keys configurations in the account console and an encryption keys configuration can be attached to one or more workspaces. \nYou can share a Databricks key configuration object between the two different encryption use cases (managed services and workspace storage). \nYou can add an encryption keys configuration to your Databricks workspace during workspace creation or you can update an existing workspace with an encryption key configuration.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/configure-customer-managed-keys.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n### Customer-managed keys for encryption\n##### Configure customer-managed keys for encryption\n###### Step 1: Create or select a key in AWS KMS\n\nYou can use the same AWS KMS key between the workspace storage and managed services use cases. \n1. Create or select a symmetric key in AWS KMS, following the instructions in [Creating symmetric CMKs](https:\/\/docs.aws.amazon.com\/kms\/latest\/developerguide\/create-keys.html#create-symmetric-cmk) or [Viewing keys](https:\/\/docs.aws.amazon.com\/kms\/latest\/developerguide\/viewing-keys.html). \nThe KMS key must be in the same AWS region as your workspace. \n1. Copy these values, which you need in a later step: \n* **Key ARN**:Get the ARN from the console or the API (the `Arn` field in the JSON response).\n* **Key alias**:An alias specifies a display name for the CMK in AWS KMS.\n2. On the **Key policy** tab, switch to the policy view. Edit the key policy to add the below text so that Databricks can use the key to perform encryption and decryption operations. \nSelect a tab for your encryption use case below and click **Copy**. \nAdd the JSON to your key policy in the `\"Statement\"` section. Do not delete the existing key policies. \nThe policy uses the Databricks AWS account ID `414351767826`. If you are are using [Databricks on AWS GovCloud](https:\/\/docs.databricks.com\/security\/privacy\/gov-cloud.html) use the Databricks account ID `044793339203`. \nTo allow Databricks to encrypt cluster EBS volumes, replace the `<cross-account-iam-role-arn>` in the policy with the ARN for the cross-cloud IAM role that you created to allow Databricks to access your account. This is the same Role ARN that you use to register a Databricks credential configuration for a Databricks workspace. \n```\n{\n\"Sid\": \"Allow Databricks to use KMS key for DBFS\",\n\"Effect\": \"Allow\",\n\"Principal\":{\n\"AWS\":\"arn:aws:iam::414351767826:root\"\n},\n\"Action\": [\n\"kms:Encrypt\",\n\"kms:Decrypt\",\n\"kms:ReEncrypt*\",\n\"kms:GenerateDataKey*\",\n\"kms:DescribeKey\"\n],\n\"Resource\": \"*\",\n\"Condition\": {\n\"StringEquals\": {\n\"aws:PrincipalTag\/DatabricksAccountId\": [\"<databricks-account-id>(s)\"]\n}\n}\n},\n{\n\"Sid\": \"Allow Databricks to use KMS key for DBFS (Grants)\",\n\"Effect\": \"Allow\",\n\"Principal\":{\n\"AWS\":\"arn:aws:iam::414351767826:root\"\n},\n\"Action\": [\n\"kms:CreateGrant\",\n\"kms:ListGrants\",\n\"kms:RevokeGrant\"\n],\n\"Resource\": \"*\",\n\"Condition\": {\n\"Bool\": {\n\"kms:GrantIsForAWSResource\": \"true\"\n},\n\"StringEquals\": {\n\"aws:PrincipalTag\/DatabricksAccountId\": [\"<databricks-account-id>(s)\"]\n}\n}\n},\n{\n\"Sid\": \"Allow Databricks to use KMS key for managed services in the control plane\",\n\"Effect\": \"Allow\",\n\"Principal\": {\n\"AWS\": \"arn:aws:iam::414351767826:root\"\n},\n\"Action\": [\n\"kms:Encrypt\",\n\"kms:Decrypt\"\n],\n\"Resource\": \"*\",\n\"Condition\": {\n\"StringEquals\": {\n\"aws:PrincipalTag\/DatabricksAccountId\": [\"<databricks-account-id>(s)\"]\n}\n}\n},\n{\n\"Sid\": \"Allow Databricks to use KMS key for EBS\",\n\"Effect\": \"Allow\",\n\"Principal\": {\n\"AWS\": \"<cross-account-iam-role-arn>\"\n},\n\"Action\": [\n\"kms:Decrypt\",\n\"kms:GenerateDataKey*\",\n\"kms:CreateGrant\",\n\"kms:DescribeKey\"\n],\n\"Resource\": \"*\",\n\"Condition\": {\n\"ForAnyValue:StringLike\": {\n\"kms:ViaService\": \"ec2.*.amazonaws.com\"\n}\n}\n}\n\n``` \n```\n{\n\"Sid\": \"Allow Databricks to use KMS key for managed services in the control plane\",\n\"Effect\": \"Allow\",\n\"Principal\": {\n\"AWS\": \"arn:aws:iam::414351767826:root\"\n},\n\"Action\": [\n\"kms:Encrypt\",\n\"kms:Decrypt\"\n],\n\"Resource\": \"*\",\n\"Condition\": {\n\"StringEquals\": {\n\"aws:PrincipalTag\/DatabricksAccountId\": [\"<databricks-account-id>(s)\"]\n}\n}\n}\n\n``` \nTo allow Databricks to encrypt cluster EBS volumes, replace the `<cross-account-iam-role-arn>` in the policy with the ARN for the cross-cloud IAM role that you created to allow Databricks to access your account. This is the same Role ARN that you use to register a Databricks credential configuration for a Databricks workspace. \n```\n{\n\"Sid\": \"Allow Databricks to use KMS key for DBFS\",\n\"Effect\": \"Allow\",\n\"Principal\":{\n\"AWS\":\"arn:aws:iam::414351767826:root\"\n},\n\"Action\": [\n\"kms:Encrypt\",\n\"kms:Decrypt\",\n\"kms:ReEncrypt*\",\n\"kms:GenerateDataKey*\",\n\"kms:DescribeKey\"\n],\n\"Resource\": \"*\",\n\"Condition\": {\n\"StringEquals\": {\n\"aws:PrincipalTag\/DatabricksAccountId\": [\"<databricks-account-id>(s)\"]\n}\n}\n},\n{\n\"Sid\": \"Allow Databricks to use KMS key for DBFS (Grants)\",\n\"Effect\": \"Allow\",\n\"Principal\":{\n\"AWS\":\"arn:aws:iam::414351767826:root\"\n},\n\"Action\": [\n\"kms:CreateGrant\",\n\"kms:ListGrants\",\n\"kms:RevokeGrant\"\n],\n\"Resource\": \"*\",\n\"Condition\": {\n\"Bool\": {\n\"kms:GrantIsForAWSResource\": \"true\"\n},\n\"StringEquals\": {\n\"aws:PrincipalTag\/DatabricksAccountId\": [\"<databricks-account-id>(s)\"]\n}\n}\n},\n{\n\"Sid\": \"Allow Databricks to use KMS key for EBS\",\n\"Effect\": \"Allow\",\n\"Principal\": {\n\"AWS\": \"<cross-account-iam-role-arn>\"\n},\n\"Action\": [\n\"kms:Decrypt\",\n\"kms:GenerateDataKey*\",\n\"kms:CreateGrant\",\n\"kms:DescribeKey\"\n],\n\"Resource\": \"*\",\n\"Condition\": {\n\"ForAnyValue:StringLike\": {\n\"kms:ViaService\": \"ec2.*.amazonaws.com\"\n}\n}\n}\n\n``` \nNote \nTo retrieve your Databricks account ID, follow [Locate your account ID](https:\/\/docs.databricks.com\/admin\/account-settings\/index.html#account-id).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/configure-customer-managed-keys.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n### Customer-managed keys for encryption\n##### Configure customer-managed keys for encryption\n###### Step 2: Add an access policy to your cross-account IAM role (Optional)\n\nIf your KMS key is in a different AWS account than the [cross-account IAM role](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/credentials.html) used to deploy your workspace, then you must add a policy to that cross-account IAM role. This policy enables Databricks to access your key. If your KMS key is in the same AWS account as the cross-account IAM role used to deploy your workspace, then you do not need to do this step. \n1. Log into the AWS Management Console as a user with administrator privileges and go to the **IAM** console.\n2. In the left navigation pane, click **Roles**.\n3. In the list of roles, click the [cross-account IAM role](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/credentials.html) that you created for Databricks.\n4. Add an inline policy. \n1. On the **Permissions** tab, click **Add inline policy**. \n![Inline policy](https:\/\/docs.databricks.com\/_images\/inline-policy.png)\n2. In the policy editor, click the **JSON** tab. \n![JSON editor](https:\/\/docs.databricks.com\/_images\/policy-editor.png)\n3. Copy the access policy below \n```\n{\n\"Sid\": \"AllowUseOfCMKInAccount <AccountIdOfCrossAccountIAMRole>\",\n\"Effect\": \"Allow\",\n\"Action\": [\n\"kms:Decrypt\",\n\"kms:GenerateDataKey*\",\n\"kms:CreateGrant\",\n\"kms:DescribeKey\"\n],\n\"Resource\": \"arn:aws:kms:<region>:<AccountIdOfKMSKey>:key\/<KMSKeyId>\",\n\"Condition\": {\n\"ForAnyValue:StringLike\": {\n\"kms:ViaService\": \"ec2.*.amazonaws.com\"\n}\n}\n}\n\n```\n4. Click **Review policy**.\n5. In the **Name** field, enter a policy name.\n6. Click **Create policy**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/configure-customer-managed-keys.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n### Customer-managed keys for encryption\n##### Configure customer-managed keys for encryption\n###### Step 3: Create a new key configuration\n\nCreate a Databricks encryption key configuration object using the Databricks account console. You can use an encryption key configuration across multiple workspaces. \n1. As an account admin, log in to the account console.\n2. In the sidebar, click **Cloud resources**.\n3. Click the **Encryption keys configuration** tab.\n4. Click **Add encryption key**.\n5. Select the use cases for this encryption key: \n* **Both managed services and workspace storage**\n* **Managed services**\n* **Workspace storage**\n6. In the **AWS key ARN** field, enter the key ARN that you copied above.\n7. In the **AWS key alias** field, enter the key alias that you copied above.\n8. Click **Add**.\n9. Copy the **Name**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/configure-customer-managed-keys.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n### Customer-managed keys for encryption\n##### Configure customer-managed keys for encryption\n###### Step 4: Add the key configuration to a workspace\n\nAdd the encryption key configuration that you created to a workspace. You cannot add the encryption key to a workspace using the account console. This section uses the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) to add an encryption key to a workspace. You can also use the [Account API](https:\/\/docs.databricks.com\/api\/account\/introduction). \nTo create a new workspace using the encryption key configuration, follow the instructions in [Create a workspace using the Account API](https:\/\/docs.databricks.com\/admin\/workspace\/create-workspace-api.html). \n1. Terminate all running compute in your workspace.\n2. Update a workspace with your key configuration. \nTo add the key for managed services, set `managed_services_customer_managed_key_id` to the key name that you copied above. \nTo add the key for workspace storage, set `storage-customer-managed-key-id` to the key name that you copied above. \nReplace `<workspace-id>` with your Databricks workspace ID. \nFor example: \n```\ndatabricks account workspaces update <workspace-id> --json '{\n\"managed_services_customer_managed_key_id\": \"<databricks-key-name>\",\n\"storage-customer-managed-key-id\": \"<databricks-key-name>\",\n}'\n\n```\n3. If you are adding keys for workspace storage, wait at least 20 minutes to start any compute or use the DBFS API.\n4. Restart compute that you terminated in a previous step.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/configure-customer-managed-keys.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n### Customer-managed keys for encryption\n##### Configure customer-managed keys for encryption\n###### Rotate an existing key\n\nYou can only rotate (update) an existing key for customer-managed key for managed services. You cannot rotate an existing key for customer-managed key for storage. However, AWS provides automatic CMK master key rotation, which rotates the underlying key without changing the key ARN. Automatic CMK master key rotation is compatible with Databricks customer-managed keys for storage. For more information, see [Rotating AWS KMS keys](https:\/\/docs.aws.amazon.com\/kms\/latest\/developerguide\/rotate-keys.html). \nTo rotate an existing key for managed services, follow the instruction in [Step 4: Add the key configuration to a workspace](https:\/\/docs.databricks.com\/security\/keys\/configure-customer-managed-keys.html#workspace). You must keep your old KMS key available to Databricks for 24 hours.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/configure-customer-managed-keys.html"} +{"content":"# \n### HTML, D3, and SVG in notebooks\n\nThis article contains Python and Scala notebooks that show how to view HTML, SVG, and D3 visualizations in notebooks. \nIf you want to use a custom Javascript library to render D3, see [Notebook example: Use a JavaScript library](https:\/\/docs.databricks.com\/archive\/legacy\/filestore.html#use-a-javascript-library).\n\n### HTML, D3, and SVG in notebooks\n#### HTML, D3, and SVG Python notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/html-d3-svg-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n### HTML, D3, and SVG in notebooks\n#### HTML, D3, and SVG Scala notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/html-d3-svg-scala.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/html-d3-and-svg.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Query serving endpoints for custom models\n\nIn this article, learn how to format scoring requests for your served model, and how to send those requests to the model serving endpoint. The guidance is relevant to serving **custom models**, which Databricks defines as traditional ML models or customized Python models packaged in the MLflow format. They can be registered either in Unity Catalog or in the workspace model registry. Examples include scikit-learn, XGBoost, PyTorch, and Hugging Face transformer models. See [Model serving with Databricks](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) for more information about this functionality and supported model categories. \nFor query requests for generative AI and LLM workloads, see [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-custom-model-endpoints.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Query serving endpoints for custom models\n##### Requirements\n\n* A [model serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html).\n* For the MLflow Deployment SDK, MLflow 2.9 or above is required.\n* [Scoring request in an accepted format](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-custom-model-endpoints.html#formats).\n* To send a scoring request through the REST API or MLflow Deployment SDK, you must have a Databricks API token. \nImportant \nAs a security best practice for production scenarios, Databricks recommends that you use [machine-to-machine OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html) for authentication during production. \nFor testing and development, Databricks recommends using a personal access token belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-custom-model-endpoints.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Query serving endpoints for custom models\n##### Querying methods and examples\n\nDatabricks Model Serving provides the following options for sending scoring requests to served models: \n| Method | Details |\n| --- | --- |\n| Serving UI | Select **Query endpoint** from the **Serving endpoint** page in your Databricks workspace. Insert JSON format model input data and click **Send Request**. If the model has an input example logged, use **Show Example** to load it. |\n| REST API | Call and query the model using the REST API. See [POST \/serving-endpoints\/{name}\/invocations](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/query) for details. For scoring requests to endpoints serving multiple models, see [Query individual models behind an endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/serve-multiple-models-to-serving-endpoint.html#query). |\n| MLflow Deployments SDK | Use MLflow Deployments SDK\u2019s [predict()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.deployments.html#mlflow.deployments.DatabricksDeploymentClient.predict) function to query the model. |\n| SQL function | Invoke model inference directly from SQL using the `ai_query` SQL function. See [Query a served model with ai\\_query()](https:\/\/docs.databricks.com\/large-language-models\/how-to-ai-query.html). | \n### Pandas DataFrame scoring example \nThe following example assumes a `MODEL_VERSION_URI` like `https:\/\/<databricks-instance>\/model\/iris-classifier\/Production\/invocations`, where `<databricks-instance>` is the [name of your Databricks instance](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-instance-names-urls-and-ids), and a [Databricks REST API token](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-custom-model-endpoints.html#required) called `DATABRICKS_API_TOKEN`. \nSee [Supported scoring formats](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-custom-model-endpoints.html#formats). \nScore a model accepting dataframe split input format. \n```\ncurl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \\\n-H 'Content-Type: application\/json' \\\n-d '{\"dataframe_split\": [{\n\"columns\": [\"sepal length (cm)\", \"sepal width (cm)\", \"petal length (cm)\", \"petal width (cm)\"],\n\"data\": [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2]]\n}]\n}'\n\n``` \nScore a model accepting tensor inputs. Tensor inputs should be formatted as described in [TensorFlow Serving\u2019s API documentation](https:\/\/www.tensorflow.org\/tfx\/serving\/api_rest#request_format_2). \n```\ncurl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \\\n-H 'Content-Type: application\/json' \\\n-d '{\"inputs\": [[5.1, 3.5, 1.4, 0.2]]}'\n\n``` \nImportant \nThe following example uses the `predict()` API from the [MLflow Deployments SDK](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.deployments.html#mlflow.deployments.DatabricksDeploymentClient.predict). \n```\n\nimport mlflow.deployments\n\nexport DATABRICKS_HOST=\"https:\/\/<workspace_host>.databricks.com\"\nexport DATABRICKS_TOKEN=\"dapi-your-databricks-token\"\n\nclient = mlflow.deployments.get_deploy_client(\"databricks\")\n\nresponse = client.predict(\nendpoint=\"test-model-endpoint\",\ninputs={\"dataframe_split\": {\n\"index\": [0, 1],\n\"columns\": [\"sepal length (cm)\", \"sepal width (cm)\", \"petal length (cm)\", \"petal width (cm)\"],\n\"data\": [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2]]\n}\n}\n)\n\n``` \nImportant \nThe following example uses the built-in SQL function, [ai\\_query](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_query.html). This function is [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html) and the definition might change. See [Query a served model with ai\\_query()](https:\/\/docs.databricks.com\/large-language-models\/how-to-ai-query.html). \nThe following example queries the model behind the `sentiment-analysis` endpoint with the `text` dataset and specifies the return type of the request. \n```\nSELECT text, ai_query(\n\"sentiment-analysis\",\ntext,\nreturnType => \"STRUCT<label:STRING, score:DOUBLE>\"\n) AS predict\nFROM\ncatalog.schema.customer_reviews\n\n``` \nYou can score a dataset in Power BI Desktop using the following steps: \n1. Open dataset you want to score.\n2. Go to Transform Data.\n3. Right-click in the left panel and select **Create New Query**.\n4. Go to **View > Advanced Editor**.\n5. Replace the query body with the code snippet below, after filling in an appropriate `DATABRICKS_API_TOKEN` and `MODEL_VERSION_URI`. \n```\n(dataset as table ) as table =>\nlet\ncall_predict = (dataset as table ) as list =>\nlet\napiToken = DATABRICKS_API_TOKEN,\nmodelUri = MODEL_VERSION_URI,\nresponseList = Json.Document(Web.Contents(modelUri,\n[\nHeaders = [\n#\"Content-Type\" = \"application\/json\",\n#\"Authorization\" = Text.Format(\"Bearer #{0}\", {apiToken})\n],\nContent = {\"dataframe_records\": Json.FromValue(dataset)}\n]\n))\nin\nresponseList,\npredictionList = List.Combine(List.Transform(Table.Split(dataset, 256), (x) => call_predict(x))),\npredictionsTable = Table.FromList(predictionList, (x) => {x}, {\"Prediction\"}),\ndatasetWithPrediction = Table.Join(\nTable.AddIndexColumn(predictionsTable, \"index\"), \"index\",\nTable.AddIndexColumn(dataset, \"index\"), \"index\")\nin\ndatasetWithPrediction\n\n```\n6. Name the query with your desired model name.\n7. Open the advanced query editor for your dataset and apply the model function. \n### Tensor input example \nThe following example scores a model accepting tensor inputs. Tensor inputs should be formatted as described in [TensorFlow Serving\u2019s API docs](https:\/\/www.tensorflow.org\/tfx\/serving\/api_rest#request_format_2). This example assumes a `MODEL_VERSION_URI` like `https:\/\/<databricks-instance>\/model\/iris-classifier\/Production\/invocations`, where `<databricks-instance>` is the [name of your Databricks instance](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-instance-names-urls-and-ids), and a [Databricks REST API token](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-custom-model-endpoints.html#required) called `DATABRICKS_API_TOKEN`. \n```\ncurl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \\\n-H 'Content-Type: application\/json' \\\n-d '{\"inputs\": [[5.1, 3.5, 1.4, 0.2]]}'\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-custom-model-endpoints.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Query serving endpoints for custom models\n##### Supported scoring formats\n\nFor custom models, Model Serving supports scoring requests in Pandas DataFrame or Tensor input. \n### Pandas DataFrame \nRequests should be sent by constructing a JSON-serialized Pandas DataFrame with one of the supported keys and a JSON object corresponding to the input format. \n* (Recommended)`dataframe_split` format is a JSON-serialized Pandas DataFrame in the `split` orientation. \n```\n{\n\"dataframe_split\": {\n\"index\": [0, 1],\n\"columns\": [\"sepal length (cm)\", \"sepal width (cm)\", \"petal length (cm)\", \"petal width (cm)\"],\n\"data\": [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2]]\n}\n}\n\n```\n* `dataframe_records` is JSON-serialized Pandas DataFrame in the `records` orientation. \nNote \nThis format does not guarantee the preservation of column ordering, and the `split` format is preferred over the `records` format. \n```\n{\n\"dataframe_records\": [\n{\n\"sepal length (cm)\": 5.1,\n\"sepal width (cm)\": 3.5,\n\"petal length (cm)\": 1.4,\n\"petal width (cm)\": 0.2\n},\n{\n\"sepal length (cm)\": 4.9,\n\"sepal width (cm)\": 3,\n\"petal length (cm)\": 1.4,\n\"petal width (cm)\": 0.2\n},\n{\n\"sepal length (cm)\": 4.7,\n\"sepal width (cm)\": 3.2,\n\"petal length (cm)\": 1.3,\n\"petal width (cm)\": 0.2\n}\n]\n}\n\n``` \nThe response from the endpoint contains the output from your model, serialized with JSON, wrapped in a `predictions` key. \n```\n{\n\"predictions\": [0,1,1,1,0]\n}\n\n``` \n### Tensor input \nWhen your model expects tensors, like a TensorFlow or Pytorch model, there are two supported format options for sending requests: `instances` and `inputs`. \nIf you have multiple named tensors per row, then you have to have one of each tensor for every row. \n* `instances` is a tensors-based format that accepts tensors in row format. Use this format if all the input tensors have the same 0-th dimension. Conceptually, each tensor in the instances list could be joined with the other tensors of the same name in the rest of the list to construct the full input tensor for the model, which would only be possible if all of the tensors have the same 0-th dimension. \n```\n{\"instances\": [ 1, 2, 3 ]}\n\n``` \nThe following example shows how to specify multiple named tensors. \n```\n{\n\"instances\": [\n{\n\"t1\": \"a\",\n\"t2\": [1, 2, 3, 4, 5],\n\"t3\": [[1, 2], [3, 4], [5, 6]]\n},\n{\n\"t1\": \"b\",\n\"t2\": [6, 7, 8, 9, 10],\n\"t3\": [[7, 8], [9, 10], [11, 12]]\n}\n]\n}\n\n```\n* `inputs` send queries with tensors in columnar format. This request is different because there are actually a different number of tensor instances of `t2` (3) than `t1` and `t3`, so it is not possible to represent this input in the `instances` format. \n```\n{\n\"inputs\": {\n\"t1\": [\"a\", \"b\"],\n\"t2\": [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]],\n\"t3\": [[[1, 2], [3, 4], [5, 6]], [[7, 8], [9, 10], [11, 12]]]\n}\n}\n\n``` \nThe response from the endpoint is in the following format. \n```\n{\n\"predictions\": [0,1,1,1,0]\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-custom-model-endpoints.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Query serving endpoints for custom models\n##### Notebook example\n\nSee the following notebook for an example of how to test your Model Serving endpoint with a Python model: \n### Test Model Serving endpoint notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/test-serverless-endpoint-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### Query serving endpoints for custom models\n##### Additional resources\n\n* [Inference tables for monitoring and debugging models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html).\n* [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-custom-model-endpoints.html"} +{"content":"# Compute\n## What is a SQL warehouse?\n#### Create a SQL warehouse\n\nWorkspace admins and sufficiently privileged users can configure and manage SQL warehouses. This article outlines how to create, edit, and monitor existing SQL warehouses. \nYou can also create SQL warehouses using the [SQL warehouse API](https:\/\/docs.databricks.com\/api\/workspace\/warehouses\/create), or [Terraform](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html). \nDatabricks recommends using serverless SQL warehouses when available. \nNote \nMost users cannot create SQL warehouses, but can restart any SQL warehouse they can connect to. See [What is a SQL warehouse?](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html).\n\n#### Create a SQL warehouse\n##### Requirements\n\nSQL warehouses have the following requirements: \n* To create a SQL warehouse you must be a workspace admin or a user with unrestricted cluster creation permissions.\n* Before you can create a serverless SQL warehouse in a [region that supports the feature](https:\/\/docs.databricks.com\/resources\/supported-regions.html), there might be required steps. See [Enable serverless SQL warehouses](https:\/\/docs.databricks.com\/admin\/sql\/serverless.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html"} +{"content":"# Compute\n## What is a SQL warehouse?\n#### Create a SQL warehouse\n##### Create a SQL warehouse\n\nTo create a SQL warehouse using the web UI: \n1. Click **SQL Warehouses** in the sidebar.\n2. Click **Create SQL Warehouse**.\n3. Enter a **Name** for the warehouse.\n4. (Optional) Configure warehouse settings. See [Configure SQL warehouse settings](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html#settings).\n5. (Optional) Configure advanced options. See [Advanced options](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html#advanced).\n6. Click **Create**.\n7. (Optional) Configure access to the SQL warehouse. See [Manage a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html#manage). \nYour created warehouse starts automatically. \n![Default SQL warehouse config](https:\/\/docs.databricks.com\/_images\/sql-warehouse-config.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html"} +{"content":"# Compute\n## What is a SQL warehouse?\n#### Create a SQL warehouse\n##### Configure SQL warehouse settings\n\nYou can modify the following settings while creating or editing a SQL warehouse: \n* **Cluster Size** represents the size of the driver node and number of worker nodes associated with the cluster. The default is **X-Large**. To reduce query latency, increase the size.\n* **Auto Stop** determines whether the warehouse stops if it\u2019s idle for the specified number of minutes. Idle SQL warehouses continue to accumulate DBU and cloud instance charges until they are stopped. \n+ **Pro and classic SQL warehouses**: The default is 45 minutes, which is recommended for typical use. The minimum is 10 minutes.\n+ **Serverless SQL warehouses**: The default is 10 minutes, which is recommended for typical use. The minimum is 5 minutes when you use the UI. Note that you can create a serverless SQL warehouse using the [SQL warehouses API](https:\/\/docs.databricks.com\/api\/workspace\/warehouses), in which case you can set the Auto Stop value as low as 1 minute.\n* **Scaling** sets the minimum and maximum number of clusters that will be used for a query. The default is a minimum and a maximum of one cluster. You can increase the maximum clusters if you want to handle more concurrent users for a given query. Databricks recommends a cluster for every 10 concurrent queries. \nTo maintain optimal performance, Databricks periodically recycles clusters. During a recycle period, you may temporarily see a cluster count that exceeds the maximum as Databricks transitions new workloads to the new cluster and waits to recycle the old cluster until all open workloads have completed.\n* **Type** determines the type of warehouse. If serverless is enabled in your account, serverless is the default. See [SQL warehouse types](https:\/\/docs.databricks.com\/admin\/sql\/warehouse-types.html) for the list. \n### Advanced options \nConfigure the following advanced options by expanding the **Advanced options** area when you create a new SQL warehouse or edit an existing SQL warehouse. You can also configure these options using the [SQL Warehouse API](https:\/\/docs.databricks.com\/api\/workspace\/warehouses\/create). \n* **Tags**: Tags allow you to monitor the cost of cloud resources used by users and groups in your organization. You specify tags as key-value pairs.\n* **Unity Catalog**: If Unity Catalog is enabled for the workspace, it is the default for all new warehouses in the workspace. If Unity Catalog is not enabled for your workspace, you do not see this option. See [What is Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html).\n* **Channel**: Use the Preview channel to test new functionality, including your queries and dashboards, before it becomes the Databricks SQL standard. \nThe [release notes](https:\/\/docs.databricks.com\/sql\/release-notes\/index.html#channels) list what\u2019s in the latest preview version. \nImportant \nDatabricks recommends against using a preview version for production workloads. Because only workspace admins can view a warehouse\u2019s properties, including its channel, consider indicating that a Databricks SQL warehouse uses a preview version in that warehouse\u2019s name to prevent users from using it for production workloads.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html"} +{"content":"# Compute\n## What is a SQL warehouse?\n#### Create a SQL warehouse\n##### Manage a SQL warehouse\n\nWorkspace admins and uses with CAN MANAGE privileges on a SQL warehouse can complete the following tasks on an existing SQL warehouse: \n* To stop a running warehouse, click the stop icon next to the warehouse.\n* To start a stopped warehouse, click the start icon next to the warehouse.\n* To edit a warehouse, click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) then click **Edit**.\n* To add and edit permissions, click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) then click **Permissions**. To learn about permission levels, see [SQL warehouse ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#sql-warehouses).\n* To upgrade a SQL warehouse to serverless, click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png), then click **Upgrade to Serverless**.\n* To delete a warehouse, click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png), then click **Delete**. \nNote \nContact your Databricks representative to restore a deleted warehouses within 14 days.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html"} +{"content":"# Technology partners\n## What is Databricks Partner Connect?\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/walkthrough-fivetran.html"} +{"content":"# Technology partners\n## What is Databricks Partner Connect?\n#### Walkthrough: Connect to Fivetran using Partner Connect\n\nIn this walkthrough, you use Partner Connect to connect a Databricks SQL warehouse in your workspace to Fivetran and then use Fivetran to ingest sample data from Google Sheets into your workspace. \n1. Make sure your Databricks account, workspace, and the signed-in user all meet the [requirements](https:\/\/docs.databricks.com\/partner-connect\/index.html#requirements) for Partner Connect.\n2. In the sidebar, click ![Partner Connect button](https:\/\/docs.databricks.com\/_images\/partner-connect.png) **Partner Connect**.\n3. Click the **Fivetran** tile. \nNote \nIf the **Fivetran** tile has a check mark icon inside of it, this means one of your administrators has already used Partner Connect to connect Fivetran to your workspace. Contact that admin, who can add you to the Fivetran account that they created by using Partner Connect. After they add you, click the **Fivetran** tile.\n4. If the **Connect to partner** dialog displays a **Next** button, click it.\n5. For **Email**, enter the email address that you want Fivetran to use to create a 14-day trial Fivetran account for you, or enter the email address for your existing Fivetran account.\n6. Click the button with the label **Connect to Fivetran** or **Sign in**. \nImportant \nIf an error displays stating that someone from your organization has already created an account with Fivetran, contact one of your organization\u2019s administrators and have them add you to your organization\u2019s Fivetran account. After they add you, click **Connect to Fivetran** or **Sign in** again.\n7. A new tab opens in your web browser, which displays the Fivetran website.\n8. Complete the on-screen instructions in Fivetran to create your 14-day trial Fivetran account, or to sign in to your existing Fivetran account.\n9. Do one of the following: \n* If Fivetran just created your 14-day trial Fivetran account, continue with [Use a new 14-day trial Fivetran account](https:\/\/docs.databricks.com\/partner-connect\/walkthrough-fivetran.html#new-fivetran-account).\n* If you signed in to your existing Fivetran account, skip ahead to [Use an existing Fivetran account](https:\/\/docs.databricks.com\/partner-connect\/walkthrough-fivetran.html#existing-fivetran-account).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/walkthrough-fivetran.html"} +{"content":"# Technology partners\n## What is Databricks Partner Connect?\n#### Walkthrough: Connect to Fivetran using Partner Connect\n##### Use a new 14-day trial Fivetran account\n\n1. On the **Fivetran is modern ELT** page, click **Set up a connector**.\n2. On the **Select your data source** page, click **Google Sheets**, and then click **Continue Setup**.\n3. Follow the on-screen instructions in the **Setup Guide** in Fivetran to finish setting up the connector.\n4. Click **Save & Test**.\n5. After the test succeeds, click **Continue**.\n6. Do one of the following: \n* If a **Google Sheets** connector page displays, skip ahead to [Ingest sample data](https:\/\/docs.databricks.com\/partner-connect\/walkthrough-fivetran.html#ingest-sample-data).\n* If a **Select your data\u2019s destination** page displays, skip ahead to step 9 in [Use an existing Fivetran account](https:\/\/docs.databricks.com\/partner-connect\/walkthrough-fivetran.html#existing-fivetran-account).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/walkthrough-fivetran.html"} +{"content":"# Technology partners\n## What is Databricks Partner Connect?\n#### Walkthrough: Connect to Fivetran using Partner Connect\n##### Use an existing Fivetran account\n\nTo complete this series of steps, you get the connection details for an existing SQL warehouse in your workspace and then add those details to your Fivetran account. \n* To get the connection details for an existing SQL warehouse, see [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html). Specifically, you need the SQL warehouse\u2019s **Server Hostname** and **HTTP Path** field values. \nNote \nBy default, the name of the SQL warehouse is **FIVETRAN\\_ENDPOINT**.\n* To create a SQL warehouse in your workspace, see [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html). \nYou must also generate a Databricks [personal access token](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement). \nTip \nIf the **Fivetran** tile in Partner Connect has a check mark icon inside of it, you can get the connection details for the connected SQL warehouse by clicking the tile and then expanding **Connection details**. Note however that the **Personal access token** here is hidden; you must [create a replacement personal access token](https:\/\/docs.databricks.com\/partner-connect\/index.html#how-to-create-token) and enter that new token instead when Fivetran asks you for it. \n1. In your **Dashboard** page in Fivetran, click the **Destinations** tab. \nImportant \nIf you sign in to your organization\u2019s Fivetran account, a **Choose Destination** page may display, listing one or more existing destination entries with the Databricks logo. *These entries might contain connection details for SQL warehouses in workspaces that are separate from yours.* If you still want to reuse one of these connections, and you trust the SQL warehouse and have access to it, choose that destination, click **Add Connector**, and then skip ahead to step 5. Otherwise, choose any available destination to get past this page.\n2. Click **Add Destination**.\n3. Enter a **Destination name** and click **Add**.\n4. On the **Fivetran is modern ELT** page, click **Set up a connector**.\n5. Click **Google Sheets**, and then click **Next**.\n6. Follow the on-screen instructions in the **Setup Guide** in Fivetran to finish setting up the connector.\n7. Click **Save & Test**.\n8. After the test succeeds, click **Continue**.\n9. On the **Select your data\u2019s destination** page, click **Databricks on AWS**.\n10. Click **Continue Setup**.\n11. Complete the on-screen instructions in Fivetran to enter the connection details for your SQL warehouse and your personal access token. \nNote \nBy default, the name of the SQL warehouse is **FIVETRAN\\_ENDPOINT**.\n12. Click **Save & Test**.\n13. After the test succeeds, click **Continue**.\n14. Continue to ingest sample data.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/walkthrough-fivetran.html"} +{"content":"# Technology partners\n## What is Databricks Partner Connect?\n#### Walkthrough: Connect to Fivetran using Partner Connect\n##### Ingest sample data\n\n1. Click **Start Initial Sync**.\n2. View the sample data in your workspace: after the sync succeeds, go to your Databricks workspace.\n3. In Databricks SQL, click **Queries**.\n4. Click **Create Query**.\n5. Choose the name of the SQL warehouse.\n6. Enter a query, for example `SELECT * FROM google_sheets.my_sheet`. \nNote \nYour database and table name here are different. For the correct database and table name, see the details for the connector in Fivetran that you just created.\n7. Click **Run**. The ingested data displays.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/walkthrough-fivetran.html"} +{"content":"# Technology partners\n## What is Databricks Partner Connect?\n#### Walkthrough: Connect to Fivetran using Partner Connect\n##### Clean up\n\nAfter you complete this walkthrough, you should clean up any related resources that you no longer plan to use. \n### Delete the table \n1. In Databricks SQL, click **Queries** on the sidebar.\n2. Click **Create Query**.\n3. Choose the name of the related SQL warehouse.\n4. Enter a query, for example `DROP TABLE google_sheets.my_sheet`. \nNote \nYour database and table name here are different. For the correct database and table name, see the details for the related connector in Fivetran. \nThis query only deletes the table. It does not delete your Google Sheet. You can manually delete your Google Sheet if you no longer plan to use it.\n5. Click **Run**. \n### Delete the SQL warehouse \nDo one of the following: \n* If you used Partner Connect to create the SQL warehouse: \n1. In Partner Connect, click the **Fivetran** tile with the check mark icon inside of it.\n2. Click **Delete connection**.\n3. Click **Delete**. The SQL warehouse and the related Databricks service principal are deleted.\n* If you used Databricks SQL to create the SQL warehouse: \n1. In Databricks SQL, click **SQL Warehouses** in the sidebar.\n2. Next to the warehouse, for **Actions**, click the ellipsis button.\n3. Click **Delete**.\n4. Confirm the deletion by clicking **Delete**. \n### Delete the connection details \n1. In your **Dashboard** page in Fivetran, click the **Destinations** tab. (If the **Dashboard** page is not displayed, go to <https:\/\/fivetran.com\/login>.)\n2. Next to the related destination entry with the Databricks logo, click the **X** icon.\n3. Click **Remove Destination**.\n\n#### Walkthrough: Connect to Fivetran using Partner Connect\n##### Additional resources\n\n[Fivetran website](https:\/\/fivetran.com)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/walkthrough-fivetran.html"} +{"content":"# What is Delta Lake?\n### Selectively overwrite data with Delta Lake\n\nDatabricks leverages Delta Lake functionality to support two distinct options for selective overwrites: \n* The `replaceWhere` option atomically replaces all records that match a given predicate.\n* You can replace directories of data based on how tables are partitioned using dynamic partition overwrites. \nFor most operations, Databricks recommends using `replaceWhere` to specify which data to overwrite. \nImportant \nIf data has been accidentally overwritten, you can use [restore](https:\/\/docs.databricks.com\/delta\/history.html#restore) to undo the change.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/selective-overwrite.html"} +{"content":"# What is Delta Lake?\n### Selectively overwrite data with Delta Lake\n#### Arbitrary selective overwrite with `replaceWhere`\n\nYou can selectively overwrite only the data that matches an arbitrary expression. \nNote \nSQL requires Databricks Runtime 12.2 LTS or above. \nThe following command atomically replaces events in January in the target table, which is partitioned by `start_date`, with the data in `replace_data`: \n```\n(replace_data.write\n.mode(\"overwrite\")\n.option(\"replaceWhere\", \"start_date >= '2017-01-01' AND end_date <= '2017-01-31'\")\n.save(\"\/tmp\/delta\/events\")\n)\n\n``` \n```\nreplace_data.write\n.mode(\"overwrite\")\n.option(\"replaceWhere\", \"start_date >= '2017-01-01' AND end_date <= '2017-01-31'\")\n.save(\"\/tmp\/delta\/events\")\n\n``` \n```\nINSERT INTO TABLE events REPLACE WHERE start_date >= '2017-01-01' AND end_date <= '2017-01-31' SELECT * FROM replace_data\n\n``` \nThis sample code writes out the data in `replace_data`, validates that all rows match the predicate, and performs an atomic replacement using `overwrite` semantics. If any values in the operation fall outside the constraint, this operation fails with an error by default. \nYou can change this behavior to `overwrite` values within the predicate range and `insert` records that fall outside the specified range. To do so, disable the constraint check by setting `spark.databricks.delta.replaceWhere.constraintCheck.enabled` to false using one of the following settings: \n```\nspark.conf.set(\"spark.databricks.delta.replaceWhere.constraintCheck.enabled\", False)\n\n``` \n```\nspark.conf.set(\"spark.databricks.delta.replaceWhere.constraintCheck.enabled\", false)\n\n``` \n```\nSET spark.databricks.delta.replaceWhere.constraintCheck.enabled=false\n\n``` \n### Legacy behavior \nLegacy default behavior had `replaceWhere` overwrite data matching a predicate over partition columns only. With this legacy model, the following command would atomically replace the month January in the target table, which is partitioned by `date`, with the data in `df`: \n```\n(df.write\n.mode(\"overwrite\")\n.option(\"replaceWhere\", \"birthDate >= '2017-01-01' AND birthDate <= '2017-01-31'\")\n.save(\"\/tmp\/delta\/people10m\")\n)\n\n``` \n```\ndf.write\n.mode(\"overwrite\")\n.option(\"replaceWhere\", \"birthDate >= '2017-01-01' AND birthDate <= '2017-01-31'\")\n.save(\"\/tmp\/delta\/people10m\")\n\n``` \nIf you want to fall back to the old behavior, you can disable the `spark.databricks.delta.replaceWhere.dataColumns.enabled` flag: \n```\nspark.conf.set(\"spark.databricks.delta.replaceWhere.dataColumns.enabled\", False)\n\n``` \n```\nspark.conf.set(\"spark.databricks.delta.replaceWhere.dataColumns.enabled\", false)\n\n``` \n```\nSET spark.databricks.delta.replaceWhere.dataColumns.enabled=false\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/selective-overwrite.html"} +{"content":"# What is Delta Lake?\n### Selectively overwrite data with Delta Lake\n#### Dynamic partition overwrites\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nDatabricks Runtime 11.3 LTS and above supports *dynamic* partition overwrite mode for partitioned tables. For tables with multiple partitions, Databricks Runtime 11.3 LTS and below only support dynamic partition overwrites if all partition columns are of the same data type. \nWhen in dynamic partition overwrite mode, operations overwrite all existing data in each logical partition for which the write commits new data. Any existing logical partitions for which the write does not contain data remain unchanged. This mode is only applicable when data is being written in overwrite mode: either `INSERT OVERWRITE` in SQL, or a DataFrame write with `df.write.mode(\"overwrite\")`. \nConfigure dynamic partition overwrite mode by setting the Spark session configuration `spark.sql.sources.partitionOverwriteMode` to `dynamic`. You can also enable this by setting the `DataFrameWriter` option `partitionOverwriteMode` to `dynamic`. If present, the query-specific option overrides the mode defined in the session configuration. The default for `partitionOverwriteMode` is `static`. \nImportant \nValidate that the data written with dynamic partition overwrite touches only the expected partitions. A single row in the incorrect partition can lead to unintentionally overwriting an entire partition. \nThe following example demonstrates using dynamic partition overwrites: \n```\nSET spark.sql.sources.partitionOverwriteMode=dynamic;\nINSERT OVERWRITE TABLE default.people10m SELECT * FROM morePeople;\n\n``` \n```\n(df.write\n.mode(\"overwrite\")\n.option(\"partitionOverwriteMode\", \"dynamic\")\n.saveAsTable(\"default.people10m\")\n)\n\n``` \n```\ndf.write\n.mode(\"overwrite\")\n.option(\"partitionOverwriteMode\", \"dynamic\")\n.saveAsTable(\"default.people10m\")\n\n``` \nNote \n* Dynamic partition overwrite conflicts with the option `replaceWhere` for partitioned tables. \n+ If dynamic partition overwrite is enabled in the Spark session configuration, and `replaceWhere` is provided as a `DataFrameWriter` option, then Delta Lake overwrites the data according to the `replaceWhere` expression (query-specific options override session configurations).\n+ You receive an error if the `DataFrameWriter` options have both dynamic partition overwrite and `replaceWhere` enabled.\n* You cannot specify `overwriteSchema` as `true` when using dynamic partition overwrite.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/selective-overwrite.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Databricks Vector Search\n\nThis article gives an overview of Databricks\u2019 vector database solution, Databricks Vector Search, including what it is and how it works.\n\n### Databricks Vector Search\n#### What is Databricks Vector Search?\n\nDatabricks Vector Search is a vector database that is built into the Databricks Data Intelligence Platform and integrated with its governance and productivity tools. A vector database is a database that is optimized to store and retrieve embeddings. Embeddings are mathematical representations of the semantic content of data, typically text or image data. Embeddings are generated by a large language model and are a key component of many GenAI applications that depend on finding documents or images that are similar to each other. Examples are RAG systems, recommender systems, and image and video recognition. \nWith Vector Search, you create a vector search index from a Delta table. The index includes embedded data with metadata. You can then query the index using a REST API to identify the most similar vectors and return the associated documents. You can structure the index to automatically sync when the underlying Delta table is updated. \nDatabricks Vector Search uses the Hierarchical Navigable Small World (HNSW) algorithm for its approximate nearest neighbor searches and the L2 distance distance metric to measure embedding vector similarity. If you want to use cosine similarity you need to normalize your datapoint embeddings before feeding them into Vector Search. When the data points are normalized, the ranking produced by L2 distance is the same as the ranking produces by cosine similarity.\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/vector-search.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Databricks Vector Search\n#### How does Vector Search work?\n\nTo create a vector database in Databricks, you must first decide how to provide vector embeddings. Databricks supports three options: \n* **Option 1** You provide a source Delta table that contains data in text format. Databricks calculates the embeddings, using a model that you specify, and optionally saves the embeddings to a table in Unity Catalog. As the Delta table is updated, the index stays synced with the Delta table. \nThe following diagram illustrates the process: \n1. Calculate query embeddings. Query can include metadata filters.\n2. Perform similarity search to identify most relevant documents.\n3. Return the most relevant documents and append them to the query.\n![vector database, Databricks calculates embeddings](https:\/\/docs.databricks.com\/_images\/calculate-embeddings.png)\n* **Option 2** You provide a source Delta table that contains pre-calculated embeddings. As the Delta table is updated, the index stays synced with the Delta table. \nThe following diagram illustrates the process: \n1. Query consists of embeddings and can include metadata filters.\n2. Perform similarity search to identify most relevant documents. Return the most relevant documents and append them to the query.\n![vector database, precalculated embeddings](https:\/\/docs.databricks.com\/_images\/precalculated-embeddings.png)\n* **Option 3** You provide a source Delta table that contains pre-calculated embeddings. There is no automatic syncing when the Delta table is updated. You must manually update the index using the REST API when the embeddings table changes. \nThe following diagram illustrates the process, which is the same as Option 2 except that the vector index is not automatically updated when the Delta table changes: \n![vector database, precalculated embeddings with no automatic sync](https:\/\/docs.databricks.com\/_images\/precalculated-embeddings-no-sync.png) \n### Similarity search calculation \nThe similarity search calculation uses the following formula: \n![reciprocal of 1 plus the squared distance](https:\/\/docs.databricks.com\/_images\/similarity-score.png) \nwhere `dist` is the Euclidean distance between the query `q` and the index entry `x`: \n![Eucidean distance, square root of the sum of squared differences](https:\/\/docs.databricks.com\/_images\/euclidean-distance.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/vector-search.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Databricks Vector Search\n#### How to set up Vector Search\n\nTo use Databricks Vector Search, you must create the following: \n* A vector search endpoint. This endpoint serves the vector search index. You can query and update the endpoint using the REST API or the SDK. Endpoints scale automatically to support the size of the index or the number of concurrent requests. See [Create a vector search endpoint](https:\/\/docs.databricks.com\/generative-ai\/create-query-vector-search.html#create-a-vector-search-endpoint) for instructions.\n* A vector search index. The vector search index is created from a Delta table and is optimized to provide real-time approximate nearest neighbor searches. The goal of the search is to identify documents that are similar to the query. Vector search indexes appear in and are governed by Unity Catalog. See [Create a vector search index](https:\/\/docs.databricks.com\/generative-ai\/create-query-vector-search.html#create-a-vector-search-index) for instructions. \nIn addition, if you choose to have Databricks compute the embeddings, you must also create a model serving endpoint for the embedding model. See [Create foundation model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-foundation-model-endpoints.html) for instructions. \nTo query the model serving endpoint, you use either the REST API or the Python SDK. Your query can define filters based on any column in the Delta table. For details, see [Use filters on queries](https:\/\/docs.databricks.com\/generative-ai\/create-query-vector-search.html#filters), the [API reference](https:\/\/docs.databricks.com\/api\/workspace\/vectorsearchendpoints), or the [Python SDK reference](https:\/\/api-docs.databricks.com\/python\/vector-search\/databricks.vector_search.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/vector-search.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Databricks Vector Search\n#### Requirements\n\n* Unity Catalog enabled workspace.\n* Serverless compute enabled.\n* Source table must have Change Data Feed enabled.\n* CREATE TABLE privileges on catalog schema(s) to create indexes.\n* [Personal access tokens enabled](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html).\n\n### Databricks Vector Search\n#### Data protection and authentication\n\nDatabricks implements the following security controls to protect your data: \n* Every customer request to Vector Search is logically isolated, authenticated, and authorized.\n* Databricks vector search encrypts all data at rest (AES-256) and in transit (TLS 1.2+). \nDatabricks Vector Search supports two modes of authentication: \n* Personal Access Token - You can use a personal access token to authenticate with Vector Search. See [personal access authentication token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). If you use the SDK in a notebook environment, it automatically generates a PAT token for authentication.\n* Service Principal Token - An admin can generate a service principal token and pass it to the SDK or API. See [use service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html). For production use cases, Databricks recommends using a service principal token. \n[Customer Managed Keys (CMK)](https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html) are supported on endpoints created on or after May 8, 2024. Vector Search support for CMK is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/vector-search.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Databricks Vector Search\n#### Monitor Vector Search usage and costs\n\nThe billable usage system table lets you monitor usage and costs associated with vector search indexes and endpoints. Here is an example query: \n```\nSELECT *\nFROM system.billing.usage\nWHERE billing_origin_product = 'VECTOR_SEARCH'\nAND usage_metadata.endpoint_name IS NOT NULL\n\n``` \nFor details about the contents of the billing usage table, see [Billable usage system table reference](https:\/\/docs.databricks.com\/admin\/system-tables\/billing.html). Additional queries are in the following example notebook. \n### Vector Search system tables queries notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/generative-ai\/vector-search-system-tables-queries.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/vector-search.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Databricks Vector Search\n#### Resource and data size limits\n\nThe following table summarizes resource and data size limits for vector search endpoints and indexes: \n| Resource | Granularity | Limit |\n| --- | --- | --- |\n| Vector search endpoints | Per workspace | 100 |\n| Embeddings | Per endpoint | 100,000,000 |\n| Embedding dimension | Per index | 4096 |\n| Indexes | Per endpoint | 20 |\n| Columns | Per index | 20 |\n| Columns | | Supported types: Bytes, short, integer, long, float, double, boolean, string, timestamp, date |\n| Metadata fields | Per index | 20 |\n| Index name | Per index | 128 characters | \nThe following limits apply to the creation and update of vector search indexes: \n| Resource | Granularity | Limit |\n| --- | --- | --- |\n| Row size for Delta Sync Index | Per index | 100KB |\n| Embedding source column size for Delta Sync index | Per Index | 32764 bytes |\n| Bulk upsert request size limit for Direct Vector index | Per Index | 10MB |\n| Bulk delete request size limit for Direct Vector index | Per Index | 10MB | \nThe following limits apply to the query API for vector search. \n| Resource | Granularity | Limit |\n| --- | --- | --- |\n| Query text length | Per query | 32764 |\n| Maximum number of results returned | Per query | 10,000 |\n\n### Databricks Vector Search\n#### Limitations\n\n* Regulated workspaces are not supported, therefore this functionality is not HIPAA compliant.\n* Row and column level permissions are not supported. However, you can implement your own application level ACLs using the filter API.\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/vector-search.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Databricks Vector Search\n#### Additional resources\n\n* [Deploy Your LLM Chatbot With Retrieval Augmented Generation (RAG), Foundation Models and Vector Search](https:\/\/www.databricks.com\/resources\/demos\/tutorials\/data-science-and-ai\/lakehouse-ai-deploy-your-llm-chatbot).\n* [How to create and query a Vector Search index](https:\/\/docs.databricks.com\/generative-ai\/create-query-vector-search.html).\n* [Example notebooks](https:\/\/docs.databricks.com\/generative-ai\/create-query-vector-search.html#example-notebooks)\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/vector-search.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Add an instance profile to a model serving endpoint\n\nThis article demonstrates how to attach an instance profile to a model serving endpoint. Doing so allows customers to access any AWS resources from the model permissible by the instance profile. Learn more about [instance profiles](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html).\n\n#### Add an instance profile to a model serving endpoint\n##### Requirements\n\n* [Create an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html).\n* [Add an instance profile to Databricks](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html#add-instance-profile). \n+ If you have an instance profile already configured for serverless SQL, be sure to change the access policies so that your models have the right access policy to your resources.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/add-model-serving-instance-profile.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Add an instance profile to a model serving endpoint\n##### Add an instance profile during endpoint creation\n\nWhen you [create a model serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html) you can add an instance profile to the endpoint configuration. \nNote \nThe endpoint creator\u2019s permission to an instance profile is validated at endpoint creation time. \n* From the Serving UI, you can add an instance profile in **Advanced configurations**: \n![Create a model serving endpoint](https:\/\/docs.databricks.com\/_images\/add-instance-profile1.png)\n* For programmatic workflows, use the `instance_profile_arn` field when you create an endpoint to add an instance profile. \n```\nPOST \/api\/2.0\/serving-endpoints\n\n{\n\"name\": \"feed-ads\",\n\"config\":{\n\"served_entities\": [{\n\"entity_name\": \"ads1\",\n\"entity_version\": \"1\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": true,\n\"instance_profile_arn\": \"arn:aws:iam::<aws-account-id>:instance-profile\/<instance-profile-name-1>\"\n}]\n}\n}\n\n```\n\n#### Add an instance profile to a model serving endpoint\n##### Update an existing endpoint with an instance profile\n\nYou can also update an existing model serving endpoint configuration with an instance profile with the `instance_profile_arn` field. \n```\nPUT \/api\/2.0\/serving-endpoints\/{name}\/config\n\n{\n\"served_entities\": [{\n\"entity_name\": \"ads1\",\n\"entity_version\": \"2\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": true,\n\"instance_profile_arn\": \"arn:aws:iam::<aws-account-id>:instance-profile\/<instance-profile-name-2>\"\n}]\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/add-model-serving-instance-profile.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Add an instance profile to a model serving endpoint\n##### Limitations\n\nThe following limitations apply: \n* STS temporary security credentials are used to authenticate data access. It can\u2019t bypass any network restriction.\n* If customers edit the instance profile IAM role from the **Settings** of the Databricks UI, endpoints running with the instance profile continue to use the old IAM role until the endpoint updates.\n* If customers delete an instance profile from the **Settings** of the Databricks UI and that profile is used in running endpoints, the running endpoint is not impacted. \nFor general model serving endpoint limitations, see [Model Serving limits and regions](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html).\n\n#### Add an instance profile to a model serving endpoint\n##### Additional resources\n\n* [Look up features](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/fs-authentication.html) using the same instance profile that you added to the serving endpoint.\n* [Configure access to resources from model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/store-env-variable-model-serving.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/add-model-serving-instance-profile.html"} +{"content":"# Databricks data engineering\n### Introduction to Databricks notebooks\n\nNotebooks are a common tool in data science and machine learning for developing code and presenting results. In Databricks, notebooks are the primary tool for creating data science and machine learning workflows and collaborating with colleagues. Databricks notebooks provide real-time coauthoring in multiple languages, automatic versioning, and built-in data visualizations. \nWith Databricks notebooks, you can: \n* [Develop code using Python, SQL, Scala, and R](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html).\n* [Customize your environment with the libraries of your choice](https:\/\/docs.databricks.com\/libraries\/index.html).\n* [Create regularly scheduled jobs to automatically run tasks, including multi-notebook workflows](https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html).\n* [Browse and access tables and volumes](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html#browse-data).\n* [Export results and notebooks](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html#export-notebook) in `.html` or `.ipynb` format.\n* [Use a Git-based repository to store your notebooks with associated files and dependencies](https:\/\/docs.databricks.com\/repos\/index.html).\n* [Build and share dashboards](https:\/\/docs.databricks.com\/notebooks\/dashboards.html).\n* [Open or run a Delta Live Tables pipeline](https:\/\/docs.databricks.com\/notebooks\/notebooks-dlt-pipeline.html).\n* (Experimental) [Use advanced editing capabilities](https:\/\/docs.databricks.com\/notebooks\/notebook-editor.html). \nNotebooks are also useful for [exploratory data analysis (EDA)](https:\/\/docs.databricks.com\/exploratory-data-analysis\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/index.html"} +{"content":"# Databricks data engineering\n### Introduction to Databricks notebooks\n#### How to import and run example notebooks\n\nThe Databricks documentation includes many example notebooks that are intended to illustrate how to use Databricks capabilities. To import one of these notebooks into a Databricks workspace: \n1. Click **Copy link for import** at the upper right of the notebook preview that appears on the page. \n### MLflow autologging quickstart Python notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-quick-start-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n2. In the workspace browser, [navigate to the location where you want to import the notebook](https:\/\/docs.databricks.com\/workspace\/workspace-objects.html).\n3. Right-click the folder and select **Import** from the menu.\n4. Click the **URL** radio button and paste the link you just copied in the field. \n![Import notebook from URL](https:\/\/docs.databricks.com\/_images\/import-nb-from-url.png)\n5. Click **Import**. The notebook is imported and opens automatically in the workspace. Changes you make to the notebook are saved automatically. For information about editing notebooks in the workspace, see [Develop code in Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html).\n6. To run the notebook, click ![Run all button](https:\/\/docs.databricks.com\/_images\/nb-run-all.png) at the top of the notebook. For more information about running notebooks and individual notebook cells, see [Run Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html). \nTo create a new, blank notebook in your workspace, see [Create a notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-notebook).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/index.html"} +{"content":"# Databricks data engineering\n### Introduction to Databricks notebooks\n#### Notebook orientation\n\n[Learn about the notebook interface and controls](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html)\n\n### Introduction to Databricks notebooks\n#### Start using Databricks notebooks\n\n* [Manage notebooks](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html): create, rename, delete, get the notebook path, configure editor settings.\n* [Develop and edit code in notebooks](https:\/\/docs.databricks.com\/notebooks\/notebooks-code.html).\n* [Get AI-assisted coding help](https:\/\/docs.databricks.com\/notebooks\/use-databricks-assistant.html).\n* [Use the interactive debugger](https:\/\/docs.databricks.com\/notebooks\/debugger.html).\n* [Work with cell outputs](https:\/\/docs.databricks.com\/notebooks\/notebook-outputs.html): download results and visualizations, control display of results in the notebook.\n* [Run notebooks and schedule regular jobs](https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html).\n* [Collaborate using notebooks](https:\/\/docs.databricks.com\/notebooks\/notebooks-collaborate.html): share a notebook, use comments in notebooks.\n* [Import and export notebooks](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html).\n* [Test notebooks](https:\/\/docs.databricks.com\/notebooks\/test-notebooks.html).\n* [Customize the libraries for your notebook](https:\/\/docs.databricks.com\/libraries\/index.html#notebook-scoped-libraries).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/index.html"} +{"content":"# Databricks data engineering\n### Introduction to Databricks notebooks\n#### Advanced material\n\n* [Notebook isolation](https:\/\/docs.databricks.com\/notebooks\/notebook-isolation.html).\n* [Open or run a Delta Live Tables pipeline](https:\/\/docs.databricks.com\/notebooks\/notebooks-dlt-pipeline.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/index.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Get started querying LLMs on Databricks\n\nThis article describes how to get started using Foundation Model APIs to serve and query LLMs on Databricks. \nThe easiest way to get started with serving and querying LLM models on Databricks is using [Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html) on a [pay-per-token](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html#token-foundation-apis) basis. The APIs provide access to popular foundation models from pay-per-token endpoints that are automatically available in the Serving UI of your Databricks workspace. See [Supported models for pay-per-token](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/supported-models.html). \nYou can also test out and chat with pay-per-token models using the AI Playground. See [Chat with supported LLMs using AI Playground](https:\/\/docs.databricks.com\/large-language-models\/ai-playground.html). \nFor production workloads, particularly if you have a fine-tuned model or a workload that requires performance guarantees, Databricks recommends you upgrade to using Foundation Model APIs on a [provisioned throughput](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html#throughput) endpoint.\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/llm-serving-intro.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Get started querying LLMs on Databricks\n#### Requirements\n\n* [Databricks workspace](https:\/\/docs.databricks.com\/workspace\/index.html) in a [supported region](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html#regions) for Foundation Model APIs pay-per-token.\n* Databricks personal access token to query and access Databricks model serving endpoints using the OpenAI client. \nImportant \nAs a security best practice for production scenarios, Databricks recommends that you use [machine-to-machine OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html) for authentication during production. \nFor testing and development, Databricks recommends using a personal access token belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/llm-serving-intro.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Get started querying LLMs on Databricks\n#### Get started using Foundation Model APIs\n\nThe following example queries the `databricks-dbrx-instruct` model that\u2019s served on the pay-per-token endpoint,`databricks-dbrx-instruct`. Learn more about the [DBRX Instruct model](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/supported-models.html#dbrx). \nIn this example, you use the OpenAI client to query the model by populating the `model` field with the name of the model serving endpoint that hosts the model you want to query. Use your personal access token to populate the `DATABRICKS_TOKEN` and your [Databricks workspace instance](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url) to connect the OpenAI client to Databricks. \n```\nfrom openai import OpenAI\nimport os\n\nDATABRICKS_TOKEN = os.environ.get(\"DATABRICKS_TOKEN\")\n\nclient = OpenAI(\napi_key=DATABRICKS_TOKEN, # your personal access token\nbase_url='https:\/\/<workspace_id>.databricks.com\/serving-endpoints', # your Databricks workspace instance\n)\n\nchat_completion = client.chat.completions.create(\nmessages=[\n{\n\"role\": \"system\",\n\"content\": \"You are an AI assistant\",\n},\n{\n\"role\": \"user\",\n\"content\": \"What is a mixture of experts model?\",\n}\n],\nmodel=\"databricks-dbrx-instruct\",\nmax_tokens=256\n)\n\nprint(chat_completion.choices[0].message.content)\n\n``` \nExpected output: \n```\n{\n\"id\": \"xxxxxxxxxxxxx\",\n\"object\": \"chat.completion\",\n\"created\": \"xxxxxxxxx\",\n\"model\": \"databricks-dbrx-instruct\",\n\"choices\": [\n{\n\"index\": 0,\n\"message\":\n{\n\"role\": \"assistant\",\n\"content\": \"A Mixture of Experts (MoE) model is a machine learning technique that combines the predictions of multiple expert models to improve overall performance. Each expert model specializes in a specific subset of the data, and the MoE model uses a gating network to determine which expert to use for a given input.\"\n},\n\"finish_reason\": \"stop\"\n}\n],\n\"usage\":\n{\n\"prompt_tokens\": 123,\n\"completion_tokens\": 23,\n\"total_tokens\": 146\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/llm-serving-intro.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Get started querying LLMs on Databricks\n#### Next steps\n\n* Use the [AI playground](https:\/\/docs.databricks.com\/large-language-models\/ai-playground.html) to try out different models in a familiar chat interface.\n* [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html).\n* Access models hosted outside of Databricks using [external models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html).\n* Learn how to [deploy fine-tuned models using provisioned throughput endpoints](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html).\n* [Explore methods to monitor model quality and endpoint health](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/monitor-diagnose-endpoints.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/llm-serving-intro.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### What are Hugging Face Transformers?\n##### Prepare data for fine tuning Hugging Face models\n\nThis article demonstrates how to prepare your data for fine-tuning open source large language models with [Hugging Face Transformers](https:\/\/huggingface.co\/docs\/transformers\/index) and [Hugging Face Datasets](https:\/\/huggingface.co\/docs\/datasets\/index).\n\n##### Prepare data for fine tuning Hugging Face models\n###### Requirements\n\n* [Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html) 13.0 or above. The examples in this guide use Hugging Face [datasets](https:\/\/huggingface.co\/docs\/datasets\/index) which is included in Databricks Runtime 13.0 ML and above.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/load-data.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### What are Hugging Face Transformers?\n##### Prepare data for fine tuning Hugging Face models\n###### Load data from Hugging Face\n\nHugging Face Datasets is a Hugging Face library for accessing and sharing datasets for audio, computer vision, and natural language processing (NLP) tasks. With Hugging Face `datasets` you can load data from various places. The `datasets` library has utilities for reading datasets from the Hugging Face Hub. There are many datasets downloadable and readable from the Hugging Face Hub by using the `load_dataset` function. Learn more about [loading data with Hugging Face Datasets](https:\/\/huggingface.co\/docs\/datasets\/loading) in the Hugging Face documentation. \n```\nfrom datasets import load_dataset\ndataset = load_dataset(\"imdb\")\n\n``` \nSome datasets in the Hugging Face Hub provide the sizes of data that is downloaded and generated when `load_dataset` is called. You can use `load_dataset_builder` to know the sizes before downloading the dataset with `load_dataset`. \n```\nfrom datasets import load_dataset_builder\nfrom psutil._common import bytes2human\n\ndef print_dataset_size_if_provided(*args, **kwargs):\ndataset_builder = load_dataset_builder(*args, **kwargs)\n\nif dataset_builder.info.download_size and dataset_builder.info.dataset_size:\nprint(f'download_size={bytes2human(dataset_builder.info.download_size)}, dataset_size={bytes2human(dataset_builder.info.dataset_size)}')\nelse:\nprint('Dataset size is not provided by uploader')\n\nprint_dataset_size_if_provided(\"imdb\")\n\n``` \nSee the [Download datasets from Hugging Face best practices notebook](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/load-data.html#notebook) for guidance on how to download and prepare datasets on Databricks for different sizes of data.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/load-data.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### What are Hugging Face Transformers?\n##### Prepare data for fine tuning Hugging Face models\n###### Format your training and evaluation data\n\nTo use your own data for model fine-tuning, you must first format your training and evaluation data into Spark DataFrames. Then, load the DataFrames using the Hugging Face `datasets` library. \nStart by formatting your training data into a table meeting the expectations of the trainer. For text classification, this is a table with two columns: a text column and a column of labels. \nTo perform fine-tuning, you need to provide a model. The Hugging Face Transformer [AutoClasses](https:\/\/huggingface.co\/docs\/transformers\/model_doc\/auto) library makes it easy to load models and configuration settings, including a wide range of `Auto Models` for [natural language processing](https:\/\/huggingface.co\/docs\/transformers\/model_doc\/auto#natural-language-processing). \nFor example, Hugging Face `transformers` provides `AutoModelForSequenceClassification` as a model loader for text classification, which expects integer IDs as the category labels. However, if you have a DataFrame with string labels, you must also specify mappings between the integer labels and string labels when creating the model. You can collect this information as follows: \n```\nlabels = df.select(df.label).groupBy(df.label).count().collect()\nid2label = {index: row.label for (index, row) in enumerate(labels)}\nlabel2id = {row.label: index for (index, row) in enumerate(labels)}\n\n``` \nThen, create the integer IDs as a label column with a Pandas UDF: \n```\nfrom pyspark.sql.functions import pandas_udf\nimport pandas as pd\n@pandas_udf('integer')\ndef replace_labels_with_ids(labels: pd.Series) -> pd.Series:\nreturn labels.apply(lambda x: label2id[x])\n\ndf_id_labels = df.select(replace_labels_with_ids(df.label).alias('label'), df.text)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/load-data.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### What are Hugging Face Transformers?\n##### Prepare data for fine tuning Hugging Face models\n###### Load a Hugging Face dataset from a Spark DataFrame\n\nHugging Face `datasets` supports loading from Spark DataFrames using `datasets.Dataset.from_spark`. See the Hugging Face documentation to learn more about the [from\\_spark()](https:\/\/huggingface.co\/docs\/datasets\/use_with_spark) method. \nFor example, if you have `train_df` and `test_df` DataFrames, you can create datasets for each with the following code: \n```\nimport datasets\ntrain_dataset = datasets.Dataset.from_spark(train_df, cache_dir=\"\/dbfs\/cache\/train\")\ntest_dataset = datasets.Dataset.from_spark(test_df, cache_dir=\"\/dbfs\/cache\/test\")\n\n``` \n`Dataset.from_spark` caches the dataset. This example describes model training on the driver, so data must be made available to it. Additionally, since cache materialization is parallelized using Spark, the provided `cache_dir` must be accessible to all workers. To satisfy these constraints, `cache_dir` should be a [Databricks File System (DBFS) root volume](https:\/\/docs.databricks.com\/dbfs\/index.html) or [mount point](https:\/\/docs.databricks.com\/dbfs\/mounts.html). \nThe DBFS root volume is accessible to all users of the workspace and should only be used for data without access restrictions. If your data requires access controls, use a [mount point](https:\/\/docs.databricks.com\/dbfs\/mounts.html) instead of DBFS root. \nIf your dataset is large, writing it to DBFS can take a long time. To speed up the process, you can use the `working_dir` parameter to have Hugging Face `datasets` write the dataset to a temporary location on disk, then move it to DBFS. For example, to use the SSD as a temporary location: \n```\nimport datasets\ndataset = datasets.Dataset.from_spark(\ntrain_df,\ncache_dir=\"\/dbfs\/cache\/train\",\nworking_dir=\"\/local_disk0\/tmp\/train\",\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/load-data.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### What are Hugging Face Transformers?\n##### Prepare data for fine tuning Hugging Face models\n###### Caching for datasets\n\nThe cache is one of the ways `datasets` improves efficiency. It stores all downloaded and processed datasets so when the user needs to use the intermediate datasets, they are reloaded directly from the cache. \nThe default cache directory of datasets is `~\/.cache\/huggingface\/datasets`. When a cluster is terminated, the cache data is lost too. To persist the cache file on cluster termination, Databricks recommends changing the cache location to DBFS by setting the environment variable `HF_DATASETS_CACHE`: \n```\nimport os\nos.environ[\"HF_DATASETS_CACHE\"] = \"\/dbfs\/place\/you\/want\/to\/save\"\n\n```\n\n##### Prepare data for fine tuning Hugging Face models\n###### Fine-tune a model\n\nWhen your data is ready, you can use it to [fine-tune a Hugging Face model](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/fine-tune-model.html).\n\n##### Prepare data for fine tuning Hugging Face models\n###### Notebook: Download datasets from Hugging Face\n\nThis example notebook provides recommended best practices of using the Hugging Face `load_dataset` function to download and prepare datasets on Databricks for different sizes of data. \n### Download datasets from Hugging Face best practices notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/hugging-face-dataset-download.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/load-data.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n### What are workspace files?\n##### Store init scripts in workspace files\n\nDatabricks recommends storing init scripts in workspace files in Databricks Runtime 11.3 LTS and above if you are not using Unity Catalog. \nNote \nThere is limited support for init scripts in workspace files in Databricks Runtime 9.1 LTS and 10.4 LTS, but this support does not cover all common use patterns for init scripts, such as referencing other files from init scripts. Databricks recommends using init scripts in cloud object storage for Databricks Runtime 9.1 LTS and 10.4 LTS. \nFor more on workspace files, see [What are workspace files?](https:\/\/docs.databricks.com\/files\/workspace.html).\n\n##### Store init scripts in workspace files\n###### Where are init scripts stored in workspace files?\n\nInit scripts can be stored in any location the user uploading the init script has proper permissions. See [Folder ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#folders). \nLike all workspace files, init scripts use access control lists (ACLs) to control permissions. By default, only the user uploading an init script and workspace admins have permissions on these files. See [File ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#files). \nNote \nSome ACLs are inherited from directories to all files within a directory.\n\n##### Store init scripts in workspace files\n###### Use init scripts stored in workspace files\n\nInit scripts in workspace files are intended for use as cluster-scoped init scripts. See [Use cluster-scoped init scripts](https:\/\/docs.databricks.com\/init-scripts\/cluster-scoped.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/workspace-init-scripts.html"} +{"content":"# \n### Introduction to the well-architected data lakehouse\n\nAs a cloud architect, when you evaluate a data lakehouse implementation on the Databricks Data Intelligence Platform, you might want to know \u201cWhat is a *good* lakehouse?\u201d The *Well-architected lakehouse* articles provide guidance for lakehouse implementation. \nAt the outset, you might also want to know: \n* What is the scope of the lakehouse - in terms of capabilities and personas?\n* What is the vision for the lakehouse?\n* How does the lakehouse integrate with the customer\u2019s cloud architecture?\n\n### Introduction to the well-architected data lakehouse\n#### Articles about lakehouse architecture\n\n### The scope of the lakehouse \nThe first step to designing your data architecture with the Databricks Data Intelligence Platform is understanding its building blocks and how they would integrate with your systems. See [The scope of the lakehouse platform](https:\/\/docs.databricks.com\/lakehouse-architecture\/scope.html). \n### Guiding principles for the lakehouse \nGround rules that define and influence your architecture. They explain the vision behind a lakehouse implementation and form the basis for future decisions on your data, analytics, and AI architecture. See [Guiding principles for the lakehouse](https:\/\/docs.databricks.com\/lakehouse-architecture\/guiding-principles.html). \n### Downloadable lakehouse reference architectures \nDownloadable architecture blueprints outline the recommended setup of the Databricks Data Intelligence Platform and its integration with cloud providers\u2019 services. For reference architecture PDFs in 11 x 17 (A3) format, see [Download lakehouse reference architectures](https:\/\/docs.databricks.com\/lakehouse-architecture\/reference.html). \n### The seven pillars of the well-architected lakehouse, their principles, and best practices \nUnderstand the pros and cons of decisions you make when building the lakehouse. This framework provides architectural best practices for developing and operating a safe, reliable, efficient, and cost-effective lakehouse. See [Data lakehouse architecture: Databricks well-architected framework](https:\/\/docs.databricks.com\/lakehouse-architecture\/well-architected.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/index.html"} +{"content":"# \n### Query data\n\nQuerying data is the foundational step for performing nearly all data-driven tasks in Databricks. Regardless of the language or tool used, workloads start by defining a query against a table or other data source and then performing actions to gain insights from the data. This article outlines the core concepts and procedures for running queries across various Databricks product offerings, and includes code examples you can adapt for your use case. \nYou can query data interactively using: \n* Notebooks\n* SQL editor\n* File editor\n* Dashboards \nYou can also run queries as part of Delta Live Tables pipelines or workflows. \nFor an overview of streaming queries on Databricks, see [Query streaming data](https:\/\/docs.databricks.com\/query\/streaming.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/index.html"} +{"content":"# \n### Query data\n#### What data can you query with Databricks?\n\nDatabricks supports querying data in multiple formats and enterprise systems. The data you query using Databricks falls into one of two broad categories: data in a Databricks lakehouse and external data. \n### What data is in a Databricks lakehouse? \nThe Databricks Data Intelligence Platform stores all of your data in a Databricks lakehouse by default. \nThis means that when you run a basic `CREATE TABLE` statement to make a new table, you have created a lakehouse table. Lakehouse data has the following properties: \n* Stored in the Delta Lake format.\n* Stored in cloud object storage.\n* Governed by Unity Catalog. \nMost lakehouse data on Databricks is registered in Unity Catalog as managed tables. Managed tables provide the easiest syntax and behave like other tables in most relational database management systems. Managed tables are recommended for most use cases and are suitable for all users who don\u2019t want to worry about the implementation details of data storage. \nAn *unmanaged table*, or *external table*, is a table registered with a `LOCATION` specified. The term *external* can be misleading, as external Delta tables are still lakehouse data. Unmanaged tables might be preferred by users who directly access tables from other Delta reader clients. For an overview of differences in table semantics, see [What is a table?](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#table). \nSome legacy workloads might exclusively interact with Delta Lake data through file paths and not register tables at all. This data is still lakehouse data, but can be more difficult to discover because it\u2019s not registered to Unity Catalog. \nNote \nYour workspace administrator might not have upgraded your data governance to use Unity Catalog. You can still get many of the benefits of a Databricks lakehouse without Unity Catalog, but not all functionality listed in this article or throughout the Databricks documentation is supported. \n### What data is considered external? \nAny data that isn\u2019t in a Databricks lakehouse can be considered external data. Some examples of external data include the following: \n* Foreign tables registered with Lakehouse Federation.\n* Tables in the Hive metastore backed by Parquet.\n* External tables in Unity Catalog backed by JSON.\n* CSV data stored in cloud object storage.\n* Streaming data read from Kafka. \nDatabricks supports configuring connections to many data sources. See [Connect to data sources](https:\/\/docs.databricks.com\/connect\/index.html). \nWhile you can use Unity Catalog to govern access to and define tables against data stored in multiple formats and external systems, Delta Lake is a requirement for data to be considered in the lakehouse. \nDelta Lake provides all of the transactional guarantees in Databricks, which are crucial for maintaining data integrity and consistency. If you want to learn more about transactional guarantees on Databricks data and why they\u2019re important, see [What are ACID guarantees on Databricks?](https:\/\/docs.databricks.com\/lakehouse\/acid.html). \nMost Databricks users query a combination of lakehouse data and external data. Connecting with external data is always the first step for data ingestion and ETL pipelines that bring data into the lakehouse. For information about ingesting data, see [Ingest data into a Databricks lakehouse](https:\/\/docs.databricks.com\/ingestion\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/index.html"} +{"content":"# \n### Query data\n#### Query tables by name\n\nFor all data registered as a table, Databricks recommends querying using the table name. \nIf you\u2019re using Unity Catalog, tables use a three-tier namespace with the following format: `<catalog-name>.<schema-name>.<table-name>`. \nWithout Unity Catalog, table identifiers use the format `<schema-name>.<table-name>`. \nNote \nDatabricks inherits much of its SQL syntax from Apache Spark, which does not differentiate between `SCHEMA` and `DATABASE`. \nQuerying by table name is supported in all Databricks execution contexts and supported languages. \n```\nSELECT * FROM catalog_name.schema_name.table_name\n\n``` \n```\nspark.read.table(\"catalog_name.schema_name.table_name\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/index.html"} +{"content":"# \n### Query data\n#### Query data by path\n\nYou can query structured, semi-structured, and unstructured data using file paths. Most files on Databricks are backed by cloud object storage. See [Work with files on Databricks](https:\/\/docs.databricks.com\/files\/index.html). \nDatabricks recommends configuring all access to cloud object storage using Unity Catalog and defining volumes for object storage locations that are directly queried. Volumes provide human-readable aliases to locations and files in cloud objects storage using catalog and schema names for the filepath. See [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html). \nThe following examples demonstrate how to use Unity Catalog volume paths to read JSON data: \n```\nSELECT * FROM json.`\/Volumes\/catalog_name\/schema_name\/volume_name\/path\/to\/data`\n\n``` \n```\nspark.read.format(\"json\").load(\"\/Volumes\/catalog_name\/schema_name\/volume_name\/path\/to\/data\")\n\n``` \nFor cloud locations that aren\u2019t configured as Unity Catalog volumes, you can query data directly using URIs. You must configure access to cloud object storage to query data with URIs. See [Configure access to cloud object storage for Databricks](https:\/\/docs.databricks.com\/connect\/storage\/index.html). \nThe following examples demonstrate how to use URIs to query JSON data in Azure Data Lake Storage Gen2, GCS, and S3: \n```\nSELECT * FROM json.`abfss:\/\/container-name@storage-account-name.dfs.core.windows.net\/path\/to\/data`;\n\nSELECT * FROM json.`gs:\/\/bucket_name\/path\/to\/data`;\n\nSELECT * FROM json.`s3:\/\/bucket_name\/path\/to\/data`;\n\n``` \n```\nspark.read.format(\"json\").load(\"abfss:\/\/container-name@storage-account-name.dfs.core.windows.net\/path\/to\/data\")\n\nspark.read.format(\"json\").load(\"gs:\/\/bucket_name\/path\/to\/data\")\n\nspark.read.format(\"json\").load(\"s3:\/\/bucket_name\/path\/to\/data\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/index.html"} +{"content":"# \n### Query data\n#### Query data using SQL warehouses\n\nDatabricks uses SQL warehouses for compute in the following interfaces: \n* SQL editor\n* Databricks SQL queries\n* Dashboards\n* Legacy dashboards\n* SQL alerts \nYou can optionally use SQL warehouses with the following products: \n* Databricks notebooks\n* Databricks file editor\n* Databricks workflows \nWhen you query data with SQL warehouses, you can use only SQL syntax. Other programming languages and APIs are not supported. \nFor workspaces that are enabled for Unity Catalog, SQL warehouses always use Unity Catalog to manage access to data sources. \nMost queries that are run on SQL warehouses target tables. Queries that target data files should leverage Unity Catalog volumes to manage access to storage locations. \nUsing URIs directly in queries run on SQL warehouses can lead to unexpected errors.\n\n### Query data\n#### Query data using all purpose compute or jobs compute\n\nMost queries that you run from Databricks notebooks, workflows, and the file editor run against compute clusters configured with Databricks Runtime. You can configure these clusters to run interactively or deploy them as *jobs compute* that power workflows. Databricks recommends that you always use jobs compute for non-interactive workloads. \n### Interactive versus non-interactive workloads \nMany users find it helpful to view query results while transformations are processed during development. Moving an interactive workload from all-purpose compute to jobs compute, you can save time and processing costs by removing queries that display results. \nApache Spark uses lazy code execution, meaning that results are calculated only as necessary, and multiple transformations or queries against a data source can be optimized as a single query if you don\u2019t force results. This contrasts with the eager execution mode used in pandas, which requires calculations to be processed in order before passing results to the next method. \nIf your goal is to save cleaned, transformed, aggregated data as a new dataset, you should remove queries that display results from your code before scheduling it to run. \nFor small operations and small datasets, the time and cost savings might be marginal. Still, with large operations, substantial time can be wasted calculating and printing results to a notebook that might not be manually inspected. The same results could likely be queried from the saved output at almost no cost after storing them.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/index.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Set up Databricks Git folders (Repos)\n##### Enable or disable the Databricks Git folder feature\n\nThe Databricks Git folders feature is enabled by default for new workspaces, but can be disabled by admins using the Databricks REST API [\/api\/2.0\/workspace-conf](https:\/\/docs.databricks.com\/api\/workspace\/workspaceconf\/setstatus) or a Databricks SDK. Admins can also use the REST API or SDK to turn on Databricks Git folders for older workspaces where the feature has been disabled in the past. This topic provides an example notebook you can use to perform this operation.\n\n##### Enable or disable the Databricks Git folder feature\n###### Run a notebook to enable (or disable) the Databricks Git folder feature\n\nImport this notebook into the Databricks UI and run it to enable the Databricks Git folder feature. To disable the Databricks Git folder feature, call [\/api\/2.0\/workspace-conf](https:\/\/docs.databricks.com\/api\/workspace\/workspaceconf\/setstatus) and set `enableProjectTypeInWorkspace` to `false`. \n### Enable a Databricks Git folder \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/repos\/turn-on-repos.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/enable-disable-repos-with-api.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query history\n\nThe query history shows SQL queries performed using [SQL warehouses](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html). The query history holds query data for the past 30 days, after which it is automatically deleted. \nNote \nIf your workspace is enabled for the serverless compute public preview, your query history will also contain all SQL and Python queries run on serverless compute for notebooks and jobs. See [Serverless compute for notebooks](https:\/\/docs.databricks.com\/compute\/serverless.html). \nYou can use the information available through this screen to help you debug issues with queries. \nThis section describes how to work with query history using the UI. To work with query history using the API, see [Query History API](https:\/\/docs.databricks.com\/api\/workspace\/queryhistory). \nImportant \nThe time recorded in query history for a SQL query is only the time the SQL warehouse spends actually executing the query. It does not record any additional overhead associated with getting ready to execute the query, such as internal queuing, or additional time related to the data upload and download process.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query history\n##### View query history\n\nTo view the history of all executions of a query: \n1. Click ![History Icon](https:\/\/docs.databricks.com\/_images\/history-icon.png) **Query History** in the sidebar.\n2. Optionally, click **Duration** to sort the list by duration. By default, the list is sorted by start time.\n3. Click the name of a query to see more details, such as the SQL command and the [execution details](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-profile.html). \nYou can filter the list by user, date range, SQL warehouse, and query status. \nIf you\u2019re a non-admin user without `CAN_MANAGE` permissions, you can only view your own queries in **Query History**. \nNote \nQueries shared by a user with **Run as Owner** permissions to another user with CAN RUN permissions appear in the query history of the user executing the query and not the user that shared the query.\n\n#### Query history\n##### View query details\n\nTo view details about a query, such as its duration, SQL command, number of rows returned, and I\/O performance: \n1. View [query history](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html#view-query-history).\n2. Click the name of a query. \n![Query history details](https:\/\/docs.databricks.com\/_images\/query-details.png) \nBrief information about a query\u2019s performance appears, such as time spent in each task, rows returned, and I\/O performance.\n3. For more detailed information about the query\u2019s performance,including its execution plan, click **View Query Profile** at the bottom of the page. For more details, see [Query profile](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-profile.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query history\n##### Terminate an executing query\n\nTo terminate a long-running query started by you or another user: \n1. View [query history](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html#view-query-history).\n2. Click the name of a query.\n3. Next to **Status**, click **Cancel**. \nNote \n**Cancel** only appears when a query is running. \nThe query is terminated and its status changes to **Canceled**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n### What are workspace files?\n","doc_uri":"https:\/\/docs.databricks.com\/files\/cwd-dbr-14.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n### What are workspace files?\n##### What is the default current working directory?\n\nThis article describes how the default current working directory (CWD) works for notebook and file execution. \nNote \nUse Databricks Runtime 14.0+ and default workspace configs for more consistency in (CWD) behavior throughout the workspace. \nThere are two default CWD behaviors for code executed locally in [notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html) and [files](https:\/\/docs.databricks.com\/files\/index.html): \n1. CWD returns the directory containing the notebook or script being run.\n2. CWD returns a directory representing the ephemeral storage volume attached to the driver. \nThis CWD behavior affects all code, including `%sh` and Python or R code that doesn\u2019t use Apache Spark. The behavior is determined by code language, Databricks Runtime version, workspace path, and workspace admin configuration. \nFor Scala code, the CWD is the **ephemeral storage attached to the driver**. \nFor code in all other languages: \n* In Databricks Runtime 14.0 and above, the CWD is the **directory containing the notebook or script being run**. This is true regardless of whether the code is in `\/Workspace\/Repos`.\n* For notebooks running Databricks Runtime 13.3 LTS and below, the CWD depends on whether the code is in `\/Workspace\/Repos`:\n* For code executed in a path outside of `\/Workspace\/Repos`, the CWD is the ephemeral storage volume attached to the driver\n* For code executed in a path in `\/Workspace\/Repos`, the CWD depends on your admin config setting and cluster DBR version: \n+ For workspaces with `enableWorkspaceFilesystem` set to `dbr8.4+` or `true`, on DBR versions 8.4 and above, the CWD is the directory containing the notebook or script being run. On DBR versions below 8.4, it is the ephemeral storage volume attached to the driver\n+ For workspaces with `enableWorkspaceFilesystem` set to `dbr11.0+`, on DBR versions 11.0 and above, the CWD is the directory containing the notebook or script being run. On DBR versions below 11.0, it is the ephemeral storage volume attached to the driver\n+ For workspaces with `enableWorkspaceFilesystem` set to `false`, the CWD is the ephemeral storage volume attached to the driver\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/cwd-dbr-14.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n### What are workspace files?\n##### What is the default current working directory?\n###### How does this impact workloads?\n\nThe biggest impacts to workloads have to do with file persistance and location: \n* In Databricks Runtime 13.3 LTS and below, for code executed in a path outside of `\/Workspace\/Repos`, many code snippets store data to a default location on an ephemeral storage volume that is permanently deleted when the cluster is terminated.\n* In Databricks Runtime 14.0 and above, the default behavior for these operations creates workspace files stored alongside the running notebook that persist until explicitly deleted. \nFor notes on performance differences and other limitations inherent in workspace files, see [Work with workspace files](https:\/\/docs.databricks.com\/files\/index.html#work-with-workspace-files).\n\n##### What is the default current working directory?\n###### Revert to legacy behavior\n\nYou can change the current working directory for any notebook using the Python method `os.chdir()`. If you want to ensure that each notebook uses a CWD on the ephemeral storage volumes attached to the driver, you can add the following command to the first cell of each notebook and run it before any other code: \n```\nimport os\n\nos.chdir(\"\/tmp\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/cwd-dbr-14.html"} +{"content":"# \n### What is a Genie space?\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nA Genie space is a no-code interface powered by DatabricksIQ where business users can interact with the Databricks Assistant to analyze data using natural language. Domain experts, like data analysts, configure Genie spaces with datasets, sample queries, and text guidelines to help the Assistant translate business questions into analytical queries. After set up, business users can ask questions and generate visualizations to understand operational data. \nSee [DatabricksIQ-powered features](https:\/\/docs.databricks.com\/databricksiq\/index.html). \nData analysts can prepare a domain-specific Genie space experience for business users by doing the following: \n* Selecting relevant tables from Unity Catalog and exposing their metadata (table and column descriptions) in the Genie space.\n* Adding instructions that transfer organization-specific information (business logic and metadata) into the Genie space.\n\n### What is a Genie space?\n#### Example use cases\n\nYou can create different Genie spaces to serve a variety of different non-technical audiences. The following scenarios describe two possible use cases. \n### Get status with visualization \nA sales manager wants to get the current status of open and closed opportunities by stage in their sales pipeline. They can interact with the Genie space using natural language, and automatically generate a visualization. \nThe following gif shows this interaction: \n![Gif with sample question, response, and auto-generated visualization](https:\/\/docs.databricks.com\/_images\/sample-q-a.gif) \n### Tracking logistics \nA logistics company wants to use Genie spaces to help business users from different departments track operational and financial details. They set up a Genie space for their shipment facility managers to track shipments and another for their financial executives to understand their financial health.\n\n","doc_uri":"https:\/\/docs.databricks.com\/prpr-67656E69652D737061636573.html"} +{"content":"# \n### What is a Genie space?\n#### Technical requirements\n\n* Genie spaces use data registered to Unity Catalog.\n* Genie spaces require a Pro or Serverless warehouse. \n* Creating Genie spaces with the Databricks Assistant requires enabling **Partner-powered AI assistive features**. For details on enabling Databricks Assistant, see [What is Databricks Assistant?](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html). For questions about privacy and security, see [Privacy and security](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html#privacy-security).\n\n### What is a Genie space?\n#### How are Genie space responses generated?\n\nGenie spaces generate responses to natural language questions using table and column names and descriptions. The actual data in the tables remains hidden from the Assistant. \nThe Assistant uses the names and descriptions to convert natural language questions to an equivalent SQL query. Then, it provides a response that includes the results of that query as a table. Genie space authors and end users can inspect the generated SQL query that produces each response. \nWhen creating visualizations, the first row of query results is shared with the Assistant. This preserves data privacy while leveraging database annotations to inform responses.\n\n### What is a Genie space?\n#### Required permissions\n\nYou must have at least CAN USE privileges on a SQL warehouse to set up a Genie space. When you save your Genie space, you are prompted to select a default SQL warehouse that will be used to generate responses to user questions. \nAccess to data for Genie space authors and end-users is governed by Unity Catalog permissions. See [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/prpr-67656E69652D737061636573.html"} +{"content":"# \n### What is a Genie space?\n#### How do I add data?\n\nGenie spaces work exclusively with data objects registered to Unity Catalog. They use the metadata attached to Unity Catalog objects to generate responses. Well-annotated datasets, paired with specific instructions that you provide, are key to creating a positive experience for end users. \nDatabricks recommends the following: \n* **Curate data for analytical consumption**: Layer views to reduce the number of columns and add use-case-specific information to increase response quality.\n* **Minimize the number of columns in a Genie space**: Use up to five closely related tables. Each table should hold fewer than 25 columns. \nGenie Spaces fully respects and enforces UC permissions, including row-level security and column-based masking. Users must have SELECT privileges on the data and CAN USE privileges on the catalog and schema. \nYou can create new Genie spaces based on one or more Unity Catalog managed tables. Closely related, well-annotated datasets, paired with specific instructions that you provide, are critical to creating a positive experience for end users.\n\n### What is a Genie space?\n#### Create a new Genie space\n\nWhen you create a new Genie space, a **New Genie space** dialog shows the following options. \n* **Title**: The title appears in the workspace browser with other workspace objects. Choose a title that will help end users discover your Genie space.\n* **Description**: Users see the description when they open the Genie space. Use this text area to describe the room\u2019s purpose.\n* **Default warehouse**: This compute resource powers the SQL statements generated in the Genie spaces. A Genie space can use a pro or serverless SQL warehouse. Serverless SQL warehouses offer optimal performance.\n* **Tables**: Genie spaces can be based on one or more tables. The dialog prompts you to add a table by choosing from each drop-down selector: **Catalog**, **Schema**, and **Table**. \nWhen you have selected a table, it is automatically added to the room. To add another table, use the drop-down selectors to choose another table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/prpr-67656E69652D737061636573.html"} +{"content":"# \n### What is a Genie space?\n#### Chat in the Genie space\n\nAfter it is created, most Genie space interactions take place in the chat window. \nA new chat window includes a set of **Quick actions** tiles that can help users get started with the Genie space. The text field, where users input questions, is near the bottom of the screen. \n![New chat window with help tiles at the top of the screen and a text input field at the bottom.](https:\/\/docs.databricks.com\/_images\/new-chat-window.png) \nResponses appear above the text field. After a user enters a question, it is saved to a chat history thread in the left pane.\n\n### What is a Genie space?\n#### Chat history\n\nChat history threads are saved for each user so that they can refer to past questions and answers. Users can also resubmit or revise questions from a chat thread. The **New chat** button in the left pane starts a new thread. \nEach chat thread maintains its context, so the Assistant considers previous questions it has been asked. This allows users to ask follow-up questions to further explore or refocus a result set.\n\n### What is a Genie space?\n#### Response structure\n\nThe precise response structure varies based on the question. Often, it includes a natural language explanation and a table that shows the relevant result set. All responses include the SQL query that was generated to answer the question. Click **Show generated code** to view the generated query. \nThe bottom-right side of the response includes optional actions. You can copy the response CSV to your clipboard, download it as a CSV file, add it as an instruction for the Genie space, and upvote or downvote the answer. \nA set of **Quick actions** tiles follow responses that include tabular data. You can use them to generate visualizations. \n![Quick action tiles that suggest different visualization options.](https:\/\/docs.databricks.com\/_images\/quick-viz-actions.png) \nYou can also generate a visualization by describing it in words.\n\n","doc_uri":"https:\/\/docs.databricks.com\/prpr-67656E69652D737061636573.html"} +{"content":"# \n### What is a Genie space?\n#### Provide instructions\n\nInstructions help to guide the Assistant\u2019s responses so that it can process the unique jargon, logic, and concepts in a given domain. You can write instructions as example queries or snippets of plain text that help the Assistant answer questions that room users are likely to ask. Comprehensive instructions are critical to a seamless, intuitive Genie space experience. \nThe following examples illustrate various types of instructions: \n* **Company-specific business information**: \n+ \u201cOur fiscal year starts in February\u201d\n* **Values, aliases, or common filters**: \n+ \u201cAlways convert to lowercase and use a like operator when applying filters.\u201d\n+ \u201cUse abbreviations for states in filter values.\u201d\n* **User-defined functions available through Unity Catalog**: \n+ \u201cFor quarters use the `adventureworks.oneb.get_quarter(date)` UDF. The output of get\\_quarter is the quarter and is either 1,2,3, or 4. Use this to filter the data as needed.For example, for quarter 3, use where `adventureworks.oneb.get_quarter(posted_date)`= 3\u201d`\u201d\n* **Sample SQL instructions**: \n+ You can provide samples of queries that you expect the Assistant to generate.\n+ Focus on providing samples that highlight logic that is unique to your organization and data, as in the following example:\n```\n-- Return our current total open pipeline by region.\n-- Opportunities are only considered pipeline if they are tagged as such.\nSELECT\na.region__c AS `Region`,\nsum(o.amount) AS `Open Pipeline`\nFROM\nsales.crm.opportunity o\nJOIN sales.crm.accounts a ON o.accountid = a.id\nWHERE\no.forecastcategory = 'Pipeline' AND\no.stagename NOT ILIKE '%closed%'\nGROUP BY ALL;\n\n``` \nYou can organize Genie space instructions as one long note or group them by related topics for better structure. You can also add certified answers to provide validated responses for users. See [Use certified answers in Genie spaces](https:\/\/docs.databricks.com\/prpr-ans-67656E69652D737061636573.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/prpr-67656E69652D737061636573.html"} +{"content":"# \n### What is a Genie space?\n#### Best practices for room preparation\n\n* Include a set of well-defined questions that you want room users to be able to answer.\n* Test your Genie space to check response quality. Try the following to see if the model provides the expected response: \n+ Rephrase the provided questions.\n+ Ask other questions related to the datasets.\n* Add and refine Genie space instructions until questions provide the expected response.\n\n### What is a Genie space?\n#### Share a Genie space\n\nImportant \nGenie space users must interact with data using their own credentials. Questions about data they cannot access generate empty responses. \nGenie space users must have CAN\\_USE permissions on the warehouse attached to a Genie space, and access permissions on the Unity Catalog objects surfaced in the space. See [How do I add data?](https:\/\/docs.databricks.com\/prpr-67656E69652D737061636573.html#add-uc-data). \nNew Genie spaces are saved to your user folder by default. Like other workspace objects, they inherit permissions from their enclosing folder. You can use your workspace folder structure to share them with other users. See [Organize workspace objects into folders](https:\/\/docs.databricks.com\/workspace\/workspace-objects.html). \nYou can also specify certain users or groups to share with at a given permission level: **Can Manage**, **Can Edit**, **Can Run**, and **Can View**. \nTo share with specific users or groups: \n1. Click **Share**.\n2. In the **Share** dialog, click **Open in Workspace**.\n3. In the Workspace browser window, enter users or groups that you want to share with, and then set permission levels as appropriate.\n\n","doc_uri":"https:\/\/docs.databricks.com\/prpr-67656E69652D737061636573.html"} +{"content":"# What is Databricks Marketplace?\n### Create and manage private exchanges in Databricks Marketplace\n\nThis article is intended for data providers and describes how to create and manage private exchanges in Databricks Marketplace. Private exchanges enable you to share data products with a defined group of invited consumers. Private exchange listings do not appear in the public marketplace.\n\n### Create and manage private exchanges in Databricks Marketplace\n#### What is a private exchange?\n\nA private exchange allows you to make certain data products discoverable only to a specified group of consumers in Databricks Marketplace. Like Delta Sharing, upon which Databricks Marketplace is built, private exchanges give you the ability to share data securely and privately with select recipients. Unlike Delta Sharing, private exchanges have the additional advantage of making data products discoverable by members *before* the data product is shared, so consumers are aware that those data products are available to request. Databricks Marketplace private exchanges also provide consumers with a storefront interface for requesting and accessing data products that might be easier to use than Delta Sharing on its own. \n### What is a private listing? \nListings are defined as public or private. To share a listing in a private exchange, it must be defined as private. You can add existing private listings to your exchange, and you can create new private listings, assigning them to the exchange when you create the listing. See [Create listings](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#create-listing). \nYou can share free listings that are available instantly or listings that require your approval before the member can access them. You can create new private listings and add them to your exchange at any time. You can also edit a public listing to make it private. See [Create listings](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#create-listing).\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/private-exchange.html"} +{"content":"# What is Databricks Marketplace?\n### Create and manage private exchanges in Databricks Marketplace\n#### Before you begin\n\n* To create and manage private exchanges, you need the [Marketplace admin role](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#marketplace-admin).\n* Private exchange members must be Databricks customers and must have access to a workspace that is attached to a Unity Catalog metastore.\n* When you add a member to your private exchange, you must enter their metastore\u2019s sharing identifier. \nThe sharing identifier is a string consisting of the metastore\u2019s cloud, region, and UUID (the unique identifier for the metastore), in the format `<cloud>:<region>:<uuid>`. For example, `aws:eu-west-1:b0c978c8-3e68-4cdf-94af-d05c120ed1ef`. \nTo get the sharing identifier, reach out to your contact at the member organization. You might need to tell them how to get the sharing identifier. For instructions, see [Step 1: Request the recipient\u2019s sharing identifier](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#request-uuid). You can also point them to step 1 of [Get access in the Databricks-to-Databricks model](https:\/\/docs.databricks.com\/data-sharing\/recipient.html#get-access-db-to-db) for instructions. \n### Set up a private exchange \nYou create the private exchange using the provider console in Databricks Marketplace. You can add members and listings when you create the exchange or after. \n**Permission required:** [Marketplace admin role](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#marketplace-admin). \nTo create a private exchange: \n1. Log into your Databricks workspace.\n2. In the sidebar, click ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace**.\n3. On the upper-right corner of the Marketplace page, click **Provider console**.\n4. On the **Exchanges** tab in the provider console, click **Create exchange**.\n5. On the **Create exchange** dialog, enter the name of your exchange and click **Create**. \nUse a human-friendly name that helps you and other users who are managing exchanges recognize the purpose of the exchange.\n6. On the **Exchanges** tab, find the exchange and click the name.\n7. Add the members you want to share with. \n1. On the **Members** tab, click **Add member**.\n2. On the dialog, add a human-readable name for the member and enter the **Sharing identifier**. \nTo learn how to get the sharing identifier from the member organization, see [Before you begin](https:\/\/docs.databricks.com\/marketplace\/private-exchange.html#before).\n3. Click **Add member**.\n8. Add the listings you want to share. \n1. On the **Listings** tab, click **Add listing**.\n2. On the dialog, select one of the existing private listings from the drop-down list and click **Add listing**.You can also create a new private listing or edit a public listing to make it private. \n1. Click **create a new listing** to go to the **New listing** page.\n2. When you create the listing, select the **Private exchange** option and select the exchange from the drop-down list. See [Create listings](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#create-listing).\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/private-exchange.html"} +{"content":"# What is Databricks Marketplace?\n### Create and manage private exchanges in Databricks Marketplace\n#### Edit members or remove them from a private exchange\n\n**Permission required:** [Marketplace admin role](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#marketplace-admin) \nTo edit or remove a member: \n1. Log into your Databricks workspace.\n2. In the sidebar, click ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace**.\n3. On the upper-right corner of the Marketplace page, click **Provider console**.\n4. On the **Exchanges** tab in the provider console, find and click the exchange name.\n5. On the **Members** tab, find the member, and click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu (also known as the three-dot menu) at the end of the member row. \nTo edit the member, select **Edit**. You can update both the member name and sharing identifier. \nTo remove the member, select **Remove** on the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/private-exchange.html"} +{"content":"# What is Databricks Marketplace?\n### Create and manage private exchanges in Databricks Marketplace\n#### Remove listings from a private exchange\n\n**Permission required:** [Marketplace admin role](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#marketplace-admin). \nTo remove a listing from an exchange: \n1. Log into your Databricks workspace.\n2. In the sidebar, click ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace**.\n3. On the upper-right corner of the Marketplace page, click **Provider console**.\n4. On the **Exchanges** tab in the provider console, find and click the exchange name.\n5. On the **Listings** tab, find the listing, click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu at the far right, and select **Remove**.\n6. On the confirmation dialog, click **Remove**. \nYou can also remove the link between a listing and an exchange by editing the listing and removing the exchange from the field under **Private exchange**.\n\n### Create and manage private exchanges in Databricks Marketplace\n#### Manage member requests for data products\n\nYou manage member requests for data products in private exchanges the same way that you do for data products in the public marketplace. See [Manage requests for your data product in Databricks Marketplace](https:\/\/docs.databricks.com\/marketplace\/manage-requests-provider.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/private-exchange.html"} +{"content":"# \n### Run offline evaluation with a `\ud83d\udcd6 Evaluation Set`\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html) \nThis tutorial walks you through the process of using a `\ud83d\udcd6 Evaluation Set` to evaluate a RAG Application\u2019s quality\/cost\/latency. This step is performed after you create a new version of your RAG Application to evaluate if you improved the application\u2019s performance and didn\u2019t cause any performance regressions. \nIn this tutorial, we use a sample evaluation set provided by Databricks - in the next tutorials, we walk you through the process of [collecting user feedback](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/4-collect-feedback.html) in order to [create your own evaluation set](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/5-create-eval-set.html).\n\n### Run offline evaluation with a `\ud83d\udcd6 Evaluation Set`\n#### Data flow\n\n![legend](https:\/\/docs.databricks.com\/_images\/offline_eval.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/3-run-offline-eval.html"} +{"content":"# \n### Run offline evaluation with a `\ud83d\udcd6 Evaluation Set`\n#### Step 1: Load the sample evaluation set\n\n1. Open a Databricks Notebook and run the following code to save the sample `\ud83d\udcd6 Evaluation Set` to a Unity Catalog schema. \nNote \nRAG Studio supports using `\ud83d\udcd6 Evaluation Set`s that are stored in any Unity Catalog schema, but for organizational purposes, Databricks suggest keeping your evaluation sets in the same Unity Catalog schema as your RAG Application. \n```\nimport requests\nimport pandas as pd\nfrom pyspark.sql.types import StructType, StructField, StringType, TimestampType, ArrayType\n\nschema = StructType([StructField('request', StructType([StructField('request_id', StringType(), True), StructField('conversation_id', StringType(), True), StructField('timestamp', TimestampType(), True), StructField('messages', ArrayType(StructType([StructField('role', StringType(), True), StructField('content', StringType(), True)]), True), True), StructField('last_input', StringType(), True)]), True), StructField('ground_truth', StructType([StructField('text_output', StructType([StructField('content', StringType(), True)]), True), StructField('retrieval_output', StructType([StructField('name', StringType(), True), StructField('chunks', ArrayType(StructType([StructField('doc_uri', StringType(), True), StructField('content', StringType(), True)]), True), True)]), True)]), True)])\n\ndf = spark.createDataFrame(\npd.read_json(\"http:\/\/docs.databricks.com\/_static\/notebooks\/rag-studio\/example_evaluation_set.json\"),\nschema,\n)\ndf.write.format(\"delta\").mode(\"overwrite\").saveAsTable(\n'catalog.schema.example_evaluation_set'\n)\n\n```\n2. Inspect the loaded `\ud83d\udcd6 Evaluation Set` to understand the schema. \nNote \nYou will notice that the `request` schema is *identical* to the `request` schema in the <rag-response-log>. This is intentional to allow you to easily translate request logs into evaluation sets. \n* `request`: the user\u2019s input to the RAG Application\n* `ground_truth`: the ground truth label for the response and retrieval steps\n![logs](https:\/\/docs.databricks.com\/_images\/eval-set-sample.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/3-run-offline-eval.html"} +{"content":"# \n### Run offline evaluation with a `\ud83d\udcd6 Evaluation Set`\n#### Step 2: Run offline evaluation to compute metrics\n\n1. Run the evaluation set through version `1` of the application that you created in the [first tutorial](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1-create-sample-app.html) by running the following command in your console. This job takes about 10 minutes to complete. \n```\n.\/rag run-offline-eval --eval-table-name catalog.schema.example_evaluation_set -v 1 -e dev\n\n``` \nNote \n**What happens behind the scenes?** \nIn the background, the Chain version `1` is run through each row of the `catalog.schema.example_evaluation_set` using an identical compute environment to how your Chain is served. For each row of `catalog.schema.example_evaluation_set`: \n* A row is written to a `\ud83d\uddc2\ufe0f Request Log` called `catalog.schema.example_evaluation_set_request_log` inside the same Unity Catalog schema as the evaluation set\n* A row is written to the `\ud83d\udc4d Assessment & Evaluation Results Log` called `catalog.schema.example_evaluation_set_assessment_log` with <llm-judge> assessments and metric computations\nThe name of the tables are based on the name of the input evaluation set table. \nNote that the schema of the `\ud83d\uddc2\ufe0f Request Log` and `\ud83d\udc4d Assessment & Evaluation Results Log` are intentionally identical to the logs you viewed in the [view logs tutorial](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/2-view-logs.html). \n![results](https:\/\/docs.databricks.com\/_images\/eval-set-results.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/3-run-offline-eval.html"} +{"content":"# \n### Run offline evaluation with a `\ud83d\udcd6 Evaluation Set`\n#### Step 3: Open the metrics UI\n\n1. Run the following command to open the metrics Notebook. This job takes about 10 minutes to complete. \nNote \nIf you have multiple versions of your application, you can run step 2 for each version, and then pass `--versions 2,3,4` or `--versions *` to compare the different versions within the notebook. \n```\n.\/rag explore-eval --eval-table-name catalog.schema.example_evaluation_set -e dev --versions 1\n\n```\n2. Click on the URL that is provided in the console output.\n3. Click to open the Notebook associated with the Databricks Job. \n![results](https:\/\/docs.databricks.com\/_images\/explore-eval.png)\n4. Run the first 2 cells to populate the widgets and then fill in the names of the tables from step 2. \n* `assessment_log_table_name`: `catalog.schema.example_evaluation_set_assessment_log`\n* `request_log_table_name`: `catalog.schema.example_evaluation_set_request_log`\n5. Run all cells in the notebook to display the metrics computed from the evaluation set. \n![results](https:\/\/docs.databricks.com\/_images\/eval-notebook.png)\n\n### Run offline evaluation with a `\ud83d\udcd6 Evaluation Set`\n#### Follow the next tutorial!\n\n[Collect feedback from \ud83e\udde0 Expert Users](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/4-collect-feedback.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/3-run-offline-eval.html"} +{"content":"# What is Delta Lake?\n### When to partition tables on Databricks\n\nThis article provides an overview of how you can partition tables on Databricks and specific recommendations around when you should use partitioning for tables backed by Delta Lake. Because of built-in features and optimizations, most tables with less than 1 TB of data do not require partitions. \nDatabricks uses Delta Lake for all tables by default. The following recommendations assume you are working with Delta Lake for all tables. \nIn Databricks Runtime 11.3 LTS and above, Databricks automatically clusters data in unpartitioned tables by ingestion time. See [Use ingestion time clustering](https:\/\/docs.databricks.com\/tables\/partitions.html#ingestion-time-clustering).\n\n### When to partition tables on Databricks\n#### Do small tables need to be partitioned?\n\nDatabricks recommends you do not partition tables that contains less than a terabyte of data.\n\n### When to partition tables on Databricks\n#### What is minimum size for each partition in a table?\n\nDatabricks recommends all partitions contain at least a gigabyte of data. Tables with fewer, larger partitions tend to outperform tables with many smaller partitions.\n\n### When to partition tables on Databricks\n#### Use ingestion time clustering\n\nBy using Delta Lake and Databricks Runtime 11.3 LTS or above, unpartitioned tables you create benefit automatically from [ingestion time clustering](https:\/\/www.databricks.com\/blog\/2022\/11\/18\/introducing-ingestion-time-clustering-dbr-112.html). Ingestion time provides similar query benefits to partitioning strategies based on datetime fields without any need to optimize or tune your data. \nNote \nTo maintain ingestion time clustering when you perform a large number of modifications using `UPDATE` or `MERGE` statements on a table, Databricks recommends running `OPTIMIZE` with `ZORDER BY` using a column that matches the ingestion order. For instance, this could be a column containing an event timestamp or a creation date.\n\n","doc_uri":"https:\/\/docs.databricks.com\/tables\/partitions.html"} +{"content":"# What is Delta Lake?\n### When to partition tables on Databricks\n#### Do Delta Lake and Parquet share partitioning strategies?\n\nDelta Lake uses Parquet as the primary format for storing data, and some Delta tables with partitions specified demonstrate organization similar to Parquet tables stored with Apache Spark. Apache Spark uses Hive-style partitioning when saving data in Parquet format. Hive-style partitioning is **not** part of the Delta Lake protocol, and workloads should not rely on this partitioning strategy to interact with Delta tables. \nMany Delta Lake features break assumptions about data layout that might have been transferred from Parquet, Hive, or even earlier Delta Lake protocol versions. You should always interact with data stored in Delta Lake using officially supported clients and APIs.\n\n### When to partition tables on Databricks\n#### How are Delta Lake partitions different from partitions in other data lakes?\n\nWhile Databricks and Delta Lake build upon open source technologies like Apache Spark, Parquet, Hive, and Hadoop, partitioning motivations and strategies useful in these technologies do not generally hold true for Databricks. If you do choose to partition your table, consider the following facts before choosing a strategy: \n* Transactions are not defined by partition boundaries. Delta Lake ensures [ACID](https:\/\/docs.databricks.com\/lakehouse\/acid.html) through transaction logs, so you do not need to separate a batch of data by a partition to ensure atomic discovery.\n* Databricks compute clusters do not have data locality tied to physical media. Data ingested into the lakehouse is stored in cloud object storage. While data is cached to local disk storage during data processing, Databricks uses file-based statistics to identify the minimal amount of data for parallel loading.\n\n","doc_uri":"https:\/\/docs.databricks.com\/tables\/partitions.html"} +{"content":"# What is Delta Lake?\n### When to partition tables on Databricks\n#### How do Z-order and partitions work together?\n\nYou can use [Z-order](https:\/\/docs.databricks.com\/delta\/data-skipping.html#delta-zorder) indexes alongside partitions to speed up queries on large datasets. \nNote \nMost tables can leverage [ingestion time clustering](https:\/\/docs.databricks.com\/tables\/partitions.html#ingestion-time-clustering) to avoid needing to worry about Z-order and partition tuning. \nThe following rules are important to keep in mind while planning a query optimization strategy based on partition boundaries and Z-order: \n* Z-order works in tandem with the `OPTIMIZE` command. You cannot combine files across partition boundaries, and so Z-order clustering can only occur within a partition. For unpartitioned tables, files can be combined across the entire table.\n* Partitioning works well only for low or known cardinality fields (for example, date fields or physical locations), but not for fields with high cardinality such as timestamps. Z-order works for all fields, including high cardinality fields and fields that may grow infinitely (for example, timestamps or the customer ID in a transactions or orders table).\n* You cannot Z-order on fields used for partitioning.\n\n### When to partition tables on Databricks\n#### If partitions are so bad, why do some Databricks features use them?\n\nPartitions can be beneficial, especially for very large tables. Many performance enhancements around partitioning focus on very large tables (hundreds of terabytes or greater). \nMany customers migrate to Delta Lake from Parquet-based data lakes. The `CONVERT TO DELTA` statement allows you to convert an existing Parquet-based table to a Delta table without rewriting existing data. As such, many customers have large tables that inherit previous partitioning strategies. Some optimizations developed by Databricks seek to leverage these partitions when possible, mitigating some potential downsides for partitioning strategies not optimized for Delta Lake. \nDelta Lake and Apache Spark are open-source technologies. While Databricks continues to introduce features that reduce reliance on partitioning, the open source community might continue to build new features that add complexity.\n\n","doc_uri":"https:\/\/docs.databricks.com\/tables\/partitions.html"} +{"content":"# What is Delta Lake?\n### When to partition tables on Databricks\n#### Is it possible to outperform Databricks built-in optimizations with custom partitioning?\n\nSome experienced users of Apache Spark and Delta Lake might be able to design and implement a pattern that provides better performance than [ingestion time clustering](https:\/\/docs.databricks.com\/tables\/partitions.html#ingestion-time-clustering). Implementing a bad partitioning stategy can have very negative repercussions on downstream performance and might require a full rewrite of data to fix. Databricks recommends that most users use default settings to avoid introducing expensive inefficiencies.\n\n","doc_uri":"https:\/\/docs.databricks.com\/tables\/partitions.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Production considerations for Structured Streaming\n##### Use scheduler pools for multiple streaming workloads\n\nTo enable multiple streaming queries to execute jobs concurrently on a shared cluster, you can configure queries to execute in separate scheduler pools.\n\n##### Use scheduler pools for multiple streaming workloads\n###### How do scheduler pools work?\n\nBy default, all queries started in a notebook run in the same [fair scheduling pool](https:\/\/spark.apache.org\/docs\/latest\/job-scheduling.html#scheduling-within-an-application). Jobs generated by triggers from all of the streaming queries in a notebook run one after another in first in, first out (FIFO) order. This can cause unnecessary delays in the queries, because they are not efficiently sharing the cluster resources. \nScheduler pools allow you to declare which Structured Streaming queries share compute resources. \nThe following example assigns `query1` to a dedicated pool, while `query2` and `query3` share a scheduler pool. \n```\n# Run streaming query1 in scheduler pool1\nspark.sparkContext.setLocalProperty(\"spark.scheduler.pool\", \"pool1\")\ndf.writeStream.queryName(\"query1\").format(\"delta\").start(path1)\n\n# Run streaming query2 in scheduler pool2\nspark.sparkContext.setLocalProperty(\"spark.scheduler.pool\", \"pool2\")\ndf.writeStream.queryName(\"query2\").format(\"delta\").start(path2)\n\n# Run streaming query3 in scheduler pool2\nspark.sparkContext.setLocalProperty(\"spark.scheduler.pool\", \"pool2\")\ndf.writeStream.queryName(\"query3\").format(\"delta\").start(path3)\n\n``` \nNote \nThe local property configuration must be in the same notebook cell where you start your streaming query. \nSee [Apache fair scheduler documentation](https:\/\/spark.apache.org\/docs\/latest\/job-scheduling.html#fair-scheduler-pools) for more details.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/scheduler-pools.html"} +{"content":"# \n### Matplotlib\n\nDatabricks Runtime displays Matplotlib figures inline.\n\n### Matplotlib\n#### Notebook example: Matplotlib\n\nThe following notebook shows how to display [Matplotlib](https:\/\/matplotlib.org\/) figures in Python notebooks. \n### Matplotlib Python notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/matplotlib.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n### Matplotlib\n#### Render images at higher resolution\n\nYou can render matplotlib images in Python notebooks at double the standard resolution, providing users of high-resolution screens with a better visualization experience. Set one of the following in a notebook cell: \n`retina` option: \n```\n%config InlineBackend.figure_format = 'retina'\n\nfrom IPython.display import set_matplotlib_formats\nset_matplotlib_formats('retina')\n\n``` \n`png2x`option: \n```\n%config InlineBackend.figure_format = 'png2x'\n\nfrom IPython.display import set_matplotlib_formats\nset_matplotlib_formats('png2x')\n\n``` \nTo switch back to standard resolution, add the following to a notebook cell: \n```\nset_matplotlib_formats('png')\n\n%config InlineBackend.figure_format = 'png'\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/matplotlib.html"} +{"content":"# What is Delta Lake?\n## Best practices: Delta Lake\n#### Delta Lake limitations on S3\n\nThis article details some of the limitations you might encounter while working with data stored in S3 with Delta Lake on Databricks. The eventually consistent model used in Amazon S3 can lead to potential problems when multiple systems or clusters modify data in the same table simultaneously. \nDatabricks and Delta Lake support multi-cluster writes by default, meaning that queries writing to a table from multiple clusters at the same time won\u2019t corrupt the table. For Delta tables stored on S3, this guarantee is limited to a single Databricks workspace. \nWarning \nTo avoid potential data corruption and data loss issues, Databricks recommends you do not modify the same Delta table stored in S3 from different workspaces.\n\n#### Delta Lake limitations on S3\n##### Bucket versioning and Delta Lake\n\nYou can use S3 bucket versioning to provide additional redundancy for data stored with Delta Lake. Databricks recommends implementing a lifecycle management policy for all S3 buckets with versioning enabled. Databricks recommends retaining three versions. \nImportant \nIf you encounter performance slowdown on tables stored in buckets with versioning enabled, please indicate that bucket versioning is enabled while contacting Databricks support.\n\n#### Delta Lake limitations on S3\n##### What are the limitations of multi-cluster writes on S3?\n\nThe following features are not supported when running in this mode: \n* [Server-Side Encryption with Customer-Provided Encryption Keys](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/dev\/ServerSideEncryptionCustomerKeys.html)\n* S3 paths with credentials in a cluster that cannot access [AWS Security Token Service](https:\/\/docs.aws.amazon.com\/STS\/latest\/APIReference\/Welcome.html) \nYou can disable multi-cluster writes by setting `spark.databricks.delta.multiClusterWrites.enabled` to `false`. If they are disabled, writes to a single table *must* originate from a single cluster. \nWarning \nDisabling `spark.databricks.delta.multiClusterWrites.enabled` and modifying the same Delta table from *multiple* clusters concurrently can lead to data loss or data corruption.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/s3-limitations.html"} +{"content":"# What is Delta Lake?\n## Best practices: Delta Lake\n#### Delta Lake limitations on S3\n##### Why is Delta Lake data I deleted still stored in S3?\n\nIf you are using Delta Lake and you have enabled bucket versioning on the S3 bucket, you have two entities managing table files. Databricks recommends disabling bucket versioning so that the `VACUUM` command can effectively remove unused data files.\n\n#### Delta Lake limitations on S3\n##### Why does a table show old data after I delete Delta Lake files with `rm -rf` and create a new table in the same location?\n\nDeletes on S3 are only eventually consistent. Thus after deleting a table old versions of the transaction log may still be visible for a while. To avoid this, do not reuse a table path after deleting it. Instead we recommend that you use transactional mechanisms like `DELETE FROM`, `overwrite`, and `overwriteSchema` to delete and update tables. See [Best practice to replace a table](https:\/\/docs.databricks.com\/delta\/best-practices.html#delta-replace-table).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/s3-limitations.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deploy models for batch inference and prediction\n### Deep learning model inference workflow\n##### Model inference using TensorFlow Keras API\n\nThe following notebook demonstrates the Databricks recommended [deep learning inference workflow](https:\/\/docs.databricks.com\/machine-learning\/model-inference\/dl-model-inference.html). This example illustrates model inference using a ResNet-50 model trained with TensorFlow Keras API and Parquet files as input data. \nTo understand the example, be familiar with [Spark data sources](https:\/\/docs.databricks.com\/query\/formats\/index.html).\n\n##### Model inference using TensorFlow Keras API\n###### Model inference TensorFlow Keras API notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/keras-metadata.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-inference\/resnet-model-inference-keras.html"} +{"content":"# Databricks data engineering\n## What is DBFS?\n#### Recommendations for working with DBFS root\n\nDatabricks uses the DBFS root directory as a [default location](https:\/\/docs.databricks.com\/dbfs\/root-locations.html) for some workspace actions. Databricks recommends against storing any production data or sensitive information in the DBFS root. This article focuses on recommendations to avoid accidental exposure of sensitive data on the DBFS root. \nNote \nDatabricks configures a separate private storage location for persisting data and configurations in customer-owned cloud storage, known as the internal DBFS. This location is not exposed to users.\n\n#### Recommendations for working with DBFS root\n##### Educate users not to store data on DBFS root\n\nBecause the DBFS root is accessible to all users in a workspace, all users can access any data stored here. It is important to instruct users to avoid using this location for storing sensitive data. The default location for managed tables in the Hive metastore on Databricks is the DBFS root; to prevent end users who create managed tables from writing to the DBFS root, declare a location on external storage when creating databases in the Hive metastore. \nUnity Catalog managed tables use a secure storage location by default. Databricks recommends using Unity Catalog for managed tables.\n\n#### Recommendations for working with DBFS root\n##### Use audit logging to monitor activity\n\nNote \nFor details about DBFS audit events, see [DBFS events](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html#dbfs). \nDatabricks recommends that you [enable S3 object-level logging](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/enable-cloudtrail-logging-for-s3.html) for your DBFS root bucket to allow faster investigation of issues. Be aware that enabling S3 object-level logging can increase your AWS usage cost.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dbfs\/dbfs-root.html"} +{"content":"# Databricks data engineering\n## What is DBFS?\n#### Recommendations for working with DBFS root\n##### Encrypt DBFS root data with a customer-managed key\n\nYou can encrypt DBFS root data with a customer-managed key. See [Customer-managed keys for encryption](https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/dbfs\/dbfs-root.html"} +{"content":"# Technology partners\n### Databricks sign-on from partner solutions\n\nDatabricks supports integrations with your favorite BI tools, including Power BI and Tableau. Some of these partner applications are enabled in your account by default as **published OAuth application integrations**. OAuth application integrations allow users to access Databricks from partner applications using single sign-on (SSO). Account admins can disable published partner OAuth applications for their account. See [Disable dbt Core, Power BI, Tableau Desktop, or Tableau Cloud OAuth application using the CLI](https:\/\/docs.databricks.com\/integrations\/enable-disable-oauth.html#disable-oauth-published-app). \nPublished partner OAuth applications include dbt Core, Power BI, Tableau Desktop, and Tableau Cloud. Some other partner OAuth applications (for example, Tableau Server) require additional configuration and must be manually enabled as **custom OAuth application integrations** by an account admin. \n* For generic custom OAuth application creation steps (CLI), see [Enable custom OAuth applications using the CLI](https:\/\/docs.databricks.com\/integrations\/enable-disable-oauth.html#enable-custom-app-cli).\n* For generic custom OAuth application creation steps (UI), see [Enable custom OAuth applications using the Databricks UI](https:\/\/docs.databricks.com\/integrations\/enable-disable-oauth.html#enable-custom-app-ui).\n* For custom OAuth application creation steps (CLI) that are specific to Tableau Server, see [Configure Databricks sign-on from Tableau Server](https:\/\/docs.databricks.com\/integrations\/configure-oauth-tableau.html). \nNote \nEnabling, disabling, or updating an OAuth application can take 30 minutes to process.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/configuration.html"} +{"content":"# Technology partners\n### Databricks sign-on from partner solutions\n#### In this section\n\n* [Configure Databricks sign-on from Tableau Server](https:\/\/docs.databricks.com\/integrations\/configure-oauth-tableau.html)\n* [Enable or disable partner OAuth applications](https:\/\/docs.databricks.com\/integrations\/enable-disable-oauth.html)\n* [Override partner OAuth token lifetime policy](https:\/\/docs.databricks.com\/integrations\/manage-oauth.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/configuration.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### Where does Databricks write data?\n\nThis article details locations Databricks writes data with common operations and configurations. Because Databricks provides a suite of tools that span many technologies and interact with cloud resources in a shared-responsibility model, the default locations used to store data vary based on the execution environment, configurations, and libraries. \nThe information in this article is meant to help you understand default paths for various operations and how configurations might alter these defaults. Data stewards and administrators looking for guidance on configuring and controlling access to data should see [Data governance with Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/index.html). \nTo learn about configuring object storage and other data source, see [Connect to data sources](https:\/\/docs.databricks.com\/connect\/index.html).\n\n#### Where does Databricks write data?\n##### What is object storage?\n\nIn cloud computing, object storage or blob storage refers to storage containers that maintain data as objects, with each object consisting of data, metadata, and a globally unique resource identifier (URI). Data manipulation operations in object storage are often limited to create, read, update, and delete (CRUD) through a REST API interface. Some object storage offerings include features like versioning and lifecycle management. Object storage has the following benefits: \n* High availability, durability, and reliability.\n* Lower cost for storage compared to most other storage options.\n* Infinitely scalable (limited by the total amount of storage available in a given region of the cloud). \nMost cloud-based data lakes are built on top of open source data formats in cloud object storage.\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/write-data.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### Where does Databricks write data?\n##### How does Databricks use object storage?\n\nObject storage is the main form of storage used by Databricks for most operations. The Databricks Filesystem ([DBFS](https:\/\/docs.databricks.com\/dbfs\/index.html)) allows Databricks users to interact with files in object storage similar to how they would in any other file system. Unless you specifically configure a table against an external data system, all tables created in Databricks store data in cloud object storage. \nDelta Lake files stored in cloud object storage provide the data foundation for the Databricks lakehouse.\n\n#### Where does Databricks write data?\n##### What is block storage?\n\nIn cloud computing, block storage or disk storage refer to storage volumes that correspond to traditional hard disk drives (HDDs) or solid state drives (SSDs), also known simply as \u201chard drives\u201d. When deploying block storage in a cloud computing environment, typically a logical partition of one or more physical drives are deployed. Implementations vary slightly between product offerings and cloud vendors, but the following characteristics are typically found across implementations: \n* All virtual machines (VMs) require an attached block storage volume.\n* Files and programs installed to a block storage volume persist as long as the block storage volume persists.\n* Block storage volumes are often used for temporary data storage.\n* Block storage volumes attached to VMs are usually deleted alongside VMs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/write-data.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### Where does Databricks write data?\n##### How does Databricks use block storage?\n\nWhen you turn on compute resources, Databricks configures and deploys VMs and attaches block storage volumes. This block storage is used for storing ephemeral data files for the lifetime of the compute. These files include the operating system and installed libraries, in addition to data used by the [disk cache](https:\/\/docs.databricks.com\/optimizations\/disk-cache.html). While Apache Spark uses block storage in the background for efficient parallelization and data loading, most code run on Databricks does not directly save or load data to block storage. \nYou can run arbitrary code such as Python or Bash commands that use the block storage attached to your driver node. See [Work with files in ephemeral storage attached to the driver node](https:\/\/docs.databricks.com\/files\/index.html#driver).\n\n#### Where does Databricks write data?\n##### Where does Unity Catalog store data files?\n\nUnity Catalog relies on administrators to configure relationships between cloud storage and relational objects. The exact location where data resides depends on how administrators have configured relations. \nData written or uploaded to objects governed by Unity Catalog is stored in one of the following locations: \n* A managed storage location associated with a metastore, catalog, or schema. Data written or uploaded to managed tables and managed volumes use managed storage. See [Managed storage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#managed-storage).\n* An external location configured with storage credentials. Data written or uploaded to external tables and external volumes use external storage. See [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/write-data.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### Where does Databricks write data?\n##### Where does Databricks SQL store data backing tables?\n\nWhen you run a `CREATE TABLE` statement with Databricks SQL configured with Unity Catalog, the default behavior is to store data files in a managed storage location configured with Unity Catalog. See [Where does Unity Catalog store data files?](https:\/\/docs.databricks.com\/files\/write-data.html#uc). \nThe legacy `hive_metastore` catalog follows different rules. See [Work with Unity Catalog and the legacy Hive metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/hive-metastore.html).\n\n#### Where does Databricks write data?\n##### Where does Delta Live Tables store data files?\n\nDatabricks recommends using Unity Catalog when creating DLT pipelines. Data is stored in directories within the managed storage location associated with the target schema. \nYou can optionally configure DLT pipelines using Hive metastore. When configured with Hive metastore, you can specify a storage location on DBFS or cloud object storage. If you do not specify a location, a location on the DBFS root is assigned to your pipeline.\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/write-data.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### Where does Databricks write data?\n##### Where does Apache Spark write data files?\n\nDatabricks recommends using object names with Unity Catalog for reading and writing data. You can also write files to Unity Catalog volumes using the following pattern: `\/Volumes\/<catalog>\/<schema>\/<volume>\/<path>\/<file-name>`. You must have sufficient privileges to upload, create, update, or insert data to Unity Catalog-governed objects. \nYou can optionally use universal resource indicators (URIs) to specify paths to data files. URIs vary depending on the cloud provider. You must also have write permissions configured for your current compute to write to cloud object storage. \nDatabricks uses the Databricks Filesystem to map Apache Spark read and write commands back to cloud object storage. Each Databricks workspace comes with a DBFS root storage location configured in the cloud account allocated for the workspace, which all users can access for reading and writing data. Databricks does not recommend using the DBFS root for storing any production data. See [What is DBFS?](https:\/\/docs.databricks.com\/dbfs\/index.html) and [Recommendations for working with DBFS root](https:\/\/docs.databricks.com\/dbfs\/dbfs-root.html).\n\n#### Where does Databricks write data?\n##### Where does pandas write data files on Databricks?\n\nIn Databricks Runtime 14.0 and above, the default current working directory (CWD) for all local Python read and write operations is the directory containing the notebook. If you provide only a filename when saving a data file, pandas saves that data file as a workspace file parallel to your currently running notebook. \nNot all Databricks Runtime versions support workspace files, and some Databricks Runtime versions have differing behavior depending on whether you use notebooks or Git folders. See [What is the default current working directory?](https:\/\/docs.databricks.com\/files\/cwd-dbr-14.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/write-data.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### Where does Databricks write data?\n##### Where should I write temporary files on Databricks?\n\nIf you must write temporary files that you do not want to keep after the cluster is shut down, writing the temporary files to `$TEMPDIR` yields better performance than writing to the current working directory (CWD) if the CWD is in workspace filesystem. You can also avoid exceeding branch size limits if the code runs in a Repo. For more information, see [File and repo size limits](https:\/\/docs.databricks.com\/repos\/limits.html#file-and-repo-size-limits). \nWrite to `\/local_disk0` if the amount of data to be written is very large and you want the storage to autoscale.\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/write-data.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Production considerations for Structured Streaming\n##### Configure RocksDB state store on Databricks\n\nYou can enable RocksDB-based state management by setting the following configuration in the\nSparkSession before starting the streaming query. \n```\nspark.conf.set(\n\"spark.sql.streaming.stateStore.providerClass\",\n\"com.databricks.sql.streaming.state.RocksDBStateStoreProvider\")\n\n``` \nYou can enable RocksDB on Delta Live Tables pipelines. See [Enable RocksDB state store for Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#rocksdb).\n\n##### Configure RocksDB state store on Databricks\n###### Enable changelog checkpointing\n\nIn Databricks Runtime 13.3 LTS and above, you can enable changelog checkpointing to lower checkpoint duration and end-to-end latency for Structured Streaming workloads. Databricks recommends enabling changelog checkpointing for all Structured Streaming stateful queries. \nTraditionally RocksDB State Store snapshots and uploads data files during checkpointing. To avoid this cost, changelog checkpointing only writes records that have changed since the last checkpoint to durable storage.\u201d \nChangelog checkpointing is disabled by default. You can enable changelog checkpointing in the SparkSession level using the following syntax: \n```\nspark.conf.set(\n\"spark.sql.streaming.stateStore.rocksdb.changelogCheckpointing.enabled\", \"true\")\n\n``` \nYou can enable changelog checkpointing on an existing stream and maintain the state information stored in the checkpoint. \nImportant \nQueries that have enabled changelog checkpointing can only be run on Databricks Runtime 13.3 LTS and above. You can disable changelog checkpointing to revert to legacy checkpointing behavior, but you must continue to run these queries on Databricks Runtime 13.3 LTS or above. You must restart the job for these changes to take place.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/rocksdb-state-store.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n### Production considerations for Structured Streaming\n##### Configure RocksDB state store on Databricks\n###### RocksDB state store metrics\n\nEach state operator collects metrics related to the state management operations performed on its RocksDB instance to observe the state store and potentially help in debugging job slowness. These metrics are aggregated (sum) per state operator in job across all tasks where the state operator is running. These metrics are part of the `customMetrics` map inside the `stateOperators` fields in `StreamingQueryProgress`. The following is an example of `StreamingQueryProgress` in JSON form (obtained using `StreamingQueryProgress.json()`). \n```\n{\n\"id\" : \"6774075e-8869-454b-ad51-513be86cfd43\",\n\"runId\" : \"3d08104d-d1d4-4d1a-b21e-0b2e1fb871c5\",\n\"batchId\" : 7,\n\"stateOperators\" : [ {\n\"numRowsTotal\" : 20000000,\n\"numRowsUpdated\" : 20000000,\n\"memoryUsedBytes\" : 31005397,\n\"numRowsDroppedByWatermark\" : 0,\n\"customMetrics\" : {\n\"rocksdbBytesCopied\" : 141037747,\n\"rocksdbCommitCheckpointLatency\" : 2,\n\"rocksdbCommitCompactLatency\" : 22061,\n\"rocksdbCommitFileSyncLatencyMs\" : 1710,\n\"rocksdbCommitFlushLatency\" : 19032,\n\"rocksdbCommitPauseLatency\" : 0,\n\"rocksdbCommitWriteBatchLatency\" : 56155,\n\"rocksdbFilesCopied\" : 2,\n\"rocksdbFilesReused\" : 0,\n\"rocksdbGetCount\" : 40000000,\n\"rocksdbGetLatency\" : 21834,\n\"rocksdbPutCount\" : 1,\n\"rocksdbPutLatency\" : 56155599000,\n\"rocksdbReadBlockCacheHitCount\" : 1988,\n\"rocksdbReadBlockCacheMissCount\" : 40341617,\n\"rocksdbSstFileSize\" : 141037747,\n\"rocksdbTotalBytesReadByCompaction\" : 336853375,\n\"rocksdbTotalBytesReadByGet\" : 680000000,\n\"rocksdbTotalBytesReadThroughIterator\" : 0,\n\"rocksdbTotalBytesWrittenByCompaction\" : 141037747,\n\"rocksdbTotalBytesWrittenByPut\" : 740000012,\n\"rocksdbTotalCompactionLatencyMs\" : 21949695000,\n\"rocksdbWriterStallLatencyMs\" : 0,\n\"rocksdbZipFileBytesUncompressed\" : 7038\n}\n} ],\n\"sources\" : [ {\n} ],\n\"sink\" : {\n}\n}\n\n``` \nDetailed descriptions of the metrics are as follows: \n| Metric name | Description |\n| --- | --- |\n| rocksdbCommitWriteBatchLatency | Time (in millis) took for applying the staged writes in in-memory structure (WriteBatch) to native RocksDB. |\n| rocksdbCommitFlushLatency | Time (in millis) took for flushing the RocksDB in-memory changes to local disk. |\n| rocksdbCommitCompactLatency | Time (in millis) took for compaction (optional) during the checkpoint commit. |\n| rocksdbCommitPauseLatency | Time (in millis) took for stopping the background worker threads (for compaction etc.) as part of the checkpoint commit. |\n| rocksdbCommitCheckpointLatency | Time (in millis) took for taking a snapshot of native RocksDB and write it to a local directory. |\n| rocksdbCommitFileSyncLatencyMs | Time (in millis) took for syncing the native RocksDB snapshot related files to an external storage (checkpoint location). |\n| rocksdbGetLatency | Average time (in nanos) took per the underlying native `RocksDB::Get` call. |\n| rocksdbPutCount | Average time (in nanos) took per the underlying native `RocksDB::Put` call. |\n| rocksdbGetCount | Number of native `RocksDB::Get` calls (doesn\u2019t include `Gets` from WriteBatch - in memory batch used for staging writes). |\n| rocksdbPutCount | Number of native `RocksDB::Put` calls (doesn\u2019t include `Puts` to WriteBatch - in memory batch used for staging writes). |\n| rocksdbTotalBytesReadByGet | Number of uncompressed bytes read through native `RocksDB::Get` calls. |\n| rocksdbTotalBytesWrittenByPut | Number of uncompressed bytes written through native `RocksDB::Put` calls. |\n| rocksdbReadBlockCacheHitCount | Number of times the native RocksDB block cache is used to avoid reading data from local disk. |\n| rocksdbReadBlockCacheMissCount | Number of times the native RocksDB block cache missed and required reading data from local disk. |\n| rocksdbTotalBytesReadByCompaction | Number of bytes read from the local disk by the native RocksDB compaction process. |\n| rocksdbTotalBytesWrittenByCompaction | Number of bytes written to the local disk by the native RocksDB compaction process. |\n| rocksdbTotalCompactionLatencyMs | Time (in millis) took for RocksDB compactions (both background and the optional compaction initiated during the commit). |\n| rocksdbWriterStallLatencyMs | Time (in millis) the writer has stalled due to a background compaction or flushing of the memtables to disk. |\n| rocksdbTotalBytesReadThroughIterator | Some of the stateful operations (such as timeout processing in `flatMapGroupsWithState` or watermarking in windowed aggregations) requires reading entire data in DB through iterator. The total size of uncompressed data read using the iterator. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/rocksdb-state-store.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Configure route optimization on serving endpoints\n\nThis article describes how to configure route optimization on your [model serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) or [feature serving](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html) endpoints and how to query them. Route optimized serving endpoints dramatically lower overhead latency and allow for substantial improvements in the throughput supported by your endpoint. \nRoute optimization is recommended for high throughput or latency sensitive workloads.\n\n#### Configure route optimization on serving endpoints\n##### Requirements\n\n* For route optimization on a model serving endpoint, see [Requirements](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html#requirement).\n* For route optimization on a feature serving, see [Requirements](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html#requirement).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/route-optimization.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Configure route optimization on serving endpoints\n##### Enable route optimization on a model serving endpoint\n\nSpecify the `route_optimized` parameter during model serving endpoint creation to configure your endpoint for route optimization. You can only specify this parameter during endpoint creation, you can not update existing endpoints to be route optimized. \n```\nPOST \/api\/2.0\/serving-endpoints\n\n{\n\"name\": \"my-endpoint\",\n\"config\":{\n\"served_entities\": [{\n\"entity_name\": \"ads1\",\n\"entity_version\": \"1\",\n\"workload_type\": \"CPU\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": true,\n}],\n},\n\"route_optimized\": true\n}\n\n``` \nIf you prefer to use Python, you can create a route optimized serving endpoint using the following notebook. \n### Create a route optimized serving endpoint using Python notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/create-route-optimized-serving-endpoint.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/route-optimization.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Configure route optimization on serving endpoints\n##### Enable route optimization on a feature serving endpoint\n\nTo use route optimization for Feature and Function Serving, specify the full name of the feature specification in the `entity_name` field for serving endpoint creation requests. The `entity_version` is not needed for `FeatureSpecs`. \n```\nPOST \/api\/2.0\/serving-endpoints\n\n{\n\"name\": \"my-endpoint\",\n\"config\":{\n\"served_entities\": [{\n\"entity_name\": \"catalog_name.schema_name.feature_spec_name\",\n\"workload_type\": \"CPU\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": true,\n}],\n},\n\"route_optimized\": true\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/route-optimization.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Configure route optimization on serving endpoints\n##### Query route optimized model serving endpoints\n\nThe following steps show how to test query a route optimized model serving endpoint. \nFor production use, like using your route optimized endpoint in an application, you must create an OAuth token. To fetch an OAuth token programmatically, you can follow the guidance in [OAuth machine-to-machine (M2M) authentication](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \n1. Fetch an OAuth token from the **Serving** UI of your workspace. \n1. Click **Serving** in the sidebar to display the Serving UI.\n2. On the Serving endpoints page, select your route optimized endpoint to see endpoint details.\n3. On the endpoint details page, click the **Query endpoint** button.\n4. Select the **Fetch Token** tab.\n5. Select **Fetch OAuth Token** button. This token is valid for 1 hour. Fetch a new token if your current token expires.\n2. Get your model serving endpoint URL from the endpoint details page from the **Serving** UI.\n3. Use the OAuth token from step 1 and the endpoint URL from step 2 to populate the following example code that queries the route optimized endpoint. \n```\nurl=\"your-endpoint-url\"\nOAUTH_TOKEN=xxxxxxx\n\ncurl -X POST -H 'Content-Type: application\/json' -H \"Authorization: Bearer $OAUTH_TOKEN\" -d@data.json $url\n\n``` \nFor a Python SDK to query a route optimized endpoint, reach out to your Databricks account team.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/route-optimization.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Configure route optimization on serving endpoints\n##### Limitations\n\n* OAuth tokens are the only supported authentication for route optimization. Personal access tokens are not supported.\n* Route optimization does not enforce any network restrictions you might have configured in your Databricks workspace such as IP access control lists or [PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html). Do not enable route optimization if you require that model serving traffic be bound by those controls. If you have such network requirements and still want to try route-optimized model serving, reach out to your Databricks account team.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/route-optimization.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on MySQL\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes how to set up Lakehouse Federation to run federated queries on MySQL data that is not managed by Databricks. To learn more about Lakehouse Federation, see [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). \nTo connect to your MySQL database using Lakehouse Federation, you must create the following in your Databricks Unity Catalog metastore: \n* A *connection* to your MySQL database.\n* A *foreign catalog* that mirrors your MySQL database in Unity Catalog so that you can use Unity Catalog query syntax and data governance tools to manage Databricks user access to the database.\n\n#### Run federated queries on MySQL\n##### Before you begin\n\nWorkspace requirements: \n* Workspace enabled for Unity Catalog. \nCompute requirements: \n* Network connectivity from your Databricks Runtime cluster or SQL warehouse to the target database systems. See [Networking recommendations for Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/networking.html).\n* Databricks clusters must use Databricks Runtime 13.3 LTS or above and shared or single-user access mode.\n* SQL warehouses must be Pro or Serverless. \nPermissions required: \n* To create a connection, you must be a metastore admin or a user with the `CREATE CONNECTION` privilege on the Unity Catalog metastore attached to the workspace.\n* To create a foreign catalog, you must have the `CREATE CATALOG` permission on the metastore and be either the owner of the connection or have the `CREATE FOREIGN CATALOG` privilege on the connection. \nAdditional permission requirements are specified in each task-based section that follows.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/mysql.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on MySQL\n##### Create a connection\n\nA connection specifies a path and credentials for accessing an external database system. To create a connection, you can use Catalog Explorer or the `CREATE CONNECTION` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** Metastore admin or user with the `CREATE CONNECTION` privilege. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **External Data** menu and select **Connections**.\n3. Click **Create connection**.\n4. Enter a user-friendly **Connection name**.\n5. Select a **Connection type** of MySQL.\n6. Enter the following connection properties for your MySQL instance. \n* **Host**: For example, `mysql-demo.lb123.us-west-2.rds.amazonaws.com`\n* **Port**: For example, `3306`\n* **User**: For example, `mysql_user`\n* **Password**: For example, `password123`\n7. (Optional) Click **Test connection** to confirm that it works.\n8. (Optional) Add a comment.\n9. Click **Create**. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nCREATE CONNECTION <connection-name> TYPE mysql\nOPTIONS (\nhost '<hostname>',\nport '<port>',\nuser '<user>',\npassword '<password>'\n);\n\n``` \nWe recommend that you use Databricks [secrets](https:\/\/docs.databricks.com\/security\/secrets\/index.html) instead of plaintext strings for sensitive values like credentials. For example: \n```\nCREATE CONNECTION <connection-name> TYPE mysql\nOPTIONS (\nhost '<hostname>',\nport '<port>',\nuser secret ('<secret-scope>','<secret-key-user>'),\npassword secret ('<secret-scope>','<secret-key-password>')\n)\n\n``` \nIf you must use plaintext strings in notebook SQL commands, avoid truncating the string by escaping special characters like `$` with `\\`. For example: `\\$`. \nFor information about setting up secrets, see [Secret management](https:\/\/docs.databricks.com\/security\/secrets\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/mysql.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on MySQL\n##### Create a foreign catalog\n\nA foreign catalog mirrors a database in an external data system so that you can query and manage access to data in that database using Databricks and Unity Catalog. To create a foreign catalog, you use a connection to the data source that has already been defined. \nTo create a foreign catalog, you can use Catalog Explorer or the `CREATE FOREIGN CATALOG` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** `CREATE CATALOG` permission on the metastore and either ownership of the connection or the `CREATE FOREIGN CATALOG` privilege on the connection. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Click the **Create Catalog** button.\n3. On the **Create a new catalog** dialog, enter a name for the catalog and select a **Type** of **Foreign**.\n4. Select the **Connection** that provides access to the database that you want to mirror as a Unity Catalog catalog.\n5. Click **Create.** \nRun the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional. Replace the placeholder values: \n* `<catalog-name>`: Name for the catalog in Databricks.\n* `<connection-name>`: The [connection object](https:\/\/docs.databricks.com\/query-federation\/mysql.html#connection) that specifies the data source, path, and access credentials. \n```\nCREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/mysql.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on MySQL\n##### Supported pushdowns\n\nThe following pushdowns are supported on all compute: \n* Filters\n* Projections\n* Limit\n* Functions: partial, only for filter expressions. (String functions, Mathematical functions, Data, Time and Timestamp functions, and other miscellaneous functions, such as Alias, Cast, SortOrder) \nThe following pushdowns are supported on Databricks Runtime 13.3 LTS and above, and on SQL warehouses: \n* Aggregates\n* Boolean operators\n* The following mathematical functions (not supported if ANSI is disabled): +, -, \\*, %, \/\n* Sorting, when used with limit \nThe following pushdowns are not supported: \n* Joins\n* Windows functions\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/mysql.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on MySQL\n##### Data type mappings\n\nWhen you read from MySQL to Spark, data types map as follows: \n| MySQL type | Spark type |\n| --- | --- |\n| bigint (if not signed), decimal | DecimalType |\n| tinyint\\*, int, integer, mediumint, smallint | IntegerType |\n| bigint (if signed) | LongType |\n| float | FloatType |\n| double | DoubleType |\n| char, enum, set | CharType |\n| varchar | VarcharType |\n| json, longtext, mediumtext, text, tinytext | StringType |\n| binary, blob, varbinary, varchar binary | BinaryType |\n| bit, boolean | BooleanType |\n| date, year | DateType |\n| datetime, time, timestamp\\*\\* | TimestampType\/TimestampNTZType | \n*`tinyint(1) signed` is treated as a boolean and converted into `BooleanType`. See [Connector\/J Reference](https:\/\/dev.mysql.com\/doc\/connector-j\/en\/connector-j-reference.html)\n\\** When you read from MySQL, MySQL `Timestamp` is mapped to Spark `TimestampType` if `preferTimestampNTZ = false` (default). MySQL `Timestamp` is mapped to `TimestampNTZType` if `preferTimestampNTZ = true`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/mysql.html"} +{"content":"# Query data\n## Data format options\n#### Avro file\n\n[Apache Avro](https:\/\/avro.apache.org\/) is a data serialization system. Avro provides: \n* Rich data structures.\n* A compact, fast, binary data format.\n* A container file, to store persistent data.\n* Remote procedure call (RPC).\n* Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages. \nThe [Avro data source](https:\/\/spark.apache.org\/docs\/latest\/sql-data-sources-avro.html) supports: \n* Schema conversion: Automatic conversion between Apache Spark SQL and Avro records.\n* Partitioning: Easily reading and writing partitioned data without any extra configuration.\n* Compression: Compression to use when writing Avro out to disk. The supported types are `uncompressed`, `snappy`, and `deflate`. You can also specify the deflate level.\n* Record names: Record name and namespace by passing a map of parameters with `recordName` and `recordNamespace`. \nAlso see [Read and write streaming Avro data](https:\/\/docs.databricks.com\/structured-streaming\/avro-dataframe.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/avro.html"} +{"content":"# Query data\n## Data format options\n#### Avro file\n##### Configuration\n\nYou can change the behavior of an Avro data source using various configuration parameters. \nTo ignore files without the `.avro` extension when reading, you can set the parameter `avro.mapred.ignore.inputs.without.extension` in the Hadoop configuration. The default is `false`. \n```\nspark\n.sparkContext\n.hadoopConfiguration\n.set(\"avro.mapred.ignore.inputs.without.extension\", \"true\")\n\n``` \nTo configure compression when writing, set the following Spark properties: \n* Compression codec: `spark.sql.avro.compression.codec`. Supported codecs are `snappy` and `deflate`. The default codec is `snappy`.\n* If the compression codec is `deflate`, you can set the compression level with: `spark.sql.avro.deflate.level`. The default level is `-1`. \nYou can set these properties in the cluster [Spark configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration) or at runtime using `spark.conf.set()`. For example: \n```\nspark.conf.set(\"spark.sql.avro.compression.codec\", \"deflate\")\nspark.conf.set(\"spark.sql.avro.deflate.level\", \"5\")\n\n``` \nFor [Databricks Runtime 9.1 LTS](https:\/\/docs.databricks.com\/release-notes\/runtime\/9.1lts.html) and above, you can change the default schema inference behavior in Avro by providing the `mergeSchema` option when reading files. Setting `mergeSchema` to `true` will infer a schema from a set of Avro files in the target directory and merge them rather than infer the read schema from a single file.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/avro.html"} +{"content":"# Query data\n## Data format options\n#### Avro file\n##### Supported types for Avro -> Spark SQL conversion\n\nThis library supports reading all Avro types. It uses the following\nmapping from Avro types to Spark SQL types: \n| Avro type | Spark SQL type |\n| --- | --- |\n| boolean | BooleanType |\n| int | IntegerType |\n| long | LongType |\n| float | FloatType |\n| double | DoubleType |\n| bytes | BinaryType |\n| string | StringType |\n| record | StructType |\n| enum | StringType |\n| array | ArrayType |\n| map | MapType |\n| fixed | BinaryType |\n| union | See [Union types](https:\/\/docs.databricks.com\/query\/formats\/avro.html#union-types). | \n### Union types \nThe Avro data source supports reading `union` types. Avro considers the following three types to be `union` types: \n* `union(int, long)` maps to `LongType`.\n* `union(float, double)` maps to `DoubleType`.\n* `union(something, null)`, where `something` is any supported Avro type. This maps to the same Spark SQL type as that of\n`something`, with `nullable` set to `true`. \nAll other `union` types are complex types. They map to\n`StructType` where field names are `member0`, `member1`, and so on, in\naccordance with members of the `union`. This is consistent with the\nbehavior when converting between Avro and Parquet. \n### Logical types \nThe Avro data source supports reading the following [Avro logical types](https:\/\/avro.apache.org\/docs\/1.8.2\/spec.html#Logical+Types): \n| Avro logical type | Avro type | Spark SQL type |\n| --- | --- | --- |\n| date | int | DateType |\n| timestamp-millis | long | TimestampType |\n| timestamp-micros | long | TimestampType |\n| decimal | fixed | DecimalType |\n| decimal | bytes | DecimalType | \nNote \nThe Avro data source ignores docs, aliases, and other properties present in the Avro file.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/avro.html"} +{"content":"# Query data\n## Data format options\n#### Avro file\n##### Supported types for Spark SQL -> Avro conversion\n\nThis library supports writing of all Spark SQL types into Avro. For most\ntypes, the mapping from Spark types to Avro types is straightforward\n(for example `IntegerType` gets converted to `int`); the following is a list of the few special cases: \n| Spark SQL type | Avro type | Avro logical type |\n| --- | --- | --- |\n| ByteType | int | |\n| ShortType | int | |\n| BinaryType | bytes | |\n| DecimalType | fixed | decimal |\n| TimestampType | long | timestamp-micros |\n| DateType | int | date | \nYou can also specify the whole output Avro schema with the option `avroSchema`, so that Spark SQL types can be converted into other Avro types.\nThe following conversions are not applied by default and require user specified Avro schema: \n| Spark SQL type | Avro type | Avro logical type |\n| --- | --- | --- |\n| ByteType | fixed | |\n| StringType | enum | |\n| DecimalType | bytes | decimal |\n| TimestampType | long | timestamp-millis |\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/avro.html"} +{"content":"# Query data\n## Data format options\n#### Avro file\n##### Examples\n\nThese examples use the [episodes.avro](https:\/\/docs.databricks.com\/_static\/examples\/episodes.avro) file. \n```\n\/\/ The Avro records are converted to Spark types, filtered, and\n\/\/ then written back out as Avro records\n\nval df = spark.read.format(\"avro\").load(\"\/tmp\/episodes.avro\")\ndf.filter(\"doctor > 5\").write.format(\"avro\").save(\"\/tmp\/output\")\n\n``` \nThis example demonstrates a custom Avro schema: \n```\nimport org.apache.avro.Schema\n\nval schema = new Schema.Parser().parse(new File(\"episode.avsc\"))\n\nspark\n.read\n.format(\"avro\")\n.option(\"avroSchema\", schema.toString)\n.load(\"\/tmp\/episodes.avro\")\n.show()\n\n``` \nThis example demonstrates Avro compression options: \n```\n\/\/ configuration to use deflate compression\nspark.conf.set(\"spark.sql.avro.compression.codec\", \"deflate\")\nspark.conf.set(\"spark.sql.avro.deflate.level\", \"5\")\n\nval df = spark.read.format(\"avro\").load(\"\/tmp\/episodes.avro\")\n\n\/\/ writes out compressed Avro records\ndf.write.format(\"avro\").save(\"\/tmp\/output\")\n\n``` \nThis example demonstrates partitioned Avro records: \n```\nimport org.apache.spark.sql.SparkSession\n\nval spark = SparkSession.builder().master(\"local\").getOrCreate()\n\nval df = spark.createDataFrame(\nSeq(\n(2012, 8, \"Batman\", 9.8),\n(2012, 8, \"Hero\", 8.7),\n(2012, 7, \"Robot\", 5.5),\n(2011, 7, \"Git\", 2.0))\n).toDF(\"year\", \"month\", \"title\", \"rating\")\n\ndf.toDF.write.format(\"avro\").partitionBy(\"year\", \"month\").save(\"\/tmp\/output\")\n\n``` \nThis example demonstrates the record name and namespace: \n```\nval df = spark.read.format(\"avro\").load(\"\/tmp\/episodes.avro\")\n\nval name = \"AvroTest\"\nval namespace = \"org.foo\"\nval parameters = Map(\"recordName\" -> name, \"recordNamespace\" -> namespace)\n\ndf.write.options(parameters).format(\"avro\").save(\"\/tmp\/output\")\n\n``` \n```\n# Create a DataFrame from a specified directory\ndf = spark.read.format(\"avro\").load(\"\/tmp\/episodes.avro\")\n\n# Saves the subset of the Avro records read in\nsubset = df.where(\"doctor > 5\")\nsubset.write.format(\"avro\").save(\"\/tmp\/output\")\n\n``` \nTo query Avro data in SQL, register the data file as a table or temporary view: \n```\nCREATE TEMPORARY VIEW episodes\nUSING avro\nOPTIONS (path \"\/tmp\/episodes.avro\")\n\nSELECT * from episodes\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/avro.html"} +{"content":"# Query data\n## Data format options\n#### Avro file\n##### Notebook example: Read and write Avro files\n\nThe following notebook demonstrates how to read and write Avro files. \n### Read and write Avro files notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/read-avro-files.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/avro.html"} +{"content":"# Databricks data engineering\n### Streaming on Databricks\n\nYou can use Databricks for near real-time data ingestion, processing, machine learning, and AI for streaming data. \nDatabricks offers numerous optimzations for streaming and incremental processing. For most streaming or incremental data processing or ETL tasks, Databricks recommends Delta Live Tables. See [What is Delta Live Tables?](https:\/\/docs.databricks.com\/delta-live-tables\/index.html). \nMost incremental and streaming workloads on Databricks are powered by Structured Streaming, including Delta Live Tables and Auto Loader. See [What is Auto Loader?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html). \nDelta Lake and Structured Streaming have tight integration to power incremental processing in the Databricks lakehouse. See [Delta table streaming reads and writes](https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html). \nFor real-time model serving, see [Model serving with Databricks](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html). \nTo learn more about building streaming solutions on the Databricks platform, see the [data streaming product page](https:\/\/www.databricks.com\/product\/data-streaming). \nDatabricks has specific features for working with semi-structured data fields contained in Avro, protocol buffers, and JSON data payloads. To learn more, see: \n* [Transform Avro payload](https:\/\/docs.databricks.com\/structured-streaming\/avro-dataframe.html)\n* [Protocol buffers](https:\/\/docs.databricks.com\/structured-streaming\/protocol-buffers.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/index.html"} +{"content":"# Databricks data engineering\n### Streaming on Databricks\n#### What is Structured Streaming?\n\nApache Spark Structured Streaming is a near-real time processing engine that offers end-to-end fault tolerance with exactly-once processing guarantees using familiar Spark APIs. Structured Streaming lets you express computation on streaming data in the same way you express a batch computation on static data. The Structured Streaming engine performs the computation incrementally and continuously updates the result as streaming data arrives. \nIf you\u2019re new to Structured Streaming, see [Run your first Structured Streaming workload](https:\/\/docs.databricks.com\/structured-streaming\/tutorial.html). \nFor information about using Structured Streaming with Unity Catalog, see [Using Unity Catalog with Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/unity-catalog.html).\n\n### Streaming on Databricks\n#### What streaming sources and sinks does Databricks support?\n\nDatabricks recommends using Auto Loader to ingest supported file types from cloud object storage into Delta Lake. For ETL pipelines, Databricks recommends using Delta Live Tables (which uses Delta tables and Structured Streaming). You can also configure incremental ETL workloads by streaming to and from Delta Lake tables. \nIn addition to Delta Lake and Auto Loader, Structured Streaming can connect to [messaging services](https:\/\/docs.databricks.com\/connect\/streaming\/index.html) such as Apache Kafka. \nYou can also [Use foreachBatch to write to arbitrary data sinks](https:\/\/docs.databricks.com\/structured-streaming\/foreach.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/index.html"} +{"content":"# Databricks data engineering\n### Streaming on Databricks\n#### Additional resources\n\nApache Spark provides a [Structured Streaming Programming Guide](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html) that has more information about Structured Streaming. \nFor reference information about Structured Streaming, Databricks recommends the following Apache Spark API references: \n* [Python](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.ss\/index.html)\n* [Scala](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/streaming\/index.html)\n* [Java](https:\/\/spark.apache.org\/docs\/latest\/api\/java\/org\/apache\/spark\/sql\/streaming\/package-summary.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/index.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Share data using the Delta Sharing open sharing protocol (for providers)\n\nThis article gives an overview of how providers can use the Delta Sharing open sharing protocol to share data from your Unity Catalog-enabled Databricks workspace with any user on any computing platform, anywhere. \nNote \nIf you are a data recipient (a user or group of users with whom data is being shared), see instead [Access data shared with you using Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/recipient.html).\n\n### Share data using the Delta Sharing open sharing protocol (for providers)\n#### Who should use the Delta Sharing open sharing protocol?\n\nThere are three ways to share data using Delta Sharing: \n1. **The Databricks open sharing protocol**, covered in this article, lets you share data that you manage in a Unity Catalog-enabled Databricks workspace with users on any computing platform. \nThis approach uses the Delta Sharing server that is built into Databricks and is useful when you manage data using Unity Catalog and want to share it with users who don\u2019t use Databricks or don\u2019t have access to a Unity Catalog-enabled Databricks workspace. The integration with Unity Catalog on the provider side simplifies setup and governance for providers.\n2. **A customer-managed implementation of the open-source Delta Sharing server** lets you share from any platform to any platform, whether Databricks or not. \nSee [github.com\/delta-io\/delta-sharing](https:\/\/github.com\/delta-io\/delta-sharing).\n3. **The Databricks-to-Databricks sharing protocol** lets you share data from your Unity Catalog-enabled workspace with users who also have access to a Unity Catalog-enabled Databricks workspace. \nSee [Share data using the Delta Sharing Databricks-to-Databricks protocol (for providers)](https:\/\/docs.databricks.com\/data-sharing\/share-data-databricks.html). \nFor an introduction to Delta Sharing and more information about these three approaches, see [Share data and AI assets securely using Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/share-data-open.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Share data using the Delta Sharing open sharing protocol (for providers)\n#### Delta Sharing open sharing workflow\n\nThis section provides a high-level overview of the open sharing workflow, with links to detailed documentation for each step. \nIn the Delta Sharing open sharing model: \n1. The data provider creates a *recipient*, which is a named object that represents a user or group of users that the data provider wants to share data with. \nWhen the data provider creates the recipient, Databricks generates a token, a credential file that includes the token, and an activation link that the data provider can send to the recipient to access the credential file. \nFor details, see [Step 1: Create the recipient](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#create-recipient-open).\n2. The data provider creates a *share*, which is a named object that contains a collection of tables registered in a Unity Catalog metastore in the provider\u2019s account. \nFor details, see [Create and manage shares for Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/create-share.html).\n3. The data provider grants the recipient access to the share. \nFor details, see [Manage access to Delta Sharing data shares (for providers)](https:\/\/docs.databricks.com\/data-sharing\/grant-access.html).\n4. The data provider sends the activation link to the recipient over a secure channel, along with instructions for using the activation link to download the credential file that the recipient will use to establish a secure connection with the data provider to receive the shared data. \nFor details, see [Step 2: Get the activation link](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#get-activation-link).\n5. The data recipient follows the activation link to download the credential file, and then uses the credential file to access the shared data. \nShared data is available to read only. Users can access data using their platform or tools of choice. \nFor details, see [Read data shared using Delta Sharing open sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/share-data-open.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Share data using the Delta Sharing open sharing protocol (for providers)\n#### Setup and security considerations for open sharing\n\nGood token management is key to sharing data securely when you use the open sharing model: \n* Data providers who intend to use open sharing must configure the default recipient token lifetime when they enable Delta Sharing for their Unity Catalog metastore. Databricks recommends that you configure tokens to expire. See [Enable Delta Sharing on a metastore](https:\/\/docs.databricks.com\/data-sharing\/set-up.html#enable).\n* If you need to modify the default token lifetime, see [Modify the recipient token lifetime](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#modify-recipient-token-lifetime).\n* Encourage recipients to manage their downloaded credential file securely.\n* For more information about token management and open sharing security, see [Manage recipient tokens (open sharing)](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#rotate-credential). \nData providers can provide additional security by assigning IP access lists to restrict recipient access to specific network locations. See [Restrict Delta Sharing recipient access using IP access lists (open sharing)](https:\/\/docs.databricks.com\/data-sharing\/access-list.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/share-data-open.html"} +{"content":"# Databricks data engineering\n### What is Delta Live Tables?\n\nDelta Live Tables is a declarative framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations to perform on your data and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling. \nInstead of defining your data pipelines using a series of separate Apache Spark tasks, you define streaming tables and materialized views that the system should create and keep up to date. Delta Live Tables manages how your data is transformed based on queries you define for each processing step. You can also enforce data quality with Delta Live Tables *expectations*, which allow you to define expected data quality and specify how to handle records that fail those expectations. \nTo learn more about the benefits of building and running your ETL pipelines with Delta Live Tables, see the [Delta Live Tables product page](https:\/\/www.databricks.com\/product\/delta-live-tables).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/index.html"} +{"content":"# Databricks data engineering\n### What is Delta Live Tables?\n#### What are Delta Live Tables datasets?\n\nDelta Live Tables datasets are the streaming tables, materialized views, and views maintained as the results of declarative queries. The following table describes how each dataset is processed: \n| Dataset type | How are records processed through defined queries? |\n| --- | --- |\n| Streaming table | Each record is processed exactly once. This assumes an append-only source. |\n| Materialized views | Records are processed as required to return accurate results for the current data state. Materialized views should be used for data processing tasks such as transformations, aggregations, or pre-computing slow queries and frequently used computations. |\n| Views | Records are processed each time the view is queried. Use views for intermediate transformations and data quality checks that should not be published to public datasets. | \nThe following sections provide more detailed descriptions of each dataset type. To learn more about selecting dataset types to implement your data processing requirements, see [When to use views, materialized views, and streaming tables](https:\/\/docs.databricks.com\/delta-live-tables\/transform.html#tables-vs-views). \n### Streaming table \nA *streaming table* is a Delta table with extra support for streaming or incremental data processing. Streaming tables allow you to process a growing dataset, handling each row only once. Because most datasets grow continuously over time, streaming tables are good for most ingestion workloads. Streaming tables are optimal for pipelines that require data freshness and low latency. Streaming tables can also be useful for massive scale transformations, as results can be incrementally calculated as new data arrives, keeping results up to date without needing to fully recompute all source data with each update. Streaming tables are designed for data sources that are append-only. \nNote \nAlthough, by default, streaming tables require append-only data sources, when a streaming source is another streaming table that requires updates or deletes, you can override this behavior with the [skipChangeCommits flag](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html#ignore-changes). \n### Materialized view \nA *materialized view* (or *live table*) is a view where the results have been precomputed. Materialized views are refreshed according to the update schedule of the pipeline in which they\u2019re contained. Materialized views are powerful because they can handle any changes in the input. Each time the pipeline updates, query results are recalculated to reflect changes in upstream datasets that might have occurred because of compliance, corrections, aggregations, or general CDC. Delta Live Tables implements materialized views as Delta tables, but abstracts away complexities associated with efficient application of updates, allowing users to focus on writing queries. \n### Views \nAll *views* in Databricks compute results from source datasets as they are queried, leveraging caching optimizations when available. Delta Live Tables does not publish views to the catalog, so views can be referenced only within the pipeline in which they are defined. Views are useful as intermediate queries that should not be exposed to end users or systems. Databricks recommends using views to enforce data quality constraints or transform and enrich datasets that drive multiple downstream queries.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/index.html"} +{"content":"# Databricks data engineering\n### What is Delta Live Tables?\n#### Declare your first datasets in Delta Live Tables\n\nDelta Live Tables introduces new syntax for Python and SQL. To get started with Delta Live Tables syntax, see the Python and SQL examples in [Tutorial: Run your first Delta Live Tables pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html). \nNote \nDelta Live Tables separates dataset definitions from update processing, and Delta Live Tables notebooks are not intended for interactive execution. See [What is a Delta Live Tables pipeline?](https:\/\/docs.databricks.com\/delta-live-tables\/index.html#pipeline).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/index.html"} +{"content":"# Databricks data engineering\n### What is Delta Live Tables?\n#### What is a Delta Live Tables pipeline?\n\nA *pipeline* is the main unit used to configure and run data processing workflows with Delta Live Tables. \nA pipeline contains materialized views and streaming tables declared in Python or SQL source files. Delta Live Tables infers the dependencies between these tables, ensuring updates occur in the correct order. For each dataset, Delta Live Tables compares the current state with the desired state and proceeds to create or update datasets using efficient processing methods. \nThe settings of Delta Live Tables pipelines fall into two broad categories: \n1. Configurations that define a collection of notebooks or files (known as *source code* or *libraries*) that use Delta Live Tables syntax to declare datasets.\n2. Configurations that control pipeline infrastructure, dependency management, how updates are processed, and how tables are saved in the workspace. \nMost configurations are optional, but some require careful attention, especially when configuring production pipelines. These include the following: \n* To make data available outside the pipeline, you must declare a **target schema** to publish to the Hive metastore or a **target catalog** and **target schema** to publish to Unity Catalog.\n* Data access permissions are configured through the cluster used for execution. Make sure your cluster has appropriate permissions configured for data sources and the target **storage location**, if specified. \nFor details on using Python and SQL to write source code for pipelines, see [Delta Live Tables SQL language reference](https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html) and [Delta Live Tables Python language reference](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html). \nFor more on pipeline settings and configurations, see [Manage configuration of Delta Live Tables pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/manage-pipeline-configurations.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/index.html"} +{"content":"# Databricks data engineering\n### What is Delta Live Tables?\n#### Deploy your first pipeline and trigger updates\n\nBefore processing data with Delta Live Tables, you must configure a pipeline. Once a pipeline is configured, you can trigger an update to calculate results for each dataset in your pipeline. To get started using Delta Live Tables pipelines, see [Tutorial: Run your first Delta Live Tables pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html).\n\n### What is Delta Live Tables?\n#### What is a pipeline update?\n\nPipelines deploy infrastructure and recompute data state when you start an *update*. An update does the following: \n* Starts a cluster with the correct configuration.\n* Discovers all the tables and views defined, and checks for any analysis errors such as invalid column names, missing dependencies, and syntax errors.\n* Creates or updates tables and views with the most recent data available. \nPipelines can be run continuously or on a schedule depending on your use case\u2019s cost and latency requirements. See [Run an update on a Delta Live Tables pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/updates.html).\n\n### What is Delta Live Tables?\n#### Ingest data with Delta Live Tables\n\nDelta Live Tables supports all data sources available in Databricks. \nDatabricks recommends using streaming tables for most ingestion use cases. For files arriving in cloud object storage, Databricks recommends Auto Loader. You can directly ingest data with Delta Live Tables from most message buses. \nFor more information about configuring access to cloud storage, see [Cloud storage configuration](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#configure-cloud-storage). \nFor formats not supported by Auto Loader, you can use Python or SQL to query any format supported by Apache Spark. See [Load data with Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/load.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/index.html"} +{"content":"# Databricks data engineering\n### What is Delta Live Tables?\n#### Monitor and enforce data quality\n\nYou can use *expectations* to specify data quality controls on the contents of a dataset. Unlike a `CHECK` constraint in a traditional database which prevents adding any records that fail the constraint, expectations provide flexibility when processing data that fails data quality requirements. This flexibility allows you to process and store data that you expect to be messy and data that must meet strict quality requirements. See [Manage data quality with Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html).\n\n### What is Delta Live Tables?\n#### How are Delta Live Tables and Delta Lake related?\n\nDelta Live Tables extends the functionality of Delta Lake. Because tables created and managed by Delta Live Tables are Delta tables, they have the same guarantees and features provided by Delta Lake. See [What is Delta Lake?](https:\/\/docs.databricks.com\/delta\/index.html). \nDelta Live Tables adds several table properties in addition to the many table properties that can be set in Delta Lake. See [Delta Live Tables properties reference](https:\/\/docs.databricks.com\/delta-live-tables\/properties.html) and [Delta table properties reference](https:\/\/docs.databricks.com\/delta\/table-properties.html).\n\n### What is Delta Live Tables?\n#### How tables are created and managed by Delta Live Tables\n\nDatabricks automatically manages tables created with Delta Live Tables, determining how updates need to be processed to correctly compute the current state of a table and performing a number of maintenance and optimization tasks. \nFor most operations, you should allow Delta Live Tables to process all updates, inserts, and deletes to a target table. For details and limitations, see [Retain manual deletes or updates](https:\/\/docs.databricks.com\/delta-live-tables\/transform.html#manual-ddl).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/index.html"} +{"content":"# Databricks data engineering\n### What is Delta Live Tables?\n#### Maintenance tasks performed by Delta Live Tables\n\nDelta Live Tables performs maintenance tasks within 24 hours of a table being updated. Maintenance can improve query performance and reduce cost by removing old versions of tables. By default, the system performs a full [OPTIMIZE](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-optimize.html) operation followed by [VACUUM](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-vacuum.html). You can disable OPTIMIZE for a table by setting `pipelines.autoOptimize.managed = false` in the [table properties](https:\/\/docs.databricks.com\/delta-live-tables\/properties.html#table-properties) for the table. Maintenance tasks are performed only if a pipeline update has run in the 24 hours before the maintenance tasks are scheduled.\n\n### What is Delta Live Tables?\n#### Limitations\n\nThe following limitations apply: \n* All tables created and updated by Delta Live Tables are Delta tables.\n* Delta Live Tables tables can only be defined once, meaning they can only be the target of a single operation in all Delta Live Tables pipelines.\n* Identity columns are not supported with tables that are the target of `APPLY CHANGES INTO` and might be recomputed during updates for materialized views. For this reason, Databricks recommends only using identity columns with streaming tables in Delta Live Tables. See [Use identity columns in Delta Lake](https:\/\/docs.databricks.com\/delta\/generated-columns.html#identity).\n* A Databricks workspace is limited to 100 concurrent pipeline updates.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/index.html"} +{"content":"# Databricks data engineering\n### What is Delta Live Tables?\n#### Additional resources\n\n* Delta Live Tables has full support in the Databricks REST API. See [Delta Live Tables API guide](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html).\n* For pipeline and table settings, see [Delta Live Tables properties reference](https:\/\/docs.databricks.com\/delta-live-tables\/properties.html).\n* [Delta Live Tables SQL language reference](https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html).\n* [Delta Live Tables Python language reference](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/index.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Driver capability settings for the Databricks JDBC Driver\n\nThis article describes how to configure special and advanced driver capability settings for the [Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/index.html). \nThe Databricks JDBC Driver provides the following special and advanced driver capability settings. \n* [ANSI SQL-92 query support in JDBC](https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html#jdbc-native)\n* [Default catalog and schema](https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html#default-catalog-and-schema)\n* [Extract large query results in JDBC](https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html#jdbc-extract)\n* [Arrow serialization in JDBC](https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html#jdbc-arrow)\n* [Cloud Fetch in JDBC](https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html#cloud-fetch-in-jdbc)\n* [Advanced configurations](https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html#advanced-configurations)\n* [Enable logging](https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html#enable-logging)\n\n######### Driver capability settings for the Databricks JDBC Driver\n########## ANSI SQL-92 query support in JDBC\n\nLegacy Spark JDBC drivers accept SQL queries in ANSI SQL-92 dialect and translate the queries to the Databricks SQL dialect before sending them to the server. However, if your application generates Databricks SQL directly or your application uses any non-ANSI SQL-92 standard SQL syntax specific to Databricks, Databricks recommends that you set `UseNativeQuery=1` as a connection configuration. With that setting, the driver passes the SQL queries verbatim to Databricks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Driver capability settings for the Databricks JDBC Driver\n########## Default catalog and schema\n\nTo specify the default catalog and schema, add `ConnCatalog=<catalog-name>;ConnSchema=<schema-name>` to the JDBC connection URL.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Driver capability settings for the Databricks JDBC Driver\n########## Extract large query results in JDBC\n\nTo achieve the best performance when you extract large query results, use the latest version of the JDBC driver, which includes the following optimizations. \n### Arrow serialization in JDBC \nJDBC driver version 2.6.16 and above supports an optimized query results serialization format that uses [Apache Arrow](https:\/\/arrow.apache.org\/docs\/index.html). \n### Cloud Fetch in JDBC \nThe JDBC driver version 2.6.19 and above supports Cloud Fetch, a capability that fetches query results through the cloud storage that is set up in your Databricks deployment. \nQuery results are uploaded to an internal [DBFS storage location](https:\/\/docs.databricks.com\/dbfs\/index.html) as Arrow-serialized files of up to 20 MB. When the driver sends fetch requests after query completion, Databricks generates and returns [presigned URLs](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/using-presigned-url.html) to the uploaded files. The JDBC driver then uses the URLs to download the results directly from DBFS. \nCloud Fetch is only used for query results larger than 1 MB. Smaller results are retrieved directly from Databricks. \nDatabricks automatically garbage collects the accumulated files which are marked for deletion after 24 hours. These marked files are completely deleted after an additional 24 hours. \nCloud Fetch is only available in E2 workspaces. Also, your corresponding Amazon S3 buckets must not have versioning enabled. If you have versioning enabled, you can still enable Cloud Fetch by following the instructions in [Advanced configurations](https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html#advanced-configurations). \nTo learn more about the Cloud Fetch architecture, see [How We Achieved High-bandwidth Connectivity With BI Tools](https:\/\/databricks.com\/blog\/2021\/08\/11\/how-we-achieved-high-bandwidth-connectivity-with-bi-tools.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Driver capability settings for the Databricks JDBC Driver\n########## Advanced configurations\n\nIf you have enabled [S3 bucket versioning](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/Versioning.html) on your [DBFS root](https:\/\/docs.databricks.com\/dbfs\/index.html), then Databricks cannot garbage collect older versions of uploaded query results. We recommend setting an S3 lifecycle policy first that purges older versions of uploaded query results. \nTo set a lifecycle policy follow the steps below: \n1. In the AWS console, go to the **S3** service.\n2. Click on the [S3 bucket](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/storage.html) that you use for your workspace\u2019s root storage.\n3. Open the **Management** tab and choose **Create lifecycle rule**.\n4. Choose any name for the **Lifecycle rule name**.\n5. Keep the prefix field empty.\n6. Under **Lifecycle rule actions** select **Permanently delete noncurrent versions of objects**.\n7. Set a value under **Days after objects become noncurrent**. We recommend using the value 1 here.\n8. Click **Create rule**. \n![Lifecycle policy](https:\/\/docs.databricks.com\/_images\/lifecycle-policy-with-tags.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Driver capability settings for the Databricks JDBC Driver\n########## Enable logging\n\nTo enable logging in the JDBC driver, set the `LogLevel` property from `1` to log only severe events through `6` to log all driver activity. Set the `LogPath` property to the full path to the folder where you want to save log files. \nFor more information, see the `Configuring Logging` section in the [Databricks JDBC Driver Guide](https:\/\/docs.databricks.com\/_extras\/documents\/Databricks-JDBC-Driver-Install-and-Configuration-Guide.pdf).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html"} +{"content":"# Compute\n## Use compute\n#### Compute configuration best practices\n\nThis article describes recommendations for setting optional compute configurations. To reduce configuration decisions, Databricks recommends taking advantage of both serverless compute and compute policies. \n* Serverless compute does not require configuring compute settings. Serverless compute is always available and scales according to your workload. See [Types of compute](https:\/\/docs.databricks.com\/compute\/index.html#types-of-compute). \n* Compute policies let you create preconfigured compute designed for specific use cases like personal compute, shared compute, power users, and jobs. If you don\u2019t have access to the policies, contact your workspace admin. See [Default policies and policy families](https:\/\/docs.databricks.com\/admin\/clusters\/policy-families.html). \nIf you choose to create compute with your own configurations, the sections below provide recommendations for typical use cases. \nNote \nThis article assumes that you have unrestricted cluster creation. Workspace admins should only grant this privilege to advanced users.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/cluster-config-best-practices.html"} +{"content":"# Compute\n## Use compute\n#### Compute configuration best practices\n##### Compute sizing considerations\n\nPeople often think of compute size in terms of the number of workers, but there are other important factors to consider: \n* Total executor cores (compute): The total number of cores across all executors. This determines the maximum parallelism of a compute.\n* Total executor memory: The total amount of RAM across all executors. This determines how much data can be stored in memory before spilling it to disk.\n* Executor local storage: The type and amount of local disk storage. Local disk is primarily used in the case of spills during shuffles and caching. \nAdditional considerations include worker instance type and size, which also influence the factors above. When sizing your compute, consider: \n* How much data will your workload consume?\n* What\u2019s the computational complexity of your workload?\n* Where are you reading data from?\n* How is the data partitioned in external storage?\n* How much parallelism do you need? \nAnswering these questions will help you determine optimal compute configurations based on workloads. \nThere\u2019s a balancing act between the number of workers and the size of worker instance types. Configuring compute with two workers, each with 16 cores and 128 GB of RAM, has the same compute and memory as configuring compute with 8 workers, each with 4 cores and 32 GB of RAM.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/cluster-config-best-practices.html"} +{"content":"# Compute\n## Use compute\n#### Compute configuration best practices\n##### Compute sizing examples\n\nThe following examples show compute recommendations based on specific types of workloads. These examples also include configurations to avoid and why those configurations are not suitable for the workload types. \n### Data analysis \nData analysts typically perform processing requiring data from multiple partitions, leading to many shuffle operations. Compute with a smaller number of nodes can reduce the network and disk I\/O needed to perform these shuffles. \nIf you are writing only SQL, the best option for data analysis will be a serverless SQL warehouse. \nNote \nIf your workspace is enabled for the serverless compute public preview, you can use serverless compute to run analysis in Python or SQL. See [Serverless compute for notebooks](https:\/\/docs.databricks.com\/compute\/serverless.html). \nIf you must configure a new compute, a single-node compute with a large VM type is likely the best choice, particularly for a single analyst. \nAnalytical workloads will likely require reading the same data repeatedly, so recommended node types are storage optimized with disk cache enabled. \nAdditional features recommended for analytical workloads include: \n* Enable auto termination to ensure compute is terminated after a period of inactivity.\n* Consider enabling autoscaling based on the analyst\u2019s typical workload.\n* Consider using pools, which will allow restricting compute to pre-approved instance types and ensure consistent compute configurations. \n### Basic batch ETL \nNote \nIf your workspace is enabled for serverless compute for workflows (Public Preview), you can use serverless compute to run your jobs. See [Serverless compute for notebooks](https:\/\/docs.databricks.com\/compute\/serverless.html). \nSimple batch ETL jobs that don\u2019t require wide transformations, such as joins or aggregations, typically benefit from compute-optimized worker types. \nCompute-optimized workers have lower requirements for memory and storage and might result in cost savings over other worker types. \n### Complex batch ETL \nNote \nIf your workspace is enabled for serverless compute for workflows (Public Preview), you can use serverless compute to run your jobs. See [Serverless compute for notebooks](https:\/\/docs.databricks.com\/compute\/serverless.html). \nFor a complex ETL job, such as one that requires unions and joins across multiple tables, Databricks recommends reducing the number of workers to reduce the amount of data shuffled. \nComplex transformations can be compute-intensive. If you observe significant spill to disk or OOM errors, you should add additional nodes. \nDatabricks recommends compute-optimized worker types. Compute-optimized workers have lower requirements for memory and storage and might result in cost savings over other worker types. Optionally, use pools to decrease compute launch times and reduce total runtime when running job pipelines. \n### Training machine learning models \nDatabricks recommends single node compute with a large node type for initial experimentation with training machine learning models. Having fewer nodes reduces the impact of shuffles. \nAdding more workers can help with stability, but you should avoid adding too many workers because of the overhead of shuffling data. \nRecommended worker types are storage optimized with disk caching enabled to account for repeated reads of the same data and to enable caching of training data. If the compute and storage options provided by storage optimized nodes are not sufficient, consider GPU optimized nodes. A possible downside is the lack of disk caching support with these nodes. \nAdditional features recommended for machine learning workloads include: \n* Enable auto termination to ensure compute is terminated after a period of inactivity.\n* Use pools, which will allow restricting compute to pre-approved instance types and ensure consistent compute configurations.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/cluster-config-best-practices.html"} +{"content":"# Technology partners\n## Connect to security partners using Partner Connect\n#### Connect to Privacera\n\nPrivacera is a unified data security governance platform that delivers universal data discovery, access policy management, masking, encryption, and audit capabilities. \nYou can connect Databricks SQL warehouses (formerly Databricks SQL endpoints) and Databricks clusters to Privacera.\n\n#### Connect to Privacera\n##### Connect to Privacera using Partner Connect\n\nTo connect your Databricks workspace to Privacera using Partner Connect, follow the steps in [Connect to security partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/data-security.html).\n\n#### Connect to Privacera\n##### Connect to Privacera manually\n\nTo connect to Privacera manually, see the Privacera documentation: \n* (Recommended) [Connect Databricks Unity Catalog to PrivaceraCloud](https:\/\/docs.privacera.com\/cloud\/en\/connect-databricks-unity-catalog-to-privaceracloud.html)\n* [Connect Databricks to PrivaceraCloud](https:\/\/docs.privacera.com\/cloud\/en\/connect-databricks-to-privaceracloud.html)\n* [Databricks SQL Overview and Configuration](https:\/\/docs.privacera.com\/cloud\/en\/databricks-sql-overview-and-configuration.html)\n\n#### Connect to Privacera\n##### Additional resources\n\n[How to Get Support](https:\/\/docs.privacera.com\/cloud\/en\/how-to-get-support.html) in the Privacera documentation.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/data-security\/privacera.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query profile\n\nYou can use a query profile to visualize the details of a query execution. The query profile helps you troubleshoot performance bottlenecks during the query\u2019s execution. For example: \n* You can visualize each query task and its related metrics, such as the time spent, number of rows processed, rows processed, and memory consumption.\n* You can identify the slowest part of a query execution at a glance and assess the impacts of modifications to the query.\n* You can discover and fix common mistakes in SQL statements, such as exploding joins or full table scans. \nImportant \nThe time recorded in query history for a SQL query is only the time the SQL warehouse spends actually executing the query. It does not record any additional overhead associated with getting ready to execute the query, such as internal queuing, or additional time related to the data upload and download process.\n\n#### Query profile\n##### Requirements\n\nTo view a query profile, you must either be the owner of the query or you must have the CAN MANAGE [permission](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#sql-warehouses) on the SQL warehouse that executed the query.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-profile.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query profile\n##### View a query profile\n\nAfter running a query in the SQL editor or in a notebook, you can open the query profile by clicking the elapsed time at the bottom of the output. \n![Open query history from editor or notebook output](https:\/\/docs.databricks.com\/_images\/elapsed-time.png) \nYou can also view the query profile from the query history as follows: \n1. View [query history](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html#view-query-history).\n2. Click the name of a query. An overview of query metrics appears. \n![Query execution summary simple view](https:\/\/docs.databricks.com\/_images\/query-profile.png)\n3. Click **See query profile**. \nNote \nIf **Query profile is not available** is displayed, no profile is available for this query. A query profile is not available for queries that run from the [query cache](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-caching.html). To circumvent the query cache, make a trivial change to the query, such as changing or removing the `LIMIT`.\n4. To view the query profile in graph view (the default), click **Graph view**. To view the query profile as a tree, click **Tree view**. \n* Graph view is optimized for visualizing how data flows from one node to another.\n* Tree view is optimized for quickly finding issues with the query\u2019s performance, such as identifying the longest-running operator.\n5. In graph view or tree view, you can click one of the tabs at the top of the page to view details about each of the query\u2019s tasks. \n* **Time spent**: The sum of execution time spent by all tasks for each operation.\n* **Rows**: The number and size of the rows affected by each of the query\u2019s tasks.\n* **Peak memory**: The peak memory each of the query\u2019s tasks consumed.\nNote \nSome non-Photon operations are executed as a group and share common metrics. In this case, all subtasks have the same value as the parent task for a given metric.\n6. In graph view, if a task has sub-tasks, click a node to show its details. In tree view, you can click **>** to expand it.\n7. Each task\u2019s operation is shown. By default, tasks and metrics for some operations are hidden. These operations are unlikely to be the cause of performance bottlenecks. To see information for all operations, and to see additional metrics, click ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) at the top of the page, then click **Enable verbose mode**. The most common operations are: \n* **Scan**: Data was read from a datasource and output as rows.\n* **Join**: Rows from multiple relations were combined (interleaved) into a single set of rows.\n* **Union**: Rows from multiple relations that use the same schema were concatenated into a single set of rows.\n* **Shuffle**: Data was redistributed or repartitioned. Shuffle operations are expensive with regard to resources because they move data between executors on the cluster.\n* **Hash \/ Sort**: Rows were grouped by a key and evaluated using an aggregate function such as `SUM`, `COUNT`, or `MAX` within each group.\n* **Filter**: Input is filtered according to a criteria, such as by a `WHERE` clause, and a subset of rows is returned.\n* **(Reused) Exchange**: A Shuffle or Broadcast Exchange is used to redistribute the data among the cluster nodes based on the desired partitioning.\n* **Collect Limit**: The number of rows returned was truncated by using a `LIMIT` statement.\n* **Take Ordered And Project**: The top N rows of the query result were returned.\n8. To view the query profile in the Apache Spark UI, click ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) at the top of the page, then click **Open in Spark UI**.\n9. To close the imported query profile, click **X** at the top of the page. \nFor more details about the information available in the query profile, see [View details about the query profile](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-profile.html#query-profile-details).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-profile.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query profile\n##### View details about the query profile\n\nThe query profile lists the query\u2019s top-level tasks in reverse order, with the last task listed first. On the left, three columns show the task sequence, the name of the operation, and a graph of the selected metric for that task. Follow these steps to familiarize yourself with the different parts of the query profile. \n1. Click **Time** to see the duration of each subtask.\n2. Click **Rows** to see the number and size of rows returned by the query.\n3. Click **Memory** to see the memory consumed by each query task. If the task has subtasks, you can click **>** to see details about each subtask.\n4. On the right, click **Overview** to see the query\u2019s SQL statement, status, start and end times, duration, the user who ran the query, and the warehouse where the query was executed.\n5. Click a task to view details about the task, such as the task\u2019s description and metrics about the task\u2019s duration, memory consumed, number and size of rows returned, and lineage.\n6. To close subtask details, click **X**.\n7. Click the name of the SQL warehouse to go to that warehouse\u2019s properties.\n8. To view the query profile in the Apache Spark UI, click ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) at the top of the page, then click **Open in Spark UI**.\n9. To close the query profile, click **X** at the top of the page.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-profile.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query profile\n##### Share a query profile\n\nTo share a query profile with another user: \n1. View [query history](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html#view-query-history).\n2. Click the name of the query.\n3. To share the query, you have two choices: \n* If the other user has the CAN MANAGE permission on the query, you can share the URL for the query profile with them. Click **Share**. The URL is copied to your clipboard.\n* Otherwise, if the other user does not have the CAN MANAGE permission or is not a member of the workspace, you can download the query profile as a JSON object. **Download**. The JSON file is downloaded to your local system.\n\n#### Query profile\n##### Import a query profile\n\nTo import the JSON for a query profile: \n1. View [query history](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html#view-query-history).\n2. Click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) on the upper right, and select **Import query profile (JSON)**.\n3. In the file browser, select the JSON file that was shared with you and click **Open**. The JSON file is uploaded and the query profile is displayed. \nWhen you import a query profile, it is dynamically loaded into your browser session and does not persist in your workspace. You need to re-import it each time you want to view it.\n4. To close the imported query profile, click **X** at the top of the page.\n\n#### Query profile\n##### Next steps\n\n* Learn about accessing query metrics using the [query history API](https:\/\/docs.databricks.com\/api\/workspace\/queryhistory)\n* Learn more about [query history](https:\/\/docs.databricks.com\/sql\/user\/queries\/query-history.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-profile.html"} +{"content":"# Compute\n## Use compute\n#### Compute configuration reference\n\nThis article explains all the configuration settings available in the Create Compute UI. Most users create compute using their assigned policies, which limits the configurable settings. If you don\u2019t see a particular setting in your UI, it\u2019s because the policy you\u2019ve selected does not allow you to configure that setting. \n![AWS unrestricted compute creation page](https:\/\/docs.databricks.com\/_images\/compute-settings-aws.png) \nThe configurations and management tools described in this article apply to both all-purpose and job compute. For more considerations on configuring job compute, see [Use Databricks compute with your jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/use-compute.html).\n\n#### Compute configuration reference\n##### Policies\n\nPolicies are a set of rules used to limit the configuration options available to users when they create compute. If a user doesn\u2019t have the **Unrestricted cluster creation** entitlement, then they can only create compute using their granted policies. \nTo create compute according to a policy, select a policy from the **Policy** drop-down menu. \nBy default, all users have access to the **Personal Compute** policy, allowing them to create single-machine compute resources. If you need access to Personal Compute or any additional policies, reach out to your workspace admin.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/configure.html"} +{"content":"# Compute\n## Use compute\n#### Compute configuration reference\n##### Single-node or multi-node compute\n\nDepending on the policy, you can select between creating a **Single node** compute or a **Multi node** compute. \nSingle node compute is intended for jobs that use small amounts of data or non-distributed workloads such as single-node machine learning libraries. Multi-node compute should be used for larger jobs with distributed workloads. \n### Single node properties \nA single node compute has the following properties: \n* Runs Spark locally.\n* Driver acts as both master and worker, with no worker nodes.\n* Spawns one executor thread per logical core in the compute, minus 1 core for the driver.\n* Saves all `stderr`, `stdout`, and `log4j` log outputs in the driver log.\n* Can\u2019t be converted to a multi-node compute. \n### Selecting single or multi node \nConsider your use case when deciding between a single or multi-node compute: \n* Large-scale data processing will exhaust the resources on a single node compute. For these workloads, Databricks recommends using a multi-node compute.\n* Single-node compute is not designed to be shared. To avoid resource conflicts, Databricks recommends using a multi-node compute when the compute must be shared.\n* A multi-node compute can\u2019t be scaled to 0 workers. Use a single node compute instead.\n* Single-node compute is not compatible with process isolation.\n* GPU scheduling is not enabled on single node compute.\n* On single-node compute, Spark cannot read Parquet files with a UDT column. The following error message results: \n```\nThe Spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.\n\n``` \nTo work around this problem, disable the native Parquet reader: \n```\nspark.conf.set(\"spark.databricks.io.parquet.nativeReader.enabled\", False)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/configure.html"} +{"content":"# Compute\n## Use compute\n#### Compute configuration reference\n##### Access modes\n\nAccess mode is a security feature that determines who can use the compute and what data they can access via the compute. Every compute in Databricks has an access mode. \nDatabricks recommends that you use shared access mode for all workloads. Only use the single user access mode if your required functionality is not supported by shared access mode. \n| Access Mode | Visible to user | UC Support | Supported Languages | Notes |\n| --- | --- | --- | --- | --- |\n| Single user | Always | Yes | Python, SQL, Scala, R | Can be assigned to and used by a single user. Referred to as **Assigned** access mode in some workspaces. |\n| Shared | Always (**Premium plan or above required**) | Yes | Python (on Databricks Runtime 11.3 LTS and above), SQL, Scala (on Unity Catalog-enabled compute using Databricks Runtime 13.3 LTS and above) | Can be used by multiple users with data isolation among users. |\n| No Isolation Shared | Admins can hide this access mode by [enforcing user isolation](https:\/\/docs.databricks.com\/admin\/workspace-settings\/enforce-user-isolation.html) in the admin settings page. | No | Python, SQL, Scala, R | There is a [related account-level setting for No Isolation Shared compute](https:\/\/docs.databricks.com\/admin\/account-settings\/no-isolation-shared.html). |\n| Custom | Hidden (For all new compute) | No | Python, SQL, Scala, R | This option is shown only if you have existing compute without a specified access mode. | \nYou can upgrade an existing compute to meet the requirements of Unity Catalog by setting its access mode to **Single User** or **Shared**. \nNote \nIn Databricks Runtime 13.3 LTS and above, init scripts and libraries are supported on all access modes. Requirements and support vary. See [Where can init scripts be installed?](https:\/\/docs.databricks.com\/init-scripts\/index.html#compatibility) and [Cluster-scoped libraries](https:\/\/docs.databricks.com\/libraries\/index.html#compatibility).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/configure.html"} +{"content":"# Compute\n## Use compute\n#### Compute configuration reference\n##### Databricks Runtime versions\n\nDatabricks Runtime is the set of core components that run on your compute. Select the runtime using the **Databricks Runtime Version** drop-down menu. For details on specific Databricks Runtime versions, see [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). All versions include Apache Spark. Databricks recommends the following: \n* For all-purpose compute, use the most current version to ensure you have the latest optimizations and the most up-to-date compatibility between your code and preloaded packages.\n* For job compute running operational workloads, consider using the Long Term Support (LTS) Databricks Runtime version. Using the LTS version will ensure you don\u2019t run into compatibility issues and can thoroughly test your workload before upgrading.\n* For data science and machine learning use cases, consider Databricks Runtime ML version. \n### Use Photon acceleration \nPhoton is enabled by default on compute running Databricks Runtime 9.1 LTS and above. \nTo enable or disable Photon acceleration, select the **Use Photon Acceleration** checkbox. To learn more about Photon, see [What is Photon?](https:\/\/docs.databricks.com\/compute\/photon.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/configure.html"} +{"content":"# Compute\n## Use compute\n#### Compute configuration reference\n##### Worker and driver node types\n\nCompute consists of one driver node and zero or more worker nodes. You can pick separate cloud provider instance types for the driver and worker nodes, although by default the driver node uses the same instance type as the worker node. Different families of instance types fit different use cases, such as memory-intensive or compute-intensive workloads. \nYou can also select a pool to use as the worker or driver node. See [What are Databricks pools?](https:\/\/docs.databricks.com\/compute\/pool-index.html). \n### Worker type \nIn multi-node compute, worker nodes run the Spark executors and other services required for proper functioning compute. When you distribute your workload with Spark, all the distributed processing happens on worker nodes. Databricks runs one executor per worker node. Therefore, the terms executor and worker are used interchangeably in the context of the Databricks architecture. \nTip \nTo run a Spark job, you need at least one worker node. If the compute has zero workers, you can run non-Spark commands on the driver node, but Spark commands will fail. \n#### Worker node IP addresses \nDatabricks launches worker nodes with two private IP addresses each. The node\u2019s primary private IP address hosts Databricks internal traffic. The secondary private IP address is used by the Spark container for intra-cluster communication. This model allows Databricks to provide isolation between multiple compute in the same workspace. \n### Driver type \nThe driver node maintains state information of all notebooks attached to the compute. The driver node also maintains the SparkContext, interprets all the commands you run from a notebook or a library on the compute, and runs the Apache Spark master that coordinates with the Spark executors. \nThe default value of the driver node type is the same as the worker node type. You can choose a larger driver node type with more memory if you are planning to `collect()` a lot of data from Spark workers and analyze them in the notebook. \nTip \nSince the driver node maintains all of the state information of the notebooks attached, make sure to detach unused notebooks from the driver node. \n### GPU instance types \nFor computationally challenging tasks that demand high performance, like those associated with deep learning, Databricks supports compute accelerated with graphics processing units (GPUs). For more information, see [GPU-enabled compute](https:\/\/docs.databricks.com\/compute\/gpu.html). \nDatabricks no longer supports spinning up compute using Amazon EC2 P2 instances. \n### AWS Graviton instance types \nDatabricks compute supports [AWS Graviton](https:\/\/aws.amazon.com\/ec2\/graviton\/) instances. These instances use AWS-designed Graviton processors that are built on top of the Arm64 instruction set architecture. AWS claims that instance types with these processors have the best price-to-performance ratio of any instance type on Amazon EC2. To use Graviton instance types, select one of the available AWS Graviton instance type for the **Worker type**, **Driver type**, or both. \nDatabricks supports AWS Graviton-enabled compute: \n* On [Databricks Runtime 9.1 LTS](https:\/\/docs.databricks.com\/release-notes\/runtime\/9.1lts.html) and above for non-[Photon](https:\/\/docs.databricks.com\/compute\/photon.html), and [Databricks Runtime 10.2 (unsupported)](https:\/\/docs.databricks.com\/archive\/runtime-release-notes\/10.2.html) and above for Photon.\n* In all AWS Regions. Note, however, that not all instance types are available in all Regions. If you select an instance type that is not available in the Region for a workspace, you get compute creation failure.\n* For AWS Graviton2 and Graviton3 processors. \nNote \nDelta Live Tables is not supported on Graviton-enabled compute. \n#### ARM64 ISA limitations \n* Floating point precision changes: typical operations like adding, subtracting, multiplying, and dividing have no change in precision. For single triangle functions such as `sin` and `cos`, the upper bound on the precision difference to Intel instances is `1.11e-16`.\n* Third party support: the change in ISA may have some impact on support for third-party tools and libraries.\n* Mixed-instance compute: Databricks does not support mixing AWS Graviton and non-AWS Graviton instance types, as each type requires a different Databricks Runtime. \n#### Graviton limitations \nThe following features do not support AWS Graviton instance types: \n* Python UDFs in Unity Catalog\n* Databricks Runtime for Machine Learning\n* Databricks Container Services\n* Delta Live Tables\n* Databricks SQL\n* Databricks on AWS GovCloud \n### AWS Fleet instance types \nNote \nIf your workspace was created before May 2023, an account admin might need to update your workspace\u2019s IAM role access policy permissions to allow fleet instance types. For required permissions, see [Create an access policy](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/credentials.html#default-policy). \nA fleet instance type is a variable instance type that automatically resolves to the best available instance type of the same size. \nFor example, if you select the fleet instance type `m-fleet.xlarge`, your node will resolve to whichever `.xlarge`, general purpose instance type has the best spot capacity and price at that moment. The instance type your compute resolves to will always have the same memory and number of cores as the fleet instance type you chose. \nFleet instance types use AWS\u2019s [Spot Placement Score API](https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/spot-placement-score.html) to choose the best and most likely to succeed availability zone for your compute at startup time. \n#### Fleet limitations \n* The **Max spot price** setting under **Advanced options** has no effect when the worker node type is set to a fleet instance type. This is because there is no single on-demand instance to use as a reference point for the spot price.\n* Fleet instances do not support GPU instances.\n* A small percentage of older workspaces do not yet support fleet instance types. If this is the case for your workspace, you\u2019ll see an error indicating this when attempting to create compute or an instance pool using a fleet instance type. We\u2019re working to bring support to these remaining workspaces.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/configure.html"} +{"content":"# Compute\n## Use compute\n#### Compute configuration reference\n##### Enable autoscaling\n\nWhen **Enable autoscaling** is checked, you can provide a minimum and maximum number of workers for the compute. Databricks then chooses the appropriate number of workers required to run your job. \nTo set the minimum and the maximum number of workers your compute will autoscale between, use the **Min workers** and **Max workers** fields next to the **Worker type** dropdown. \nIf you don\u2019t enable autoscaling, you will enter a fixed number of workers in the **Workers** field next to the **Worker type** dropdown. \nNote \nWhen the compute is running, the compute details page displays the number of allocated workers. You can compare number of allocated workers with the worker configuration and make adjustments as needed. \n### Benefits of autoscaling \nWith autoscaling, Databricks dynamically reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more computationally demanding than others, and Databricks automatically adds additional workers during these phases of your job (and removes them when they\u2019re no longer needed). \nAutoscaling makes it easier to achieve high utilization because you don\u2019t need to provision the compute to match a workload. This applies especially to workloads whose requirements change over time (like exploring a dataset during the course of a day), but it can also apply to a one-time shorter workload whose provisioning requirements are unknown. Autoscaling thus offers two advantages: \n* Workloads can run faster compared to a constant-sized under-provisioned compute.\n* Autoscaling can reduce overall costs compared to a statically-sized compute. \nDepending on the constant size of the compute and the workload, autoscaling gives you one or both of these benefits at the same time. The compute size can go below the minimum number of workers selected when the cloud provider terminates instances. In this case, Databricks continuously retries to re-provision instances in order to maintain the minimum number of workers. \nNote \nAutoscaling is not available for `spark-submit` jobs. \nNote \nCompute auto-scaling has limitations scaling down cluster size for Structured Streaming workloads. Databricks recommends using Delta Live Tables with Enhanced Autoscaling for streaming workloads. See [Optimize the cluster utilization of Delta Live Tables pipelines with Enhanced Autoscaling](https:\/\/docs.databricks.com\/delta-live-tables\/auto-scaling.html). \n### How autoscaling behaves \nWorkspace in the Premium and Enterprise pricing plans use optimized autoscaling. Workspaces on the standard pricing plan use standard autoscaling. \nOptimized autoscaling has the following characteristics: \n* Scales up from min to max in 2 steps.\n* Can scale down, even if the compute is not idle, by looking at the shuffle file state.\n* Scales down based on a percentage of current nodes.\n* On job compute, scales down if the compute is underutilized over the last 40 seconds.\n* On all-purpose compute, scales down if the compute is underutilized over the last 150 seconds.\n* The `spark.databricks.aggressiveWindowDownS` Spark configuration property specifies in seconds how often the compute makes down-scaling decisions. Increasing the value causes the compute to scale down more slowly. The maximum value is 600. \nStandard autoscaling is used in standard plan workspaces. Standard autoscaling has the following characteristics: \n* Starts by adding 8 nodes. Then scales up exponentially, taking as many steps as required to reach the max.\n* Scales down when 90% of the nodes are not busy for 10 minutes and the compute has been idle for at least 30 seconds.\n* Scales down exponentially, starting with 1 node. \n### Autoscaling with pools \nIf you are attaching your compute to a pool, consider the following: \n* Make sure the compute size requested is less than or equal to the [minimum number of idle instances](https:\/\/docs.databricks.com\/compute\/pools.html#pool-min) in the pool. If it is larger, compute startup time will be equivalent to compute that doesn\u2019t use a pool.\n* Make sure the maximum compute size is less than or equal to the [maximum capacity](https:\/\/docs.databricks.com\/compute\/pools.html#pool-max) of the pool. If it is larger, the compute creation will fail. \n### Autoscaling example \nIf you reconfigure a static compute to autoscale, Databricks immediately resizes the compute within the minimum and maximum bounds and then starts autoscaling. As an example, the following table demonstrates what happens to compute with a certain initial size if you reconfigure the compute to autoscale between 5 and 10 nodes. \n| Initial size | Size after reconfiguration |\n| --- | --- |\n| 6 | 6 |\n| 12 | 10 |\n| 3 | 5 |\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/configure.html"} +{"content":"# Compute\n## Use compute\n#### Compute configuration reference\n##### Enable autoscaling local storage\n\nIf you don\u2019t want to allocate a fixed number of EBS volumes at compute creation time, use autoscaling local storage. With autoscaling local storage, Databricks monitors the amount of free disk space available on your compute\u2019s Spark workers. If a worker begins to run too low on disk, Databricks automatically attaches a new EBS volume to the worker before it runs out of disk space. EBS volumes are attached up to a limit of 5 TB of total disk space per instance (including the instance\u2019s local storage). \nTo configure autoscaling storage, select **Enable autoscaling local storage**. \nThe EBS volumes attached to an instance are detached only when the instance is returned to AWS. That is, EBS volumes are never detached from an instance as long as it is part of a running compute. To scale down EBS usage, Databricks recommends using this feature in compute configured with [autoscaling compute](https:\/\/docs.databricks.com\/compute\/configure.html#autoscaling) or [automatic termination](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#automatic-termination). \nNote \nDatabricks uses Throughput Optimized HDD (st1) to extend the local storage of an instance. The [default AWS capacity limit](https:\/\/docs.aws.amazon.com\/general\/latest\/gr\/aws_service_limits.html#limits_ebs) for these volumes is 20 TiB. To avoid hitting this limit, administrators should request an increase in this limit based on their usage requirements. \n### Local disk encryption \nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nSome instance types you use to run compute may have locally attached disks. Databricks may store shuffle data or ephemeral data on these locally attached disks. To ensure that all data at rest is encrypted for all storage types, including shuffle data that is stored temporarily on your compute\u2019s local disks, you can enable local disk encryption. \nImportant \nYour workloads may run more slowly because of the performance impact of reading and writing encrypted data to and from local volumes. \nWhen local disk encryption is enabled, Databricks generates an encryption key locally that is unique to each compute node and is used to encrypt all data stored on local disks. The scope of the key is local to each compute node and is destroyed along with the compute node itself. During its lifetime, the key resides in memory for encryption and decryption and is stored encrypted on the disk. \nTo enable local disk encryption, you must use the [Clusters API](https:\/\/docs.databricks.com\/api\/workspace\/clusters). During compute creation or edit, set `enable_local_disk_encryption` to `true`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/configure.html"} +{"content":"# Compute\n## Use compute\n#### Compute configuration reference\n##### Automatic termination\n\nYou can set auto termination for compute. During compute creation, specify an inactivity period in minutes after which you want the compute to terminate. \nIf the difference between the current time and the last command run on the compute is more than the inactivity period specified, Databricks automatically terminates that compute. For more information on compute termination, see [Terminate a compute](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-terminate).\n\n#### Compute configuration reference\n##### Instance profiles\n\nNote \nDatabricks recommends using Unity Catalog external locations to connect to S3 instead of instance profiles. Unity Catalog simplifies the security and governance of your data by providing a central place to administer and audit data access across multiple workspaces in your account. See [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html). \nTo securely access AWS resources without using AWS keys, you can launch Databricks compute with instance profiles. See [Tutorial: Configure S3 access with an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html) for information about how to create and configure instance profiles. Once you have created an instance profile, you select it in the **Instance Profile** drop-down list. \nAfter you launch your compute, verify that you can access the S3 bucket using the following command. If the command succeeds, that compute resource can access the S3 bucket. \n```\ndbutils.fs.ls(\"s3a:\/\/<s3-bucket-name>\/\")\n\n``` \nWarning \nOnce a compute launches with an instance profile, anyone who has attach permissions to this compute can access the underlying resources controlled by this role. To guard against unwanted access, use [Compute permissions](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions) to restrict permissions to the compute.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/configure.html"} +{"content":"# Compute\n## Use compute\n#### Compute configuration reference\n##### Tags\n\nTags allow you to easily monitor the cost of cloud resources used by various groups in your organization. Specify tags as key-value pairs when you create compute, and Databricks applies these tags to cloud resources like VMs and disk volumes, as well as DBU usage reports. \nFor compute launched from pools, the custom tags are only applied to DBU usage reports and do not propagate to cloud resources. \nFor detailed information about how pool and compute tag types work together, see [Monitor usage using tags](https:\/\/docs.databricks.com\/admin\/account-settings\/usage-detail-tags.html) \nTo add tags to your compute: \n1. In the **Tags** section, add a key-value pair for each custom tag.\n2. Click **Add**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/configure.html"} +{"content":"# Compute\n## Use compute\n#### Compute configuration reference\n##### AWS configurations\n\nWhen you create compute, you can choose the availability zone, the max spot price, and EBS volume type. These settings are under the **Advanced Options** toggle in the **Instances** tab. \n### Availability zones \nThis setting lets you specify which availability zone (AZ) you want the compute to use. By default, this setting is set to **auto**, where the AZ is automatically selected based on available IPs in the workspace subnets. Auto-AZ retries in other availability zones if AWS returns insufficient capacity errors. \nNote \nAuto-AZ works only at compute startup. After the compute launches, all the nodes stay in the original availability zone until the compute is terminated or restarted. \nChoosing a specific AZ for the compute is useful primarily if your organization has purchased reserved instances in specific availability zones. Read more about [AWS availability zones](https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/using-regions-availability-zones.html). \n### Spot instances \nYou can specify whether to use spot instances and the max spot price to use when launching spot instances as a percentage of the corresponding on-demand price. By default, the max price is 100% of the on-demand price. See [AWS spot pricing](https:\/\/aws.amazon.com\/ec2\/spot\/). \n### EBS volumes \nThis section describes the default EBS volume settings for worker nodes, how to add shuffle volumes, and how to configure compute so that Databricks automatically allocates EBS volumes. \nTo configure EBS volumes, your compute must not be enabled for autoscaling local storage. Click the **Instances** tab in the compute configuration and select an option in the **EBS Volume Type** dropdown list. \n#### Default EBS volumes \nDatabricks provisions EBS volumes for every worker node as follows: \n* A 30 GB encrypted EBS instance root volume used by the host operating system and Databricks internal services.\n* A 150 GB encrypted EBS container root volume used by the Spark worker. This hosts Spark services and logs.\n* (HIPAA only) a 75 GB encrypted EBS worker log volume that stores logs for Databricks internal services. \n#### Add EBS shuffle volumes \nTo add shuffle volumes, select **General Purpose SSD** in the **EBS Volume Type** dropdown list. \nBy default, Spark shuffle outputs go to the instance local disk. For instance types that do not have a local disk, or if you want to increase your Spark shuffle storage space, you can specify additional EBS volumes.\nThis is particularly useful to prevent out-of-disk space errors when you run Spark jobs that produce large shuffle outputs. \nDatabricks encrypts these EBS volumes for both on-demand and spot instances. Read more about [AWS EBS volumes](https:\/\/aws.amazon.com\/ebs\/features\/). \n#### Optionally encrypt Databricks EBS volumes with a customer-managed key \nOptionally, you can encrypt compute EBS volumes with a customer-managed key. \nSee [Customer-managed keys for encryption](https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html). \n#### AWS EBS limits \nEnsure that your AWS EBS limits are high enough to satisfy the runtime requirements for all workers in all your deployed compute.\nFor information on the default EBS limits and how to change them, see [Amazon Elastic Block Store (EBS) Limits](https:\/\/docs.aws.amazon.com\/general\/latest\/gr\/aws_service_limits.html#limits_ebs). \n#### AWS EBS SSD volume type \nSelect either gp2 or gp3 for your AWS EBS SSD volume type. To do this, see [Manage SSD storage](https:\/\/docs.databricks.com\/admin\/clusters\/manage-ssd.html). Databricks recommends you switch to gp3 for its cost savings compared to gp2. \nNote \nBy default, the Databricks configuration sets the gp3 volume\u2019s IOPS and throughput IOPS to match the maximum performance of a gp2 volume with the same volume size. \nFor technical information about gp2 and gp3, see [Amazon EBS volume types](https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/ebs-volume-types.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/configure.html"} +{"content":"# Compute\n## Use compute\n#### Compute configuration reference\n##### Spark configuration\n\nTo fine-tune Spark jobs, you can provide custom [Spark configuration properties](https:\/\/spark.apache.org\/docs\/latest\/configuration.html). \n1. On the compute configuration page, click the **Advanced Options** toggle.\n2. Click the **Spark** tab. \n![Spark configuration](https:\/\/docs.databricks.com\/_images\/spark-config-aws.png) \nIn **Spark config**, enter the configuration properties as one key-value pair per line. \nWhen you configure compute using the [Clusters API](https:\/\/docs.databricks.com\/api\/workspace\/clusters), set Spark properties in the `spark_conf` field in the [create cluster API](https:\/\/docs.databricks.com\/api\/workspace\/clusters\/create) or [Update cluster API](https:\/\/docs.databricks.com\/api\/workspace\/clusters\/edit). \nTo enforce Spark configurations on compute, workspace admins can use [compute policies](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html). \n### Retrieve a Spark configuration property from a secret \nDatabricks recommends storing sensitive information, such as passwords, in a [secret](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html) instead of plaintext. To reference a secret in the Spark configuration, use the following syntax: \n```\nspark.<property-name> {{secrets\/<scope-name>\/<secret-name>}}\n\n``` \nFor example, to set a Spark configuration property called `password` to the value of the secret stored in `secrets\/acme_app\/password`: \n```\nspark.password {{secrets\/acme-app\/password}}\n\n``` \nFor more information, see [Syntax for referencing secrets in a Spark configuration property or environment variable](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html#path-value). \n### Environment variables \nConfigure custom environment variables that you can access from [init scripts](https:\/\/docs.databricks.com\/init-scripts\/index.html) running on the compute. Databricks also provides predefined [environment variables](https:\/\/docs.databricks.com\/init-scripts\/environment-variables.html) that you can use in init scripts. You cannot override these predefined environment variables. \n1. On the compute configuration page, click the **Advanced Options** toggle.\n2. Click the **Spark** tab.\n3. Set the environment variables in the **Environment Variables** field. \n![Environment Variables field](https:\/\/docs.databricks.com\/_images\/environment-variables.png) \nYou can also set environment variables using the `spark_env_vars` field in the [Create cluster API](https:\/\/docs.databricks.com\/api\/workspace\/clusters\/create) or [Update cluster API](https:\/\/docs.databricks.com\/api\/workspace\/clusters\/edit).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/configure.html"} +{"content":"# Compute\n## Use compute\n#### Compute configuration reference\n##### Compute log delivery\n\nWhen you create compute, you can specify a location to deliver the logs for the Spark driver node, worker nodes, and events. Logs are delivered every five minutes and archived hourly in your chosen destination. When a compute is terminated, Databricks guarantees to deliver all logs generated up until the compute was terminated. \nThe destination of the logs depends on the compute\u2019s `cluster_id`. If the specified destination is\n`dbfs:\/cluster-log-delivery`, compute logs for `0630-191345-leap375` are delivered to\n`dbfs:\/cluster-log-delivery\/0630-191345-leap375`. \nTo configure the log delivery location: \n1. On the compute page, click the **Advanced Options** toggle.\n2. Click the **Logging** tab.\n3. Select a destination type.\n4. Enter the compute log path. \n### S3 bucket destinations \nIf you choose an S3 destination, you must configure the compute with an instance profile that can access the bucket.\nThis instance profile must have both the `PutObject` and `PutObjectAcl` permissions. An example instance profile\nhas been included for your convenience. See [Tutorial: Configure S3 access with an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html) for instructions on how to set up an instance profile. \n```\n{\n\"Version\": \"2012-10-17\",\n\"Statement\": [\n{\n\"Effect\": \"Allow\",\n\"Action\": [\n\"s3:ListBucket\"\n],\n\"Resource\": [\n\"arn:aws:s3:::<my-s3-bucket>\"\n]\n},\n{\n\"Effect\": \"Allow\",\n\"Action\": [\n\"s3:PutObject\",\n\"s3:PutObjectAcl\",\n\"s3:GetObject\",\n\"s3:DeleteObject\"\n],\n\"Resource\": [\n\"arn:aws:s3:::<my-s3-bucket>\/*\"\n]\n}\n]\n}\n\n``` \nNote \nThis feature is also available in the REST API. See the [Clusters API](https:\/\/docs.databricks.com\/api\/workspace\/clusters).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/configure.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Install libraries from a volume\n\nThis article walks you through the steps required to upload libraries or requirements.txt files to volumes and install them onto clusters in Databricks. You can install libraries onto all-purpose compute or job compute. \nFor more information about volumes, see [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html). For information about working with Unity Catalog, including controlling access and creating objects, see [What is Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). \nFor full library compatibility details, see [Cluster-scoped libraries](https:\/\/docs.databricks.com\/libraries\/index.html#compatibility).\n\n#### Install libraries from a volume\n##### Load libraries to a volume\n\nTo load a library to a volume: \n1. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog** in the left sidebar.\n2. In the Catalog Explorer tree, navigate to the volume.\n3. Click **+Add**, then select **Upload to this volume**.\n4. The **Upload files to volume** dialog appears. Drag and drop or browse to the file(s) you want to upload, and click **Upload**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/volume-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Install libraries from a volume\n##### Install libraries from a volume onto a cluster\n\nWhen you install a library onto a cluster, all notebooks running on that cluster have access to the library. \nTo install a library from a volume onto a cluster: \n1. Click ![compute icon](https:\/\/docs.databricks.com\/_images\/clusters-icon.png) **Compute** in the left sidebar.\n2. Click the name of the cluster in the cluster list.\n3. Click the **Libraries** tab.\n4. Click **Install new**. The **Install library** dialog appears.\n5. For **Library Source**, select **Volumes**.\n6. Upload the library or requirements.txt file, browse to the library or requirements.txt file in the Volumes browser, or enter its location in the **Volumes File Path** field, such as the following: `\/Volumes\/<catalog>\/<schema>\/<volume>\/<path-to-library>\/<file-name>.<ext>`.\n7. Click **Install**.\n\n#### Install libraries from a volume\n##### Add dependent libraries to workflow tasks from a volume\n\nTo configure a workflow task with a dependent library from a volume: \n1. Select an existing task in a workflow or create a new task.\n2. Next to **Dependent libraries**, click **+ Add**.\n3. In the **Add dependent library** dialog, select Volumes for **Library Source**.\n4. Upload the library or requirements.txt file, browse to the library or requirements.txt file in the Volumes browser, or enter its location in the **Volumes File Path** field, such as the following: `\/Volumes\/<catalog>\/<schema>\/<volume>\/<path-to-library>\/<file-name>.<ext>`.\n5. Click **Install**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/volume-libraries.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Install libraries from a volume\n##### Install libraries from a volume to a notebook\n\nYou can install Python libraries directly to a notebook to create custom Python environments that are specific to the notebook. For example, you can use a specific version of a library in a notebook, without affecting other users on the cluster who may need a different version of the same library. For more information, see [notebook-scoped libraries](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html). \nWhen you install a library to a notebook, only the current notebook and any jobs associated with that notebook have access to that library. Other notebooks attached to the same cluster are not affected. \nThe following code shows how to install a Python wheel file from a volume to a notebook as a notebook-scoped library. \n```\n%pip install \/Volumes\/<catalog>\/<schema>\/<volume>\/<path-to-library>\/mypackage-0.0.1-py3-none-any.whl\n\n``` \nor \n```\n%pip install \/Volumes\/<catalog>\/<schema>\/<volume>\/<path-to-project>\/requirements.txt\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/volume-libraries.html"} +{"content":"# \n### Introduction to Databricks Lakehouse Monitoring\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes Databricks Lakehouse Monitoring. It covers the benefits of monitoring your data and gives an overview of the components and usage of Databricks Lakehouse Monitoring. \nDatabricks Lakehouse Monitoring lets you monitor the statistical properties and quality of the data in all of the tables in your account. You can also use it to track the performance of machine learning models and model-serving endpoints by monitoring inference tables that contain model inputs and predictions. The diagram shows the flow of data through data and ML pipelines in Databricks, and how you can use monitoring to continuously track data quality and model performance. \n![Databricks Lakehouse Monitoring overview](https:\/\/docs.databricks.com\/_images\/lakehouse-monitoring-overview.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/index.html"} +{"content":"# \n### Introduction to Databricks Lakehouse Monitoring\n#### Why use Databricks Lakehouse Monitoring?\n\nTo draw useful insights from your data, you must have confidence in the quality of your data. Monitoring your data provides quantitative measures that help you track and confirm the quality and consistency of your data over time. When you detect changes in your table\u2019s data distribution or corresponding model\u2019s performance, the tables created by Databricks Lakehouse Monitoring can capture and alert you to the change and can help you identify the cause. \nDatabricks Lakehouse Monitoring helps you answer questions like the following: \n* What does data integrity look like, and how does it change over time? For example, what is the fraction of null or zero values in the current data, and has it increased?\n* What does the statistical distribution of the data look like, and how does it change over time? For example, what is the 90th percentile of a numerical column? Or, what is the distribution of values in a categorical column, and how does it differ from yesterday?\n* Is there drift between the current data and a known baseline, or between successive time windows of the data?\n* What does the statistical distribution or drift of a subset or slice of the data look like?\n* How are ML model inputs and predictions shifting over time?\n* How is model performance trending over time? Is model version A performing better than version B? \nIn addition, Databricks Lakehouse Monitoring lets you control the time granularity of observations and set up custom metrics.\n\n### Introduction to Databricks Lakehouse Monitoring\n#### Requirements\n\nThe following are required to use Databricks Lakehouse Monitoring: \n* Your workspace must be enabled for Unity Catalog and you must have access to Databricks SQL.\n* Only Delta tables, including managed tables, external tables, views, materialized views, and streaming tables are supported for monitoring. Monitors created over materialized views and streaming tables do not support incremental processing.\n* Not all regions are supported. For regional support, see [Databricks clouds and regions](https:\/\/docs.databricks.com\/resources\/supported-regions.html). \nNote \nDatabricks Lakehouse Monitoring uses serverless compute for workflows. For information about tracking Lakehouse Monitoring expenses, see [View Lakehouse Monitoring expenses](https:\/\/docs.databricks.com\/lakehouse-monitoring\/expense.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/index.html"} +{"content":"# \n### Introduction to Databricks Lakehouse Monitoring\n#### How Lakehouse Monitoring works on Databricks\n\nTo monitor a table in Databricks, you create a monitor attached to the table. To monitor the performance of a machine learning model, you attach the monitor to an inference table that holds the model\u2019s inputs and corresponding predictions. \nDatabricks Lakehouse Monitoring provides the following types of analysis: time series, snapshot, and inference. \n| Profile type | Description |\n| --- | --- |\n| Time series | Use for tables that contain a time series dataset based on a timestamp column. Monitoring computes data quality metrics across time-based windows of the time series. |\n| Inference | Use for tables that contain the request log for a model. Each row is a request, with columns for the timestamp , the model inputs, the corresponding prediction, and (optional) ground-truth label. Monitoring compares model performance and data quality metrics across time-based windows of the request log. |\n| Snapshot | Use for all other types of tables. Monitoring calculates data quality metrics over all data in the table. The complete table is processed with every refresh. | \nThis section briefly describes the input tables used by Databricks Lakehouse Monitoring and the metric tables it produces. The diagram shows the relationship between the input tables, the metric tables, the monitor, and the dashboard. \n![Databricks Lakehouse Monitoring diagram](https:\/\/docs.databricks.com\/_images\/lakehouse-monitoring.png) \n### Primary table and baseline table \nIn addition to the table to be monitored, called the \u201cprimary table\u201d, you can optionally specify a baseline table to use as a reference for measuring drift, or the change in values over time. A baseline table is useful when you have a sample of what you expect your data to look like. The idea is that drift is then computed relative to expected data values and distributions. \nThe baseline table should contain a dataset that reflects the expected quality of the input data, in terms of statistical distributions, individual column distributions, missing values, and other characteristics. It should match the schema of the monitored table. The exception is the timestamp column for tables used with time series or inference profiles. If columns are missing in either the primary table or the baseline table, monitoring uses best-effort heuristics to compute the output metrics. \nFor monitors that use a snapshot profile, the baseline table should contain a snapshot of the data where the distribution represents an acceptable quality standard. For example, on grade distribution data, one might set the baseline to a previous class where grades were distributed evenly. \nFor monitors that use a time series profile, the baseline table should contain data that represents time window(s) where data distributions represent an acceptable quality standard. For example, on weather data, you might set the baseline to a week, month, or year where the temperature was close to expected normal temperatures. \nFor monitors that use an inference profile, a good choice for a baseline is the data that was used to train or validate the model being monitored. In this way, users can be alerted when the data has drifted relative to what the model was trained and validated on. This table should contain the same feature columns as the primary table, and additionally should have the same `model_id_col` that was specified for the primary table\u2019s InferenceLog so that the data is aggregated consistently. Ideally, the test or validation set used to evaluate the model should be used to ensure comparable model quality metrics. \n### Metric tables and dashboard \nA table monitor creates two metric tables and a dashboard. Metric values are computed for the entire table, and for the time windows and data subsets (or \u201cslices\u201d) that you specify when you create the monitor. In addition, for inference analysis, metrics are computed for each model ID. For more details about the metric tables, see [Monitor metric tables](https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-output.html). \n* The profile metric table contains summary statistics. See the [profile metrics table schema](https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-output.html#profile-metrics-table).\n* The drift metrics table contains statistics related to the data\u2019s drift over time. If a baseline table is provided, drift is also monitored relative to the baseline values. See the [drift metrics table schema](https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-output.html#drift-metrics-table). \nThe metric tables are Delta tables and are stored in a Unity Catalog schema that you specify. You can view these tables using the Databricks UI, query them using Databricks SQL, and create dashboards and alerts based on them. \nFor each monitor, Databricks automatically creates a dashboard to help you visualize and present the monitor results. The dashboard is fully customizable like any other [legacy dashboard](https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/index.html"} +{"content":"# \n### Introduction to Databricks Lakehouse Monitoring\n#### Start using Lakehouse Monitoring on Databricks\n\nSee the following articles to get started: \n* [Create a monitor using the Databricks UI](https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-ui.html).\n* [Create a monitor using the API](https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-api.html).\n* [Understand monitor metric tables](https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-output.html).\n* [Work with the monitor dashboard](https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-dashboard.html).\n* [Create SQL alerts based on a monitor](https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-alerts.html).\n* [Create custom metrics](https:\/\/docs.databricks.com\/lakehouse-monitoring\/custom-metrics.html).\n* [Monitor model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html).\n* [Monitor fairness and bias for classification models](https:\/\/docs.databricks.com\/lakehouse-monitoring\/fairness-bias.html).\n* See the reference material for the [Databricks Lakehouse Monitoring API](https:\/\/api-docs.databricks.com\/python\/lakehouse-monitoring\/latest\/index.html).\n* [Example notebooks](https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-api.html#example-notebooks).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/index.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Diagnose cost and performance issues using the Spark UI\n\nThis guide walks you through how to use the Spark UI to diagnose cost and performance issues. It\u2019s a step-by-step guide, and it\u2019s a practical how-to. Rather than just providing you an explanation of what each page in the Spark UI does, it tells you what to look for and what it means. If you aren\u2019t familiar with the concepts of driver, workers, executors, stages, and tasks, you might want to review the Spark architecture. \nIf you are looking for a comprehensive list of various optimization tools, use the [Databricks Optimization guide](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide). Sections of the optimization guide are referenced in this Spark UI guide.\n\n#### Diagnose cost and performance issues using the Spark UI\n##### Using this guide\n\nTo navigate through the guide, use the links embedded in each page to be taken to the next step. The guide contains the following steps in order: \n1. [Use the Jobs Timeline to identify major issues](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/jobs-timeline.html)\n2. [Look at longest stage](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/long-spark-stage.html)\n3. [Look for skew or spill](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/long-spark-stage-page.html)\n4. [Determine if longest stage is I\/O bound](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/long-spark-stage-io.html)\n5. [Look for other causes of slow stage runtime](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/slow-spark-stage-low-io.html) \nLet\u2019s get started!\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/index.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Diagnose cost and performance issues using the Spark UI\n##### How to open the Spark UI\n\n1. Navigate to your cluster\u2019s page: \n![Navigate to Compute](https:\/\/docs.databricks.com\/_images\/open-spark-ui-1.png)\n2. Click **Spark UI**: \n![Navigate to SparkUI](https:\/\/docs.databricks.com\/_images\/open-spark-ui-2.png)\n\n#### Diagnose cost and performance issues using the Spark UI\n##### Next step\n\nNow that you\u2019ve opened the Spark UI, next review the event timeline to find out more about your pipeline or query. See [Jobs timeline](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/jobs-timeline.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Get started with MLflow experiments\n##### Quickstart R\n###### MLflow quickstart R notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-quick-start-r.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nSee [View notebook experiment](https:\/\/docs.databricks.com\/mlflow\/experiments.html#view-notebook-experiment) for instructions on how to view the experiment, run, and notebook revision used in the quickstart.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/quick-start-r.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deploy models for batch inference and prediction\n### Deep learning model inference workflow\n##### Model inference using PyTorch\n\nThe following notebook demonstrates the Databricks recommended [deep learning inference workflow](https:\/\/docs.databricks.com\/machine-learning\/model-inference\/dl-model-inference.html). \nThis example illustrates model inference using PyTorch with a trained ResNet-50 model and image files as input data.\n\n##### Model inference using PyTorch\n###### Model inference with PyTorch notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/pytorch-images.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-inference\/resnet-model-inference-pytorch.html"} +{"content":"# Model serving with Databricks\n### Manage model serving endpoints\n\nThis article describes how to manage model serving endpoints using the **Serving** UI and REST API. See [Serving endpoints](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints) in the REST API reference. \nTo create model serving endpoints use one of the following: \n* [Create custom model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html).\n* [Create foundation model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-foundation-model-endpoints.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/manage-serving-endpoints.html"} +{"content":"# Model serving with Databricks\n### Manage model serving endpoints\n#### Get the status of the model endpoint\n\nIn the **Serving** UI, you can check the status of an endpoint from the **Serving endpoint state** indicator at the top of your endpoint\u2019s details page. \nYou can use check the status and details of an endpoint programmatically using the REST API or the MLflow Deployments SDK \n```\nGET \/api\/2.0\/serving-endpoints\/{name}\n\n``` \nThe following example gets the details of an endpoint that serves the first version of the `ads1` model that is registered in the model registry. To specify a model from Unity Catalog, provide the full model name including parent catalog and schema such as, `catalog.schema.example-model`. \nIn the following example response, the `state.ready` field is \u201cREADY\u201d, which means the endpoint is ready to receive traffic. The `state.update_state` field is `NOT_UPDATING` and `pending_config` is no longer returned because the update was finished successfully. \n```\n{\n\"name\": \"workspace-model-endpoint\",\n\"creator\": \"customer@example.com\",\n\"creation_timestamp\": 1666829055000,\n\"last_updated_timestamp\": 1666829055000,\n\"state\": {\n\"ready\": \"READY\",\n\"update_state\": \"NOT_UPDATING\"\n},\n\"config\": {\n\"served_entities\": [\n{\n\"name\": \"ads1-1\",\n\"entity_name\": \"ads1\",\n\"entity_version\": \"1\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": false,\n\"state\": {\n\"deployment\": \"DEPLOYMENT_READY\",\n\"deployment_state_message\": \"\"\n},\n\"creator\": \"customer@example.com\",\n\"creation_timestamp\": 1666829055000\n}\n],\n\"traffic_config\": {\n\"routes\": [\n{\n\"served_model_name\": \"ads1-1\",\n\"traffic_percentage\": 100\n}\n]\n},\n\"config_version\": 1\n},\n\"id\": \"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\",\n\"permission_level\": \"CAN_MANAGE\"\n}\n\n``` \n```\nfrom mlflow.deployments import get_deploy_client\n\nclient = get_deploy_client(\"databricks\")\nendpoint = client.get_endpoint(endpoint=\"chat\")\nassert endpoint == {\n\"name\": \"chat\",\n\"creator\": \"alice@company.com\",\n\"creation_timestamp\": 0,\n\"last_updated_timestamp\": 0,\n\"state\": {...},\n\"config\": {...},\n\"tags\": [...],\n\"id\": \"88fd3f75a0d24b0380ddc40484d7a31b\",\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/manage-serving-endpoints.html"} +{"content":"# Model serving with Databricks\n### Manage model serving endpoints\n#### Delete a model serving endpoint\n\nTo disable serving for a model, you can delete the endpoint it\u2019s served on. \nYou can delete an endpoint from the endpoint\u2019s details page in the **Serving** UI. \n1. Click **Serving** on the sidebar.\n2. Click the endpoint you want to delete.\n3. Click the kebab menu at the top and select **Delete**. \nAlternatively, you can delete a serving endpoint programmatically using the REST API or the MLflow Deployments SDK \n```\nDELETE \/api\/2.0\/serving-endpoints\/{name}\n\n``` \n```\nfrom mlflow.deployments import get_deploy_client\n\nclient = get_deploy_client(\"databricks\")\nclient.delete_endpoint(endpoint=\"chat\")\n\n```\n\n### Manage model serving endpoints\n#### Debug your model serving endpoint\n\nTo debug any issues with the endpoint, you can fetch: \n* Model server container build logs\n* Model server logs \nThese logs are also accessible from the **Endpoints** UI in the **Logs** tab. \nFor the **build logs** for a served model you can use the following request: \n```\nGET \/api\/2.0\/serving-endpoints\/{name}\/served-models\/{served-model-name}\/build-logs\n{\n\u201cconfig_version\u201d: 1 \/\/ optional\n}\n\n``` \nFor the **model server** logs for a serve model, you can use the following request: \n```\nGET \/api\/2.0\/serving-endpoints\/{name}\/served-models\/{served-model-name}\/logs\n\n{\n\u201cconfig_version\u201d: 1 \/\/ optional\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/manage-serving-endpoints.html"} +{"content":"# Model serving with Databricks\n### Manage model serving endpoints\n#### Manage permissions on your model serving endpoint\n\nYou must have at least the CAN MANAGE permission on a serving endpoint to modify permissions. For more information on the permission levels, see [Serving endpoint ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#serving-endpoints). \nGet the list of permissions on the serving endpoint. \n```\ndatabricks permissions get servingendpoints <endpoint-id>\n\n``` \nGrant user `jsmith@example.com` the CAN QUERY permission on the serving endpoint. \n```\ndatabricks permissions update servingendpoints <endpoint-id> --json '{\n\"access_control_list\": [\n{\n\"user_name\": \"jsmith@example.com\",\n\"permission_level\": \"CAN_QUERY\"\n}\n]\n}'\n\n``` \nYou can also modify serving endpoint permissions using the [Permissions API](https:\/\/docs.databricks.com\/api\/workspace\/permissions).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/manage-serving-endpoints.html"} +{"content":"# Model serving with Databricks\n### Manage model serving endpoints\n#### Get a model serving endpoint schema\n\nPreview \nSupport for serving endpoint query schemas is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). This functionality is available in [Model Serving regions](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html#regions). \nA serving endpoint query schema is a formal description of the serving endpoint using the standard OpenAPI specification in JSON format. It contains information about the endpoint including the endpoint path, details for querying the endpoint like the request and response body format, and data type for each field. This information can be helpful for reproducibility scenarios or when you need information about the endpoint, but are not the original endpoint creator or owner. \nTo get the model serving endpoint schema, the served model must have a model signature logged and the endpoint must be in a `READY` state. \nThe following examples demonstrate how to programmatically get the model serving endpoint schema using the REST API. For feature serving endpoint schemas, see [What is Databricks Feature Serving?](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html). \nThe schema returned by the API is in the format of a JSON object that follows the OpenAPI specification. \n```\nACCESS_TOKEN=\"<endpoint-token>\"\nENDPOINT_NAME=\"<endpoint name>\"\n\ncurl \"https:\/\/example.databricks.com\/api\/2.0\/serving-endpoints\/$ENDPOINT_NAME\/openapi\" -H \"Authorization: Bearer $ACCESS_TOKEN\" -H \"Content-Type: application\/json\"\n\n``` \n### Schema response details \nThe response is an OpenAPI specification in JSON format, typically including fields like `openapi`, `info`, `servers` and `paths`. Since the schema response is a JSON object, you can parse it using common programming languages, and generate client code from the specification using third-party tools.\nYou can also visualize the OpenAPI specification using third-party tools like Swagger Editor. \nThe main fields of the response include: \n* The `info.title` field shows the name of the serving endpoint.\n* The `servers` field always contains one object, typically the `url` field which is the base url of the endpoint.\n* The `paths` object in the response contains all supported paths for an endpoint. The keys in the object are the path URL. Each `path` can support multiple formats of inputs. These inputs are listed in the `oneOf` field. \nThe following is an example endpoint schema response: \n```\n{\n\"openapi\": \"3.1.0\",\n\"info\": {\n\"title\": \"example-endpoint\",\n\"version\": \"2\"\n},\n\"servers\": [{ \"url\": \"https:\/\/example.databricks.com\/serving-endpoints\/example-endpoint\"}],\n\"paths\": {\n\"\/served-models\/vanilla_simple_model-2\/invocations\": {\n\"post\": {\n\"requestBody\": {\n\"content\": {\n\"application\/json\": {\n\"schema\": {\n\"oneOf\": [\n{\n\"type\": \"object\",\n\"properties\": {\n\"dataframe_split\": {\n\"type\": \"object\",\n\"properties\": {\n\"columns\": {\n\"description\": \"required fields: int_col\",\n\"type\": \"array\",\n\"items\": {\n\"type\": \"string\",\n\"enum\": [\n\"int_col\",\n\"float_col\",\n\"string_col\"\n]\n}\n},\n\"data\": {\n\"type\": \"array\",\n\"items\": {\n\"type\": \"array\",\n\"prefixItems\": [\n{\n\"type\": \"integer\",\n\"format\": \"int64\"\n},\n{\n\"type\": \"number\",\n\"format\": \"double\"\n},\n{\n\"type\": \"string\"\n}\n]\n}\n}\n}\n},\n\"params\": {\n\"type\": \"object\",\n\"properties\": {\n\"sentiment\": {\n\"type\": \"number\",\n\"format\": \"double\",\n\"default\": \"0.5\"\n}\n}\n}\n},\n\"examples\": [\n{\n\"columns\": [\n\"int_col\",\n\"float_col\",\n\"string_col\"\n],\n\"data\": [\n[\n3,\n10.4,\n\"abc\"\n],\n[\n2,\n20.4,\n\"xyz\"\n]\n]\n}\n]\n},\n{\n\"type\": \"object\",\n\"properties\": {\n\"dataframe_records\": {\n\"type\": \"array\",\n\"items\": {\n\"required\": [\n\"int_col\",\n\"float_col\",\n\"string_col\"\n],\n\"type\": \"object\",\n\"properties\": {\n\"int_col\": {\n\"type\": \"integer\",\n\"format\": \"int64\"\n},\n\"float_col\": {\n\"type\": \"number\",\n\"format\": \"double\"\n},\n\"string_col\": {\n\"type\": \"string\"\n},\n\"becx_col\": {\n\"type\": \"object\",\n\"format\": \"unknown\"\n}\n}\n}\n},\n\"params\": {\n\"type\": \"object\",\n\"properties\": {\n\"sentiment\": {\n\"type\": \"number\",\n\"format\": \"double\",\n\"default\": \"0.5\"\n}\n}\n}\n}\n}\n]\n}\n}\n}\n},\n\"responses\": {\n\"200\": {\n\"description\": \"Successful operation\",\n\"content\": {\n\"application\/json\": {\n\"schema\": {\n\"type\": \"object\",\n\"properties\": {\n\"predictions\": {\n\"type\": \"array\",\n\"items\": {\n\"type\": \"number\",\n\"format\": \"double\"\n}\n}\n}\n}\n}\n}\n}\n}\n}\n}\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/manage-serving-endpoints.html"} +{"content":"# Data governance with Unity Catalog\n## Hive metastore table access control (legacy)\n#### Hive metastore privileges and securable objects (legacy)\n\nThis article describes the privilege model for the legacy Databricks Hive metastore, which is built in to each Databricks workspace. It also describes how to grant, deny, and revoke privileges for objects in the built-in Hive metastore. Unity Catalog uses a different model for granting privileges. See [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html). \nNote \nTable access control for data managed by the Hive metastore is a legacy data governance model. Databricks recommends that you [upgrade the tables managed by the Hive metastore to the Unity Catalog metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html). Unity Catalog simplifies security and governance of your data by providing a central place to administer and audit data access across multiple workspaces in your account. To learn more about how the legacy privilege model differs from the Unity Catalog privilege model, see [Work with Unity Catalog and the legacy Hive metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/hive-metastore.html).\n\n#### Hive metastore privileges and securable objects (legacy)\n##### Requirements\n\n* An administrator must [enable and enforce table access control](https:\/\/docs.databricks.com\/data-governance\/table-acls\/table-acl.html#enable-table-acl-workspace) for the workspace.\n* The cluster must be enabled for [table access control](https:\/\/docs.databricks.com\/data-governance\/table-acls\/table-acl.html#table-access-control). \nNote \n* Data access control is *always enabled* in Databricks SQL even if table access control is *not enabled* for the workspace.\n* If table access control is enabled for the workspace and you have already specified ACLs (granted and denied privileges) in the workspace, those ACLs are respected in Databricks SQL.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/table-acls\/object-privileges.html"} +{"content":"# Data governance with Unity Catalog\n## Hive metastore table access control (legacy)\n#### Hive metastore privileges and securable objects (legacy)\n##### Manage privileges on objects in the Hive metastore\n\nPrivileges on data objects managed by the Hive metastore can be granted by either a workspace admin or the owner of an object. You can manage privileges for Hive metastore objects by using SQL commands. \nTo manage privileges in SQL, you use [GRANT](https:\/\/docs.databricks.com\/sql\/language-manual\/security-grant.html), [REVOKE](https:\/\/docs.databricks.com\/sql\/language-manual\/security-revoke.html), [DENY](https:\/\/docs.databricks.com\/sql\/language-manual\/security-deny.html), [MSCK](https:\/\/docs.databricks.com\/sql\/language-manual\/security-msck.html), and [SHOW GRANTS](https:\/\/docs.databricks.com\/sql\/language-manual\/security-show-grant.html) statements in a notebook or the Databricks SQL query editor, using the syntax: \n```\nGRANT privilege_type ON securable_object TO principal\n\n``` \nWhere: \n* `privilege_type` is a [Hive metastore privilege type](https:\/\/docs.databricks.com\/data-governance\/table-acls\/object-privileges.html#privilege-types)\n* `securable_object` is a [securable object in the Hive metastore](https:\/\/docs.databricks.com\/data-governance\/table-acls\/object-privileges.html#securable-objects)\n* `principal` is a user, service principal (represented by its applicationId value), or group. You must enclose users, service principals, and group names with [special characters](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-identifiers.html#delimited-identifiers) in backticks ( `` `` ). See [Principal](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-principal.html). \nTo grant a privilege to all users in your workspace, grant the privilege to the `users` group. For example: \n```\nGRANT SELECT ON TABLE <schema-name>.<table-name> TO users\n\n``` \nFor more information about managing privileges for objects in the Hive metastore using SQL commands, see [Privileges and securable objects in the Hive metastore](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-privileges-hms.html). \nYou can also manage table access control in a fully automated setup using the [Databricks Terraform](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) provider and [databricks\\_sql\\_permissions](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/sql_permissions#example-usage).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/table-acls\/object-privileges.html"} +{"content":"# Data governance with Unity Catalog\n## Hive metastore table access control (legacy)\n#### Hive metastore privileges and securable objects (legacy)\n##### Object ownership\n\nWhen table access control is enabled on a cluster or SQL warehouse, a user who creates a schema, table, view, or function\nbecomes its owner. The owner is granted all privileges and can grant privileges to other users. \nGroups may own objects, in which case all members of that group are considered owners. \nEither the owner of an object or a workspace admin can transfer ownership of an object using the following command: \n```\nALTER <object> OWNER TO `<user-name>@<user-domain>.com`\n\n``` \nNote \nWhen table access control is disabled on a cluster or SQL warehouse, owners are not registered when a schema, table, or view is created. A workspace admin must assign an owner to the object using the `ALTER <object> OWNER TO` command.\n\n#### Hive metastore privileges and securable objects (legacy)\n##### Securable objects in the Hive metastore\n\nThe securable objects are: \n* `CATALOG`: controls access to the entire data catalog. \n+ `SCHEMA`: controls access to a schema. \n- `TABLE`: controls access to a managed or external table.\n- `VIEW`: controls access to SQL views.\n- `FUNCTION`: controls access to a named function.\n* `ANONYMOUS FUNCTION`: controls access to [anonymous or temporary functions](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-function.html). \nNote \n`ANONYMOUS FUNCTION` objects are not supported in Databricks SQL.\n* `ANY FILE`: controls access to the underlying filesystem. \nWarning \nUsers granted access to `ANY FILE` can bypass the restrictions put on the catalog, schemas, tables, and views by reading from the filesystem directly. \nNote \nPrivileges on global and local temporary views are not supported. Local temporary views are visible only within the same session, and views created in the `global_temp` schema are visible to all users sharing a cluster or SQL warehouse. However, privileges on the underlying tables and views referenced by any temporary views are enforced.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/table-acls\/object-privileges.html"} +{"content":"# Data governance with Unity Catalog\n## Hive metastore table access control (legacy)\n#### Hive metastore privileges and securable objects (legacy)\n##### Privileges you can grant on Hive metastore objects\n\n* `SELECT`: gives read access to an object.\n* `CREATE`: gives ability to create an object (for example, a table in a schema).\n* `MODIFY`: gives ability to add, delete, and modify data to or from an object.\n* `USAGE`: does not give any abilities, but is an additional requirement to perform any action on a schema object.\n* `READ_METADATA`: gives ability to view an object and its metadata.\n* `CREATE_NAMED_FUNCTION`: gives ability to create a named UDF in an existing catalog or schema.\n* `MODIFY_CLASSPATH`: gives ability to add files to the Spark class path.\n* `ALL PRIVILEGES`: gives all privileges (is translated into all the above privileges). \nNote \nThe `MODIFY_CLASSPATH` privilege is not supported in Databricks SQL. \n### `USAGE` privilege \nTo perform an action on a schema object in the Hive metastore, a user must have the `USAGE` privilege on that schema in addition to the privilege to perform that action. Any one of the following satisfies the `USAGE` requirement: \n* Be a workspace admin\n* Have the `USAGE` privilege on the schema or be in a group that has the `USAGE` privilege on the schema\n* Have the `USAGE` privilege on the `CATALOG` or be in a group that has the `USAGE` privilege\n* Be the owner of the schema or be in a group that owns the schema \nEven the owner of an object inside a schema must have the `USAGE` privilege in order to use it.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/table-acls\/object-privileges.html"} +{"content":"# Data governance with Unity Catalog\n## Hive metastore table access control (legacy)\n#### Hive metastore privileges and securable objects (legacy)\n##### Privilege hierarchy\n\nWhen table access control is enabled on the workspace and on all clusters, SQL objects in Databricks are hierarchical and privileges are inherited downward. This means that granting or denying a privilege on the `CATALOG` automatically grants or denies the privilege to all schemas in the catalog. Similarly, privileges granted on a schema object are inherited by all objects in that schema. This pattern is true for all securable objects. \nIf you deny a user privileges on a table, the user can\u2019t see the table by attempting to list all tables in the schema. If you deny a user privileges on a schema, the user can\u2019t see that the schema exists by attempting to list all schemas in the catalog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/table-acls\/object-privileges.html"} +{"content":"# Data governance with Unity Catalog\n## Hive metastore table access control (legacy)\n#### Hive metastore privileges and securable objects (legacy)\n##### Dynamic view functions\n\nDatabricks includes two user functions that allow you to express column- and row-level permissions dynamically in the body of a view definition that is managed by the Hive metastore. \n* `current_user()`: return the current user name.\n* `is_member()`: determine if the current user is a member of a specific Databricks [group](https:\/\/docs.databricks.com\/admin\/users-groups\/groups.html) at the workspace level. \nThe following example combines both functions to determine if a user has the appropriate group membership: \n```\n-- Return: true if the user is a member and false if they are not\nSELECT\ncurrent_user as user,\n-- Check to see if the current user is a member of the \"Managers\" group.\nis_member(\"Managers\") as admin\n\n``` \n### Column-level permissions \nYou can use dynamic views to limit the columns a specific group or user can see. Consider the following example where only users who belong to the `auditors` group are able to see email addresses from the `sales_raw` table. At analysis time Spark replaces the `CASE` statement with either the literal `'REDACTED'` or the column `email`. This behavior allows for all the usual performance optimizations provided by Spark. \n```\n-- Alias the field 'email' to itself (as 'email') to prevent the\n-- permission logic from showing up directly in the column name results.\nCREATE VIEW sales_redacted AS\nSELECT\nuser_id,\nCASE WHEN\nis_group_member('auditors') THEN email\nELSE 'REDACTED'\nEND AS email,\ncountry,\nproduct,\ntotal\nFROM sales_raw\n\n``` \n### Row-level permissions \nUsing dynamic views you can specify permissions down to the row or field level. Consider the following example, where only users who belong to the `managers` group are able to see transaction amounts (`total` column) greater than $1,000,000.00: \n```\nCREATE VIEW sales_redacted AS\nSELECT\nuser_id,\ncountry,\nproduct,\ntotal\nFROM sales_raw\nWHERE\nCASE\nWHEN is_group_member('managers') THEN TRUE\nELSE total <= 1000000\nEND;\n\n``` \n### Data masking \nAs shown in the preceding examples, you can implement column-level masking to prevent users from seeing specific column data unless they are in the correct group. Because these views are standard Spark SQL, you can do more advanced types of masking with more complex SQL expressions. The following example lets all users perform analysis on email domains, but lets members of the `auditors` group see users\u2019 full email addresses. \n```\n-- The regexp_extract function takes an email address such as\n-- user.x.lastname@example.com and extracts 'example', allowing\n-- analysts to query the domain name\n\nCREATE VIEW sales_redacted AS\nSELECT\nuser_id,\nregion,\nCASE\nWHEN is_group_member('auditors') THEN email\nELSE regexp_extract(email, '^.*@(.*)$', 1)\nEND\nFROM sales_raw\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/table-acls\/object-privileges.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n### What are workspace files?\n##### Workspace files basic usage\n\nYou can use the workspace UI to perform basic tasks like creating, importing, and editing workspace files. \nNote \nAll files present in a repository are synced as workspace files automatically when you [clone a Git repository](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html).\n\n##### Workspace files basic usage\n###### Create a new file\n\nYou can create a new file in any Databricks directory. Click the down arrow next to the directory name, and select **Create > File** from the menu.\n\n##### Workspace files basic usage\n###### Import a file\n\nTo import a file, click the down arrow next to the directory name, and select **Import**. \nThe import dialog appears. You can drag files into the dialog or click **browse** to select files. \nNote \n* Only notebooks can be imported from a URL.\n* When you import a .zip file, Databricks automatically unzips the file and imports each file and notebook that is included in the .zip file.\n* You can import .whl files to use as libraries.\n\n##### Workspace files basic usage\n###### Edit a file\n\nTo edit a file, click the filename in the workspace browser. The file opens and you can edit it. Changes are saved automatically. \nWhen you open a Markdown (`.md`) file, the rendered preview is displayed by default. To edit a cell, double-click in the cell. To return to preview mode, click anywhere outside the cell. \nThe editor includes additional functionality such as autocomplete, multicursor support, and the ability to run code. For more information, see [Use the Databricks notebook and file editor](https:\/\/docs.databricks.com\/notebooks\/notebook-editor.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/workspace-basics.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Install libraries from a package repository\n\nDatabricks provides tools to install libraries from PyPI, Maven, and CRAN package repositories. See [Cluster-scoped libraries](https:\/\/docs.databricks.com\/libraries\/index.html#compatibility) for full library compatibility details. \nImportant \nLibraries can be installed from DBFS when using Databricks Runtime 14.3 LTS and below. However, any workspace user can modify library files stored in DBFS. To improve the security of libraries in a Databricks workspace, storing library files in the DBFS root is deprecated and disabled by default in Databricks Runtime 15.0 and above. See [Storing libraries in DBFS root is deprecated and disabled by default](https:\/\/docs.databricks.com\/release-notes\/runtime\/15.0.html#libraries-dbfs-deprecation). \nInstead, Databricks [recommends](https:\/\/docs.databricks.com\/libraries\/index.html#recommendations) uploading all libraries, including Python libraries, JAR files, and Spark connectors, to workspace files or Unity Catalog volumes, or using library package repositories. If your workload does not support these patterns, you can also use libraries stored in cloud object storage.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/package-repositories.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Install libraries from a package repository\n##### PyPI package\n\n1. In the **Library Source** button list, select **PyPI**.\n2. Enter a PyPI package name. To install a specific version of a library, use this format for the library: `<library>==<version>`. For example, `scikit-learn==0.19.1`. \nNote \nFor [jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html), Databricks recommends that you specify a library version to ensure a reproducible environment. If the library version is not fully specified, Databricks uses the latest matching version. This means that different runs of the same job might use different library versions as new versions are published. Specifying the library version prevents new, breaking changes in libraries from breaking your jobs.\n3. (Optional) In the Index URL field enter a PyPI index URL.\n4. Click **Install**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/package-repositories.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Install libraries from a package repository\n##### Maven or Spark package\n\nImportant \nTo install Maven libraries on compute configured with shared access mode, you must add the coordinates to the allowlist. See [Allowlist libraries and init scripts on shared compute](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/allowlist.html). \nImportant \nFor DBR 14.3 LTS and below, Databricks uses Apache Ivy 2.4.0 to resolve Maven packages. For DBR 15.0 and above, Databricks uses Ivy 2.5.1 or greater and the specific Ivy version is listed in [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). \nThe installation order of Maven packages may affect the final dependency tree, which can impact the order in which libraries are loaded. \n1. In the **Library Source** button list, select **Maven**.\n2. Specify a Maven coordinate. Do one of the following: \n* In the Coordinate field, enter the Maven coordinate of the library to install. Maven coordinates are in the form `groupId:artifactId:version`; for example, `com.databricks:spark-avro_2.10:1.0.0`.\n* If you don\u2019t know the exact coordinate, enter the library name and click **Search Packages**. A list of matching packages displays. To display details about a package, click its name. You can sort packages by name, organization, and rating. You can also filter the results by writing a query in the search bar. The results refresh automatically. \n1. Select **Maven Central** or **Spark Packages** in the drop-down list at the top left.\n2. Optionally select the package version in the Releases column.\n3. Click **+ Select** next to a package. The Coordinate field is filled in with the selected package and version.\n3. (Optional) In the Repository field, you can enter a Maven repository URL. \nNote \nInternal Maven repositories are not supported.\n4. In the **Exclusions** field, optionally provide the `groupId` and the `artifactId` of the dependencies that you want to exclude (for example, `log4j:log4j`). \nNote \nMaven uses the closest-to-root version, and in the case of two packages vying for versions with different dependencies, the order matters, so it may fail when the package with an older dependency gets loaded first. \nTo work around this, use the **Exclusions** field to exclude the conflicting library.\n5. Click **Install**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/package-repositories.html"} +{"content":"# Databricks data engineering\n## Libraries\n#### Install libraries from a package repository\n##### CRAN package\n\n1. In the **Library Source** button list, select **CRAN**.\n2. In the Package field, enter the name of the package.\n3. (Optional) In the Repository field, you can enter the CRAN repository URL.\n4. Click **Install**. \nNote \nCRAN mirrors serve the latest version of a library. As a result, you may end up with different versions of an R package if you attach the library to different clusters at different times. To learn how to manage and fix R package versions on Databricks, see the [Knowledge Base](https:\/\/kb.databricks.com\/r\/pin-r-packages.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/libraries\/package-repositories.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Set up Databricks Git folders (Repos)\n##### Configure Git credentials & connect a remote repo to Databricks\n\nThis article describes how to configure your Git credentials in Databricks so that you can connect a remote repo using Databricks Git folders (formerly Repos). \nFor a list of supported Git providers (cloud and on-premises), read [Supported Git providers](https:\/\/docs.databricks.com\/repos\/index.html#supported-git-providers). \n* [Authenticate a GitHub account using OAuth 2.0](https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html#github-link-account)\n* [Authenticate a GitHub account using a PAT](https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html#github-personal-access-token)\n* [Authnticate a GitHub account using a fine-grained PAT](https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html#github-personal-access-token-fg)\n* [Authenticate a GitLab account using a PAT](https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html#gitlab)\n* [Authenticate access to a Microsoft Azure DevOps repo](https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html#devops)\n* [Authenticate access to an Atlassian BitBucket repo](https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html#bitbucket) \n* [Authenticate access to an AWS CodeCommit repo](https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html#codecommit)\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Set up Databricks Git folders (Repos)\n##### Configure Git credentials & connect a remote repo to Databricks\n###### GitHub and GitHub AE\n\nThe following information applies to GitHub and GitHub AE users. \n### Why use the Databricks GitHub App instead of a PAT? \nDatabricks Git folders allows you to choose the Databricks GitHub App for user authentication instead of PATs if you are using a hosted GitHub account. Using the GitHub App provides the following benefits over PATs: \n* It uses OAuth 2.0 for user authentication. OAuth 2.0 repo traffic is encrypted for strong security.\n* It is easier to integrate ([see the steps below](https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html#github-link-account)) and does not require individual tracking of tokens.\n* Token renewal is handled automatically.\n* The integration can be scoped to specific attached Git repos, allowing you more granular control over access. \nImportant \nAs per standard OAuth 2.0 integration, Databricks stores a user\u2019s access and refresh tokens\u2013all other access control is handled by GitHub. Access and refresh tokens follow GitHub\u2019s default expiry rules with access tokens expiring after 8 hours (which minimizes risk in the event of credential leak). Refresh tokens have a 6-month lifetime if unused. Linked credentials expire after 6 months of inactivity, requiring the user to reconfigure them. \nYou can optionally encrypt Databricks tokens using [customer-managed keys](https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html) (CMK). \n### Link your GitHub account using Databricks GitHub App \nNote \n* This feature is not supported in GitHub Enterprise Server. Use a personal access token instead. \nIn Databricks, link your GitHub account on the User Settings page: \n1. In the upper-right corner of any page, click your username, then select **Settings**.\n2. Click the **Linked accounts** tab.\n3. Change your provider to GitHub, select **Link Git account**, and click **Link**. \n![Link GitHub account in Databricks](https:\/\/docs.databricks.com\/_images\/link-github-account-in-databricks.png)\n4. The Databricks GitHub App authorization page appears. Authorize the GitHub App to complete the setup, which allows Databricks to act on your behalf when you perform Git operations in Git folders (such as cloning a repository). See the [GitHub documentation](https:\/\/docs.github.com\/en\/apps\/using-github-apps\/authorizing-github-apps) for more details on app authorization. \n![Databricks GitHub app authorization page](https:\/\/docs.databricks.com\/_images\/databricks-github-app-authorization-page.png)\n5. To allow access to GitHub repositories, follow the steps below to install and configure the Databricks GitHub app. \n#### Install and configure the [Databricks GitHub App](https:\/\/github.com\/apps\/databricks) to allow access to repositories \nYou can install and configure the Databricks GitHub App on GitHub repositories that you want to access from Databricks Git folders. See the [GitHub documentation](https:\/\/docs.github.com\/en\/apps\/using-github-apps\/installing-a-github-app-from-github-marketplace-for-your-organizations#about-installing-github-apps) for more details on app installation. \n1. Open the [Databricks GitHub App installation page](https:\/\/github.com\/apps\/databricks\/installations\/new).\n2. Select the account that owns the repositories you want to access. \n![Databricks GitHub app installation page](https:\/\/docs.databricks.com\/_images\/databricks-github-app-installation-page.png)\n3. If you are not an owner of the account, you must have the account owner install and configure the app for you.\n4. If you are the account owner, install the GitHub App. Installing it gives read and write access to code. Code is only accessed on behalf of users (for example, when a user clones a repository in Databricks Git folders).\n5. Optionally, you can give access to only a subset of repositories by selecting the **Only select repositories** option. \n### Connect to a GitHub repo using a personal access token \nIn GitHub, follow these steps to create a personal access token that allows access to your repositories: \n1. In the upper-right corner of any page, click your profile photo, then click **Settings**.\n2. Click **Developer settings**.\n3. Click the **Personal access tokens** tab in the left-hand pane, and then **Tokens (classic)**.\n4. Click the **Generate new token** button.\n5. Enter a token description.\n6. Select the **repo** scope and **workflow** scope, and click the **Generate token** button. **workflow** scope is needed in case your repository has GitHub Action workflows.\n7. Copy the token to your clipboard. You enter this token in Databricks under **User Settings > Linked accounts**. \nTo use single sign-on, see [Authorizing a personal access token for use with SAML single sign-on](https:\/\/docs.github.com\/en\/enterprise-cloud@latest\/authentication\/authenticating-with-saml-single-sign-on\/authorizing-a-personal-access-token-for-use-with-saml-single-sign-on). \nNote \nHaving trouble installing Databricks Github App on your account or organization? See the [GitHub App installation documentation](https:\/\/docs.github.com\/en\/enterprise-cloud@latest\/apps\/using-github-apps\/installing-a-github-app-from-a-third-party#installing-a-github-app) for troubleshooting guidance. \n### Connect to a GitHub repo using a fine-grained personal access token \nAs a best practice, use a fine-grained PAT that only grants access to the resources you will access in your project. In GitHub, follow these steps to create a fine-grained PAT that allows access to your repositories: \n1. In the upper-right corner of any page, click your profile photo, then click **Settings**.\n2. Click **Developer settings**.\n3. Click the **Personal access tokens** tab in the left-hand pane, and then **Fine-grained tokens**.\n4. Click the **Generate new token** button in the upper-right of the page to open the **New fine-grained personal access token** page. \n![Generate GitHub token](https:\/\/docs.databricks.com\/_images\/github-newtoken.png)\n5. Configure your new fine-grained token from the following settings: \n* **Token name**: Provide a unique token name. Write it down somewhere so you don\u2019t forget or lose it!\n* **Expiration**: Select the time period for token expiry. The default is \u201c30 days\u201d.\n* **Description**: Add some short text describing the purpose of the token.\n* **Resource owner**: The default is your current GitHub ID. Set this to the GitHub organization that owns the repo(s) you will access.\n* Under **Repository access**, choose the access scope for your token. As a best practice, select only those repositories that you will be using for Git folder version control.\n* Under **Permissions**, configure the specific access levels granted by this token for the repositories and account you will work with. For more details on the permission groups, read [Permissions required for fine-grained personal access tokens](https:\/\/docs.github.com\/en\/rest\/authentication\/permissions-required-for-fine-grained-personal-access-tokens) in the GitHub documentation. \nSet the access permissions for **Contents** to **Read and write**. (You find the Contents scope under **Repository permissions**.) For details on this scope, see [the GitHub documentation on the Contents scope](https:\/\/docs.github.com\/en\/rest\/authentication\/permissions-required-for-fine-grained-personal-access-tokens?apiVersion=2022-11-28#repository-permissions-for-contents). \n![Setting the permissions for a fine-grained Git PAT to read-and-write through the GitHub UI](https:\/\/docs.databricks.com\/_images\/git-pat-fine-grained-contents.png)\n6. Click the **Generate token** button.\n7. Copy the token to your clipboard. You enter this token in Databricks under **User Settings > Linked accounts**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Set up Databricks Git folders (Repos)\n##### Configure Git credentials & connect a remote repo to Databricks\n###### GitLab\n\nIn GitLab, follow these steps to create a personal access token that allows access to your repositories: \n1. From GitLab, click your user icon in the upper-left corner of the screen and select **Preferences**.\n2. Click **Access Tokens** in the sidebar.\n3. Click **Add new token** in the Personal Access Tokens section of the page. \n![Generate GitLab token](https:\/\/docs.databricks.com\/_images\/gitlab-newtoken.png)\n4. Enter a name for the token.\n5. Select the specific scopes to provide access by checking the boxes for your desired permission levels. For more details on the scope options, read the [GitLab documentation on PAT scopes](https:\/\/gitlab.com\/help\/user\/profile\/personal_access_tokens#personal-access-token-scopes).\n6. Click **Create personal access token**.\n7. Copy the token to your clipboard. Enter this token in Databricks under **User Settings > Linked accounts**. \nSee the [GitLab documentation](https:\/\/docs.gitlab.com\/ee\/user\/profile\/personal_access_tokens.html) to learn more about how to create and manage personal access tokens. \nGitLab also provides support for fine-grained access using \u201cProject Access Tokens\u201d. You can use Project Access Tokens to scope access to a GitLab project. For more details, read [GitLab\u2019s documentation on Project Access Tokens](https:\/\/docs.gitlab.com\/ee\/user\/project\/settings\/project_access_tokens.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Set up Databricks Git folders (Repos)\n##### Configure Git credentials & connect a remote repo to Databricks\n###### AWS CodeCommit\n\nIn AWS CodeCommit, follow these steps to create a **HTTPS Git credential** that allows access to your repositories: \n1. In AWS CodeCommit, create HTTPS Git credentials that allow access to your repositories. See the [AWS CodeCommit](https:\/\/docs.aws.amazon.com\/codecommit\/latest\/userguide\/setting-up-gc.html) documentation. The associated IAM user must have \u201cread\u201d and \u201cwrite\u201d permissions for the repository.\n2. Record the password. You enter this password in Databricks under **User Settings > Linked accounts**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Set up Databricks Git folders (Repos)\n##### Configure Git credentials & connect a remote repo to Databricks\n###### Azure DevOps Services\n\n### Connect to an Azure DevOps repo using a token \nThe following steps show you how to connect a Databricks repo to an Azure DevOps repo when they aren\u2019t in the same Microsoft Entra ID tenancy. \nThe service endpoint for Microsoft Entra ID must be accessible from both the private and public subnets of the Databricks workspace. For more information, see [VPC peering](https:\/\/docs.databricks.com\/security\/network\/classic\/vpc-peering.html). \nGet an access token for the repository in Azure DevOps: \n1. Go to dev.azure.com, and then sign in to the DevOps organization containing the repository you want to connect Databricks to.\n2. In the upper-right side, click the User Settings icon and select **Personal Access Tokens**.\n3. Click **+ New Token**.\n4. Enter information into the form: \n1. Name the token.\n2. Select the organization name, which is the repo name.\n3. Set an expiration date.\n4. Choose the the scope required, such as **Full access**.\n5. Copy the access token displayed.\n6. Enter this token in Databricks under **User Settings > Linked accounts**.\n7. In **Git provider username or email**, enter the email address you use to log in to the DevOps organization. \nIn Azure DevOps, follow these steps to get an access token for the repository. [Azure DevOps documentation](https:\/\/learn.microsoft.com\/azure\/devops\/organizations\/accounts\/use-personal-access-tokens-to-authenticate?view=azure-devops&tabs=Windows) contains more information about Azure DevOps personal access tokens. \n1. Go to dev.azure.com, and then sign in to the DevOps organization containing the repository you want to connect Databricks to.\n2. In the upper-right side, click the User Settings icon and select **Personal Access Tokens**.\n3. Click **+ New Token**.\n4. Enter information into the form: \n1. Name the token.\n2. Select the organization name, which is the repo name.\n3. Set an expiration date.\n4. Choose the the scope required, such as **Full access**.\n5. Copy the access token displayed.\n6. Enter this token in Databricks under **User Settings > Linked accounts**.\n7. In **Git provider username or email**, enter the email address you use to log in to the DevOps organization.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n### Set up Databricks Git folders (Repos)\n##### Configure Git credentials & connect a remote repo to Databricks\n###### Bitbucket\n\nNote \nDatabricks does not support Bitbucket Repository Access Tokens or Project Access Tokens. \nIn Bitbucket, follow these steps to create an app password that allows access to your repositories: \n1. Go to Bitbucket Cloud and create an app password that allows access to your repositories. See the [Bitbucket Cloud documentation](https:\/\/confluence.atlassian.com\/bitbucket\/app-passwords-828781300.html).\n2. Record the password in a secure manner.\n3. In Databricks, enter this password under **User Settings > Linked accounts**.\n\n##### Configure Git credentials & connect a remote repo to Databricks\n###### Other Git providers\n\nIf your Git provider is not listed, selecting \u201cGitHub\u201d and providing it the PAT you obtained from your Git provider often works, but is not guaranteed to work.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/get-access-tokens-from-git-provider.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Get started with MLflow experiments\n##### Quickstart Python\n\n[MLflow](https:\/\/www.mlflow.org\/) is an open source platform for managing the end-to-end machine\nlearning lifecycle. MLflow provides simple APIs for logging metrics (for example, model loss),\nparameters (for example, learning rate), and fitted models, making it easy to analyze training\nresults or deploy models later on. \nIn this section: \n* [Install MLflow](https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html#install-mlflow)\n* [Automatically log training runs to MLflow](https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html#automatically-log-training-runs-to-mlflow)\n* [View results](https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html#view-results)\n* [Track additional metrics, parameters, and models](https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html#track-additional-metrics-parameters-and-models)\n* [Example notebooks](https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html#example-notebooks)\n* [Learn more](https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html#learn-more)\n\n##### Quickstart Python\n###### [Install MLflow](https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html#id1)\n\nIf you\u2019re using [Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html), MLflow is already installed.\nOtherwise, [install the MLflow package from PyPI](https:\/\/docs.databricks.com\/libraries\/cluster-libraries.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Get started with MLflow experiments\n##### Quickstart Python\n###### [Automatically log training runs to MLflow](https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html#id2)\n\nWith Databricks Runtime 10.4 LTS ML and above, [Databricks Autologging](https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html) is enabled by default and automatically captures model parameters, metrics, files, and lineage information when you train models from a variety of popular machine learning libraries. \nWith Databricks Runtime 9.1 LTS ML, MLflow provides `mlflow.<framework>.autolog()` APIs to automatically log training code written in many ML frameworks. You can call this API before running training code to log model-specific metrics, parameters, and model artifacts. \nNote \nKeras models are also supported in `mlflow.tensorflow.autolog()`. \n```\n# Also autoinstruments tf.keras\nimport mlflow.tensorflow\nmlflow.tensorflow.autolog()\n\n``` \n```\nimport mlflow.xgboost\nmlflow.xgboost.autolog()\n\n``` \n```\nimport mlflow.lightgbm\nmlflow.lightgbm.autolog()\n\n``` \n```\nimport mlflow.sklearn\nmlflow.sklearn.autolog()\n\n``` \nIf performing tuning with `pyspark.ml`, metrics and models are automatically logged to MLflow.\nSee [Apache Spark MLlib and automated MLflow tracking](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/mllib-mlflow-integration.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Get started with MLflow experiments\n##### Quickstart Python\n###### [View results](https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html#id3)\n\nAfter executing your machine learning code, you can view results using the Experiment Runs sidebar. See [View notebook experiment](https:\/\/docs.databricks.com\/mlflow\/experiments.html#view-notebook-experiment) for instructions on how to view the experiment, run, and notebook revision used in the quickstart.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Get started with MLflow experiments\n##### Quickstart Python\n###### [Track additional metrics, parameters, and models](https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html#id4)\n\nYou can log additional information by directly invoking the\n[MLflow Tracking logging APIs](https:\/\/www.mlflow.org\/docs\/latest\/tracking.html#logging-functions). \n### Numerical metrics \n```\nimport mlflow\nmlflow.log_metric(\"accuracy\", 0.9)\n\n``` \n### Training parameters \n```\nimport mlflow\nmlflow.log_param(\"learning_rate\", 0.001)\n\n``` \n### Models \n```\nimport mlflow.sklearn\nmlflow.sklearn.log_model(model, \"myModel\")\n\n``` \n```\nimport mlflow.spark\nmlflow.spark.log_model(model, \"myModel\")\n\n``` \n```\nimport mlflow.xgboost\nmlflow.xgboost.log_model(model, \"myModel\")\n\n``` \n```\nimport mlflow.tensorflow\nmlflow.tensorflow.log_model(model, \"myModel\")\n\n``` \n```\nimport mlflow.keras\nmlflow.keras.log_model(model, \"myModel\")\n\n``` \n```\nimport mlflow.pytorch\nmlflow.pytorch.log_model(model, \"myModel\")\n\n``` \n```\nimport mlflow.spacy\nmlflow.spacy.log_model(model, \"myModel\")\n\n``` \n### Other artifacts (files) \n```\nimport mlflow\nmlflow.log_artifact(\"\/tmp\/my-file\", \"myArtifactPath\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Get started with MLflow experiments\n##### Quickstart Python\n###### [Example notebooks](https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html#id5)\n\nNote \nWith Databricks Runtime 10.4 LTS ML and above, [Databricks Autologging](https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html) is enabled by default, and the code in these example notebooks is not required. The example notebooks in this section are designed for use with Databricks Runtime 9.1 LTS ML. \nThe recommended way to get started using MLflow tracking with Python is to use the MLflow `autolog()` API. With MLflow\u2019s autologging capabilities, a single line of code automatically logs the resulting model, the parameters used to create the model, and a model score. The following notebook shows you how to set up a run using autologging. \n### MLflow autologging quickstart Python notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-quick-start-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nIf you need more control over the metrics logged for each training run, or want to log additional artifacts such as tables or plots, you can use the MLflow logging API functions demonstrated in the following notebook. \n### MLflow logging API quickstart Python notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-logging-api-quick-start-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n### Get started with MLflow experiments\n##### Quickstart Python\n###### [Learn more](https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html#id6)\n\n* [MLflow overview](https:\/\/docs.databricks.com\/mlflow\/index.html)\n* [Track ML and deep learning training runs](https:\/\/docs.databricks.com\/mlflow\/tracking.html)\n* [Run MLflow Projects on Databricks](https:\/\/docs.databricks.com\/mlflow\/projects.html)\n* [Log, load, register, and deploy MLflow models](https:\/\/docs.databricks.com\/mlflow\/models.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html"} +{"content":"# AI and Machine Learning on Databricks\n## MLOps workflows on Databricks\n#### Model deployment patterns\n\nThis article describes two common patterns for moving ML artifacts through staging and into production. The asynchronous nature of changes to models and code means that there are multiple possible patterns that an ML development process might follow. \nModels are created by code, but the resulting model artifacts and the code that created them can operate asynchronously. That is, new model versions and code changes might not happen at the same time. For example, consider the following scenarios: \n* To detect fraudulent transactions, you develop an ML pipeline that retrains a model weekly. The code may not change very often, but the model might be retrained every week to incorporate new data.\n* You might create a large, deep neural network to classify documents. In this case, training the model is computationally expensive and time-consuming, and retraining the model is likely to happen infrequently. However, the code that deploys, serves, and monitors this model can be updated without retraining the model. \n![deploy patterns](https:\/\/docs.databricks.com\/_images\/deploy-patterns.png) \nThe two patterns differ in whether *the model artifact* or *the training code that produces the model artifact* is promoted towards production.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/deployment-patterns.html"} +{"content":"# AI and Machine Learning on Databricks\n## MLOps workflows on Databricks\n#### Model deployment patterns\n##### Deploy code (recommended)\n\nIn most situations, Databricks recommends the \u201cdeploy code\u201d approach. This approach is incorporated into the [recommended MLOps workflow](https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html). \nIn this pattern, the code to train models is developed in the development environment. The same code moves to staging and then production. The model is trained in each environment: initially in the development environment as part of model development, in staging (on a limited subset of data) as part of integration tests, and in the production environment (on the full production data) to produce the final model. \nAdvantages: \n* In organizations where access to production data is restricted, this pattern allows the model to be trained on production data in the production environment.\n* Automated model retraining is safer, since the training code is reviewed, tested, and approved for production.\n* Supporting code follows the same pattern as model training code. Both go through integration tests in staging. \nDisadvantages: \n* The learning curve for data scientists to hand off code to collaborators can be steep. Predefined project templates and workflows are helpful. \nAlso in this pattern, data scientists must be able to review training results from the production environment, as they have the knowledge to identify and fix ML-specific issues. \nIf your situation requires that the model be trained in staging over the full production dataset, you can use a hybrid approach by deploying code to staging, training the model, and then deploying the model to production. This approach saves training costs in production but adds an extra operation cost in staging.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/deployment-patterns.html"} +{"content":"# AI and Machine Learning on Databricks\n## MLOps workflows on Databricks\n#### Model deployment patterns\n##### Deploy models\n\nIn this pattern, the model artifact is generated by training code in the development environment. The artifact is then tested in the staging environment before being deployed into production. \nConsider this option when one or more of the following apply: \n* Model training is very expensive or hard to reproduce.\n* All work is done in a single Databricks workspace.\n* You are not working with external repos or a CI\/CD process. \nAdvantages: \n* A simpler handoff for data scientists\n* In cases where model training is expensive, only requires training the model once. \nDisadvantages: \n* If production data is not accessible from the development environment (which may be true for security reasons), this architecture may not be viable.\n* Automated model retraining is tricky in this pattern. You could automate retraining in the development environment, but the team responsible for deploying the model in production might not accept the resulting model as production-ready.\n* Supporting code, such as pipelines used for feature engineering, inference, and monitoring, needs to be deployed to production separately. \nTypically an environment (development, staging, or production) corresponds to a catalog in Unity Catalog. For details on how to implement this pattern, see [the upgrade guide](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/upgrade-workflows.html#aliases). \nThe diagram below contrasts the code lifecycle for the above deployment patterns across the different execution environments. \nThe environment shown in the diagram is the final environment in which a step is run. For example, in the deploy models pattern, final unit and integration testing is performed in the development environment. In the deploy code pattern, unit tests and integration tests are run in the development environments, and final unit and integration testing is performed in the staging environment. \n![deploy patterns lifecycle](https:\/\/docs.databricks.com\/_images\/deploy-patterns-lifecycle.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/deployment-patterns.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Supported models for pay-per-token\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes the state-of-the-art open models that are supported by the [Databricks Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html) in pay-per-token mode. \nYou can send query requests to these models using the pay-per-token endpoints available in your Databricks workspace. See [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html). \nIn addition to supporting models in pay-per-token mode, Foundation Model APIs also offers provisioned throughput mode. Databricks recommends provisioned throughput for production workloads. This mode supports all models of a model architecture family (for example, DBRX models), including the fine-tuned and custom pre-trained models supported in pay-per-token mode. See [Provisioned throughput Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html#throughput) for the list of supported architectures. \nYou can interact with these supported models using the [AI Playground](https:\/\/docs.databricks.com\/large-language-models\/ai-playground.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/supported-models.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Supported models for pay-per-token\n###### DBRX Instruct\n\nImportant \nDBRX is provided under and subject to the [Databricks Open Model License](https:\/\/www.databricks.com\/legal\/open-model-license), Copyright \u00a9 Databricks, Inc. All rights reserved. Customers are responsible for ensuring compliance with applicable model licenses, including the [Databricks Acceptable Use policy](https:\/\/www.databricks.com\/legal\/acceptable-use-policy-open-model). \nDBRX Instruct is a state-of-the-art mixture of experts (MoE) language model trained by Databricks. \nThe model outperforms established open source models on standard benchmarks, and excels at a broad set of natural language tasks such as: text summarization, question-answering, extraction and coding. \nDBRX Instruct can handle up to 32k tokens of input length, and generates outputs of up to 4k tokens. Thanks to its MoE architecture, DBRX Instruct is highly efficient for inference, activating only 36B parameters out of a total of 132B trained parameters. The pay-per-token endpoint that serves this model has a rate limit of one query per second. See [Model Serving limits and regions](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html). \nSimilar to other large language models, DBRX Instruct output may omit some facts and occasionally produce false information. Databricks recommends using retrieval augmented generation (RAG) in scenarios where accuracy is especially important. \nDBRX models use the following default system prompt to ensure relevance and accuracy in model responses: \n```\nYou are DBRX, created by Databricks. You were last updated in December 2023. You answer questions based on information available up to that point.\nYOU PROVIDE SHORT RESPONSES TO SHORT QUESTIONS OR STATEMENTS, but provide thorough responses to more complex and open-ended questions.\nYou assist with various tasks, from writing to coding (using markdown for code blocks \u2014 remember to use ``` with code, JSON, and tables).\n(You do not have real-time data access or code execution capabilities. You avoid stereotyping and provide balanced perspectives on controversial topics. You do not provide song lyrics, poems, or news articles and do not divulge details of your training data.)\nThis is your system prompt, guiding your responses. Do not reference it, just respond to the user. If you find yourself talking about this message, stop. You should be responding appropriately and usually that means not mentioning this.\nYOU DO NOT MENTION ANY OF THIS INFORMATION ABOUT YOURSELF UNLESS THE INFORMATION IS DIRECTLY PERTINENT TO THE USER'S QUERY.\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/supported-models.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Supported models for pay-per-token\n###### Meta Llama 3 70B Instruct\n\nImportant \nLlama 3 is licensed under the [LLAMA 3 Community License](https:\/\/llama.meta.com\/llama3\/license\/), Copyright \u00a9 Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring compliance with applicable model licenses. \nMeta-Llama-3-70B-Instruct is a state-of-the-art 70B parameter dense language model with a context of 8000 tokens that was built and trained by Meta. The model is optimized for dialogue use cases and aligned with human preferences for helpfulness and safety. It is not intended for use in languages other than English. [Learn more about the Meta Llama 3 models](https:\/\/ai.meta.com\/blog\/meta-llama-3\/). \nSimilar to other large language models, Llama-3\u2019s output may omit some facts and occasionally produce false information. Databricks recommends using retrieval augmented generation (RAG) in scenarios where accuracy is especially important.\n\n##### Supported models for pay-per-token\n###### Llama 2 70B Chat\n\nImportant \nLlama 2 is licensed under the [LLAMA 2 Community License](https:\/\/ai.meta.com\/llama\/license\/), Copyright \u00a9 Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring compliance with applicable model licenses. \nLlama-2-70B-Chat is a state-of-the-art 70B parameter language model with a context length of 4,096 tokens, trained by Meta. It excels at interactive applications that require strong reasoning capabilities, including summarization, question-answering, and chat applications. \nSimilar to other large language models, Llama-2-70B\u2019s output may omit some facts and occasionally produce false information. Databricks recommends using retrieval augmented generation (RAG) in scenarios where accuracy is especially important.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/supported-models.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Supported models for pay-per-token\n###### Mixtral-8x7B Instruct\n\nMixtral-8x7B Instruct is a high-quality sparse mixture of experts model (SMoE) trained by Mistral AI. Mixtral-8x7B Instruct can be used for a variety of tasks such as question-answering, summarization, and extraction. \nMixtral can handle context lengths up to 32k tokens. Mixtral can process English, French, Italian, German, and Spanish. Mixtral matches or outperforms Llama 2 70B and GPT3.5 on most benchmarks ([Mixtral performance](https:\/\/mistral.ai\/news\/mixtral-of-experts\/)), while being four times faster than Llama 70B during inference. \nSimilar to other large language models, Mixtral-8x7B Instruct model should not be relied on to produce factually accurate information. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs. To reduce risk, Databricks defaults to using a variant of Mistral\u2019s [safe mode system prompt](https:\/\/docs.mistral.ai\/platform\/guardrailing\/).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/supported-models.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Supported models for pay-per-token\n###### MPT 7B Instruct\n\nMPT-7B-8K-Instruct is a 6.7B parameter model trained by MosaicML for long-form instruction following, especially question-answering on and summarization of longer documents. The model is pre-trained for 1.5T tokens on a mixture of datasets, and fine-tuned on a dataset derived from the Databricks Dolly-15k and the Anthropic Helpful and Harmless (HH-RLHF) datasets The model name you see in the product is `mpt-7b-instruct` but the model specifically being used is the newer version of the model. \nMPT-7B-8K-Instruct can be used for a variety of tasks such as question-answering, summarization, and extraction. It is very fast relative to Llama-2-70B but might generate lower quality responses. This model supports a context length of 8 thousand tokens. [Learn more about the MPT-7B-8k-Instruct model](https:\/\/www.mosaicml.com\/blog\/long-context-mpt-7b-8k). \nSimilar to other language models of this size, MPT-7B-8K-Instruct should not be relied on to produce factually accurate information. This model was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/supported-models.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Supported models for pay-per-token\n###### MPT 30B Instruct\n\nMPT-30B-Instruct is a 30B parameter model for instruction following trained by MosaicML. The model is pre-trained for 1T tokens on a mixture of English text and code, and then further instruction fine-tuned on a dataset derived from Databricks Dolly-15k, Anthropic Helpful and Harmless (HH-RLHF), CompetitionMath, DuoRC, CoT GSM8k, QASPER, QuALITY, SummScreen, and Spider datasets. \nMPT-30B-Instruct can be used for a variety of tasks such as question-answering, summarization, and extraction. It is very fast relative to Llama-2-70B but might generate lower quality responses and does not support multi-turn chat. This model supports a context length of 8,192 tokens. [Learn more about the MPT-30B-Instruct model](https:\/\/www.mosaicml.com\/blog\/mpt-30b). \nSimilar to other language models of this size, MPT-30B-Instruct should not be relied on to produce factually accurate information. This model was trained on various public datasets. While great efforts have been taken to clean the pre-training data, it is possible that this model could generate lewd, biased, or otherwise offensive outputs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/supported-models.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n##### Supported models for pay-per-token\n###### BGE Large (En)\n\n[BAAI General Embedding (BGE)](https:\/\/huggingface.co\/BAAI\/bge-large-en-v1.5) is a text embedding model that can map any text to a 1024-dimension embedding vector and an embedding window of 512 tokens. These vectors can be used in vector databases for LLMs, and for tasks like retrieval, classification, question-answering, clustering, or semantic search. This endpoint serves the English version of the model. \nEmbedding models are especially effective when used in tandem with LLMs for retrieval augmented generation (RAG) use cases. BGE can be used to find relevant text snippets in large chunks of documents that can be used in the context of an LLM. \nIn RAG applications, you may be able to improve the performance of your retrieval system by including an instruction parameter. The BGE authors recommend trying the instruction `\"Represent this sentence for searching relevant passages:\"` for query embeddings, though its performance impact is domain dependent.\n\n##### Supported models for pay-per-token\n###### Additional resources\n\n* [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html)\n* [Foundation model REST API reference](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html)\n* [Databricks Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/supported-models.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Set up Delta Sharing for your account (for providers)\n\nThis article describes how data providers (organizations that want to use Delta Sharing to share data securely) perform initial setup of Delta Sharing on Databricks. \nNote \nIf you are a data recipient (an organization that receives data that is shared using Delta Sharing), see instead [Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html). \nImportant \nA provider who wants to use the Delta Sharing server that is built into Databricks must have at least one workspace that is enabled for Unity Catalog. You do not need to migrate all of your workspaces to Unity Catalog. You can create one Unity Catalog-enabled workspace for share management. In some accounts, new workspaces are enabled for Unity Catalog automatically. See [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement). \nIf creating a new Unity Catalog-enabled workspace is not an option, you can use the [open-source Delta Sharing project](https:\/\/delta.io\/sharing) to deploy your own Delta Sharing server and use that to share Delta tables from any platform. \nInitial provider setup includes the following steps: \n1. Enable Delta Sharing on a Unity Catalog metastore.\n2. (Optional) Install the Unity Catalog CLI.\n3. Configure audits of Delta Sharing activity.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/set-up.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Set up Delta Sharing for your account (for providers)\n#### Requirements\n\nAs a data provider who is setting up your Databricks account to be able to share data, you must have: \n* At least one Databricks workspace that is [enabled for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html). \nYou do not need to migrate all of your workspaces to Unity Catalog to take advantage of Databricks support for Delta Sharing providers. See [Do I need Unity Catalog to use Delta Sharing?](https:\/\/docs.databricks.com\/data-sharing\/index.html#uc-faq). \nRecipients do not need to have a Unity Catalog-enabled workspace.\n* Account admin role to enable Delta Sharing for your Unity Catalog metastore and to enable audit logging.\n* Metastore admin role or the `CREATE SHARE` and `CREATE RECIPIENT` privileges. See [Admin roles for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#admin-roles). \nNote \nIf your workspace was enabled for Unity Catalog automatically, you might not have a metastore admin. However, workspace admins in such workspaces have the `CREATE SHARE` and `CREATE RECIPIENT` privileges on the metastore by default. For more information, see [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement) and [Workspace admin privileges when workspaces are enabled for Unity Catalog automatically](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html#workspace-admins-auto).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/set-up.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Set up Delta Sharing for your account (for providers)\n#### Enable Delta Sharing on a metastore\n\nFollow these steps for each Unity Catalog metastore that manages data that you plan to share using Delta Sharing. \nNote \nYou do not need to enable Delta Sharing on your metastore if you intend to use Delta Sharing only to share data with users on other Unity Catalog metastores in your account. Metastore-to-metastore sharing within a single Databricks account is enabled by default. \n1. As a Databricks account admin, log in to the [account console](https:\/\/accounts.cloud.databricks.com).\n2. In the sidebar, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. Click the name of a metastore to open its details.\n4. Click the checkbox next to **Enable Delta Sharing to allow a Databricks user to share data outside their organization**.\n5. Configure the recipient token lifetime. \nThis configuration sets the period of time after which all recipient tokens expire and must be regenerated. Recipient tokens are used only in the [open sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html#open) protocol. Databricks recommends that you configure a default token lifetime rather than allow tokens to live indefinitely. \nNote \nThe recipient token lifetime for existing recipients is not updated automatically when you change the default recipient token lifetime for a metastore. In order to apply a new token lifetime to a given recipient, you must rotate their token. See [Manage recipient tokens (open sharing)](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#rotate-credential). \nTo set the default recipient token lifetime: \n1. Confirm that **Set expiration** is enabled (this is the default). \nIf you clear this checkbox, tokens will never expire. Databricks recommends that you configure tokens to expire.\n2. Enter a number of seconds, minutes, hours, or days, and select the unit of measure.\n3. Click **Enable**.For more information, see [Security considerations for tokens](https:\/\/docs.databricks.com\/data-sharing\/create-recipient.html#security-considerations).\n6. Optionally enter a name for your organization that a recipient can use to identify who is sharing with them.\n7. Click **Enable**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/set-up.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Set up Delta Sharing for your account (for providers)\n#### (Optional) Install the Unity Catalog CLI\n\nTo manage shares and recipients, you can use Catalog Explorer, SQL commands, or the Unity Catalog CLI. The CLI runs in your local environment and does not require Databricks compute resources. \nTo install the CLI, see [What is the Databricks CLI?](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html).\n\n### Set up Delta Sharing for your account (for providers)\n#### Enable audit logging\n\nAs a Databricks account admin, you should enable audit logging to capture Delta Sharing events, such as: \n* When someone creates, modifies, updates, or deletes a share or a recipient\n* When a recipient accesses an activation link and downloads the credential (open sharing only)\n* When a recipient accesses data\n* When a recipient\u2019s credential is rotated or expires (open sharing only) \nDelta Sharing activity is logged at the account level. \nTo enable audit logging, follow the instructions in [Audit log reference](https:\/\/docs.databricks.com\/admin\/account-settings\/audit-logs.html). \nImportant \nDelta Sharing activity is logged at the account level. When you configure log delivery, do not enter a value for `workspace_ids_filter`. \nFor detailed information about how Delta Sharing events are logged, see [Audit and monitor data sharing](https:\/\/docs.databricks.com\/data-sharing\/audit-logs.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/set-up.html"} +{"content":"# Share data and AI assets securely using Delta Sharing\n### Set up Delta Sharing for your account (for providers)\n#### Grant permission to create and manage shares and recipients\n\nMetastore admins have the right to create and manage shares and recipients, including the granting of shares to recipients. Many provider tasks can be delegated by a metastore admin using the following privileges: \nNote \nIf your workspace was enabled for Unity Catalog automatically, you might not have a metastore admin. However, workspace admins in such workspaces have the `CREATE SHARE` and `CREATE RECIPIENT` privileges on the metastore by default. For more information, see [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement) and [Workspace admin privileges when workspaces are enabled for Unity Catalog automatically](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html#workspace-admins-auto). \n* `CREATE SHARE` on the metastore grants the ability to create shares.\n* `CREATE RECIPIENT` on the metastore grants the ability to create recipients.\n* `USE RECIPIENT` on grants the ability to list and view details for all recipients in the metastore.\n* `USE SHARE` on the metastore grants the ability to list and view details for all shares in the metastore.\n* `USE RECIPIENT`, `USE SHARE,` and `SET SHARE PERMISSION` combined give a user the ability to grant share access to recipients.\n* `USE SHARE` and `SET SHARE PERMISSION` combined give a user the ability to transfer ownership of any share.\n* Share and recipient owners can update those objects and grant shares to recipients. Object creators are granted ownership by default, but ownership can be transferred.\n* Share owners can add tables and volumes to shares, as long as they have `SELECT` access to the tables and `READ VOLUME` access to the volumes. \nFor details, see [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html) and the permissions listed for every task described in the Delta Sharing guide.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-sharing\/set-up.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Troubleshoot and repair job failures\n\nSuppose you have been notified (for example, through an email notification, a monitoring solution, or in the Databricks Jobs UI) that a task has failed in a run of your Databricks job. The steps in this article provide guidance to help you identify the cause of failure, suggestions to fix the issues that you find, and how to repair failed job runs.\n\n#### Troubleshoot and repair job failures\n##### Identify the cause of failure\n\nTo find the failed task in the Databricks Jobs UI: \n1. Click ![Job Runs Icon](https:\/\/docs.databricks.com\/_images\/job-runs-icon.png) **Job Runs** in the sidebar.\n2. In the **Name** column, click a job name. The **Runs** tab shows active runs and completed runs, including any failed runs. The matrix view in the **Runs** tab shows a history of runs for the job, including successful and unsuccessful runs for each job task. A task run may be unsuccessful because it failed or was skipped because a dependent task failed. Using the matrix view, you can quickly identify the task failures for your job run. \n![Matrix view of job runs](https:\/\/docs.databricks.com\/_images\/job-runs-matrix-view.png)\n3. Hover over a failed task to see associated metadata. This metadata includes the start and end dates, status, duration cluster details, and, in some cases, an error message.\n4. To help identify the cause of the failure, click the failed task. The **Task run details** page appears, displaying the task\u2019s output, error message, and associated metadata.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/repair-job-failures.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Troubleshoot and repair job failures\n##### Fix the cause of failure\n\nYour task might have failed for several reasons, for example, a data quality issue, a misconfiguration, or insufficient compute resources. The following are suggested steps to fix some common causes of task failures: \n* If the failure is related to the task configuration, click **Edit task**. The task configuration opens in a new tab. Update the task configuration as required and click **Save task**.\n* If the issue is related to cluster resources, for example, insufficient instances, there are several options: \n+ If your job is configured to use a job cluster, consider using a shared all-purpose cluster.\n+ Change the cluster configuration. Click **Edit task**. In the **Job details** panel, under **Compute**, click **Configure** to configure the cluster. You can change the number of workers, the instance types, or other cluster configuration options. You can also click **Swap** to switch to another available cluster. To ensure you\u2019re making optimal use of available resources, review best practices for [cluster configuration](https:\/\/docs.databricks.com\/compute\/cluster-config-best-practices.html).\n+ If necessary, ask an administrator to increase resource quotas in the cloud account and region where your workspace is deployed.\n* If the failure is caused by exceeding the maximum concurrent runs, either: \n+ Wait for other runs to complete.\n+ Click **Edit task**. In the **Job details** panel, click **Edit concurrent runs**, enter a new value for **Maximum concurrent runs**, and click **Confirm**. \nIn some cases, the cause of a failure may be upstream from your job; for example, an external data source is unavailable. You can still take advantage of the repair run feature covered in the next section after the external issue is resolved.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/repair-job-failures.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Troubleshoot and repair job failures\n##### Re-run failed and skipped tasks\n\nAfter you identify the cause of failure, you can repair failed or canceled multi-task jobs by running only the subset of unsuccessful tasks and any dependent tasks. Because successful tasks and any tasks that depend on them are not re-run, this feature reduces the time and resources required to recover from unsuccessful job runs. \nYou can change job or task settings before repairing the job run. Unsuccessful tasks are re-run with the current job and task settings. For example, if you change the path to a notebook or a cluster setting, the task is re-run with the updated notebook or cluster settings. \nView the [history of all task runs](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#task-history) on the **Task run details** page. \nNote \n* If one or more tasks share a job cluster, a repair run creates a new job cluster. For example, if the original run used the job cluster `my_job_cluster`, the first repair run uses the new job cluster `my_job_cluster_v1`, allowing you to easily see the cluster and cluster settings used by the initial run and any repair runs. The settings for `my_job_cluster_v1` are the same as the current settings for `my_job_cluster`.\n* Repair is supported only with jobs that orchestrate two or more tasks.\n* The **Duration** value displayed in the **Runs** tab includes the time the first run started until the time when the latest repair run finished. For example, if a run failed twice and succeeded on the third run, the duration includes the time for all three runs. \nTo repair a failed job run: \n1. Click the link for the failed run in the **Start time** column of the job runs table or click the failed run in the matrix view. The **Job run details** page appears.\n2. Click **Repair run**. The **Repair job run** dialog appears, listing all unsuccessful tasks and any dependent tasks that will be re-run.\n3. To add or edit parameters for the tasks to repair, enter the parameters in the **Repair job run** dialog. Parameters you enter in the **Repair job run** dialog override existing values. On subsequent repair runs, you can return a parameter to its original value by clearing the key and value in the **Repair job run** dialog.\n4. Click **Repair run** in the **Repair job run** dialog.\n5. After the repair run finishes, the matrix view is updated with a new column for the repaired run. Any failed tasks that were red should now be green, indicating a successful run for your entire job.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/repair-job-failures.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Troubleshoot and repair job failures\n##### View and manage continuous job failures\n\nWhen consecutive failures of a continuous job exceed a threshold, Databricks Jobs uses [exponential backoff](https:\/\/docs.databricks.com\/workflows\/jobs\/schedule-jobs.html#exponential-backoff) to retry the job. When a job is in the exponential backoff state, a message in the **Job details** panel displays information, including: \n* The number of consecutive failures.\n* The period for the job to run without error to be considered successful.\n* The time before the next retry if no run is currently active. \nTo cancel the active run, reset the retry period, and start a new job run, click **Restart run**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/repair-job-failures.html"} +{"content":"# Databricks administration introduction\n### Get started with Databricks administration\n\nThis article article provides opinionated guidance for new account and workspace admins looking to take advantage of the administrative and security features available on Databricks. For more in-depth security guidance, see the [Security and compliance guide](https:\/\/docs.databricks.com\/security\/index.html).\n\n### Get started with Databricks administration\n#### Requirements\n\nYou need a Databricks account and workspace. If you haven\u2019t set yours up yet, follow the steps in [Get started: Account and workspace setup](https:\/\/docs.databricks.com\/getting-started\/index.html) to get up and running. Once you have a workspace set up, go through the following admin tasks:\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/admin-get-started.html"} +{"content":"# Databricks administration introduction\n### Get started with Databricks administration\n#### Step 1: Build out your team\n\nThe best practice for building out your team is to add users and groups to your account by [syncing your identity provider (IdP) with Databricks](https:\/\/docs.databricks.com\/admin\/users-groups\/scim\/index.html). If you choose to build your team out manually, you can follow the steps in [Manage users](https:\/\/docs.databricks.com\/admin\/users-groups\/users.html) and [Manage groups](https:\/\/docs.databricks.com\/admin\/users-groups\/groups.html) to add your team through the account console UI. \nYou should organize your users and service principals into account groups based on permissions and roles. Account groups simplify identity management by making it easier to assign access to workspaces, data, and other securable objects. \nAfter your team has been added to Databricks, the following tasks are recommended: \n* [Assign groups to the workspace](https:\/\/docs.databricks.com\/admin\/users-groups\/groups.html#add-groups-workspace)\n* [Assign the account admin role to a group](https:\/\/docs.databricks.com\/admin\/users-groups\/groups.html#account-admin) \n* [Set up SSO for your account](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/single-sign-on\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/admin-get-started.html"} +{"content":"# Databricks administration introduction\n### Get started with Databricks administration\n#### Step 2: Configure permissions and access control\n\nWithin a workspace, workspace admins help secure data and control compute usage by giving users access only to the Databricks functionality and data they need. \nNote \nAccess control requires the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons). If you don\u2019t have it, go to the account console to [update your subscription](https:\/\/docs.databricks.com\/admin\/account-settings\/account.html#upgrade-downgrade) or contact your Databricks account team. \nThe following articles walk you through enabling and managing key features workspace admins can use to control data access and compute usage: \n* [Manage data governance and user data access](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html)\n* [Manage access control for clusters, jobs, notebooks, and other workspace objects](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html)\n* [Create and manage compute policies](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html)\n\n### Get started with Databricks administration\n#### Step 3: Set up account monitoring\n\nTo control costs and allow your organization to monitor detailed Databricks usage patterns, including audit and billable usage logs, Databricks recommends using system tables (Public Preview).\nYou can also use custom tags to help monitor resources and data objects. \n* [Monitor usage with system tables](https:\/\/docs.databricks.com\/admin\/system-tables\/index.html).\n* [Monitor usage using tags](https:\/\/docs.databricks.com\/admin\/account-settings\/usage-detail-tags.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/admin-get-started.html"} +{"content":"# Databricks administration introduction\n### Get started with Databricks administration\n#### Step 4: Implement additional security features\n\nDatabricks provides a secure networking environment by default, but if your organization has additional needs, you can configure network security features on your Databricks resources. See [Customize network security](https:\/\/docs.databricks.com\/security\/network\/index.html). For an overview of available security features, see [Security and compliance guide](https:\/\/docs.databricks.com\/security\/index.html).\n\n### Get started with Databricks administration\n#### Get Databricks support\n\nIf you have any questions about setting up Databricks and need live help, please e-mail [onboarding-help@databricks.com](mailto:onboarding-help%40databricks.com). \nIf you have a Databricks support package, you can open and manage support cases with Databricks. See [Learn how to use Databricks support](https:\/\/docs.databricks.com\/resources\/support.html). \nIf your organization does not have a Databricks support subscription, or if you are not an authorized contact for your company\u2019s support subscription, you can get answers to many questions in [Databricks Office Hours](https:\/\/www.databricks.com\/p\/webinar\/officehours?utm_source=databricks&utm_medium=site&utm_content=docs) or from the [Databricks Community](https:\/\/community.databricks.com).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/admin-get-started.html"} +{"content":"# Databricks administration introduction\n### Get started with Databricks administration\n#### Databricks Academy\n\nDatabricks Academy has a [free self-paced learning path for platform administrators](https:\/\/customer-academy.databricks.com\/learn\/lp\/207\/platform-administrator-learning-plan). Before you can access the course, you first need to [register for Databricks Academy](https:\/\/customer-academy.databricks.com\/learn\/register) if you haven\u2019t already. \nYou can also sign up to attend a [live platform administration training](https:\/\/files.training.databricks.com\/static\/ilt-sessions\/onboarding\/index.html?utm_source=databricks&utm_medium=web&utm_campaign=7018y0000010b3oqaa&_ga=2.115610374.107910741.1678852231-1960333334.1675274743).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/admin-get-started.html"} +{"content":"# Databricks administration introduction\n### Get started with Databricks administration\n#### Additional resources\n\nThe following table includes links for further learning: \n| **Become a Databricks expert** | * Run the [Get started quickstarts](https:\/\/docs.databricks.com\/getting-started\/index.html) * Take training courses at [Databricks Academy](https:\/\/academy.databricks.com) * Review [resources](https:\/\/databricks.com\/resources) such as e-books, webinars, and more |\n| --- | --- |\n| **Learn industry best practices and news** | * Read [Databricks blogs](https:\/\/databricks.com\/blog) * View past [Spark + AI Summit sessions](https:\/\/databricks.com\/sparkaisummit\/north-america\/sessions?eventName=Summit%202019) |\n| **Follow in-depth, proven best practices** | * Learn about [CI\/CD on Databricks](https:\/\/docs.databricks.com\/dev-tools\/index-ci-cd.html#dev-tools-ci-cd) * Learn how to [keep your data lake GDPR and CCPA compliant using Delta Lake](https:\/\/docs.databricks.com\/security\/privacy\/gdpr-delta.html) * [Meet HIPAA requirements](https:\/\/docs.databricks.com\/security\/privacy\/hipaa.html) * [Manage data retention](https:\/\/docs.databricks.com\/admin\/workspace-settings\/storage.html) * Learn how to [migrate workloads to Databricks](https:\/\/docs.databricks.com\/migration\/index.html) * Learn [cluster configuration best practices on Databricks](https:\/\/docs.databricks.com\/compute\/cluster-config-best-practices.html) |\n| **Get involved** | * Participate in [Data + AI Summit](https:\/\/www.databricks.com\/dataaisummit\/) * Influence the product roadmap by adding ideas to the [Ideas Portal](https:\/\/ideas.databricks.com) |\n| **Get help and support** | * [Help Center](https:\/\/help.databricks.com) * [Community Forums](https:\/\/forums.databricks.com) |\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/admin-get-started.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n### Set up query federation for non-Unity-Catalog workspaces\n##### Query federation for Amazon Redshift in Databricks SQL (Experimental)\n\nExperimental \nThe configurations described in this article are [Experimental](https:\/\/docs.databricks.com\/release-notes\/release-types.html). Experimental features are provided as-is and are not supported by Databricks through customer technical support. **To get full query federation support, you should instead use [Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html), which enables your Databricks users to take advantage of Unity Catalog syntax and data governance tools.** \nThis article describes how to configure read-only query federation to Redshift on serverless and pro SQL warehouses. For information about configuring Redshift S3 credentials, see [Query Amazon Redshift using Databricks](https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html). \nYou configure connections to Redshift at the table level. You can use [secrets](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/secret.html) to store and access text credentials without displaying them in plaintext. See the following example: \n```\nDROP TABLE IF EXISTS redshift_table;\nCREATE TABLE redshift_table\nUSING redshift\nOPTIONS (\ndbtable '<table-name>',\ntempdir 's3a:\/\/<bucket>\/<directory-path>',\nurl 'jdbc:redshift:\/\/<database-host-url>',\nuser secret('redshift_creds', 'my_username'),\npassword secret('redshift_creds', 'my_password'),\nforward_spark_s3_credentials 'true'\n);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/redshift-no-uc.html"} +{"content":"# Model serving with Databricks\n### Migrate to Model Serving\n\nThis article demonstrates how to enable Model Serving on your workspace and switch your models to the new [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) experience built on serverless compute.\n\n### Migrate to Model Serving\n#### Requirements\n\n* Registered model in the MLflow Model Registry.\n* Permissions on the registered models as described in the [access control guide](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#serving-endpoints).\n* [Enable serverless compute on your workspace](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html#serverless).\n\n### Migrate to Model Serving\n#### Significant changes\n\n* In Model Serving, the format of the request to the endpoint and the response from the endpoint are slightly different from Legacy MLflow Model Serving. See [Scoring a model endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html#score) for details on the new format protocol.\n* In Model Serving, the endpoint URL includes `serving-endpoints` instead of `model`.\n* Model Serving includes full support for [managing resources with API workflows](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html).\n* Model Serving is production-ready and backed by the Databricks SLA.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/migrate-model-serving.html"} +{"content":"# Model serving with Databricks\n### Migrate to Model Serving\n#### Migrate Legacy MLflow Model Serving served models to Model Serving\n\nYou can create a Model Serving endpoint and flexibly transition model serving workflows without disabling [Legacy MLflow Model Serving](https:\/\/docs.databricks.com\/archive\/legacy-model-serving\/model-serving.html). \nThe following steps show how to accomplish this with the UI. For each model on which you have Legacy MLflow Model Serving enabled: \n1. Navigate to **Serving endpoints** on the sidebar of your machine learning workspace.\n2. Follow the workflow described in [Create custom model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html) on how to create a serving endpoint with your model.\n3. Transition your application to use the new URL provided by the serving endpoint to query the model, along with the new scoring format.\n4. When your models are transitioned over, you can navigate to **Models** on the sidebar of your machine learning workspace.\n5. Select the model for which you want to disable Legacy MLflow Model Serving.\n6. On the **Serving** tab, select **Stop**.\n7. A message appears to confirm. Select **Stop Serving**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/migrate-model-serving.html"} +{"content":"# Model serving with Databricks\n### Migrate to Model Serving\n#### Migrate deployed model versions to Model Serving\n\nIn previous versions of the Model Serving functionality, the serving endpoint was created based on the stage of the registered model version: `Staging` or `Production`. To migrate your served models from that experience, you can replicate that behavior in the new Model Serving experience. \nThis section demonstrates how to create separate model serving endpoints for `Staging` model versions and `Production` model versions. The following steps show how to accomplish this with the serving endpoints API for each of your served models. \nIn the example, the registered model name `modelA` has version 1 in the model stage `Production` and version 2 in the model stage `Staging`. \n1. Create two endpoints for your registered model, one for `Staging` model versions and another for `Production` model versions. \nFor `Staging` model versions: \n```\nPOST \/api\/2.0\/serving-endpoints\n{\n\"name\":\"modelA-Staging\"\n\"config\":{\n\"served_entities\":[\n{\n\"entity_name\":\"model-A\",\n\"entity_version\":\"2\", \/\/ Staging Model Version\n\"workload_size\":\"Small\",\n\"scale_to_zero_enabled\":true\n},\n],\n},\n}\n\n``` \nFor `Production` model versions: \n```\nPOST \/api\/2.0\/serving-endpoints\n{\n\"name\":\"modelA-Production\"\n\"config\":{\n\"served_entities\":[\n{\n\"entity_name\":\"model-A\",\n\"entity_version\":\"1\", \/\/ Production Model Version\n\"workload_size\":\"Small\",\n\"scale_to_zero_enabled\":true\n},\n],\n},\n}\n\n```\n2. Verify the status of the endpoints. \nFor Staging endpoint: `GET \/api\/2.0\/serving-endpoints\/modelA-Staging` \nFor Production endpoint: `GET \/api\/2.0\/serving-endpoints\/modelA-Production`\n3. Once the endpoints are ready, query the endpoint using: \nFor Staging endpoint: `POST \/serving-endpoints\/modelA-Staging\/invocations` \nFor Production endpoint: `POST \/serving-endpoints\/modelA-Production\/invocations`\n4. Update the endpoint based on model version transitions. \nIn the scenario where a new model version 3 is created, you can have the model version 2 transition to `Production`, while model version 3 can transition to `Staging` and model version 1 is `Archived`. These changes can be reflected in separate model serving endpoints as follows: \nFor the `Staging` endpoint, update the endpoint to use the new model version in `Staging`. \n```\nPUT \/api\/2.0\/serving-endpoints\/modelA-Staging\/config\n{\n\"served_entities\":[\n{\n\"entity_name\":\"model-A\",\n\"entity_version\":\"3\", \/\/ New Staging model version\n\"workload_size\":\"Small\",\n\"scale_to_zero_enabled\":true\n},\n],\n}\n\n``` \nFor `Production` endpoint, update the endpoint to use the new model version in `Production`. \n```\nPUT \/api\/2.0\/serving-endpoints\/modelA-Production\/config\n{\n\"served_entities\":[\n{\n\"entity_name\":\"model-A\",\n\"entity_version\":\"2\", \/\/ New Production model version\n\"workload_size\":\"Small\",\n\"scale_to_zero_enabled\":true\n},\n],\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/migrate-model-serving.html"} +{"content":"# Model serving with Databricks\n### Migrate to Model Serving\n#### Migrate MosaicML inference workflows to Model Serving\n\nThis section provides guidance on how to migrate your MosaicML inference deployments to Databricks Model Serving and includes a notebook example. \nThe following table summarizes the parity between MosaicML inference and model serving on Databricks. \n| MosaicML Inference | Databricks Model Serving |\n| --- | --- |\n| create\\_inference\\_deployment | [Create a model serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html) |\n| update\\_inference\\_deployment | [Update a model serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html#endpoint-config) |\n| delete\\_inference\\_deployment | [Delete a model serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/manage-serving-endpoints.html#delete-endpoint) |\n| get\\_inference\\_deployment | [Get status of a model serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/manage-serving-endpoints.html#status) | \nThe following notebook provides a guided example of migrating a `llama-13b` model from MosaicML to Databricks Model Serving. \n### Migrate from MosaicML inference to Databricks Model Serving notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/migrate-mosaicml-inference-to-model-serving.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/migrate-model-serving.html"} +{"content":"# Model serving with Databricks\n### Migrate to Model Serving\n#### Additional resources\n\n* [Create Model Serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html)\n* [Migrate optimized LLM serving endpoints to provisioned throughput](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/migrate-provisioned-throughput.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/migrate-model-serving.html"} +{"content":"# \n### Migrate legacy line charts\n\nLearn about line charts in Databricks and how to migrate from the legacy Line charts. Databricks has three types of line charts: **Line** and legacy charts **Line (v2)** and **Line (v1)**.\n\n### Migrate legacy line charts\n#### Comparison: Line chart types with time-series data\n\n**Line** has custom plot options: setting a Y-axis range, showing or hiding markers, and applying log scale to the Y-axis and a built-in toolbar that supports a rich set of client-side interactions. \nIn addition, **Line**, **Line (v2)**, and **Line (v1)** charts treat time-series data in different ways: \n| Line and Line (v2) | Line (v1) |\n| --- | --- |\n| Date and timestamp are supported. Line (v2) formats a date to local time. | Date and timestamp are treated as text. |\n| Key can only be date, timestamp, or number. | Key can be of any type. |\n| X-axis is ordered:* Natural, linear ordering of date, timestamp, and number. * Gaps in time appear as gaps on the chart. | X-axis is categorical:* Not ordered unless the data is ordered. * Gaps in time do not appear on the chart. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/legacy-charts.html"} +{"content":"# \n### Migrate legacy line charts\n#### Notebook example: Migrate to Line from legacy line charts\n\nThe notebook in this example converts the date as string into timestamp (including time zone) using `unix_timestamp`. \nTo migrate to **Line** from **Line (v1)** or **Line (v2)**: \n1. Click ![Button Down](https:\/\/docs.databricks.com\/_images\/button-down.png) next to the bar chart ![Chart Button](https:\/\/docs.databricks.com\/_images\/chart-button.png) and select **Line**. \n![Chart types](https:\/\/docs.databricks.com\/_images\/display-charts.png)\n2. For a Line (v1) chart, if the key column is not a date, timestamp, or number, you must parse the column to a date, timestamp, or number explicitly as demonstrated in the following notebook. \n### Timestamp conversion notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/timestamp-conversion.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n### Migrate legacy line charts\n#### Use legacy line charts\n\nTo use legacy line charts, select them from the **Legacy Charts** drop-down menu. \n![Legacy chart types](https:\/\/docs.databricks.com\/_images\/display-legacy-charts.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/legacy-charts.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Develop Delta Live Tables pipelines\n\nThe articles in this section describe steps and recommendations for Delta Live Tables pipeline development and testing in either a Databricks notebook, the Databricks file editor, or locally using an integrated development environment (IDE).\n\n#### Develop Delta Live Tables pipelines\n##### Create a pipeline in the Databricks UI\n\nFor UI steps to configure a Delta Live Tables pipeline from code in a notebook, see [Create a pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html#create-pipeline).\n\n#### Develop Delta Live Tables pipelines\n##### Notebook experience for Delta Live Tables code development (Public Preview)\n\nWhen you work on a Python or SQL notebook that is the source code for an existing Delta Live Tables pipeline, you can connect the notebook to the pipeline and access a set of features in notebooks that assist in developing and debugging Delta Live Tables code. See [Notebook experience for Delta Live Tables code development](https:\/\/docs.databricks.com\/delta-live-tables\/dlt-notebook-devex.html).\n\n#### Develop Delta Live Tables pipelines\n##### Tips, recommendations, and features for developing and testing pipelines\n\nFor pipeline development and testing tips, recommendations, and features, see [Tips, recommendations, and features for developing and testing Delta Live Tables pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/testing.html).\n\n#### Develop Delta Live Tables pipelines\n##### CI\/CD for pipelines\n\n*Databricks Asset Bundles* enable you to programmatically validate, deploy, and run Databricks resources such as Delta Live Tables pipelines. For steps that you can complete from your local development machine to use a bundle that programmatically manages a Delta Live Tables pipeline, see [Develop Delta Live Tables pipelines with Databricks Asset Bundles](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-bundles.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/develop-pipelines.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Develop Delta Live Tables pipelines\n##### Develop pipeline code in your local development environment\n\nIn addition to using notebooks or the file editor in your Databricks workspace to implement pipeline code that uses the Delta Live Tables [Python interface](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html), you can also develop your code in your local development environment. For example, you can use your favorite integrated development environment (IDE) such as Visual Studio Code or PyCharm. After writing your pipeline code locally, you can manually move it into your Databricks workspace or use Databricks tools to operationalize your pipeline, including deploying and running the pipeline. \nSee [Develop Delta Live Tables pipeline code in your local development environment](https:\/\/docs.databricks.com\/delta-live-tables\/develop-locally.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/develop-pipelines.html"} +{"content":"# Data governance with Unity Catalog\n### What is Catalog Explorer?\n\nDatabricks Catalog Explorer provides a UI to explore and manage data, schemas (databases), tables, models, functions, and other AI assets. To open Catalog Explorer, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog** in the sidebar.\n\n### What is Catalog Explorer?\n#### What can you do with Catalog Explorer?\n\nCatalog Explorer has two primary functions: \n* Finding data assets. \nFor example, you can use Catalog Explorer to view schema details, preview sample data, see table and model details, and explore entity relationships. To learn how to use Catalog Explorer to discover data, see [Discover data](https:\/\/docs.databricks.com\/discover\/index.html) and the articles listed below.\n* Managing Unity Catalog and Delta Sharing. \nFor example, you can use use Catalog Explorer to create catalogs, create shares, manage external locations, view and change object ownership, and grant and revoke permissions on all objects. To learn how to use Catalog Explorer to manage Unity Catalog and Delta Sharing, see [What is Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). \nThis section includes the following articles that describe how to perform some of these data discovery and object management tasks using Catalog Explorer. \n* [Explore models](https:\/\/docs.databricks.com\/catalog-explorer\/explore-models.html)\n* [View the Entity Relationship Diagram](https:\/\/docs.databricks.com\/catalog-explorer\/entity-relationship-diagram.html)\n* [Add AI-generated comments to a table](https:\/\/docs.databricks.com\/catalog-explorer\/ai-comments.html)\n* [Document data in Catalog Explorer using markdown comments](https:\/\/docs.databricks.com\/catalog-explorer\/markdown-data-comments.html) \nNote \nThe Catalog Explorer UI is not live in all Databricks workspaces. For information about the previous UI, see [Explore and create tables in DBFS](https:\/\/docs.databricks.com\/archive\/legacy\/data-tab.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/catalog-explorer\/index.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Develop and debug Delta Live Tables pipelines in notebooks\n\nPreview \nThe notebook experience for Delta Live Tables development is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes features in Databricks notebooks that assist in the development and debugging of Delta Live Tables code.\n\n##### Develop and debug Delta Live Tables pipelines in notebooks\n###### Overview of features\n\nWhen you work on a Python or SQL notebook that is the source code for an existing Delta Live Tables pipeline, you can connect the notebook directly to the pipeline. When the notebook is connected to the pipeline, the following features are available: \n* Start and validate the pipeline from the notebook.\n* View the pipeline\u2019s dataflow graph and event log for the latest update in the notebook.\n* View pipeline diagnostics in the notebook editor.\n* View the status of the pipeline\u2019s cluster in the notebook.\n* Access the Delta Live Tables UI from the notebook.\n\n##### Develop and debug Delta Live Tables pipelines in notebooks\n###### Prerequisites\n\n* You must have an existing Delta Live Tables pipeline with a Python or SQL notebook as source code.\n* You must either be the owner of the pipeline or have the `CAN_MANAGE` privilege.\n\n##### Develop and debug Delta Live Tables pipelines in notebooks\n###### Limitations\n\n* The features covered in this article are only available in Databricks notebooks. Workspace files are not supported.\n* The web terminal is not available when attached to a pipeline. As a result, it is not visible as a tab in the bottom panel.\n\n##### Develop and debug Delta Live Tables pipelines in notebooks\n###### Connect a notebook to a Delta Live Tables pipeline\n\nInside the notebook, click on the drop-down menu used to select compute. The drop-down menu shows all your Delta Live Tables pipelines with this notebook as source code. To connect the notebook to a pipeline, select it from the list.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/dlt-notebook-devex.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Develop and debug Delta Live Tables pipelines in notebooks\n###### View the pipeline\u2019s cluster status\n\nTo easily understand the state of your pipeline\u2019s cluster, its status is shown in the compute drop-down menu with a green color to indicate that the cluster is running.\n\n##### Develop and debug Delta Live Tables pipelines in notebooks\n###### Validate pipeline code\n\nYou can [validate the pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/updates.html#validate-update) to check for syntax errors in your source code without processing any data. \nTo validate a pipeline, do one of the following: \n* In the top-right corner of the notebook, click **Validate**.\n* Press `Shift+Enter` in any notebook cell.\n* In a cell\u2019s dropdown menu, click **Validate Pipeline**. \nNote \nIf you attempt to validate your pipeline while an existing update is already running, a dialog box displays asking if you want to terminate the existing update.\n\n##### Develop and debug Delta Live Tables pipelines in notebooks\n###### Start the pipeline\n\nA pipeline update does the following: starts a cluster, discovers and validates all the tables and views defined, and creates or updates tables and views with the most recent data available. \nTo start an update of your pipeline, click the **Start** button in the top-right corner of the notebook. \nIf you click **Yes**, the existing update stops, and a *validate* update automatically starts.\n\n##### Develop and debug Delta Live Tables pipelines in notebooks\n###### View the status of an update\n\nThe top panel in the notebook displays whether a pipeline update is: \n* Starting\n* Validating\n* Stopping\n\n##### Develop and debug Delta Live Tables pipelines in notebooks\n###### View errors and diagnostics\n\nAfter a pipeline has been started or validated, any errors are shown inline with a red underline. Hover over an error to see more information.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/dlt-notebook-devex.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Develop and debug Delta Live Tables pipelines in notebooks\n###### View pipeline events\n\nWhen attached to a pipeline, there is a Delta Live Tables event log tab at the bottom of the notebook. \n![Event log](https:\/\/docs.databricks.com\/_images\/dlt-event-log-tab.png)\n\n##### Develop and debug Delta Live Tables pipelines in notebooks\n###### View the pipeline Dataflow Graph\n\nTo view a pipeline\u2019s dataflow graph, use the Delta Live Tables graph tab at the bottom of the notebook. Selecting a node in the graph displays its schema in the right panel. \n![Dataflow Graph](https:\/\/docs.databricks.com\/_images\/dataflow-graph.png)\n\n##### Develop and debug Delta Live Tables pipelines in notebooks\n###### How to access the Delta Live Tables UI from the notebook\n\nTo easily jump to the Delta Live Tables UI, use the menu in the top-right corner of the notebook. \n![Open in DLT UI from notebook](https:\/\/docs.databricks.com\/_images\/open-in-dlt-ui.png)\n\n##### Develop and debug Delta Live Tables pipelines in notebooks\n###### Access driver logs and the Spark UI from the notebook\n\nThe driver logs and Spark UI associated with the pipeline being developed can be easily accessed from the notebook\u2019s **View** menu. \n![Access driver logs and Spark UI](https:\/\/docs.databricks.com\/_images\/driver-logs-spark-ui.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/dlt-notebook-devex.html"} +{"content":"# Security and compliance guide\n## Data security and encryption\n#### Encrypt queries, query history, and query results\n\nNote \nThis feature is available with the [Enterprise pricing tier](https:\/\/databricks.com\/product\/aws-pricing). \nYou can encrypt the data at rest for queries and query history. The details vary by the type of object.\n\n#### Encrypt queries, query history, and query results\n##### Use your key to encrypt queries and query history\n\nYou can use your own key from AWS KMS to encrypt the Databricks SQL queries and your query history stored in the Databricks [control plane](https:\/\/docs.databricks.com\/getting-started\/overview.html). \nIf you\u2019ve already configured your own key for a workspace to encrypt data for managed services, then no further action is required. The same customer-managed key for managed services also encrypts the Databricks SQL queries and query history. This key encrypts data stored at rest. It does not affect data in transit or in memory. To learn about this feature and to configure encryption, see [Customer-managed keys for managed services](https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html#managed-services). \nDatabricks SQL queries and query history that were stored before you added the key or before May 20, 2021 are not guaranteed to use this key to help protect and control access to the data.\n\n#### Encrypt queries, query history, and query results\n##### Use your key to encrypt query results\n\nYou can use your own key from AWS KMS to encrypt your Databricks SQL query results, which are stored in your workspace storage bucket that you provided during workspace setup. This key encrypts data stored at rest. It does not affect data in transit or in memory. See [customer-managed keys for storage](https:\/\/docs.databricks.com\/security\/keys\/customer-managed-keys.html#workspace-storage).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/keys\/sql-encryption.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Introduction to workspace objects\n\nThis article provides a high-level introduction to Databricks workspace objects. You can create, view, and organize workspace objects in the workspace browser across personas.\n\n#### Introduction to workspace objects\n##### Clusters\n\nDatabricks Data Science & Engineering and Databricks Machine Learning clusters provide a unified platform for various use cases such as running production ETL pipelines, streaming analytics, ad-hoc analytics, and machine learning. A cluster is a type of Databricks *compute resource*. Other compute resource types include Databricks [SQL warehouses](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html). \nFor detailed information on managing and using clusters, see [Compute](https:\/\/docs.databricks.com\/compute\/index.html).\n\n#### Introduction to workspace objects\n##### Notebooks\n\nA notebook is a web-based interface to documents containing a series of runnable cells (commands) that operate on files and [tables](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#table), [visualizations](https:\/\/docs.databricks.com\/visualizations\/index.html), and narrative text. Commands can be run in sequence, referring to the output of one or more previously run commands. \nNotebooks are one mechanism for running code in Databricks. The other mechanism is [jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html). \nFor detailed information on managing and using notebooks, see [Introduction to Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html).\n\n#### Introduction to workspace objects\n##### Jobs\n\nJobs are one mechanism for running code in Databricks. The other mechanism is notebooks. \nFor detailed information on managing and using jobs, see [Create and run Databricks Jobs](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/workspace-assets.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Introduction to workspace objects\n##### Libraries\n\nA library makes third-party or locally-built code available to notebooks and jobs running on your clusters. \nFor detailed information on managing and using libraries, see [Libraries](https:\/\/docs.databricks.com\/libraries\/index.html).\n\n#### Introduction to workspace objects\n##### Data\n\nYou can import data into a distributed file system mounted into a Databricks workspace and work with it in Databricks notebooks and clusters. You can also use a wide variety of Apache Spark data sources to access data. \nFor detailed information on loading data, see [Ingest data into a Databricks lakehouse](https:\/\/docs.databricks.com\/ingestion\/index.html).\n\n#### Introduction to workspace objects\n##### Files\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nIn Databricks Runtime 11.3 LTS and above, you can create and use arbitrary files in the Databricks workspace. Files can be any file type. Common examples include: \n* `.py` files used in custom modules.\n* `.md` files, such as `README.md`.\n* `.csv` or other small data files.\n* `.txt` files.\n* Log files. \nFor detailed information on using files, see [Work with files on Databricks](https:\/\/docs.databricks.com\/files\/index.html). For information about how to use files to modularize your code as you develop with Databricks notebooks, see [Share code between Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/share-code.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/workspace-assets.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Introduction to workspace objects\n##### Git folders\n\nGit folders are Databricks folders whose contents are co-versioned together by syncing them to a remote Git repository. Using Databricks Git folders, you can develop notebooks in Databricks and use a remote Git repository for collaboration and version control. \nFor detailed information on using repos, see [Git integration with Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html).\n\n#### Introduction to workspace objects\n##### Models\n\n*Model* refers to a model registered in MLflow Model Registry. Model Registry is a centralized model store that enables you to manage the full lifecycle of MLflow models. It provides chronological model lineage, model versioning, stage transitions, and model and model version annotations and descriptions. \nFor detailed information on managing and using models, see [Manage model lifecycle in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html).\n\n#### Introduction to workspace objects\n##### Experiments\n\nAn MLflow experiment is the primary unit of organization and access control for MLflow machine learning model training runs; all MLflow runs belong to an experiment. Each experiment lets you visualize, search, and compare runs, as well as download run artifacts or metadata for analysis in other tools. \nFor detailed information on managing and using experiments, see [Organize training runs with MLflow experiments](https:\/\/docs.databricks.com\/mlflow\/experiments.html).\n\n#### Introduction to workspace objects\n##### Queries\n\nQueries are SQL statements that allow you to interact with your data. For more information, see [Access and manage saved queries](https:\/\/docs.databricks.com\/sql\/user\/queries\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/workspace-assets.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Introduction to workspace objects\n##### Dashboards\n\nDashboards are presentations of query visualizations and commentary. See [Dashboards](https:\/\/docs.databricks.com\/dashboards\/index.html) or [Legacy dashboards](https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html).\n\n#### Introduction to workspace objects\n##### Alerts\n\nAlerts are notifications that a field returned by a query has reached a threshold. For more information, see [What are Databricks SQL alerts?](https:\/\/docs.databricks.com\/sql\/user\/alerts\/index.html).\n\n#### Introduction to workspace objects\n##### References to workspace objects\n\nHistorically, users were required to include the `\/Workspace` path prefix for some Databricks APIs (`%sh`) but not for others (`%run`, REST API inputs). \nUsers can use workspace paths with the `\/Workspace` prefix everywhere. Old references to paths without the `\/Workspace` prefix are redirected and continue to work. We recommend that all workspace paths carry the `\/Workspace` prefix to differentiate them from Volume and DBFS paths. \nThe prerequisite for consistent `\/Workspace` path prefix behavior is this: There cannot be a `\/Workspace` folder at the workspace root level. If you have a `\/Workspace` folder on the root level and want to enable this UX improvement, delete or rename the `\/Workspace` folder you created and contact your Databricks account team.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/workspace-assets.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Introduction to workspace objects\n##### Share a file, folder, or notebook URL\n\nIn your Databricks workspace, URLs to workspace files, notebooks, and folders are in the formats: \n**Workspace file URLs** \n```\nhttps:\/\/<databricks-instance>\/?o=<16-digit-workspace-ID>#files\/<16-digit-object-ID>\n\n``` \n**Notebook URLs** \n```\nhttps:\/\/<databricks-instance>\/?o=<16-digit-workspace-ID>#notebook\/<16-digit-object-ID>\/command\/<16-digit-command-ID>\n\n``` \n**Folder (workspace and Git) URLs** \n```\nhttps:\/\/<databricks-instance>\/browse\/folders\/<16-digit-ID>?o=<16-digit-workspace-ID>\n\n``` \nThese links can break if any folder, file, or notebook in the current path is updated with a Git pull command, or is deleted and recreated with the same name. However, you can construct a link based on the workspace path to share with other Databricks users with appropriate access levels by changing it to a link in this format: \n```\nhttps:\/\/<databricks-instance>\/?o=<16-digit-workspace-ID>#workspace\/<full-workspace-path-to-file-or-folder>\n\n``` \nLinks to folders, notebooks, and files can be shared by replacing everything in the URL after `?o=<16-digit-workspace-ID>` with the path to the file, folder, or notebook from the workspace root. If you are sharing a URL to a folder, remove `\/browse\/folders\/<16-digit-ID>` from the original URL as well. \nTo get the file path, open the context menu by right-clicking on the folder, notebook, or file in your workspace that you want to share and select **Copy URL\/path** > **Full path**. Prepend `#workspace` to the file path you just copied, and append the resulting string after the `?o=<16-digit-workspace-ID>` so it matches the URL format above. \n![Selecting the Copy URL path followed by Full path from a workspace folder's context menu.](https:\/\/docs.databricks.com\/_images\/repos-copy-path1.png) \n### URL formulation example #1: Folder URLs \nTo share the workspace folder URL `https:\/\/<databricks-instance>\/browse\/folders\/1111111111111111?o=2222222222222222`, remove the `browse\/folders\/1111111111111111` substring from the URL. Add `#workspace` followed by the path to the folder or workspace object you want to share. \nIn this case, the workspace path is to a folder, `\/Workspace\/Users\/user@example.com\/team-git\/notebooks`. After copying the full path from your workspace, you can now construct the shareable link: \n```\nhttps:\/\/<databricks-instance>\/?o=2222222222222222#workspace\/Workspace\/Users\/user@example.com\/team-git\/notebooks\n\n``` \n### URL formulation example 2: Notebook URLs \nTo share the notebook URL `https:\/\/<databricks-instance>\/?o=1111111111111111#notebook\/2222222222222222\/command\/3333333333333333`, remove `#notebook\/2222222222222222\/command\/3333333333333333`. Add `#workspace` followed by the path to the folder or workspace object. \nIn this case, the workspace path is points to a notebook, `\/Workspace\/Users\/user@example.com\/team-git\/notebooks\/v1.0\/test-notebook`. After copying the full path from your workspace, you can now construct the shareable link: \n```\nhttps:\/\/<databricks-instance>\/?o=1111111111111111#workspace\/Workspace\/Users\/user@example.com\/team-git\/notebooks\/v1.0\/test-notebook\n\n``` \nNow you have a stable URL for a file, folder, or notebook path to share! For more information about URLs and identifiers, see [Get identifiers for workspace objects](https:\/\/docs.databricks.com\/workspace\/workspace-details.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/workspace-assets.html"} +{"content":"# \n### Visualization deep dive in Scala\n#### Charts and graphs Scala notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/charts-and-graphs-scala.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/charts-and-graphs-scala.html"} +{"content":"# \n### Databricks documentation\n\nDatabricks documentation provides how-to guidance and reference information for data analysts, data scientists, and data engineers solving problems in analytics and AI. The Databricks Data Intelligence Platform enables data teams to collaborate on data stored in the lakehouse. See [What is a data lakehouse?](https:\/\/docs.databricks.com\/lakehouse\/index.html).\n\n### Databricks documentation\n#### Try Databricks\n\n* [Get a free trial & set up](https:\/\/docs.databricks.com\/getting-started\/index.html)\n* [Query and visualize data from a notebook](https:\/\/docs.databricks.com\/getting-started\/quick-start.html)\n* [Import and visualize CSV data from a notebook](https:\/\/docs.databricks.com\/getting-started\/import-visualize-data.html)\n* [Build a basic ETL pipeline](https:\/\/docs.databricks.com\/getting-started\/etl-quick-start.html)\n* [Build a simple lakehouse analytics pipeline](https:\/\/docs.databricks.com\/getting-started\/lakehouse-e2e.html)\n* [Free training](https:\/\/docs.databricks.com\/getting-started\/free-training.html)\n\n### Databricks documentation\n#### What do you want to do?\n\n* [Data science & engineering](https:\/\/docs.databricks.com\/workspace-index.html)\n* [Machine learning](https:\/\/docs.databricks.com\/machine-learning\/index.html)\n* [SQL queries & visualizations](https:\/\/docs.databricks.com\/sql\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/index.html"} +{"content":"# \n### Databricks documentation\n#### Manage Databricks\n\n* [Account & workspace administration](https:\/\/docs.databricks.com\/admin\/index.html)\n* [Security & compliance](https:\/\/docs.databricks.com\/security\/index.html)\n* [Data governance](https:\/\/docs.databricks.com\/data-governance\/index.html)\n\n### Databricks documentation\n#### Reference Guides\n\n* [API reference](https:\/\/docs.databricks.com\/reference\/api.html)\n* [SQL language reference](https:\/\/docs.databricks.com\/sql\/language-manual\/index.html)\n* [Error handling and error messages](https:\/\/docs.databricks.com\/error-messages\/index.html)\n\n### Databricks documentation\n#### Resources\n\n* [Release notes](https:\/\/docs.databricks.com\/release-notes\/index.html)\n* [Other resources](https:\/\/docs.databricks.com\/resources\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/index.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Tutorials: Implement ETL workflows with Delta Live Tables\n##### Tutorial: Run your first Delta Live Tables pipeline\n\nThis tutorial shows you how to configure a Delta Live Tables pipeline from code in a Databricks notebook and run the pipeline by triggering a pipeline update. This tutorial includes an example pipeline to ingest and process a sample dataset with example code using the [Python](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html) and [SQL](https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html) interfaces. You can also use the instructions in this tutorial to create a pipeline with any notebooks with properly-defined Delta Live Tables syntax. \nYou can configure Delta Live Tables pipelines and trigger updates using the Databricks workspace UI or automated tooling options such as the API, CLI, Databricks Asset Bundles, or as a task in a Databricks workflow. To familiarize yourself with the functionality and features of Delta Live Tables, Databricks recommends first using the UI to create and run pipelines. Additionally, when you configure a pipeline in the UI, Delta Live Tables generates a JSON configuration for your pipeline that can be used to implement your programmatic workflows. \nTo demonstrate Delta Live Tables functionality, the examples in this tutorial download a publicly available dataset. However, Databricks has several ways to connect to data sources and ingest data that pipelines implementing real-world use cases will use. See [Ingest data with Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html#ingestion).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Tutorials: Implement ETL workflows with Delta Live Tables\n##### Tutorial: Run your first Delta Live Tables pipeline\n###### Requirements\n\n* To start a pipeline, you must have [cluster creation permission](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions) or access to a cluster policy defining a Delta Live Tables cluster. The Delta Live Tables runtime creates a cluster before it runs your pipeline and fails if you don\u2019t have the correct permission.\n* To use the examples in this tutorial, your workspace must have [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) enabled.\n* You must have the following permissions in Unity Catalog: \n+ `READ VOLUME` and `WRITE VOLUME`, or `ALL PRIVILEGES`, for the `my-volume` volume.\n+ `USE SCHEMA` or `ALL PRIVILEGES` for the `default` schema.\n+ `USE CATALOG` or `ALL PRIVILEGES` for the `main` catalog.To set these permissions, see your Databricks administrator or [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html).\n* The examples in this tutorial use a Unity Catalog [volume](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html) to store sample data. To use these examples, create a volume and use that volume\u2019s catalog, schema, and volume names to set the volume path used by the examples. \nNote \nIf your workspace does not have Unity Catalog enabled, [notebooks](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html#non-uc-notebooks) with examples that do not require Unity Catalog are attached to this article. To use these examples, select `Hive metastore` as the storage option when you create the pipeline.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Tutorials: Implement ETL workflows with Delta Live Tables\n##### Tutorial: Run your first Delta Live Tables pipeline\n###### Where do you run Delta Live Tables queries?\n\nDelta Live Tables queries are primarily implemented in Databricks notebooks, but Delta Live Tables is not designed to be run interactively in notebook cells. Executing a cell that contains Delta Live Tables syntax in a Databricks notebook results in an error message. To run your queries, you must configure your notebooks as part of a pipeline. \nImportant \n* You cannot rely on the cell-by-cell execution ordering of notebooks when writing queries for Delta Live Tables. Delta Live Tables evaluates and runs all code defined in notebooks but has a different execution model than a notebook **Run all** command.\n* You cannot mix languages in a single Delta Live Tables source code file. For example, a notebook can contain only Python queries or SQL queries. If you must use multiple languages in a pipeline, use multiple language-specific notebooks or files in the pipeline. \nYou can also use Python code stored in files. For example, you can create a Python module that can be imported into your Python pipelines or define Python user-defined functions (UDFs) to use in SQL queries. To learn about importing Python modules, see [Import Python modules from Git folders or workspace files](https:\/\/docs.databricks.com\/delta-live-tables\/import-workspace-files.html). To learn about using Python UDFs, see [User-defined scalar functions - Python](https:\/\/docs.databricks.com\/udf\/python.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Tutorials: Implement ETL workflows with Delta Live Tables\n##### Tutorial: Run your first Delta Live Tables pipeline\n###### Example: Ingest and process New York baby names data\n\nThe example in this article uses a publicly available dataset that contains records of [New York State baby names](https:\/\/health.data.ny.gov\/Health\/Baby-Names-Beginning-2007\/jxy9-yhdk\/about_data). These examples demonstrate using a Delta Live Tables pipeline to: \n* Read raw CSV data from a publicly available dataset into a table.\n* Read the records from the raw data table and use Delta Live Tables [expectations](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html) to create a new table that contains cleansed data.\n* Use the cleansed records as input to Delta Live Tables queries that create derived datasets. \nThis code demonstrates a simplified example of the medallion architecture. See [What is the medallion lakehouse architecture?](https:\/\/docs.databricks.com\/lakehouse\/medallion.html). \nImplementations of this example are provided for the [Python](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html#python-example) and [SQL](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html#sql-example) interfaces. You can follow the steps to create new notebooks that contain the example code, or you can skip ahead to [Create a pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html#create-pipeline) and use one of the [notebooks](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html#notebooks) provided on this page.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Tutorials: Implement ETL workflows with Delta Live Tables\n##### Tutorial: Run your first Delta Live Tables pipeline\n###### Implement a Delta Live Tables pipeline with Python\n\nPython code that creates Delta Live Tables datasets must return DataFrames, familiar to users with PySpark or Pandas for Spark experience. For users unfamiliar with DataFrames, Databricks recommends using the SQL interface. See [Implement a Delta Live Tables pipeline with SQL](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html#sql-example). \nAll Delta Live Tables Python APIs are implemented in the `dlt` module. Your Delta Live Tables pipeline code implemented with Python must explicitly import the `dlt` module at the top of Python notebooks and files. Delta Live Tables differs from many Python scripts in a key way: you do not call the functions that perform data ingestion and transformation to create Delta Live Tables datasets. Instead, Delta Live Tables interprets the decorator functions from the `dlt` module in all files loaded into a pipeline and builds a dataflow graph. \nTo implement the example in this tutorial, copy and paste the following Python code into a new Python notebook. You should add each example code snippet to its own cell in the notebook in the order described. To review options for creating notebooks, see [Create a notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-notebook). \nNote \nWhen you create a pipeline with the Python interface, by default, table names are defined by function names. For example, the following Python example creates three tables named `baby_names_raw`, `baby_names_prepared`, and `top_baby_names_2021`. You can override the table name using the `name` parameter. See [Create a Delta Live Tables materialized view or streaming table](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html#create-table-python). \n### Import the Delta Live Tables module \nAll Delta Live Tables Python APIs are implemented in the `dlt` module. Explicitly import the `dlt` module at the top of Python notebooks and files. \nThe following example shows this import, alongside import statements for `pyspark.sql.functions`. \n```\nimport dlt\nfrom pyspark.sql.functions import *\n\n``` \n### Download the data \nTo get the data for this example, you download a CSV file and store it in the volume as follows: \n```\nimport os\n\nos.environ[\"UNITY_CATALOG_VOLUME_PATH\"] = \"\/Volumes\/<catalog-name>\/<schema-name>\/<volume-name>\/\"\nos.environ[\"DATASET_DOWNLOAD_URL\"] = \"https:\/\/health.data.ny.gov\/api\/views\/jxy9-yhdk\/rows.csv\"\nos.environ[\"DATASET_DOWNLOAD_FILENAME\"] = \"rows.csv\"\n\ndbutils.fs.cp(f\"{os.environ.get('DATASET_DOWNLOAD_URL')}\", f\"{os.environ.get('UNITY_CATALOG_VOLUME_PATH')}{os.environ.get('DATASET_DOWNLOAD_FILENAME')}\")\n\n``` \nReplace `<catalog-name>`, `<schema-name>`, and `<volume-name>` with the catalog, schema, and volume names for a Unity Catalog volume. \n### Create a table from files in object storage \nDelta Live Tables supports loading data from all formats supported by Databricks. See [Data format options](https:\/\/docs.databricks.com\/query\/formats\/index.html). \nThe `@dlt.table` decorator tells Delta Live Tables to create a table that contains the result of a `DataFrame` returned by a function. Add the `@dlt.table` decorator before any Python function definition that returns a Spark DataFrame to register a new table in Delta Live Tables. The following example demonstrates using the function name as the table name and adding a descriptive comment to the table: \n```\n@dlt.table(\ncomment=\"Popular baby first names in New York. This data was ingested from the New York State Department of Health.\"\n)\ndef baby_names_raw():\ndf = spark.read.csv(f\"{os.environ.get('UNITY_CATALOG_VOLUME_PATH')}{os.environ.get('DATASET_DOWNLOAD_FILENAME')}\", header=True, inferSchema=True)\ndf_renamed_column = df.withColumnRenamed(\"First Name\", \"First_Name\")\nreturn df_renamed_column\n\n``` \n### Add a table from an upstream dataset in the pipeline \nYou can use `dlt.read()` to read data from other datasets declared in your current Delta Live Tables pipeline. Declaring new tables in this way creates a dependency that Delta Live Tables automatically resolves before executing updates. The following code also includes examples of monitoring and enforcing data quality with expectations. See [Manage data quality with Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html). \n```\n@dlt.table(\ncomment=\"New York popular baby first name data cleaned and prepared for analysis.\"\n)\n@dlt.expect(\"valid_first_name\", \"First_Name IS NOT NULL\")\n@dlt.expect_or_fail(\"valid_count\", \"Count > 0\")\ndef baby_names_prepared():\nreturn (\ndlt.read(\"baby_names_raw\")\n.withColumnRenamed(\"Year\", \"Year_Of_Birth\")\n.select(\"Year_Of_Birth\", \"First_Name\", \"Count\")\n)\n\n``` \n### Create a table with enriched data views \nBecause Delta Live Tables processes updates to pipelines as a series of dependency graphs, you can declare highly enriched views that power dashboards, BI, and analytics by declaring tables with specific business logic. \nTables in Delta Live Tables are equivalent conceptually to materialized views. Whereas traditional views on Spark run logic each time the view is queried, a Delta Live Tables table stores the most recent version of query results in data files. Because Delta Live Tables manages updates for all datasets in a pipeline, you can schedule pipeline updates to match latency requirements for materialized views and know that queries against these tables contain the most recent version of data available. \nThe table defined by the following code demonstrates the conceptual similarity to a materialized view derived from upstream data in your pipeline: \n```\n@dlt.table(\ncomment=\"A table summarizing counts of the top baby names for New York for 2021.\"\n)\ndef top_baby_names_2021():\nreturn (\ndlt.read(\"baby_names_prepared\")\n.filter(expr(\"Year_Of_Birth == 2021\"))\n.groupBy(\"First_Name\")\n.agg(sum(\"Count\").alias(\"Total_Count\"))\n.sort(desc(\"Total_Count\"))\n.limit(10)\n)\n\n``` \nTo configure a pipeline that uses the notebook, see [Create a pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html#create-pipeline).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Tutorials: Implement ETL workflows with Delta Live Tables\n##### Tutorial: Run your first Delta Live Tables pipeline\n###### Implement a Delta Live Tables pipeline with SQL\n\nDatabricks recommends Delta Live Tables with SQL as the preferred way for SQL users to build new ETL, ingestion, and transformation pipelines on Databricks. The SQL interface for Delta Live Tables extends standard Spark SQL with many new keywords, constructs, and table-valued functions. These additions to standard SQL allow users to declare dependencies between datasets and deploy production-grade infrastructure without learning new tooling or additional concepts. \nFor users familiar with Spark DataFrames and who need support for more extensive testing and operations that are difficult to implement with SQL, such as metaprogramming operations, Databricks recommends using the Python interface. See [Example: Ingest and process New York baby names data](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html#python-example). \n### Download the data \nTo get the data for this example, copy the following code, paste it into a new notebook, and then run the notebook. To review options for creating notebooks, see [Create a notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-notebook). \n```\n%sh\nwget -O \"\/Volumes\/<catalog-name>\/<schema-name>\/<volume-name>\/babynames.csv\" \"https:\/\/health.data.ny.gov\/api\/views\/jxy9-yhdk\/rows.csv\"\n\n``` \nReplace `<catalog-name>`, `<schema-name>`, and `<volume-name>` with the catalog, schema, and volume names for a Unity Catalog volume. \n### Create a table from files in Unity Catalog \nFor the rest of this example, copy the following SQL snippets and paste them into a new SQL notebook, separate from the notebook in the previous section. You should add each example SQL snippet to its own cell in the notebook in the order described. \nDelta Live Tables supports loading data from all formats supported by Databricks. See [Data format options](https:\/\/docs.databricks.com\/query\/formats\/index.html). \nAll Delta Live Tables SQL statements use `CREATE OR REFRESH` syntax and semantics. When you update a pipeline, Delta Live Tables determines whether the logically correct result for the table can be accomplished through incremental processing or if full recomputation is required. \nThe following example creates a table by loading data from the CSV file stored in the Unity Catalog volume: \n```\nCREATE OR REFRESH LIVE TABLE baby_names_sql_raw\nCOMMENT \"Popular baby first names in New York. This data was ingested from the New York State Department of Health.\"\nAS SELECT Year, `First Name` AS First_Name, County, Sex, Count FROM read_files(\n'\/Volumes\/<catalog-name>\/<schema-name>\/<volume-name>\/babynames.csv',\nformat => 'csv',\nheader => true,\nmode => 'FAILFAST')\n\n``` \nReplace `<catalog-name>`, `<schema-name>`, and `<volume-name>` with the catalog, schema, and volume names for a Unity Catalog volume. \n### Add a table from an upstream dataset to the pipeline \nYou can use the `live` virtual schema to query data from other datasets declared in your current Delta Live Tables pipeline. Declaring new tables in this way creates a dependency that Delta Live Tables automatically resolves before executing updates. The `live` schema is a custom keyword implemented in Delta Live Tables that can be substituted for a target schema if you want to publish your datasets. See [Use Unity Catalog with your Delta Live Tables pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html) and [Publish data from Delta Live Tables to the Hive metastore](https:\/\/docs.databricks.com\/delta-live-tables\/publish.html). \nThe following code also includes examples of monitoring and enforcing data quality with expectations. See [Manage data quality with Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/expectations.html). \n```\nCREATE OR REFRESH LIVE TABLE baby_names_sql_prepared(\nCONSTRAINT valid_first_name EXPECT (First_Name IS NOT NULL),\nCONSTRAINT valid_count EXPECT (Count > 0) ON VIOLATION FAIL UPDATE\n)\nCOMMENT \"New York popular baby first name data cleaned and prepared for analysis.\"\nAS SELECT\nYear AS Year_Of_Birth,\nFirst_Name,\nCount\nFROM live.baby_names_sql_raw;\n\n``` \n### Create an enriched data view \nBecause Delta Live Tables processes updates to pipelines as a series of dependency graphs, you can declare highly enriched views that power dashboards, BI, and analytics by declaring tables with specific business logic. \nLive tables are equivalent conceptually to materialized views. Whereas traditional views on Spark run logic each time the view is queried, live tables store the most recent version of query results in data files. Because Delta Live Tables manages updates for all datasets in a pipeline, you can schedule pipeline updates to match latency requirements for materialized views and know that queries against these tables contain the most recent version of data available. \nThe following code creates an enriched materialized view of upstream data: \n```\nCREATE OR REFRESH LIVE TABLE top_baby_names_sql_2021\nCOMMENT \"A table summarizing counts of the top baby names for New York for 2021.\"\nAS SELECT\nFirst_Name,\nSUM(Count) AS Total_Count\nFROM live.baby_names_sql_prepared\nWHERE Year_Of_Birth = 2021\nGROUP BY First_Name\nORDER BY Total_Count DESC\nLIMIT 10;\n\n``` \nTo configure a pipeline that uses the notebook, continue to [Create a pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html#create-pipeline).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Tutorials: Implement ETL workflows with Delta Live Tables\n##### Tutorial: Run your first Delta Live Tables pipeline\n###### Create a pipeline\n\nDelta Live Tables creates pipelines by resolving dependencies defined in notebooks or files (called *source code* or *libraries*) using Delta Live Tables syntax. Each source code file can only contain one language, but you can mix libraries of different languages in your pipeline. \n1. Click **Delta Live Tables** in the sidebar and click **Create Pipeline**.\n2. Give the pipeline a name.\n3. (Optional) Select a [product edition](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#editions).\n4. Select **Triggered** for **Pipeline Mode**.\n5. Configure one or more notebooks containing the source code for the pipeline. In the **Paths** textbox, enter the path to a notebook or click ![File Picker Icon](https:\/\/docs.databricks.com\/_images\/file-picker.png) to select a notebook.\n6. Select a destination for datasets published by the pipeline, either the Hive metastore or Unity Catalog. See [Publish datasets](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html#publish). \n* **Hive metastore**: \n+ (Optional) Enter a **Storage location** for output data from the pipeline. The system uses a default location if you leave **Storage location** empty.\n+ (Optional) Specify a **Target schema** to publish your dataset to the Hive metastore.\n* **Unity Catalog**: Specify a **Catalog** and a **Target schema** to publish your dataset to Unity Catalog.\n7. (Optional) Configure compute settings for the pipeline. To learn about options for compute settings, see [Configure pipeline settings for Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html).\n8. (Optional) Click **Add notification** to configure one or more email addresses to receive notifications for pipeline events. See [Add email notifications for pipeline events](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#email-notifications).\n9. (Optional) Configure advanced settings for the pipeline. To learn about options for advanced settings, see [Configure pipeline settings for Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html).\n10. Click **Create**. \nThe system displays the **Pipeline Details** page after you click **Create**. You can also access your pipeline by clicking the pipeline name in the **Delta Live Tables** tab.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Tutorials: Implement ETL workflows with Delta Live Tables\n##### Tutorial: Run your first Delta Live Tables pipeline\n###### Start a pipeline update\n\nTo start an update for a pipeline, click the ![Delta Live Tables Start Icon](https:\/\/docs.databricks.com\/_images\/dlt-start-button.png) button in the top panel. The system returns a message confirming that your pipeline is starting. \nAfter successfully starting the update, the Delta Live Tables system: \n1. Starts a cluster using a cluster configuration created by the Delta Live Tables system. You can also specify a custom [cluster configuration](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#cluster-config).\n2. Creates any tables that don\u2019t exist and ensures that the schema is correct for any existing tables.\n3. Updates tables with the latest data available.\n4. Shuts down the cluster when the update is complete. \nNote \nExecution mode is set to **Production** by default, which deploys ephemeral compute resources for each update. You can use **Development** mode to change this behavior, allowing the same compute resources to be used for multiple pipeline updates during development and testing. See [Development and production modes](https:\/\/docs.databricks.com\/delta-live-tables\/updates.html#optimize-execution).\n\n##### Tutorial: Run your first Delta Live Tables pipeline\n###### Publish datasets\n\nYou can make Delta Live Tables datasets available for querying by publishing tables to the Hive metastore or Unity Catalog. If you do not specify a target for publishing data, tables created in Delta Live Tables pipelines can only be accessed by other operations in that same pipeline. See [Publish data from Delta Live Tables to the Hive metastore](https:\/\/docs.databricks.com\/delta-live-tables\/publish.html) and [Use Unity Catalog with your Delta Live Tables pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Tutorials: Implement ETL workflows with Delta Live Tables\n##### Tutorial: Run your first Delta Live Tables pipeline\n###### Example source code notebooks\n\nYou can import these notebooks into a Databricks workspace and use them to deploy a Delta Live Tables pipeline. See [Create a pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html#create-pipeline). \n### Get started with Delta Live Tables Python notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/dlt-babynames-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Get started with Delta Live Tables SQL notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/dlt-babynames-sql.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Tutorials: Implement ETL workflows with Delta Live Tables\n##### Tutorial: Run your first Delta Live Tables pipeline\n###### Example source code notebooks for workspaces without Unity Catalog\n\nYou can import these notebooks into a Databricks workspace without Unity Catalog enabled and use them to deploy a Delta Live Tables pipeline. See [Create a pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html#create-pipeline). \n### Get started with Delta Live Tables Python notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/dlt-wikipedia-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Get started with Delta Live Tables SQL notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/dlt-wikipedia-sql.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/compute.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Compute settings for the Databricks JDBC Driver\n\nThis article describes how to configure Databricks compute resource settings for the [Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/index.html). \nThe driver requires the following compute resource configuration settings: \n| Setting | Description |\n| --- | --- |\n| `Host` | The Databricks compute resource\u2019s **Server Hostname** value. |\n| `Port` | 443 |\n| `HTTPPath` | The Databricks compute resource\u2019s **HTTP Path** value. |\n| `SSL` | 1 |\n| `ThriftTransport` | 2 |\n| `Schema` (optional) | The name of the default schema to use. |\n| `Catalog` (optional) | The name of the default catalog to use. | \nA JDBC connection URL that uses the preceding settings has the following format: \n```\njdbc:databricks:\/\/<server-hostname>:443;httpPath=<http-path>[;<setting1>=<value1>;<setting2>=<value2>;<settingN>=<valueN>]\n\n``` \nJava code that uses the preceding settings has the following format: \n```\n\/\/ ...\nString url = \"jdbc:databricks:\/\/<server-hostname>:443\";\nProperties p = new java.util.Properties();\np.put(\"httpPath\", \"<http-path>\");\np.put(\"<setting1>\", \"<value1\");\np.put(\"<setting2>\", \"<value2\");\np.put(\"<settingN>\", \"<valueN\");\n\/\/ ...\nConnection conn = DriverManager.getConnection(url, p);\n\/\/ ...\n\n``` \n* For a complete Java code example that you can adapt as needed, see the beginning of [Authentication settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html).\n* Replace `<setting>` and `<value>` as needed for each of the target Databricks [authentication settings](https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html) and any special or advanced [driver capability settings](https:\/\/docs.databricks.com\/integrations\/jdbc\/capability.html).\n* To get the values for `<server-hostname>` and `<http-path>`, see the following procedures. \nTo get the connection details for a Databricks [cluster](https:\/\/docs.databricks.com\/compute\/configure.html), do the following: \n1. Log in to your Databricks workspace.\n2. In the sidebar, click **Compute**.\n3. In the list of available clusters, click the target cluster\u2019s name.\n4. On the **Configuration** tab, expand **Advanced options**.\n5. Click the **JDBC\/ODBC** tab.\n6. Copy the connection details that you need, such as **Server Hostname**, **Port**, and **HTTP Path**. \nTo get the connection details for a Databricks SQL [warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html), do the following: \n1. Log in to your Databricks workspace.\n2. In the sidebar, click **SQL > SQL Warehouses**.\n3. In the list of available warehouses, click the target warehouse\u2019s name.\n4. On the **Connection Details** tab, copy the connection details that you need, such as **Server hostname**, **Port**, and **HTTP path**. \nTo use the driver with a Databricks [cluster](https:\/\/docs.databricks.com\/compute\/configure.html), there are two [permissions](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions) that the calling user or service principal needs when connecting to or restarting the cluster: \n* CAN ATTACH TO permission to connect to the running cluster.\n* CAN RESTART permission to automatically trigger the cluster to start if its state is terminated when connecting. \nTo use the driver with a Databricks SQL [warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html), the calling user or service principal needs CAN USE [permission](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#sql-warehouses). The Databricks SQL warehouse automatically starts if it was stopped. \nNote \nDatabricks SQL warehouses are recommended when using Microsoft Power BI in **DirectQuery** mode.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/compute.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Read and write streaming Avro data\n\n[Apache Avro](https:\/\/avro.apache.org\/) is a commonly used data serialization system in the streaming world. A typical solution is to put data in Avro format in Apache Kafka, metadata in [Confluent Schema Registry](https:\/\/docs.confluent.io\/current\/schema-registry\/docs\/index.html), and then run queries with a streaming framework that connects to both Kafka and Schema Registry. \nDatabricks supports the `from_avro` and `to_avro` [functions](https:\/\/spark.apache.org\/docs\/latest\/sql-data-sources-avro.html#to_avro-and-from_avro) to build streaming pipelines with Avro data in Kafka and metadata in Schema Registry. The function `to_avro` encodes a column as binary in Avro format and `from_avro` decodes Avro binary data into a column. Both functions transform one column to another column, and the input\/output SQL data type can be a complex type or a primitive type. \nNote \nThe `from_avro` and `to_avro` functions: \n* Are available in [Python](https:\/\/spark.apache.org\/docs\/latest\/sql-data-sources-avro.html#to_avro-and-from_avro), Scala, and Java.\n* Can be passed to SQL functions in both batch and streaming queries. \nAlso see [Avro file data source](https:\/\/docs.databricks.com\/query\/formats\/avro.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/avro-dataframe.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Read and write streaming Avro data\n##### Manually specified schema example\n\nSimilar to [from\\_json](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/from_json.html) and [to\\_json](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/to_json.html), you can use `from_avro` and `to_avro` with any binary column. You can specify the Avro schema manually, as in the following example: \n```\nimport org.apache.spark.sql.avro.functions._\nimport org.apache.avro.SchemaBuilder\n\n\/\/ When reading the key and value of a Kafka topic, decode the\n\/\/ binary (Avro) data into structured data.\n\/\/ The schema of the resulting DataFrame is: <key: string, value: int>\nval df = spark\n.readStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", servers)\n.option(\"subscribe\", \"t\")\n.load()\n.select(\nfrom_avro($\"key\", SchemaBuilder.builder().stringType()).as(\"key\"),\nfrom_avro($\"value\", SchemaBuilder.builder().intType()).as(\"value\"))\n\n\/\/ Convert structured data to binary from string (key column) and\n\/\/ int (value column) and save to a Kafka topic.\ndataDF\n.select(\nto_avro($\"key\").as(\"key\"),\nto_avro($\"value\").as(\"value\"))\n.writeStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", servers)\n.option(\"topic\", \"t\")\n.start()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/avro-dataframe.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Read and write streaming Avro data\n##### jsonFormatSchema example\n\nYou can also specify a schema as a JSON string. For example, if `\/tmp\/user.avsc` is: \n```\n{\n\"namespace\": \"example.avro\",\n\"type\": \"record\",\n\"name\": \"User\",\n\"fields\": [\n{\"name\": \"name\", \"type\": \"string\"},\n{\"name\": \"favorite_color\", \"type\": [\"string\", \"null\"]}\n]\n}\n\n``` \nYou can create a JSON string: \n```\nfrom pyspark.sql.avro.functions import from_avro, to_avro\n\njsonFormatSchema = open(\"\/tmp\/user.avsc\", \"r\").read()\n\n``` \nThen use the schema in `from_avro`: \n```\n# 1. Decode the Avro data into a struct.\n# 2. Filter by column \"favorite_color\".\n# 3. Encode the column \"name\" in Avro format.\n\noutput = df\\\n.select(from_avro(\"value\", jsonFormatSchema).alias(\"user\"))\\\n.where('user.favorite_color == \"red\"')\\\n.select(to_avro(\"user.name\").alias(\"value\"))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/avro-dataframe.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Read and write streaming Avro data\n##### Example with Schema Registry\n\nIf your cluster has a Schema Registry service, `from_avro` can work with it so that you don\u2019t need to specify the Avro schema manually. \nThe following example demonstrates reading a Kafka topic \u201ct\u201d, assuming the key and value are already registered in Schema Registry as subjects \u201ct-key\u201d and \u201ct-value\u201d of types `STRING` and `INT`: \n```\nimport org.apache.spark.sql.avro.functions._\n\nval schemaRegistryAddr = \"https:\/\/myhost:8081\"\nval df = spark\n.readStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", servers)\n.option(\"subscribe\", \"t\")\n.load()\n.select(\nfrom_avro($\"key\", \"t-key\", schemaRegistryAddr).as(\"key\"),\nfrom_avro($\"value\", \"t-value\", schemaRegistryAddr).as(\"value\"))\n\n``` \nFor `to_avro`, the default output Avro schema might not match the schema of the target subject in the Schema Registry service for the following reasons: \n* The mapping from Spark SQL type to Avro schema is not one-to-one. See [Supported types for Spark SQL -> Avro conversion](https:\/\/docs.databricks.com\/query\/formats\/avro.html#supported-types-for-spark-sql---avro-conversion).\n* If the converted output Avro schema is of record type, the record name is `topLevelRecord` and there is no namespace by default. \nIf the default output schema of `to_avro` matches the schema of the target subject, you can do the following: \n```\n\/\/ The converted data is saved to Kafka as a Kafka topic \"t\".\ndataDF\n.select(\nto_avro($\"key\", lit(\"t-key\"), schemaRegistryAddr).as(\"key\"),\nto_avro($\"value\", lit(\"t-value\"), schemaRegistryAddr).as(\"value\"))\n.writeStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", servers)\n.option(\"topic\", \"t\")\n.start()\n\n``` \nOtherwise, you must provide the schema of the target subject in the `to_avro` function: \n```\n\/\/ The Avro schema of subject \"t-value\" in JSON string format.\nval avroSchema = ...\n\/\/ The converted data is saved to Kafka as a Kafka topic \"t\".\ndataDF\n.select(\nto_avro($\"key\", lit(\"t-key\"), schemaRegistryAddr).as(\"key\"),\nto_avro($\"value\", lit(\"t-value\"), schemaRegistryAddr, avroSchema).as(\"value\"))\n.writeStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", servers)\n.option(\"topic\", \"t\")\n.start()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/avro-dataframe.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Read and write streaming Avro data\n##### Authenticate to an external Confluent Schema Registry\n\nIn Databricks Runtime 12.2 LTS and above, you can authenticate to an external Confluent Schema Registry. The following examples demonstrate how to configure your schema registry options to include auth credentials and API keys. \n```\nimport org.apache.spark.sql.avro.functions._\nimport scala.collection.JavaConverters._\n\nval schemaRegistryAddr = \"https:\/\/confluent-schema-registry-endpoint\"\nval schemaRegistryOptions = Map(\n\"confluent.schema.registry.basic.auth.credentials.source\" -> \"USER_INFO\",\n\"confluent.schema.registry.basic.auth.user.info\" -> \"confluentApiKey:confluentApiSecret\")\n\nval df = spark\n.readStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", servers)\n.option(\"subscribe\", \"t\")\n.load()\n.select(\nfrom_avro($\"key\", \"t-key\", schemaRegistryAddr, schemaRegistryOptions.asJava).as(\"key\"),\nfrom_avro($\"value\", \"t-value\", schemaRegistryAddr, schemaRegistryOptions.asJava).as(\"value\"))\n\n\/\/ The converted data is saved to Kafka as a Kafka topic \"t\".\ndataDF\n.select(\nto_avro($\"key\", lit(\"t-key\"), schemaRegistryAddr, schemaRegistryOptions.asJava).as(\"key\"),\nto_avro($\"value\", lit(\"t-value\"), schemaRegistryAddr, schemaRegistryOptions.asJava).as(\"value\"))\n.writeStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", servers)\n.option(\"topic\", \"t\")\n.save()\n\n\/\/ The Avro schema of subject \"t-value\" in JSON string format.\nval avroSchema = ...\n\n\/\/ The converted data is saved to Kafka as a Kafka topic \"t\".\ndataDF\n.select(\nto_avro($\"key\", lit(\"t-key\"), schemaRegistryAddr, schemaRegistryOptions.asJava).as(\"key\"),\nto_avro($\"value\", lit(\"t-value\"), schemaRegistryAddr, schemaRegistryOptions.asJava, avroSchema).as(\"value\"))\n.writeStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", servers)\n.option(\"topic\", \"t\")\n.save()\n\n``` \n```\nfrom pyspark.sql.functions import col, lit\nfrom pyspark.sql.avro.functions import from_avro, to_avro\n\nschema_registry_address = \"https:\/\/confluent-schema-registry-endpoint\"\nschema_registry_options = {\n\"confluent.schema.registry.basic.auth.credentials.source\": 'USER_INFO',\n\"confluent.schema.registry.basic.auth.user.info\": f\"{key}:{secret}\"\n}\n\ndf = (spark\n.readStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", servers)\n.option(\"subscribe\", \"t\")\n.load()\n.select(\nfrom_avro(\ndata = col(\"key\"),\noptions = schema_registry_options,\nsubject = \"t-key\",\nschemaRegistryAddress = schema_registry_address\n).alias(\"key\"),\nfrom_avro(\ndata = col(\"value\"),\noptions = schema_registry_options,\nsubject = \"t-value\",\nschemaRegistryAddress = schema_registry_address\n).alias(\"value\")\n)\n)\n\n# The converted data is saved to Kafka as a Kafka topic \"t\".\ndata_df\n.select(\nto_avro(\ndata = col(\"key\"),\nsubject = lit(\"t-key\"),\nschemaRegistryAddress = schema_registry_address,\noptions = schema_registry_options\n).alias(\"key\"),\nto_avro(\ndata = col(\"value\"),\nsubject = lit(\"t-value\"),\nschemaRegistryAddress = schema_registry_address,\noptions = schema_registry_options\n).alias(\"value\")\n)\n.writeStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", servers)\n.option(\"topic\", \"t\")\n.save()\n\n# The Avro schema of subject \"t-value\" in JSON string format.\navro_schema = ...\n\n# The converted data is saved to Kafka as a Kafka topic \"t\".\ndata_df\n.select(\nto_avro(\ndata = col(\"key\"),\nsubject = lit(\"t-key\"),\nschemaRegistryAddress = schema_registry_address,\noptions = schema_registry_options\n).alias(\"key\"),\nto_avro(\ndata = col(\"value\"),\nsubject = lit(\"t-value\"),\nschemaRegistryAddress = schema_registry_address,\noptions = schema_registry_options,\njsonFormatSchema = avro_schema).alias(\"value\"))\n.writeStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", servers)\n.option(\"topic\", \"t\")\n.save()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/avro-dataframe.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Read and write streaming Avro data\n##### Use truststore and keystore files in Unity Catalog volumes\n\nIn Databricks Runtime 14.3 LTS and above, you can use truststore and keystore files in Unity Catalog volumes to authenticate to a Confluent Schema Registry. Update the configuration in the [previous example](https:\/\/docs.databricks.com\/structured-streaming\/avro-dataframe.html#auth-registry) using the following syntax: \n```\nval schemaRegistryAddr = \"https:\/\/confluent-schema-registry-endpoint\"\nval schemaRegistryOptions = Map(\n\"confluent.schema.registry.ssl.truststore.location\" -> \"\/Volumes\/<catalog_name>\/<schema_name>\/<volume_name>\/truststore.jks\",\n\"confluent.schema.registry.ssl.truststore.password\" -> \"truststorePassword\",\n\"confluent.schema.registry.ssl.keystore.location\" -> \"\/Volumes\/<catalog_name>\/<schema_name>\/<volume_name>\/keystore.jks\",\n\"confluent.schema.registry.ssl.truststore.password\" -> \"keystorePassword\",\n\"confluent.schema.registry.ssl.key.password\" -> \"keyPassword\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/avro-dataframe.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Read and write streaming Avro data\n##### Use schema evolution mode with `from_avro`\n\nIn Databricks Runtime 14.2 and above, you can use schema evolution mode with `from_avro`. Enabling schema evolution mode causes the job to throw an `UnknownFieldException` after detecting schema evolution. Databricks recommends configuring jobs with schema evolution mode to automatically restart on task failure. See [Configure Structured Streaming jobs to restart streaming queries on failure](https:\/\/docs.databricks.com\/structured-streaming\/query-recovery.html#restart-job). \nSchema evolution is useful if you expect the schema of your source data to evolve over time and ingest all fields from your data source. If your queries already explicitly specify which fields to query in your data source, added fields are ignored regardless of schema evolution. \nUse the `avroSchemaEvolutionMode` option to enable schema evolution. The following table describes the options for schema evolution mode: \n| Option | Behavior |\n| --- | --- |\n| `none` | **Default**. Ignores schema evolution and the job continues. |\n| `restart` | Throws an `UnknownFieldException` when detecting schema evolution. Requires a job restart. | \nNote \nYou can change this configuration between streaming jobs and reuse the same checkpoint. Disabling schema evolution can result in dropped columns.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/avro-dataframe.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Read and write streaming Avro data\n##### Configure the parse mode\n\nYou can configure the parse mode to determine whether you want to fail or emit null records when schema evolution mode is disabled and the schema evolves in a non-backward compatible way. With default settings, `from_avro` fails when it observes incompatible schema changes. \nUse the `mode` option to specify parse mode. The following table describes the option for parse mode: \n| Option | Behavior |\n| --- | --- |\n| `FAILFAST` | **Default**. A parsing error throws a `SparkException` with an `errorClass` of `MALFORMED_AVRO_MESSAGE`. |\n| `PERMISSIVE` | A parsing error is ignored and a null record is emitted. | \nNote \nWith schema evolution enabled, `FAILFAST` only throws exceptions if a record is corrupted.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/avro-dataframe.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Read and write streaming Avro data\n##### Example using schema evolution and setting parse mode\n\nThe following example demonstrates enabling schema evolution and specifying `FAILFAST` parse mode with a Confluent Schema Registry: \n```\nimport org.apache.spark.sql.avro.functions._\nimport scala.collection.JavaConverters._\n\nval schemaRegistryAddr = \"https:\/\/confluent-schema-registry-endpoint\"\nval schemaRegistryOptions = Map(\n\"confluent.schema.registry.basic.auth.credentials.source\" -> \"USER_INFO\",\n\"confluent.schema.registry.basic.auth.user.info\" -> \"confluentApiKey:confluentApiSecret\",\n\"avroSchemaEvolutionMode\" -> \"restart\",\n\"mode\" -> \"FAILFAST\")\n\nval df = spark\n.readStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", servers)\n.option(\"subscribe\", \"t\")\n.load()\n.select(\n\/\/ We read the \"key\" binary column from the subject \"t-key\" in the schema\n\/\/ registry at schemaRegistryAddr. We provide schemaRegistryOptions,\n\/\/ which has avroSchemaEvolutionMode -> \"restart\". This instructs from_avro\n\/\/ to fail the query if the schema for the subject t-key evolves.\nfrom_avro(\n$\"key\",\n\"t-key\",\nschemaRegistryAddr,\nschemaRegistryOptions.asJava).as(\"key\"))\n\n``` \n```\nfrom pyspark.sql.functions import col, lit\nfrom pyspark.sql.avro.functions import from_avro, to_avro\n\nschema_registry_address = \"https:\/\/confluent-schema-registry-endpoint\"\nschema_registry_options = {\n\"confluent.schema.registry.basic.auth.credentials.source\": 'USER_INFO',\n\"confluent.schema.registry.basic.auth.user.info\": f\"{key}:{secret}\",\n\"avroSchemaEvolutionMode\": \"restart\",\n\"mode\": \"FAILFAST\",\n}\n\ndf = (spark\n.readStream\n.format(\"kafka\")\n.option(\"kafka.bootstrap.servers\", servers)\n.option(\"subscribe\", \"t\")\n.load()\n.select(\nfrom_avro(\ndata = col(\"key\"),\noptions = schema_registry_options,\nsubject = \"t-key\",\nschemaRegistryAddress = schema_registry_address\n).alias(\"key\")\n)\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/avro-dataframe.html"} +{"content":"# Model serving with Databricks\n## Migrate to Model Serving\n#### Migrate optimized LLM serving endpoints to provisioned throughput\n\nThis article describes how to migrate your existing LLM serving endpoints to the [provisioned throughput](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html) experience available using [Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html).\n\n#### Migrate optimized LLM serving endpoints to provisioned throughput\n##### What\u2019s changing?\n\nProvisioned throughput provides a simpler experience for launching optimized LLM serving endpoints. Databricks has modified their LLM model serving system so that: \n* Scale-out ranges can be configured in LLM-native terms, like tokens per second instead of concurrency.\n* Customers no longer need to select GPU workload types themselves. \nNew LLM serving endpoints are created with provisioned throughput by default. If you want to continue selecting the GPU workload type, this experience is only supported using the API.\n\n#### Migrate optimized LLM serving endpoints to provisioned throughput\n##### Migrate LLM serving endpoints to provisioned throughput\n\nThe simplest way to migrate your existing endpoint to provisioned throughput is to update your endpoint with a new model version. After you select a new model version, the UI displays the experience for provisioned throughput. The UI shows tokens per second ranges based on Databricks benchmarking for typical use cases. \n![Provisioned throughput LLM serving](https:\/\/docs.databricks.com\/_images\/serving-provisioned-throughput.png) \nPerformance with this updated offering is strictly better due to optimization improvements, and the price for your endpoint remains unchanged. Please reach out to `model-serving-feedback@databricks.com` for product feedback or concerns.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/migrate-provisioned-throughput.html"} +{"content":"# \n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-infra.html"} +{"content":"# \n### Infrastructure setup\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html) \nThis document walks you through configuring the required infrastructure to create a RAG Studio application: \n1. [Databricks Workspace](https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-infra.html#databricks-workspace)\n2. [<uc> schema](https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-infra.html#unity-catalog-schema)\n3. [Vector Search endpoint](https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-infra.html#vector-search-endpoint)\n4. [Personal Access Token saved in Secrets manager](https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-infra.html#personal-access-token-saved-in-secrets-manager)\n5. [Generative AI models](https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-infra.html#generative-ai-models)\n6. [Cluster configurations](https:\/\/docs.databricks.com\/rag-studio\/setup\/clusters.html) \nYou will need these values when [creating](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/1-create-sample-app.html) your RAG Application, so we suggest using a scratch pad, such as the one below, to write down these values as you walk through the steps below. These values will be requested from you when you initialize your application. \n```\nvector_search_endpoint_name:\nunity_catalog_catalog_name:\nunity_catalog_schema_name:\nsecret_scope:\nsecret_name:\nmodel_serving_endpoint_chat: databricks-llama-2-70b-chat\nmodel_serving_endpoint_embeddings: databricks-bge-large-en\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-infra.html"} +{"content":"# \n### Infrastructure setup\n#### Databricks workspace\n\nSelect a Databricks workspace with Unity Catalog and serverless enabled in a [supported region](https:\/\/docs.databricks.com\/rag-studio\/regions.html). Note the URL of the workspace to use when configuring the application e.g., `https:\/\/workspace-name.cloud.databricks.com`.\n\n### Infrastructure setup\n#### Unity Catalog schema\n\nRAG Studio creates all assets within a Unity Catalog schema. \n1. Create a [new catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html) and\/or [new schema](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-schemas.html) *or* select an existing catalog \/ schema.\n2. Assign `Data Editor` permissions for your Databricks account to the catalog \/ schema using SQL or the [Catalog Explorer](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html) \nNote \nIf you created a new catalog\/schema, you already have the necessary permissions. \n```\nGRANT\nUSE SCHEMA,\nAPPLY TAG,\nMODIFY,\nREAD VOLUME,\nREFRESH,\nSELECT,\nWRITE VOLUME,\nCREATE FUNCTION,\nCREATE MATERIALIZED VIEW,\nCREATE MODEL,\nCREATE TABLE,\nCREATE VOLUME\nON SCHEMA my_schema\nTO `user@domain.com`;\n\n``` \n![data_editor](https:\/\/docs.databricks.com\/_images\/uc_permissions.png)\n\n### Infrastructure setup\n#### Vector Search Endpoint\n\nCreate a new endpoint [using the UI or Python SDK](https:\/\/docs.databricks.com\/generative-ai\/create-query-vector-search.html#create-a-vector-search-endpoint) or select an existing endpoint.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-infra.html"} +{"content":"# \n### Infrastructure setup\n#### Personal Access Token saved in Secrets manager\n\nWarning \nThis approach is a temporary workaround to enable your app\u2019s chain, which is hosted on Model Serving to access to the vector search indexes created by RAG Studio. In the future, this will not be needed. \n1. Create a personal access token (PAT) that has access to the Unity Catalog schema you created above. \n* **Option 1:** Create a PAT token for your user account by following [these steps](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html#databricks-personal-access-tokens-for-workspace-users).\n.. note :: Using a PAT token is only suggested for development. Using a service principal is strongly recommended for production.\n* If you need to use a [service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html), reach out to the RAG Studio team at [rag-feedback@databricks.com](mailto:rag-feedback%40databricks.com). \n1. Save the PAT to a [secret scope](https:\/\/docs.databricks.com\/security\/secrets\/index.html) \nNote \nThese steps assume you have followed the [Development environment](https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-dev.html) to install the Databricks CLI. For detailed instructions, refer to the [Secret management documentation](https:\/\/docs.databricks.com\/security\/secrets\/index.html). \n```\ndatabricks secrets create-scope <scope-name>\ndatabricks secrets put-secret <scope-name> <secret-name>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-infra.html"} +{"content":"# \n### Infrastructure setup\n#### Generative AI models\n\nRAG Studio natively integrates with Databricks Model Serving for access to Foundational Models. This integration is used for RAG Studio\u2019s `\ud83e\udd16 LLM Judge` and within your `\ud83d\udd17 Chain` and `\ud83d\uddc3\ufe0f Data Processor`. \nYou need access to 2 types of models: \n1. [Chat model](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html#chat-task) following `llm\/v1\/chat` schema\n2. [Embeddings model](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html#embedding-task) following `llm\/v1\/embeddings` \nNote \nNo additional set up is required to use [LLaMa2-70B-Chat](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/supported-models.html#llama2-70b) and [BGE-Large-EN](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/supported-models.html#bge-large) use Open Source models hosted by [Databricks Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html) with pay-per-token. \nOptionally, you can also configure: \n* Open Source models hosted with Databricks Foundation Model provisioned throughput. \n+ Follow the steps for [deploying a provisioned throughput model](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html) to load any supported open source model with a `llm\/v1\/embeddings` or `llm\/v1\/chat` type\n* External Models such as (Azure) OpenAI. \n+ Follow the steps to [configure external models in Databricks Model Serving](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html) to load any supported open source model with a `llm\/v1\/embeddings` or `llm\/v1\/chat` type\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-infra.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Manage private access settings\n\nThis article discusses how to create private access settings objects, which are a required object as part of enabling [AWS PrivateLink](https:\/\/aws.amazon.com\/privatelink). This article does not contain all the information necessary to configure PrivateLink for your workspace. For all requirements and steps, including the requirements for registering VPC endpoints and creating network configuration objects, see [Enable AWS PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html). \nThe following related sections discuss updating existing network and configuration objects: \n* [Update a running or failed workspace](https:\/\/docs.databricks.com\/admin\/workspace\/update-workspace.html).\n* [Updates of existing PrivateLink configuration objects](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html#update-related).\n\n##### Manage private access settings\n###### What is a private access settings object?\n\nA private access settings object is a Databricks object that describes a workspace\u2019s PrivateLink connectivity. Create a new private access settings object just for this workspace, or re-use and share an existing private access setting object among multiple workspaces but they must be in the same AWS region. \nThis object serves several purposes: \n* It expresses your intent to use AWS PrivateLink with your workspace.\n* It controls your settings for the front-end use case of AWS PrivateLink for public network access.\n* It controls which VPC endpoints are permitted to access your workspace. \nCreate a private access settings object using the account console or the [Account API](https:\/\/docs.databricks.com\/api\/account\/introduction). You will reference it in the set of fields when you create a workspace. You can update a workspace to point to a different private access settings object but to use PrivateLink you *must* attach a private access settings object to the workspace during workspace creation.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/private-access-settings.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Manage private access settings\n###### Create a private access settings object\n\nNote \nThese instructions show you how to create the private access object from the **Cloud resources** page in the account console before you create a new workspace. You can also create the private access settings in a similar way as part of the flow of creating a new workspace and choosing **Add a new private access object** from the picker instead of choosing an existing object. See [Manually create a workspace (existing Databricks accounts)](https:\/\/docs.databricks.com\/admin\/workspace\/create-workspace.html). \n1. In the [account console](https:\/\/docs.databricks.com\/admin\/account-settings\/index.html#account-console), click **Cloud resources**.\n2. In the horizontal tabs, click **Network**.\n3. In the vertical tabs, click **Private access settings**.\n4. Click **Add private access settings**. \n![private access settings object](https:\/\/docs.databricks.com\/_images\/privatelink-vpc-pas.png)\n5. Enter a name for your new private access settings object.\n6. For the region, be sure to match the region of your workspace as this is not validated immediately and workspace deployment will fail if it does not match. It is validated only during the actual creation of the workspace.\n7. Set the **Public access enabled** field, which configures public access to the front-end connection (the web application and REST APIs) for your workspace. \n* If set to **False** (the default), the front-end connection can be accessed only using PrivateLink connectivity and not from the public internet. When public access is disabled, the [Configure IP access lists for workspaces](https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list-workspace.html) feature is unsupported.\n* If set to **True**, the front-end connection can be accessed either from PrivateLink connectivity or from the public internet. Any IP access lists only limit connections from the public internet but not traffic through the PrivateLink connection.\n8. Set the **Private Access Level** field to the value that best represents which VPC endpoints to allow for your workspace. \n* Set to **Account** to limit connections to those VPC endpoints that are registered in your Databricks account.\n* Set to **Endpoint** to limit connections to an explicit set of VPC endpoints, which you can enter in a field that appears. It lets you select VPC endpoint registrations that you\u2019ve already created. Be sure to include your *front-end* VPC endpoint registration if you created one.\n9. Click **Add**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/private-access-settings.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Manage private access settings\n###### Update a private access settings object\n\nTo update fields on a private access object: \n1. In the [account console](https:\/\/docs.databricks.com\/admin\/account-settings\/index.html#account-console), click **Cloud resources**.\n2. In the horizontal tabs, click **Network**.\n3. In the vertical tabs, click **Private access settings**.\n4. On the row for the configuration, click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) on the right, and select **Update**.\n5. Change any fields. For guidance on specific fields, see [Create a private access settings object](https:\/\/docs.databricks.com\/security\/network\/classic\/private-access-settings.html#create). \nNote \nThe private access access level `ANY` is deprecated. If the object previously had this value and you use the account console to update the private access settings for any fields, you must change the private access level to a non-deprecated value. To make changes to other fields without changing the `ANY` private access level at this time, use the [Account API](https:\/\/docs.databricks.com\/api\/account\/introduction). See [AWS PrivateLink private access level ANY is deprecated](https:\/\/docs.databricks.com\/release-notes\/product\/2022\/august.html#privatelink-private-access-level-any-deprecated).\n6. Click **Update private access setting**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/private-access-settings.html"} +{"content":"# Security and compliance guide\n## Networking\n### Classic compute plane networking\n##### Manage private access settings\n###### Delete a private access settings object\n\nPrivate access settings object cannot be edited after creation. If the configuration has incorrect data or if you no longer need it for any workspaces, delete it: \n1. In the [account console](https:\/\/docs.databricks.com\/admin\/account-settings\/index.html#account-console), click **Cloud resources**.\n2. Click **Network**.\n3. In the vertical tabs, click **Private access settings**.\n4. On the row for the configuration, click the kebab menu ![Vertical Ellipsis](https:\/\/docs.databricks.com\/_images\/vertical-ellipsis.png) on the right, and select **Delete**.\n5. In the confirmation dialog, click **Confirm Delete**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/classic\/private-access-settings.html"} +{"content":"# AI and Machine Learning on Databricks\n## Prepare data and environment for ML and DL\n#### Load data for machine learning and deep learning\n\nThis section covers information about loading data specifically for ML and DL applications. For general information about loading data, see [Ingest data into a Databricks lakehouse](https:\/\/docs.databricks.com\/ingestion\/index.html).\n\n#### Load data for machine learning and deep learning\n##### Store files for data loading and model checkpointing\n\nMachine learning applications may need to use shared storage for data loading and model checkpointing. This is particularly important for distributed deep learning. \nDatabricks provides [the Databricks File System (DBFS)](https:\/\/docs.databricks.com\/dbfs\/index.html) for accessing data on a cluster using both Spark and local file APIs.\n\n#### Load data for machine learning and deep learning\n##### Load tabular data\n\nYou can load tabular machine learning data from [tables](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#table) or files (for example, see [Read and write to CSV files](https:\/\/docs.databricks.com\/query\/formats\/csv.html)). You can convert Apache Spark DataFrames into pandas DataFrames using the [PySpark method](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.DataFrame.toPandas.html?highlight=topandas#pyspark-sql-dataframe-topandas) `toPandas()`, and then optionally convert to NumPy format using the [PySpark method](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.pandas\/api\/pyspark.pandas.DataFrame.to_numpy.html) `to_numpy()`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/load-data\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## Prepare data and environment for ML and DL\n#### Load data for machine learning and deep learning\n##### Prepare data to fine tune large language models\n\nYou can prepare your data for fine-tuning open source large language models with [Hugging Face Transformers](https:\/\/huggingface.co\/docs\/transformers\/index) and [Hugging Face Datasets](https:\/\/huggingface.co\/docs\/datasets\/index). \n[Prepare data for fine tuning Hugging Face models](https:\/\/docs.databricks.com\/machine-learning\/train-model\/huggingface\/load-data.html)\n\n#### Load data for machine learning and deep learning\n##### Prepare data for distributed training\n\nThis section covers two methods for preparing data for distributed training: Petastorm and TFRecords. \n* [Prepare data for distributed training](https:\/\/docs.databricks.com\/machine-learning\/load-data\/ddl-data.html)\n+ [Petastorm (Recommended)](https:\/\/docs.databricks.com\/machine-learning\/load-data\/ddl-data.html#petastorm-recommended)\n+ [TFRecord](https:\/\/docs.databricks.com\/machine-learning\/load-data\/ddl-data.html#tfrecord)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/load-data\/index.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### ElasticSearch\n\n[ElasticSearch](https:\/\/www.elastic.co\/elasticsearch) is a distributed, RESTful search and analytics engine. \nThe following notebook shows how to read and write data to ElasticSearch.\n\n#### ElasticSearch\n##### ElasticSearch notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/elasticsearch.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/elasticsearch.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n### Chat with supported LLMs using AI Playground\n\nYou can interact with supported large language models using the AI Playground. The AI Playground is a chat-like environment where you can test, prompt, and compare LLMs. This functionality is available in your Databricks workspace.\n\n### Chat with supported LLMs using AI Playground\n#### Requirements\n\n* [Databricks workspace](https:\/\/docs.databricks.com\/workspace\/index.html) in a [supported region](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html#regions) for Foundation Model APIs pay-per-token or provisioned throughput.\n\n### Chat with supported LLMs using AI Playground\n#### Use AI Playground\n\nTo use the AI Playground: \n1. Select Playground from the left navigation pane under Machine Learning.\n2. Select the model you want to interact with using the dropdown list on the top left.\n3. You can do either of the following: \n1. Type in your question or prompt.\n2. Select a sample AI instruction from those listed in the window.\n4. You can select the **+** to add an endpoint. Doing so allows you to compare multiple model responses side-by-side. \n![AI playground](https:\/\/docs.databricks.com\/_images\/ai-playground.gif)\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/ai-playground.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### External models in Databricks Model Serving\n\nImportant \nThe code examples in this article demonstrate usage of the [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html) MLflow Deployments CRUD API. \nThis article describes external models in Databricks Model Serving including its supported model providers and limitations.\n\n#### External models in Databricks Model Serving\n##### What are external models?\n\nExternal models are third-party models hosted outside of Databricks. Supported by Model Serving, external models allow you to streamline the usage and management of various large language model (LLM) providers, such as OpenAI and Anthropic, within an organization. You can also use Databricks Model Serving as a provider to serve custom models, which offers rate limits for those endpoints. As part of this support, Model Serving offers a high-level interface that simplifies the interaction with these services by providing a unified endpoint to handle specific LLM-related requests. \nIn addition, Databricks support for external models provides centralized credential management. By storing API keys in one secure location, organizations can enhance their security posture by minimizing the exposure of sensitive API keys throughout the system. It also helps to prevent exposing these keys within code or requiring end users to manage keys safely. \nSee [Tutorial: Create external model endpoints to query OpenAI models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/external-models-tutorial.html) for step-by-step guidance on external model endpoint creation and querying supported models served by those endpoints using the MLflow Deployments SDK. See the following guides for instructions on how to use the Serving UI and the REST API: \n* [Create custom model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html)\n* [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### External models in Databricks Model Serving\n##### Requirements\n\n* API key for the model provider.\n* Databricks workspace in [External models supported regions](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html#regions).\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### External models in Databricks Model Serving\n##### Model providers\n\nExternal models in Model Serving is designed to support a variety of model providers. A provider represents the source of the machine learning models, such as OpenAI, Anthropic, and so on. Each provider has its specific characteristics and configurations that are encapsulated within the `external_model` field of the [external model endpoint configuration](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html#endpoint). \nThe following providers are supported: \n* **openai**: For models offered by [OpenAI](https:\/\/platform.openai.com\/) and the [Azure](https:\/\/learn.microsoft.com\/azure\/cognitive-services\/openai\/) integrations for Azure OpenAI and Azure OpenAI with AAD.\n* **anthropic**: For models offered by [Anthropic](https:\/\/docs.anthropic.com\/claude\/docs).\n* **cohere**: For models offered by [Cohere](https:\/\/docs.cohere.com\/docs).\n* **amazon-bedrock**: For models offered by [Amazon Bedrock](https:\/\/aws.amazon.com\/bedrock\/).\n* **ai21labs**: For models offered by [AI21Labs](https:\/\/www.ai21.com\/studio\/foundation-models).\n* **google-cloud-vertex-ai**: For models offered by [Google Cloud Vertex AI](https:\/\/cloud.google.com\/vertex-ai\/generative-ai\/docs\/learn\/models).\n* **databricks-model-serving**: For Databricks Model Serving endpoints with compatible schemas. See [Endpoint configuration](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html#endpoint). \nTo request support for a provider not listed here, reach out to your Databricks account team.\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### External models in Databricks Model Serving\n##### Supported models\n\nThe model you choose directly affects the results of the responses you get from the API calls. Therefore, choose a model that fits your use-case requirements. For instance, for generating conversational responses, you can choose a chat model. Conversely, for generating embeddings of text, you can choose an embedding model. \nThe table below presents a non-exhaustive list of supported models and corresponding [endpoint types](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html#endpoint). Model associations listed below can be used as a helpful guide when configuring an endpoint for any newly released model types as they become available with a given provider. Customers are responsible for ensuring compliance with applicable model licenses. \nNote \nWith the rapid development of LLMs, there is no guarantee that this list is up to date at all times. \n| Model provider | llm\/v1\/completions | llm\/v1\/chat | llm\/v1\/embeddings |\n| --- | --- | --- | --- |\n| OpenAI\\*\\* | * gpt-3.5-turbo-instruct * babbage-002 * davinci-002 | * gpt-3.5-turbo * gpt-4 * gpt-3.5-turbo-0125 * gpt-3.5-turbo-1106 * gpt-4-0125-preview * gpt-4-turbo-preview * gpt-4-1106-preview * gpt-4-vision-preview * gpt-4-1106-vision-preview | * text-embedding-ada-002 * text-embedding-3-large * text-embedding-3-small |\n| Azure OpenAI\\*\\* | * text-davinci-003 * gpt-35-turbo-instruct | * gpt-35-turbo * gpt-35-turbo-16k * gpt-4 * gpt-4-32k | * text-embedding-ada-002 * text-embedding-3-large * text-embedding-3-small |\n| Anthropic | * claude-1 * claude-1.3-100k * claude-2 * claude-2.1 * claude-2.0 * claude-instant-1.2 | * claude-3-opus-20240229 * claude-3-sonnet-20240229 * claude-2.1 * claude-2.0 * claude-instant-1.2 | |\n| Cohere\\*\\* | * command * command-light-nightly * command-light * command-nightly | | * embed-english-v2.0 * embed-multilingual-v2.0 * embed-english-light-v2.0 * embed-english-v3.0 * embed-english-light-v3.0 * embed-multilingual-v3.0 * embed-multilingual-light-v3.0 |\n| Databricks Model Serving | Databricks serving endpoint | Databricks serving endpoint | Databricks serving endpoint |\n| Amazon Bedrock | Anthropic:* claude-instant-v1 * claude-v1 * claude-v2 Cohere:* command-text-v14 * command-text-v14:7:4k * command-light-text-v14 * command-light-text-v14:7:4k AI21 Labs:* j2-grande-instruct * j2-jumbo-instruct * j2-mid * j2-mid-v1 * j2-ultra j2-ultra-v1 | Anthropic:* claude-instant-v1:2:100k * claude-v2 * claude-v2:0:18k * claude-v2:0:100k * claude-v2:1 * claude-v2:1:18k * claude-v2:1:200k * claude-3-sonnet-20240229-v1:0 | Amazon:* titan-embed-text-v1 * titan-embed-g1-text-02 * titan-embed-text-v1:2:8k |\n| AI21 Labs\u2020 | * j2-mid * j2-light * j2-ultra | | |\n| Google Cloud Vertex AI | text-bison | * chat-bison * gemini-pro | textembedding-gecko | \n`**` Model provider supports fine-tuned completion and chat models. To query a fine-tuned model, populate the `name` field of the `external model` configuration with the name of your fine-tuned model. \n\u2020 Model provider supports custom completion models. \n### Use models served on Databricks Model Serving endpoints \n[Databricks Model Serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html) as a provider is supported for the `llm\/v1\/completions`, `llm\/v1\/chat`, and `llm\/v1\/embeddings` endpoint types. These endpoints must accept the standard query parameters marked as required, while other parameters might be ignored depending on whether or not the Databricks Model Serving endpoint supports them. \nSee [POST \/serving-endpoints\/{name}\/invocations](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/query) in the API reference for standard query parameters. \nThese endpoints must produce responses in the following OpenAI format. \nFor completions tasks: \n```\n{\n\"id\": \"123\", # Not Required\n\"model\": \"test_databricks_model\",\n\"choices\": [\n{\n\"text\": \"Hello World!\",\n\"index\": 0,\n\"logprobs\": null, # Not Required\n\"finish_reason\": \"length\" # Not Required\n}\n],\n\"usage\": {\n\"prompt_tokens\": 8,\n\"total_tokens\": 8\n}\n}\n\n``` \nFor chat tasks: \n```\n{\n\"id\": \"123\", # Not Required\n\"model\": \"test_chat_model\",\n\"choices\": [{\n\"index\": 0,\n\"message\": {\n\"role\": \"assistant\",\n\"content\": \"\\n\\nHello there, how may I assist you today?\",\n},\n\"finish_reason\": \"stop\"\n},\n{\n\"index\": 1,\n\"message\": {\n\"role\": \"human\",\n\"content\": \"\\n\\nWhat is the weather in San Francisco?\",\n},\n\"finish_reason\": \"stop\"\n}],\n\"usage\": {\n\"prompt_tokens\": 8,\n\"total_tokens\": 8\n}\n}\n\n``` \nFor embeddings tasks: \n```\n{\n\"data\": [\n{\n\"embedding\": [\n0.0023064255,\n-0.009327292,\n.... # (1536 floats total for ada-002)\n-0.0028842222,\n],\n\"index\": 0\n},\n{\n\"embedding\": [\n0.0023064255,\n-0.009327292,\n.... #(1536 floats total for ada-002)\n-0.0028842222,\n],\n\"index\": 0\n}\n],\n\"model\": \"test_embedding_model\",\n\"usage\": {\n\"prompt_tokens\": 8,\n\"total_tokens\": 8\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### External models in Databricks Model Serving\n##### Endpoint configuration\n\nTo serve and query external models you need to configure a serving endpoint. See [Create custom model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html) \nFor an external model serving endpoint you must include the `external_model` field and its parameters in the `served_entities` section of the endpoint configuration. \nThe `external_model` field defines the model to which this endpoint forwards requests. When specifying a model, it is critical that the provider supports the model you are requesting. For instance, `openai` as a provider supports models like `text-embedding-ada-002`, but other providers might not. If the model is not supported by the provider, Databricks returns an HTTP 4xx error when trying to route requests to that model. \nThe below table summarizes the `external_model` field parameters. See [POST \/api\/2.0\/serving-endpoints](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/create) for endpoint configuration parameters. \n| Parameter | Descriptions |\n| --- | --- |\n| `name` | The name of the model to use. For example, `gpt-3.5-turbo` for OpenAI\u2019s `GPT-3.5-Turbo` model. |\n| `provider` | Specifies the name of the provider for this model. This string value must correspond to a supported external model [provider](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html#provider). For example, `openai` for OpenAI\u2019s `GPT-3.5` models. |\n| `task` | The task corresponds to the type of language model interaction you desire. Supported tasks are \u201cllm\/v1\/completions\u201d, \u201cllm\/v1\/chat\u201d, \u201cllm\/v1\/embeddings\u201d. |\n| `<provider>_config` | Contains any additional configuration details required for the model. This includes specifying the API base URL and the API key. See [Configure the provider for an endpoint](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html#configure-provider). | \nThe following is an example of creating an external model endpoint using the `create_endpoint()` API. In this example, a request sent to the completion endpoint is forwarded to the `claude-2` model provided by `anthropic`. \n```\nimport mlflow.deployments\n\nclient = mlflow.deployments.get_deploy_client(\"databricks\")\n\nclient.create_endpoint(\nname=\"anthropic-completions-endpoint\",\nconfig={\n\"served_entities\": [\n{\n\"name\": \"test\",\n\"external_model\": {\n\"name\": \"claude-2\",\n\"provider\": \"anthropic\",\n\"task\": \"llm\/v1\/completions\",\n\"anthropic_config\": {\n\"anthropic_api_key\": \"{{secrets\/my_anthropic_secret_scope\/anthropic_api_key}}\"\n}\n}\n}\n]\n}\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### External models in Databricks Model Serving\n##### Configure the provider for an endpoint\n\nWhen you create an endpoint, you must supply the required configurations for the specified model provider. The following sections summarize the available endpoint configuration parameters for each model provider. \n### OpenAI \n| Configuration Parameter | Description | Required | Default |\n| --- | --- | --- | --- |\n| `openai_api_key` | The API key for the OpenAI service. | Yes | |\n| `openai_api_type` | An optional field to specify the type of OpenAI API to use. | No | |\n| `openai_api_base` | The base URL for the OpenAI API. | No | `https:\/\/api.openai.com\/v1` |\n| `openai_api_version` | An optional field to specify the OpenAI API version. | No | |\n| `openai_organization` | An optional field to specify the organization in OpenAI. | No | | \n### Cohere \n| Configuration Parameter | Description | Required | Default |\n| --- | --- | --- | --- |\n| `cohere_api_key` | The API key for the Cohere service. | Yes | | \n### Anthropic \n| Configuration Parameter | Description | Required | Default |\n| --- | --- | --- | --- |\n| `anthropic_api_key` | The API key for the Anthropic service. | Yes | | \n### Azure OpenAI \nAzure OpenAI has distinct features as compared with the direct OpenAI service. For an overview, please see [the comparison documentation](https:\/\/learn.microsoft.com\/azure\/cognitive-services\/openai\/how-to\/switching-endpoints). \n| Configuration Parameter | Description | Required | Default |\n| --- | --- | --- | --- |\n| `openai_api_key` | The API key for the Azure OpenAI service. | Yes | |\n| `openai_api_type` | Adjust this parameter to represent the preferred security access validation protocol. For access token validation, use `azure`. For authentication using Azure Active Directory (Azure AD) use, `azuread`. | Yes | |\n| `openai_api_base` | The base URL for the Azure OpenAI API service provided by Azure. | Yes | |\n| `openai_api_version` | The version of the Azure OpenAI service to utilize, specified by a date. | Yes | |\n| `openai_deployment_name` | The name of the deployment resource for the Azure OpenAI service. | Yes | |\n| `openai_organization` | An optional field to specify the organization in OpenAI. | No | | \nThe following example demonstrates how to create an endpoint with Azure OpenAI: \n```\nclient.create_endpoint(\nname=\"openai-chat-endpoint\",\nconfig={\n\"served_entities\": [{\n\"external_model\": {\n\"name\": \"gpt-3.5-turbo\",\n\"provider\": \"openai\",\n\"task\": \"llm\/v1\/chat\",\n\"openai_config\": {\n\"openai_api_type\": \"azure\",\n\"openai_api_key\": \"{{secrets\/my_openai_secret_scope\/openai_api_key}}\",\n\"openai_api_base\": \"https:\/\/my-azure-openai-endpoint.openai.azure.com\",\n\"openai_deployment_name\": \"my-gpt-35-turbo-deployment\",\n\"openai_api_version\": \"2023-05-15\"\n}\n}\n}]\n}\n)\n\n``` \n### Google Cloud Vertex AI \n| Configuration Parameter | Description | Required | Default |\n| --- | --- | --- | --- |\n| `private_key` | This is the private key for the service account which has access to the Google Cloud Vertex AI Service. See [Best practices for managing service account keys](https:\/\/cloud.google.com\/iam\/docs\/best-practices-for-managing-service-account-keys). | Yes | |\n| `region` | This is the region for the Google Cloud Vertex AI Service. See [supported regions](https:\/\/cloud.google.com\/vertex-ai\/docs\/general\/locations) for more details. Some models are only available in specific regions. | Yes | |\n| `project_id` | This is the Google Cloud project id that the service account is associated with. | Yes | | \n### Amazon Bedrock \nTo use Amazon Bedrock as an external model provider, customers need to make sure Bedrock is enabled in the specified AWS region, and the specified AWS key pair have the appropriate permissions to interact with Bedrock services. For more information, see [AWS Identity and Access Management](https:\/\/docs.aws.amazon.com\/IAM\/latest\/UserGuide\/id_credentials_access-keys.html). \n| Configuration Parameter | Description | Required | Default |\n| --- | --- | --- | --- |\n| `aws_region` | The AWS region to use. Bedrock has to be enabled there. | Yes | |\n| `aws_access_key_id` | An AWS access key ID with permissions to interact with Bedrock services. | Yes | |\n| `aws_secret_access_key` | An AWS secret access key paired with the access key ID, with permissions to interact with Bedrock services. | Yes | |\n| `bedrock_provider` | The underlying provider in Amazon Bedrock. Supported values (case insensitive) include: Anthropic, Cohere, AI21Labs, Amazon | Yes | | \nThe following example demonstrates how to create an endpoint with Amazon Bedrock. \n```\nclient.create_endpoint(\nname=\"bedrock-anthropic-completions-endpoint\",\nconfig={\n\"served_entities\": [\n{\n\"external_model\": {\n\"name\": \"claude-v2\",\n\"provider\": \"amazon-bedrock\",\n\"task\": \"llm\/v1\/completions\",\n\"amazon_bedrock_config\": {\n\"aws_region\": \"<YOUR_AWS_REGION>\",\n\"aws_access_key_id\": \"{{secrets\/my_amazon_bedrock_secret_scope\/aws_access_key_id}}\",\n\"aws_secret_access_key\": \"{{secrets\/my_amazon_bedrock_secret_scope\/aws_secret_access_key}}\",\n\"bedrock_provider\": \"anthropic\",\n},\n}\n}\n]\n},\n)\n\n``` \nIf there are AWS permission issues, Databricks recommends that you verify the credentials directly with the [Amazon Bedrock API](https:\/\/docs.aws.amazon.com\/bedrock\/latest\/APIReference\/welcome.html). \n### AI21 Labs \n| Configuration Parameter | Description | Required | Default |\n| --- | --- | --- | --- |\n| `ai21labs_api_key` | This is the API key for the AI21 Labs service. | Yes | |\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### External models in Databricks Model Serving\n##### Query an external model endpoint\n\nAfter you create an external model endpoint, it is ready to receive traffic from users. \nYou can send scoring requests to the endpoint using the OpenAI client, the REST API or the MLflow Deployments SDK. \n* See the standard query parameters for a scoring request in [POST \/serving-endpoints\/{name}\/invocations](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/query).\n* [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html) \nThe following example queries the `claude-2` completions model hosted by Anthropic using the OpenAI client. To use the OpenAI client, populate the `model` field with the name of the model serving endpoint that hosts the model you want to query. \nThis example uses a previously created endpoint, `anthropic-completions-endpoint`, configured for accessing external models from the Anthropic model provider. See how to [create external model endpoints](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html#endpoint). \nSee [Supported models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html#supported) for additional models you can query and their providers. \n```\nimport os\nimport openai\nfrom openai import OpenAI\n\nclient = OpenAI(\napi_key=\"dapi-your-databricks-token\",\nbase_url=\"https:\/\/example.staging.cloud.databricks.com\/serving-endpoints\"\n)\n\ncompletion = client.completions.create(\nmodel=\"anthropic-completions-endpoint\",\nprompt=\"what is databricks\",\ntemperature=1.0\n)\nprint(completion)\n\n``` \nExpected output response format: \n```\n{\n\"id\": \"123\", # Not Required\n\"model\": \"anthropic-completions-endpoint\",\n\"choices\": [\n{\n\"text\": \"Hello World!\",\n\"index\": 0,\n\"logprobs\": null, # Not Required\n\"finish_reason\": \"length\" # Not Required\n}\n],\n\"usage\": {\n\"prompt_tokens\": 8,\n\"total_tokens\": 8\n}\n}\n\n``` \n### Additional query parameters \nYou can pass any additional parameters supported by the endpoint\u2019s provider as part of your query. \nFor example: \n* `logit_bias` (supported by OpenAI, Cohere).\n* `top_k` (supported by Anthropic, Cohere).\n* `frequency_penalty` (supported by OpenAI, Cohere).\n* `presence_penalty` (supported by OpenAI, Cohere).\n* `stream` (supported by OpenAI, Anthropic, Cohere, Amazon Bedrock for Anthropic). This is only available for chat and completions requests.\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### External models in Databricks Model Serving\n##### Limitations\n\nDepending on the external model you choose, your configuration might cause your data to be processed outside of the region where your data originated. \nSee [Model Serving limits and regions](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html).\n\n#### External models in Databricks Model Serving\n##### Additional resources\n\n* [Tutorial: Create external model endpoints to query OpenAI models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/external-models-tutorial.html).\n* [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Create foundation model serving endpoints\n\nIn this article, you learn how to create model serving endpoints that serve [foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/foundation-models.html). \n[Databricks Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) supports the following foundation models: \n* State-of-the-art open models made available by [Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html). These models are curated foundation model architectures that support optimized inference. Base models, like Llama-2-70B-chat, BGE-Large, and Mistral-7B are available for immediate use with **pay-per-token** pricing. Production workloads, using base or fine-tuned models, can be deployed with performance guarantees using **provisioned throughput**.\n* [External models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html). These are models that are hosted outside of Databricks. Endpoints that serve external models can be centrally governed and customers can establish rate limits and access control for them. Examples include foundation models like, OpenAI\u2019s GPT-4, Anthropic\u2019s Claude, and others. \nModel Serving provides the following options for model serving endpoint creation: \n* The Serving UI\n* REST API\n* MLflow Deployments SDK \nFor creating endpoints that serve traditional ML or Python models, see [Create custom model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-foundation-model-endpoints.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Create foundation model serving endpoints\n##### Requirements\n\n* A Databricks workspace in a supported region. \n+ [Foundation Model APIs regions](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html#regions)\n+ [External models regions](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html#regions)\n* For creating endpoints using the MLflow Deployments SDK, you must install the MLflow Deployment client. To install it, run: \n```\nimport mlflow.deployments\n\nclient = mlflow.deployments.get_deploy_client(\"databricks\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-foundation-model-endpoints.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Create foundation model serving endpoints\n##### Create a foundation model serving endpoint\n\nThe following describes how to create an endpoint that serves a foundation model made available using Databricks external models. For endpoints that serve fine-tuned variants of the models made available using Foundation Model APIs, see [Create your provisioned throughput endpoint using the REST API](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/deploy-prov-throughput-foundation-model-apis.html#provisioned-throughput-api). \n1. In the **Name** field provide a name for your endpoint.\n2. In the **Served entities** section \n1. Click into the **Entity** field to open the **Select served entity** form.\n2. Select **External model**.\n3. Select the model provider you want to use.\n4. Click **Confirm**\n5. Provide the name of the external model you want to use. The form dynamically updates based on your selection. See the [available external models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html#supported-models).\n6. Select the task type. Available tasks are chat, completions, and embeddings.\n7. Provide the configuration details for accessing the selected model provider. This is typically the secret that references the [personal access token](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens) you want the endpoint to use for accessing this model.\n3. Click **Create**. The **Serving endpoints** page appears with **Serving endpoint state** shown as Not Ready. \n![Create a model serving endpoint](https:\/\/docs.databricks.com\/_images\/create-endpoint1.png) \nImportant \nThe REST API parameters for creating serving endpoints that serve foundation models are in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThe following example creates an endpoint that serves the first version of the `text-embedding-ada-002` model provided by OpenAI. \nSee [POST \/api\/2.0\/serving-endpoints](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/create) for endpoint configuration parameters. \n```\n{\n\"name\": \"openai_endpoint\",\n\"config\":{\n\"served_entities\": [\n{\n\"name\": \"openai_embeddings\",\n\"external_model\":{\n\"name\": \"text-embedding-ada-002\",\n\"provider\": \"openai\",\n\"task\": \"llm\/v1\/embeddings\",\n\"openai_config\":{\n\"openai_api_key\": \"{{secrets\/my_scope\/my_openai_api_key}}\"\n}\n}\n}\n]\n},\n\"rate_limits\": [\n{\n\"calls\": 100,\n\"key\": \"user\",\n\"renewal_period\": \"minute\"\n}\n],\n\"tags\": [\n{\n\"key\": \"team\",\n\"value\": \"gen-ai\"\n}\n]\n}\n\n``` \nThe following is an example response. \n```\n{\n\"name\": \"openai_endpoint\",\n\"creator\": \"user@email.com\",\n\"creation_timestamp\": 1699617587000,\n\"last_updated_timestamp\": 1699617587000,\n\"state\": {\n\"ready\": \"READY\"\n},\n\"config\": {\n\"served_entities\": [\n{\n\"name\": \"openai_embeddings\",\n\"external_model\": {\n\"provider\": \"openai\",\n\"name\": \"text-embedding-ada-002\",\n\"task\": \"llm\/v1\/embeddings\",\n\"openai_config\": {\n\"openai_api_key\": \"{{secrets\/my_scope\/my_openai_api_key}}\"\n}\n},\n\"state\": {\n\"deployment\": \"DEPLOYMENT_READY\",\n\"deployment_state_message\": \"\"\n},\n\"creator\": \"user@email.com\",\n\"creation_timestamp\": 1699617587000\n}\n],\n\"traffic_config\": {\n\"routes\": [\n{\n\"served_model_name\": \"openai_embeddings\",\n\"traffic_percentage\": 100\n}\n]\n},\n\"config_version\": 1\n},\n\"tags\": [\n{\n\"key\": \"team\",\n\"value\": \"gen-ai\"\n}\n],\n\"id\": \"69962db6b9db47c4a8a222d2ac79d7f8\",\n\"permission_level\": \"CAN_MANAGE\",\n\"route_optimized\": false\n}\n\n``` \nThe following creates an endpoint for embeddings with OpenAI `text-embedding-ada-002`. \nFor foundation model endpoints, you must provide API keys for the model provider you want to use. See [POST \/api\/2.0\/serving-endpoints](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/create) in the REST API for request and response schema details. \nYou can also create endpoints for completions and chat tasks, as specified by the `task` field in the `external_model` section of the configuration. See [External models in Databricks Model Serving](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html) for supported models and providers for each task. \n```\n\nfrom mlflow.deployments import get_deploy_client\n\nclient = get_deploy_client(\"databricks\")\nendpoint = client.create_endpoint(\nname=\"chat\",\nconfig={\n\"served_entities\": [\n{\n\"name\": \"completions\",\n\"external_model\": {\n\"name\": \"gpt-4\",\n\"provider\": \"openai\",\n\"task\": \"llm\/v1\/chat\",\n\"openai_config\": {\n\"openai_api_key\": \"{{secrets\/scope\/key}}\",\n},\n},\n}\n],\n},\n)\nassert endpoint == {\n\"name\": \"chat\",\n\"creator\": \"alice@company.com\",\n\"creation_timestamp\": 0,\n\"last_updated_timestamp\": 0,\n\"state\": {...},\n\"config\": {...},\n\"tags\": [...],\n\"id\": \"88fd3f75a0d24b0380ddc40484d7a31b\",\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-foundation-model-endpoints.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Create foundation model serving endpoints\n##### Update a foundation model endpoint\n\nAfter enabling a model endpoint, you can set the compute configuration as desired. This configuration is particularly helpful if you need additional resources for your model. Workload size and compute configuration play a key role in what resources are allocated for serving your model. \nUntil the new configuration is ready, the old configuration keeps serving prediction traffic. While there is an update in progress, another update cannot be made. In the Serving UI, you can cancel an in progress configuration update by selecting **Cancel update** on the top right of the endpoint\u2019s details page. This functionality is only available in the Serving UI. \nWhen an `external_model` is present in an endpoint configuration, the served entities list can only have one served\\_entity object. Existing endpoints with an `external_model` can not be updated to no longer have an `external_model`. If the endpoint is created without an `external_model`, you cannot update it to add an `external_model`. \nTo update your foundation model endpoint see the REST API [update configuration documentation](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/updateconfig) for request and response schema details. \n```\n{\n\"name\": \"openai_endpoint\",\n\"served_entities\":[\n{\n\"name\": \"openai_chat\",\n\"external_model\":{\n\"name\": \"gpt-4\",\n\"provider\": \"openai\",\n\"task\": \"llm\/v1\/chat\",\n\"openai_config\":{\n\"openai_api_key\": \"{{secrets\/my_scope\/my_openai_api_key}}\"\n}\n}\n}\n]\n}\n\n``` \nTo update your foundation model endpoint see the REST API [update configuration documentation](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/updateconfig) for request and response schema details. \n```\nfrom mlflow.deployments import get_deploy_client\n\nclient = get_deploy_client(\"databricks\")\nendpoint = client.update_endpoint(\nendpoint=\"chat\",\nconfig={\n\"served_entities\": [\n{\n\"name\": \"chats\",\n\"external_model\": {\n\"name\": \"gpt-4\",\n\"provider\": \"openai\",\n\"task\": \"llm\/v1\/chat\",\n\"openai_config\": {\n\"openai_api_key\": \"{{secrets\/scope\/key}}\",\n},\n},\n}\n],\n},\n)\nassert endpoint == {\n\"name\": \"chats\",\n\"creator\": \"alice@company.com\",\n\"creation_timestamp\": 0,\n\"last_updated_timestamp\": 0,\n\"state\": {...},\n\"config\": {...},\n\"tags\": [...],\n\"id\": \"88fd3f75a0d24b0380ddc40484d7a31b\",\n}\n\nrate_limits = client.update_endpoint(\nendpoint=\"chat\",\nconfig={\n\"rate_limits\": [\n{\n\"key\": \"user\",\n\"renewal_period\": \"minute\",\n\"calls\": 10,\n}\n],\n},\n)\nassert rate_limits == {\n\"rate_limits\": [\n{\n\"key\": \"user\",\n\"renewal_period\": \"minute\",\n\"calls\": 10,\n}\n],\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-foundation-model-endpoints.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Create foundation model serving endpoints\n##### Additional resources\n\n* [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html).\n* [External models in Databricks Model Serving](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html).\n* [Inference tables for monitoring and debugging models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-foundation-model-endpoints.html"} +{"content":"# What is Delta Lake?\n### Delta table properties reference\n\nDelta Lake reserves Delta table properties starting with `delta.`. These properties may have specific meanings, and affect behaviors when these properties are set. \nNote \nAll operations that set or update table properties conflict with other concurrent write operations, causing them to fail. Databricks recommends you modify a table property only when there are no concurrent write operations on the table.\n\n### Delta table properties reference\n#### How do table properties and SparkSession properties interact?\n\nDelta table properties are set per table. If a property is set on a table, then this is the setting that is followed by default. \nSome table properties have associated SparkSession configurations which always take precedence over table properties. Some examples include the `spark.databricks.delta.autoCompact.enabled` and `spark.databricks.delta.optimizeWrite.enabled` configurations, which turn on auto compaction and optimized writes at the SparkSession level rather than the table level. Databricks recommends using table-scoped configurations for most workloads. \nFor every Delta table property you can set a default value for new tables using a SparkSession configuration, overriding the built-in default. This setting only affects new tables and does not override or replace properties set on existing tables. The prefix used in the SparkSession is different from the configurations used in the table properties, as shown in the following table: \n| Delta Lake conf | SparkSession conf |\n| --- | --- |\n| `delta.<conf>` | `spark.databricks.delta.properties.defaults.<conf>` | \nFor example, to set the `delta.appendOnly = true` property for all new Delta Lake tables created in a session, set the following: \n```\nSET spark.databricks.delta.properties.defaults.appendOnly = true\n\n``` \nTo modify table properties of existing tables, use [SET TBLPROPERTIES](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-tblproperties.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/table-properties.html"} +{"content":"# What is Delta Lake?\n### Delta table properties reference\n#### Delta table properties\n\nAvailable Delta table properties include the following: \n| Property |\n| --- |\n| `delta.appendOnly` `true` for this Delta table to be append-only. If append-only, existing records cannot be deleted, and existing values cannot be updated. See [Delta table properties reference](https:\/\/docs.databricks.com\/delta\/table-properties.html). Data type: `Boolean` Default: `false` |\n| `delta.autoOptimize.autoCompact` `auto` for Delta Lake to automatically optimize the layout of the files for this Delta table. See [Auto compaction for Delta Lake on Databricks](https:\/\/docs.databricks.com\/delta\/tune-file-size.html#auto-compact). Data type: `Boolean` Default: (none) |\n| `delta.autoOptimize.optimizeWrite` `true` for Delta Lake to automatically optimize the layout of the files for this Delta table during writes. See [Optimized writes for Delta Lake on Databricks](https:\/\/docs.databricks.com\/delta\/tune-file-size.html#optimized-writes). Data type: `Boolean` Default: (none) |\n| `delta.checkpoint.writeStatsAsJson` `true` for Delta Lake to write file statistics in checkpoints in JSON format for the `stats` column. See [Manage column-level statistics in checkpoints](https:\/\/docs.databricks.com\/delta\/best-practices.html#column-stats). Data type: `Boolean` Default: `true` |\n| `delta.checkpoint.writeStatsAsStruct` `true` for Delta Lake to write file statistics to checkpoints in struct format for the `stats_parsed` column and to write partition values as a struct for `partitionValues_parsed`. See [Manage column-level statistics in checkpoints](https:\/\/docs.databricks.com\/delta\/best-practices.html#column-stats). Data type: `Boolean` Default: (none) |\n| `delta.checkpointPolicy` `classic` for classic Delta Lake checkpoints. `v2` for v2 checkpoints. See [Compatibility for tables with liquid clustering](https:\/\/docs.databricks.com\/delta\/clustering.html#compatibility). Data type: `String` Default: `classic` |\n| `delta.columnMapping.mode` Whether column mapping is enabled for Delta table columns and the corresponding Parquet columns that use different names. See [Rename and drop columns with Delta Lake column mapping](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html). Note: Enabling `delta.columnMapping.mode` automatically enables `delta.randomizeFilePrefixes`. Data type: `DeltaColumnMappingMode` Default: `none` |\n| `delta.compatibility.symlinkFormatManifest.enabled` `true` for Delta Lake to configure the Delta table so that all write operations on the table automatically update the manifests. Data type: `Boolean` Default: `false` |\n| `delta.dataSkippingNumIndexedCols` The number of columns for Delta Lake to collect statistics about for data skipping. A value of `-1` means to collect statistics for all columns. See [Data skipping for Delta Lake](https:\/\/docs.databricks.com\/delta\/data-skipping.html). Data type: `Int` Default: `32` |\n| `delta.dataSkippingStatsColumns` A comma-separated list of column names on which Delta Lake collects statistics to enhance data skipping functionality. This property takes precedence over `delta.dataSkippingNumIndexedCols`. See [Data skipping for Delta Lake](https:\/\/docs.databricks.com\/delta\/data-skipping.html). Data type: `String` Default: (none) |\n| `delta.deletedFileRetentionDuration` The shortest duration for Delta Lake to keep logically deleted data files before deleting them physically. This is to prevent failures in stale readers after compactions or partition overwrites. This value should be large enough to ensure that:* It is larger than the longest possible duration of a job if you run `VACUUM` when there are concurrent readers or writers accessing the Delta table. * If you run a streaming query that reads from the table, that query does not stop for longer than this value. Otherwise, the query may not be able to restart, as it must still read old files. See [Configure data retention for time travel queries](https:\/\/docs.databricks.com\/delta\/history.html#data-retention). Data type: `CalendarInterval` Default: `interval 1 week` |\n| `delta.enableChangeDataFeed` `true` to enable change data feed. See [Enable change data feed](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html#enable-change-data-feed). Data type: `Boolean` Default: `false` |\n| `delta.enableDeletionVectors` `true` to enable deletion vectors and predictive I\/O for updates. See [What are deletion vectors?](https:\/\/docs.databricks.com\/delta\/deletion-vectors.html). Data type: `Boolean` Default: Depends on workspace admin settings and Databricks Runtime version. See [Auto-enable deletion vectors](https:\/\/docs.databricks.com\/admin\/workspace-settings\/deletion-vectors.html) |\n| `delta.isolationLevel` The degree to which a transaction must be isolated from modifications made by concurrent transactions. Valid values are `Serializable` and `WriteSerializable`. See [Isolation levels and write conflicts on Databricks](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html). Data type: `String` Default: `WriteSerializable` |\n| `delta.logRetentionDuration` How long the history for a Delta table is kept. `VACUUM` operations override this retention threshold. Each time a checkpoint is written, Delta Lake automatically cleans up log entries older than the retention interval. If you set this property to a large enough value, many log entries are retained. This should not impact performance as operations against the log are constant time. Operations on history are parallel but will become more expensive as the log size increases. See [Configure data retention for time travel queries](https:\/\/docs.databricks.com\/delta\/history.html#data-retention). Data type: `CalendarInterval` Default: `interval 30 days` |\n| `delta.minReaderVersion` The minimum required protocol reader version for a reader that allows to read from this Delta table. Databricks recommends against manually configuring this property. See [How does Databricks manage Delta Lake feature compatibility?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html). Data type: `Int` Default: `1` |\n| `delta.minWriterVersion` The minimum required protocol writer version for a writer that allows to write to this Delta table. Databricks recommends against manually configuring this property. See [How does Databricks manage Delta Lake feature compatibility?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html). Data type: `Int` Default: `2` |\n| `delta.randomizeFilePrefixes` `true` for Delta Lake to generate a random prefix for a file path instead of partition information. For example, this may improve Amazon S3 performance when Delta Lake needs to send very high volumes of Amazon S3 calls to better partition across S3 servers. See [Delta table properties reference](https:\/\/docs.databricks.com\/delta\/table-properties.html). Data type: `Boolean` Default: `false` |\n| `delta.randomPrefixLength` When `delta.randomizeFilePrefixes` is set to `true`, the number of characters that Delta Lake generates for random prefixes. See [Delta table properties reference](https:\/\/docs.databricks.com\/delta\/table-properties.html). Data type: `Int` Default: `2` |\n| `delta.setTransactionRetentionDuration` The shortest duration within which new snapshots will retain transaction identifiers (for example, `SetTransaction`s). When a new snapshot sees a transaction identifier older than or equal to the duration specified by this property, the snapshot considers it expired and ignores it. The `SetTransaction` identifier is used when making the writes idempotent. See [Idempotent table writes in foreachBatch](https:\/\/docs.databricks.com\/structured-streaming\/delta-lake.html#idempotent-table-writes-in-foreachbatch) for details. Data type: `CalendarInterval` Default: (none) |\n| `delta.targetFileSize` The target file size in bytes or higher units for file tuning. For example, `104857600` (bytes) or `100mb`. See [Configure Delta Lake to control data file size](https:\/\/docs.databricks.com\/delta\/tune-file-size.html). Data type: `String` Default: (none) |\n| `delta.tuneFileSizesForRewrites` `true` to always use lower file sizes for all data layout optimization operations on the Delta table. `false` to never tune to lower file sizes, that is, prevent auto-detection from being activated. See [Configure Delta Lake to control data file size](https:\/\/docs.databricks.com\/delta\/tune-file-size.html). Data type: `Boolean` Default: (none) |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/table-properties.html"} +{"content":"# \n### Model serving with Databricks\n\nThis article describes Databricks Model Serving, including its advantages and limitations.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html"} +{"content":"# \n### Model serving with Databricks\n#### What is Model Serving?\n\nDatabricks Model Serving provides a unified interface to deploy, govern, and query AI models. Each model you serve is available as a REST API that you can integrate into your web or client application. \nModel Serving provides a highly available and low-latency service for deploying models. The service automatically scales up or down to meet demand changes, saving infrastructure costs while optimizing latency performance. This functionality uses [serverless compute](https:\/\/docs.databricks.com\/getting-started\/overview.html#serverless). See the [Model Serving pricing page](https:\/\/www.databricks.com\/product\/pricing\/model-serving) for more details. \nModel serving supports serving: \n* [Custom models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/custom-models.html). These are Python models packaged in the MLflow format. They can be registered either in Unity Catalog or in the workspace model registry. Examples include scikit-learn, XGBoost, PyTorch, and Hugging Face transformer models.\n* State-of-the-art open models made available by [Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html). These models are curated foundation model architectures that support optimized inference. Base models, like Llama-2-70B-chat, BGE-Large, and Mistral-7B are available for immediate use with **pay-per-token** pricing, and workloads that require performance guarantees and fine-tuned model variants can be deployed with **provisioned throughput**.\n* [External models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html). These are models that are hosted outside of Databricks. Endpoints that serve external models can be centrally governed and customers can establish rate limits and access control for them. Examples include foundation models like, OpenAI\u2019s GPT-4, Anthropic\u2019s Claude, and others. \nNote \nYou can interact with supported large language models using the [AI Playground](https:\/\/docs.databricks.com\/large-language-models\/ai-playground.html). The AI Playground is a chat-like environment where you can test, prompt, and compare LLMs. This functionality is available in your Databricks workspace. \nModel serving offers a unified REST API and MLflow Deployment API for CRUD and querying tasks. In addition, it provides a single UI to manage all your models and their respective serving endpoints. You can also access models directly from SQL using [AI functions](https:\/\/docs.databricks.com\/large-language-models\/ai-functions.html) for easy integration into analytics workflows. \nFor an introductory tutorial on how to serve custom models on Databricks, see [Tutorial: Deploy and query a custom model](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-intro.html). \nFor a getting started tutorial on how to query a foundation model on Databricks, see [Get started querying LLMs on Databricks](https:\/\/docs.databricks.com\/large-language-models\/llm-serving-intro.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html"} +{"content":"# \n### Model serving with Databricks\n#### Why use Model Serving?\n\n* **Deploy and query any models**: Model Serving provides a unified interface that so you can manage all models in one location and query them with a single API, regardless of whether they are hosted on Databricks or externally. This approach simplifies the process of experimenting with, customizing, and deploying models in production across various clouds and providers.\n* **Securely customize models with your private data**: Built on a Data Intelligence Platform, Model Serving simplifies the integration of features and embeddings into models through native integration with the [Databricks Feature Store](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/automatic-feature-lookup.html) and [Databricks Vector Search](https:\/\/docs.databricks.com\/generative-ai\/vector-search.html). For even more improved accuracy and contextual understanding, models can be fine-tuned with proprietary data and deployed effortlessly on Model Serving.\n* **Govern and monitor models**: The Serving UI allows you to centrally manage all model endpoints in one place, including those that are externally hosted. You can manage permissions, track and set usage limits, and monitor the [quality of all types of models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html). This enables you to democratize access to SaaS and open LLMs within your organization while ensuring appropriate guardrails are in place.\n* **Reduce cost with optimized inference and fast scaling**: Databricks has implemented a range of optimizations to ensure you get the best throughput and latency for large models. The endpoints automatically scale up or down to meet demand changes, saving infrastructure costs while optimizing latency performance. \nNote \nFor workloads that are latency sensitive or require high queries per second, Model Serving offers route optimization on custom model serving endpoints, see [Configure route optimization on serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/route-optimization.html). \n* **Bring reliability and security to Model Serving**: Model Serving is designed for high-availability, low-latency production use and can support over 25K queries per second with an overhead latency of less than 50 ms. The serving workloads are protected by multiple layers of security, ensuring a secure and reliable environment for even the most sensitive tasks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html"} +{"content":"# \n### Model serving with Databricks\n#### Requirements\n\n* Registered model in [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) or the [Workspace Model Registry](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/workspace-model-registry.html).\n* Permissions on the registered models as described in [Serving endpoint ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#serving-endpoints).\n* MLflow 1.29 or higher\n\n### Model serving with Databricks\n#### Enable Model Serving for your workspace\n\nTo use Model Serving, your account admin must read and accept the terms and conditions for enabling serverless compute in the account console. \nNote \nIf your account was created after March 28, 2022, serverless compute is enabled by default for your workspaces. \nIf you are not an account admin, you cannot perform these steps. Contact an account admin if your workspace needs access to serverless compute. \n1. As an account admin, go to the [feature enablement tab of the account console settings page](https:\/\/accounts.cloud.databricks.com\/settings\/feature-enablement).\n2. A banner at the top of the page prompts you to accept the additional terms. Once you read the terms, click **Accept**. If you do not see the banner asking you to accept the terms, this step has been completed already. \nAfter you\u2019ve accepted the terms, your account is enabled for serverless. \nNo additional steps are required to enable Model Serving in your workspace.\n\n### Model serving with Databricks\n#### Limitations and region availability\n\nDatabricks Model Serving imposes default limits to ensure reliable performance. See [Model Serving limits and regions](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html). If you have feedback on these limits or an endpoint in an unsupported region, reach out to your Databricks account team.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html"} +{"content":"# \n### Model serving with Databricks\n#### Data protection in Model Serving\n\nDatabricks takes data security seriously. Databricks understands the importance of the data you analyze using Databricks Model Serving, and implements the following security controls to protect your data. \n* Every customer request to Model Serving is logically isolated, authenticated, and authorized.\n* Databricks Model Serving encrypts all data at rest (AES-256) and in transit (TLS 1.2+). \nFor all paid accounts, Databricks Model Serving does not use user inputs submitted to the service or outputs from the service to train any models or improve any Databricks services. \nFor Databricks Foundation Model APIs, as part of providing the service, Databricks may temporarily process and store inputs and outputs for the purposes of preventing, detecting, and mitigating abuse or harmful uses. Your inputs and outputs are isolated from those of other customers, stored in the same region as your workspace for up to thirty (30) days, and only accessible for detecting and responding to security or abuse concerns.\n\n### Model serving with Databricks\n#### Additional resources\n\n* [Get started querying LLMs on Databricks](https:\/\/docs.databricks.com\/large-language-models\/llm-serving-intro.html).\n* [Tutorial: Deploy and query a custom model](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-intro.html)\n* [Deploy generative AI foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/foundation-models.html)\n* [Deploy custom models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/custom-models.html).\n* [Migrate to Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/migrate-model-serving.html)\n* [Migrate optimized LLM serving endpoints to provisioned throughput](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/migrate-provisioned-throughput.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html"} +{"content":"# \n### `\ud83d\udc4d Assessment & Evaluation Results Log` schema\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/assessment-log.html"} +{"content":"# \n### `\ud83d\udc4d Assessment & Evaluation Results Log` schema\n#### `df.printSchema()`\n\n```\nroot\n|-- request_id: string (nullable = true)\n|-- step_id: string (nullable = true)\n|-- source: struct (nullable = true)\n| |-- type: string (nullable = true)\n| |-- id: string (nullable = true)\n| |-- tags: map (nullable = true)\n| | |-- key: string\n| | |-- value: string (valueContainsNull = true)\n|-- timestamp: timestamp (nullable = true)\n|-- text_assessment: struct (nullable = true)\n| |-- step_id: string (nullable = true)\n| |-- ratings: map (nullable = true)\n| | |-- key: string\n| | |-- value: struct (valueContainsNull = true)\n| | | |-- bool_value: boolean (nullable = true)\n| | | |-- double_value: double (nullable = true)\n| | | |-- rationale: string (nullable = true)\n| |-- free_text_comment: string (nullable = true)\n| |-- suggested_output: string (nullable = true)\n|-- retrieval_assessment: struct (nullable = true)\n| |-- position: integer (nullable = true)\n| |-- step_id: string (nullable = true)\n| |-- ratings: map (nullable = true)\n| | |-- key: string\n| | |-- value: struct (valueContainsNull = true)\n| | | |-- bool_value: boolean (nullable = true)\n| | | |-- double_value: double (nullable = true)\n| | | |-- rationale: string (nullable = true)\n| |-- free_text_comment: string (nullable = true)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/assessment-log.html"} +{"content":"# \n### `\ud83d\udc4d Assessment & Evaluation Results Log` schema\n#### `df.schema`\n\n```\nfrom pyspark.sql.types import *\nschema = StructType(\n[\nStructField(\"request_id\", StringType(), True),\nStructField(\"step_id\", StringType(), True),\nStructField(\n\"source\",\nStructType(\n[\nStructField(\"type\", StringType(), True),\nStructField(\"id\", StringType(), True),\nStructField(\n\"tags\", MapType(StringType(), StringType(), True), True\n),\n]\n),\nTrue,\n),\nStructField(\"timestamp\", TimestampType(), True),\nStructField(\n\"text_assessment\",\nStructType(\n[\nStructField(\"step_id\", StringType(), True),\nStructField(\n\"ratings\",\nMapType(\nStringType(),\nStructType(\n[\nStructField(\"bool_value\", BooleanType(), True),\nStructField(\"double_value\", DoubleType(), True),\nStructField(\"rationale\", StringType(), True),\n]\n),\nTrue,\n),\nTrue,\n),\nStructField(\"free_text_comment\", StringType(), True),\nStructField(\"suggested_output\", StringType(), True),\n]\n),\nTrue,\n),\nStructField(\n\"retrieval_assessment\",\nStructType(\n[\nStructField(\"position\", IntegerType(), True),\nStructField(\"step_id\", StringType(), True),\nStructField(\n\"ratings\",\nMapType(\nStringType(),\nStructType(\n[\nStructField(\"bool_value\", BooleanType(), True),\nStructField(\"double_value\", DoubleType(), True),\nStructField(\"rationale\", StringType(), True),\n]\n),\nTrue,\n),\nTrue,\n),\nStructField(\"free_text_comment\", StringType(), True),\n]\n),\nTrue,\n),\n]\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/assessment-log.html"} +{"content":"# \n### Create an `\ud83d\udcd6 Evaluation Set`\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html) \nThis tutorial walks you through the process of creating a `\ud83d\udcd6 Evaluation Set` to evaluate a RAG Application\u2019s quality\/cost\/latency. \nThis evaluation set allows you to quickly and quantitatively check the quality of a new version of your application before distributing it to stakeholders for their feedback.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/5-create-eval-set.html"} +{"content":"# \n### Create an `\ud83d\udcd6 Evaluation Set`\n#### Step 1: Create an evaluation set with only questions\n\nYou can either collect `\ud83d\uddc2\ufe0f Request Log`s or manually curate questions. \n**To collect `\ud83d\uddc2\ufe0f Request Log`s:** \n1. Use the `\ud83d\udcac Review UI` to ask the RAG Application questions.\n2. Run the following SQL to create a Unity Catalog table called `<eval_table_name>`. This table can be stored in any Unity Catalog schema, but we suggest storing it in the Unity Catalog schema you configured for the RAG Application. \nNote \nYou can modify the SQL code to only select a subset of logs. If you do this, make sure you keep the original schema of the `request` column. \n```\nCREATE TABLE <eval_table_name> CLONE <catalog>.<schema>.rag_studio_<app_name>_<environment>_eval_dataset_template\nINSERT INTO <eval_table_name> SELECT request FROM <catalog>.<schema>.<request_log> LIMIT 5\n\n``` \nNote \nThe schema of `request` is intentionally the same between the request logs and the evaluation set. \n**To manually curate questions:** \n1. Clone the `<catalog>.<schema>.rag_studio_<app_name>_<environment>_eval_dataset_template` table to create a new table called `<eval_table_name>`. \n```\nCREATE TABLE <eval_table_name> CLONE <catalog>.<schema>.rag_studio_<app_name>_<environment>_eval_dataset_template\n\n```\n2. Add questions to the `<eval_table_name>`, ensuring that the `request` column has the same schema as shown below. \n```\n{\n\"request_id\": \"c20cb3a9-23d0-48ac-a2cb-98c47a0b84e2\",\n\"conversation_id\": null,\n\"timestamp\": \"2024-01-18T23:22:52.701Z\",\n\"messages\": [\n{\n\"role\": \"user\",\n\"content\": \"Hello how are you\"\n}\n],\n\"last_input\": \"Hello how are you\"\n}\n\n``` \nNote \nthe `messages` array follows the OpenAI messages format. You can include any number of role\/content pairs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/5-create-eval-set.html"} +{"content":"# \n### Create an `\ud83d\udcd6 Evaluation Set`\n#### Step 2: Optionally - add ground truth data for each question\n\nDatabricks suggests adding ground-truth answers and retrieved contexts to the questions you just created - this will allow you to more accurately measure the quality of your application. However, *this step is optional* and you can still use RAG Studio\u2019s functionality without doing so - the only missing functionality is the computation of a [answer correctness metric](https:\/\/docs.databricks.com\/rag-studio\/details\/metrics.html) + [retrieval metrics](https:\/\/docs.databricks.com\/rag-studio\/details\/metrics.html). \n1. Open a Databricks Notebook\n2. Create a Spark dataframe with the following schema \n```\nfrom pyspark.sql.types import StructType, StringType, StructField\n\nschema = StructType([StructField('id', StringType(), True), StructField('answer', StringType(), True), StructField('doc_ids', StringType(), True)])\n\nlabeled_df = spark.createDataFrame([], schema)\n\n```\n3. For each `request_id` in your `<eval_table_name>` from above, add a ground truth answer to the Dataframe.\n4. Append the ground-truth labels to your `<eval_table_name>`: \n```\n%pip install \"https:\/\/ml-team-public-read.s3.us-west-2.amazonaws.com\/wheels\/rag-studio\/ed24b030-3c87-40b1-b04c-bb1977254aa3\/databricks_rag-0.0.0a1-py3-none-any.whl\"\ndbutils.library.restartPython()\n\nfrom databricks.rag.utils import add_labels_to_eval_dataset\n\nhelp(add_labels_to_eval_dataset)\n\nadd_labels_to_eval_dataset(labeled_df, \"<eval_table_name>\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/5-create-eval-set.html"} +{"content":"# \n### Create an `\ud83d\udcd6 Evaluation Set`\n#### Follow the next tutorial!\n\n[Deploy a RAG application to production](https:\/\/docs.databricks.com\/rag-studio\/tutorials\/6-deploy-rag-app-to-production.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/5-create-eval-set.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Slow Spark stage with little I\/O\n\nIf you have a slow stage with not much I\/O, this could be caused by: \n* Reading a lot of small files\n* Writing a lot of small files\n* Slow UDF(s)\n* Cartesian join\n* Exploding join \nAlmost all of these issues can be identified using the SQL DAG.\n\n##### Slow Spark stage with little I\/O\n###### Open the SQL DAG\n\nTo open the SQL DAG, scroll up to the top of the job\u2019s page and click on **Associated SQL Query**: \n![SQL ID](https:\/\/docs.databricks.com\/_images\/stage-to-sql.png) \nYou should now see the DAG. If not, scroll around a bit and you should see it: \n![SLQ DAG](https:\/\/docs.databricks.com\/_images\/sql-dag.png) \nBefore you move on, familiarize yourself with the DAG and where time is being spent. Some nodes in the DAG have helpful time information and others don\u2019t. For example, this block took 2.1 minutes and even provides the stage ID: \n![Slow Stage Node](https:\/\/docs.databricks.com\/_images\/slow-stage-in-dag.png) \nThis node requires you to open it to see that it took 1.4 minutes: \n![Slow Write Node](https:\/\/docs.databricks.com\/_images\/slow-write-node.png) \nThese times are cumulative, so it\u2019s the total time spent on all the tasks, not the clock time. But it\u2019s still very useful as they are correlated with clock time and cost. \nIt\u2019s helpful to familiarize yourself with where in the DAG the time is being spent.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/slow-spark-stage-low-io.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Slow Spark stage with little I\/O\n###### Reading a lot of small files\n\nIf you see one of your scan operators is taking a lot of time, open it up and look for the number of files read: \n![Reading Many Files](https:\/\/docs.databricks.com\/_images\/many-files-read.png) \nIf you\u2019re reading tens of thousands of files or more, you may have a small file problem. Your files should be no less than 8MB. The small file problem is most often caused by partitioning on too many columns or a high-cardinality column. \nIf you\u2019re lucky, you might just need to run [OPTIMIZE](https:\/\/docs.databricks.com\/en\/sql\/language-manual\/delta-optimize.html). Regardless, you need to reconsider your [file layout](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#data-layout).\n\n##### Slow Spark stage with little I\/O\n###### Writing a lot of small files\n\nIf you see your write is taking a long time, open it up and look for the number of files and how much data was written: \n![Writing many files](https:\/\/docs.databricks.com\/_images\/many-files-write.png) \nIf you\u2019re writing tens of thousands of files or more, you may have a small file problem. Your files should be no less than 8MB. The small file problem is most often caused by partitioning on too many columns or a high-cardinality column. You need to reconsider your [file layout](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#data-layout) or turn on [optimized writes](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#auto-optimize).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/slow-spark-stage-low-io.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Slow Spark stage with little I\/O\n###### Slow UDFs\n\nIf you know you have [UDFs](https:\/\/docs.databricks.com\/udf\/index.html), or see something like this in your DAG, you might be suffering from slow UDFs: \n![UDF Node](https:\/\/docs.databricks.com\/_images\/udf-dag.png) \nIf you think you\u2019re suffering from this problem, try commenting out your UDF to see how it impacts the speed of your pipeline. If the UDF is indeed where the time is being spent, your best bet is to rewrite the UDF using native functions. If that\u2019s not possible, consider the number of tasks in the stage executing your UDF. If it\u2019s less than the number of cores on your cluster, `repartition()` your dataframe before using the UDF: \n```\n(df\n.repartition(num_cores)\n.withColumn('new_col', udf(...))\n)\n\n``` \nUDFs can also suffer from memory issues. Consider that each task may have to load all the data in its partition into memory. If this data is too big, things can get very slow or unstable. Repartition also can resolve this issue by making each task smaller.\n\n##### Slow Spark stage with little I\/O\n###### Cartesian join\n\nIf you see a cartesian join or nested loop join in your DAG, you should know that these joins are very expensive. Make sure that\u2019s what you intended and see if there\u2019s another way.\n\n##### Slow Spark stage with little I\/O\n###### Exploding join or explode\n\nIf you see a few rows going into a node and magnitudes more coming out, you may be suffering from an exploding join or explode(): \n![Exploding Join](https:\/\/docs.databricks.com\/_images\/exploding-join.png) \nRead more about explodes in the [Databricks Optimization guide](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#data-explosion).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/slow-spark-stage-low-io.html"} +{"content":"# Technology partners\n## Connect to reverse ETL partners using Partner Connect\n#### Connect to Census\n\nCensus is a reverse ETL platform that syncs customer data from you lakehouse into downstream business tools such as Salesforce, HubSpot, and Google Ads. \nYou can integrate your Databricks SQL warehouses and Databricks clusters with Census.\n\n#### Connect to Census\n##### Connect to Census using Partner Connect\n\nTo connect your Databricks workspace to Census using Partner Connect, see [Connect to reverse ETL partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/reverse-etl.html). \nNote \nPartner Connect only supports SQL warehouses for Census. To connect a cluster to Census, connect to Census manually.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/reverse-etl\/census.html"} +{"content":"# Technology partners\n## Connect to reverse ETL partners using Partner Connect\n#### Connect to Census\n##### Connect to Census manually\n\nThis section describes how to connect an existing SQL warehouse or cluster in your Databricks workspace to Census manually. \nNote \nFor Databricks SQL warehouses, you can connect to Census using Partner Connect to simplify the experience. \n### Requirements \nBefore you connect to Census manually, you need the following: \n* A cluster or SQL warehouse in your Databricks workspace. \n+ [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n+ [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* The connection details for your cluster or SQL warehouse, specifically the **Server Hostname**, **Port**, and **HTTP Path** values. \n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). To create a personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**.\nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n+ [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n+ [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \n### Steps to connect \nTo connect to Census manually, follow [Databricks](https:\/\/docs.getcensus.com\/sources\/databricks) in the Census documentation.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/reverse-etl\/census.html"} +{"content":"# Technology partners\n## Connect to reverse ETL partners using Partner Connect\n#### Connect to Census\n##### Additional resources\n\nExplore the following Census resources: \n* [Website](https:\/\/www.getcensus.com\/)\n* [Documentation](https:\/\/docs.getcensus.com\/)\n* [Support](mailto:support%40getcensus.com)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/reverse-etl\/census.html"} +{"content":"# Databricks data engineering\n## What are init scripts?\n#### Use global init scripts\n\nImportant \nGlobal init scripts only run on clusters configured with single user or legacy no-isolation shared access mode, so Databricks recommends configuring all init scripts as cluster-scoped init scripts and managing them across your workspace using cluster policies. \nA global init script runs on every cluster created in your workspace. Global init scripts are useful when you want to enforce organization-wide library configurations or security screens. Only workpace admins can create global init scripts. You can create them using either the UI or REST API. \nImportant \nBecause global init scripts run on all clusters, consider potential impacts such as the following: \n* It is easy to add libraries or make other modifications that cause unexpected impacts. Whenever possible, use cluster-scoped init scripts instead.\n* Any user who creates a cluster and enables cluster log delivery can view the `stderr` and `stdout` output from global init scripts. You should ensure that your global init scripts do not output any sensitive information. \nYou can troubleshoot global init scripts by configuring [cluster log delivery](https:\/\/docs.databricks.com\/compute\/configure.html#cluster-log-delivery) and examining the init script log. See [Init script logging](https:\/\/docs.databricks.com\/init-scripts\/logs.html). \nNote \nGlobal init scripts are not run on [model serving clusters](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/global.html"} +{"content":"# Databricks data engineering\n## What are init scripts?\n#### Use global init scripts\n##### Add a global init script using the UI\n\nTo configure global init scripts using the admin settings: \n1. Go to the admin settings and click the **Compute** tab.\n2. Click **Manage** next to **Global init scripts**.\n3. Click **+ Add**.\n4. Name the script and enter it by typing, pasting, or dragging a text file into the **Script** field. \nNote \nThe init script cannot be larger than 64KB. If a script exceeds that size, an error message appears when you try to save.\n5. If you have more than one global init script configured for your workspace, set the order in which the new script will run.\n6. If you want the script to be enabled for all new and restarted clusters after you save, toggle **Enabled**. \nImportant \nWhen you add a global init script or make changes to the name, run order, or enablement of init scripts, those changes do not take effect until you restart the cluster.\n7. Click **Add**.\n\n#### Use global init scripts\n##### Add a global init script using Terraform\n\nYou can add a global init script by using the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_global\\_init\\_script](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/global_init_script).\n\n#### Use global init scripts\n##### Edit a global init script using the UI\n\n1. Go to the admin settings and click the **Compute** tab.\n2. Click **Manage** next to **Global init scripts**.\n3. Click a script.\n4. Edit the script.\n5. Click **Confirm**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/global.html"} +{"content":"# Databricks data engineering\n## What are init scripts?\n#### Use global init scripts\n##### Configure a global init script using the API\n\nWorkspace admins can add, delete, re-order, and get information about the global init scripts in your workspace using the [Global Init Scripts API](https:\/\/docs.databricks.com\/api\/workspace\/globalinitscripts).\n\n","doc_uri":"https:\/\/docs.databricks.com\/init-scripts\/global.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n#### Use XGBoost on Databricks\n\nThis article provides examples of training machine learning models using XGBoost in Databricks. Databricks Runtime for Machine Learning includes XGBoost libraries for both Python and Scala. You can train XGBoost models on an individual machine or in a distributed fashion.\n\n#### Use XGBoost on Databricks\n##### Train XGBoost models on a single node\n\nYou can train models using the Python `xgboost` package. This package supports only single node workloads. To train a PySpark ML pipeline and take advantage of distributed training, see [Distributed training of XGBoost models](https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost.html#xgboost-pyspark). \n### XGBoost Python notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/xgboost-python.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n#### Use XGBoost on Databricks\n##### Distributed training of XGBoost models\n\nFor distributed training of XGBoost models, Databricks includes PySpark estimators based on the `xgboost` package. Databricks also includes the Scala package `xgboost-4j`. For details and example notebooks, see the following: \n* [Distributed training of XGBoost models using xgboost.spark](https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost-spark.html) (Databricks Runtime 12.0 ML and above)\n* [Distributed training of XGBoost models using sparkdl.xgboost](https:\/\/docs.databricks.com\/machine-learning\/train-model\/sparkdl-xgboost.html) (deprecated starting with Databricks Runtime 12.0 ML)\n* [Distributed training of XGBoost models using Scala](https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost-scala.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n#### Use XGBoost on Databricks\n##### Install XGBoost on Databricks\n\nIf you need to install XGBoost on Databricks Runtime or use a different version than the one pre-installed with Databricks Runtime ML, follow these instructions. \n### Install XGBoost on Databricks Runtime ML \nXGBoost is included in Databricks Runtime ML. You can use these libraries in Databricks Runtime ML without installing any packages. \nFor the version of XGBoost installed in the Databricks Runtime ML version you are using, see the [release notes](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). To install other Python versions in Databricks Runtime ML, install XGBoost as a [Databricks PyPI library](https:\/\/docs.databricks.com\/libraries\/index.html). Specify it as the following and replace `<xgboost version>` with the desired version. \n```\nxgboost==<xgboost version>\n\n``` \n### Install XGBoost on Databricks Runtime \n* **Python package**: Execute the following command in a notebook cell: \n```\n%pip install xgboost\n\n``` \nTo install a specific version, replace `<xgboost version>` with the desired version: \n```\n%pip install xgboost==<xgboost version>\n\n``` \n* **Scala\/Java packages**: Install as a [Databricks library](https:\/\/docs.databricks.com\/libraries\/index.html) with the Spark Package name `xgboost-linux64`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost.html"} +{"content":"# What is Databricks Marketplace?\n### Manage shared Databricks Marketplace data products\n\nThis article describes how to manage data products that are shared with you through Databricks Marketplace and your pending requests for access to data products. This article is intended for data consumers. \nTo learn how to request and access data products in Databricks Marketplace, see [Access data products in Databricks Marketplace (Unity Catalog-enabled workspaces)](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html).\n\n### Manage shared Databricks Marketplace data products\n#### Before you begin\n\nTo manage data products that have been shared with you, you must have the `CREATE CATALOG` and `USE PROVIDER` permissions on the Unity Catalog metastore attached to your workspace, or the [metastore admin role](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html). \nTo get these privileges, ask your Databricks account admin or metastore admin to grant them to you. For more information, see [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html). \nIf your workspace was enabled for Unity Catalog automatically, the workspace admin has these permissions and can grant them to other users. See [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement).\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/manage-requests-consumer.html"} +{"content":"# What is Databricks Marketplace?\n### Manage shared Databricks Marketplace data products\n#### View request status\n\nIf you have requested a data product that requires provider approval, you can view request status in the Marketplace UI: \n1. Log into your Databricks workspace. \nFor required permissions, see [Before you begin](https:\/\/docs.databricks.com\/marketplace\/manage-requests-consumer.html#requirements).\n2. In the sidebar, click ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace**.\n3. On the upper-right corner of the Marketplace page, click **My requests**.\n4. On the **Requests** tab, view all requests and their current review status: **Pending**, **Fulfilled**, and **Denied**, along with the requested and reviewed dates. \nTransactions that are ongoing between you and a data provider occur outside of the Databricks Marketplace system. Details of those transactions are not captured here.\n\n### Manage shared Databricks Marketplace data products\n#### View and access installed data products\n\nTo view installed data products and access them in Catalog Explorer: \n1. Log into your Databricks workspace. \nFor required permissions, see [Before you begin](https:\/\/docs.databricks.com\/marketplace\/manage-requests-consumer.html#requirements).\n2. In the sidebar, click ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace**.\n3. On the upper-right corner of the Marketplace page, click **My requests**.\n4. On the **Installed data products** page, you can view all installed data products. \nTo view details about a specific product, such as provider info, documentation, terms, and sample notebooks, click the product name. \nTo view the shared data in Catalog Explorer, click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu (also known as the three-dot menu) at the end of the data product row, and select **View data**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/manage-requests-consumer.html"} +{"content":"# What is Databricks Marketplace?\n### Manage shared Databricks Marketplace data products\n#### Work with installed data products\n\nEach Marketplace dataset is shared with you in a read-only catalog. Catalogs are the top-level container for data managed by Unity Catalog in Databricks. \nFor detailed information about accessing and granting permissions on shared catalogs, see [Access the shared data using Unity Catalog](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html#access). \nTo learn how to access shared notebooks, see [View sample notebooks](https:\/\/docs.databricks.com\/marketplace\/get-started-consumer.html#notebooks) \nFor more information about the data object hierarchy in Unity Catalog, see [The Unity Catalog object model](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#object-model).\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/manage-requests-consumer.html"} +{"content":"# What is Databricks Marketplace?\n### Manage shared Databricks Marketplace data products\n#### Delete installed data products\n\nWhen you remove an installed data product, the shared catalog is deleted from your workspace. Remember to update or delete any queries, dashboards, or notebooks that depend on the data in that catalog. \nTo remove an installed data product: \n1. Log into your Databricks workspace. \nFor required permissions, see [Before you begin](https:\/\/docs.databricks.com\/marketplace\/manage-requests-consumer.html#requirements).\n2. In the sidebar, click ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace**.\n3. On the upper-right corner of the Marketplace page, click **My requests**.\n4. On the **Installed data products** tab, find the data product you want to delete and click the product name.\n5. On the product detail page, click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu next to the **Open** button and select **Uninstall product**.\n6. On the confirmation dialog, type the repository\u2019s name (which is displayed in the sentence immediately above the confirmation field).\n7. Click **Confirm and uninstall**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/manage-requests-consumer.html"} +{"content":"# Security and compliance guide\n## Networking\n#### Serverless compute plane networking\n\nThis guide introduces tools to secure network access between the compute resources in the Databricks serverless compute plane and customer resources. To learn more about the control plane and the serverless compute plane, see [Databricks architecture overview](https:\/\/docs.databricks.com\/security\/network\/index.html#architecture). \nNote \nThere are currently no networking charges for serverless features. In a later release, you might be charged. Databricks will provide advance notice for networking pricing changes.\n\n#### Serverless compute plane networking\n##### Serverless compute plane networking overview\n\nServerless compute resources run in the serverless compute plane, which is managed by Databricks. Account admins can configure secure connectivity between the serverless compute plane and their resources. This network connection is labeled as 2 on the diagram below: \n![Network connectivity overview diagram](https:\/\/docs.databricks.com\/_images\/networking-serverless.png) \nConnectivity between the control plane and the serverless compute plane is always over the cloud network backbone and not the public internet. For more information on configuring security features on the other network connections in the diagram, see [Networking](https:\/\/docs.databricks.com\/security\/network\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/serverless-network-security\/index.html"} +{"content":"# Security and compliance guide\n## Networking\n#### Serverless compute plane networking\n##### What is a network connectivity configuration (NCC)?\n\nServerless network connectivity is managed with network connectivity configurations (NCC). NCCs are account-level regional constructs that are used to manage private endpoints creation and firewall enablement at scale. \nAccount admins create NCCs in the account console and an NCC can be attached to one or more workspaces to enable firewalls for resources. An NCC contains a list of stable IP addresses. When an NCC is attached to a workspace, serverless compute in that workspace uses one of those IP addresses to connect the cloud resource. You can allow list those networks on your resource firewall. See [Configure a firewall for serverless compute access](https:\/\/docs.databricks.com\/security\/network\/serverless-network-security\/serverless-firewall.html). \nCreating a resource firewall also affects connectivity from the classic compute plane to your resource. You must also allow list the networks on your resource firewalls to connect to them from classic compute resources. \nNCC firewall enablement is not supported for Amazon S3 or Amazon DynamoDB. NCC firewall enablement is only supported from SQL warehouses. It is not supported from other compute resources in the serverless compute plane.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/serverless-network-security\/index.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Run Git operations on Databricks Git folders (Repos)\n\nThe article describes how to perform common Git operations in your Databricks workspace using Git folders, including cloning, branching, committing, and pushing.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Run Git operations on Databricks Git folders (Repos)\n##### Clone a repo connected to a remote Git repository\n\n1. In the sidebar, select **Workspace** and then browser to the folder where you want to create the Git repo clone.\n2. Click the down arrow to the right of the **Add** in the upper right of the workspace, and select **Git folder** from the dropdown. \n![Add repo UI.](https:\/\/docs.databricks.com\/_images\/add-repo.png)\n3. In the **Create Git folder** dialog, provide the following information: \n* The URL of the Git repository you want to clone, in the format `https:\/\/example.com\/organization\/project.git`\n* The Git provider for the repository you want to clone. Options include GitHub, GitHub Enterprise, GitLab, and Azure DevOps (Azure Repos)\n* The name of the folder in your workspace that will contain the contents of the cloned repo\n* Whether or not you will use sparse checkout, in which only subdirectories specified using a cone pattern are cloned\n![Clone from Git folder UI.](https:\/\/docs.databricks.com\/_images\/clone-from-repo.png) \nAt this stage, you have the option to clone only a subset of your repository\u2019s directories using [sparse checkout](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html#sparse). This is useful if your repository is larger than Databricks supported [limits](https:\/\/docs.databricks.com\/repos\/limits.html) \n1. Click **Create Git folder**. The contents of the remote repository are cloned to the Databricks repo, and you can begin working with them using supported Git operations through your workspace.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Run Git operations on Databricks Git folders (Repos)\n##### Best practice: Collaborating in Git folders\n\nDatabricks Git folders effectively behave as embedded Git clients in your workspace so users can collaborate using Git-based source control and versioning. To make team collaboration more effective, **use a separate a Databricks Git folder mapped to a remote Git repo for each user who works in their own development branch** . Although multiple users can contribute content to a Git folder, only **one** designated user should perform Git operations such as pull, push, commit, and branch switching. If multiple users perform Git operations on a Git folder, branch management can become difficult and error-prone, such as when a user switches a branch and unintentionally switches it for all other users of that folder. \nTo share a Git folder with a collaborator, click **Copy link to create Git folder** in the banner at the top of your Databricks workspace. This action copies a URL to your local clipboard which you can send to another user. When the recipient user loads that URL in a browser they will be taken to the workspace where they can create their own Git folder cloned from the same remote Git repository. They will see a **Create Git folder** modal dialog in the UI, pre-populated with the values taken from your own Git folder. When they click the blue **Create Git folder** button in the modal, the Git repository is cloned into the workspace under their current working folder, where they can now work with it directly. \n![Click the **Copy link to Git folder** button the banner to share the Git repo configuration for the folder with another user in your Databricks organization](https:\/\/docs.databricks.com\/_images\/git-folder-share1.png) \nWhen accessing someone else\u2019s Git folder in a shared workspace, click **Create Git folder** in the banner at the top. This action opens the **Create Git folder** dialog for you, pre-populated with the configuration for the Git repository that backs it. \n![When viewing another user's Git folder, click the **Create Git folder** button in the banner to make a copy of that folder in your own workspace](https:\/\/docs.databricks.com\/_images\/git-folder-copy-collab1.png) \nImportant \nCurrently you cannot use the Git CLI to perform Git operations in a Git folder. If you clone a Git repo using the CLI through a cluster\u2019s web terminal, the files won\u2019t display in the Databricks UI.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Run Git operations on Databricks Git folders (Repos)\n##### Access the Git dialog\n\nYou can access the Git dialog from a notebook or from the Databricks Git folders browser. \n* From a notebook, click the button next to the name of the notebook that identifies the current Git branch. \n![Git dialog button on notebook.](https:\/\/docs.databricks.com\/_images\/toolbar.png)\n* From the Databricks Git folders browser, click the button to the right of the repo name. You can also right-click the repo name and select **Git\u2026** from the menu. \n![Git dialog button and Git menu in repo browser.](https:\/\/docs.databricks.com\/_images\/git-button-repos.png) \nYou will see a full-screen dialog where you can perform Git operations. \n![The dialog used to perform Git operations in a Databricks workspace.](https:\/\/docs.databricks.com\/_images\/git-folder-ui1.png) \n1. Your current working branch. You can select other branches here. If other users have access to this Git folder, changing the branch will also change the branch for them if they share the same workspace. See a recommended [best practice](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html#best-practices) to avoid this problem.\n2. The button to create a new branch.\n3. The list of file assets and subfolders checked into your current branch.\n4. A button that takes you to your Git provider and shows you the current branch history.\n5. The button to pull content from the remote Git repository.\n6. Text box where you add a commit message and optional expanded description for your changes.\n7. The button to commit your work to the working branch and push the updated branch to the remote Git repository. \nClick the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab in the upper right to choose from additional Git branch operations, such as a hard reset, a merge, or a rebase. \n![Drop down menu in the Git folder dialogue for branch operations.](https:\/\/docs.databricks.com\/_images\/git-folder-ui2.png) \nThis is your home for performing Git operations on your workspace Git folder. You are limited to the Git operations presented in the user interface.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Run Git operations on Databricks Git folders (Repos)\n##### Create a new branch\n\nYou can create a new branch based on an existing branch from the Git dialog: \n![Git dialog new branch.](https:\/\/docs.databricks.com\/_images\/git-dialog-new-branch.png)\n\n#### Run Git operations on Databricks Git folders (Repos)\n##### Switch to a different branch\n\nYou can switch to (checkout) a different branch using the branch dropdown in the Git dialog: \n![Git dialog switch to different branch](https:\/\/docs.databricks.com\/_images\/git-repos-switch-branch-dialog.png) \nImportant \nAfter you checkout a branch in a Git folder, there is always a chance the branch may be deleted on the remote Git repository by someone else. If a branch is deleted on the remote repo, the local version can remain present in the associated Git folder for up to 7 days. Local branches in Databricks cannot be deleted, so if you must remove them, you must also delete and reclone the repository.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Run Git operations on Databricks Git folders (Repos)\n##### Commit and push changes to the remote Git repository\n\nWhen you have added new notebooks or files, or made changes to existing notebooks or files, the Git folder UI highlights the changes. \n![Git dialog with changes highlighted.](https:\/\/docs.databricks.com\/_images\/git-commit-push.png) \nAdd a required commit message for the changes, and click **Commit & Push** to push these changes to the remote Git repository. \nIf you don\u2019t have permission to commit to the default branch (such as the `main` branch), create a new branch and use your Git provider\u2019s interface to create a pull request (PR) to merge it into the default branch. \nNote \n* Notebook outputs are not included in commits by default when notebooks are saved in source file formats (`.py`, `.scala`, `.sql`, `.r`). For information on committing notebook outputs using the IPYNB format, see [Control IPYNB notebook output artifact commits](https:\/\/docs.databricks.com\/repos\/manage-assets.html#notebook-outputs-commit)\n\n#### Run Git operations on Databricks Git folders (Repos)\n##### Pull changes from the remote Git repository\n\nTo pull changes from the remote Git repository, click **Pull** in the Git operations dialog. Notebooks and other files are updated automatically to the latest version in your remote Git repository. If the changes pulled from the remote repo conflict with your local changes in Databricks, you need to resolve the [merge conflicts](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html#resolve-merge-conflicts). \nImportant \nGit operations that pull in upstream changes clear the notebook state. For more information, see [Incoming changes clear the notebook state](https:\/\/docs.databricks.com\/repos\/limits.html#incoming-changes-clear-notebook-state).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Run Git operations on Databricks Git folders (Repos)\n##### Merge branches\n\nAccess the Git **Merge** operation by selecting it from the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab in the upper right of the Git operations dialog. \nThe merge function in Databricks Git folders merges one branch into another using `git merge`. A merge operation is a way to combine the commit history from one branch into another branch; the only difference is the strategy it uses to achieve this. For Git beginners, we recommend using merge (over rebase) because it does not require force pushing to a branch and therefore does not rewrite commit history. \nTo learn more about the differences between merging and rebasing commits, please see [Atlassian\u2019s documentation on the subject](https:\/\/www.atlassian.com\/git\/tutorials\/merging-vs-rebasing). \n* If there\u2019s a merge conflict, resolve it in the Git folders UI.\n* If there\u2019s no conflict, the merge is pushed to the remote Git repo using `git push`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Run Git operations on Databricks Git folders (Repos)\n##### `Rebase` a branch on another branch\n\nAccess the Git **Rebase** operation by selecting it from the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu in the upper right of the Git operations dialog. \nRebasing alters the commit history of a branch. Like `git merge`, `git rebase` integrates changes from one branch into another. Rebase does the following: \n1. Saves the commits on your current branch to a temporary area.\n2. Resets the current branch to the chosen branch.\n3. Reapplies each individual commit previously saved on the current branch, resulting in a linear history that combines changes from both branches. \nFor an in-depth explanation of rebasing, see [git rebase](https:\/\/www.atlassian.com\/git\/tutorials\/rewriting-history\/git-rebase). \nWarning \nUsing rebase can cause versioning issues for collaborators working in the same repo. \nA common workflow is to rebase a feature branch on the main branch. \nTo rebase a branch on another branch: \n1. From the **Branch** menu in the Git folders UI, select the branch you want to rebase.\n2. Select **Rebase** from the kebab menu. \n![Git rebase function on the kebab menu.](https:\/\/docs.databricks.com\/_images\/rebase-option.png)\n3. Select the branch you want to rebase on. \nThe rebase operation integrates changes from the branch you choose here into the current branch. \nDatabricks Git folders runs `git commit` and `git push --force` to update the remote Git repo.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Run Git operations on Databricks Git folders (Repos)\n##### Resolve merge conflicts\n\nMerge conflicts happen when 2 or more Git users attempt to merge changes to the same lines of a file into a common branch and Git cannot choose the \u201cright\u201d changes to apply. Merge conflicts can also occur when a user attempts to pull or merge changes from another branch into a branch with uncommitted changes. \n![Animated GIF that shows a common merge conflict arising from uncommitted changes during a git pull](https:\/\/docs.databricks.com\/_images\/git-ops-2.gif) \nIf an operation such as pull, rebase, or merge causes a merge conflict, the Git folders UI shows a list of files with conflicts and options for resolving the conflicts. \nYou have two primary options: \n* Use the Git folders UI to resolve the conflict.\n* Abort the Git operation, manually discard the changes in the conflicting file, and try the Git operation again. \n![Animated GIF showing a merge conflict in a Databricks Git folders folder UI](https:\/\/docs.databricks.com\/_images\/git-ops-1.gif) \nWhen resolving merge conflicts with the Git folders UI, you must choose between manually resolving the conflicts in the editor or keeping all incoming or current changes. \n**Keep All Current** or **Take Incoming Changes** \nIf you know you **only** want to keep all of the current or incoming changes, click the kebab to the right of the file name in your notebook pane and select either **Keep all current changes** or **Take all incoming changes**. Click the button with the same label to commit the changes and resolve the conflict. \n![The pane for the Databricks notebook UI, showing the dropdown options for merge conflict resolution](https:\/\/docs.databricks.com\/_images\/git-ops-3-1.png) \nTip \nConfused about which option to pick? The color of each option matches the respective code changes that it will keep in the file. \n**Manually Resolving Conflicts** \nManual conflict resolution lets you determine which of the conflicting lines should be accepted in the merge. For merge conflicts, you resolve the conflict by directly editing the contents of the file with the conflicts. \n![Animated GIF showing a manual resolution of a merge conflict](https:\/\/docs.databricks.com\/_images\/git-ops-3.gif) \nTo resolve the conflict, select the code lines you want to preserve and delete everything else, including the Git merge conflict markers. When you\u2019re done, select **Mark As Resolved**. \nIf you decide you made the wrong choices when resolving merge conflicts, click the **Abort** button to abort the process and undo everything. Once all conflicts are resolved, click the **Continue Merge** or **Continue Rebase** option to resolve the conflict and complete the operation.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Run Git operations on Databricks Git folders (Repos)\n##### Git `reset`\n\nIn Databricks Git folders, you can perform a Git `reset` within the Databricks UI. Git reset in Databricks Git folders is equivalent to `git reset --hard` combined with `git push --force`. \nGit reset replaces the branch contents and history with the most recent state of another branch. You can use this when edits are in conflict with the upstream branch, and you don\u2019t mind losing those edits when you reset to the upstream branch. [Read more about git `reset \u2013hard`](https:\/\/git-scm.com\/docs\/git-reset#Documentation\/git-reset.txt---hard). \n### Reset to an upstream (remote) branch \nWith `git reset` in this scenario: \n* You reset your selected branch (for example, `feature_a`) to a different branch (for example, `main`).\n* You also reset the upstream (remote) branch `feature_a` to main. \nImportant \nWhen you reset, you lose all uncommitted and committed changes in both the local and remote version of the branch. \nTo reset a branch to a remote branch: \n1. In the Git folders UI from the **Branch** menu, choose the branch you want to reset. \n![Branch selector in the Git folders UI.](https:\/\/docs.databricks.com\/_images\/repos-branch-selector.png)\n2. Select **Reset** from the kebab menu. \n![Git reset operation on the kebab menu.](https:\/\/docs.databricks.com\/_images\/reset-local-branch.png)\n3. Select the branch to reset. \n![Git reset --hard dialog.](https:\/\/docs.databricks.com\/_images\/reset-to-branch.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Run Git operations on Databricks Git folders (Repos)\n##### Configure sparse checkout mode\n\nSparse checkout is a client side setting which allows you to clone and work with only a subset of the remote repositories\u2019s directories in Databricks. This is especially useful if your repository\u2019s size is beyond the Databricks supported [limits](https:\/\/docs.databricks.com\/repos\/limits.html). \nYou can use the Sparse Checkout mode when adding (cloning) a new repo. \n1. In the **Add Git folder** dialog, open **Advanced**.\n2. Select **Sparse checkout mode**. \n![Sparse checkout option in the Add Git folder dialog.](https:\/\/docs.databricks.com\/_images\/sparse-checkout-option.png)\n3. In the **Cone patterns** box, specify the cone checkout patterns you want. Separate multiple patterns by line breaks. \nAt this time, you can\u2019t disable sparse checkout for a repo in Databricks. \n### How cone patterns work \nTo understand how cone pattern works in the sparse checkout mode, see the following diagram representing the remote repository structure. \n![Remote repository structure without sparse checkout.](https:\/\/docs.databricks.com\/_images\/repo-structure-without-sparse-checkout.png) \nIf you select **Sparse checkout mode**, but do not specify a cone pattern, the default cone pattern is applied. This includes only the files in root and no subdirectories, resulting in a repo structure as following: \n![Sparse checkout: default cone pattern.](https:\/\/docs.databricks.com\/_images\/sparse-checkout-default.png) \nSetting the sparse checkout cone pattern as `parent\/child\/grandchild` results in all contents of the `grandchild` directory being recursively included. The files immediately in the `\/parent`, `\/parent\/child` and root directory are also included. See the directory structure in the following diagram: \n![Sparse checkout: Specify parent-grandchild-child folder cone pattern.](https:\/\/docs.databricks.com\/_images\/set-sparse-parent-grand-child.png) \nYou can add multiple patterns separated by line breaks. \nNote \nExclusion behaviors (`!`) are not supported in Git cone pattern syntax. \n### Modify sparse checkout settings \nOnce a repo is created, the sparse checkout cone pattern can be edited from **Settings > Advanced > Cone patterns**. \nNote the following behavior: \n* Removing a folder from the cone pattern removes it from Databricks if there are no uncommitted changes.\n* Adding a folder via editing the sparse checkout cone pattern adds it to Databricks without requiring an additional pull.\n* Sparse checkout patterns cannot be changed to remove a folder when there are uncommitted changes in that folder. \nFor example, a user edits a file in a folder and does not commit changes. She then tries to change the sparse checkout pattern to not include this folder. In this case, the pattern is accepted, but the actual folder is not deleted. She needs to revert the pattern to include that folder, commit changes, and then reapply the new pattern. \nNote \nYou can\u2019t disable sparse checkout for a repo that was created with Sparse Checkout mode enabled. \n### Make and push changes with sparse checkout \nYou can edit existing files and commit and push them from the Git folder. When creating new folders of files, include them in the cone pattern you specified for that repo. \nIncluding a new folder outside of the cone pattern results in an error during the commit and push operation. To fix it, edit the cone pattern to include the new folder you are trying to commit and push. \n### Patterns for a repo config file \nThe commit outputs config file uses patterns similar to [gitignore patterns](https:\/\/git-scm.com\/docs\/gitignore) and does the following: \n* Positive patterns enable outputs inclusion for matching notebooks.\n* Negative patterns disable outputs inclusion for matching notebooks.\n* Patterns are evaluated in order for all notebooks.\n* Invalid paths or paths not resolving to `.ipynb` notebooks are ignored. \n**Positive pattern:** To include outputs from a notebook path `folder\/innerfolder\/notebook.ipynb`, use following patterns: \n```\n**\/*\nfolder\/**\nfolder\/innerfolder\/note*\n\n``` \n**Negative pattern:** To exclude outputs for a notebook, check that none of the positive patterns match or add a negative pattern in a correct spot of the configuration file. Negative (exclude) patterns start with `!`: \n```\n!folder\/innerfolder\/*.ipynb\n!folder\/**\/*.ipynb\n!**\/notebook.ipynb\n\n``` \n### Sparse checkout limitation \nSparse checkout currently does not work for Azure DevOps repos larger than 4GB in size.\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html"} +{"content":"# Databricks data engineering\n## Git integration with Databricks Git folders\n#### Run Git operations on Databricks Git folders (Repos)\n##### Add a repo and connect remotely later\n\nTo manage and work with Git folders programmatically, use the [Git folders REST API](https:\/\/docs.databricks.com\/api\/workspace\/gitcredentials).\n\n","doc_uri":"https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### APPLY CHANGES API: Simplify change data capture in Delta Live Tables\n\nDelta Live Tables simplifies change data capture (CDC) with the `APPLY CHANGES` API. Previously, the `MERGE INTO` statement was commonly used for processing CDC records on Databricks. However, `MERGE INTO` can produce incorrect results because of out-of-sequence records, or require complex logic to re-order records. \nBy automatically handling out-of-sequence records, the `APPLY CHANGES` API in Delta Live Tables ensures correct processing of CDC records and removes the need to develop complex logic for handling out-of-sequence records. \nThe `APPLY CHANGES` API is supported in the Delta Live Tables SQL and Python interfaces, including support for updating tables with SCD type 1 and type 2: \n* Use SCD type 1 to update records directly. History is not retained for records that are updated.\n* Use SCD type 2 to retain a history of records, either on all updates or on updates to a specified set of columns. \nFor syntax and other references, see: \n* [Change data capture with Python in Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html#cdc)\n* [Change data capture with SQL in Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html#cdc)\n* [Control tombstone management for SCD type 1 queries](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#cdc) \nNote \nThis article describes how to update tables in your Delta Live Tables pipeline based on changes in source data. To learn how to record and query row-level change information for Delta tables, see [Use Delta Lake change data feed on Databricks](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### APPLY CHANGES API: Simplify change data capture in Delta Live Tables\n###### How is CDC implemented with Delta Live Tables?\n\nYou must specify a column in the source data on which to sequence records, which Delta Live Tables interprets as a monotonically increasing representation of the proper ordering of the source data. Delta Live Tables automatically handles data that arrives out of order. For SCD Type 2 changes, Delta Live Tables propagates the appropriate sequencing values to the `__START_AT` and `__END_AT` columns of the target table. There should be one distinct update per key at each sequencing value, and NULL sequencing values are unsupported. \nTo perform CDC processing with Delta Live Tables, you first create a streaming table, and then use an `APPLY CHANGES INTO` statement to specify the source, keys, and sequencing for the change feed. To create the target streaming table, use the `CREATE OR REFRESH STREAMING TABLE` statement in SQL or the `create_streaming_table()` function in Python. To create the statement defining the CDC processing, use the `APPLY CHANGES` statement in SQL or the `apply_changes()` function in Python. For syntax details, see [Change data capture with SQL in Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html#cdc) or [Change data capture with Python in Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html#cdc).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### APPLY CHANGES API: Simplify change data capture in Delta Live Tables\n###### What data objects are used for Delta Live Tables CDC processing?\n\nWhen you declare the target table in the Hive metastore, two data structures are created: \n* A view using the name assigned to the target table.\n* An internal backing table used by Delta Live Tables to manage CDC processing. This table is named by prepending `__apply_changes_storage_` to the target table name. \nFor example, if you declare a target table named `dlt_cdc_target`, you will see a view named `dlt_cdc_target` and a table named `__apply_changes_storage_dlt_cdc_target` in the metastore. Creating a view allows Delta Live Tables to filter out the extra information (for example, tombstones and versions) required to handle out-of-order data. To view the processed data, query the target view. Because the schema of the `__apply_changes_storage_` table might change to support future features or enhancements, you should not query the table for production use. If you add data manually to the table, the records are assumed to come before other changes because the version columns are missing. \nIf a pipeline publishes to Unity Catalog, the internal backing tables are not accessible to users.\n\n##### APPLY CHANGES API: Simplify change data capture in Delta Live Tables\n###### Get data about records processed by a Delta Live Tables CDC query\n\nThe following metrics are captured by `apply changes` queries: \n* **`num_upserted_rows`**: The number of output rows upserted into the dataset during an update.\n* **`num_deleted_rows`**: The number of existing output rows deleted from the dataset during an update. \nThe `num_output_rows` metric, which is output for non-CDC flows, is not captured for `apply changes` queries.\n\n##### APPLY CHANGES API: Simplify change data capture in Delta Live Tables\n###### Limitations\n\nThe target of the `APPLY CHANGES INTO` query or `apply_changes` function cannot be used as a source for a streaming table. A table that reads from the target of an `APPLY CHANGES INTO` query or `apply_changes` function must be a materialized view.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### APPLY CHANGES API: Simplify change data capture in Delta Live Tables\n###### SCD type 1 and SCD type 2 on Databricks\n\nThe following sections provide examples that demonstrate Delta Live Tables SCD type 1 and type 2 queries that update target tables based on source events that: \n1. Create new user records.\n2. Delete a user record.\n3. Update user records. In the SCD type 1 example, the last `UPDATE` operations arrive late and are dropped from the target table, demonstrating the handling of out-of-order events. \nThe following examples assume familiarity with configuring and updating Delta Live Tables pipelines. See [Tutorial: Run your first Delta Live Tables pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html). \nTo run these examples, you must begin by creating a sample dataset. See [Generate test data](https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html#generate-data). \nThe following are the input records for these examples: \n| userId | name | city | operation | sequenceNum |\n| --- | --- | --- | --- | --- |\n| 124 | Raul | Oaxaca | INSERT | 1 |\n| 123 | Isabel | Monterrey | INSERT | 1 |\n| 125 | Mercedes | Tijuana | INSERT | 2 |\n| 126 | Lily | Cancun | INSERT | 2 |\n| 123 | null | null | DELETE | 6 |\n| 125 | Mercedes | Guadalajara | UPDATE | 6 |\n| 125 | Mercedes | Mexicali | UPDATE | 5 |\n| 123 | Isabel | Chihuahua | UPDATE | 5 | \nIf you uncomment the final row in the example data, it will insert the following record that specifies where records should be truncated: \n| userId | name | city | operation | sequenceNum |\n| --- | --- | --- | --- | --- |\n| null | null | null | TRUNCATE | 3 | \nNote \nAll the following examples include options to specify both `DELETE` and `TRUNCATE` operations, but each of these are optional.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### APPLY CHANGES API: Simplify change data capture in Delta Live Tables\n###### Process SCD type 1 updates\n\nThe following code example demonstrates processing SCD type 1 updates: \n```\nimport dlt\nfrom pyspark.sql.functions import col, expr\n\n@dlt.view\ndef users():\nreturn spark.readStream.format(\"delta\").table(\"cdc_data.users\")\n\ndlt.create_streaming_table(\"target\")\n\ndlt.apply_changes(\ntarget = \"target\",\nsource = \"users\",\nkeys = [\"userId\"],\nsequence_by = col(\"sequenceNum\"),\napply_as_deletes = expr(\"operation = 'DELETE'\"),\napply_as_truncates = expr(\"operation = 'TRUNCATE'\"),\nexcept_column_list = [\"operation\", \"sequenceNum\"],\nstored_as_scd_type = 1\n)\n\n``` \n```\n-- Create and populate the target table.\nCREATE OR REFRESH STREAMING TABLE target;\n\nAPPLY CHANGES INTO\nlive.target\nFROM\nstream(cdc_data.users)\nKEYS\n(userId)\nAPPLY AS DELETE WHEN\noperation = \"DELETE\"\nAPPLY AS TRUNCATE WHEN\noperation = \"TRUNCATE\"\nSEQUENCE BY\nsequenceNum\nCOLUMNS * EXCEPT\n(operation, sequenceNum)\nSTORED AS\nSCD TYPE 1;\n\n``` \nAfter running the SCD type 1 example, the target table contains the following records: \n| userId | name | city |\n| --- | --- | --- |\n| 124 | Raul | Oaxaca |\n| 125 | Mercedes | Guadalajara |\n| 126 | Lily | Cancun | \nAfter running the SCD type 1 example with the additional `TRUNCATE` record, records `124` and `126` are truncated because of the `TRUNCATE` operation at `sequenceNum=3`, and the target table contains the following record: \n| userId | name | city |\n| --- | --- | --- |\n| 125 | Mercedes | Guadalajara |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### APPLY CHANGES API: Simplify change data capture in Delta Live Tables\n###### Process SCD type 2 updates\n\nThe following code example demonstrates processing SCD type 2 updates: \n```\nimport dlt\nfrom pyspark.sql.functions import col, expr\n\n@dlt.view\ndef users():\nreturn spark.readStream.format(\"delta\").table(\"cdc_data.users\")\n\ndlt.create_streaming_table(\"target\")\n\ndlt.apply_changes(\ntarget = \"target\",\nsource = \"users\",\nkeys = [\"userId\"],\nsequence_by = col(\"sequenceNum\"),\napply_as_deletes = expr(\"operation = 'DELETE'\"),\nexcept_column_list = [\"operation\", \"sequenceNum\"],\nstored_as_scd_type = \"2\"\n)\n\n``` \n```\n-- Create and populate the target table.\nCREATE OR REFRESH STREAMING TABLE target;\n\nAPPLY CHANGES INTO\nlive.target\nFROM\nstream(cdc_data.users)\nKEYS\n(userId)\nAPPLY AS DELETE WHEN\noperation = \"DELETE\"\nSEQUENCE BY\nsequenceNum\nCOLUMNS * EXCEPT\n(operation, sequenceNum)\nSTORED AS\nSCD TYPE 2;\n\n``` \nAfter running the SCD type 2 example, the target table contains the following records: \n| userId | name | city | \\_\\_START\\_AT | \\_\\_END\\_AT |\n| --- | --- | --- | --- | --- |\n| 123 | Isabel | Monterrey | 1 | 5 |\n| 123 | Isabel | Chihuahua | 5 | 6 |\n| 124 | Raul | Oaxaca | 1 | null |\n| 125 | Mercedes | Tijuana | 2 | 5 |\n| 125 | Mercedes | Mexicali | 5 | 6 |\n| 125 | Mercedes | Guadalajara | 6 | null |\n| 126 | Lily | Cancun | 2 | null | \nAn SCD type 2 query can also specify a subset of output columns to be tracked for history in the target table. Changes to other columns are updated in place rather than generating new history records. The following example demonstrates excluding the `city` column from tracking: \nThe following example demonstrates using track history with SCD type 2: \n```\nimport dlt\nfrom pyspark.sql.functions import col, expr\n\n@dlt.view\ndef users():\nreturn spark.readStream.format(\"delta\").table(\"cdc_data.users\")\n\ndlt.create_streaming_table(\"target\")\n\ndlt.apply_changes(\ntarget = \"target\",\nsource = \"users\",\nkeys = [\"userId\"],\nsequence_by = col(\"sequenceNum\"),\napply_as_deletes = expr(\"operation = 'DELETE'\"),\nexcept_column_list = [\"operation\", \"sequenceNum\"],\nstored_as_scd_type = \"2\",\ntrack_history_except_column_list = [\"city\"]\n)\n\n``` \n```\n-- Create and populate the target table.\nCREATE OR REFRESH STREAMING TABLE target;\n\nAPPLY CHANGES INTO\nlive.target\nFROM\nstream(cdc_data.users)\nKEYS\n(userId)\nAPPLY AS DELETE WHEN\noperation = \"DELETE\"\nSEQUENCE BY\nsequenceNum\nCOLUMNS * EXCEPT\n(operation, sequenceNum)\nSTORED AS\nSCD TYPE 2\nTRACK HISTORY ON * EXCEPT\n(city)\n\n``` \nAfter running this example without the additional `TRUNCATE` record, the target table contains the following records: \n| userId | name | city | \\_\\_START\\_AT | \\_\\_END\\_AT |\n| --- | --- | --- | --- | --- |\n| 123 | Isabel | Chihuahua | 1 | 6 |\n| 124 | Raul | Oaxaca | 1 | null |\n| 125 | Mercedes | Guadalajara | 2 | null |\n| 126 | Lily | Cancun | 2 | null |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### APPLY CHANGES API: Simplify change data capture in Delta Live Tables\n###### Generate test data\n\nThe code below is provided to generate an example dataset for use in the example queries present in this tutorial. Assuming that you have the proper credentials to create a new schema and create a new table, you can execute these statements with either a notebook or Databricks SQL. The following code is **not** intended to be run as part of a Delta Live Tables pipeline: \n```\nCREATE SCHEMA IF NOT EXISTS cdc_data;\n\nCREATE TABLE\ncdc_data.users\nAS SELECT\ncol1 AS userId,\ncol2 AS name,\ncol3 AS city,\ncol4 AS operation,\ncol5 AS sequenceNum\nFROM (\nVALUES\n-- Initial load.\n(124, \"Raul\", \"Oaxaca\", \"INSERT\", 1),\n(123, \"Isabel\", \"Monterrey\", \"INSERT\", 1),\n-- New users.\n(125, \"Mercedes\", \"Tijuana\", \"INSERT\", 2),\n(126, \"Lily\", \"Cancun\", \"INSERT\", 2),\n-- Isabel is removed from the system and Mercedes moved to Guadalajara.\n(123, null, null, \"DELETE\", 6),\n(125, \"Mercedes\", \"Guadalajara\", \"UPDATE\", 6),\n-- This batch of updates arrived out of order. The above batch at sequenceNum 5 will be the final state.\n(125, \"Mercedes\", \"Mexicali\", \"UPDATE\", 5),\n(123, \"Isabel\", \"Chihuahua\", \"UPDATE\", 5)\n-- Uncomment to test TRUNCATE.\n-- ,(null, null, null, \"TRUNCATE\", 3)\n);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n##### APPLY CHANGES API: Simplify change data capture in Delta Live Tables\n###### Add, change, or delete data in a target streaming table\n\nIf your pipeline publishes tables to Unity Catalog, you can use [data manipulation language](https:\/\/docs.databricks.com\/sql\/language-manual\/index.html#dml-statements) (DML) statements, including insert, update, delete, and merge statements, to modify the target streaming tables created by `APPLY CHANGES INTO` statements. \nNote \n* DML statements that modify the table schema of a streaming table are not supported. Ensure that your DML statements do not attempt to evolve the table schema.\n* DML statements that update a streaming table can be run only in a shared Unity Catalog cluster or a SQL warehouse using Databricks Runtime 13.3 LTS and above.\n* Because streaming requires append-only data sources, if your processing requires streaming from a source streaming table with changes (for example, by DML statements), set the [skipChangeCommits flag](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html#ignore-changes) when reading the source streaming table. When `skipChangeCommits` is set, transactions that delete or modify records on the source table are ignored. If your processing does not require a streaming table, you can use a materialized view (which does not have the append-only restriction) as the target table. \nBecause Delta Live Tables uses a specified `SEQUENCE BY` column and propagates appropriate sequencing values to the `__START_AT` and `__END_AT` columns of the target table (for SCD type 2), you must ensure that DML statements use valid values for these columns to maintain the proper ordering of records. See [How is CDC implemented with Delta Live Tables?](https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html#how-is-cdc-implemented). \nFor more information about using DML statements with streaming tables, see [Add, change, or delete data in a streaming table](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html#streaming-tables-dml-statements). \nThe following example inserts an active record with a start sequence of 5: \n```\nINSERT INTO my_streaming_table (id, name, __START_AT, __END_AT) VALUES (123, 'John Doe', 5, NULL);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Use custom metrics with Databricks Lakehouse Monitoring\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis page describes how to create a custom metric in Databricks Lakehouse Monitoring. In addition to the analysis and drift statistics that are automatically calculated, you can create custom metrics. For example, you might want to track a weighted mean that captures some aspect of business logic or use a custom model quality score. You can also create custom drift metrics that track changes to the values in the primary table (compared to the baseline or the previous time window). \nFor more details on how to use the `databricks.lakehouse_monitoring.Metric` API, see the [API reference](https:\/\/api-docs.databricks.com\/python\/lakehouse-monitoring\/latest\/index.html).\n\n### Use custom metrics with Databricks Lakehouse Monitoring\n#### Types of custom metrics\n\nDatabricks Lakehouse Monitoring includes the following types of custom metrics: \n* Aggregate metrics, which are calculated based on columns in the primary table. Aggregate metrics are stored in the profile metrics table.\n* Derived metrics, which are calculated based on previously computed aggregate metrics and do not directly use data from the primary table. Derived metrics are stored in the profile metrics table.\n* Drift metrics, which compare previously computed aggregate or derived metrics from two different time windows, or between the primary table and the baseline table. Drift metrics are stored in the drift metrics table. \nUsing derived and drift metrics where possible minimizes recomputation over the full primary table. Only aggregate metrics access data from the primary table. Derived and drift metrics can then be computed directly from the aggregate metric values.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/custom-metrics.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Use custom metrics with Databricks Lakehouse Monitoring\n#### Custom metrics parameters\n\nTo define a custom metric, you create a [Jinja template](https:\/\/jinja.palletsprojects.com\/en\/3.0.x\/templates\/#variables) for a SQL column expression. The tables in this section describe the parameters that define the metric, and the parameters that are used in the Jinja template. \n| Parameter | Description |\n| --- | --- |\n| `type` | One of `aggregate`, `derived`, or `drift`. |\n| `name` | Column name for the custom metric in metric tables. |\n| `input_columns` | List of column names in the input table the metric should be computed for. To indicate that more than one column is used in the calculation, use `:table`. See the examples in this article. |\n| `definition` | Jinja template for a SQL expression that specifies how to compute the metric. See [Create metric\\_definition](https:\/\/docs.databricks.com\/lakehouse-monitoring\/custom-metrics.html#custom-metric). |\n| `output_data_type` | Spark datatype of the metric output. | \n### Create `definition` \nThe `definition` parameter must be a single string expression in the form of a Jinja template. It cannot contain joins or subqueries. To construct complex definitions, you can use Python helper functions. \nThe following table lists the parameters you can use to create a SQL Jinja Template to specify how to calculate the metric. \n| Parameter | Description |\n| --- | --- |\n| `{{input_column}}` | Column used to compute the custom metric. |\n| `{{prediction_col}}` | Column holding ML model predictions. Used with `InferenceLog` analysis. |\n| `{{label_col}}` | Column holding ML model ground truth labels. Used with `InferenceLog` analysis. |\n| `{{current_df}}` | For drift compared to the previous time window. Data from the previous time window. |\n| `{{base_df}}` | For drift compared to the baseline table. Baseline data. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/custom-metrics.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Use custom metrics with Databricks Lakehouse Monitoring\n#### Aggregate metric example\n\nThe following example computes the average of the square of the values in a column, and is applied to columns `f1` and `f2`. The output is saved as a new column in the profile metrics table and is shown in the analysis rows corresponding to the columns `f1` and `f2`. The applicable column names are substituted for the Jinja parameter `{{input_column}}`. \n```\nfrom databricks import lakehouse_monitoring as lm\nfrom pyspark.sql import types as T\n\nlm.Metric(\ntype=\"aggregate\",\nname=\"squared_avg\",\ninput_columns=[\"f1\", \"f2\"],\ndefinition=\"avg(`{{input_column}}`*`{{input_column}}`)\",\noutput_data_type=T.DoubleType()\n)\n\n``` \nThe following code defines a custom metric that computes the average of the difference between columns `f1` and `f2`. This example shows the use of `[\":table\"]` in the `input_columns` parameter to indicate that more than one column from the table is used in the calculation. \n```\nfrom databricks import lakehouse_monitoring as lm\nfrom pyspark.sql import types as T\n\nlm.Metric(\ntype=\"aggregate\",\nname=\"avg_diff_f1_f2\",\ninput_columns=[\":table\"],\ndefinition=\"avg(f1 - f2)\",\noutput_data_type=T.DoubleType())\n\n``` \nThis example computes a weighted model quality score. For observations where the `critical` column is `True`, a heavier penalty is assigned when the predicted value for that row does not match the ground truth. Because it\u2019s defined on the raw columns (`prediction` and `label`), it\u2019s defined as an aggregate metric. The `:table` column indicates that this metric is calculated from multiple columns. The Jinja parameters `{{prediction_col}}` and `{{label_col}}` are replaced with the name of the prediction and ground truth label columns for the monitor. \n```\nfrom databricks import lakehouse_monitoring as lm\nfrom pyspark.sql import types as T\n\nlm.Metric(\ntype=\"aggregate\",\nname=\"weighted_error\",\ninput_columns=[\":table\"],\ndefinition=\"\"\"avg(CASE\nWHEN {{prediction_col}} = {{label_col}} THEN 0\nWHEN {{prediction_col}} != {{label_col}} AND critical=TRUE THEN 2\nELSE 1 END)\"\"\",\noutput_data_type=T.DoubleType()\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/custom-metrics.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Use custom metrics with Databricks Lakehouse Monitoring\n#### Derived metric example\n\nThe following code defines a custom metric that computes the square root of the `squared_avg` metric defined earlier in this section. Because this is a derived metric, it does not reference the primary table data and instead is defined in terms of the `squared_avg` aggregate metric. The output is saved as a new column in the profile metrics table. \n```\nfrom databricks import lakehouse_monitoring as lm\nfrom pyspark.sql import types as T\n\nlm.Metric(\ntype=\"derived\",\nname=\"root_mean_square\",\ninput_columns=[\"f1\", \"f2\"],\ndefinition=\"sqrt(squared_avg)\",\noutput_data_type=T.DoubleType())\n\n```\n\n### Use custom metrics with Databricks Lakehouse Monitoring\n#### Drift metrics example\n\nThe following code defines a drift metric that tracks the change in the `weighted_error` metric defined earlier in this section. The `{{current_df}}` and `{{base_df}}` parameters allow the metric to reference the `weighted_error` values from the current window and the comparison window. The comparison window can be either the baseline data or the data from the previous time window. Drift metrics are saved in the drift metrics table. \n```\nfrom databricks import lakehouse_monitoring as lm\nfrom pyspark.sql import types as T\n\nlm.Metric(\ntype=\"drift\",\nname=\"error_rate_delta\",\ninput_columns=[\":table\"],\ndefinition=\"{{current_df}}.weighted_error - {{base_df}}.weighted_error\",\noutput_data_type=T.DoubleType()\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/custom-metrics.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Gaps between Spark jobs\n\nSo you see gaps in your jobs timeline like the following: \n![Job Gaps](https:\/\/docs.databricks.com\/_images\/job-gaps.png) \nThere are a few reasons this could be happening. If the gaps make up a high proportion of the time spent on your workload, you need to figure out what is causing these gaps and if it\u2019s expected or not. There are a few things that could be happening during the gaps: \n* There\u2019s no work to do\n* Driver is compiling a complex execution plan\n* Execution of non-spark code\n* Driver is overloaded\n* Cluster is malfunctioning\n\n##### Gaps between Spark jobs\n###### No work\n\nOn [all-purpose compute](https:\/\/docs.databricks.com\/en\/compute\/index.html#types-of-compute), having no work to do is the most likely explanation for the gaps. Because the cluster is running and users are submitting queries, gaps are expected. These gaps are the time between query submissions.\n\n##### Gaps between Spark jobs\n###### Complex execution plan\n\nFor example, if you use `withColumn()` in a loop, it creates a very expensive plan to process. The gaps could be the time the driver is spending simply building and processing the plan. If this is the case, try simplifying the code. Use `selectExpr()` to combine multiple `withColumn()` calls into one expression, or convert the code into SQL. You can still embed the SQL in your Python code, using Python to manipulate the query with string functions. This often fixes this type of problem.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/spark-job-gaps.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Gaps between Spark jobs\n###### Execution of non-Spark code\n\nSpark code is either written in SQL or using a Spark API like [PySpark](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/index.html). Any execution of code that is not Spark will show up in the timeline as gaps. For example, you could have a loop in Python which calls native Python functions. This code is not executing in Spark and it can show up as a gap in the timeline. If you\u2019re not sure if your code is running Spark, try running it interactively in a notebook. If the code is using Spark, you will see Spark jobs under the cell: \n![Spark Execution](https:\/\/docs.databricks.com\/_images\/spark-execution.png) \nYou can also expand the **Spark Jobs** drop-down under the cell to see if the jobs are actively executing (in case Spark is now idle). If you\u2019re not using Spark you won\u2019t see the **Spark Jobs** under the cell, or you will see that none are active. If you can\u2019t run the code interactively, you can try logging in your code and see if you can match the gaps up with sections of your code by time stamp, but that can be tricky. \nIf you see gaps in your timeline caused by running non-Spark code, this means your workers are all idle and likely wasting money during the gaps. Maybe this is intentional and unavoidable, but if you can write this code to use Spark you will fully utilize the cluster. Start with [this tutorial](https:\/\/docs.databricks.com\/getting-started\/quick-start.html) to learn how to work with Spark.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/spark-job-gaps.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Gaps between Spark jobs\n###### Driver is overloaded\n\nTo determine if your driver is overloaded, you need to look at the cluster metrics. \nIf your cluster is on DBR 13.0 or later, click **Metrics** as highlighted in this screenshot: \n![New Cluster Metrics](https:\/\/docs.databricks.com\/_images\/new-cluster-metrics.png) \nNotice the **Server load distribution visualization**. You should look to see if the driver is heavily loaded. This visualization has a block of color for each machine in the cluster. Red means heavily loaded, and blue means not loaded at all. \nThe previous screenshot shows a basically idle cluster. If the driver is overloaded, it would look something like this: \n![New Metrics, Busy Driver](https:\/\/docs.databricks.com\/_images\/new-cluster-metrics-heavy.jpeg) \nWe can see that one square is red, while the others are blue. Roll your mouse over the red square to make sure the red block represents your driver. \nTo fix an overloaded driver, see [Spark driver overloaded](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/spark-driver-overloaded.html). \n### View distribution with legacy Ganglia metrics \nIf your cluster is on DBR 12.x or earlier, click on **Metrics**, and **Ganglia UI** as highlighted in this screenshot: \n![Open Ganglia](https:\/\/docs.databricks.com\/_images\/get-to-old-cluster-metrics.png) \nIf the cluster is no longer running, you can open one of the historical snapshots. Look at the **Server Load Distribution** visualization, which is highlighted here in red: \n![Server Load Distribution in Ganglia](https:\/\/docs.databricks.com\/_images\/old-cluster-metrics.png) \nYou should look to see if the driver is heavily loaded. This visualization has a block of color for each machine in the cluster. Red means heavily loaded, and blue means not loaded at all. The above distribution shows a basically idle cluster. If the driver is overloaded, it would look something like this: \n![Overloaded Driver in Ganglia](https:\/\/docs.databricks.com\/_images\/old-cluster-metrics-heavy.png) \nWe can see that one square is red, while the others are blue. Be careful if you only have one worker. You need to make sure the red block is your driver and not your worker. \nTo fix an overloaded driver, see [Spark driver overloaded](https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/spark-driver-overloaded.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/spark-job-gaps.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Gaps between Spark jobs\n###### Cluster is malfunctioning\n\nMalfunctioning clusters are rare, but if this is the case it can be difficult to determine what happened. You may just want to restart the cluster to see if this resolves the issue. You can also look into the logs to see if there\u2019s anything suspicious. The **Event log** tab and **Driver logs** tabs, highlighted in the screenshot below, will be the places to look: \n![Getting Driver Logs](https:\/\/docs.databricks.com\/_images\/cluster-logs.png) \nYou may want to enable [Cluster log delivery](https:\/\/docs.databricks.com\/compute\/configure.html#cluster-log-delivery) in order to access the logs of the workers. You can also change the log level, but you might need to reach out to your Databricks account team for help.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/spark-job-gaps.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n### Set up query federation for non-Unity-Catalog workspaces\n##### Query federation for Snowflake in Databricks SQL (Experimental)\n\nExperimental \nThe configurations described in this article are [Experimental](https:\/\/docs.databricks.com\/release-notes\/release-types.html). Experimental features are provided as-is and are not supported by Databricks through customer technical support. **To get full query federation support, you should instead use [Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html), which enables your Databricks users to take advantage of Unity Catalog syntax and data governance tools.** \nThis article describes how to configure read-only query federation to Snowflake on Serverless and Pro SQL warehouses. \nYou configure connections to Snowflake at the table level. You can use [secrets](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/secret.html) to store and access text credentials without displaying them in plaintext. See the following example: \n```\nDROP TABLE IF EXISTS snowflake_table;\nCREATE TABLE snowflake_table\nUSING snowflake\nOPTIONS (\ndbtable '<table-name>',\nsfUrl '<database-host-url>',\nsfUser secret('snowflake_creds', 'my_username'),\nsfPassword secret('snowflake_creds', 'my_password'),\nsfDatabase '<database-name>',\nsfSchema '<schema-name>',\nsfWarehouse '<warehouse-name>'\n);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/snowflake-no-uc.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Trigger jobs when new files arrive\n\nYou can use *file arrival triggers* to trigger a run of your Databricks job when new files arrive in an [external location](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html) such as Amazon S3, Azure storage, or Google Cloud Storage. You can use this feature when a scheduled job might be inefficient because new data arrives on an irregular schedule. \nFile arrival triggers make a best effort to check for new files every minute, although this can be affected by the performance of the underlying cloud storage. File arrival triggers do not incur additional costs other than cloud provider costs associated with listing files in the storage location. \nA file arrival trigger can be configured to monitor the root of a Unity Catalog external location or volume, or a subpath of an external location or volume. For example, for the Unity Catalog root volume `\/Volumes\/mycatalog\/myschema\/myvolume\/`, the following are valid paths for a file arrival trigger: \n```\n\/Volumes\/mycatalog\/myschema\/myvolume\/\n\/Volumes\/mycatalog\/myschema\/myvolume\/mydirectory\/\n\n```\n\n#### Trigger jobs when new files arrive\n##### Requirements\n\nThe following are required to use file arrival triggers: \n* The workspace must have [Unity Catalog enabled](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/enable-workspaces.html).\n* You must use a storage location that\u2019s either a Unity Catalog volume or an external location added to the Unity Catalog metastore. See [Create an external location to connect cloud storage to Databricks](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html).\n* You must have `READ` permissions to the storage location and CAN MANAGE permissions on the job. For more information about job permissions, see [Job ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html#jobs).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/file-arrival-triggers.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Trigger jobs when new files arrive\n##### Limitations\n\n* A maximum of fifty jobs can be configured with a file arrival trigger in a Databricks workspace.\n* A storage location configured for a file arrival trigger can contain only up to 10,000 files. Locations with more files cannot be monitored for new file arrivals. If the configured storage location is a subpath of a Unity Catalog external location or volume, the 10,000 file limit applies to the subpath and not the root of the storage location. For example, the root of the storage location can contain more than 10,000 files across its subdirectories, but the configured subdirectory must not exceed the 10,000 file limit.\n* The path used for a file arrival trigger must not contain any external tables or managed locations of catalogs and schemas.\n\n#### Trigger jobs when new files arrive\n##### Add a file arrival trigger\n\nTo add a file arrival trigger to a job: \n1. In the sidebar, click **Workflows**.\n2. In the **Name** column on the **Jobs** tab, click the job name.\n3. In the **Job details** panel on the right, click **Add trigger**.\n4. In **Trigger type**, select **File arrival**.\n5. In **Storage location**, enter the URL of the root or a subpath of a Unity Catalog external location or the root or a subpath of a Unity Catalog volume to monitor.\n6. (Optional) Configure advanced options: \n* **Minimum time between triggers in seconds**: The minimum time to wait to trigger a run after a previous run completes. Files that arrive in this period trigger a run only after the waiting time expires. Use this setting to control the frequency of run creation.\n* **Wait after last change in seconds**: The time to wait to trigger a run after file arrival. Another file arrival in this period resets the timer. This setting can be used when files arrive in batches, and the whole batch needs to be processed after all files have arrived.\n7. To validate the configuration, click **Test connection**.\n8. Click **Save**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/file-arrival-triggers.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Trigger jobs when new files arrive\n##### Receive notifications of failed file arrival triggers\n\nTo be notified if a file arrival trigger fails to evaluate, configure email or system destination notifications on job failure. See [Add email and system notifications for job events](https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/file-arrival-triggers.html"} +{"content":"# AI and Machine Learning on Databricks\n### Tutorials: Get started with ML\n\nThe notebooks in this article are designed to get you started quickly with machine learning on Databricks. You can [import each notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html#import-notebook) to your Databricks workspace to run them. \nThese notebooks illustrate how to use Databricks throughout the machine learning lifecycle, including data loading and preparation; model training, tuning, and inference; and model deployment and management. They also demonstrate helpful tools such as [Hyperopt](https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/index.html#hyperopt-overview) for automated hyperparameter tuning, [MLflow tracking](https:\/\/docs.databricks.com\/mlflow\/tracking.html) and autologging for model development, and [Model Registry](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html) for model management.\n\n### Tutorials: Get started with ML\n#### scikit-learn notebooks\n\n| Notebook | Requirements | Features |\n| --- | --- | --- |\n| [Machine learning tutorial](https:\/\/docs.databricks.com\/machine-learning\/train-model\/scikit-learn.html#basic-example) | Databricks Runtime ML | Unity Catalog, classification model, MLflow, automated hyperparameter tuning with Hyperopt and MLflow |\n| [End-to-end example](https:\/\/docs.databricks.com\/mlflow\/end-to-end-example.html) | Databricks Runtime ML | Unity Catalog, classification model, MLflow, automated hyperparameter tuning with Hyperopt and MLflow, XGBoost |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/ml-tutorials.html"} +{"content":"# AI and Machine Learning on Databricks\n### Tutorials: Get started with ML\n#### Apache Spark MLlib notebook\n\n| Notebook | Requirements | Features |\n| --- | --- | --- |\n| [Machine learning with MLlib](https:\/\/docs.databricks.com\/machine-learning\/train-model\/mllib.html) | Databricks Runtime ML | Logistic regression model, Spark pipeline, automated hyperparameter tuning using MLlib API |\n\n### Tutorials: Get started with ML\n#### Deep learning notebook\n\n| Notebook | Requirements | Features |\n| --- | --- | --- |\n| [Deep learning with TensorFlow Keras](https:\/\/docs.databricks.com\/machine-learning\/train-model\/tensorflow.html) | Databricks Runtime ML | Neural network model, inline TensorBoard, automated hyperparameter tuning with Hyperopt and MLflow, autologging, ModelRegistry |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/ml-tutorials.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Low shuffle merge on Databricks\n\nNote \nLow shuffle merge is [generally available (GA)](https:\/\/docs.databricks.com\/release-notes\/release-types.html) in Databricks Runtime 10.4 lTS and above and in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html) in Databricks Runtime 9.1 LTS. Databricks recommends that Preview customers migrate to Databricks Runtime 10.4 LTS or above. \nThe [MERGE](https:\/\/docs.databricks.com\/delta\/merge.html) command is used to perform simultaneous updates, insertions, and deletions from a Delta Lake table. Databricks has an optimized implementation of `MERGE` that improves performance substantially for common workloads by reducing the number of shuffle operations. \nDatabricks low shuffle merge provides better performance by processing unmodified rows in a separate, more streamlined processing mode, instead of processing them together with the modified rows. As a result, the amount of shuffled data is reduced significantly, leading to improved performance. Low shuffle merge also reduces the need for users to re-run the [OPTIMIZE ZORDER BY](https:\/\/docs.databricks.com\/delta\/data-skipping.html) command after performing a `MERGE` operation.\n\n#### Low shuffle merge on Databricks\n##### Optimized performance\n\nMany `MERGE` workloads only update a relatively small number of rows in a table. However, Delta tables can only be updated on a per-file basis. When the `MERGE` command needs to update or delete a small number of rows that are stored in a particular file, then it must also process and rewrite all remaining rows that are stored in the same file, even though these rows are unmodified. Low shuffle merge optimizes the processing of unmodified rows. Previously, they were processed in the same way as modified rows, passing them through multiple shuffle stages and expensive calculations. In low shuffle merge, the unmodified rows are instead processed without any shuffles, expensive processing, or other added overhead.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/low-shuffle-merge.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Low shuffle merge on Databricks\n##### Optimized data layout\n\nIn addition to being faster to run, low shuffle merge benefits subsequent operations as well. The earlier `MERGE` implementation caused the data layout of unmodified data to be changed entirely, resulting in lower performance on subsequent operations. Low shuffle merge tries to preserve the existing data layout of the unmodified records, including [Z-order optimization](https:\/\/docs.databricks.com\/delta\/data-skipping.html) on a best-effort basis. Hence, with low shuffle merge, the performance of operations on a Delta table will degrade more slowly after running one or more `MERGE` commands. \nNote \nLow shuffle merge tries to preserve the data layout on existing data that is not modified. The data layout of updated or newly inserted data may not be optimal, so it may still be necessary to run the `OPTIMIZE` or [OPTIMIZE ZORDER BY](https:\/\/docs.databricks.com\/delta\/data-skipping.html) commands.\n\n#### Low shuffle merge on Databricks\n##### Availability\n\nLow shuffle merge is enabled by default in Databricks Runtime 10.4 and above. In earlier supported Databricks Runtime versions it can be enabled by setting the configuration `spark.databricks.delta.merge.enableLowShuffle` to `true`. This flag has no effect in Databricks Runtime 10.4 and above.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/low-shuffle-merge.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### How to determine if Spark is rewriting data\n\nFirst open the SQL DAG for your write stage. Scroll up to the top of the job\u2019s page and click on the Associated SQL Query: \n![Stage to SQL](https:\/\/docs.databricks.com\/_images\/stage-to-sql.png) \nYou should now see the DAG. If not, scroll around a bit and you should see it: \n![SQL DAG](https:\/\/docs.databricks.com\/_images\/sql-dag.png) \nIf you\u2019re doing a Delete or Update operation, look at the amount of data being written by the writer versus what you expect. If you\u2019re seeing a lot more data being written than you expect, you\u2019re probably rewriting data: \n![Write Stats](https:\/\/docs.databricks.com\/_images\/write-stats.png) \nIf you\u2019re doing a merge, the merge node has explicit statistics about how much data it\u2019s rewriting.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/spark-rewriting-data.html"} +{"content":"# \n### Creating a `\ud83d\udd0d Retriever` version\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html)\n\n### Creating a `\ud83d\udd0d Retriever` version\n#### Conceptual overview\n\nThe `\ud83d\udd0d Retriever` is logic that retrieves relevant chunks from a Vector Index. Given the dependencies between processing logic and retrieval logic, a `\ud83d\udd0d Retriever` is associated with 1+ `\ud83d\uddc3\ufe0f Data Processor`s. A `\ud83d\udd0d Retriever` can be associated with (used by) any number of `\ud83d\udd17 Chain`s. \nA `\ud83d\udd0d Retriever` can be a simple call to a Vector Index or a more complex series of steps including a re-ranker. \nNote \nIn v2024-01-19, the `\ud83d\udd0d Retriever` provides only retriever configuration settings. In this release, you must include the code for your `\ud83d\udd0d Retriever` within your `\ud83d\udd17 Chain`\u2019s code. \nTip \n**\ud83d\udea7 Roadmap \ud83d\udea7** Support for managing the `\ud83d\udd0d Retriever` code separately from the `\ud83d\udd17 Chain`. \nNote \nIn v2024-01-19, in order to enable `\ud83d\udcdd Trace` logging, you must use a LangChain Retriever as part of a LangChain defined chain inside your `\ud83d\udd17 Chain`. \nTip \n**\ud83d\udea7 Roadmap \ud83d\udea7** Support for non-LangChain retrievers and integrations with other frameworks such as Llama-Index. \nTip \n**\ud83d\udea7 Roadmap \ud83d\udea7** Support for multiple `\ud83d\udd17 Chain` per RAG Application. In v2024-01-19, only one `\ud83d\udd17 Chain` can be created per RAG Application.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-retriever.html"} +{"content":"# \n### Creating a `\ud83d\udd0d Retriever` version\n#### Step-by-step instructions\n\n1. Open the `rag-config.yml` in your IDE\/code editor.\n2. Edit the `retrievers` configuration. \n```\nretrievers:\n- name: ann-retriever\ndescription: Basic ANN retriever\n# explicit link to the data processor that this retriever uses.\ndata_processors:\n- name: spark-docs-processor\n# these are key-value pairs that can be specified by the end user\nconfigurations:\nk: 5\nuse_mmr: false\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/tutorials\/7-rag-versions-retriever.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Run tasks conditionally in a Databricks job\n\nBy default, a job task runs when its dependencies have run and have all succeeded, but you can also configure tasks in a Databricks job to run only when specific conditions are met. Databricks Jobs supports the following methods to run tasks conditionally: \n* You can specify *Run if* dependencies to run a task based on the run status of the task\u2019s dependencies. For example, you can use `Run if` to run a task even when some or all of its dependencies have failed, allowing your job to recover from failures and continue running.\n* The *If\/else condition* task is used to run a part of a job DAG based on the results of a boolean expression. The `If\/else condition` task allows you to add branching logic to your job. For example, run transformation tasks only if the upstream ingestion task adds new data. Otherwise, run data processing tasks.\n\n#### Run tasks conditionally in a Databricks job\n##### Add the `Run if` condition of a task\n\nYou can configure a `Run if` condition when you [edit a task](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#task-configuration) with one or more dependencies. To add the condition to the task, select the condition from the **Run if dependencies** drop-down menu in the task configuration. The `Run if` condition is evaluated after completing all task dependencies. You can also add a `Run if` condition when you add a new task with one or more dependencies.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/conditional-tasks.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Run tasks conditionally in a Databricks job\n##### `Run if` condition options\n\nYou can add the following `Run if` conditions to a task: \n* **All succeeded**: All dependencies have run and succeeded. This is the default condition to run a task. The task is marked as `Upstream failed` if the condition is unmet.\n* **At least one succeeded**: At least one dependency has succeeded. The task is marked as `Upstream failed` if the condition is unmet.\n* **None failed**: None of the dependencies failed, and at least one dependency was run. The task is marked as `Upstream failed` if the condition is unmet.\n* **All done**: The task is run after all its dependencies have run, regardless of the status of the dependent runs. This condition allows you to define a task that is run without depending on the outcome of its dependent tasks.\n* **At least one failed**: At least one dependency failed. The task is marked as `Excluded` if the condition is unmet.\n* **All failed**: All dependencies have failed. The task is marked as `Excluded` if the condition is unmet. \nNote \n* Tasks configured to handle failures are marked as `Excluded` if their `Run if` condition is unmet. Excluded tasks are skipped and are treated as successful.\n* If all task dependencies are excluded, the task is also excluded, regardless of its `Run if` condition.\n* If you cancel a task run, the cancellation propagates through downstream tasks, and tasks with a `Run if` condition that handles failure are run, for example, to verify a cleanup task runs when a task run is canceled.\n\n#### Run tasks conditionally in a Databricks job\n##### How does Databricks Jobs determine job run status?\n\nDatabricks Jobs determines whether a job run was successful based on the outcome of the job\u2019s *leaf tasks*. A leaf task is a task that has no downstream dependencies. A job run can have one of three outcomes: \n* Succeeded: All tasks were successful.\n* Succeeded with failures: Some tasks failed, but all leaf tasks were successful.\n* Failed: One or more leaf tasks failed.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/conditional-tasks.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Run tasks conditionally in a Databricks job\n##### Add branching logic to your job with the `If\/else condition` task\n\nUse the `If\/else condition` task to run a part of a job DAG based on a boolean expression. The expression consists of a boolean operator and a pair of operands, where the operands might reference job or task state using [job and task parameter variables](https:\/\/docs.databricks.com\/workflows\/jobs\/parameter-value-references.html) or use [task values](https:\/\/docs.databricks.com\/workflows\/jobs\/share-task-context.html). \nNote \n* Numeric and non-numeric values are handled differently depending on the boolean operator: \n+ The `==` and `!=` operators perform string comparison of their operands. For example, `12.0 == 12` evaluates to false.\n+ The `>`, `>=`, and `<=` operators perform numeric comparisons of their operands. For example, `12.0 >= 12` evaluates to true, and `10.0 >= 12` evaluates to false.\n+ Only numeric, string, and boolean values are allowed when referencing [task values](https:\/\/docs.databricks.com\/workflows\/jobs\/share-task-context.html) in an operand. Any other types will cause the condition expression to fail. Non-numeric value types are serialized to strings and are treated as strings in `If\/else condition` expressions. For example, if a task value is set to a boolean value, it is serialized to `\"true\"` or `\"false\"`. \nYou can add an `If\/else condition` task when you [create a job](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html#job-create) or [edit a task](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#task-configuration) in an existing job. To configure an `If\/else condition` task: \n1. In the **Type** drop-down menu, select `If\/else condition`.\n2. In the first **Condition** text box, enter the operand to be evaluated. The operand can reference a job or task parameter variable or a task value.\n3. Select a boolean operator from the drop-down menu.\n4. In the second **Condition** text box, enter the value for evaluating the condition. \nTo configure dependencies on an `If\/else condition` task: \n1. Select the `If\/else condition` task in the DAG view and click **+ Add task**.\n2. After entering details for the task, click **Depends on** and select `<task-name> (true)` where `<task-name>` is the name of the `If\/else condition` task.\n3. Repeat for the condition evaluating to `false`. \nFor example, suppose you have a task named `process_records` that maintains a count of records that are not valid in a value named `bad_records`, and you want to branch processing based on whether records that are not valid are found. To add this logic to your workflow, you can create an `If\/else condition` task with an expression like `{{tasks.process_records.values.bad_records}} > 0`. You can then add dependent tasks based on the results of the condition. \nAfter the run of a job containing an `If\/else condition` task completes, you can view the result of the expression and details of the expression evaluation when you view the [job run details](https:\/\/docs.databricks.com\/workflows\/jobs\/monitor-job-runs.html#job-run-details) in the UI.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/conditional-tasks.html"} +{"content":"# What is Delta Lake?\n### Schema enforcement\n\nDatabricks validates data quality by enforcing schema on write. \nNote \nThis article describes the default behavior for tables on Databricks, which are backed by Delta Lake. Schema enforcement does not apply to tables backed by external data.\n\n### Schema enforcement\n#### Schema enforcement for insert operations\n\nDatabricks enforces the following rules when inserting data into a table: \n* All inserted columns must exist in the target table.\n* All column data types must match the column data types in the target table. \nNote \nDatabricks attempts to safely cast column data types to match the target table.\n\n### Schema enforcement\n#### Schema validation during `MERGE` operations\n\nDatabricks enforces the following rules when inserting or updating data as part of a `MERGE` operation: \n* If the data type in the source statement does not match the target column, `MERGE` tries to safely cast column data types to match the target table.\n* The columns that are the target of an `UPDATE` or `INSERT` action must exist in the target table.\n* When using `INSERT *` or `UPDATE SET *` syntax: \n+ Columns in the source dataset not present in the target table are ignored.\n+ The source dataset must have all the columns present in the target table.\n\n### Schema enforcement\n#### Modify a table schema\n\nYou can update the schema of a table using either explicit `ALTER TABLE` statements or automatic schema evolution. See [Update Delta Lake table schema](https:\/\/docs.databricks.com\/delta\/update-schema.html). \nSchema evolution has special semantics for `MERGE` operations. See [Automatic schema evolution for Delta Lake merge](https:\/\/docs.databricks.com\/delta\/update-schema.html#merge-schema-evolution).\n\n","doc_uri":"https:\/\/docs.databricks.com\/tables\/schema-enforcement.html"} +{"content":"# What is data warehousing on Databricks?\n## Write queries and explore data in the SQL Editor\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/schedule-query.html"} +{"content":"# What is data warehousing on Databricks?\n## Write queries and explore data in the SQL Editor\n#### Schedule a query\n\nYou can use scheduled query executions to update your dashboards or enable routine alerts. By default, your queries do not have a schedule. \nNote \nIf an alert uses your query, the alert runs on its own refresh schedule and does not use the query schedule. \nTo set the schedule: \n1. In the Query Editor, click **Schedule**>**Add schedule** to open a menu with schedule settings. \n![Schedule interval](https:\/\/docs.databricks.com\/_images\/schedule-modal.png)\n2. Choose when to run the query. \n* Use the dropdown pickers to specify the frequency, period, starting time, and time zone. Optionally, select the **Show cron syntax** checkbox to edit the schedule in [Quartz Cron Syntax](http:\/\/www.quartz-scheduler.org\/documentation\/quartz-2.3.0\/tutorials\/crontrigger.html).\n* Choose **More options** to show optional settings. You can also choose: \n+ A name for the schedule.\n+ A SQL warehouse to power the query. By default, the SQL warehouse used for ad hoc query execution is also used for a scheduled job. Use this optional setting to select a different warehouse to run the scheduled query.\n3. Click **Create**.\nYour query will run automatically according to the schedule. If you experience a scheduled query not executing according to its schedule, you should manually trigger the query to make sure it doesn\u2019t fail. \nIf a query execution fails during a scheduled run, Databricks retries with a back-off algorithm. This means that retries happen less frequently as failures persist. With persistent failures, the next retry might exceed the scheduled interval. \nAfter you create a schedule, the label on the **Schedule** button reads **Schedule(#)**, where the **#** is the number of scheduled events that are visible to you. You cannot see schedules that have not been shared with you. \nImportant \nNew schedules are not automatically shared with other users, even if those users have access to the query. To make scheduled runs and results visible to other users, use the sharing settings described in the next step.\n4. Share the schedule \nQuery permissions are not linked to schedule permissions. After creating your scheduled run interval, edit the schedule permissions to provide access to other users. \n* Click **Schedule(#)**.\n* Click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu and select **Edit schedule permissions**.\n* Choose a user or group from the drop-down menu in the dialog.\n* Choose CAN VIEW to allow the selected users to view the results of scheduled runs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/schedule-query.html"} +{"content":"# What is data warehousing on Databricks?\n## Write queries and explore data in the SQL Editor\n#### Schedule a query\n##### Refresh behavior and execution context\n\nWhen a query is \u201cRun as Owner\u201d and a schedule is added, the query owner\u2019s credential is used for execution, and anyone with at least CAN RUN sees the results of those refreshed queries. \nWhen a query is \u201cRun as Viewer\u201d and a schedule is added, the schedule owner\u2019s credential is used for execution. Only the users with appropriate schedule permissions see the results of the refreshed queries; all other viewers must manually refresh to see updated query results.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/schedule-query.html"} +{"content":"# \n### Plotly\n\n[Plotly](https:\/\/pypi.org\/project\/plotly\/) is an interactive graphing library. Databricks supports Plotly 2.0.7. To use Plotly, [install the Plotly PyPI package](https:\/\/docs.databricks.com\/libraries\/index.html) and attach it to your cluster. \nNote \nInside Databricks notebooks we recommend using [Plotly Offline](https:\/\/plot.ly\/python\/offline\/). Plotly Offline [may not perform well](https:\/\/community.plot.ly\/t\/offline-plotting-in-python-is-very-slow-on-big-data-sets\/3077) when handling large datasets. If you notice performance issues, you should reduce the size of your dataset. \nTo display a Plotly plot: \n1. Specify `output_type='div'` as an argument to the Plotly `plot()` function.\n2. Pass the output of the `plot()` function to Databricks `displayHTML()` function.\n\n### Plotly\n#### Notebook example: Plotly\n\nThe following notebook shows a Plotly example. \n### Plotly Python notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/plotly.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/plotly.html"} +{"content":"# What is Databricks?\n### What are all the *Delta* things in Databricks?\n\nThis article is an introduction to the technologies collectively branded *Delta* on Databricks. Delta refers to technologies related to or in the [Delta Lake open source project](https:\/\/delta.io\/). \nThis article answers: \n* What are the *Delta* technologies in Databricks?\n* What do they do? Or what are they used for?\n* How are they related to and distinct from one another?\n\n### What are all the *Delta* things in Databricks?\n#### What are the Delta things used for?\n\nDelta is a term introduced with Delta Lake, the foundation for storing data and tables in the Databricks lakehouse. Delta Lake was conceived of as a unified data management system for handling transactional real-time and batch big data, by extending Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.\n\n### What are all the *Delta* things in Databricks?\n#### Delta Lake: OS data management for the lakehouse\n\n[Delta Lake](https:\/\/delta.io\/) is an open-source storage layer that brings reliability to data lakes by adding a transactional storage layer on top of data stored in cloud storage (on AWS S3, Azure Storage, and GCS). It allows for ACID transactions, data versioning, and rollback capabilities. It allows you to handle both batch and streaming data in a unified way. \nDelta tables are built on top of this storage layer and provide a table abstraction, making it easy to work with large-scale structured data using SQL and the DataFrame API.\n\n","doc_uri":"https:\/\/docs.databricks.com\/introduction\/delta-comparison.html"} +{"content":"# What is Databricks?\n### What are all the *Delta* things in Databricks?\n#### Delta tables: Default data table architecture\n\nDelta table is the default data table format in Databricks and is a feature of the Delta Lake open source data framework. Delta tables are typically used for data lakes, where data is ingested via streaming or in large batches. \nSee: \n* [Delta Lake quickstart: Create a table](https:\/\/docs.delta.io\/latest\/quick-start.html#create-a-table)\n* [Updating and modifying Delta Lake tables](https:\/\/docs.databricks.com\/delta\/index.html#updates).\n* [DeltaTable class](https:\/\/docs.delta.io\/latest\/delta-apidoc.html#delta-spark): Main class for interacting programmatically with Delta tables.\n\n### What are all the *Delta* things in Databricks?\n#### Delta Live Tables: Data pipelines\n\nDelta Live Tables manage the flow of data between many Delta tables, thus simplifying the work of data engineers on ETL development and management. The pipeline is the main unit of execution for [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html). Delta Live Tables offers declarative pipeline development, improved data reliability, and cloud-scale production operations. Users can perform both batch and streaming operations on the same table and the data is immediately available for querying. You define the transformations to perform on your data, and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling. Delta Live Tables Enhanced Autoscaling can handle streaming workloads which are spiky and unpredictable. \nSee the [Delta Live Tables tutorial](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/introduction\/delta-comparison.html"} +{"content":"# What is Databricks?\n### What are all the *Delta* things in Databricks?\n#### Delta tables vs. Delta Live Tables\n\nDelta table is a way to store data in tables, whereas Delta Live Tables allows you to describe how data flows between these tables declaratively. Delta Live Tables is a declarative framework that manages many delta tables, by creating them and keeping them up to date. In short, Delta tables is a data table architecture while Delta Live Tables is a data pipeline framework.\n\n### What are all the *Delta* things in Databricks?\n#### Delta: Open source or proprietary?\n\nA strength of the Databricks platform is that it doesn\u2019t lock customers into proprietary tools: Much of the technology is powered by open source projects, which Databricks contributes to. \nThe Delta OSS projects are examples: \n* [Delta Lake project](https:\/\/delta.io\/): Open source storage for a lakehouse.\n* [Delta Sharing protocol](https:\/\/delta.io\/sharing\/): Open protocol for secure data sharing. \nDelta Live Tables is a proprietary framework in Databricks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/introduction\/delta-comparison.html"} +{"content":"# What is Databricks?\n### What are all the *Delta* things in Databricks?\n#### What are the other *Delta* things on Databricks?\n\nBelow are descriptions of other features that include *Delta* in their name. \n### Delta Sharing \nAn open standard for secure data sharing, [Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html) enables data sharing between organizations regardless of their compute platform. \n### Delta engine \nA query optimizer for big data that uses Delta Lake open source technology included in Databricks. Delta engine optimizes the performance of Spark SQL, Databricks SQL, and DataFrame operations by pushing computation to the data. \n### Delta Lake transaction log (AKA DeltaLogs) \nA single source of truth tracking all changes that users make to the table and the mechanism through which Delta Lake guarantees [atomicity](https:\/\/docs.databricks.com\/lakehouse\/acid.html#atomicity). See the [Delta transaction log protocol](https:\/\/github.com\/delta-io\/delta\/blob\/master\/PROTOCOL.md) on GitHub. \nThe transaction log is key to understanding Delta Lake, because it is the common thread that runs through many of its most important features: \n* ACID transactions\n* Scalable metadata handling\n* Time travel\n* And more.\n\n","doc_uri":"https:\/\/docs.databricks.com\/introduction\/delta-comparison.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n### Manage privileges in Unity Catalog\n##### Upgrade to privilege inheritance\n\nIf you created your Unity Catalog metastore during the public preview (before August 25, 2022), you can upgrade to Privilege Model version 1.0. to take advantage of [privilege inheritance](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html#inheritance). Existing workloads will continue to operate as-is until you upgrade your privilege model. Databricks recommends upgrading to Privilege Model version 1.0 to get the benefits of privilege inheritance and new features.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/upgrade-privilege-model.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n### Manage privileges in Unity Catalog\n##### Upgrade to privilege inheritance\n###### Differences in Privilege Model Version 1.0\n\nPrivilege Model v1.0 in Unity Catalog has the following differences from the public preview privilege model: \n* **Privilege inheritance:** In Privilege Model v1.0 privileges are inherited on child securable objects. This means that granting a privilege on the catalog automatically grants the privilege to all current and future objects within the catalog. Similarly, privileges granted on a schema are inherited by all current and future objects within that schema. In the preview model, privileges are not inherited on child securable objects. For more information on privilege inheritance, see [Inheritance model](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html#inheritance).\n* **`ALL PRIVILEGES` is evaluated differently:** In the public preview privilege model, `ALL PRIVILEGES` grants the principal all available privileges at the time of the privilege grant. In Privilege Model v1.0, the `ALL PRIVILEGES` permission expands to all available privileges at the time a permission check is made. \nIn Privilege Model v1.0, when `ALL PRIVILEGES` is revoked only the `ALL PRIVILEGES` privilege itself is revoked. Users retain any other privileges that were granted to them separately.\n* **`CREATE TABLE` is updated to `CREATE EXTERNAL TABLE`:** The `CREATE TABLE` permission no longer applies to external locations or storage credentials, which are required to create external tables. In Privilege Model v1.0, you instead grant the `CREATE EXTERNAL TABLE` privilege on external locations and storage credentials to allow a user to create external tables using that external location or storage credential.\n* **`CREATE` is removed:** The `CREATE` permission is removed and replaced by the following more specific privileges: `CREATE CATALOG`, `CREATE EXTERNAL LOCATION`, `CREATE FUNCTION`, `CREATE SCHEMA`, `CREATE TABLE`, `CREATE MANAGED STORAGE`.\n* **`USAGE` is removed:** The `USAGE` permission is removed and replaced by the following more specific privileges: `USE CATALOG` and `USE SCHEMA`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/upgrade-privilege-model.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n### Manage privileges in Unity Catalog\n##### Upgrade to privilege inheritance\n###### Upgrade to Privilege Model Version 1.0\n\nWarning \nYou cannot undo this action. \n1. Upgrade all workloads that reference Unity Catalog to use [Databricks Runtime 11.3 LTS](https:\/\/docs.databricks.com\/release-notes\/runtime\/11.3lts.html) or above. \nYou must upgrade all clusters to use Databricks Runtime 11.3 LTS or above, and you must restart any running SQL warehouses. If you skip this step, workloads on older versions of Databricks Runtime will be rejected after you complete the upgrade.\n2. As an account admin, log in to the [account console](https:\/\/accounts.cloud.databricks.com).\n3. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n4. Click the metastore name.\n5. Under **Privilege Model** click **Upgrade**\n6. Click **Upgrade** \nIf you do not see the option to upgrade, your Unity Catalog metastore is already using Privilege Model 1.0.\n\n##### Upgrade to privilege inheritance\n###### Upgrade SQL commands (optional)\n\nDatabricks will continue to support grants expressed using the old privilege model and automatically map them to the equivalent grant in Privilege Model v1.0. However, privileges returned via `SHOW GRANTS` or `information_schema` data will continue to reference Privilege Model v1.0. Databricks recommends that you upgrade existing code that performs grants to reference the updated privilege model. \n* Replace the `CREATE TABLE` privilege on external locations or storage credentials with the `CREATE EXTERNAL TABLE` privilege.\n* Replace the `CREATE` permission with the specific privilege `CREATE CATALOG`, `CREATE EXTERNAL LOCATION`, `CREATE FUNCTION`, `CREATE SCHEMA`, or `CREATE TABLE`.\n* Replace the `USAGE` permission with the specific privilege `USE CATALOG` or `USE SCHEMA`. \nFor more information about Unity Catalog privilege model see [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/upgrade-privilege-model.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### User-defined functions (UDFs) in Unity Catalog\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nDatabricks provides a SQL-native syntax to register custom functions to schemas governed by Unity Catalog. Python UDFs registered as functions in Unity Catalog differ in scope and support from PySpark UDFs scoped to a notebook or SparkSession. See [User-defined scalar functions - Python](https:\/\/docs.databricks.com\/udf\/python.html). \nFor the complete SQL language reference, see [CREATE FUNCTION (SQL and Python)](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-sql-function.html). \nFor information about how Unity Catalog manages permissions on functions, see [CREATE FUNCTION](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html#function).\n\n#### User-defined functions (UDFs) in Unity Catalog\n##### Requirements\n\n* Databricks Runtime 13.3 LTS or above.\n* To use Python code in UDFs that are registered in Unity Catalog, you must use a serverless or pro SQL warehouse or a cluster running Databricks Runtime 13.3 LTS or above.\n* To resolve views that were created using a UDF registered to Unity Catalog, you must use a serverless or pro SQL warehouse.\n* Graviton instances do not support UDFs on Unity Catalog-enabled clusters.\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/unity-catalog.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### User-defined functions (UDFs) in Unity Catalog\n##### Custom SQL functions in Unity Catalog\n\nWhen you create a SQL function using compute configured for Unity Catalog, the function is registered to the currently active schema by default. The following example demonstrates the syntax you might use to declare a target catalog and schema for a new function: \n```\nCREATE FUNCTION target_catalog.target_schema.roll_dice()\nRETURNS INT\nLANGUAGE SQL\nNOT DETERMINISTIC\nCONTAINS SQL\nCOMMENT 'Roll a single 6 sided die'\nRETURN (rand() * 6)::INT + 1;\n\n``` \nAll users with sufficient privileges on the function can then use the function in compute environments configured for Unity Catalog, as in the following example: \n```\nSELECT target_catalog.target_schema.roll_dice()\n\n``` \nNote \nYou can use UDFs using `LANGUAGE SQL` to return tables or scalar values.\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/unity-catalog.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### User-defined functions (UDFs) in Unity Catalog\n##### Register a Python UDF to Unity Catalog\n\nIn Databricks Runtime 13.3 LTS and above, you can use the SQL `CREATE FUNCTION` statement to register scalar Python UDFs to Unity Catalog. \nImportant \nOnly pro and serverless SQL warehouses support Python UDFs for Unity Catalog. \nPython UDFs are designed to provide the full expressiveness of Python directly within SQL functions, allowing for customized operations such as advanced transformations, data masking, and hashing. \nPython UDFs execute in a secure, isolated environment and do not have access to file systems or internal services. \nPython UDFs running on serverless compute or in shared access mode allow TCP\/UDP network traffic over ports 80, 443, and 53. \nSee [Which UDFs are most efficient?](https:\/\/docs.databricks.com\/udf\/index.html#udf-efficiency). \nNote \nSyntax and semantics for Python UDFs in Unity Catalog differ from Python UDFs registered to the SparkSession. See [User-defined scalar functions - Python](https:\/\/docs.databricks.com\/udf\/python.html). \nPython UDFs for Unity Catalog use statements set off by double dollar signs (`$$`), as in the following code example: \n```\nCREATE FUNCTION target_catalog.target_schema.greet(s STRING)\nRETURNS STRING\nLANGUAGE PYTHON\nAS $$\nreturn f\"Hello, {s}\"\n$$\n\n``` \nThe following example demonstrates using this function to return greeting statements for all names stored in the `first_name` column of a table named `students`: \n```\nSELECT target_catalog.target_schema.greet(first_name)\nFROM students;\n\n``` \nYou can define any number of Python functions within a Python UDF, but must return a scalar value. \nPython functions must handle `NULL` values independently, and all type mappings must follow Databricks [SQL language mappings](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-datatypes.html#language-mappings). \nYou can import standard Python libraries included by Databricks, but you cannot include custom libraries or external dependencies. \nIf no catalog or schema is specified, Python UDFs are registered to the current active schema. \nThe following example imports a library and uses multiple functions within a Python UDF: \n```\nCREATE FUNCTION roll_dice(num_dice INTEGER, num_sides INTEGER)\nRETURNS INTEGER\nLANGUAGE PYTHON\nAS $$\nimport numpy as np\n\ndef roll_die(num_sides):\nreturn np.random.randint(num_sides) + 1\n\ndef sum_dice(num_dice,num_sides):\nreturn sum([roll_die(num_sides) for x in range(num_dice)])\n\nreturn sum_dice(num_dice, num_sides)\n$$\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/unity-catalog.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Create an ODBC DSN for the Databricks ODBC Driver\n\nThis article describes how to create an ODBC Data Source Name (DSN) for the [Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/index.html). \nCreate your DSN as follows, depending on your operating system. \n* [Create an ODBC DSN with Windows](https:\/\/docs.databricks.com\/integrations\/odbc\/dsn.html#windows)\n* [Create an ODBC DSN with macOS](https:\/\/docs.databricks.com\/integrations\/odbc\/dsn.html#macos)\n* [Create an ODBC DSN with Linux](https:\/\/docs.databricks.com\/integrations\/odbc\/dsn.html#linux)\n\n######### Create an ODBC DSN for the Databricks ODBC Driver\n########## Create an ODBC DSN with Windows\n\n1. From the Start menu, search for **ODBC Data Sources** to launch the **ODBC Data Source Administrator**.\n2. Click the **Drivers** tab to verify that the ODBC Driver (`Simba Spark ODBC Driver`) is installed.\n3. Click the **User DSN** or **System DSN** tab and then click the **Add** button.\n4. Select **Simba Spark ODBC Driver** from the list of installed drivers and then click **Finish**.\n5. Enter some name for the DSN and set the configuration settings for your target Databricks connection.\n6. Click **OK** to finish creating the DSN. \nTo use your DSN with your target app, tool, client, SDK, or API, see [Technology partners](https:\/\/docs.databricks.com\/integrations\/index.html) or your provider\u2019s documentation.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/dsn.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Create an ODBC DSN for the Databricks ODBC Driver\n########## Create an ODBC DSN with macOS\n\n1. Install [ODBC Manager](http:\/\/www.odbcmanager.net) for macOS in one of the following ways: \n* [Install ODBC Manager by using Homebrew](https:\/\/formulae.brew.sh\/cask\/odbc-manager)\n* [Download ODBC Manager](http:\/\/www.odbcmanager.net) and then double-click on the downloaded `.dmg` file to install it.\n2. Start ODBC Manager.\n3. Click the **Drivers** tab to verify that the ODBC driver (`Simba Spark ODBC Driver`) is installed.\n4. Click the **User DSN** or **System DSN** tab and then click the **Add** button.\n5. Select **Simba Spark ODBC Driver** from the list of installed drivers and then click **OK**.\n6. Enter some name for the DSN and set the configuration settings for your target Databricks connection.\n7. Click **OK** to finish creating the DSN. \nTo use your DSN with your target app, tool, client, SDK, or API, see [Technology partners](https:\/\/docs.databricks.com\/integrations\/index.html) or your provider\u2019s documentation.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/dsn.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks ODBC Driver\n######### Create an ODBC DSN for the Databricks ODBC Driver\n########## Create an ODBC DSN with Linux\n\n1. Install [unixODBC](http:\/\/www.unixodbc.org\/).\n2. Locate the `odbc.ini` driver configuration file that corresponds to `SYSTEM DATA SOURCES`: \n```\nodbcinst -j\n\n```\n3. In a text editor, open the `odbc.ini` configuration file.\n4. Create an `[ODBC Data Sources]` section: \n```\n[ODBC Data Sources]\nDatabricks=Databricks ODBC Connector\n\n```\n5. Create another section with the same name as your DSN as follows: \n```\n[Databricks]\nDriver=<path-to-driver>\nHost=<server-hostname>\nPort=443\nHTTPPath=<http-path>\nSSL=1\nThriftTransport=2\n<setting1>=<value1>\n<setting2>=<value2>\n<settingN>=<valueN>\n\n``` \n* To get the value for `<path-to-driver>`, see [Download and install the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html).\n* To get the values for `<server-hostname>` and `<http-path>`, see [Compute settings for the Databricks ODBC Driver](https:\/\/docs.databricks.com\/integrations\/odbc\/compute.html).\n* `<setting>=<value>` is one or more pairs of [authentication settings](https:\/\/docs.databricks.com\/integrations\/odbc\/authentication.html) and any special or advanced [driver capability settings](https:\/\/docs.databricks.com\/integrations\/odbc\/capability.html) for your target Databricks connection. \nTo use your DSN with your target app, tool, client, SDK, or API, see [Technology partners](https:\/\/docs.databricks.com\/integrations\/index.html) or your provider\u2019s documentation.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/odbc\/dsn.html"} +{"content":"# What is Databricks Marketplace?\n### List your data product in Databricks Marketplace\n\nThis article describes how to become a Databricks Marketplace provider and how to create a Databricks Marketplace listing for your data products.\n\n### List your data product in Databricks Marketplace\n#### Before you begin\n\nTo list products in the Databricks Marketplace, you must agree to provider policies, and your account and workspaces must meet certain requirements. \n**Policies:** \nTo list data products on the Marketplace exchange, you must agree to the [Marketplace provider policies](https:\/\/docs.databricks.com\/marketplace\/provider-policies.html). \n**Account and workspace requirements:** \nDatabricks Marketplace uses [Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html) to manage secure sharing of data products. Delta Sharing, in turn, requires that your Databricks workspace is enabled for Unity Catalog. Your Databricks account and workspaces must therefore meet the following requirements: \n* Databricks account on the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons).\n* A Databricks workspace enabled for Unity Catalog. See [Enable a workspace for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/enable-workspaces.html). You do not need to enable all of your workspaces for Unity Catalog. You can create one simply to manage Marketplace listings. \nIf you have workspaces that meet these criteria, your users will be able to view the Marketplace home page. Additional permissions are required to create and manage listings. These are enumerated in the sections that follow. If you don\u2019t want your users to be able to view the Marketplace home page at all, contact your Databricks account team. \n**Permission requirements:** \nTo sign up as a private-exchange-only provider, you must be a Databricks account admin. See [Sign up to be a Databricks Marketplace provider](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#apply).\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html"} +{"content":"# What is Databricks Marketplace?\n### List your data product in Databricks Marketplace\n#### Sign up to be a Databricks Marketplace provider\n\nThe way you sign up to be a Databricks Marketplace provider depends on whether you intend to create listings in the public marketplace or only through private exchanges. In private exchanges, only consumers who are members of the exchange can browse, view, and request access to a listing. See [Create and manage private exchanges in Databricks Marketplace](https:\/\/docs.databricks.com\/marketplace\/private-exchange.html). \nTo be a private-exchange-only provider, you can sign up using the provider console. To create public listings, you apply through the Databricks Data Partner Program. \n### Apply to be a provider that can create public listings \nNote \nIf your organization is already in the Databricks Partner Program and you\u2019re interested in becoming a Marketplace provider, skip the following instructions and contact `partnerops@databricks.com` instead. \n1. On the [Databricks Data Partner Program page](https:\/\/www.databricks.com\/company\/partners\/data-partner-program), click **Apply Now**.\n2. On the next page, click **Apply now**.\n3. Enter your email address and click **Apply Now**.\n4. Fill out the application form. \nTowards the bottom of the application form, you\u2019re asked which Databricks Partner program interests you. Select Marketplace. \nThe Databricks Partner team will reach out to you to complete the application process. When you\u2019re approved, the Provider console will become available in your Unity-Catalog-enabled Databricks workspaces. To access the Provider console, a user must have the Marketplace admin role. \n### Sign up as a private exchange provider \nTo sign up as a private-exchange-only provider, use the **Get started as a provider** page in the provider console. \n![Marketplace home page](https:\/\/docs.databricks.com\/_images\/get-started-provider-private.png) \nTo sign up as a private exchange provider: \n1. As a Databricks account admin, sign in to your Databricks workspace.\n2. In the sidebar, click ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace**.\n3. On the upper-right corner of the Marketplace page, click **Provider console**.\n4. If your account has not been onboarded as a provider, the provider console displays the **Get started as a provider** page, which walks you through the process of enabling your account as a private exchange provider.\n5. Under **Accept Marketplace terms**, click the **Databricks Marketplace Private Provider terms** link to review the terms.\n6. To accept the terms, click the **Accept Private Provider terms** button. \nThis takes you to the Databricks account console in a new browser tab and opens the **Settings > Feature enablement** tab. You might need to log to the account console if you aren\u2019t logged in already. \nNote \nIf you have multiple Databricks accounts, you should confirm that you are logging into the account that contains the workspace in which you are accessing the provider console.\n7. On the **Feature enablement** tab, enable the **Marketplace Private Exchange Provider** option.\n8. Return to the **Provider console** in the workspace and click **Refresh page**. Do not use your browser\u2019s refresh page functionality.\n9. After a few minutes, the **Assign Marketplace admin** button appears. Click it to open your user page in the account console. On the **Roles** tab, enable **Marketplace admin**. \nYou can choose to assign the Marketplace admin role to another user or users. If you do, they can continue the process from this point. If you assign the role to yourself, you can continue the process.\n10. Return to the **Provider console** in the workspace and click **Refresh page**. Do not use your browser\u2019s refresh page functionality.\n11. After a few minutes, the **Create provider profile** button appears. Click it to open the **Create Profile** page. \nNote \nIt might take a few minutes for the system to finish assigning the Marketplace admin role. If you proceed to create a provider profile and see an error indicating that you don\u2019t have the Marketplace admin role, wait a few more minutes, refresh the page, and retry.\n12. To create your provider profile, follow the instructions in [Create your Marketplace provider profile](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#profile), starting at step 5.\n13. Create your first private exchange. See [Create and manage private exchanges in Databricks Marketplace](https:\/\/docs.databricks.com\/marketplace\/private-exchange.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html"} +{"content":"# What is Databricks Marketplace?\n### List your data product in Databricks Marketplace\n#### Assign the Marketplace admin role\n\nIf you signed up as a private exchange provider, you performed this task as part of the sign-up process. You can skip the instructions in this section, unless you want to enable other users in your Databricks account as Marketplace admins. \nOnce you\u2019ve been approved as a Marketplace provider, you must grant at least one user the Marketplace admin role. This role is required for accessing the Marketplace Provider console and for creating and managing your Marketplace provider profile and listings. A Databricks account admin can grant the role. \n1. As an account admin, log in to the [account console](https:\/\/accounts.cloud.databricks.com).\n2. Click ![Account Console user management icon](https:\/\/docs.databricks.com\/_images\/user-management.png) **User management**.\n3. Find and click the username.\n4. On the **Roles** tab, turn on **Marketplace admin**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html"} +{"content":"# What is Databricks Marketplace?\n### List your data product in Databricks Marketplace\n#### Create your Marketplace provider profile\n\nYour provider profile gives you the opportunity to tell prospective consumers who you are and to group your data products under a single brand or identity. Typically, a data provider has one profile but can list multiple data products. If you want more than one profile, reach out to your Databricks account team. \nIf you are a public provider, you can create your profile after your provider application has been approved. If you are a private-exchange-only provider, you create your profile as the final step of the [signup process](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#private), starting at step 5 in the procedure provided here. \n**Permission required**: Marketplace admin role \nTo create a profile: \n1. Log in to the Databricks workspace you will use for creating shares and listings.\n2. In the sidebar, click ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace**.\n3. On the upper-right corner of the Marketplace page, click **Provider console**.\n4. On the **Provider console** page **Profiles** tab, click **Create profile**.\n5. Enter the following information. All fields are required: \n* **Provider name**: Use a name that consumers will recognize. Consumers can filter listings by provider name.\n* **Logo**: drag and drop or browse to an image file for the logo you want to use in your listing.\n* **Description**: Describe your organization clearly and accurately. Include details such as industries you typically serve or represent and the types of data assets that you typically list. Consumers can see this description when they view your profile and on all of your listings.\n* **Organization website**: Link to your organization\u2019s website. Consumers can follow this link to learn more about your organization. This link appears on all of your listings.\n* **Business email**: Enter an email address that Databricks can use to send you notifications. Consumers do not see this information.\n* **Support email**: Enter an email address that consumers can use to request support. This link appears on all of your listings.\n* **Terms of service link**: This link appears on all of your listings. You can override this link by entering a different one at the listing level.\n* **Privacy policy**: This link appears on all of your listings. You can override this link by entering a different one at the listing level.\n6. Save the profile. \nIf any of this information changes while you are a Marketplace provider, update your profile. Your profile must be accurate and kept up to date.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html"} +{"content":"# What is Databricks Marketplace?\n### List your data product in Databricks Marketplace\n#### Create shares\n\nAfter you have a Databricks account enabled for Delta Sharing and a Databricks workspace enabled for Unity Catalog, you can create the *shares* that you will use to share your data in the Marketplace. \nA share is a Delta Sharing object. It\u2019s a collection of tables, views, volumes, and AI models that are shareable and securable as a unit. Tables can be shared with any consumer. Volumes, AI models, and notebooks can only be shared with consumers who have access to a Databricks workspace that is enabled for Unity Catalog. \nNote \nTo list a data product that is free and instantly available to consumers, you must include a share when you create the listing. Listings that require you to approve a consumer request, on the other hand, don\u2019t require that you include a share in the listing. You can create the share later, after any business agreements are complete and you\u2019ve approved the consumer\u2019s request. If that\u2019s what you want to do, skip ahead to [Create listings](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#create-listing). \n1. Add data tables, views, or volumes to your Unity Catalog metastore. \nTo learn how to create these data assets in Unity Catalog, see: \n* [Create tables in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html)\n* [Create views](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-views.html)\n* [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html)\n2. Create a share and add these data assets to the share. \nTo learn how to create and update shares, see [Create and manage shares for Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/create-share.html). \n*Permissions required:* \n* To create a share, you must be a metastore admin or user with the `CREATE SHARE` privilege for the Unity Catalog metastore where the data you want to share is registered.\n* To add a table, volume, or view to a share, you must be the share owner, have the `USE SCHEMA` privilege on the schema that contains the data asset, and have the `SELECT` privilege on the data asset. You must keep the `SELECT` privilege in order for the asset to continue to be shared. If you lose it, the recipient cannot access the asset through the share. Databricks, therefore, recommends that you use a group as the share owner.For more details about requirements for sharing tables, volumes, and views, including compute and data type requirements, see [Create and manage shares for Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/create-share.html). \nAfter your share is created, you can create or update a Marketplace listing that references it.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html"} +{"content":"# What is Databricks Marketplace?\n### List your data product in Databricks Marketplace\n#### Notebook example: Sample notebook\n\nIn addition to tables, volumes, and views, Databricks highly recommends that you also share Databricks notebooks. A notebook is a great way to demonstrate example use cases and visualize table properties. Your listing can include sample notebook previews that consumers can import into their workspaces. \n![Notebook preview on a listing](https:\/\/docs.databricks.com\/_images\/sample-notebook.png) \nFor more information about creating notebooks, see [Introduction to Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html). If you need help creating an effective sample notebook, contact `dataproviders@databricks.com`. \nNote \nThe **Sample notebooks** display and preview in the listings UI does not work in Chrome Incognito mode. \nThe following example shows how to create an effective sample notebook. It includes guidance for creating effective sample notebooks for your listings. \n### Marketplace starter notebook for data providers \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/marketplace\/marketplace-starter-notebook-for-data-providers.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html"} +{"content":"# What is Databricks Marketplace?\n### List your data product in Databricks Marketplace\n#### Create listings\n\nMarketplace listings enable consumers to browse, select, and access your data products. All dataset listings are automatically shareable with both consumers on Databricks workspaces and consumers on third-party platforms like Power BI, pandas, and Apache Spark. \nNote \nSome data assets, like Databricks volumes, can only be shared with consumers who have access to a Unity Catalog-enabled Databricks workspace. Tables, however, are shareable with all consumers. If you include both tables and volumes in a share, consumers who don\u2019t have access to a Unity Catalog-enabled workspace will only be able to access the tabular data. \n**Permissions required:** [Marketplace admin role](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#marketplace-admin). If you are creating and managing personalized listings (those that require provider approval before they\u2019re fulfilled), you must also have the `CREATE RECIPIENT` and `USE RECIPIENT` privileges. See [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html). \nTo create a listing: \n1. Log into your Databricks workspace.\n2. In the sidebar, click ![Marketplace icon](https:\/\/docs.databricks.com\/_images\/marketplace.png) **Marketplace**.\n3. On the upper-right corner of the Marketplace page, click **Provider console**.\n4. On the **Provider console** page **Listings** tab, click **Create listing**.\n5. On the **New listing** page, enter your listing information. \nFor instructions, see [Listing fields and options](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#listing-recommendations). \nYou can save a draft and view a preview before you publish. When you click **Publish**, the listing appears in the Marketplace immediately. \n### Listing fields and options \nThis section describes each **New listing** page field and option. It also provides recommendations for creating an effective listing. \n* **Listing name**: Each listing should have a unique name that helps consumers understand what it offers. \nRecommendations: \n+ Fewer than 100 characters.\n+ Title case (capitalize primary words).\nExample \nUS Census 2022\n* **Short description**: A short, informative explanation of the dataset that expands on the listing name. This field appears in listing tiles and consumer search results. \nRecommendations: \n+ Fewer than 100 characters. Cannot exceed 160 characters.\n+ Sentence case (capitalize only the first word an any proper nouns or acronyms).\nExample \nGeneral information about US population count and demography in 2020\n* **Provider profile**: Your organization or company name. Select from the drop-down menu. Your profile is created by Databricks as part of the partner organization approval process.\n* **Terms of service**: A URL that links to your terms of service for the appropriate use of the shared data assets. \nTerms of service must be publicly accessible and require no login.\n* **Public Marketplace**: All consumers can browse and view the listing in the public Databricks Marketplace.\n* **Private exchange**: Only consumers who are members of a private exchange, created by you or another marketplace admin, can browse, view, and request the listing. See [Create and manage private exchanges in Databricks Marketplace](https:\/\/docs.databricks.com\/marketplace\/private-exchange.html). \nYou must select at least one private exchange from the drop-down list.\n* **Data is available instantly**: Select this option to let consumers gain access to the shared data directly from the Marketplace, without requiring your approval (but requiring acceptance of terms of service). Choose a share from the drop-down menu. This option is typically used for sample and public datasets. \nIf you have not yet created the share that you want to include, click **+ Create new share** at the bottom of the drop-down menu. You are taken to the **Create a new share** dialog box. \nIf a share that you select or create here contains no data or assets, a message appears with an **Add data** button. Click it to go to Catalog Explorer, where you can add tables to the share. \nFor more information about creating shares and adding tables to shares, including required permissions, see [Create and manage shares for Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/create-share.html).\n* **Require approval of consumer requests**: Select this option to require your approval before a consumer can access the shared data. Use this option if you require a business agreement before you make the data product available to consumers. You must manage the business agreement with consumers outside of Databricks Marketplace. You can initiate communications using the consumer email address. \nYou can view and handle consumer requests on the **Provider Console > Consumer requests** tab. See [Manage requests for your data product in Databricks Marketplace](https:\/\/docs.databricks.com\/marketplace\/manage-requests-provider.html).\n* **Categories**: Select up to five categories that consumers can use to filter listings. Categories also appear as tags on listing tiles and detail pages.\n* **Add attribute**: Attributes are optional. They include fields such as geographic coverage, update frequency, time range, data source, and dataset size. Adding attributes helps consumers understand more about your data product. Select as many attributes as you like.\n* **Description**: The detailed description of your data should include a summary of the data and assets being offered in the listing. \nBasic rich text formatting is supported (that is, bold, italics, bullets, and numbered lists), using Markdown syntax. To preview your content, use the buttons at the far right of the Description field toolbar. \nRecommendations: \n+ Include benefits and use cases.\n+ Provide brief guidance about how to use the data and sample use cases.\n+ Include sample datasets and field names.\n+ Specify schemas, tables, and columns.\n+ Use consistent punctuation and syntax.\n+ Add an extra line break between paragraphs.\n+ Check your spelling and grammar.\n+ Don\u2019t repeat the attributes that you defined under **Add Attribute**.\nExample \n**Overview:** \nThe US Census of Population and Dwellings is the official count of people and houses in the US in 2020. It provides a social and economic snapshot. The 2020 Census, held on March 6, 2021, is the 23th census. \n**Use cases:** \n+ Group customers based on demographic variables like age and gender.\n+ Customize product offerings to specific consumer groups.\n**Information included in this dataset:** \n+ Population estimates\n+ Demographic components (births, deaths, migration)\n+ This data can be sorted by characteristics such as age, sex, and race, as well as by national, state, and county location\n* **Sample notebook**: Databricks highly recommends that you share sample notebooks to demonstrate how best to use the data. Add up to ten notebooks. You must save the listing and return to it in order to upload sample notebooks. \nFor more information about creating notebooks, see [Notebook example: Sample notebook](https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html#notebooks) and [Introduction to Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html).\n* **Documentation**: A URL that links to documentation that can help consumers use or understand your data set (for example, a dataset dictionary).\n* **Privacy policy**: A URL that links to your privacy policy. \nThe privacy policy must be publicly accessible and require no login.\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html"} +{"content":"# What is Databricks Marketplace?\n### List your data product in Databricks Marketplace\n#### Analyze consumer activity using system tables (Public Preview)\n\nIf you have [system tables](https:\/\/docs.databricks.com\/admin\/system-tables\/index.html) enabled in your account, you can use the Marketplace system tables to analyze consumer activity on your listings. For more information, see [Marketplace system tables reference](https:\/\/docs.databricks.com\/admin\/system-tables\/marketplace.html).\n\n### List your data product in Databricks Marketplace\n#### Next steps\n\n* [Manage consumer requests](https:\/\/docs.databricks.com\/marketplace\/manage-requests-provider.html)\n* [Manage existing listings](https:\/\/docs.databricks.com\/marketplace\/manage-listings.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/marketplace\/get-started-provider.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Monitoring and observability options for Delta Live Tables pipelines\n\nThis article introduces the features included in Delta Live Tables that support monitoring, reporting, and auditing your pipelines, and generating alerts and notifications for pipeline events. \nDelta Live Tables includes several features to support monitoring and observability of pipelines. These features support tasks such as: \n* Observing the progress and status of pipeline updates.\n* Alerts for pipeline events such as success or failure of pipeline updates.\n* Extracting detailed information on pipeline updates such as data lineage, data quality metrics, and resource usage.\n* Defining custom actions to take when specific events occur. \nThe articles in this section are detailed references to help you effectively use these features to monitor your Delta Live Tables pipelines.\n\n#### Monitoring and observability options for Delta Live Tables pipelines\n##### Monitor pipelines with the Delta Live Tables UI and logs\n\nYou can use the pipeline details UI to view the status and progress of pipeline updates, a history of updates, and details about the datasets in a pipeline. For more advanced reporting and auditing tasks, you can use the Delta Live Tables event log. \nSee [Monitor Delta Live Tables pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/observability.html).\n\n#### Monitoring and observability options for Delta Live Tables pipelines\n##### Implement custom monitoring of Delta Live Tables pipelines with event hooks (Public Preview)\n\nYou can use event hooks to implement custom monitoring and alerting for pipelines. For example, you can use event hooks to send emails or write to a log when specific events occur, or integrate with third-party solutions. Event hooks are custom Python callback functions that run when events are persisted to a Delta Live Tables pipeline\u2019s event log. \nSee [Define custom monitoring of Delta Live Tables pipelines with event hooks](https:\/\/docs.databricks.com\/delta-live-tables\/event-hooks.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/monitor-pipelines.html"} +{"content":"# Technology partners\n## Connect to ML partners using Partner Connect\n#### Connect to Dataiku\n\nDataiku is an end-to-end AI platform for data preparation, AutoML, and MLOps. You can integrate your Databricks SQL warehouses and Databricks clusters with Dataiku.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ml\/dataiku.html"} +{"content":"# Technology partners\n## Connect to ML partners using Partner Connect\n#### Connect to Dataiku\n##### Connect to Dataiku using Partner Connect\n\nNote \nPartner Connect only supports SQL warehouses for Dataiku. To connect a cluster to Dataiku, connect to Dataiku manually. \nTo connect your Databricks workspace to Dataiku using Partner Connect, do the following: \n1. In the sidebar, click ![Partner Connect button](https:\/\/docs.databricks.com\/_images\/partner-connect.png) **Partner Connect**.\n2. Click the partner tile. \nNote \nIf the Dataiku tile has a check mark icon inside it, an administrator has already used Partner Connect to connect Dataiku to your workspace. Skip to step 5. The partner uses the email address for your Databricks account to prompt you to sign in to your existing Dataiku account.\n3. Select a catalog from the drop-down list, and then click **Next**. \nNote \nIf your workspace is Unity Catalog-enabled, but the partner doesn\u2019t support Unity Catalog with Partner Connect, the workspace default catalog is used. If your workspace isn\u2019t Unity Catalog-enabled, `hive_metastore` is used.\n4. Select a schema from the drop-down list, and then click **Add**. You can repeat this step to add multiple schemas.\n5. Click **Next**. \nPartner Connect creates the following resources in your workspace: \n* A Databricks [service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) named **`<PARTNER>_USER`**.\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) that is associated with the **`<PARTNER>_USER`** service principal.Partner Connect also grants the following privileges to the **`<PARTNER>_USER`** service principal: \n* (Unity Catalog)`USE CATALOG`: Required to interact with objects within the selected catalog.\n* (Unity Catalog) `USE SCHEMA`: Grants the ability to read the schemas you selected.\n* (Unity Catalog)`CREATE SCHEMA`: Grants the ability to create schemas in the selected catalog.\n* (Legacy Hive metastore) `USAGE`: Required to interact with objects within `hive_metastore` and the selected schemas.\n* (Legacy Hive metastore) `CREATE`: Grants the ability to create a schema in `hive_metastore`.\n* (Legacy Hive metastore) **READ\\_METADATA**: Grants the ability to read metadata for the schemas you selected.\n* (Legacy Hive metastore) **SELECT**: Grants the ability to read the schemas you selected.\n6. Click **Next**. \nThe **Email** box displays the email address for your Databricks account. Dataiku uses this email address to prompt you to either create a new partner account or sign in to your existing partner account.\n7. Click **Connect to Dataiku** or **Sign in**. \nA new tab opens in your web browser, which displays the Dataiku website.\n8. Complete the on-screen instructions on the Dataiku website to create your trial Dataiku account or sign in to your existing Dataiku account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ml\/dataiku.html"} +{"content":"# Technology partners\n## Connect to ML partners using Partner Connect\n#### Connect to Dataiku\n##### Connect to Dataiku manually\n\nThis section describes how to connect an existing SQL warehouse or cluster in your Databricks workspace to Dataiku manually. \nNote \nFor Databricks SQL warehouses, you can connect to Dataiku using Partner Connect to simplify the experience. \n### Requirements \nBefore you connect to Dataiku manually, you need the following: \n* A cluster or SQL warehouse in your Databricks workspace. \n+ [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n+ [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* The connection details for your cluster or SQL warehouse, specifically the **Server Hostname**, **Port**, and **HTTP Path** values. \n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). To create a personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**.\nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n+ [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n+ [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens). \n### Steps to connect \nTo connect to Dataiku manually, follow [Databricks](https:\/\/doc.dataiku.com\/dss\/latest\/connecting\/sql\/databricks.html) in the Dataiku documentation.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ml\/dataiku.html"} +{"content":"# Technology partners\n## Connect to ML partners using Partner Connect\n#### Connect to Dataiku\n##### Additional resources\n\nExplore the following Dataiku resources: \n* [Website](https:\/\/www.dataiku.com\/)\n* [Documentation](https:\/\/doc.dataiku.com\/dss\/latest\/)\n* [Support](https:\/\/support.dataiku.com\/support\/home)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ml\/dataiku.html"} +{"content":"# Technology partners\n### Connect to BI partners using Partner Connect\n\nTo connect your Databricks workspace to a BI and visualization partner solution using Partner Connect, you typically follow the steps in this article. \nImportant \nBefore you follow the steps in this article, see the appropriate partner article for important partner-specific information. There might be differences in the connection steps between partner solutions. Some partner solutions also allow you to integrate with Databricks SQL warehouses (formerly Databricks SQL endpoints) or Databricks clusters, but not both.\n\n### Connect to BI partners using Partner Connect\n#### Requirements\n\nSee the [requirements](https:\/\/docs.databricks.com\/partner-connect\/index.html#requirements) for using Partner Connect. \nImportant \nFor partner-specific requirements, see the appropriate partner article.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/bi.html"} +{"content":"# Technology partners\n### Connect to BI partners using Partner Connect\n#### Steps to connect to a BI and visualization partner\n\nTo connect your Databricks workspace to a BI and visualization partner solution, follow the steps in this section. \nTip \nIf you have an existing partner account, Databricks recommends that you follow the steps to connect to the partner solution manually in the appropriate partner article. This is because the connection experience in Partner Connect is optimized for new partner accounts. \n1. In the sidebar, click ![Partner Connect button](https:\/\/docs.databricks.com\/_images\/partner-connect.png) **Partner Connect**.\n2. Click the partner tile. \nNote \nIf the partner tile has a check mark icon inside it, an administrator has already used Partner Connect to connect the partner to your workspace. Skip to step 7. The partner uses the email address for your Databricks account to prompt you to sign in to your existing partner account.\n3. If there are SQL warehouses in your workspace, select a SQL warehouse from the drop-down list. If your SQL warehouse is stopped, click **Start**.\n4. If there are no SQL warehouses in your workspace, do the following: \n1. Click **Create warehouse**. A new tab opens in your browser that displays the **New SQL Warehouse** page in the Databricks SQL UI.\n2. Follow the steps in [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n3. Return to the Partner Connect tab in your browser, then close the partner tile.\n4. Re-open the partner tile.\n5. Select the SQL warehouse you just created from the drop-down list.\n5. Select a catalog and a schema from the drop-down lists, then click **Add**. You can repeat this step to add multiple schemas. \nNote \nIf a partner doesn\u2019t support Unity Catalog with Partner Connect, the workspace default catalog is used. If your workspace isn\u2019t Unity Catalog-enabled, `hive_metastore` is used.\n6. Click **Next**. \nPartner Connect creates the following resources in your workspace: \n* A Databricks cluster named **`<PARTNER>_CLUSTER`**.\n* A Databricks [service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) named **`<PARTNER>_USER`**.\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) that is associated with the **`<PARTNER>_USER`** service principal.The **Email** box displays the email address for your Databricks account. The partner uses this email address to prompt you to either create a new partner account or sign in to your existing partner account.\n7. Click **Connect to `<Partner>`** or **Sign in**. \nA new tab opens in your web browser, which displays the partner website.\n8. Complete the on-screen instructions on the partner website to create your trial partner account or sign in to your existing partner account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/bi.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query caching\n\nCaching is an essential technique for improving the performance of data warehouse systems by avoiding the need to recompute or fetch the same data multiple times. In Databricks SQL, caching can significantly speed up query execution and minimize warehouse usage, resulting in lower costs and more efficient resource utilization. Each caching layer improves query performance, minimizes cluster usage, and optimizes resource utilization for a seamless data warehouse experience. \nCaching provides numerous advantages in data warehouses, including: \n* **Speed**: By storing query results or frequently accessed data in memory or other fast storage mediums, caching can dramatically reduce query execution times. This storage is particularly beneficial for repetitive queries, as the system can quickly retrieve the cached results instead of recomputing them.\n* **Reduced cluster usage**: Caching minimizes the need for additional compute resources by reusing previously computed results. This reduces the overall warehouse uptime and the demand for additional compute clusters, leading to cost savings and better resource allocation.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-caching.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query caching\n##### Types of query caches in Databricks SQL\n\nDatabricks SQL performs several types of query caching. \n![query caches](https:\/\/docs.databricks.com\/_images\/query-cache.png) \n* **Databricks SQL UI cache**: Per user caching of all query and dashboard results in the Databricks SQL [UI](https:\/\/docs.databricks.com\/sql\/user\/queries\/index.html). When users first open a dashboard or SQL query, the Databricks SQL UI cache displays the most recent query result, including the results from scheduled executions. \nThe Databricks SQL UI cache has at most a 7-day life cycle. The cache is located within your Databricks filesystem in your account. You can delete query results by re-running the query that you no longer want to be stored. Once re-run, the old query results are removed from cache. Additionally, the cache is invalidated once the underlying tables have been updated.\n* **Result cache**: Per cluster caching of query results for all queries through SQL warehouses. Result caching includes both local and remote result caches, which work together to improve query performance by storing query results in memory or remote storage mediums. \n+ **Local cache**: The local cache is an in-memory cache that stores query results for the cluster\u2019s lifetime or until the cache is full, whichever comes first. This cache is useful for speeding up repetitive queries, eliminating the need to recompute the same results. However, once the cluster is stopped or restarted, the cache is cleaned and all query results are removed.\n+ **Remote result cache**: The remote result cache is a serverless-only cache system that retains query results by persisting them in cloud storage. As a result, this cache is not invalidated by the stopping or restarting of a SQL warehouse. Remote result cache addresses a common pain point in caching query results in-memory, which only remains available as long as the compute resources are running. The remote cache is a persistent shared cache across all warehouses in a Databricks workspace.Accessing the remote result cache requires a running warehouse. When processing a query, a cluster first looks in its local cache and then looks in the remote result cache if necessary. Only if the query result isn\u2019t cached in either cache is the query executed. Both the local and the remote caches have a life cycle of 24 hours, which starts at cache entry. The remote result cache persists through the stopping or restarting of a SQL warehouse. Both caches are invalidated when the underlying tables are updated. \nRemote result cache is available for queries using ODBC \/ JDBC clients and SQL Statement API. \nTo disable query result caching, you can run `SET use_cached_result = false` in the SQL editor. \nImportant \nYou should use this option only in testing or benchmarking.\n* **Disk cache**: Local SSD caching for data read from data storage for queries through SQL warehouses. The disk cache is designed to enhance query performance by storing data on disk, allowing for accelerated data reads. Data is automatically cached when files are fetched, utilizing a fast intermediate format. By storing copies of the files on the local storage attached to compute nodes, the disk cache ensures the data is located closer to the workers, resulting in improved query performance. See [Optimize performance with caching on Databricks](https:\/\/docs.databricks.com\/optimizations\/disk-cache.html). \nIn addition to its primary function, the disk cache automatically detects changes to the underlying data files. When it detects changes, the cache is invalidated. The disk cache shares the same lifecycle characteristics as the local result cache. This means that when the cluster is stopped or restarted, the cache is cleaned and needs to be repopulated. \nThe query results caching and [disk cache](https:\/\/docs.databricks.com\/optimizations\/disk-cache.html) affect queries in the Databricks SQL [UI](https:\/\/docs.databricks.com\/sql\/user\/queries\/index.html) and [BI and other external clients](https:\/\/docs.databricks.com\/integrations\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-caching.html"} +{"content":"# \n### Additional reference material\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/details\/index.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### External models in Databricks Model Serving\n##### Tutorial: Create external model endpoints to query OpenAI models\n\nThis article provides step-by-step instructions for configuring and querying an external model endpoint that serves OpenAI models for completions, chat, and embeddings using the [MLflow Deployments SDK](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.deployments.html). Learn more about [external models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html).\n\n##### Tutorial: Create external model endpoints to query OpenAI models\n###### Requirements\n\n* Databricks Runtime 13.0 ML or above.\n* MLflow 2.9 or above.\n* OpenAI API keys.\n* Databricks CLI version 0.205 or above.\n\n##### Tutorial: Create external model endpoints to query OpenAI models\n###### Step 1: Store the OpenAI API key using the Databricks Secrets CLI\n\nYou can store the OpenAI API key using the Databricks Secrets CLI (version 0.205 and above). You can also use the [REST API for secrets](https:\/\/docs.databricks.com\/api\/workspace\/secrets\/putsecret). \nThe following creates the secret scope named, `my_openai_secret_scope`, and then creates the secret `openai_api_key` in that scope. \n```\ndatabricks secrets create-scope my_openai_secret_scope\ndatabricks secrets put-secret my_openai_secret_scope openai_api_key\n\n```\n\n##### Tutorial: Create external model endpoints to query OpenAI models\n###### Step 2: Install MLflow with external models support\n\nUse the following to install an MLflow version with external models support: \n```\n%pip install mlflow[genai]>=2.9.0\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/external-models\/external-models-tutorial.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### External models in Databricks Model Serving\n##### Tutorial: Create external model endpoints to query OpenAI models\n###### Step 3: Create and manage an external model endpoint\n\nImportant \nThe code examples in this section demonstrate usage of the [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html) MLflow Deployments CRUD SDK. \nYou can create an external model endpoint for each large language model (LLM) you want to use with the `create_endpoint()` method from the MLflow Deployments SDK . The following code snippet creates a completions endpoint for OpenAI `gpt-3.5-turbo-instruct`, as specified in the `served_entities` section of the configuration. \n```\nimport mlflow.deployments\n\nclient = mlflow.deployments.get_deploy_client(\"databricks\")\nclient.create_endpoint(\nname=\"openai-completions-endpoint\",\nconfig={\n\"served_entities\": [{\n\"name\": \"openai-completions\"\n\"external_model\": {\n\"name\": \"gpt-3.5-turbo-instruct\",\n\"provider\": \"openai\",\n\"task\": \"llm\/v1\/completions\",\n\"openai_config\": {\n\"openai_api_key\": \"{{secrets\/my_openai_secret_scope\/openai_api_key}}\"\n}\n}\n}]\n}\n)\n\n``` \nIf you are using Azure OpenAI, you can also specify the Azure OpenAI deployment name, endpoint URL, and API version in the\n`openai_config` section of the configuration. \n```\nclient.create_endpoint(\nname=\"openai-completions-endpoint\",\nconfig={\n\"served_entities\": [\n{\n\"name\": \"openai-completions\",\n\"external_model\": {\n\"name\": \"gpt-3.5-turbo-instruct\",\n\"provider\": \"openai\",\n\"task\": \"llm\/v1\/completions\",\n\"openai_config\": {\n\"openai_api_type\": \"azure\",\n\"openai_api_key\": \"{{secrets\/my_openai_secret_scope\/openai_api_key}}\",\n\"openai_api_base\": \"https:\/\/my-azure-openai-endpoint.openai.azure.com\",\n\"openai_deployment_name\": \"my-gpt-35-turbo-deployment\",\n\"openai_api_version\": \"2023-05-15\"\n},\n},\n}\n],\n},\n)\n\n``` \nTo update an endpoint, use `update_endpoint()`. The following code snippet demonstrates how to update an endpoint\u2019s rate limits to 20 calls per minute per user. \n```\nclient.update_endpoint(\nendpoint=\"openai-completions-endpoint\",\nconfig={\n\"rate_limits\": [\n{\n\"key\": \"user\",\n\"renewal_period\": \"minute\",\n\"calls\": 20\n}\n],\n},\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/external-models\/external-models-tutorial.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### External models in Databricks Model Serving\n##### Tutorial: Create external model endpoints to query OpenAI models\n###### Step 4: Send requests to an external model endpoint\n\nImportant \nThe code examples in this section demonstrate usage of the MLflow Deployments SDK\u2019s `predict()` method. \nYou can send chat, completions, and embeddings requests to an external model endpoint using the MLflow Deployments SDK\u2019s `predict()` method. \nThe following sends a request to `gpt-3.5-turbo-instruct` hosted by OpenAI. \n```\ncompletions_response = client.predict(\nendpoint=\"openai-completions-endpoint\",\ninputs={\n\"prompt\": \"What is the capital of France?\",\n\"temperature\": 0.1,\n\"max_tokens\": 10,\n\"n\": 2\n}\n)\ncompletions_response == {\n\"id\": \"cmpl-8QW0hdtUesKmhB3a1Vel6X25j2MDJ\",\n\"object\": \"text_completion\",\n\"created\": 1701330267,\n\"model\": \"gpt-3.5-turbo-instruct\",\n\"choices\": [\n{\n\"text\": \"The capital of France is Paris.\",\n\"index\": 0,\n\"finish_reason\": \"stop\",\n\"logprobs\": None\n},\n{\n\"text\": \"Paris is the capital of France\",\n\"index\": 1,\n\"finish_reason\": \"stop\",\n\"logprobs\": None\n},\n],\n\"usage\": {\n\"prompt_tokens\": 7,\n\"completion_tokens\": 16,\n\"total_tokens\": 23\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/external-models\/external-models-tutorial.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### External models in Databricks Model Serving\n##### Tutorial: Create external model endpoints to query OpenAI models\n###### Step 5: Compare models from a different provider\n\nModel serving supports many external model providers including Open AI, Anthropic, Cohere, Amazon Bedrock, Google Cloud Vertex AI, and more. You can compare LLMs across providers, helping you optimize the accuracy, speed, and cost of your applications using the [AI Playground](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html#playground). \nThe following example creates an endpoint for Anthropic `claude-2` and compares its response to a question that uses OpenAI `gpt-3.5-turbo-instruct`. Both responses have the same standard format, which makes them easy to compare. \n### Create an endpoint for Anthropic claude-2 \n```\nimport mlflow.deployments\n\nclient = mlflow.deployments.get_deploy_client(\"databricks\")\n\nclient.create_endpoint(\nname=\"anthropic-completions-endpoint\",\nconfig={\n\"served_entities\": [\n{\n\"name\": \"claude-completions\",\n\"external_model\": {\n\"name\": \"claude-2\",\n\"provider\": \"anthropic\",\n\"task\": \"llm\/v1\/completions\",\n\"anthropic_config\": {\n\"anthropic_api_key\": \"{{secrets\/my_anthropic_secret_scope\/anthropic_api_key}}\"\n},\n},\n}\n],\n},\n)\n\n``` \n### Compare the responses from each endpoint \n```\n\nopenai_response = client.predict(\nendpoint=\"openai-completions-endpoint\",\ninputs={\n\"prompt\": \"How is Pi calculated? Be very concise.\"\n}\n)\nanthropic_response = client.predict(\nendpoint=\"anthropic-completions-endpoint\",\ninputs={\n\"prompt\": \"How is Pi calculated? Be very concise.\"\n}\n)\nopenai_response[\"choices\"] == [\n{\n\"text\": \"Pi is calculated by dividing the circumference of a circle by its diameter.\"\n\" This constant ratio of 3.14159... is then used to represent the relationship\"\n\" between a circle's circumference and its diameter, regardless of the size of the\"\n\" circle.\",\n\"index\": 0,\n\"finish_reason\": \"stop\",\n\"logprobs\": None\n}\n]\nanthropic_response[\"choices\"] == [\n{\n\"text\": \"Pi is calculated by approximating the ratio of a circle's circumference to\"\n\" its diameter. Common approximation methods include infinite series, infinite\"\n\" products, and computing the perimeters of polygons with more and more sides\"\n\" inscribed in or around a circle.\",\n\"index\": 0,\n\"finish_reason\": \"stop\",\n\"logprobs\": None\n}\n]\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/external-models\/external-models-tutorial.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### External models in Databricks Model Serving\n##### Tutorial: Create external model endpoints to query OpenAI models\n###### Additional resources\n\n[External models in Databricks Model Serving](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/generative-ai\/external-models\/external-models-tutorial.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### Recommendations for files in volumes and workspace files\n\nWhen you upload or save data or files to Databricks, you can choose to store these files using Unity Catalog volumes or workspace files. This article contains recommendations and requirements for using these locations. For more details on volumes and workspace files, see [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html) and [What are workspace files?](https:\/\/docs.databricks.com\/files\/workspace.html). \nDatabricks recommends using Unity Catalog volumes to store data, libraries, and build artifacts. Store notebooks, SQL queries, and code files as workspace files. You can configure workspace file directories as Git folders to sync with remote Git repositories. See [Git integration with Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html). Small data files used for test scenarios can also be stored as workspace files. \nThe tables below provide specific recommendations for files, depending on your type of file or feature needs. \nImportant \nThe Databricks File System (DBFS) is also available for file storage, but is not recommended, as all workspace users have access to files in DBFS. See [DBFS](https:\/\/docs.databricks.com\/dbfs\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/files-recommendations.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### Recommendations for files in volumes and workspace files\n##### File types\n\nThe following table provides storage recommendations for file types. Databricks supports many file formats beyond what are provided in this table as examples. \n| File type | Recommendation |\n| --- | --- |\n| Databricks objects, such as notebooks and queries | Store as workspace files |\n| Structured data files, such as Parquet files and ORC files | Store in Unity Catalog volumes |\n| Semi-structured data files, such as text files (`.csv`, `.txt`) and JSON files (`.json`) | Store in Unity Catalog volumes |\n| Unstructured data files, such as image files (`.png`, `.svg`), audio files (`.mp3`), and document files (`.pdf`, `.docx`) | Store in Unity Catalog volumes |\n| Raw data files used for adhoc or early data exploration | Store in Unity Catalog volumes |\n| Operational data, such as log files | Store in Unity Catalog volumes |\n| Large archive files, such as ZIP files (`.zip`) | Store in Unity Catalog volumes |\n| Source code files, such as Python files (`.py`), Java files (`.java`), and Scala files (`.scala`) | Store as workspace files, if applicable, with other related objects, such as notebooks and queries. Databricks recommends managing these files in a [Git folder](https:\/\/docs.databricks.com\/repos\/index.html) for version control and change tracking of these files. |\n| Build artifacts and libraries, such as Python wheels (`.whl`) and JAR files (`.jar`) | Store in Unity Catalog volumes |\n| Configuration files | Store configuration files needed across workspaces in Unity Catalog volumes, but store them as workspace files if they are project files in a [Git folder](https:\/\/docs.databricks.com\/repos\/index.html). |\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/files-recommendations.html"} +{"content":"# Databricks data engineering\n## Work with files on Databricks\n#### Recommendations for files in volumes and workspace files\n##### Feature comparison\n\nThe following table compares the feature offerings of [workspace files](https:\/\/docs.databricks.com\/files\/workspace.html) and Unity Catalog [volumes](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). \n| Feature | Workspace files | Unity Catalog volumes |\n| --- | --- | --- |\n| File access | Workspace files are only accessible to each other within the same workspace. | Files are globally accessible across workspaces. |\n| Programmatic access | Files can be accessed using:* Spark APIs * [FUSE](https:\/\/docs.databricks.com\/files\/index.html) * [dbutils](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#file-system-utility-dbutilsfs) * [REST API](https:\/\/docs.databricks.com\/api\/workspace\/workspace) * [Databricks SDKs](https:\/\/docs.databricks.com\/dev-tools\/index-sdk.html) * [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/fs-commands.html) | Files can be accessed using:* Spark APIs * [FUSE](https:\/\/docs.databricks.com\/files\/index.html) * [dbutils](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html#file-system-utility-dbutilsfs) * [REST API](https:\/\/docs.databricks.com\/api\/workspace\/files) * [Databricks SDKs](https:\/\/docs.databricks.com\/dev-tools\/index-sdk.html) * [Databricks SQL Connectors](https:\/\/docs.databricks.com\/dev-tools\/index-driver.html) * [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/fs-commands.html) * [Databricks Terraform Provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) |\n| Databricks Asset Bundles | By default, all files in a bundle, which includes libraries and Databricks objects such as notebooks and queries, are deployed securely as workspace files. Permissions are defined in the bundle configuration. | Bundles can be customized to include libraries already in volumes when the libraries exceed the size limit of workspace files. See [Databricks Asset Bundles library dependencies](https:\/\/docs.databricks.com\/dev-tools\/bundles\/library-dependencies.html). |\n| File permission level | Permissions are at the Git-folder level if the file is in a [Git folder](https:\/\/docs.databricks.com\/repos\/index.html), otherwise permissions are set at the file level. | Permissions are at the volume level. |\n| Permissions management | Permissions are managed by workspace [ACLs](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html) and are limited to the containing workspace. | Metadata and permissions are [managed by Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#data-permissions). These permissions are applicable across all workspaces that have access to the catalog. |\n| External storage mount | Does not support mounting external storage | Provides the option to point to pre-existing datasets on external storage by creating an external volume. See [Create an external volume](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html#create-an-external-volume). |\n| UDF support | Not supported | Writing from UDFs is supported using Volumes FUSE |\n| File size | Store smaller files less than 500MB, such as source code files (`.py`, `.md`, `.yml`) needed alongside notebooks. | Store very large data files at limits determined by cloud service providers. |\n| Upload & download | Support for upload and download up to 10MB. | Support for upload and download up to 5GB. |\n| Table creation support | Tables cannot be created with workspace files as the location. | Tables can be created from files in a volume by running `COPY INTO`, Autoloader, or other options described in [Ingest data into a Databricks lakehouse](https:\/\/docs.databricks.com\/ingestion\/index.html). |\n| Directory structure & file paths | Files are organized in nested directories, each with its own permission model:* User home directories, one for each user and service principal in the workspace * Git folders * Shared | Files are organized in nested directories inside a volume See [How can you access data in Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/paths.html#access-data). |\n| File history | Use [Git folder](https:\/\/docs.databricks.com\/repos\/index.html) within workspaces to track file changes. | Audit logs are available. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/files\/files-recommendations.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Apply watermarks to control data processing thresholds\n\nThis article introduces the basic concepts of watermarking and provides recommendations for using watermarks in common stateful streaming operations. You must apply watermarks to stateful streaming operations to avoid infinitely expanding the amount of data kept in state, which could introduce memory issues and increase processing latencies during long-running streaming operations.\n\n#### Apply watermarks to control data processing thresholds\n##### What is a watermark?\n\nStructured Streaming uses watermarks to control the threshold for how long to continue processing updates for a given state entity. Common examples of state entities include: \n* Aggregations over a time window.\n* Unique keys in a join between two streams. \nWhen you declare a watermark, you specify a timestamp field and a watermark threshold on a streaming DataFrame. As new data arrives, the state manager tracks the most recent timestamp in the specified field and processes all records within the lateness threshold. \nThe following example applies a 10 minute watermark threshold to a windowed count: \n```\nfrom pyspark.sql.functions import window\n\n(df\n.withWatermark(\"event_time\", \"10 minutes\")\n.groupBy(\nwindow(\"event_time\", \"5 minutes\"),\n\"id\")\n.count()\n)\n\n``` \nIn this example: \n* The `event_time` column is used to define a 10 minute watermark and a 5 minute tumbling window.\n* A count is collected for each `id` observed for each non-overlapping 5 minute windows.\n* State information is maintained for each count until the end of window is 10 minutes older than the latest observed `event_time`. \nImportant \nWatermark thresholds guarantee that records arriving within the specified threshold are processed according to the semantics of the defined query. Late-arriving records arriving outside the specified threshold might still be processed using query metrics, but this is not guaranteed.\n\n#### Apply watermarks to control data processing thresholds\n##### How do watermarks impact processing time and throughput?\n\nWatermarks interact with output modes to control when data is written to the sink. Because watermarks reduce the total amount of state information to be processed, effective use of watermarks is essential for efficient stateful streaming throughput. \nNote \nNot all output modes are supported for all stateful operations.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/watermarks.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Apply watermarks to control data processing thresholds\n##### Watermarks and output mode for windowed aggregations\n\nThe following table details processing for queries with aggregation on a timestamp with a watermark defined: \n| Output mode | Behavior |\n| --- | --- |\n| Append | Rows are written to the target table once the watermark threshold has passed. All writes are delayed based on the lateness threshold. Old aggregation state is dropped once the threshold has passed. |\n| Update | Rows are written to the target table as results are calculated, and can be updated and overwritten as new data arrives. Old aggregation state is dropped once the threshold has passed. |\n| Complete | Aggregation state is not dropped. The target table is rewritten with each trigger. |\n\n#### Apply watermarks to control data processing thresholds\n##### Watermarks and output for stream-stream joins\n\nJoins between multiple streams only support append mode, and matched records are written in each batch they are discovered. For inner joins, Databricks recommends setting a watermark threshold on each streaming data source. This allows state information to be discarded for old records. Without watermarks, Structured Streaming attempts to join every key from both sides of the join with each trigger. \nStructured Streaming has special semantics to support outer joins. Watermarking is mandatory for outer joins, as it indicates when a key must be written with a null value after going unmatched. Note that while outer joins can be useful for recording records that are never matched during data processing, because joins only write to tables as append operations, this missing data is not recorded until after the lateness threshold has passed.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/watermarks.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Apply watermarks to control data processing thresholds\n##### Control late data threshold with multiple watermark policy in Structured Streaming\n\nWhen working with multiple Structured Streaming inputs, you can set multiple watermarks to control tolerance thresholds for late-arriving data. Configuring watermarks allows you to control state information and impacts latency. \nA streaming query can have multiple input streams that are unioned or joined together. Each of the input streams can have a different threshold of late data that needs to be tolerated for stateful operations. Specify these thresholds using\n`withWatermarks(\"eventTime\", delay)` on each of the input streams. The following is an example query with [stream-stream joins](https:\/\/databricks.com\/blog\/2018\/03\/13\/introducing-stream-stream-joins-in-apache-spark-2-3.html). \n```\nval inputStream1 = ... \/\/ delays up to 1 hour\nval inputStream2 = ... \/\/ delays up to 2 hours\n\ninputStream1.withWatermark(\"eventTime1\", \"1 hour\")\n.join(\ninputStream2.withWatermark(\"eventTime2\", \"2 hours\"),\njoinCondition)\n\n``` \nWhile running the query, Structured Streaming individually tracks the maximum event time seen in each input stream, calculates watermarks based on the corresponding delay, and chooses a single global watermark with them to be used for stateful operations. By default, the minimum is chosen as the global watermark because it ensures that no data is accidentally dropped as too late if one of the streams falls behind the others (for example, one of the streams stop receiving data due to upstream failures). In other words, the global watermark safely moves at the pace of the slowest stream and the query output is delayed accordingly. \nIf you want to get faster results, you can set the multiple watermark policy to choose the maximum value as the global watermark by setting the SQL configuration `spark.sql.streaming.multipleWatermarkPolicy` to `max` (default is `min`). This lets the global watermark move at the pace of the fastest stream. However, this configuration drops data from the slowest streams. Because of this, Databricks recommends that you use this configuration judiciously.\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/watermarks.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Apply watermarks to control data processing thresholds\n##### Drop duplicates within watermark\n\nIn Databricks Runtime 13.3 LTS and above, you can deduplicate records within a watermark threshold using a unique identifier. \nStructured Streaming provides exactly-once processing guarantees, but does not automatically deduplicate records from data sources. You can use `dropDuplicatesWithinWatermark` to deduplicate records on any specified field, allowing you to remove duplicates from a stream even if some fields differ (such as event time or arrival time). \nDuplicate records that arrive within the specified watermark are guaranteed to be dropped. This guarantee is strict in only one direction, and duplicate records that arrive outside of the specified threshold might also be dropped. You must set the delay threshold of watermark longer than max timestamp differences among duplicated events to remove all duplicates. \nYou must specify a watermark to use the `dropDuplicatesWithinWatermark` method, as in the following example: \n```\nstreamingDf = spark.readStream. ...\n\n# deduplicate using guid column with watermark based on eventTime column\n(streamingDf\n.withWatermark(\"eventTime\", \"10 hours\")\n.dropDuplicatesWithinWatermark(\"guid\")\n)\n\n``` \n```\nval streamingDf = spark.readStream. ... \/\/ columns: guid, eventTime, ...\n\n\/\/ deduplicate using guid column with watermark based on eventTime column\nstreamingDf\n.withWatermark(\"eventTime\", \"10 hours\")\n.dropDuplicatesWithinWatermark(\"guid\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/watermarks.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n#### Load data with Delta Live Tables\n###### Use Azure Event Hubs as a Delta Live Tables data source\n\nThis article explains how to use Delta Live Tables to process messages from Azure Event Hubs. You cannot use the [Structured Streaming Event Hubs connector](https:\/\/github.com\/Azure\/azure-event-hubs-spark) because this library is not available as part of Databricks Runtime, and Delta Live Tables [does not allow you to use third-party JVM libraries](https:\/\/docs.databricks.com\/delta-live-tables\/external-dependencies.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/event-hubs.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n#### Load data with Delta Live Tables\n###### Use Azure Event Hubs as a Delta Live Tables data source\n####### How can Delta Live Tables connect to Azure Event Hubs?\n\nAzure Event Hubs provides an endpoint compatible with Apache Kafka that you can use with the [Structured Streaming Kafka connector](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-kafka-integration.html), available in Databricks Runtime, to process messages from Azure Event Hubs. For more information about Azure Event Hubs and Apache Kafka compatibility, see [Use Azure Event Hubs from Apache Kafka applications](https:\/\/learn.microsoft.com\/azure\/event-hubs\/event-hubs-for-kafka-ecosystem-overview). \nThe following steps describe connecting a Delta Live Tables pipeline to an existing Event Hubs instance and consuming events from a topic. To complete these steps, you need the following Event Hubs connection values: \n* The name of the Event Hubs namespace.\n* The name of the Event Hub instance in the Event Hubs namespace.\n* A shared access policy name and policy key for Event Hubs. By default, A `RootManageSharedAccessKey` policy is created for each Event Hubs namespace. This policy has `manage`, `send` and `listen` permissions. If your pipeline only reads from Event Hubs, Databricks recommends creating a new policy with listen permission only. \nFor more information about the Event Hubs connection string, see [Get an Event Hubs connection string](https:\/\/learn.microsoft.com\/azure\/event-hubs\/event-hubs-get-connection-string). \nNote \n* Azure Event Hubs provides both OAuth 2.0 and shared access signature (SAS) options to authorize access to your secure resources. These instructions use SAS-based authentication.\n* If you get the Event Hubs connection string from the Azure portal, it may not contain the `EntityPath` value. The `EntityPath` value is required only when using the Structured Streaming Event Hub connector. Using the Structured Streaming Kafka Connector requires providing only the topic name.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/event-hubs.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n#### Load data with Delta Live Tables\n###### Use Azure Event Hubs as a Delta Live Tables data source\n####### Store the policy key in a Databricks secret\n\nBecause the policy key is sensitive information, Databricks recommends not hardcoding the value in your pipeline code. Instead, use Databricks secrets to store and manage access to the key. \nThe following example uses the Databricks CLI to create a secret scope and store the key in that secret scope. In your pipeline code, use the `dbutils.secrets.get()` function with the `scope-name` and `shared-policy-name` to retrieve the key value. \n```\ndatabricks --profile <profile-name> secrets create-scope <scope-name>\n\ndatabricks --profile <profile-name> secrets put-secret <scope-name> <shared-policy-name> --string-value <shared-policy-key>\n\n``` \nFor more information on Databricks secrets, see [Secret management](https:\/\/docs.databricks.com\/security\/secrets\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/event-hubs.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n#### Load data with Delta Live Tables\n###### Use Azure Event Hubs as a Delta Live Tables data source\n####### Create a notebook and add the pipeline code to consume events\n\nThe following example reads IoT events from a topic, but you can adapt the example for the requirements of your application. As a best practice, Databricks recommends using the Delta Live Tables pipeline settings to configure application variables. Your pipeline code then uses the `spark.conf.get()` function to retrieve values. For more information on using pipeline settings to parameterize your pipeline, see [Parameterize pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#parameterize-pipelines). \n```\nimport dlt\nimport pyspark.sql.types as T\nfrom pyspark.sql.functions import *\n\n# Event Hubs configuration\nEH_NAMESPACE = spark.conf.get(\"iot.ingestion.eh.namespace\")\nEH_NAME = spark.conf.get(\"iot.ingestion.eh.name\")\n\nEH_CONN_SHARED_ACCESS_KEY_NAME = spark.conf.get(\"iot.ingestion.eh.accessKeyName\")\nSECRET_SCOPE = spark.conf.get(\"io.ingestion.eh.secretsScopeName\")\nEH_CONN_SHARED_ACCESS_KEY_VALUE = dbutils.secrets.get(scope = SECRET_SCOPE, key = EH_CONN_SHARED_ACCESS_KEY_NAME)\n\nEH_CONN_STR = f\"Endpoint=sb:\/\/{EH_NAMESPACE}.servicebus.windows.net\/;SharedAccessKeyName={EH_CONN_SHARED_ACCESS_KEY_NAME};SharedAccessKey={EH_CONN_SHARED_ACCESS_KEY_VALUE}\"\n# Kafka Consumer configuration\n\nKAFKA_OPTIONS = {\n\"kafka.bootstrap.servers\" : f\"{EH_NAMESPACE}.servicebus.windows.net:9093\",\n\"subscribe\" : EH_NAME,\n\"kafka.sasl.mechanism\" : \"PLAIN\",\n\"kafka.security.protocol\" : \"SASL_SSL\",\n\"kafka.sasl.jaas.config\" : f\"kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username=\\\"$ConnectionString\\\" password=\\\"{EH_CONN_STR}\\\";\",\n\"kafka.request.timeout.ms\" : spark.conf.get(\"iot.ingestion.kafka.requestTimeout\"),\n\"kafka.session.timeout.ms\" : spark.conf.get(\"iot.ingestion.kafka.sessionTimeout\"),\n\"maxOffsetsPerTrigger\" : spark.conf.get(\"iot.ingestion.spark.maxOffsetsPerTrigger\"),\n\"failOnDataLoss\" : spark.conf.get(\"iot.ingestion.spark.failOnDataLoss\"),\n\"startingOffsets\" : spark.conf.get(\"iot.ingestion.spark.startingOffsets\")\n}\n\n# PAYLOAD SCHEMA\npayload_ddl = \"\"\"battery_level BIGINT, c02_level BIGINT, cca2 STRING, cca3 STRING, cn STRING, device_id BIGINT, device_name STRING, humidity BIGINT, ip STRING, latitude DOUBLE, lcd STRING, longitude DOUBLE, scale STRING, temp BIGINT, timestamp BIGINT\"\"\"\npayload_schema = T._parse_datatype_string(payload_ddl)\n\n# Basic record parsing and adding ETL audit columns\ndef parse(df):\nreturn (df\n.withColumn(\"records\", col(\"value\").cast(\"string\"))\n.withColumn(\"parsed_records\", from_json(col(\"records\"), payload_schema))\n.withColumn(\"iot_event_timestamp\", expr(\"cast(from_unixtime(parsed_records.timestamp \/ 1000) as timestamp)\"))\n.withColumn(\"eh_enqueued_timestamp\", expr(\"timestamp\"))\n.withColumn(\"eh_enqueued_date\", expr(\"to_date(timestamp)\"))\n.withColumn(\"etl_processed_timestamp\", col(\"current_timestamp\"))\n.withColumn(\"etl_rec_uuid\", expr(\"uuid()\"))\n.drop(\"records\", \"value\", \"key\")\n)\n\n@dlt.create_table(\ncomment=\"Raw IOT Events\",\ntable_properties={\n\"quality\": \"bronze\",\n\"pipelines.reset.allowed\": \"false\" # preserves the data in the delta table if you do full refresh\n},\npartition_cols=[\"eh_enqueued_date\"]\n)\n@dlt.expect(\"valid_topic\", \"topic IS NOT NULL\")\n@dlt.expect(\"valid records\", \"parsed_records IS NOT NULL\")\ndef iot_raw():\nreturn (\nspark.readStream\n.format(\"kafka\")\n.options(**KAFKA_OPTIONS)\n.load()\n.transform(parse)\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/event-hubs.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Load and transform data with Delta Live Tables\n#### Load data with Delta Live Tables\n###### Use Azure Event Hubs as a Delta Live Tables data source\n####### Create the pipeline\n\nCreate a new pipeline with the following settings, replacing the placeholder values with appropriate values for your environment. \n```\n{\n\"clusters\": [\n{\n\"spark_conf\": {\n\"spark.hadoop.fs.azure.account.key.<storage-account-name>.dfs.core.windows.net\": \"{{secrets\/<scope-name>\/<secret-name>}}\"\n},\n\"num_workers\": 4\n}\n],\n\"development\": true,\n\"continuous\": false,\n\"channel\": \"CURRENT\",\n\"edition\": \"ADVANCED\",\n\"photon\": false,\n\"libraries\": [\n{\n\"notebook\": {\n\"path\": \"<path-to-notebook>\"\n}\n}\n],\n\"name\": \"dlt_eventhub_ingestion_using_kafka\",\n\"storage\": \"abfss:\/\/<container-name>@<storage-account-name>.dfs.core.windows.net\/iot\/\",\n\"configuration\": {\n\"iot.ingestion.eh.namespace\": \"<eh-namespace>\",\n\"iot.ingestion.eh.accessKeyName\": \"<eh-policy-name>\",\n\"iot.ingestion.eh.name\": \"<eventhub>\",\n\"io.ingestion.eh.secretsScopeName\": \"<secret-scope-name>\",\n\"iot.ingestion.spark.maxOffsetsPerTrigger\": \"50000\",\n\"iot.ingestion.spark.startingOffsets\": \"latest\",\n\"iot.ingestion.spark.failOnDataLoss\": \"false\",\n\"iot.ingestion.kafka.requestTimeout\": \"60000\",\n\"iot.ingestion.kafka.sessionTimeout\": \"30000\"\n},\n\"target\": \"<target-database-name>\"\n}\n\n``` \nReplace \n* `<container-name>` with the name of an Azure storage account container.\n* `<storage-account-name>` with the name of an ADLS Gen2 storage account.\n* `<eh-namespace>` with the name of your Event Hubs namespace.\n* `<eh-policy-name>` with the secret scope key for the Event Hubs policy key.\n* `<eventhub>` with the name of your Event Hubs instance.\n* `<secret-scope-name>` with the name of the Databricks secret scope that contains the Event Hubs policy key. \nAs a best practice, this pipeline doesn\u2019t use the default DBFS storage path but instead uses an Azure Data Lake Storage Gen2 (ADLS Gen2) storage account. For more information on configuring authentication for an ADLS Gen2 storage account, see [Securely access storage credentials with secrets in a pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/load.html#configure-secrets).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/event-hubs.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### What is Databricks Feature Serving?\n##### Tutorial: Deploy and query a feature serving endpoint\n\nThis article shows how to deploy and query a feature serving endpoint in a step-by-step process. This article uses the Databricks SDK. Some steps can also be completed using the REST API or the Databricks UI and include references to the documentation for those methods. \nIn this example, you have a table of cities with their locations (latitude and longitude) and a recommender app that takes into account the user\u2019s current distance from those cities. Because the user\u2019s location changes constantly, the distance between the user and each city must be calculated at the time of inference. This tutorial illustrates how to perform those calculations with low latency using Databricks Online Tables and Databricks Feature Serving. For the full set of example code, see the [example notebook](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-serving-tutorial.html#example).\n\n##### Tutorial: Deploy and query a feature serving endpoint\n###### Step 1. Create the source table\n\nThe source table contains precomputed feature values and can be any Delta table in Unity Catalog with a primary key. In this example, the table contains a list of cities with their latitude and longitude. The primary key is `destination_id`. Sample data is shown below. \n| name | destination\\_id (pk) | latitude | longitude |\n| --- | --- | --- | --- |\n| Nashville, Tennessee | 0 | 36.162663 | -86.7816 |\n| Honolulu, Hawaii | 1 | 21.309885 | -157.85814 |\n| Las Vegas, Nevada | 2 | 36.171562 | -115.1391 |\n| New York, New York | 3 | 40.712776 | -74.005974 |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-serving-tutorial.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### What is Databricks Feature Serving?\n##### Tutorial: Deploy and query a feature serving endpoint\n###### Step 2. Create an online table\n\nAn online table is a read-only copy of a Delta Table that is optimized for online access. For more information, see [Use online tables for real-time feature serving](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html). \nTo create an online table, you can use the UI [Create an online table using the UI](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-tables.html#create-an-online-table-using-the-ui), the [REST API](https:\/\/docs.databricks.com\/api\/workspace\/onlinetables\/create), or the Databricks SDK, as in the following example: \n```\nfrom pprint import pprint\nfrom databricks.sdk import WorkspaceClient\nfrom databricks.sdk.service.catalog import *\nimport mlflow\n\nworkspace = WorkspaceClient()\n\n# Create an online table\nfeature_table_name = f\"main.on_demand_demo.location_features\"\nonline_table_name=f\"main.on_demand_demo.location_features_online\"\n\nspec = OnlineTableSpec(\nprimary_key_columns=[\"destination_id\"],\nsource_table_full_name = feature_table_name,\nrun_triggered=OnlineTableSpecTriggeredSchedulingPolicy.from_dict({'triggered': 'true'}),\nperform_full_copy=True)\n\n# ignore \"already exists\" error\ntry:\nonline_table_pipeline = workspace.online_tables.create(name=online_table_name, spec=spec)\nexcept Exception as e:\nif \"already exists\" in str(e):\npass\nelse:\nraise e\n\npprint(workspace.online_tables.get(online_table_name))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-serving-tutorial.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### What is Databricks Feature Serving?\n##### Tutorial: Deploy and query a feature serving endpoint\n###### Step 3. Create a function in Unity Catalog\n\nIn this example, the function calculates the distance between the destination (whose location does not change) and the user (whose location changes frequently and is not known until the time of inference). \n```\n# Define the function. This function calculates the distance between two locations.\nfunction_name = f\"main.on_demand_demo.distance\"\n\nspark.sql(f\"\"\"\nCREATE OR REPLACE FUNCTION {function_name}(latitude DOUBLE, longitude DOUBLE, user_latitude DOUBLE, user_longitude DOUBLE)\nRETURNS DOUBLE\nLANGUAGE PYTHON AS\n$$\nimport math\nlat1 = math.radians(latitude)\nlon1 = math.radians(longitude)\nlat2 = math.radians(user_latitude)\nlon2 = math.radians(user_longitude)\n\n# Earth's radius in kilometers\nradius = 6371\n\n# Haversine formula\ndlat = lat2 - lat1\ndlon = lon2 - lon1\na = math.sin(dlat\/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon\/2)**2\nc = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))\ndistance = radius * c\n\nreturn distance\n$$\"\"\")\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-serving-tutorial.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### What is Databricks Feature Serving?\n##### Tutorial: Deploy and query a feature serving endpoint\n###### Step 4. Create a feature spec in Unity Catalog\n\nThe feature spec specifies the features that the endpoint serves and their lookup keys. It also specifies any required functions to apply to the retrieved features with their bindings. For details, see [Create a FeatureSpec](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html#create-a-featurespec). \n```\nfrom databricks.feature_engineering import FeatureLookup, FeatureFunction, FeatureEngineeringClient\n\nfe = FeatureEngineeringClient()\n\nfeatures=[\nFeatureLookup(\ntable_name=feature_table_name,\nlookup_key=\"destination_id\"\n),\nFeatureFunction(\nudf_name=function_name,\noutput_name=\"distance\",\ninput_bindings={\n\"latitude\": \"latitude\",\n\"longitude\": \"longitude\",\n\"user_latitude\": \"user_latitude\",\n\"user_longitude\": \"user_longitude\"\n},\n),\n]\n\nfeature_spec_name = f\"main.on_demand_demo.travel_spec\"\n\n# The following code ignores errors raised if a feature_spec with the specified name already exists.\ntry:\nfe.create_feature_spec(name=feature_spec_name, features=features, exclude_columns=None)\nexcept Exception as e:\nif \"already exists\" in str(e):\npass\nelse:\nraise e\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-serving-tutorial.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### What is Databricks Feature Serving?\n##### Tutorial: Deploy and query a feature serving endpoint\n###### Step 5. Create a feature serving endpoint\n\nTo create a feature serving endpoint, you can use the UI [Create an endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html#create-an-endpoint), the [REST API](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/create), or the Databricks SDK, shown here. \nThe feature serving endpoint takes the `feature_spec` that you created in Step 4 as a parameter. \n```\nfrom databricks.sdk import WorkspaceClient\nfrom databricks.sdk.service.serving import EndpointCoreConfigInput, ServedEntityInput\n\n# Create endpoint\nendpoint_name = \"fse-location\"\n\ntry:\nstatus = workspace.serving_endpoints.create_and_wait(\nname=endpoint_name,\nconfig = EndpointCoreConfigInput(\nserved_entities=[\nServedEntityInput(\nentity_name=feature_spec_name,\nscale_to_zero_enabled=True,\nworkload_size=\"Small\"\n)\n]\n)\n)\nprint(status)\n\n# Get the status of the endpoint\nstatus = workspace.serving_endpoints.get(name=endpoint_name)\nprint(status)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-serving-tutorial.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### What is Databricks Feature Serving?\n##### Tutorial: Deploy and query a feature serving endpoint\n###### Step 6. Query the feature serving endpoint\n\nWhen you query the endpoint, you provide the primary key and optionally any context data that the function uses. In this example, the function takes as input the user\u2019s current location (latitude and longitude). Because the user\u2019s location is constantly changing, it must be provided to the function at inference time as a context feature. \nYou can also query the endpoint using the UI [Query an endpoint using the UI](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-function-serving.html#query-an-endpoint-using-the-ui) or the [REST API](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/query). \nFor simplicity, this example only calculates the distance to two cities. A more realistic scenario might calculate the user\u2019s distance from each location in the feature table to determine which cities to recommend. \n```\nimport mlflow.deployments\n\nclient = mlflow.deployments.get_deploy_client(\"databricks\")\nresponse = client.predict(\nendpoint=endpoint_name,\ninputs={\n\"dataframe_records\": [\n{\"destination_id\": 1, \"user_latitude\": 37, \"user_longitude\": -122},\n{\"destination_id\": 2, \"user_latitude\": 37, \"user_longitude\": -122},\n]\n},\n)\n\npprint(response)\n\n```\n\n##### Tutorial: Deploy and query a feature serving endpoint\n###### Example notebook\n\nSee this notebook for a complete illustration of the steps: \n### Feature Serving example notebook with online tables \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/machine-learning\/feature-function-serving-online-tables-dbsdk.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-serving-tutorial.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n### What is Databricks Feature Serving?\n##### Tutorial: Deploy and query a feature serving endpoint\n###### Additional information\n\nFor details about using the feature engineering Python API, see [the reference documentation](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/feature-serving-tutorial.html"} +{"content":"# Technology partners\n### Get connection details for a Databricks compute resource\n\nTo connect a participating app, tool, SDK, or API to a Databricks compute resource such as a Databricks cluster or a Databricks SQL warehouse, you must provide specific information about that cluster or SQL warehouse so that the connection can be made successfully. \nTo get the connection details for a Databricks [cluster](https:\/\/docs.databricks.com\/compute\/configure.html), do the following: \n1. Log in to your Databricks workspace.\n2. In the sidebar, click **Compute**.\n3. In the list of available clusters, click the target cluster\u2019s name.\n4. On the **Configuration** tab, expand **Advanced options**.\n5. Click the **JDBC\/ODBC** tab.\n6. Copy the connection details that you need, such as **Server Hostname**, **Port**, and **HTTP Path**. \nTo get the connection details for a Databricks SQL [warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html), do the following: \n1. Log in to your Databricks workspace.\n2. In the sidebar, click **SQL > SQL Warehouses**.\n3. In the list of available warehouses, click the target warehouse\u2019s name.\n4. On the **Connection Details** tab, copy the connection details that you need, such as **Server hostname**, **Port**, and **HTTP path**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/compute-details.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/testing.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Testing the Databricks JDBC Driver\n\nThis article describes how to test code that uses the [Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/index.html). \nTo test code that uses the Databricks JDBC Driver along with a collection of connection properties, you can use any test frameworks for programming languages that support JDBC. For instance, the following Java code example uses [JUnit](https:\/\/junit.org\/) and [Mockito](https:\/\/site.mockito.org\/) to automate and test the Databricks JDBC Driver against a collection of connection properties. This example code is based on the example code in [Authentication settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html). \nThe following example code file named `Helpers.java` contains several functions that automate the Databricks JDBC Driver against a collection of connection properties: \n* The `CreateConnect` function uses a collection of connection properties to open a connection through a Databricks compute resource.\n* The `SelectNYCTaxis` function uses the connection to select the specified number of data rows from the `trips` table in the `samples` catalog\u2019s `nyctaxi` schema.\n* The `PrintResultSet` function prints the data rows\u2019 content to the screen. \n```\n\/\/ Helpers.java\n\nimport java.sql.*;\nimport java.util.Properties;\n\npublic class Helpers {\nstatic Connection CreateConnection(\nString url,\nProperties p\n) throws SQLException {\nConnection conn = DriverManager.getConnection(url, p);\nreturn conn;\n}\n\nstatic ResultSet SelectNYCTaxis(\nConnection conn,\nlong rows\n) throws SQLException {\nStatement stmt = conn.createStatement();\nResultSet rs = stmt.executeQuery(\"SELECT * FROM samples.nyctaxi.trips LIMIT \" + rows);\nreturn rs;\n}\n\nstatic void PrintResultSet(ResultSet rs) throws SQLException {\nResultSetMetaData md = rs.getMetaData();\nString[] columns = new String[md.getColumnCount()];\nfor (int i = 0; i < columns.length; i++) {\ncolumns[i] = md.getColumnName(i + 1);\n}\nwhile (rs.next()) {\nSystem.out.print(\"Row \" + rs.getRow() + \"=[\");\nfor (int i = 0; i < columns.length; i++) {\nif (i != 0) {\nSystem.out.print(\", \");\n}\nSystem.out.print(columns[i] + \"='\" + rs.getObject(i + 1) + \"'\");\n}\nSystem.out.println(\")]\");\n}\n}\n}\n\n``` \nThe following example code file named `Main.class` file calls the functions in the `Helpers.class` file: \n```\npackage org.example;\n\nimport java.sql.Connection;\nimport java.sql.ResultSet;\nimport java.sql.SQLException;\nimport java.util.Properties;\n\npublic class Main {\npublic static void main(String[] args) throws ClassNotFoundException, SQLException {\nClass.forName(\"com.databricks.client.jdbc.Driver\");\nString url = \"jdbc:databricks:\/\/\" + System.getenv(\"DATABRICKS_SERVER_HOSTNAME\") + \":443\";\nProperties p = new Properties();\np.put(\"httpPath\", System.getenv(\"DATABRICKS_HTTP_PATH\"));\np.put(\"AuthMech\", \"3\");\np.put(\"UID\", \"token\");\np.put(\"PWD\", System.getenv(\"DATABRICKS_TOKEN\"));\n\nConnection conn = Helpers.CreateConnection(url, p);\nResultSet rs = Helpers.SelectNYCTaxis(conn, 2);\nHelpers.PrintResultSet(rs);\n}\n}\n\n``` \nThe following example code file named `HelpersTest.class` uses JUnit to test the `SelectNYCTaxis` function in the `Helpers.class` file. Instead of using the time and cost of actual compute resources to call the function in the `Helpers.class` file, the following example code uses Mockito to simulate this call. Simulated calls such as this are typically completed in just a few seconds, increasing your confidence in the quality of your code while not changing the state of your existing Databricks accounts or workspaces. \n```\npackage org.example;\n\nimport java.sql.Connection;\nimport java.sql.ResultSet;\nimport java.sql.SQLException;\nimport java.sql.Statement;\nimport org.junit.jupiter.api.Test;\nimport org.mockito.Mockito;\nimport static org.junit.jupiter.api.Assertions.assertEquals;\n\npublic class HelpersTest {\n@Test\npublic void testSelectNYCTaxis() throws SQLException {\nConnection mockConnection = Mockito.mock(Connection.class);\nStatement mockStatement = Mockito.mock(Statement.class);\nResultSet mockResultSet = Mockito.mock(ResultSet.class);\n\nMockito.when(mockConnection.createStatement()).thenReturn(mockStatement);\nMockito.when(mockStatement.executeQuery(Mockito.anyString())).thenReturn(mockResultSet);\n\nResultSet rs = Helpers.SelectNYCTaxis(mockConnection, 2);\nassertEquals(mockResultSet, rs);\n}\n}\n\n``` \nBecause the `SelectNYCTaxis` function contains a `SELECT` statement and therefore does not change the state of the `trips` table, mocking is not absolutely required in this example. However, mocking enables you to quickly run your tests without waiting for an actual connection to be made with the compute resource. Also, mocking enables you to run simulated tests multiple times for functions that might change a table\u2019s state, such as `INSERT INTO`, `UPDATE`, and `DELETE FROM`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/testing.html"} +{"content":"# Security and compliance guide\n## Networking\n### Users to Databricks networking\n#### Manage IP access lists\n###### Configure IP access lists for workspaces\n\nThis article describes how to configure IP access lists for Databricks workspaces. This article discusses the most common tasks you can perform with the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). You can also use the [IP Access Lists API](https:\/\/docs.databricks.com\/api\/workspace\/ipaccesslists).\n\n###### Configure IP access lists for workspaces\n####### Requirements\n\n* This feature requires the [Enterprise pricing tier](https:\/\/www.databricks.com\/product\/aws-pricing). \n* IP access lists support only Internet Protocol version 4 (IPv4) addresses. \n* Any public IPs that the compute plane uses to access the control plane must either be added to an allow list or you must configure [back-end PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html). Otherwise, classic compute resources cannot launch. \nFor example, when you configure a customer-managed VPC, subnets must have outbound access to the public network using a NAT gateway or a similar approach. Those public IPs must be present in an allow list. See [Subnets](https:\/\/docs.databricks.com\/security\/network\/classic\/customer-managed-vpc.html#subnet). Alternatively, if you use a Databricks-managed VPC and you configure the managed NAT gateway to access public IPs, those IPs must be present in an allow list. For more information, see the [Databricks Community post](https:\/\/community.databricks.com\/t5\/product-platform-updates\/action-required-aws-update-outbound-connectivity-for-classic\/ba-p\/70061).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list-workspace.html"} +{"content":"# Security and compliance guide\n## Networking\n### Users to Databricks networking\n#### Manage IP access lists\n###### Configure IP access lists for workspaces\n####### Check if your workspace has the IP access list feature enabled\n\nTo check if your workspace has the IP access list feature enabled: \n```\ndatabricks workspace-conf get-status enableIpAccessLists\n\n```\n\n###### Configure IP access lists for workspaces\n####### Enable or disable the IP access list feature for a workspace\n\nIn a JSON request body, specify `enableIpAccessLists` as `true` (enabled) or `false` (disabled). \n```\ndatabricks workspace-conf set-status --json '{\n\"enableIpAccessLists\": \"true\"\n}'\n\n```\n\n###### Configure IP access lists for workspaces\n####### Add an IP access list\n\nWhen the IP access lists feature is enabled and there are no allow lists or block lists for the workspace, all IP addresses are allowed. Adding IP addresses to the allow list blocks all IP addresses that are not on the list. Ensure to add any public IPs that the compute plane uses to access the control plane to an allow list. Review the changes carefully to avoid unintended access restrictions. \nIP access lists have a label, which is a name for the list, and a list type. The list type is either `ALLOW` (allow list) or `BLOCK` (a block list, which means exclude even if in allow list). \nFor example, to add an allow list: \n```\ndatabricks ip-access-lists create --json '{\n\"label\": \"office\",\n\"list_type\": \"ALLOW\",\n\"ip_addresses\": [\n\"1.1.1.1\"\n]\n}'\n\n```\n\n###### Configure IP access lists for workspaces\n####### List IP access lists\n\n```\ndatabricks ip-access-lists list\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list-workspace.html"} +{"content":"# Security and compliance guide\n## Networking\n### Users to Databricks networking\n#### Manage IP access lists\n###### Configure IP access lists for workspaces\n####### Update an IP access list\n\nSpecify at least one of the following values to update: \n* `label` \u2014Label for this list.\n* `list_type` \u2014Either `ALLOW` (allow list) or `BLOCK` (block list, which means exclude even if in allow list).\n* `ip_addresses` \u2014A JSON array of IP addresses and CIDR ranges, as String values.\n* `enabled` \u2014Specifies whether this list is enabled. Pass `true` or `false`. \nThe response is a copy of the object that you passed in with additional fields for the ID and modification dates. \nFor example, to disable a list: \n```\ndatabricks ip-access-lists update <list-id> --json '{\n\"enabled\": \"false\"\n}'\n\n```\n\n###### Configure IP access lists for workspaces\n####### Delete an IP access list\n\nTo delete an IP access: \n```\ndatabricks ip-access-lists delete <list-id>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list-workspace.html"} +{"content":"# AI and Machine Learning on Databricks\n## Prepare data and environment for ML and DL\n### Preprocess data for machine learning and deep learning\n##### Featurization for transfer learning\n\nThis article provides an example of doing featurization for transfer learning using pandas UDFs.\n\n##### Featurization for transfer learning\n###### Featurization for transfer learning in DL models\n\nDatabricks supports featurization with deep learning models. Pre-trained deep learning models can be used to compute features for use in other downstream models. Databricks supports featurization at scale, distributing the computation across a cluster. You can perform featurization with deep learning libraries included in [Databricks Runtime ML](https:\/\/docs.databricks.com\/machine-learning\/index.html), including TensorFlow and PyTorch. \nDatabricks also supports [transfer learning](https:\/\/en.wikipedia.org\/wiki\/Transfer_learning), a technique closely related to featurization. Transfer learning allows you to reuse knowledge from one problem domain in a related domain. Featurization is itself a simple and powerful method for transfer learning: computing features using a pre-trained deep learning model transfers knowledge about good features from the original domain.\n\n##### Featurization for transfer learning\n###### Steps to compute features for transfer learning\n\nThis article demonstrates how to compute features for transfer learning using a pre-trained TensorFlow model, using the following workflow: \n1. Start with a pre-trained deep learning model, in this case an image classification model from `tensorflow.keras.applications`.\n2. Truncate the last layer(s) of the model. The modified model produces a tensor of features as output, rather than a prediction.\n3. Apply that model to a new image dataset from a different problem domain, computing features for the images.\n4. Use these features to train a new model. The following notebook omits this final step. For examples of training a simple model such as logistic regression, see [Model training examples](https:\/\/docs.databricks.com\/machine-learning\/train-model\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/preprocess-data\/transfer-learning-tensorflow.html"} +{"content":"# AI and Machine Learning on Databricks\n## Prepare data and environment for ML and DL\n### Preprocess data for machine learning and deep learning\n##### Featurization for transfer learning\n###### Example: Use pandas UDFs for featurization\n\nThe following notebook uses [pandas UDFs](https:\/\/docs.databricks.com\/udf\/pandas.html) to perform the featurization step. pandas UDFs, and their newer variant [Scalar Iterator pandas UDFs](https:\/\/docs.databricks.com\/udf\/pandas.html#scalar-iterator-udfs), offer flexible APIs, support any deep learning library, and give high performance. \n### Featurization and transfer learning with TensorFlow \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/deep-learning-transfer-learning-keras.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/preprocess-data\/transfer-learning-tensorflow.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n### Set up query federation for non-Unity-Catalog workspaces\n##### Query federation for Azure Synapse in Databricks SQL (Experimental)\n\nExperimental \nThe configurations described in this article are [Experimental](https:\/\/docs.databricks.com\/release-notes\/release-types.html). Experimental features are provided as-is and are not supported by Databricks through customer technical support. **To get full query federation support, you should instead use [Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html), which enables your Databricks users to take advantage of Unity Catalog syntax and data governance tools.** \nThis article describes how to configure read-only query federation to Azure Synapse (SQL Data Warehouse) on Serverless and Pro SQL warehouses. \nFor information about configuring Synapse Azure Data Lake Storage Gen2 credentials, see [Query data in Azure Synapse Analytics](https:\/\/docs.databricks.com\/connect\/external-systems\/synapse-analytics.html). \nYou configure connections to Synapse at the table level. You can use [secrets](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/secret.html) to store and access text credentials without displaying them in plaintext. See the following example: \n```\nDROP TABLE IF EXISTS synapse_table;\nCREATE TABLE synapse_table\nUSING sqldw\nOPTIONS (\ndbtable '<table-name>',\ntempdir 'abfss:\/\/<your-container-name>@<your-storage-account-name>.dfs.core.windows.net\/<your-directory-name>',\nurl 'jdbc:sqlserver:\/\/<database-host-url>',\nuser secret('synapse_creds', 'my_username'),\npassword secret('synapse_creds', 'my_password'),\nforwardSparkAzureStorageCredentials 'true'\n);\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/synapse-no-uc.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Cohort options\n\nThis section covers the configuration options for cohort visualizations. For an example, see [cohort example](https:\/\/docs.databricks.com\/visualizations\/visualization-types.html#cohort).\n\n##### Cohort options\n###### Columns\n\nTo configure column options, click **Columns** and configure each of the following required settings: \n* **Date (bucket)**: The date that uniquely identifies a cohort. Suppose you\u2019re visualizing monthly user activity by sign-up date. Your cohort date for all users that signed up in January 2018 would be January 1, 2018. The cohort date for any user who signed up in February would be February 1, 2018.\n* **Stage**: A count of how many stages transpired since the cohort date as of this sample. If you are grouping users by sign-up month, then your stage will be the count of months since these users signed up. In the above example, a measurement of activity in July for users who signed up in January would yield a value of 7 because seven stages have transpired between January and July.\n* **Bucket population size**: The denominator to use to calculate the percentage of a cohort\u2019s target satisfaction for a given stage. Continuing the example above, if 72 users signed up in January then the bucket population size would be 72. When the visualization is rendered, the value would be displayed as `41.67%` (`32 \u00f7 72`).\n* **Stage value**: Your actual measurement of this cohort\u2019s performance in the given stage. In the above example, if 30 users who signed up in January showed activity in July then the stage value would be 30.\n\n##### Cohort options\n###### Options\n\nTo configure options, click **Options** and configure each of the following required settings: \n* **Time interval**: Lets you choose to define the cohort on either a daily, weekly, or monthly basis.\n* **Mode**: Options are **Fill gaps with zeros (default)** or **Show data as is**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/cohorts.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Cohort options\n###### Colors\n\nTo configure color options, click **Colors** and configure the following optional settings: \n* **min color**\n* **max color**\n* **steps**\n\n##### Cohort options\n###### Appearance\n\nTo configure appearance options, click **Appearance** and configure the following optional settings: \n* **Title column title**: Override the column name with a different display name.\n* **People column title**: Override the column name with a different display name.\n* **Stage column title**: Override the column name with a different display name.\n* **Number values format**: The format to use for labels for numeric values.\n* **Percent values format**: The format to use for labels for percentages.\n* **No value placeholder**: Default is `-`. Specify other value if desired.\n* **Show tooltips**: Tooltips are displayed by default. Clear checkbox to override.\n* **Normalize values to percentage**: Values normalized to percentage by default. Clear checkbox to override.\n\n##### Cohort options\n###### Cohort date notes\n\nEven if you define your cohorts by month or week, Databricks expects the values in your **Date** column to be a full date value. If you are grouping by month, `2018-01-18` should be shortened to `2018-01-01` or any other full date in January, not `2018-01`. \nThe cohort visualizer converts all date and time values to GMT before rendering. To avoid rendering issues, you should adjust the date times returned from your database by your local UTC offset.\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/cohorts.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n#### Provisioned throughput Foundation Model APIs\n###### What do tokens per second ranges in provisioned throughput mean?\n\nThis article describes how and why Databricks measures tokens per second for provisioned throughput workloads for [Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html). \nPerformance for large language models (LLMs) is often measured in terms of tokens per second. When configuring production model serving endpoints, it\u2019s important to consider the number of requests your application sends to the endpoint. Doing so helps you understand if your endpoint needs to be configured to scale so as to not impact latency. \nWhen configuring the scale-out ranges for endpoints deployed with provisioned throughput, Databricks found it easier to reason about the inputs going into your system using tokens.\n\n###### What do tokens per second ranges in provisioned throughput mean?\n####### What are tokens?\n\nLLMs read and generate text in terms of what is called a **token**. Tokens can be words or sub-words, and the exact rules for splitting text into tokens vary from model to model. For instance, you can use online tools to see how [Llama\u2019s tokenizer converts words to tokens](https:\/\/belladoreai.github.io\/llama-tokenizer-js\/example-demo\/build\/).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/prov-throughput-tokens.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n#### Provisioned throughput Foundation Model APIs\n###### What do tokens per second ranges in provisioned throughput mean?\n####### Why measure LLM performance in terms of tokens per second?\n\nTraditionally, serving endpoints are configured based on the number of concurrent requests per second (RPS). However, an LLM inference request takes a different amount of time based on how many tokens are passed in and how many it generates, which can be imbalanced across requests. Therefore, deciding how much scale out your endpoint needs really requires measuring endpoint scale in terms of the content of your request - tokens. \nDifferent use cases feature different input and output token ratios: \n* **Varying lengths of input contexts**: While some requests might involve only a few input tokens, for example a short question, others may involve hundreds or even thousands of tokens, like a long document for summarization. This variability makes configuring a serving endpoint based only on RPS challenging since it does not account for the varying processing demands of the different requests.\n* **Varying lengths of output depending on use case**: Different use cases for LLMs can lead to vastly different output token lengths. Generating output tokens is the most time intensive part of LLM inference, so this can dramatically impact throughput. For example, summarization involves shorter, pithier responses, but text generation, like writing articles or product descriptions, can generate much longer answers.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/prov-throughput-tokens.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n### Databricks Foundation Model APIs\n#### Provisioned throughput Foundation Model APIs\n###### What do tokens per second ranges in provisioned throughput mean?\n####### How do I select the tokens per second range for my endpoint?\n\nProvisioned throughput serving endpoints are configured in terms of a range of tokens per second that you can send to the endpoint. The endpoint scales up and down to handle the load of your production application. You are charged per hour based on the range of tokens per second your endpoint is scaled to. \nThe best way to know what tokens per second range on your provisioned throughput serving endpoint works for your use case is to perform a load test with a representative dataset. See [Conduct your own LLM endpoint benchmarking](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/prov-throughput-run-benchmark.html). \nThere are two important factors to consider: \n* *How Databricks measures tokens per second performance of the LLM* \nDatabricks benchmarks endpoints against a workload representing summarization tasks that are common for retrieval-augmented generation use cases. Specifically, the workload consists of: \n+ 2048 input tokens\n+ 256 output tokensThe token ranges displayed *combine* input and output token throughput and, by default, optimize for balancing throughput and latency. \nDatabricks benchmarks that users can send that many tokens per second concurrently to the endpoint at a batch size of 1 per request. This simulates multiple requests hitting the endpoint at the same time, which more accurately represents how you would actually use the endpoint in production.\n* *How autoscaling works* \nModel Serving features a rapid autoscaling system that scales the underlying compute to meet the tokens per second demand of your application. Databricks scales up provisioned throughput in chunks of tokens per second, so you are charged for additional units of provisioned throughput only when you\u2019re using them.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/prov-throughput-tokens.html"} +{"content":"# AI and Machine Learning on Databricks\n## Prepare data and environment for ML and DL\n### Load data for machine learning and deep learning\n#### Prepare data for distributed training\n###### Save Apache Spark DataFrames as TFRecord files\n\nThis article shows you how to use spark-tensorflow-connector to save Apache Spark DataFrames to TFRecord files and load TFRecord with TensorFlow. \nThe TFRecord file format is a simple record-oriented binary format for ML training data. The [tf.data.TFRecordDataset](https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/data\/TFRecordDataset) class enables you to stream over the contents of one or more TFRecord files as part of an input pipeline.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/load-data\/tfrecords-save-load.html"} +{"content":"# AI and Machine Learning on Databricks\n## Prepare data and environment for ML and DL\n### Load data for machine learning and deep learning\n#### Prepare data for distributed training\n###### Save Apache Spark DataFrames as TFRecord files\n####### Use `spark-tensorflow-connector` library\n\nYou can use [spark-tensorflow-connector](https:\/\/github.com\/tensorflow\/ecosystem\/tree\/master\/spark\/spark-tensorflow-connector) to save Apache Spark DataFrames to TFRecord files. \n`spark-tensorflow-connector` is a library within the [TensorFlow ecosystem](https:\/\/github.com\/tensorflow\/ecosystem)\nthat enables conversion between Spark DataFrames and [TFRecords](https:\/\/www.tensorflow.org\/tutorials\/load_data\/tfrecord#tfrecord_files_in_python) (a popular format for storing data for TensorFlow). With spark-tensorflow-connector, you can use\nSpark DataFrame APIs to read TFRecords files into DataFrames and write DataFrames as TFRecords. \nNote \nThe `spark-tensorflow-connector` library is included in [Databricks Runtime for Machine Learning](https:\/\/docs.databricks.com\/machine-learning\/index.html). To use `spark-tensorflow-connector` on [Databricks Runtime release notes versions and compatibility](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html), you need to install the library from Maven. See [Maven or Spark package](https:\/\/docs.databricks.com\/libraries\/package-repositories.html#maven-libraries) for details.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/load-data\/tfrecords-save-load.html"} +{"content":"# AI and Machine Learning on Databricks\n## Prepare data and environment for ML and DL\n### Load data for machine learning and deep learning\n#### Prepare data for distributed training\n###### Save Apache Spark DataFrames as TFRecord files\n####### Example: Load data from TFRecord files with TensorFlow\n\nThe example notebook demonstrates how to save data from Apache Spark DataFrames to TFRecord files and load TFRecord files for ML\ntraining. \nYou can load the TFRecord files using the `tf.data.TFRecordDataset` class. See [Reading a TFRecord file](https:\/\/www.tensorflow.org\/ tutorials\/load_data\/tfrecord#reading_a_tfrecord_file) from TensorFlow for details. \n### Prepare image data for Distributed DL notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/tfrecords-save-load.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/load-data\/tfrecords-save-load.html"} +{"content":"# Data governance with Unity Catalog\n## What is Catalog Explorer?\n#### View the Entity Relationship Diagram\n\nThis article describes how to access the Entity Relationship Diagram (ERD) in Catalog Explorer. The ERD displays the primary key and foreign key relationships between tables in a graph, providing a clear and intuitive representation of how data entities connect. \nFor more information about primary key and foreign key constraints, see [Constraints on Databricks](https:\/\/docs.databricks.com\/tables\/constraints.html). \nTo access the ERD, do the following: \n1. Select a schema.\n2. Click the **Filter tables** field. Optionally type a string to filter the tables. \n![Filter tables](https:\/\/docs.databricks.com\/_images\/filter-tables.png)\n3. Click a table with foreign keys defined. ![Primary key pill](https:\/\/docs.databricks.com\/_images\/primary-key-icon.png) and ![Foreign key pill](https:\/\/docs.databricks.com\/_images\/foreign-key-icon.png) appear next to columns that are designated as primary keys or foreign keys. `PK (TS)` indicates a [TIMESERIES primary key](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html). \n![Table schema](https:\/\/docs.databricks.com\/_images\/table-schema.png)\n4. Click **View relationships** ![View relationships button](https:\/\/docs.databricks.com\/_images\/pk-fk-view-relationships.png) at the top-right of the **Columns** tab. The Entity Relationship Diagram (ERD) opens. \n![Entity relationship diagram](https:\/\/docs.databricks.com\/_images\/ce-erd.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/catalog-explorer\/entity-relationship-diagram.html"} +{"content":"# Data governance with Unity Catalog\n## What is Catalog Explorer?\n#### Add AI-generated comments to a table\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nAs a table owner or user with permission to modify a table, you can use Catalog Explorer to view and add an AI-generated comment for any table or table column managed by Unity Catalog. Comments are powered by a large language model (LLM) that takes into account the table metadata, such as the table schema and column names.\n\n#### Add AI-generated comments to a table\n##### How do AI-generated comments work?\n\nAI-generated comments (also known as AI-generated documentation) provide a quick way to help users discover data managed by Unity Catalog. \nImportant \nAI-generated comments are intended to provide a general description of tables and columns based on the schema. The descriptions are tuned for data in a business and enterprise context, using example schemas from several open datasets across various industries. The model was evaluated with hundreds of simulated samples to verify it avoids generating harmful or inappropriate descriptions. \nAI models are not always accurate and comments must be reviewed prior to saving. Databricks strongly recommends human review of AI-generated comments to check for inaccuracies. The model should not be relied on for data classification tasks such as detecting columns with PII. \nUsers with the `USE SCHEMA` and `SELECT` privileges on the table can view comments once they are added. \nFor information about the models that are used to generate comment suggestions, see [Frequently asked questions about AI-generated table comments](https:\/\/docs.databricks.com\/catalog-explorer\/ai-comments.html#comments-faq).\n\n","doc_uri":"https:\/\/docs.databricks.com\/catalog-explorer\/ai-comments.html"} +{"content":"# Data governance with Unity Catalog\n## What is Catalog Explorer?\n#### Add AI-generated comments to a table\n##### Add AI-generated comments\n\nYou must use Catalog Explorer to view suggested comments, edit them, and add them to tables and columns. \n**Prerequisites**: If your workspace uses the [compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html), a workspace admin must enable partner-powered AI assistive features: \n1. In **Settings**, go to the **Advanced** tab and scroll down to the **Other** section.\n2. Turn on the **Partner-powered AI assistive features** option. \nFor other workspaces, the feature is enabled by default. \n**Permissions required**: You must be the table owner or have the `MODIFY` privilege on the table to view the AI-suggested comment, edit it, and add it. \nTo add an AI-generated comment to a table: \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Search or browse for the table and select it.\n3. View the **AI Suggested Comment** field below the **Tags** field. \n![AI-generated comment edit field](https:\/\/docs.databricks.com\/_images\/ai-generated-comment.png) \nThe AI might take a moment to generate the comment.\n4. Click **Accept** to accept the comment as-is, or **Edit** to modify it before you save it. \nTo add an AI-generated comment to a column: \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Search or browse for the table and select it.\n3. On the **Columns** tab, click the **AI generate** button. \nA comment is generated for each column.\n4. Click the check mark next to the column comment to accept it or close it unsaved. \nThe table owner or user with the `MODIFY` privilege on the table can update table and column comments at any time, using the Catalog Explorer UI or SQL commands ([ALTER TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table.html) or [COMMENT ON](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-comment.html)).\n\n","doc_uri":"https:\/\/docs.databricks.com\/catalog-explorer\/ai-comments.html"} +{"content":"# Data governance with Unity Catalog\n## What is Catalog Explorer?\n#### Add AI-generated comments to a table\n##### Frequently asked questions about AI-generated table comments\n\nThis section provides general information about AI-generated table comments (also know as AI-generated documentation) in the form of frequently asked questions. \n### What services does the AI-generated documentation feature use? \nIn workspaces enabled with the [compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html), AI-generated comments may use external model partners to provide responses. \nFor all other workspaces on AWS, AI-generated comments use an internal large language model (LLM). \nWhether the model is internal or external, data sent to these models is not used for model training. The models themselves are stateless: no prompts or completions are stored by model providers. \n### What regions are model-serving endpoints hosted in? \nEuropean Union (EU) data stays in the EU. For external partner models, European Union (EU) workspaces use an external model hosted in the EU. All other regions use an external model hosted in the US. For internal Databricks models, European Union (EU) workspaces use a model hosted in `eu-west-1`. All other traffic is sent to the `us-west-2` region during the Public Preview. \n### How is data encrypted between Databricks and external model partners? \nTraffic between Databricks and external model partners is encrypted in transit using industry standard TLS 1.2 encryption. \n### Is everything encrypted at rest? \nAny data stored within a Databricks workspace is AES-256 bit encrypted. Our external partners do not store any prompts or completions sent to them. \n### What data is sent to the models? \nDatabricks sends the following metadata to the models with each API request: \n* Table schema (catalog name, schema name, table name, current comment)\n* Column names (column name, type, primary key or not, current column comment) \nApproved table or column comments are stored in the Databricks control plane database, along with the rest of the Unity Catalog metadata. The control plane database is AES-256 bit encrypted. \n### What legal terms govern the use of AI-generated comments? \nUsage is governed by the existing Databricks terms and conditions the customer has agreed to when using Databricks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/catalog-explorer\/ai-comments.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on another Databricks workspace\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes how to set up Lakehouse Federation to run federated queries on Databricks data in another Databricks workspace. To learn more about Lakehouse Federation, see [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). \nImportant \nDatabricks-to-Databricks Lakehouse Federation is a good tool for running queries on data managed by another Databricks workspace\u2019s Hive or AWS Glue metastore. For most other scenarios, other Databricks workflows are more efficient: \n* If Databricks workspaces share the same Unity Catalog metastore, you can manage cross-workspace queries using standard Unity Catalog queries and data governance tools.\n* If you want *read-only* access to data in a Databricks workspace attached to a different Unity Catalog metastore, whether in your Databricks account or not, Delta Sharing is a better choice. \nThere is no need to set up Lakehouse Federation in either of these scenarios. \nTo connect to a Databricks catalog in another workspace using Lakehouse Federation, you must create the following in your Databricks Unity Catalog metastore: \n* A cluster or SQL warehouse in a Databricks workspace.\n* A *connection* to the cluster or SQL warehouse.\n* A *foreign catalog* in your Unity Catalog metastore that mirrors the other Databricks catalog accessible from the cluster or SQL warehouse so that you can use Unity Catalog query syntax and data governance tools to manage Databricks user access to the data.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/databricks.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on another Databricks workspace\n##### Before you begin\n\nWorkspace requirements: \n* Workspace enabled for Unity Catalog. \nCompute requirements: \n* Network connectivity from your Databricks Runtime cluster or SQL warehouse to the target database systems. See [Networking recommendations for Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/networking.html).\n* Databricks clusters must use Databricks Runtime 13.3 LTS or above and shared or single-user access mode.\n* SQL warehouses must be Pro or Serverless. \nPermissions required: \n* To create a connection, you must be a metastore admin or a user with the `CREATE CONNECTION` privilege on the Unity Catalog metastore attached to the workspace.\n* To create a foreign catalog, you must have the `CREATE CATALOG` permission on the metastore and be either the owner of the connection or have the `CREATE FOREIGN CATALOG` privilege on the connection. \nAdditional permission requirements are specified in each task-based section that follows. \nYou must also have an active cluster or SQL warehouse in the Databricks workspace that you are using to configure the connection.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/databricks.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on another Databricks workspace\n##### Create a connection\n\nA connection specifies a path and credentials for accessing an external database system. To create a connection, you can use Catalog Explorer or the `CREATE CONNECTION` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** Metastore admin or user with the `CREATE CONNECTION` privilege. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **External Data** menu and select **Connections**.\n3. Click **Create connection**.\n4. Enter a user-friendly **Connection name**.\n5. Select a **Connection type** of **Databricks**.\n6. Enter the following connection properties for the other Databricks instance. \n* **Host**: Workspace instance name. To learn how to get the workspace instance name, see [Get identifiers for workspace objects](https:\/\/docs.databricks.com\/workspace\/workspace-details.html).\n* **HTTP path**: The HTTP path for your SQL warehouse. To get the path, go to **SQL > SQL Warehouses** in the sidebar, select the SQL warehouse, go to the **Connection details** tab, and copy the value for **HTTP path**.\n* **Personal access token**: A Databricks personal access token that enables access to the target workspace.. To learn how to get a token, see [Databricks personal access token authentication](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). For connections, Databricks recommends using a personal access token for a service principal.\n7. (Optional) Click **Test connection** to confirm that it works.\n8. (Optional) Add a comment.\n9. Click **Create**. \nRun the following command in a notebook or the Databricks SQL query editor, replacing the following: \n* `<connection-name>`: User-friendly name for the connection you\u2019re creating.\n* `<workspace-instance>`: The target workspace instance. To learn how to get the workspace instance name, see [Get identifiers for workspace objects](https:\/\/docs.databricks.com\/workspace\/workspace-details.html).\n* `<sql-warehouse-path>`: The HTTP path for your SQL warehouse. To get the path, go to **SQL > SQL Warehouses** in the sidebar, select the SQL warehouse, go to the **Connection details** tab, and copy the value for **HTTP path**.\n* `<personal-access-token>`: A Databricks personal access token that enables access to the target workspace. To learn how to get a token, see [Databricks personal access token authentication](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). For connections, Databricks recommends that you use a service principal\u2019s personal access token. \n```\nCREATE CONNECTION <connection-name> TYPE databricks\nOPTIONS (\nhost '<workspace-instance>',\nhttpPath '<sql-warehouse-path>',\npersonalAccessToken '<personal-access-token>'\n);\n\n``` \nWe recommend that you use Databricks [secrets](https:\/\/docs.databricks.com\/security\/secrets\/index.html) instead of plaintext strings for sensitive values like credentials. For example: \n```\nCREATE CONNECTION <connection-name> TYPE databricks\nOPTIONS (\nhost '<workspace-instance>',\nhttpPath '<sql-warehouse-path>',\npersonalAccessToken secret ('<secret-scope>','<secret-key-password>')\n)\n\n``` \nFor information about setting up secrets, see [Secret management](https:\/\/docs.databricks.com\/security\/secrets\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/databricks.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on another Databricks workspace\n##### Create a foreign catalog\n\nA foreign catalog mirrors a catalog in the external Databricks workspace so that you can query and manage access to data in that external Databricks catalog as if it were a catalog in your own workspsace. To create a foreign catalog, you use a connection to the external Databricks workspace that has already been defined. \nTo create a foreign catalog, you can use Catalog Explorer or the `CREATE FOREIGN CATALOG` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** `CREATE CATALOG` permission on the metastore and either ownership of the connection or the `CREATE FOREIGN CATALOG` privilege on the connection. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Click the **Create Catalog** button.\n3. On the **Create a new catalog** dialog, enter a name for the catalog and select a **Type** of **Foreign**.\n4. Select the **Connection** that provides access to the database that you want to mirror as a Unity Catalog catalog.\n5. Enter the target Databricks **Catalog** name.\n6. Click **Create.** \nRun the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional. Replace the placeholder values: \n* `<catalog-name>`: Name for the foreign catalog that you are creating.\n* `<connection-name>`: The [connection object](https:\/\/docs.databricks.com\/query-federation\/databricks.html#connection) that specifies the data source, path, and access credentials.\n* `<external-catalog-name>`: Name of the catalog in the external Databricks workspace that you are mirroring. \n```\nCREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>\nOPTIONS (catalog '<external-catalog-name>');\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/databricks.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on another Databricks workspace\n##### Supported pushdowns\n\nThe following pushdowns are supported on all compute: \n* Filters\n* Projections\n* Limit\n* Functions: only filter expressions are supported (string functions, Mathematical functions, Data, Time and Timestamp functions, and other miscellaneous functions, such as Alias, Cast, SortOrder). \nThe following pushdowns are supported on Databricks Runtime 13.3 LTS and above and SQL warehouse compute: \n* Aggregates\n* The following Boolean operators: =, <, <=, >, >=, <=>\n* The following mathematical functions (not supported if ANSI is disabled): +, -, \\*, %, \/\n* The following miscellaneous operators: ^, |, ~\n* Sorting, when used with limit \nThe following pushdowns are not supported: \n* Joins\n* Windows functions\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/databricks.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### FedRAMP Moderate compliance controls\n\nPreview \nThe ability for admins to add Enhanced Security and Compliance features is a feature in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). The compliance security profile and support for compliance standards are generally available (GA). \nFedRAMP Moderate compliance controls provide enhancements that help you with FedRAMP Moderate compliance for your workspace. For FedRAMP High compliance, see [Databricks on AWS GovCloud](https:\/\/docs.databricks.com\/security\/privacy\/gov-cloud.html). \nFedRAMP Moderate compliance controls require enabling the *compliance security profile*, which adds monitoring agents, enforces instance types for inter-node encryption, provides a hardened compute image, and other features. For technical details, see [Compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html). It is your responsibility to [confirm that each affected workspace has the compliance security profile enabled](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html#verify) and confirm that FedRAMP is added as a compliance program. \nImportant \n* Databricks is a FedRAMP\u00ae Authorized Cloud Service Offering (CSO) at the moderate impact Level in the AWS US East-1 and US West-2 (commercial) regions.\n* US Government agencies can access the Databricks on AWS FedRAMP\u00ae package on OMB Max by submitting a [Package Access Request Form](https:\/\/www.fedramp.gov\/assets\/resources\/documents\/Agency_Package_Request_Form.pdf) and submitting it to `package-access@fedramp.gov`.\n* Additional information regarding Databricks and FedRAMP\u00ae compliance is located on the [Databricks Security and Trust Center](https:\/\/www.databricks.com\/trust\/fedramp).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/fedramp.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### FedRAMP Moderate compliance controls\n######## Which compute resources get enhanced security\n\nThe compliance security profile enhancements apply to compute resources in the [classic compute plane](https:\/\/docs.databricks.com\/getting-started\/overview.html) in all regions. \nSupport for serverless SQL warehouses for the compliance security profile varies by region. See [Serverless SQL warehouses support the compliance security profile in some regions](https:\/\/docs.databricks.com\/admin\/sql\/serverless.html#security-profile).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/fedramp.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### FedRAMP Moderate compliance controls\n######## Requirements\n\n* Your Databricks account must include the Enhanced Security and Compliance add-on. For details, see the [pricing page](https:\/\/databricks.com\/product\/pricing\/platform-addons).\n* Your workspace is on the Enterprise tier.\n* Your workspace is deployed in AWS region US East-1 and US West-2.\n* [Single sign-on (SSO)](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/single-sign-on\/index.html) authentication is configured for the workspace.\n* Your workspace enables the [compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html) and adds the FedRAMP compliance standard as part of the compliance security profile configuration.\n* You must use the following VM instance types: \n+ **General purpose:** `M-fleet`, `Md-fleet`, `M5dn`, `M5n`, `M5zn`, `M7g`, `M7gd`, `M6i`, `M7i`, `M6id`, `M6in`, `M6idn`, `M6a`, `M7a`\n+ **Compute optimized:** `C5a`, `C5ad`, `C5n`, `C6gn`, `C7g`, `C7gd`, `C7gn`, `C6i`, `C6id`, `C7i`, `C6in`, `C6a`, `C7a`\n+ **Memory optimized:** `R-fleet`, `Rd-fleet`, `R7g`, `R7gd`, `R6i`, `R7i`, `R7iz`, `R6id`, `R6in`, `R6idn`, `R6a`, `R7a`\n+ **Storage optimized:** `D3`, `D3en`, `P3dn`, `R5dn`, `R5n`, `I4i`, `I4g`, `I3en`, `Im4gn`, `Is4gen`\n+ **Accelerated computing:** `G4dn`, `G5`, `P4d`, `P4de`, `P5`\n* Ensure that sensitive information is never entered in customer-defined input fields, such as workspace names, cluster names, and job names.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/fedramp.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### FedRAMP Moderate compliance controls\n######## Enable FedRAMP Moderate compliance controls\n\nTo configure your workspace to support processing of data regulated by the FedRAMP standard, enable the [compliance security profile](https:\/\/docs.databricks.com\/security\/privacy\/security-profile.html) and add the FedRAMP compliance standard. You can for this for all workspaces or only on some workspaces. \n* To enable the compliance security profile and add the FedRAMP compliance standard for an existing workspace, see [Enable enhanced security and compliance features on a workspace](https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-compliance.html#aws-workspace-config).\n* To set an account-level setting to enable the compliance security profile and FedRAMP for new workspaces, see [Set account-level defaults for new workspaces](https:\/\/docs.databricks.com\/security\/privacy\/enhanced-security-compliance.html#aws-account-level-defaults).\n\n####### FedRAMP Moderate compliance controls\n######## Does Databricks permit processing data protected by FedRAMP Moderate?\n\nYes, if you comply with the [requirements](https:\/\/docs.databricks.com\/security\/privacy\/fedramp.html#requirements), enable the compliance security profile, and add the FedRAMP compliance standard as part of the compliance security profile configuration.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/fedramp.html"} +{"content":"# Security and compliance guide\n## Auditing\n### privacy\n#### and compliance\n##### Compliance security profile\n####### FedRAMP Moderate compliance controls\n######## Preview features that are supported for processing data protected by FedRAMP Moderate\n\nThe following preview features are supported for processing data protected by FedRAMP Moderate: \n* [SCIM provisioning](https:\/\/docs.databricks.com\/admin\/users-groups\/scim\/index.html)\n* [IAM passthrough](https:\/\/docs.databricks.com\/archive\/credential-passthrough\/iam-passthrough.html)\n* [Secret paths in environment variables](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html#spark-conf-env-var)\n* [System tables](https:\/\/docs.databricks.com\/admin\/system-tables\/index.html)\n* [Serverless SQL warehouse usage when compliance security profile is enabled](https:\/\/docs.databricks.com\/admin\/sql\/serverless.html#security-profile), with support in some regions\n* [Filtering sensitive table data with row filters and column masks](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/row-and-column-filters.html)\n* [Unified login](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/single-sign-on\/index.html#unified-login)\n* [Lakehouse Federation to Redshift](https:\/\/docs.databricks.com\/query-federation\/redshift.html)\n* [Liquid clustering for Delta tables](https:\/\/docs.databricks.com\/delta\/clustering.html)\n* [Unity Catalog-enabled DLT pipelines](https:\/\/docs.databricks.com\/delta-live-tables\/unity-catalog.html)\n* [Databricks Assistant](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html)\n* Scala support for shared clusters\n* Delta Live Tables Hive metastore to Unity Catalog clone API\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/privacy\/fedramp.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### One Spark task\n\nIf you see a long-running stage with just one task, that\u2019s likely a sign of a problem. While this one task is running only one CPU is utilized and the rest of the cluster may be idle. This happens most frequently in the following situations: \n* Expensive [UDF](https:\/\/docs.databricks.com\/udf\/index.html) on small data\n* [Window function](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-window-functions.html) without `PARTITION BY` statement\n* Reading from an unsplittable file type. This means the file cannot be read in multiple parts, so you end up with one big task. Gzip is an example of an unsplittable file type.\n* Setting the `multiLine` option when reading a JSON or CSV file\n* Schema inference of a large file\n* Use of [repartition(1)](https:\/\/spark.apache.org\/docs\/3.1.3\/api\/python\/reference\/api\/pyspark.sql.DataFrame.repartition.html) or [coalesce(1)](https:\/\/spark.apache.org\/docs\/3.1.3\/api\/python\/reference\/api\/pyspark.sql.DataFrame.coalesce.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/one-spark-task.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query filters\n\nA query filter lets you interactively reduce the amount of data shown in a visualization. Query filters are similar to query parameter but with a few key differences. A query filter limits data *after* the query has been executed. This makes filters ideal for smaller datasets and environments where query executions are time-consuming, rate-limited, or costly. \nThe following describes some benefits of Databricks SQL. \n* While previous query filters operated client-side only, these updated filters work dynamically on either client- or server-side to optimize performance.\n* Simplified UI experience: click the **+Add filter** button and select a column from a dropdown to add a filter. You don\u2019t need to author, permission, and refresh a separate query in order to filter on the distinct values of a column.\n* Enable \u201chighlight relevant values\u201d to see which selections within a filter will return results given other filter selections. For example, consider a user who has both a \u201cState\u201d and \u201cCity\u201d filter. If a user chooses to highlight relevant values, selecting \u201cCalifornia\u201d in the state filter will highlight only the cities in California in the \u201cCity\u201d filter. Non-highlighted options are put under a \u201cFiltered out\u201d menu option in the dropdown.\n* Text Input filters: filters column results based on text input searches. There are three modes the search can find matches with: exact match, contains, and starts with.\n* Quick date selectors enable you to filter on predefined date ranges such as last week, last month, last year, and more.\n* You can set default date ranges when creating date filters.\n* You can also use query filters on dashboards. By default, the filter widget appears beside each visualization where the filter has been added to the query. To link together the filter widgets into a dashboard-level query filter see [Dashboard filters](https:\/\/docs.databricks.com\/sql\/user\/dashboards\/index.html#dashboard-filters).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-filters.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query filters\n##### Description of functionality\n\nAfter running a query, in the **Results** panel, click **+** and then select **Filter**. \nThe **+Add filter** button opens a popup menu where you can apply the following filters and settings. \n* Column: the column on which to apply the filter. \n+ Strings, numbers, and dates are currently supported.\n+ If the selected column contains dates, users can choose to specify a time binning by date, month, or year.\n* Type: the type of filter to apply \n+ Single Select: filter to one field value only\n+ Multi Select: filter to multiple field values\n+ Text Input: enter a string to search for matching values in a particular column. Supports \u201cContains,\u201d \u201cExact Match,\u201d and \u201cStarts With\u201d\n+ Date \/ time picker and ranger picker: ability to select a date or date range to filter on date data types\n* Sort Order: The order in which values are sorted for the filter dropdown. You can choose between \u201cAscending\u201d or \u201cDescending.\u201d\n* Highlight relevant values (y\/n): Enable this to easily see which selections within a filter will return results given other filter selections. For example, consider a user who has both a \u201cState\u201d and \u201cCity\u201d filter. If a user chooses to highlight relevant values, selecting \u201cCalifornia\u201d in the state filter will highlight the set of options available in the \u201cCity\u201d filter to only show cities in California, while non-highlighted options will be put under a \u201cFiltered out\u201d menu option in the dropdown. Note that this requires running a query each time a filter is updated.\n* Default value: When a \u201cdate\u201d type column is selected and a time-binned value is chosen (days, months, or years), users can also choose to set a default date range for the filter. The default filter is automatically applied whenever the query is refreshed.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-filters.html"} +{"content":"# What is data warehousing on Databricks?\n## Access and manage saved queries\n#### Query filters\n##### Limitations\n\n* It is important to note that query filters are applied over the entirety of the dataset. However, the dropdown selector for query filters is limited to 64k unique values. If a user wishes to filter in situations where there are more than 64k unique filter values, it is recommended to use a **Text** parameter instead.\n* Filters can only be applied to columns returned by a query, not all columns of a referenced table.\n* Filters display the distinct list of options from the designated column in returned results. If the results are limited (i.e. query run with **Limit 1000**), then a filter will only display unique values from within those 1000 results.\n* While filters applied to a query will optimize to run on either the client or server side for better performance, filters applied to a dashboard will always run on the server side.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sql\/user\/queries\/query-filters.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### AI Functions on Databricks\n##### Query an external model with ai\\_query()\n\nNote \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To query endpoints that serve [external models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html), you must enroll in the public preview. Please populate and submit the [AI Functions Public Preview enrollment form](https:\/\/docs.google.com\/forms\/d\/e\/1FAIpQLSeHFz5YYZX2zZQy2bH8y4v3QJmfvpepPPw7UsK3IQrpskQ8Gg\/viewform). \nThis article illustrates how to set up and query an [external model endpoint](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html) using the built-in Databricks SQL function `ai_query()`. The example uses external model support in Databricks Model Serving to query `gpt-4` provided by OpenAI and accomplish chat tasks. See [AI Functions on Databricks](https:\/\/docs.databricks.com\/large-language-models\/ai-functions.html) for more detail about this AI function.\n\n##### Query an external model with ai\\_query()\n###### Prerequisites\n\n* See the requirements of [ai\\_query SQL function](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_query.html).\n* An [OpenAI API key](https:\/\/platform.openai.com\/docs\/api-reference\/authentication)\n* Store the key in a [Databricks secret](https:\/\/docs.databricks.com\/security\/secrets\/index.html). In this example you store the API key in *scope* `my-external-model` and *secret* `openai`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/ai-query-external-model.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### AI Functions on Databricks\n##### Query an external model with ai\\_query()\n###### Create an external model endpoint\n\nThe following creates an external model serving endpoint that serves OpenAI `gpt-4` for a chat task. \nTo create a personal access token, see [Authentication for Databricks automation](https:\/\/docs.databricks.com\/dev-tools\/auth\/index.html). \n```\nimport requests\nimport json\n\npersonal_access_token = \"your-personal-access-token\"\nheaders = {\n\"Authorization\": \"Bearer \" + personal_access_token,\n}\nhost = \"https:\/\/oregon.cloud.databricks.com\/\"\nurl = host + \"api\/2.0\/serving-endpoints\"\n\ndata = {\n\"name\": \"my-external-openai-chat\",\n\"config\": {\n\"served_entities\": [\n{\n\"name\": \"my_entity\",\n\"external_model\": {\n\"name\": \"gpt-4\",\n\"provider\": \"openai\",\n\"openai_config\": {\n\"openai_api_key\": \"{{secrets\/my-external-model\/openai}}\",\n},\n\"task\": \"llm\/v1\/chat\",\n},\n}\n],\n},\n}\n\nresponse = requests.post(url, headers=headers, json=data)\n\nprint(\"Status Code\", response.status_code)\nprint(\"JSON Response \", json.dumps(json.loads(response.text), indent=4))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/ai-query-external-model.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### AI Functions on Databricks\n##### Query an external model with ai\\_query()\n###### Query the external model with ai\\_query()\n\nIn the Databricks SQL query editor, you can write SQL queries to query the external model serving endpoint. \nExample queries: \n```\nSELECT ai_query(\n\"my-external-openai-chat\",\n\"What is a large language model?\"\n)\n\nSELECT question, ai_query(\n\"my-external-openai-chat\",\n\"You are a customer service agent. Answer the customer's question in 100 words: \" || question\n) AS answer\nFROM\nuc_catalog.schema.customer_questions\n\nSELECT\nsku_id,\nproduct_name,\nai_query(\n\"my-external-openai-chat\",\n\"You are a marketing expert for a winter holiday promotion targeting GenZ. Generate a promotional text in 30 words mentioning a 50% discount for product: \" || product_name\n)\nFROM\nuc_catalog.schema.retail_products\nWHERE\ninventory > 2 * forecasted_sales\n\n```\n\n##### Query an external model with ai\\_query()\n###### Additional resources\n\n* [AI Functions on Databricks](https:\/\/docs.databricks.com\/large-language-models\/ai-functions.html).\n* [Query a served model with ai\\_query()](https:\/\/docs.databricks.com\/large-language-models\/how-to-ai-query.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/ai-query-external-model.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Use custom Python libraries with Model Serving\n\nIn this article, you learn how to include custom libraries or libraries from a private mirror server when you log your model, so that you can use them with [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) model deployments. You should complete the steps detailed in this guide after you have a trained ML model ready to deploy but before you create a Databricks [Model Serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html). \nModel development often requires the use of custom Python libraries that contain functions for pre- or post-processing, custom model definitions, and other shared utilities. In addition, many enterprise security teams encourage the use of private PyPi mirrors, such as Nexus or Artifactory, to reduce the risk of [supply-chain attacks](https:\/\/wikipedia.org\/wiki\/Supply_chain_attack). Databricks offers [native support](https:\/\/docs.databricks.com\/libraries\/index.html) for installation of custom libraries and libraries from a private mirror in the Databricks workspace.\n\n#### Use custom Python libraries with Model Serving\n##### Requirements\n\n* MLflow 1.29 or higher\n\n#### Use custom Python libraries with Model Serving\n##### Step 1: Upload dependency file\n\nDatabricks recommends that you upload your dependency file to Unity Catalog [volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html). Alternatively, you can upload it to [Databricks File System (DBFS)](https:\/\/docs.databricks.com\/dbfs\/index.html) using the Databricks UI. \nTo ensure your library is available to your notebook, you need to install it using `%pip%`. Using `%pip` installs the library in the current notebook and downloads the dependency to the cluster.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/private-libraries-model-serving.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Use custom Python libraries with Model Serving\n##### Step 2: Log the model with a custom library\n\nImportant \nThe guidance in this section is not required if you install the private library by pointing to a custom PyPi mirror. \nAfter you install the library and upload the Python wheel file to either Unity Catalog volumes or DBFS, include the following code in your script. In the `extra_pip_requirements` specify the path of your dependency file. \n```\nmlflow.sklearn.log_model(model, \"sklearn-model\", extra_pip_requirements=[\"\/volume\/path\/to\/dependency.whl\"])\n\n``` \nFor DBFS, use the following: \n```\nmlflow.sklearn.log_model(model, \"sklearn-model\", extra_pip_requirements=[\"\/dbfs\/path\/to\/dependency.whl\"])\n\n``` \nIf you have a custom library, you must specify all custom Python libraries associated with your model when you configure logging. You can do so with the `extra_pip_requirements` or `conda_env` parameters in [log\\_model()](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/mlflow.sklearn.html#mlflow.sklearn.log_model). \nImportant \nIf using DBFS, be sure to include a forward slash, `\/`, before your `dbfs` path when logging `extra_pip_requirements`. Learn more about DBFS paths in [Work with files on Databricks](https:\/\/docs.databricks.com\/files\/index.html). \n```\nfrom mlflow.utils.environment import _mlflow_conda_env\nconda_env = _mlflow_conda_env(\nadditional_conda_deps= None,\nadditional_pip_deps= [\"\/volumes\/path\/to\/dependency\"],\nadditional_conda_channels=None,\n)\nmlflow.pyfunc.log_model(..., conda_env = conda_env)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/private-libraries-model-serving.html"} +{"content":"# Model serving with Databricks\n## Deploy custom models\n#### Use custom Python libraries with Model Serving\n##### Step 3: Update MLflow model with Python wheel files\n\nMLflow provides the [add\\_libraries\\_to\\_model()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.models.html#mlflow.models.add_libraries_to_model) utility to log your model with all of its dependencies pre-packaged as Python wheel files. This packages your custom libraries alongside the model in addition to *all* other libraries that are specified as dependencies of your model. This guarantees that the libraries used by your model are exactly the ones accessible from your training environment. \nIn the following example, `model_uri` references the model registry using the syntax `models:\/<model-name>\/<model-version>`. \nWhen you use the model registry URI, this utility generates a new version under your existing registered model. \n```\nimport mlflow.models.utils\nmlflow.models.utils.add_libraries_to_model(<model-uri>)\n\n```\n\n#### Use custom Python libraries with Model Serving\n##### Step 4: Serve your model\n\nWhen a new model version with the packages included is available in the model registry, you can add this model version to an endpoint with [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/private-libraries-model-serving.html"} +{"content":"# AI and Machine Learning on Databricks\n## Reference solutions for machine learning\n#### Reference solution for image applications\n\nLearn how to do distributed image model inference from reference solution notebooks using pandas UDF, PyTorch, and TensorFlow in a common configuration shared by many real-world image applications. This configuration assumes that you store many images in an object store and optionally have continuously arriving new images.\n\n#### Reference solution for image applications\n##### Workflow for image model inferencing\n\nSuppose you have several trained deep learning (DL) models for image classification and object detection\u2014for example, MobileNetV2 for\ndetecting human objects in user-uploaded photos to help protect privacy\u2014and you want to apply these DL models to the stored images. \nYou might re-train the models and update previously computed predictions.\nHowever, it is both I\/O-heavy and compute-heavy to load many images and apply DL models.\nFortunately, the inference workload is embarrassingly parallel and in theory can be distributed easily.\nThis guide walks you through a practical solution that contains two major stages: \n1. ETL images into a Delta table using Auto Loader\n2. Perform distributed inference using pandas UDF\n\n#### Reference solution for image applications\n##### ETL images into a Delta table using Auto Loader\n\nFor image applications, including training and inference tasks, Databricks recommends that you ETL images into a Delta table with the [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html). The Auto Loader helps data management and automatically handles continuously arriving new images. \n### ETL image dataset into a Delta table notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/dist-img-infer-1-etl.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/images-etl-inference.html"} +{"content":"# AI and Machine Learning on Databricks\n## Reference solutions for machine learning\n#### Reference solution for image applications\n##### Perform distributed inference using pandas UDF\n\nThe following notebooks use PyTorch and TensorFlow tf.Keras to demonstrate the reference solution. \n### Distributed inference via Pytorch and pandas UDF notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/dist-img-infer-2-pandas-udf.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \n### Distributed inference via Keras and pandas UDF notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/dist-img-infer-3-keras-udf.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n#### Reference solution for image applications\n##### Limitations: Image file sizes\n\nFor large image files (average image size greater than 100 MB), Databricks recommends using the Delta table only to manage the metadata (list of file names) and loading the images from the object store using their paths when needed.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/reference-solutions\/images-etl-inference.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Use the UCX utilities to upgrade your workspace to Unity Catalog\n\nThis article introduces [UCX](https:\/\/github.com\/databrickslabs\/ucx), a Databricks Labs project that provides tools to help you upgrade your non-Unity-Catalog workspace to Unity Catalog. \nNote \nUCX, like all projects in the databrickslabs GitHub account, is provided for your exploration only, and is not formally supported by Databricks with service-level agreements (SLAs). It is provided as-is. We make no guarantees of any kind. Do not submit a Databricks support ticket relating to issues that arise from the use of this project. Instead, file a [GitHub issue](https:\/\/github.com\/databrickslabs\/ucx\/issues\/new\/choose). Issues will be reviewed as time permits, but there are no formal SLAs for support. \nThe UCX project provides the following migration tools and workflows: \n1. [Assessment workflow](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/ucx.html#assessment) to help you plan your migration.\n2. [Group migration workflow](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/ucx.html#group) to help you upgrade group membership from your workspace to your Databricks account and migrate permissions to the new account-level groups.\n3. [Table migration worfklow](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/ucx.html#table) to help you upgrade tables that are registered in your workspace\u2019s Hive metastore to the Unity Catalog metastore. This workflow also helps you migrate storage locations and the credentials required to access them. \nThis diagram shows the overall migration flow, identifying migration workflows and utilities by name: \n![UCX migration workflows chart](https:\/\/docs.databricks.com\/_images\/ucx-migration-flow.png) \nNote \nThe code migration workflow that is depicted in the diagram remains under development and is not yet available.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/ucx.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Use the UCX utilities to upgrade your workspace to Unity Catalog\n##### Before you begin\n\nBefore you can install UCX and run the UCX workflows, your environment must meet the following requirements. \n**Packages installed on the computer where you run UCX**: \n* Databricks CLI v0.213 or above. See [Install or update the Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/install.html). \nYou must have a Databricks configuration file with configuration profiles for both the workspace and the Databricks account.\n* Python 3.10 or above.\n* If you want to run the UCX workflow that identifies storage locations used by Hive tables in your workspace (recommended, but not required), you must have the CLI for your cloud storage provider (Azure CLI or AWS CLI) installed on the computer where you run the UCX workflows. \n**Network access**: \n* Network access from the computer that runs the UCX installation to the Databricks workspace that you are migrating.\n* Network access to the internet from the computer that runs the UCX installation. This is required for access to pypi.org and github.com.\n* Network access from your Databricks workspace to pypi.org to download the `databricks-sdk` and `pyyaml` packages. \n**Databricks roles and permissions**: \n* Databricks account admin and workspace admin roles for the user who runs the UCX installation. You cannot run the installation as a service principal. \n**Other Databricks prerequisites**: \n* A Unity Catalog metastore created for every region that hosts a workspace that you want to upgrade, with each of those Databricks workspaces attached to a Unity Catalog metastore. \nTo learn how to determine whether you already have a Unity Catalog metastore in the relevant workspace regions, how to create a metastore if you don\u2019t, and how to attach a Unity Catalog metastore to a workspace, see [Step 1: Confirm that your workspace is enabled for Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#auto-enabled-check) in the Unity Catalog setup article. As an alternative, UCX provides [a utility for assigning Unity Catalog metastores to workspaces](https:\/\/github.com\/databrickslabs\/ucx\/blob\/main\/README.md#assign-metastore-command) that you can use after UCX is installed. \nAttaching a Unity Catalog metastore to a workspace also enables *identity federation*, in which you centralize user management at the Databricks account level, which is also a prerequisite for using UCX. See [Enable identity federation](https:\/\/docs.databricks.com\/admin\/users-groups\/best-practices.html#identity-federation).\n* If your workspace uses an external Hive metastore (such as AWS Glue) instead of the default workspace-local Hive metastore, you must perform some prerequisite setup. See [External Hive Metastore Integration](https:\/\/github.com\/databrickslabs\/ucx\/blob\/main\/docs\/external_hms_glue.md) in the databrickslabs\/ucx repo.\n* A Pro or Serverless SQL warehouse running on the workspace where you run UCX workflows, required to render the report generated by the assessment workflow.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/ucx.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Use the UCX utilities to upgrade your workspace to Unity Catalog\n##### Install UCX\n\nTo install UCX, use the Databricks CLI: \n```\ndatabricks labs install ucx\n\n``` \nYou are prompted to select the following: \n1. The Databricks configuration profile for the workspace that you want to upgrade. The configuration file must also include a configuration profile for the workspace\u2019s parent Databricks account.\n2. A name for the inventory database that will be used to store the output of the migration workflows. Typically it\u2019s fine to select the default, which is `ucx`.\n3. A SQL warehouse to run the installation process on.\n4. A list of workspace-local groups you want to migrate to account-level groups. If you leave this as the default (`<ALL>`), any existing account-level group whose name matches a workspace-local group will be treated as the replacement for that workspace-local group and will inherit all of its workspace permissions when you run the [group migration workflow](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/ucx.html#group) after installation. \nYou do have the opportunity to modify the workspace-group-to-account-group mapping after you run the installer and before you run group migration. See [Group Name Conflict Resolution](https:\/\/github.comthat\/databrickslabs\/ucx\/blob\/main\/docs\/group_name_conflict.md) in the UCX repo.\n5. If you have an external Hive metastore, such as AWS Glue, you have the option to connect to it or not. See [External Hive Metastore Integration](https:\/\/github.com\/databrickslabs\/ucx\/blob\/main\/docs\/external_hms_glue.md) in the databrickslabs\/ucx repo.\n6. Whether to open the generated README notebook. \nWhen the installation is done, it deploys a README notebook, dashboards, databases, libraries, jobs, and other assets in your workspace. \nFor more information, see the [installation instructions in the project readme](https:\/\/github.com\/databrickslabs\/ucx#install-ucx). You can also [install UCX on all of the workspaces in your Databricks account](https:\/\/github.com\/databrickslabs\/ucx#advanced-installing-ucx-on-all-workspaces-within-a-databricks-account).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/ucx.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Use the UCX utilities to upgrade your workspace to Unity Catalog\n##### Open the README notebook\n\nEvery installation creates a README notebook that provides a detailed description of all workflows and tasks, with quick links to the workflows and dashboards. See [Readme notebook](https:\/\/github.com\/databrickslabs\/ucx\/blob\/main\/README.md#readme-notebook).\n\n#### Use the UCX utilities to upgrade your workspace to Unity Catalog\n##### Step 1. Run the assessment workflow\n\nThe assessment workflow assesses the Unity Catalog compatibility of group identities, storage locations, storage credentials, access controls, and tables in the current workspace and provides the information necessary for planning the migration to Unity Catalog. The tasks in the assessment workflow can be executed in parallel or sequentially, depending on specified dependencies. After the assessment workflow finishes, an assessment dashboard is populated with findings and common recommendations. \nThe output of each workflow task is stored in Delta tables in the `$inventory_database` schema that you specify during installation. You can use these tables to perform further analysis and decision-making using an [assessment report](https:\/\/github.com\/databrickslabs\/ucx\/blob\/main\/docs\/assessment.md). You can run the assessment workflow multiple times to ensure that all incompatible entities are identified and accounted for before you start the migration process. \nYou can trigger the assessment workflow from the UCX-generated README notebook and the Databricks UI (Workflows > Jobs > [UCX] Assessment), or run the following Databricks CLI command: \n```\ndatabricks labs ucx ensure-assessment-run\n\n``` \nFor detailed instructions, see [Assessment workflow](https:\/\/github.com\/databrickslabs\/ucx?tab=readme-ov-file#assessment-workflow).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/ucx.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Use the UCX utilities to upgrade your workspace to Unity Catalog\n##### Step 2. Run the group migration workflow\n\nThe group migration workflow upgrades workspace-local groups to account-level groups to support Unity Catalog. It ensures that the appropriate account-level groups are available in the workspace and replicates all permissions. It also removes any unnecessary groups and permissions from the workspace. The tasks in the group migration workflow depend on the output of the assessment workflow. \nThe output of each workflow task is stored in Delta tables in the `$inventory_database` schema that you specify during installation. You can use these tables to perform further analysis and decision-making. You can run the group migration workflow multiple times to ensure that all groups are upgraded successfully and that all necessary permissions are assigned. \nFor information about running the group migration workflow, see your UCX-generated README notebook and [Group migration workflow](https:\/\/github.com\/databrickslabs\/ucx\/blob\/main\/README.md#group-migration-workflow) in the UCX readme.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/ucx.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Use the UCX utilities to upgrade your workspace to Unity Catalog\n##### Step 3. Run the table migration workflow\n\nThe table migration workflow upgrades tables from the Hive metastore to the Unity Catalog metastore. External tables in the Hive metastore are upgraded as external tables in Unity Catalog, using [SYNC](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-aux-sync.html). Managed tables in the Hive metastore that are stored in workspace storage (also known as DBFS root) are upgraded as managed tables in Unity Catalog, using [DEEP CLONE](https:\/\/docs.databricks.com\/delta\/clone.html). \nHive managed tables must be in Delta or Parquet format to be upgraded. External Hive tables must be in one of the data formats listed in [External tables](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#external-table). \n### Run the preparatory commands \nTable migration includes a number of preparatory tasks that you run before you run the table migration workflow. You perform these tasks using the following Databricks CLI commands: \n* The `create-table-mapping` command, which creates a CSV file that maps a target Unity Catalog catalog, schema, and table to each Hive table that will be upgraded. You should review and update the mapping file before proceeding with the migration workflow.\n* The `create-uber-principal` command, which creates a service principal with read-only access to all storage used by the tables in this workspace. The workflow job compute resource uses this principal to upgrade the tables in the workspace. Deprovision this service principal when you are done with your upgrade.\n* (Optional) The `principal-prefix-access` command, which identifies the storage accounts and storage access credentials used by the Hive tables in the workspace.\n* (Optional) The `migrate-credentials` command, which creates Unity Catalog storage credentials from the storage access credentials identified by `principal-prefix-access`.\n* (Optional) The `migration locations` command, which creates Unity Catalog external locations from the storage locations identified by the assessment workflow, using the storage credentials created by `migrate-credentials`.\n* (Optional) The `create-catalogs-schemas`command, which creates Unity Catalog catalogs and schemas that will hold the upgraded tables. \nFor details, including additional table migration workflow commands and options, see [Table migration commands](https:\/\/github.com\/databrickslabs\/ucx\/blob\/main\/README.md#table-migration-commands) in the UCX readme. \n### Run the table migration \nOnce you\u2019ve run the preparatory tasks, you can run the table migration workflow from the UCX-generated README notebook or from **Workflows > Jobs** in the workspace UI. \nThe output of each workflow task is stored in Delta tables in the `$inventory_database` schema that you specify during installation. You can use these tables to perform further analysis and decision-making. You might need to run the table migration workflow multiple times to ensure that all tables are upgraded successfully. \nFor complete table migration instructions, see your UCX-generated README notebook and the [Table Migration Workflow](https:\/\/github.com\/databrickslabs\/ucx\/blob\/main\/README.md#table-migration-workflow) in the UCX readme.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/ucx.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Use the UCX utilities to upgrade your workspace to Unity Catalog\n##### Additional tools\n\nUCX also includes debugging tools and other utilities to help you succeed with your migration. For more information, see your UCX-generated README notebook and the [UCX project readme](https:\/\/github.com\/databrickslabs\/ucx\/blob\/main\/README.md).\n\n#### Use the UCX utilities to upgrade your workspace to Unity Catalog\n##### Upgrade your UCX installation\n\nThe UCX project is updated regularly. To upgrade your UCX installation to the latest version: \n1. Verify that UCX is installed. \n```\ndatabricks labs installed\n\nName Description Version\nucx Unity Catalog Migration Toolkit (UCX) 0.20.0\n\n```\n2. Run the upgrade: \n```\ndatabricks labs upgrade ucx\n\n```\n\n#### Use the UCX utilities to upgrade your workspace to Unity Catalog\n##### Get help\n\nFor help with the UCX CLI, run: \n```\ndatabricks labs ucx --help\n\n``` \nFor help with a specific UCX command, run: \n```\ndatabricks labs ucx <command> --help\n\n``` \nTo troubleshoot issues: \n* Run `--debug` with any command to enable [debug logs](https:\/\/github.com\/databrickslabs\/ucx\/blob\/main\/README.md#debug-logs).\n* Use the [debug notebook that is generated automatically by UCX](https:\/\/github.com\/databrickslabs\/ucx\/blob\/main\/README.md#debug-notebook). \nTo file an issue or feature request, file a [GitHub issue](https:\/\/github.com\/databrickslabs\/ucx\/issues\/new\/choose).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/ucx.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Use the UCX utilities to upgrade your workspace to Unity Catalog\n##### UCX release notes\n\nSee the [changelog](https:\/\/github.com\/databrickslabs\/ucx\/blob\/main\/CHANGELOG.md) in the UCX GitHub repo.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/ucx.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n### Manage privileges in Unity Catalog\n##### Manage Unity Catalog object ownership\n\nEach [securable object](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html#object-model) in Unity Catalog has an owner. The owner can be any principal: a user, service principal, or account group. The principal that creates an object becomes its initial owner. An object\u2019s owner has all privileges on the object, such as `SELECT` and `MODIFY` on a table, in addition to the permission to grant privileges to other principals. An object\u2019s owner has the ability to drop the object.\n\n##### Manage Unity Catalog object ownership\n###### Owner privileges\n\nOwners of an object are automatically granted all privileges on that object. In addition, object owners can grant privileges on the object itself and on all of its child objects. This means that owners of a schema do not automatically have all privileges on the tables in the schema, but they can grant themselves privileges on the tables in the schema.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/ownership.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n### Manage privileges in Unity Catalog\n##### Manage Unity Catalog object ownership\n###### Metastore and catalog ownership\n\nMetastore admins are the owners of the metastore. The metastore admin role is optional. Metastore admins can reassign ownership of the metastore by transferring the metastore admin role, see [Assign a metastore admin](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html#assign-metastore-admin). \nIf your workspace was enabled for Unity Catalog automatically, the workspace is attached to a metastore by default and a workspace catalog is created for your workspace in the metastore. Workspace admins are the default owners and can reassign ownership of the workspace catalog. In these workspaces, there is no metastore admin assigned by default, but account admins can grant metastore admin permissions if needed. See [Metastore admins](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html#metastore-admins). \nFor more information on admin privileges in Unity Catalog, see [Admin privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/ownership.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n### Manage privileges in Unity Catalog\n##### Manage Unity Catalog object ownership\n###### View an object\u2019s owner\n\n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Select the object, such as a catalog, schema, table, view, volume, external location, or storage credential.\n3. Click **Permissions**. \nRun the following SQL command in a notebook or SQL query editor. Replace the placeholder values: \n* `<securable-type>`: The type of securable, such as `CATALOG` or `TABLE`.\n* `<catalog>`: The parent catalog for a table or view.\n* `<schema>`: The parent schema for a table or view.\n* `<securable-name>`: The name of the securable, such as a table or view. \n```\nDESCRIBE <securable-type> EXTENDED <catalog>.<schema>.<securable-name>;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/ownership.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n### Manage privileges in Unity Catalog\n##### Manage Unity Catalog object ownership\n###### Transfer ownership\n\nObject ownership can be transferred to other principals by the current owner, a metastore admin, or the owner of the container (the catalog for a schema, the schema for a table). Delta Sharing share objects are an exception: principals with the `USE SHARE` and `SET SHARE PERMISSION` can also transfer share ownership. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Select the object, such as a catalog, schema, table, view, external location, or storage credential.\n3. Click **Permissions**.\n4. Click the blue pencil next to the **Owner**.\n5. Select a group, user, or service principal from the dropdown list.\n6. Click **Save**. \nRun the following SQL command in a notebook or SQL query editor. Replace the placeholder values: \n* `<securable-type>`: The type of securable object, such as `CATALOG` or `TABLE`. `METASTORE` is not supported as a securable object in this command. \n+ `<securable-name>`: The name of the securable.\n+ `<principal>` is a user, service principal (represented by its applicationId value), or group. You must enclose users, service principals, and group names that include [special characters](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-identifiers.html#delimited-identifiers) in backticks ( `` `` ). See [Principal](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-principal.html).\n```\nALTER <securable-type> <securable-name> OWNER TO <principal>;\n\n``` \nFor example, to transfer ownership of a table to the `accounting` group: \n```\nALTER TABLE orders OWNER TO `accounting`;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/ownership.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create a storage credential for connecting to Cloudflare R2\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes how to create a storage credential in Unity Catalog to connect to Cloudflare R2. Cloudflare R2 object storage incurs no egress fees. Replicating or migrating data that you share to R2 enables you to share data across clouds and regions without incurring egress fees. \nNote \nUnity Catalog supports two cloud storage options for Databricks on AWS: AWS S3 buckets and Cloudflare R2 buckets. Cloudflare R2 is intended primarily for Delta Sharing use cases in which you want to avoid cloud provider data egress fees. S3 is appropriate for most other use cases. See [Monitor and manage Delta Sharing egress costs (for providers)](https:\/\/docs.databricks.com\/data-sharing\/manage-egress.html) and [Create a storage credential for connecting to AWS S3](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html). \nTo use an R2 bucket as a storage location for data that is managed by Unity Catalog, you must create a storage credential that authorizes access to the R2 bucket and create an external location that references the storage credential and the bucket path: \n* **Storage credentials** encapsulate a long-term cloud credential that provides access to cloud storage.\n* **External locations** contain a reference to a storage credential and a cloud storage path. \nThis article focuses on creating a storage credential. \nFor more information, see [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials-r2.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create a storage credential for connecting to Cloudflare R2\n##### Requirements\n\n* Databricks workspace enabled for Unity Catalog.\n* Databricks Runtime 14.3 or above, or SQL warehouse 2024.15 or above. \nIf you encounter the error message `No FileSystem for scheme \"r2\u201d`, your compute is probably on an unsupported version.\n* Cloudflare account. See <https:\/\/dash.cloudflare.com\/sign-up>.\n* Cloudflare R2 Admin role. See the [Cloudflare roles documentation](https:\/\/developers.cloudflare.com\/fundamentals\/setup\/manage-members\/roles\/#account-scoped-roles).\n* `CREATE STORAGE CREDENTIAL` privilege on the Unity Catalog metastore attached to the workspace. Account admins and metastore admins have this privilege by default.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials-r2.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create a storage credential for connecting to Cloudflare R2\n##### Configure an R2 bucket\n\n1. Create a Cloudflare R2 bucket. \nYou can use the Cloudflare dashboard or the Cloudflare Wrangler tool. \nSee the [Cloudflare R2 \u201cGet started\u201d documentation](https:\/\/developers.cloudflare.com\/r2\/get-started\/) or the [Wrangler documentation](https:\/\/developers.cloudflare.com\/r2\/buckets\/create-buckets\/).\n2. Create an R2 API Token and apply it to the bucket. \nSee the [Cloudflare R2 API authentication documentation](https:\/\/developers.cloudflare.com\/r2\/api\/s3\/tokens\/). \nSet the following token properties: \n* **Permissions**: Object Read & Write. \nThis permission grants read and write access, which is required when you use R2 storage as a replication target, as described in [Use Cloudflare R2 replicas or migrate storage to R2](https:\/\/docs.databricks.com\/data-sharing\/manage-egress.html#r2). \nIf you want to enforce read-only access from Databricks to the R2 bucket, you can instead create a token that grants read access only. However, this may be unnecessary, because you can mark the storage credential as read-only, and any write access granted by this permission will be ignored.\n* **(Optional) TTL**: The length of time that you want to share the bucket data with the data recipients.\n* **(Optional) Client IP Address Filtering**: Select if you want to limit network access to specified recipient IP addresses. If this option is enabled, you must specify your recipients\u2019 IP addresses and you must allowlist the Databricks control plane NAT IP address for the workspace region.See [Outbound from Databricks control plane](https:\/\/docs.databricks.com\/resources\/supported-regions.html#outbound).\n3. Copy the R2 API token values: \n* Access Key ID\n* Secret Access Key\nImportant \nToken values are shown only once.\n4. On the R2 homepage, go to **Account details** and copy the R2 account ID.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials-r2.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create a storage credential for connecting to Cloudflare R2\n##### Create the storage credential\n\n1. In Databricks, log in to your workspace.\n2. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. Click the **+Add** button and select **Add a storage credential** from the menu. \nThis option does not appear if you don\u2019t have the `CREATE STORAGE CREDENTIAL` privilege.\n4. Select a **Credential Type** of **Cloudflare API Token**.\n5. Enter a name for the credential and the following values that you copied when you configured the R2 bucket: \n* **Account ID**\n* **Access key ID**\n* **Secret access key**\n6. (Optional) If you want users to have read-only access to the external locations that use this storage credential, in **Advanced options** select **Read only**. \nDo not select this option if you want to use the storage credential to access R2 storage that you are using as a replication target, as described in [Use Cloudflare R2 replicas or migrate storage to R2](https:\/\/docs.databricks.com\/data-sharing\/manage-egress.html#r2). \nFor more information, see [Mark a storage credential as read-only](https:\/\/docs.databricks.com\/connect\/unity-catalog\/manage-storage-credentials.html#read-only).\n7. Click **Create**.\n8. In the **Storage credential created** dialog, copy the **External ID**.\n9. (Optional) Bind the storage credential to specific workspaces. \nBy default, a storage credential can be used by any privileged user on any workspace attached to the metastore. If you want to allow access only from specific workspaces, go to the **Workspaces** tab and assign workspaces. See [(Optional) Assign a storage credential to specific workspaces](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html#workspace-binding).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials-r2.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create a storage credential for connecting to Cloudflare R2\n##### Next step: create the external location\n\nSee [Create an external location to connect cloud storage to Databricks](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials-r2.html"} +{"content":"# \n### Ingest data into a Databricks lakehouse\n\nDatabricks offers a variety of ways to help you ingest data into a lakehouse backed by Delta Lake. Databricks recommends using Auto Loader for incremental data ingestion from cloud object storage. The add data UI provides a number of options for quickly uploading local files or connecting to external data sources.\n\n### Ingest data into a Databricks lakehouse\n#### Run your first ETL workload\n\nIf you haven\u2019t used Auto Loader on Databricks, start with a tutorial. See [Run your first ETL workload on Databricks](https:\/\/docs.databricks.com\/getting-started\/etl-quick-start.html).\n\n### Ingest data into a Databricks lakehouse\n#### Auto Loader\n\n[Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html) incrementally and efficiently processes new data files as they arrive in cloud storage without additional setup. Auto Loader provides a Structured Streaming source called `cloudFiles`. Given an input directory path on the cloud file storage, the `cloudFiles` source automatically processes new files as they arrive, with the option of also processing existing files in that directory.\n\n### Ingest data into a Databricks lakehouse\n#### Automate ETL with Delta Live Tables and Auto Loader\n\nYou can simplify deployment of scalable, incremental ingestion infrastructure with Auto Loader and Delta Live Tables. Note that Delta Live Tables does not use the standard interactive execution found in notebooks, instead emphasizing deployment of infrastructure ready for production. \n* [Tutorial: Run your first ETL workload on Databricks](https:\/\/docs.databricks.com\/getting-started\/etl-quick-start.html)\n* [Ingest data using streaming tables (Python\/SQL notebook)](https:\/\/docs.databricks.com\/ingestion\/onboard-data.html) \n* [Load data using streaming tables in Databricks SQL](https:\/\/docs.databricks.com\/sql\/load-data-streaming-table.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/index.html"} +{"content":"# \n### Ingest data into a Databricks lakehouse\n#### Upload local data files or connect external data sources\n\nYou can securely upload local data files or ingest data from external sources to create tables. See [Load data using the add data UI](https:\/\/docs.databricks.com\/ingestion\/add-data\/index.html).\n\n### Ingest data into a Databricks lakehouse\n#### Ingest data into Databricks using third-party tools\n\nDatabricks validates technology partner integrations that enable you to ingest data into Databricks. These integrations enable low-code, scalable data ingestion from a variety of sources into Databricks. See [Technology partners](https:\/\/docs.databricks.com\/integrations\/index.html). Some technology partners are featured in [Databricks Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/index.html), which provides a UI that simplifies connecting third-party tools to your lakehouse data.\n\n### Ingest data into a Databricks lakehouse\n#### COPY INTO\n\n[COPY INTO](https:\/\/docs.databricks.com\/ingestion\/copy-into\/index.html) allows SQL users to idempotently and incrementally ingest data from cloud object storage into Delta tables. It can be used in Databricks SQL, notebooks, and Databricks Jobs.\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/index.html"} +{"content":"# \n### Ingest data into a Databricks lakehouse\n#### When to use COPY INTO and when to use Auto Loader\n\nHere are a few things to consider when choosing between Auto Loader and `COPY INTO`: \n* If you\u2019re going to ingest files in the order of thousands, you can use `COPY INTO`. If you are expecting files in the order of millions or more over time, use Auto Loader. Auto Loader requires fewer total operations to discover files compared to `COPY INTO` and can split the processing into multiple batches, meaning that Auto Loader is less expensive and more efficient at scale.\n* If your data schema is going to evolve frequently, Auto Loader provides better primitives around schema inference and evolution. See [Configure schema inference and evolution in Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/schema.html) for more details.\n* Loading a subset of re-uploaded files can be a bit easier to manage with `COPY INTO`. With Auto Loader, it\u2019s harder to reprocess a select subset of files. However, you can use `COPY INTO` to reload the subset of files while an Auto Loader stream is running simultaneously. \n* For an even more scalable and robust file ingestion experience, Auto Loader enables SQL users to leverage streaming tables. See [Load data using streaming tables in Databricks SQL](https:\/\/docs.databricks.com\/sql\/load-data-streaming-table.html). \nFor a brief overview and demonstration of Auto Loader, as well as `COPY INTO`, watch the following YouTube video (2 minutes).\n\n### Ingest data into a Databricks lakehouse\n#### Review file metadata captured during data ingestion\n\nApache Spark automatically captures data about source files during data loading. Databricks lets you access this data with the [File metadata column](https:\/\/docs.databricks.com\/ingestion\/file-metadata-column.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/index.html"} +{"content":"# \n### Ingest data into a Databricks lakehouse\n#### Upload spreadsheet exports to Databricks\n\nUse the **Create or modify table from file upload** page to upload CSV, TSV, or JSON files. See [Create or modify a table using file upload](https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-data.html).\n\n### Ingest data into a Databricks lakehouse\n#### Migrate data applications to Databricks\n\nMigrate existing data applications to Databricks so you can work with data from many source systems on a single platform. See [Migrate data applications to Databricks](https:\/\/docs.databricks.com\/migration\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/index.html"} +{"content":"# Databricks reference documentation\n### Delta Lake API reference\n\n[Delta Lake](https:\/\/delta.io) is an [open source storage layer](https:\/\/github.com\/delta-io\/delta) that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. \nSee the Delta Lake website for [API references](https:\/\/docs.delta.io\/latest\/delta-apidoc.html#delta-spark) for Scala, Java, and Python. \nTo learn how to use the Delta Lake APIs on Databricks, see: \n* [What is Delta Lake?](https:\/\/docs.databricks.com\/delta\/index.html)\n* [Tutorial: Delta Lake](https:\/\/docs.databricks.com\/delta\/tutorial.html) \nSee also the [Delta Lake API documentation](https:\/\/docs.databricks.com\/delta\/index.html#delta-api) in the Databricks documentation.\n\n","doc_uri":"https:\/\/docs.databricks.com\/reference\/delta-lake.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Manage files in Unity Catalog volumes with the Databricks JDBC Driver\n\nThis article describes how to upload, download, and delete files in Unity Catalog [volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html) using the [Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/index.html).\n\n######### Manage files in Unity Catalog volumes with the Databricks JDBC Driver\n########## Requirements\n\n* Databricks JDBC Driver versions 2.6.38 or above.\n* Add the `UseNativeQuery` property to your JDBC connection properties collection, setting its value to `1`. \nFor a complete Java code example showing how to run this article\u2019s code snippets in the context of setting up Databricks authentication and running SQL statements withe the Databricks JDBC Driver, see [Authentication settings for the Databricks JDBC Driver](https:\/\/docs.databricks.com\/integrations\/jdbc\/authentication.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/volumes.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Manage files in Unity Catalog volumes with the Databricks JDBC Driver\n########## Upload a file\n\nTo upload a file to a volume, you must add the `StagingAllowedLocalPaths` property to your JDBC connection properties collection, setting this property\u2019s value to the path of the file to upload. To upload multiple files from separate locations, set this property to a comma-separated list of paths, for example `\/tmp\/,\/usr\/tmp\/`. \nTo override the contents of any existing file in the specified upload location, add `OVERWRITE`. \nThe following Java code snippet shows how to upload a file to a volume. \n```\n\/\/ ...\np.put(\"UseNativeQuery\", \"1\");\np.put(\"StagingAllowedLocalPaths\", \"\/tmp\/\");\n\nConnection conn = DriverManager.getConnection(url, p);\nStatement stmt = conn.createStatement();\n\nstmt.executeQuery(\"PUT '\" +\n\"\/tmp\/my-data.csv\" +\n\"' INTO '\" +\n\"\/Volumes\/main\/default\/my-volume\/my-data.csv\" +\n\"' OVERWRITE\")\n\/\/ ...\n\n```\n\n######### Manage files in Unity Catalog volumes with the Databricks JDBC Driver\n########## Download a file\n\nThe following Java code snippet shows how to download a file from a volume. \n```\n\/\/ ...\np.put(\"UseNativeQuery\", \"1\");\n\nConnection conn = DriverManager.getConnection(url, p);\nStatement stmt = conn.createStatement();\n\nstmt.executeQuery(\"GET '\" +\n\"\/Volumes\/main\/default\/my-volume\/my-data.csv\" +\n\"' TO '\" +\n\"\/tmp\/my-downloaded-data.csv\" +\n\"'\")\n\/\/ ...\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/volumes.html"} +{"content":"# Develop on Databricks\n## Developer tools and guidance\n### Use a SQL connector\n#### driver\n##### or API\n###### Databricks ODBC and JDBC Drivers\n####### Databricks JDBC Driver\n######### Manage files in Unity Catalog volumes with the Databricks JDBC Driver\n########## Delete a file\n\nThe following Java code snippet shows how to delete a file from a volume. \n```\n\/\/ ...\np.put(\"UseNativeQuery\", \"1\");\n\nConnection conn = DriverManager.getConnection(url, p);\nStatement stmt = conn.createStatement();\n\nstmt.executeQuery(\"REMOVE '\" +\n\"\/Volumes\/main\/default\/my-volume\/my-data.csv\" +\n\"'\")\n\/\/ ...\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/integrations\/jdbc\/volumes.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/testing.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Unit testing for notebooks\n\nYou can use *unit testing* to help improve the quality and consistency of your notebooks\u2019 code. Unit testing is an approach to testing self-contained units of code, such as functions, early and often. This helps you find problems with your code faster, uncover mistaken assumptions about your code sooner, and streamline your overall coding efforts. \nThis article is an introduction to basic [unit testing](https:\/\/en.wikipedia.org\/wiki\/Unit_testing) with functions. Advanced concepts such as unit testing classes and interfaces, as well as the use of [stubs](https:\/\/en.wikipedia.org\/wiki\/Method_stub), [mocks](https:\/\/en.wikipedia.org\/wiki\/Mock_object), and [test harnesses](https:\/\/en.wikipedia.org\/wiki\/Test_harness), while also supported when unit testing for notebooks, are outside the scope of this article. This article also does not cover other kinds of testing methods, such as [integration testing](https:\/\/en.wikipedia.org\/wiki\/Integration_testing), [system testing](https:\/\/en.wikipedia.org\/wiki\/System_testing), [acceptance testing](https:\/\/en.wikipedia.org\/wiki\/Acceptance_testing), or [non-functional testing](https:\/\/en.wikipedia.org\/wiki\/Non-functional_testing) methods such as [performance testing](https:\/\/en.wikipedia.org\/wiki\/Software_performance_testing) or [usability testing](https:\/\/en.wikipedia.org\/wiki\/Usability_testing). \nThis article demonstrates the following: \n* How to organize functions and their unit tests.\n* How to write functions in Python, R, Scala, as well as user-defined functions in SQL, that are well-designed to be unit tested.\n* How to call these functions from Python, R, Scala, and SQL notebooks.\n* How to write unit tests in Python, R, and Scala by using the popular test frameworks [pytest](https:\/\/docs.pytest.org) for Python, [testthat](https:\/\/testthat.r-lib.org) for R, and [ScalaTest](https:\/\/docs.pytest.org) for Scala. Also how to write SQL that unit tests SQL user-defined functions (SQL UDFs).\n* How to run these unit tests from Python, R, Scala, and SQL notebooks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/testing.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Unit testing for notebooks\n##### Organize functions and unit tests\n\nThere are a few common approaches for organizing your functions and their unit tests with notebooks. Each approach has its benefits and challenges. \nFor Python, R, and Scala notebooks, common approaches include the following: \n* [Store functions and their unit tests outside of notebooks.](https:\/\/docs.databricks.com\/notebooks\/test-notebooks.html#separate-test-code-from-the-notebook). \n+ Benefits: You can call these functions with and outside of notebooks. Test frameworks are better designed to run tests outside of notebooks.\n+ Challenges: This approach is not supported for Scala notebooks. This approach also increases the number of files to track and maintain.\n* [Store functions in one notebook and their unit tests in a separate notebook.](https:\/\/docs.databricks.com\/notebooks\/test-notebooks.html#separate-test-code-from-the-notebook). \n+ Benefits: These functions are easier to reuse across notebooks.\n+ Challenges: The number of notebooks to track and maintain increases. These functions cannot be used outside of notebooks. These functions can also be more difficult to test outside of notebooks.\n* [Store functions and their unit tests within the same notebook.](https:\/\/docs.databricks.com\/notebooks\/test-notebooks.html). \n+ Benefits: Functions and their unit tests are stored within a single notebook for easier tracking and maintenance.\n+ Challenges: These functions can be more difficult to reuse across notebooks. These functions cannot be used outside of notebooks. These functions can also be more difficult to test outside of notebooks. \nFor Python and R notebooks, Databricks recommends storing functions and their unit tests outside of notebooks. For Scala notebooks, Databricks recommends including functions in one notebook and their unit tests in a separate notebook. \nFor SQL notebooks, Databricks recommends that you store functions as SQL user-defined functions (SQL UDFs) in your schemas (also known as databases). You can then call these SQL UDFs and their unit tests from SQL notebooks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/testing.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Unit testing for notebooks\n##### Write functions\n\nThis section describes a simple set of example functions that determine the following: \n* Whether a table exists in a database.\n* Whether a column exists in a table.\n* How many rows exist in a column for a value within that column. \nThese functions are intended to be simple, so that you can focus on the unit testing details in this article rather than focus on the functions themselves. \nTo get the best unit testing results, a function should return a single predictable outcome and be of a single data type. For example, to check whether something exists, the function should return a boolean value of true or false. To return the number of rows that exist, the function should return a non-negative, whole number. It should not, in the first example, return either false if something does not exist or the thing itself if it does exist. Likewise, for the second example, it should not return either the number of rows that exist or false if no rows exist. \nYou can add these functions to an existing Databricks workspace as follows, in Python, R, Scala, or SQL. \nThe following code assumes you have [Set up Databricks Git folders (Repos)](https:\/\/docs.databricks.com\/repos\/repos-setup.html), [added a repo](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html), and have the repo open in your Databricks workspace. \n[Create a file](https:\/\/docs.databricks.com\/files\/workspace-basics.html#create-a-new-file) named `myfunctions.py` within the repo, and add the following contents to the file. Other examples in this article expect this file to be named `myfunctions.py`. You can use different names for your own files. \n```\nimport pyspark\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.functions import col\n\n# Because this file is not a Databricks notebook, you\n# must create a Spark session. Databricks notebooks\n# create a Spark session for you by default.\nspark = SparkSession.builder \\\n.appName('integrity-tests') \\\n.getOrCreate()\n\n# Does the specified table exist in the specified database?\ndef tableExists(tableName, dbName):\nreturn spark.catalog.tableExists(f\"{dbName}.{tableName}\")\n\n# Does the specified column exist in the given DataFrame?\ndef columnExists(dataFrame, columnName):\nif columnName in dataFrame.columns:\nreturn True\nelse:\nreturn False\n\n# How many rows are there for the specified value in the specified column\n# in the given DataFrame?\ndef numRowsInColumnForValue(dataFrame, columnName, columnValue):\ndf = dataFrame.filter(col(columnName) == columnValue)\n\nreturn df.count()\n\n``` \nThe following code assumes you have [Set up Databricks Git folders (Repos)](https:\/\/docs.databricks.com\/repos\/repos-setup.html), [added a repo](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html), and have the repo open in your Databricks workspace. \n[Create a file](https:\/\/docs.databricks.com\/files\/workspace-basics.html#create-a-new-file) named `myfunctions.r` within the repo, and add the following contents to the file. Other examples in this article expect this file to be named `myfunctions.r`. You can use different names for your own files. \n```\nlibrary(SparkR)\n\n# Does the specified table exist in the specified database?\ntable_exists <- function(table_name, db_name) {\ntableExists(paste(db_name, \".\", table_name, sep = \"\"))\n}\n\n# Does the specified column exist in the given DataFrame?\ncolumn_exists <- function(dataframe, column_name) {\ncolumn_name %in% colnames(dataframe)\n}\n\n# How many rows are there for the specified value in the specified column\n# in the given DataFrame?\nnum_rows_in_column_for_value <- function(dataframe, column_name, column_value) {\ndf = filter(dataframe, dataframe[[column_name]] == column_value)\n\ncount(df)\n}\n\n``` \nCreate a [Scala notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook) named `myfunctions` with the following contents. Other examples in this article expect this notebook to be named `myfunctions`. You can use different names for your own notebooks. \n```\nimport org.apache.spark.sql.DataFrame\nimport org.apache.spark.sql.functions.col\n\n\/\/ Does the specified table exist in the specified database?\ndef tableExists(tableName: String, dbName: String) : Boolean = {\nreturn spark.catalog.tableExists(dbName + \".\" + tableName)\n}\n\n\/\/ Does the specified column exist in the given DataFrame?\ndef columnExists(dataFrame: DataFrame, columnName: String) : Boolean = {\nval nameOfColumn = null\n\nfor(nameOfColumn <- dataFrame.columns) {\nif (nameOfColumn == columnName) {\nreturn true\n}\n}\n\nreturn false\n}\n\n\/\/ How many rows are there for the specified value in the specified column\n\/\/ in the given DataFrame?\ndef numRowsInColumnForValue(dataFrame: DataFrame, columnName: String, columnValue: String) : Long = {\nval df = dataFrame.filter(col(columnName) === columnValue)\n\nreturn df.count()\n}\n\n``` \nThe following code assumes you have the third-party sample dataset [diamonds](https:\/\/docs.databricks.com\/discover\/databricks-datasets.html) within a schema named `default` within a catalog named `main` that is accessible from your Databricks workspace. If the catalog or schema that you want to use has a different name, then change one or both of the following `USE` statements to match. \nCreate a [SQL notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html#create-a-notebook) and add the following contents to this new notebook. Then [attach](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) the notebook to a cluster and [run](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html) the notebook to add the following SQL UDFs to the specified catalog and schema. \nNote \nThe SQL UDFs `table_exists` and `column_exists` work only with Unity Catalog. SQL UDF support for Unity Catalog is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \n```\nUSE CATALOG main;\nUSE SCHEMA default;\n\nCREATE OR REPLACE FUNCTION table_exists(catalog_name STRING,\ndb_name STRING,\ntable_name STRING)\nRETURNS BOOLEAN\nRETURN if(\n(SELECT count(*) FROM system.information_schema.tables\nWHERE table_catalog = table_exists.catalog_name\nAND table_schema = table_exists.db_name\nAND table_name = table_exists.table_name) > 0,\ntrue,\nfalse\n);\n\nCREATE OR REPLACE FUNCTION column_exists(catalog_name STRING,\ndb_name STRING,\ntable_name STRING,\ncolumn_name STRING)\nRETURNS BOOLEAN\nRETURN if(\n(SELECT count(*) FROM system.information_schema.columns\nWHERE table_catalog = column_exists.catalog_name\nAND table_schema = column_exists.db_name\nAND table_name = column_exists.table_name\nAND column_name = column_exists.column_name) > 0,\ntrue,\nfalse\n);\n\nCREATE OR REPLACE FUNCTION num_rows_for_clarity_in_diamonds(clarity_value STRING)\nRETURNS BIGINT\nRETURN SELECT count(*)\nFROM main.default.diamonds\nWHERE clarity = clarity_value\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/testing.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Unit testing for notebooks\n##### Call functions\n\nThis section describes code that calls the preceding functions. You could use these functions, for example, to count the number of rows in table where a specified value exists within a specfied column. However, you would want to check whether the table actually exists, and whether the column actually exists in that table, before you proceed. The following code checks for these conditions. \nIf you added the functions from the preceding section to your Databricks workspace, you can call these functions from your workspace as follows. \n[Create a Python notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html) in the same folder as the preceding `myfunctions.py` file in your repo, and add the following contents to the notebook. Change the variable values for the table name, the schema (database) name, the column name, and the column value as needed. Then [attach](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) the notebook to a cluster and [run](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html) the notebook to see the results. \n```\nfrom myfunctions import *\n\ntableName = \"diamonds\"\ndbName = \"default\"\ncolumnName = \"clarity\"\ncolumnValue = \"VVS2\"\n\n# If the table exists in the specified database...\nif tableExists(tableName, dbName):\n\ndf = spark.sql(f\"SELECT * FROM {dbName}.{tableName}\")\n\n# And the specified column exists in that table...\nif columnExists(df, columnName):\n# Then report the number of rows for the specified value in that column.\nnumRows = numRowsInColumnForValue(df, columnName, columnValue)\n\nprint(f\"There are {numRows} rows in '{tableName}' where '{columnName}' equals '{columnValue}'.\")\nelse:\nprint(f\"Column '{columnName}' does not exist in table '{tableName}' in schema (database) '{dbName}'.\")\nelse:\nprint(f\"Table '{tableName}' does not exist in schema (database) '{dbName}'.\")\n\n``` \n[Create an R notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html) in the same folder as the preceding `myfunctions.r` file in your repo, and add the following contents to the notebook. Change the variable values for the table name, the schema (database) name, the column name, and the column value as needed. Then [attach](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) the notebook to a cluster and [run](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html) the notebook to see the results. \n```\nlibrary(SparkR)\nsource(\"myfunctions.r\")\n\ntable_name <- \"diamonds\"\ndb_name <- \"default\"\ncolumn_name <- \"clarity\"\ncolumn_value <- \"VVS2\"\n\n# If the table exists in the specified database...\nif (table_exists(table_name, db_name)) {\n\ndf = sql(paste(\"SELECT * FROM \", db_name, \".\", table_name, sep = \"\"))\n\n# And the specified column exists in that table...\nif (column_exists(df, column_name)) {\n# Then report the number of rows for the specified value in that column.\nnum_rows = num_rows_in_column_for_value(df, column_name, column_value)\n\nprint(paste(\"There are \", num_rows, \" rows in table '\", table_name, \"' where '\", column_name, \"' equals '\", column_value, \"'.\", sep = \"\"))\n} else {\nprint(paste(\"Column '\", column_name, \"' does not exist in table '\", table_name, \"' in schema (database) '\", db_name, \"'.\", sep = \"\"))\n}\n\n} else {\nprint(paste(\"Table '\", table_name, \"' does not exist in schema (database) '\", db_name, \"'.\", sep = \"\"))\n}\n\n``` \nCreate another Scala notebook in the same folder as the preceding `myfunctions` Scala notebook, and add the following contents to this new notebook. \nIn this new notebook\u2019s first cell, add the following code, which calls the [%run](https:\/\/docs.databricks.com\/notebooks\/notebook-workflows.html#run) magic. This magic makes the contents of the `myfunctions` notebook available to your new notebook. \n```\n%run .\/myfunctions\n\n``` \nIn this new notebook\u2019s second cell, add the following code. Change the variable values for the table name, the schema (database) name, the column name, and the column value as needed. Then [attach](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#attach) the notebook to a cluster and [run](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html) the notebook to see the results. \n```\nval tableName = \"diamonds\"\nval dbName = \"default\"\nval columnName = \"clarity\"\nval columnValue = \"VVS2\"\n\n\/\/ If the table exists in the specified database...\nif (tableExists(tableName, dbName)) {\n\nval df = spark.sql(\"SELECT * FROM \" + dbName + \".\" + tableName)\n\n\/\/ And the specified column exists in that table...\nif (columnExists(df, columnName)) {\n\/\/ Then report the number of rows for the specified value in that column.\nval numRows = numRowsInColumnForValue(df, columnName, columnValue)\n\nprintln(\"There are \" + numRows + \" rows in '\" + tableName + \"' where '\" + columnName + \"' equals '\" + columnValue + \"'.\")\n} else {\nprintln(\"Column '\" + columnName + \"' does not exist in table '\" + tableName + \"' in database '\" + dbName + \"'.\")\n}\n\n} else {\nprintln(\"Table '\" + tableName + \"' does not exist in database '\" + dbName + \"'.\")\n}\n\n``` \nAdd the following code to a new cell in the preceding notebook or to a cell in a separate notebook. Change the schema or catalog names if necessary to match yours, and then run this cell to see the results. \n```\nSELECT CASE\n-- If the table exists in the specified catalog and schema...\nWHEN\ntable_exists(\"main\", \"default\", \"diamonds\")\nTHEN\n-- And the specified column exists in that table...\n(SELECT CASE\nWHEN\ncolumn_exists(\"main\", \"default\", \"diamonds\", \"clarity\")\nTHEN\n-- Then report the number of rows for the specified value in that column.\nprintf(\"There are %d rows in table 'main.default.diamonds' where 'clarity' equals 'VVS2'.\",\nnum_rows_for_clarity_in_diamonds(\"VVS2\"))\nELSE\nprintf(\"Column 'clarity' does not exist in table 'main.default.diamonds'.\")\nEND)\nELSE\nprintf(\"Table 'main.default.diamonds' does not exist.\")\nEND\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/testing.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Unit testing for notebooks\n##### Write unit tests\n\nThis section describes code that tests each of the functions that are described toward the beginning of this article. If you make any changes to functions in the future, you can use unit tests to determine whether those functions still work as you expect them to. \nIf you added the functions toward the beginning of this article to your Databricks workspace, you can add unit tests for these functions to your workspace as follows. \nCreate another file named `test_myfunctions.py` in the same folder as the preceding `myfunctions.py` file in your repo, and add the following contents to the file. By default, `pytest` looks for `.py` files whose names start with `test_` (or end with `_test`) to test. Similarly, by default, `pytest` looks inside of these files for functions whose names start with `test_` to test. \nIn general, it is a best practice to *not* run unit tests against functions that work with data in production. This is especially important for functions that add, remove, or otherwise change data. To protect your production data from being compromised by your unit tests in unexpected ways, you should run unit tests against non-production data. One common approach is to create fake data that is as close as possible to the production data. The following code example creates fake data for the unit tests to run against. \n```\nimport pytest\nimport pyspark\nfrom myfunctions import *\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.types import StructType, StructField, IntegerType, FloatType, StringType\n\ntableName = \"diamonds\"\ndbName = \"default\"\ncolumnName = \"clarity\"\ncolumnValue = \"SI2\"\n\n# Because this file is not a Databricks notebook, you\n# must create a Spark session. Databricks notebooks\n# create a Spark session for you by default.\nspark = SparkSession.builder \\\n.appName('integrity-tests') \\\n.getOrCreate()\n\n# Create fake data for the unit tests to run against.\n# In general, it is a best practice to not run unit tests\n# against functions that work with data in production.\nschema = StructType([ \\\nStructField(\"_c0\", IntegerType(), True), \\\nStructField(\"carat\", FloatType(), True), \\\nStructField(\"cut\", StringType(), True), \\\nStructField(\"color\", StringType(), True), \\\nStructField(\"clarity\", StringType(), True), \\\nStructField(\"depth\", FloatType(), True), \\\nStructField(\"table\", IntegerType(), True), \\\nStructField(\"price\", IntegerType(), True), \\\nStructField(\"x\", FloatType(), True), \\\nStructField(\"y\", FloatType(), True), \\\nStructField(\"z\", FloatType(), True), \\\n])\n\ndata = [ (1, 0.23, \"Ideal\", \"E\", \"SI2\", 61.5, 55, 326, 3.95, 3.98, 2.43 ), \\\n(2, 0.21, \"Premium\", \"E\", \"SI1\", 59.8, 61, 326, 3.89, 3.84, 2.31 ) ]\n\ndf = spark.createDataFrame(data, schema)\n\n# Does the table exist?\ndef test_tableExists():\nassert tableExists(tableName, dbName) is True\n\n# Does the column exist?\ndef test_columnExists():\nassert columnExists(df, columnName) is True\n\n# Is there at least one row for the value in the specified column?\ndef test_numRowsInColumnForValue():\nassert numRowsInColumnForValue(df, columnName, columnValue) > 0\n\n``` \nCreate another file named `test_myfunctions.r` in the same folder as the preceding `myfunctions.r` file in your repo, and add the following contents to the file. By default, `testthat` looks for `.r` files whose names start with `test` to test. \nIn general, it is a best practice to *not* run unit tests against functions that work with data in production. This is especially important for functions that add, remove, or otherwise change data. To protect your production data from being compromised by your unit tests in unexpected ways, you should run unit tests against non-production data. One common approach is to create fake data that is as close as possible to the production data. The following code example creates fake data for the unit tests to run against. \n```\nlibrary(testthat)\nsource(\"myfunctions.r\")\n\ntable_name <- \"diamonds\"\ndb_name <- \"default\"\ncolumn_name <- \"clarity\"\ncolumn_value <- \"SI2\"\n\n# Create fake data for the unit tests to run against.\n# In general, it is a best practice to not run unit tests\n# against functions that work with data in production.\nschema <- structType(\nstructField(\"_c0\", \"integer\"),\nstructField(\"carat\", \"float\"),\nstructField(\"cut\", \"string\"),\nstructField(\"color\", \"string\"),\nstructField(\"clarity\", \"string\"),\nstructField(\"depth\", \"float\"),\nstructField(\"table\", \"integer\"),\nstructField(\"price\", \"integer\"),\nstructField(\"x\", \"float\"),\nstructField(\"y\", \"float\"),\nstructField(\"z\", \"float\"))\n\ndata <- list(list(as.integer(1), 0.23, \"Ideal\", \"E\", \"SI2\", 61.5, as.integer(55), as.integer(326), 3.95, 3.98, 2.43),\nlist(as.integer(2), 0.21, \"Premium\", \"E\", \"SI1\", 59.8, as.integer(61), as.integer(326), 3.89, 3.84, 2.31))\n\ndf <- createDataFrame(data, schema)\n\n# Does the table exist?\ntest_that (\"The table exists.\", {\nexpect_true(table_exists(table_name, db_name))\n})\n\n# Does the column exist?\ntest_that (\"The column exists in the table.\", {\nexpect_true(column_exists(df, column_name))\n})\n\n# Is there at least one row for the value in the specified column?\ntest_that (\"There is at least one row in the query result.\", {\nexpect_true(num_rows_in_column_for_value(df, column_name, column_value) > 0)\n})\n\n``` \nCreate another Scala notebook in the same folder as the preceding `myfunctions` Scala notebook, and add the following contents to this new notebook. \nIn the new notebook\u2019s first cell, add the following code, which calls the `%run` magic. This magic makes the contents of the `myfunctions` notebook available to your new notebook. \n```\n%run .\/myfunctions\n\n``` \nIn the second cell, add the following code. This code defines your unit tests and specifies how to run them. \nIn general, it is a best practice to *not* run unit tests against functions that work with data in production. This is especially important for functions that add, remove, or otherwise change data. To protect your production data from being compromised by your unit tests in unexpected ways, you should run unit tests against non-production data. One common approach is to create fake data that is as close as possible to the production data. The following code example creates fake data for the unit tests to run against. \n```\nimport org.scalatest._\nimport org.apache.spark.sql.types.{StructType, StructField, IntegerType, FloatType, StringType}\nimport scala.collection.JavaConverters._\n\nclass DataTests extends AsyncFunSuite {\n\nval tableName = \"diamonds\"\nval dbName = \"default\"\nval columnName = \"clarity\"\nval columnValue = \"SI2\"\n\n\/\/ Create fake data for the unit tests to run against.\n\/\/ In general, it is a best practice to not run unit tests\n\/\/ against functions that work with data in production.\nval schema = StructType(Array(\nStructField(\"_c0\", IntegerType),\nStructField(\"carat\", FloatType),\nStructField(\"cut\", StringType),\nStructField(\"color\", StringType),\nStructField(\"clarity\", StringType),\nStructField(\"depth\", FloatType),\nStructField(\"table\", IntegerType),\nStructField(\"price\", IntegerType),\nStructField(\"x\", FloatType),\nStructField(\"y\", FloatType),\nStructField(\"z\", FloatType)\n))\n\nval data = Seq(\nRow(1, 0.23, \"Ideal\", \"E\", \"SI2\", 61.5, 55, 326, 3.95, 3.98, 2.43),\nRow(2, 0.21, \"Premium\", \"E\", \"SI1\", 59.8, 61, 326, 3.89, 3.84, 2.31)\n).asJava\n\nval df = spark.createDataFrame(data, schema)\n\n\/\/ Does the table exist?\ntest(\"The table exists\") {\nassert(tableExists(tableName, dbName) == true)\n}\n\n\/\/ Does the column exist?\ntest(\"The column exists\") {\nassert(columnExists(df, columnName) == true)\n}\n\n\/\/ Is there at least one row for the value in the specified column?\ntest(\"There is at least one matching row\") {\nassert(numRowsInColumnForValue(df, columnName, columnValue) > 0)\n}\n}\n\nnocolor.nodurations.nostacks.stats.run(new DataTests)\n\n``` \nNote \nThis code example uses the `FunSuite` style of testing in ScalaTest. For other available testing styles, see [Selecting testing styles for your project](https:\/\/www.scalatest.org\/user_guide\/selecting_a_style). \nBefore you add unit tests, you should be aware that in general, it is a best practice to *not* run unit tests against functions that work with data in production. This is especially important for functions that add, remove, or otherwise change data. To protect your production data from being compromised by your unit tests in unexpected ways, you should run unit tests against non-production data. One common approach is to run unit tests against [views](https:\/\/docs.databricks.com\/lakehouse\/data-objects.html#view) instead of tables. \nTo create a view, you can call the [CREATE VIEW](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-view.html) command from a new cell in either the preceding notebook or a separate notebook. The following example assumes that you have an existing table named `diamonds` within a schema (database) named `default` within a catalog named `main`. Change these names to match your own as needed, and then run only that cell. \n```\nUSE CATALOG main;\nUSE SCHEMA default;\n\nCREATE VIEW view_diamonds AS\nSELECT * FROM diamonds;\n\n``` \nAfter you create the view, add each of the following `SELECT` statements to its own new cell in the preceding notebook or to its own new cell in a separate notebook. Change the names to match your own as needed. \n```\nSELECT if(table_exists(\"main\", \"default\", \"view_diamonds\"),\nprintf(\"PASS: The table 'main.default.view_diamonds' exists.\"),\nprintf(\"FAIL: The table 'main.default.view_diamonds' does not exist.\"));\n\nSELECT if(column_exists(\"main\", \"default\", \"view_diamonds\", \"clarity\"),\nprintf(\"PASS: The column 'clarity' exists in the table 'main.default.view_diamonds'.\"),\nprintf(\"FAIL: The column 'clarity' does not exists in the table 'main.default.view_diamonds'.\"));\n\nSELECT if(num_rows_for_clarity_in_diamonds(\"VVS2\") > 0,\nprintf(\"PASS: The table 'main.default.view_diamonds' has at least one row where the column 'clarity' equals 'VVS2'.\"),\nprintf(\"FAIL: The table 'main.default.view_diamonds' does not have at least one row where the column 'clarity' equals 'VVS2'.\"));\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/testing.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Unit testing for notebooks\n##### Run unit tests\n\nThis section describes how to run the unit tests that you coded in the preceding section. When you run the unit tests, you get results showing which unit tests passed and failed. \nIf you added the unit tests from the preceding section to your Databricks workspace, you can run these unit tests from your workspace. You can run these unit tests either [manually](https:\/\/docs.databricks.com\/notebooks\/run-notebook.html) or [on a schedule](https:\/\/docs.databricks.com\/notebooks\/schedule-notebook-jobs.html). \nCreate a Python notebook in the same folder as the preceding `test_myfunctions.py` file in your repo, and add the following contents. \nIn the new notebook\u2019s first cell, add the following code, and then run the cell, which calls the `%pip` magic. This magic installs `pytest`. \n```\n%pip install pytest\n\n``` \nIn the second cell, add the following code and then run the cell. Results show which unit tests passed and failed. \n```\nimport pytest\nimport sys\n\n# Skip writing pyc files on a readonly filesystem.\nsys.dont_write_bytecode = True\n\n# Run pytest.\nretcode = pytest.main([\".\", \"-v\", \"-p\", \"no:cacheprovider\"])\n\n# Fail the cell execution if there are any test failures.\nassert retcode == 0, \"The pytest invocation failed. See the log for details.\"\n\n``` \nCreate an R notebook in the same folder as the preceding `test_myfunctions.r` file in your repo, and add the following contents. \nIn the first cell, add the following code, and then run the cell, which calls the `install.packages` function. This function installs `testthat`. \n```\ninstall.packages(\"testthat\")\n\n``` \nIn the second cell, add the following code, and then run the cell. Results show which unit tests passed and failed. \n```\nlibrary(testthat)\nsource(\"myfunctions.r\")\n\ntest_dir(\".\", reporter = \"tap\")\n\n``` \nRun the first and then second cells in the notebook from the preceding section. Results show which unit tests passed and failed. \nRun each of the three cells in the notebook from the preceding section. Results show whether each unit test passed or failed. \nIf you no longer need the view after you run your unit tests, you can delete the view. To delete this view, you can add the following code to a new cell within one of the preceding notebooks and then run only that cell. \n```\nDROP VIEW view_diamonds;\n\n``` \nTip \nYou can view the results of your notebook runs (including unit test results) in your cluster\u2019s [driver logs](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#driver-logs). You can also specify a location for your cluster\u2019s [log delivery](https:\/\/docs.databricks.com\/compute\/configure.html#cluster-log-delivery). \nYou can set up a continuous integration and continuous delivery or deployment (CI\/CD) system, such as GitHub Actions, to automatically run your unit tests whenever your code changes. For an example, see the coverage of GitHub Actions in [Software engineering best practices for notebooks](https:\/\/docs.databricks.com\/notebooks\/best-practices.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/testing.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Unit testing for notebooks\n##### Additional resources\n\n### pytest \n* [pytest homepage](https:\/\/docs.pytest.org)\n* [pytest how-to guides](https:\/\/docs.pytest.org\/en\/7.1.x\/how-to\/index.html)\n* [pytest reference guides](https:\/\/docs.pytest.org\/en\/7.1.x\/reference\/index.html)\n* [Software engineering best practices for notebooks](https:\/\/docs.databricks.com\/notebooks\/best-practices.html) \n### testthat \n* [testthat homepage](https:\/\/testthat.r-lib.org\/)\n* [testthat function reference](https:\/\/testthat.r-lib.org\/reference\/index.html) \n### ScalaTest \n* [ScalaTest homepage](https:\/\/www.scalatest.org)\n* [ScalaTest User Guide](https:\/\/www.scalatest.org\/user_guide)\n* [ScalaTest\u2019s Scaladoc documentation](https:\/\/www.scalatest.org\/scaladoc) \n### SQL \n* [CREATE FUNCTION (SQL and Python)](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-sql-function.html)\n* [CREATE VIEW](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-view.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/testing.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### Data lakehouse architecture: Databricks well-architected framework\n\nThis set of data lakehouse architecture articles provides principles and best practices for the implementation and operation of a lakehouse using Databricks.\n\n### Data lakehouse architecture: Databricks well-architected framework\n#### Databricks well-architected framework for the lakehouse\n\n![Well-architected framework: data lakehouse diagram.](https:\/\/docs.databricks.com\/_images\/well-architected-lakehouse.png) \nThe *well-architected lakehouse* consists of 7 pillars that describe different areas of concern for the implementation of a data lakehouse in the cloud: \n* **Data governance** \nThe oversight to ensure that data brings value and supports your business strategy.\n* **Interoperability and usability** \nThe ability of the lakehouse to interact with users and other systems.\n* **Operational excellence** \nAll operations processes that keep the lakehouse running in production.\n* **Security, privacy, compliance** \nProtect the Databricks application, customer workloads, and customer data from threats.\n* **Reliability** \nThe ability of a system to recover from failures and continue to function.\n* **Performance efficiency** \nThe ability of a system to adapt to changes in load.\n* **Cost optimization** \nManaging costs to maximize the value delivered. \nThe *well-architected lakehouse* extends the [AWS Well-Architected Framework](https:\/\/docs.aws.amazon.com\/wellarchitected\/latest\/framework) to the Databricks Data Intelligence Platform and shares the pillars \u201c*Operational Excellence*\u201d, \u201c*Security*\u201d (as \u201c*Security, privacy, compliance*\u201d), \u201c*Reliability*\u201d, \u201c*Performance Efficiency*\u201d and \u201c*Cost Optimization*\u201d. \nFor these five pillars, the principles and best practices of the cloud framework still apply to the lakehouse. The *well-architected lakehouse* extends these with principles and best practices that are specific to the lakehouse and important to build an effective and efficient lakehouse.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/well-architected.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### Data lakehouse architecture: Databricks well-architected framework\n#### Data Governance and Interoperability & Usability in lakehouse architectures\n\nThe pillars \u201c*Data Governance*\u201d and \u201c*Interoperability and Usability*\u201d cover concerns specific to the lakehouse. \nData governance encapsulates the policies and practices implemented to securely manage the data assets within an organization. One of the fundamental aspects of a lakehouse is centralized data governance: The lakehouse unifies data warehousing and AI uses cases on a single platform. This simplifies the modern data stack by eliminating the data silos that traditionally separate and complicate data engineering, analytics, BI, data science, and machine learning. To simplify data governance, the lakehouse offers a unified governance solution for data, analytics and AI. By minimizing the copies of your data and moving to a single data processing layer where all your data governance controls can run together, you improve your chances of staying in compliance and detecting a data breach. \nAnother important tenet of the lakehouse is to provide a great user experience for all the personas that work with it, and to be able to interact with a wide ecosystem of external systems. AWS already has a variety of data tools that perform most tasks a data-driven enterprise might need. However, these tools must be properly assembled to provide all the functionality, with each service offering a different user experience. This approach can lead to high implementation costs and typically does not provide the same user experience as a native lakehouse platform: Users are limited by inconsistencies between tools and a lack of collaboration capabilities, and often have to go through complex processes to gain access to the system and thus to the data. \nAn integrated lakehouse on the other side provides a consistent user experience across all workloads and therefore increases usability. This reduces training and onboarding costs and improves collaboration between functions. In addition, new features are automatically added over time - to further improve the user experience - without the need to invest internal resources and budgets. \nA multicloud approach can be a deliberate strategy of a company or the result of mergers and acquisitions or independent business units selecting different cloud providers. In this case, using a multicloud lakehouse results in a unified user experience across all clouds. This reduces the proliferation of systems across the enterprise, which in turn reduces the skill and training requirements of employees involved in data-driven tasks. \nFinally, in a networked world with cross-company business processes, systems must work together as seamlessly as possible. The degree of interoperability is a crucial criterion here, and the most recent data, as a core asset of any business, must flow securely between internal and external partners\u2019 systems.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/well-architected.html"} +{"content":"# Introduction to the well-architected data lakehouse\n### Data lakehouse architecture: Databricks well-architected framework\n#### Principles and best practices\n\n* [Data governance](https:\/\/docs.databricks.com\/lakehouse-architecture\/data-governance\/index.html)\n* [Interoperability & usability](https:\/\/docs.databricks.com\/lakehouse-architecture\/interoperability-and-usability\/index.html)\n* [Operational excellence](https:\/\/docs.databricks.com\/lakehouse-architecture\/operational-excellence\/index.html)\n* [Security, compliance & privacy](https:\/\/docs.databricks.com\/lakehouse-architecture\/security-compliance-and-privacy\/index.html)\n* [Reliability](https:\/\/docs.databricks.com\/lakehouse-architecture\/reliability\/index.html)\n* [Performance efficiency](https:\/\/docs.databricks.com\/lakehouse-architecture\/performance-efficiency\/index.html)\n* [Cost optimization](https:\/\/docs.databricks.com\/lakehouse-architecture\/cost-optimization\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/well-architected.html"} +{"content":"# Model serving with Databricks\n### Monitor model quality and endpoint health\n\nDatabricks Model Serving provides advanced tooling for monitoring the quality and health of models and their deployments. The following table is an overview of each monitoring tool available. \n| **Tool** | **Description** | **Purpose** | **Access** |\n| --- | --- | --- | --- |\n| [Service logs](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/logs) | Captures `stdout` and `stderr` streams from the model serving endpoint. | Useful for debugging during model deployment. Use `print(..., flush=true)` for immediate display in the logs. | Accessible using the **Logs tab** in the Serving UI. Logs are streamed in real-time and can be exported through the API. |\n| [Build logs](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/buildlogs) | Displays output from the process which automatically creates a production-ready Python environment for the model serving endpoint. | Useful for diagnosing model deployment and dependency issues. | Available upon completion of the model serving build under **Build logs** in the **Logs** tab. Logs can be exported through the API. |\n| [Endpoint health metrics](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/metrics-export-serving-endpoint.html) | Provides insights into infrastructure metrics like latency, request rate, error rate, CPU usage, and memory usage. | Important for understanding the performance and health of the serving infrastructure. | Available by default in the Serving UI for the last 14 days. Data can also be streamed to observability tools in real-time. |\n| [Inference tables](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html) | Automatically logs online prediction requests and responses into Delta tables managed by Unity Catalog. | Use this tool for monitoring and debugging model quality or responses, generating training data sets, or conducting compliance audits. | Can be enabled for existing and new model-serving endpoints using a single click in the UI or API. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/monitor-diagnose-endpoints.html"} +{"content":"# Model serving with Databricks\n### Monitor model quality and endpoint health\n#### Additional resources\n\n* [Track and export serving endpoint health metrics to Prometheus and Datadog](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/metrics-export-serving-endpoint.html)\n* [Inference tables for monitoring and debugging models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/monitor-diagnose-endpoints.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### AI Functions on Databricks\n##### Query a served model with `ai_query()`\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes how to query a [model serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) from SQL with `ai_query()`.\n\n##### Query a served model with `ai_query()`\n###### What is `ai_query()`?\n\nThe `ai_query()` function is a built-in Databricks SQL function, part of [AI functions](https:\/\/docs.databricks.com\/large-language-models\/ai-functions.html). It allows these types of models to be accessible from SQL queries: \n* Custom models hosted by a model serving endpoint.\n* Models hosted by Databricks Foundation Model APIs.\n* External models (third-party models hosted outside of Databricks). \nFor syntax and design patterns, see [ai\\_query function](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_query.html). \nWhen this function is used to query a [model serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html), it is only available in workspaces and regions where [Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) is available and enabled.\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/how-to-ai-query.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### AI Functions on Databricks\n##### Query a served model with `ai_query()`\n###### Requirements\n\n* This function is not available on Databricks SQL Classic.\n* Querying Foundation Model APIs is enabled by default. To query endpoints that serve [custom models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) or [external models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html), you must enroll in the public preview. Please populate and submit the [AI Functions Public Preview enrollment form](https:\/\/docs.google.com\/forms\/d\/e\/1FAIpQLSeHFz5YYZX2zZQy2bH8y4v3QJmfvpepPPw7UsK3IQrpskQ8Gg\/viewform).\n* An existing model serving endpoint. See [Create custom model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html).\n* You must enable [AWS PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html) to use this feature on pro SQL warehouses.\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/how-to-ai-query.html"} +{"content":"# Generative AI and large language models (LLMs) on Databricks\n## Large language models (LLMs) on Databricks\n### AI Functions on Databricks\n##### Query a served model with `ai_query()`\n###### Query the endpoint with `ai_query()`\n\nYou can query the model behind the endpoint using `ai_query()` on serverless or pro SQL warehouses. For scoring request and response formats see [Query foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html). \nNote \n* For Databricks Runtime 14.2 and above, this function is supported in notebook environments including Databricks notebooks and workflows.\n* For Databricks Runtime 14.1 and below, this function is not supported in notebook environments, including Databricks notebooks. \n### Example: Query a large language model \nThe following example queries the model behind the `sentiment-analysis` endpoint with the `text` dataset and specifies the return type of the request. \n```\nSELECT text, ai_query(\n\"sentiment-analysis\",\ntext,\nreturnType => \"STRUCT<label:STRING, score:DOUBLE>\"\n) AS predict\nFROM\ncatalog.schema.customer_reviews\n\n``` \n### Example: Query a predictive model \nThe following example queries a classification model behind the `spam-classification` endpoint to batch predict whether the `text` is spam in `inbox_messages` table. The model takes 3 input features: timestamp, sender, text. The model returns a boolean array. \n```\nSELECT text, ai_query(\nendpoint => \"spam-classification\",\nrequest => named_struct(\n\"timestamp\", timestamp,\n\"sender\", from_number,\n\"text\", text),\nreturnType => \"BOOLEAN\") AS is_spam\nFROM catalog.schema.inbox_messages\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/large-language-models\/how-to-ai-query.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### RStudio on Databricks\n\nYou can use [RStudio](https:\/\/www.rstudio.com\/), a popular integrated development environment (IDE) for R, to connect to Databricks compute resources within Databricks workspaces. Use [RStudio Desktop](https:\/\/docs.databricks.com\/sparkr\/rstudio.html#rstudio-desktop) to connect to a Databricks [cluster](https:\/\/docs.databricks.com\/compute\/index.html) or a [SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html) from your local development machine. You can also use your web browser to sign in to your Databricks workspace and then connect to a Databricks cluster that has [RStudio Server](https:\/\/docs.databricks.com\/sparkr\/rstudio.html#rstudio-server) installed, within that workspace.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/rstudio.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### RStudio on Databricks\n##### Connect using RStudio Desktop\n\nUse [RStudio Desktop](https:\/\/www.rstudio.com\/products\/rstudio\/#rstudio-desktop) to connect to a remote Databricks cluster or SQL warehouse from your local development machine. To connect in this scenario, use an ODBC connection and call ODBC package functions for R, which are described in this section. \nNote \nYou cannot use packages such as [SparkR](https:\/\/docs.databricks.com\/sparkr\/overview.html) or [sparklyr](https:\/\/docs.databricks.com\/sparkr\/sparklyr.html) in this RStudio Desktop scenario, unless you also use [Databricks Connect](https:\/\/docs.databricks.com\/dev-tools\/databricks-connect\/index.html). As an alternative to using RStudio Desktop, you can use your web browser to sign in to your Databricks workspace and then connect to a Databricks cluster that has [RStudio Server](https:\/\/docs.databricks.com\/sparkr\/rstudio.html#rstudio-server) installed in that workspace. \nTo set up RStudio Desktop on your local development machine: \n1. [Download and install R 3.3.0 or higher](https:\/\/cran.rstudio.com\/).\n2. [Download and install RStudio Desktop](https:\/\/www.rstudio.com\/products\/rstudio\/download\/#download).\n3. Start RStudio Desktop. \n(Optional) To create an RStudio project: \n1. Start RStudio Desktop.\n2. Click **File > New Project**.\n3. Select **New Directory > New Project**.\n4. Choose a new directory for the project, and then click **Create Project**. \nTo create an R script: \n1. With the project open, click **File > New File > R Script**.\n2. Click **File > Save As**.\n3. Name the file, and then click **Save**. \nTo connect to the remote Databricks cluster or SQL warehouse through ODBC for R: \n1. Get the **Server hostname**, **Port**, and **HTTP path** values for your remote [cluster](https:\/\/docs.databricks.com\/integrations\/compute-details.html) or [SQL warehouse](https:\/\/docs.databricks.com\/integrations\/compute-details.html). For a cluster, these values are on the **JDBC\/ODBC** tab of **Advanced options**. For a SQL warehouse, these values are on the **Connection details** tab.\n2. Get a Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n3. Install and configure the Databricks ODBC driver for [Windows](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html#odbc-windows), [macOS](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html#odbc-mac), or [Linux](https:\/\/docs.databricks.com\/integrations\/odbc\/download.html#odbc-linux), based on your local machine\u2019s operating system.\n4. Set up an ODBC Data Source Name (DSN) to your remote cluster or SQL warehouse for [Windows](https:\/\/docs.databricks.com\/integrations\/odbc\/dsn.html#windows), [macOS](https:\/\/docs.databricks.com\/integrations\/odbc\/dsn.html#macos), or [Linux](https:\/\/docs.databricks.com\/integrations\/odbc\/dsn.html#linux), based on your local machine\u2019s operating system.\n5. From the RStudio console (**View > Move Focus to Console**), install the [odbc](https:\/\/solutions.rstudio.com\/db\/r-packages\/odbc\/) and [DBI](https:\/\/dbi.r-dbi.org\/) packages from [CRAN](http:\/\/cran.us.r-project.org): \n```\nrequire(devtools)\n\ninstall_version(\npackage = \"odbc\",\nrepos = \"http:\/\/cran.us.r-project.org\"\n)\n\ninstall_version(\npackage = \"DBI\",\nrepos = \"http:\/\/cran.us.r-project.org\"\n)\n\n```\n6. Back in your R script (**View > Move Focus to Source**), load the installed `odbc` and `DBI` packages: \n```\nlibrary(odbc)\nlibrary(DBI)\n\n```\n7. Call the ODBC version of the [dbConnect](https:\/\/rdrr.io\/cran\/odbc\/man\/dbConnect-OdbcDriver-method.html) function in the `DBI` package, specifying the `odbc` driver in the `odbc` package as well as the ODBC DSN that you created, for example, an ODBC DSN of `Databricks`. \n```\nconn = dbConnect(\ndrv = odbc(),\ndsn = \"Databricks\"\n)\n\n```\n8. Call an operation through the ODBC DSN, for instance a `SELECT` statement through the [dbGetQuery](https:\/\/www.rdocumentation.org\/packages\/DBI\/versions\/0.5-1\/topics\/dbGetQuery) function in the `DBI` package, specifying the name of the connection variable and the `SELECT` statement itself, for example from a table named `diamonds` in a schema (database) named `default`: \n```\nprint(dbGetQuery(conn, \"SELECT * FROM default.diamonds LIMIT 2\"))\n\n``` \nThe complete R script is as follows: \n```\nlibrary(odbc)\nlibrary(DBI)\n\nconn = dbConnect(\ndrv = odbc(),\ndsn = \"Databricks\"\n)\n\nprint(dbGetQuery(conn, \"SELECT * FROM default.diamonds LIMIT 2\"))\n\n``` \nTo run the script, in source view, click **Source**. The results for the preceding R script are as follows: \n```\n_c0 carat cut color clarity depth table price x y z\n1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43\n2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/rstudio.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### RStudio on Databricks\n##### Connect using RStudio Server\n\nUse your web browser to sign in to your Databricks workspace and then connect to a Databricks cluster that has [RStudio Server](https:\/\/www.rstudio.com\/products\/rstudio\/#rstudio-server) installed, within that workspace. \nNote \nAs an alternative to RStudio Server, you can use [RStudio Desktop](https:\/\/docs.databricks.com\/sparkr\/rstudio.html#rstudio-desktop) to connect to a Databricks cluster or SQL warehouse from your local development machine through an ODBC connection, and call ODBC package functions for R. You cannot use packages such as [SparkR](https:\/\/docs.databricks.com\/sparkr\/overview.html) or [sparklyr](https:\/\/docs.databricks.com\/sparkr\/sparklyr.html) in the RStudio Desktop scenario, unless you also use [Databricks Connect](https:\/\/docs.databricks.com\/dev-tools\/databricks-connect\/index.html). \nFor RStudio Server, you can use either the Open Source Edition or RStudio Workbench (previously RStudio Server Pro) edition on Databricks. If you want to use RStudio Workbench \/ RStudio Server Pro, you must transfer your existing RStudio Workbench \/ RStudio Server Pro license to Databricks (see [Get started: RStudio Workbench](https:\/\/docs.databricks.com\/sparkr\/rstudio.html#rsp)). \nDatabricks recommends that you use Databricks Runtime for Machine Learning (Databricks Runtime ML) on Databricks clusters with RStudio Server, to reduce cluster start times. Databricks Runtime ML includes an unmodified version of the RStudio Server Open Source Edition package for which the source code can be found in [GitHub](https:\/\/github.com\/rstudio\/rstudio\/). The following table lists the version of RStudio Server Open Source Edition that is currently preinstalled on Databricks Runtime ML versions. \n| Databricks Runtime for ML Version | RStudio Server Version |\n| --- | --- |\n| Databricks Runtime 9.1 LTS ML and 10.4 LTS ML | 1.4 |\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/rstudio.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### RStudio on Databricks\n##### RStudio integration architecture\n\nWhen you use RStudio Server on Databricks, the RStudio Server Daemon runs on the driver node of a Databricks cluster. The RStudio web UI is proxied through Databricks webapp, which means that you do not need to make any changes to your cluster network configuration. This diagram demonstrates the RStudio integration component architecture. \n![Architecture of RStudio on Databricks](https:\/\/docs.databricks.com\/_images\/rstudio-architecture.png) \nWarning \nDatabricks proxies the RStudio web service from port 8787 on the cluster\u2019s Spark driver. This web proxy is intended for use only with RStudio. If you launch other web services on port 8787, you might expose your users to potential security exploits.\nDatabricks is not responsible for any issues that result from the installation of unsupported software on a cluster. \n### Requirements \n* The cluster must be an all-purpose cluster.\n* You must have CAN ATTACH TO permission for that cluster. The cluster admin can grant you this permission. See [Compute permissions](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions). \n* The cluster *must not* have [table access control](https:\/\/docs.databricks.com\/data-governance\/table-acls\/object-privileges.html), [automatic termination](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#automatic-termination), or [credential passthrough](https:\/\/docs.databricks.com\/archive\/credential-passthrough\/index.html) enabled. \n* The cluster *must not* use the **Shared** [access mode](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode).\n* The cluster *must not* have the Spark configuration `spark.databricks.pyspark.enableProcessIsolation` set to `true`.\n* You must have an RStudio Server floating Pro license to use the Pro edition. \nNote \nAlthough the cluster can use an [access mode](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode) that supports Unity Catalog, you cannot use RStudio Server from that cluster to access data in Unity Catalog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/rstudio.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### RStudio on Databricks\n##### Get started: RStudio Server OS Edition\n\nRStudio Server Open Source Edition is preinstalled on Databricks clusters that use Databricks Runtime for Machine Learning (Databricks Runtime ML). \nTo open RStudio Server OS Edition on a cluster, do the following: \n1. Open the cluster\u2019s details page.\n2. Start the cluster, and then click the **Apps** tab: \n![Cluster Apps tab](https:\/\/docs.databricks.com\/_images\/rstudio-apps-ui.png)\n3. On the **Apps** tab, click the **Set up RStudio** button. This generates a one-time password for you. Click the **show** link to display it and copy the password. \n![RStudio one-time password](https:\/\/docs.databricks.com\/_images\/rstudio-password-ui.png)\n4. Click the **Open RStudio** link to open the UI in a new tab. Enter your username and password in the login form and sign in. \n![RStudio login form](https:\/\/docs.databricks.com\/_images\/rstudio-login-ui.png)\n5. From the RStudio UI, you can import the `SparkR` package and set up a `SparkR` session to launch Spark jobs on your cluster. \n```\nlibrary(SparkR)\n\nsparkR.session()\n\n# Query the first two rows of a table named \"diamonds\" in a\n# schema (database) named \"default\" and display the query result.\ndf <- SparkR::sql(\"SELECT * FROM default.diamonds LIMIT 2\")\nshowDF(df)\n\n``` \n![RStudio Open Source Edition session](https:\/\/docs.databricks.com\/_images\/rstudio-session-ui.png)\n6. You can also attach the [sparklyr](https:\/\/docs.databricks.com\/sparkr\/sparklyr.html) package and set up a Spark connection. \n```\nlibrary(sparklyr)\n\nsc <- spark_connect(method = \"databricks\")\n\n# Query a table named \"diamonds\" and display the first two rows.\ndf <- spark_read_table(sc = sc, name = \"diamonds\")\nprint(x = df, n = 2)\n\n``` \n![RStudio Open Source Edition sparklyr connection](https:\/\/docs.databricks.com\/_images\/rstudio-session-ui-sparklyr.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/rstudio.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### RStudio on Databricks\n##### Get started: RStudio Workbench\n\nThis section shows you how to set up and start using RStudio Workbench (formerly RStudio Server Pro) on a Databricks cluster. See an [FAQ about the name change](https:\/\/support.rstudio.com\/hc\/en-us\/articles\/1500012472761). Depending on your license, RStudio Workbench may include RStudio Server Pro. \n### Set up RStudio license server \nTo use RStudio Workbench on Databricks, you need to convert your Pro License to a [floating license](https:\/\/support.rstudio.com\/hc\/articles\/115011574507-Floating-Licenses). For assistance, contact [help@rstudio.com](mailto:help%40rstudio.com). When your license is converted, you must set up a [license server](https:\/\/www.rstudio.com\/floating-license-servers\/) for RStudio Workbench. \nTo set up a license server: \n1. Launch a small instance on your cloud provider network; the license server daemon is very lightweight.\n2. Download and install the corresponding version of RStudio License Server on your instance, and start the service. For detailed instructions, see [RStudio Workbench Admin Guide](https:\/\/docs.rstudio.com\/ide\/server-pro\/license_management\/floating_licensing.html).\n3. Make sure that the license server port is open to Databricks instances. \n### Install RStudio Workbench \nTo set up RStudio Workbench on a Databricks cluster, you must create an init script to install the RStudio Workbench binary package and configure it to use your license server for license lease. \nNote \nIf you plan to install RStudio Workbench on a Databricks Runtime version that already includes RStudio Server Open Source Edition package, you need to first uninstall that package for installation to succeed. \nThe following is an example `.sh` file that you can store as an init script in a location such as in your home directory as a workspace file, in a Unity Catalog volume, or in object storage. For more information, see [Use cluster-scoped init scripts](https:\/\/docs.databricks.com\/init-scripts\/cluster-scoped.html). The script also performs additional authentication configurations that streamline integration with Databricks. \nWarning \nCluster-scoped init scripts on DBFS are end-of-life. Storing init scripts in DBFS exists in some workspaces to support legacy workloads and is not recommended. All init scripts stored in DBFS should be migrated. For migration instructions, see [Migrate init scripts from DBFS](https:\/\/docs.databricks.com\/init-scripts\/index.html#migrate). \n```\n#!\/bin\/bash\n\nset -euxo pipefail\n\nif [[ $DB_IS_DRIVER = \"TRUE\" ]]; then\nsudo apt-get update\nsudo dpkg --purge rstudio-server # in case open source version is installed.\nsudo apt-get install -y gdebi-core alien\n\n## Installing RStudio Workbench\ncd \/tmp\n\n# You can find new releases at https:\/\/rstudio.com\/products\/rstudio\/download-commercial\/debian-ubuntu\/.\nwget https:\/\/download2.rstudio.org\/server\/bionic\/amd64\/rstudio-workbench-2022.02.1-461.pro1-amd64.deb -O rstudio-workbench.deb\nsudo gdebi -n rstudio-workbench.deb\n\n## Configuring authentication\nsudo echo 'auth-proxy=1' >> \/etc\/rstudio\/rserver.conf\nsudo echo 'auth-proxy-user-header-rewrite=^(.*)$ $1' >> \/etc\/rstudio\/rserver.conf\nsudo echo 'auth-proxy-sign-in-url=<domain>\/login.html' >> \/etc\/rstudio\/rserver.conf\nsudo echo 'admin-enabled=1' >> \/etc\/rstudio\/rserver.conf\nsudo echo 'export PATH=\/usr\/local\/sbin:\/usr\/local\/bin:\/usr\/sbin:\/usr\/bin:\/sbin:\/bin' >> \/etc\/rstudio\/rsession-profile\n\n# Enabling floating license\nsudo echo 'server-license-type=remote' >> \/etc\/rstudio\/rserver.conf\n\n# Session configurations\nsudo echo 'session-rprofile-on-resume-default=1' >> \/etc\/rstudio\/rsession.conf\nsudo echo 'allow-terminal-websockets=0' >> \/etc\/rstudio\/rsession.conf\n\nsudo rstudio-server license-manager license-server <license-server-url>\nsudo rstudio-server restart || true\nfi\n\n``` \n1. Replace `<domain>` with your Databricks URL and `<license-server-url>` with the URL of your floating license server.\n2. Store this `.sh` file as an init script in a location such as in your home directory as a workspace file, in a Unity Catalog volume, or in object storage. For more information, see [Use cluster-scoped init scripts](https:\/\/docs.databricks.com\/init-scripts\/cluster-scoped.html).\n3. Before launching a cluster, add this `.sh` file as an init script from the associated location. For instructions, see [Use cluster-scoped init scripts](https:\/\/docs.databricks.com\/init-scripts\/cluster-scoped.html).\n4. Launch the cluster.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/rstudio.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### RStudio on Databricks\n##### Use RStudio Server Pro\n\n1. Open the cluster\u2019s details page.\n2. Start the cluster, and click the **Apps** tab: \n![Cluster Apps tab](https:\/\/docs.databricks.com\/_images\/rstudio-apps-ui.png)\n3. On the **Apps** tab, click the **Set up RStudio** button. \n![RStudio one-time password](https:\/\/docs.databricks.com\/_images\/rstudio-password-ui.png)\n4. You do not need the one-time password. Click the **Open RStudio UI** link and it will open an authenticated RStudio Pro session for you.\n5. From the RStudio UI, you can attach the `SparkR` package and set up a `SparkR` session to launch Spark jobs on your cluster. \n```\nlibrary(SparkR)\n\nsparkR.session()\n\n# Query the first two rows of a table named \"diamonds\" in a\n# schema (database) named \"default\" and display the query result.\ndf <- SparkR::sql(\"SELECT * FROM default.diamonds LIMIT 2\")\nshowDF(df)\n\n``` \n![RStudio Pro session](https:\/\/docs.databricks.com\/_images\/rstudio-pro-session-ui.png)\n6. You can also attach the [sparklyr](https:\/\/docs.databricks.com\/sparkr\/sparklyr.html) package and set up a Spark connection. \n```\nlibrary(sparklyr)\n\nsc <- spark_connect(method = \"databricks\")\n\n# Query a table named \"diamonds\" and display the first two rows.\ndf <- spark_read_table(sc = sc, name = \"diamonds\")\nprint(x = df, n = 2)\n\n``` \n![RStudio Pro sparklyr connection](https:\/\/docs.databricks.com\/_images\/rstudio-pro-session-ui-sparklyr.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/rstudio.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### RStudio on Databricks\n##### RStudio Server FAQ\n\n### What is the difference between RStudio Server Open Source Edition and RStudio Workbench? \nRStudio Workbench supports a wide range of enterprise features that are not available on the Open Source Edition. You can see the feature comparison on [RStudio\u2019s website](https:\/\/www.rstudio.com\/products\/rstudio\/#Server). \nIn addition, RStudio Server Open Source Edition is distributed under the [GNU Affero General Public License (AGPL)](https:\/\/www.gnu.org\/licenses\/agpl-3.0.en.html), while the Pro version comes with a commercial license for organizations that are not able to use AGPL software. \nFinally, RStudio Workbench comes with professional and enterprise support from RStudio, PBC, while RStudio Server Open Source Edition comes with no support. \n### Can I use my RStudio Workbench \/ RStudio Server Pro license on Databricks? \nYes, if you already have a Pro or Enterprise license for RStudio Server, you can use that license on Databricks. See [Get started: RStudio Workbench](https:\/\/docs.databricks.com\/sparkr\/rstudio.html#rsp) to learn how to set up RStudio Workbench on Databricks. \n### Where does RStudio Server run? Do I need to manage any additional services\/servers? \nAs you can see on the diagram in [RStudio integration architecture](https:\/\/docs.databricks.com\/sparkr\/rstudio.html#rsf), the RStudio Server daemon runs on the driver (master) node of your Databricks cluster. With RStudio Server Open Source Edition, you do not need to run any additional servers\/services. However, for RStudio Workbench, you must manage a separate instance that runs RStudio License Server. \n### Can I use RStudio Server on a standard cluster? \nNote \nThis article describes the legacy clusters UI. For information about the new clusters UI (in preview), including terminology changes for cluster access modes, see [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html). For a comparison of the new and legacy cluster types, see [Clusters UI changes and cluster access modes](https:\/\/docs.databricks.com\/archive\/compute\/cluster-ui-preview.html). \nYes, you can. \n### Can I use RStudio Server on a cluster with auto termination? \nNo, you can\u2019t use RStudio when auto termination is enabled. Auto termination can purge unsaved user scripts and data inside an RStudio session. To protect users against this unintended data loss scenario, RStudio is disabled on such clusters by default. \nFor customers who require cleaning up cluster resources when they are not used, Databricks recommends using [cluster APIs](https:\/\/docs.databricks.com\/api\/workspace\/clusters) to clean up RStudio clusters based on a schedule. \n### How should I persist my work on RStudio? \nWe strongly recommend that you persist your work using a version control system from RStudio. RStudio has great support for various version control systems and allows you to check in and manage your projects. If you do not persist your code through one of the following methods, you risk losing your work if a workspace admin restarts or terminates the cluster. \nOne method is to save your files (code or data) on the [What is DBFS?](https:\/\/docs.databricks.com\/dbfs\/index.html). For example, if you save a file under `\/dbfs\/` the files will not be deleted when your cluster is terminated or restarted. \nAnother method is to save the R notebook to your local file system by exporting it as `Rmarkdown`, then later importing the file into the RStudio instance. The blog [Sharing R Notebooks using RMarkdown](https:\/\/databricks.com\/blog\/2018\/07\/06\/sharing-r-notebooks-using-rmarkdown.html) describes the steps in more detail. \nAnother method is to mount an Amazon Elastic File System (Amazon EFS) volume to your cluster, so that when the cluster is shut down you won\u2019t lose your work. When the cluster restarts, Databricks remounts the Amazon EFS volume, and you can continue where you left off. To mount an existing Amazon EFS volume to a cluster, call the [create cluster](https:\/\/docs.databricks.com\/api\/workspace\/clusters) (`POST \/api\/2.0\/clusters\/create`) or [edit cluster](https:\/\/docs.databricks.com\/api\/workspace\/clusters) (`POST \/api\/2.0\/clusters\/edit`) operation in the Clusters API 2.0, specifying the Amazon EFS volume\u2019s mount information in the operation\u2019s `cluster_mount_infos` array. \nMake sure the cluster that you create or use does not have Unity Catalog, auto termination, or auto scaling enabled. Also make sure that the cluster has write access to the mounted volume, for example by running the command `chmod a+w <\/path\/to\/volume>` on the cluster. You can run this command on an existing cluster through the cluster\u2019s [web terminal](https:\/\/docs.databricks.com\/compute\/web-terminal.html), or on a new cluster by using an [init script](https:\/\/docs.databricks.com\/init-scripts\/index.html) that you specify in the preceding operation\u2019s `init_scripts` array. \nIf you do not have an existing Amazon EFS volume, you can create one. First, contact your Databricks administrator and get the VPC ID, public subnet ID, and security group ID for your Databricks workspace. Then use this information, along with the AWS Management Console, to [create a file system with custom settings using the Amazon EFS console](https:\/\/docs.aws.amazon.com\/efs\/latest\/ug\/creating-using-create-fs.html#creating-using-fs-part1-console). In the last step of this procedure, click **Attach** and copy the DNS name and mount options, which you specify in the preceding `cluster_mount_infos` array. \n### How do I start a `SparkR` session? \n`SparkR` is contained in Databricks Runtime, but you must load it into RStudio. Run the following code inside RStudio to initialize a `SparkR` session. \n```\nlibrary(SparkR)\n\nsparkR.session()\n\n``` \nIf there is an error importing the `SparkR` package, run `.libPaths()` and verify that `\/home\/ubuntu\/databricks\/spark\/R\/lib` is included in the result. \nIf it is not included, check the content of `\/usr\/lib\/R\/etc\/Rprofile.site`. List `\/home\/ubuntu\/databricks\/spark\/R\/lib\/SparkR` on the driver to verify that the `SparkR` package is installed. \n### How do I start a `sparklyr` session? \nThe `sparklyr` package must be installed on the cluster. Use one of the following methods to install the `sparklyr` package: \n* As a Databricks library\n* `install.packages()` command\n* RStudio package management UI \n```\nlibrary(sparklyr)\n\nsc <- spark_connect(method = \u201cdatabricks\u201d)\n\n``` \n### How does RStudio integrate with Databricks R notebooks? \nYou can move your work between notebooks and RStudio through version control. \n### What is the working directory? \nWhen you start a project in RStudio, you choose a working directory. By default this is the home directory on the driver (master) container where RStudio Server is running. You can change this directory if you want. \n### Can I launch Shiny Apps from RStudio running on Databricks? \nYes, you can develop and view [Shiny applications inside RStudio Server on Databricks](https:\/\/docs.databricks.com\/sparkr\/shiny.html). \n### I can\u2019t use terminal or git inside RStudio on Databricks. How can I fix that? \nMake sure that you have disabled websockets. In RStudio Server Open Source Edition, you can do this from the UI. \n![RStudio Session](https:\/\/docs.databricks.com\/_images\/rstudio-terminal-options.png) \nIn RStudio Server Pro, you can add `allow-terminal-websockets=0` to `\/etc\/rstudio\/rsession.conf` to disable websockets for all users. \n### I don\u2019t see the Apps tab under cluster details. \nThis feature is not available to all customers. You must be on the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/rstudio.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Apply tags to Unity Catalog securable objects\n\nThis article shows how to apply tags in Unity Catalog. \nTags are attributes containing keys and optional values that you can apply to securable objects in Unity Catalog. Tagging is useful for organizing and categorizing securable objects in a metastore. Using tags also simplifies search and discovery of tables and views using the workspace search functionality.\n\n#### Apply tags to Unity Catalog securable objects\n##### Supported securable objects\n\nSecurable object tagging is currently supported on catalogs, schemas, tables, table columns, volumes, views, and registered models. For more information about securable objects, see [Securable objects in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html#securable-objects).\n\n#### Apply tags to Unity Catalog securable objects\n##### Requirements\n\nTo add tags to Unity Catalog securable objects, you must have the `APPLY TAG` privilege on the object, as well as the `USE SCHEMA` privilege on the object\u2019s parent schema and the `USE CATALOG` privilege on the object\u2019s parent catalog.\n\n#### Apply tags to Unity Catalog securable objects\n##### Constraints\n\nThe following is a list of tag constraints: \n* You can assign a maximum of 20 tags to a single securable object.\n* The maximum length of a tag key is 255 characters.\n* The maximum length of a tag value is 1000 characters.\n* Special characters cannot be used in tag names.\n* Tag search using the workspace search UI is supported only for tables, views, and table columns.\n* Tag search requires exact term matching.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/tags.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Apply tags to Unity Catalog securable objects\n##### Manage tags in Catalog Explorer\n\nTo manage securable object tags using the Catalog Explorer UI you must have at least the `BROWSE` privilege on the object. \n1. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog** in the sidebar.\n2. Select a securable object to view the tag information.\n3. Click ![Edit icon](https:\/\/docs.databricks.com\/_images\/pencil-edit-icon.png)**Add\/Edit Tags** to manage tags for the current securable object. You can add and remove multiple tags simultaneously in the tag management modal. \nTo add or edit table column tags, click the ![Add column or tag comment icon](https:\/\/docs.databricks.com\/_images\/add-column-comment.png) **Add tag** icon.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/tags.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Apply tags to Unity Catalog securable objects\n##### Retrieve tag information in information schema tables\n\nEach catalog created in Unity Catalog includes an `INFORMATION_SCHEMA`. This schema includes tables that describe the objects known to the schema\u2019s catalog. You must have the appropriate privileges to view the schema information. \nYou can query the following to retrieve tag information: \n* [INFORMATION\\_SCHEMA.CATALOG\\_TAGS](https:\/\/docs.databricks.com\/sql\/language-manual\/information-schema\/catalog_tags.html)\n* [INFORMATION\\_SCHEMA.COLUMN\\_TAGS](https:\/\/docs.databricks.com\/sql\/language-manual\/information-schema\/column_tags.html)\n* [INFORMATION\\_SCHEMA.SCHEMA\\_TAGS](https:\/\/docs.databricks.com\/sql\/language-manual\/information-schema\/schema_tags.html)\n* [INFORMATION\\_SCHEMA.TABLE\\_TAGS](https:\/\/docs.databricks.com\/sql\/language-manual\/information-schema\/table_tags.html)\n* [INFORMATION\\_SCHEMA.VOLUME\\_TAGS](https:\/\/docs.databricks.com\/sql\/language-manual\/information-schema\/volume_tags.html) \nFor more information, see [Information schema](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-information-schema.html).\n\n#### Apply tags to Unity Catalog securable objects\n##### Manage tags using SQL commands\n\nNote \nThis feature is available only in Databricks Runtime versions 13.3 and above. \nYou can use SQL commands to tag catalogs, schemas, tables (views, materialized views, streaming tables), volumes, and table columns. For example, you can use the `SET TAGS` and `UNSET TAGS` clauses with the `ALTER TABLE` command to manage tags on a table. See [DDL statements](https:\/\/docs.databricks.com\/sql\/language-manual\/index.html#ddl-statements) for a list of available Data Definition Language (DDL) commands and their syntax.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/tags.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Apply tags to Unity Catalog securable objects\n##### Use tags to search for tables\n\nYou can use the Databricks workspace search bar to search for tables, views, and table columns using tag keys and tag values. You can use both table tags and table column tags. You cannot use tags to search for other tagged objects, like catalogs, schemas, or volumes. \nFor details, see [Use tags to search for tables](https:\/\/docs.databricks.com\/search\/index.html#tags).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/tags.html"} +{"content":"# Security and compliance guide\n## Authentication and access control\n#### Manage entitlements\n\nThis article describes the how to manage entitlements for users, service principals, and groups. \nNote \nEntitlements are available only in the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/entitlements.html"} +{"content":"# Security and compliance guide\n## Authentication and access control\n#### Manage entitlements\n##### Entitlements overview\n\nAn entitlement is a property that allows a user, service principal, or group to interact with Databricks in a specified way. Entitlements are assigned to users at the workspace level. The following table lists entitlements and the workspace UI and API property names that you use to manage each one. You can use the workspace admin settings page and Workspace Users, Service Principals, and Groups APIs to manage entitlements. \n| Entitlement name | Entitlement API name | Default | Description |\n| --- | --- | --- | --- |\n| Workspace access | `workspace-access` | Granted by default. | When granted to a user or service principal, they can access the Data Science & Engineering and Databricks Machine Learning persona-based environments. Can\u2019t be removed from workspace admins. |\n| Databricks SQL access | `databricks-sql-access` | Granted by default. | When granted to a user or service principal, they can access Databricks SQL. |\n| Allow unrestricted cluster creation | `allow-cluster-create` | Not granted to users or service principals by default. | When granted to a user or service principal, they can create unrestricted clusters. You can restrict access to existing clusters using [cluster-level permissions](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions). Can\u2019t be removed from workspace admins. |\n| Allow pool creation (not available via UI) | `allow-instance-pool-create` | Can\u2019t be granted to individual users or service principals. | When granted to a group, its members can create instance pools. Can\u2019t be removed from workspace admins. | \nThe `users` group is granted the **Workspace access** and **Databricks SQL access** entitlements by default. All workspace users and service principals are members of the `users` group. To assign these entitlements on a user-by-user basis, a workspace admin must remove the entitlement from the `users` group and assign it individually to users, service principals, and groups. \nTo log in and access a Databricks workspace, a user must have the **Databricks SQL access** or **Workspace access** entitlement. \nYou cannot grant the `allow-instance-pool-create` entitlement using the admin settings page. Instead, use the Workspace Users, Service Principals, or Groups API.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/entitlements.html"} +{"content":"# Security and compliance guide\n## Authentication and access control\n#### Manage entitlements\n##### Manage entitlements on users\n\nWorkspace admins can add or remove an entitlement for a user using the workspace admin settings page. You can also use the [Workspace Users API](https:\/\/docs.databricks.com\/api\/workspace\/serviceprincipals). \n1. As a workspace admin, log in to the Databricks workspace.\n2. Click your username in the top bar of the Databricks workspace and select **Settings**.\n3. Click on the **Identity and access** tab.\n4. Next to **Users**, click **Manage**.\n5. Select the user.\n6. Click the **Entitlements** tab.\n7. To add an entitlement, select the toggle in the corresponding column. \nTo remove an entitlement, perform the same steps, but deselect the toggle instead. \nIf an entitlement is inherited from a group, the entitlement toggle is selected but grayed out. To remove an inherited entitlement, either remove the user from the group that has the entitlement, or remove the entitlement from the group.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/entitlements.html"} +{"content":"# Security and compliance guide\n## Authentication and access control\n#### Manage entitlements\n##### Manage entitlements on service principals\n\nWorkspace admins can add or remove an entitlement for a service principal using the workspace admin settings page. You can also use the [Workspace Service Principals API](https:\/\/docs.databricks.com\/api\/workspace\/serviceprincipals). \n1. As a workspace admin, log in to the Databricks workspace.\n2. Click your username in the top bar of the Databricks workspace and select **Settings**.\n3. Click on the **Identity and access** tab.\n4. Next to **Service principals**, click **Manage**.\n5. Select the service principal you want to update.\n6. To add an entitlement, under **Entitlements**, select the corresponding checkbox. \nTo remove an entitlement, perform the same steps, but clear the checkbox instead. \nIf an entitlement is inherited from a group, the entitlement toggle is selected but grayed out. To remove an inherited entitlement, either remove the service principal from the group that has the entitlement, or remove the entitlement from the group.\n\n#### Manage entitlements\n##### Manage entitlements on groups\n\nWorkspace admins can manage group entitlements at the workspace level, regardless of whether the group was created in the account or is workspace-local. \n1. As a workspace admin, log in to the Databricks workspace.\n2. Click your username in the top bar of the Databricks workspace and select **Settings**.\n3. Click on the **Identity and access** tab.\n4. Next to **Groups**, click **Manage**.\n5. Select the group you want to update. You must have the group manager role on the group to update it.\n6. On the **Entitlements** tab, select the entitlement you want to grant to all users in the group. \nTo remove an entitlement, perform the same steps, but deselect the toggle instead. Group members lose the entitlement, unless they have permission granted as an individual user or through another group membership.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/auth-authz\/entitlements.html"} +{"content":"# Technology partners\n### Connect to data prep partners using Partner Connect\n\nTo connect your Databricks workspace to a data preparation and transformation partner solution using Partner Connect, you typically follow the steps in this article. \nImportant \nBefore you follow the steps in this article, see the appropriate partner article for important partner-specific information. There might be differences in the connection steps between partner solutions. Some partner solutions also allow you to integrate with Databricks SQL warehouses (formerly Databricks SQL endpoints) or Databricks clusters, but not both.\n\n### Connect to data prep partners using Partner Connect\n#### Requirements\n\nSee the [requirements](https:\/\/docs.databricks.com\/partner-connect\/index.html#requirements) for using Partner Connect. \nImportant \nFor partner-specific requirements, see the appropriate partner article.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/prep.html"} +{"content":"# Technology partners\n### Connect to data prep partners using Partner Connect\n#### Steps to connect to a data preparation and transformation partner\n\nTo connect your Databricks workspace to a data preparation and transformation partner solution, follow the steps in this section. \nTip \nIf you have an existing partner account, Databricks recommends that you follow the steps to connect to the partner solution manually in the appropriate partner article. This is because the connection experience in Partner Connect is optimized for new partner accounts. \n1. In the sidebar, click ![Partner Connect button](https:\/\/docs.databricks.com\/_images\/partner-connect.png) **Partner Connect**.\n2. Click the partner tile. \nNote \nIf the partner tile has a check mark icon inside it, an administrator has already used Partner Connect to connect the partner to your workspace. Skip to step 5. The partner uses the email address for your Databricks account to prompt you to sign in to your existing partner account.\n3. Select a catalog for the partner to write to, then click **Next**. \nNote \nIf a partner doesn\u2019t support Unity Catalog with Partner Connect, the workspace default catalog is used. If your workspace isn\u2019t Unity Catalog-enabled, `hive_metastore` is used. \nPartner Connect creates the following resources in your workspace: \n* A Databricks [service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) named **`<PARTNER>_USER`**.\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) that is associated with the **`<PARTNER>_USER`** service principal.\n4. Click **Next**. \nThe **Email** box displays the email address for your Databricks account. The partner uses this email address to prompt you to either create a new partner account or sign in to your existing partner account.\n5. Click **Connect to `<Partner>`** or **Sign in**. \nA new tab opens in your web browser, which displays the partner website.\n6. Complete the on-screen instructions on the partner website to create your trial partner account or sign in to your existing partner account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/prep.html"} +{"content":"# \n### Notebook isolation\n\nNotebook isolation refers to the visibility of variables and classes between notebooks. Databricks supports two types of isolation: \n* Variable and class isolation\n* Spark session isolation \nNote \nDatabricks manages user isolation using [access modes configured on clusters](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode). \n* **No isolation shared**: Multiple users can use the same cluster. Users share credentials set at the cluster level. No data access controls are enforced.\n* **Single User**: Only the named user can use the cluster. All commands run with that user\u2019s privileges. Table ACLs in the Hive metastore are not enforced. This access mode supports Unity Catalog.\n* **Shared**: Multiple users can use the same cluster. Users are fully isolated from one another, and each user runs commands with their own privileges. Table ACLs in the Hive metastore are enforced. This access mode supports Unity Catalog.\n\n### Notebook isolation\n#### Variable and class isolation\n\nVariables and classes are available only in the current notebook. For example, two notebooks attached to the same cluster can define variables and classes with the same name, but these objects are distinct. \nTo define a class that is visible to *all notebooks attached to the same cluster*, define the class in a [package cell](https:\/\/docs.databricks.com\/notebooks\/package-cells.html). Then you can access the class by using its fully qualified name, which is the same as accessing a class in an attached Scala or Java library.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-isolation.html"} +{"content":"# \n### Notebook isolation\n#### Spark session isolation\n\nEvery notebook attached to a cluster has a pre-defined variable named `spark` that represents a `SparkSession`. `SparkSession` is the entry point for using Spark APIs as well as setting runtime configurations. \nSpark session isolation is enabled by default. You can also use *global* temporary views to share temporary views across notebooks. See [CREATE VIEW](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-view.html). To disable Spark session isolation, set `spark.databricks.session.share` to `true` in the [Spark configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration). \nImportant \nSetting `spark.databricks.session.share` true breaks the monitoring used by both streaming notebook cells and streaming jobs. Specifically: \n* The graphs in streaming cells are not displayed.\n* Jobs do not block as long as a stream is running (they just finish \u201csuccessfully\u201d, stopping the stream).\n* Streams in jobs are not monitored for termination. Instead you must manually call `awaitTermination()`.\n* Calling the [Create a new visualization](https:\/\/docs.databricks.com\/visualizations\/index.html#display-function) on streaming DataFrames doesn\u2019t work. \nCells that trigger commands in other languages (that is, cells using `%scala`, `%python`, `%r`, and `%sql`) and cells that include other notebooks (that is, cells using `%run`) are part of the current notebook. Thus, these cells are in the same session as other notebook cells. By contrast, a [notebook workflow](https:\/\/docs.databricks.com\/notebooks\/notebook-workflows.html) runs a notebook with an isolated `SparkSession`, which means temporary views defined in such a notebook are *not visible* in other notebooks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/notebook-isolation.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Hyperparameter tuning\n##### Parallelize hyperparameter tuning with scikit-learn and MLflow\n\nThis notebook shows how to use Hyperopt to parallelize hyperparameter tuning calculations. It uses the `SparkTrials` class to automatically distribute calculations across the cluster workers. It also illustrates automated MLflow tracking of Hyperopt runs so you can save the results for later.\n\n##### Parallelize hyperparameter tuning with scikit-learn and MLflow\n###### Parallelize hyperparameter tuning with automated MLflow tracking notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/hyperopt-spark-mlflow.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nAfter you perform the actions in the last cell in the notebook, your MLflow UI should display: \n![Hyperopt MLflow demo](https:\/\/docs.databricks.com\/_images\/hyperopt-spark-mlflow.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/automl-hyperparam-tuning\/hyperopt-spark-mlflow-integration.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Dashboard visualization types\n\nThis article outlines the types of visualizations available to use in dashboards, and shows you how to create an example of each visualization type. For instructions on building a dashboard, see [Create a dashboard](https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-dashboard.html). \nNote \nTo optimize performance, charts can only render 10k elements on the canvas. Otherwise, visualizations may be truncated.\n\n#### Dashboard visualization types\n##### Area visualization\n\nArea visualizations combine the line and bar visualizations to show how one or more groups\u2019 numeric values change over the progression of a second variable, typically that of time. They are often used to show sales funnel changes through time. \n![Area visualization example](https:\/\/docs.databricks.com\/_images\/area.png) \n**Configuration values**: For this area visualization example, the following values were set: \n* Title: `Total price and order year by order priority and clerk`\n* X-axis: \n+ Field: `o_orderdate`\n+ Scale Type: `Temporal`\n+ Transform: `Yearly`\n+ Axis title: `Order year`\n* Y-axis: \n+ Field: `o_totalprice`\n+ Axis title: `Total price`\n+ Scale Type: `Quantitative`\n+ Transform: `Sum`\n* Group by: \n+ Field: `o_orderpriority`\n+ Legend title: `Order priority`\n* Filter \n+ Field: `TPCH orders.o_clerk` \n**SQL query**: For this area visualization, the following SQL query was used to generate the data set named `TPCH orders`. \n```\nSELECT * FROM samples.tpch.orders\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/visualization-types.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Dashboard visualization types\n##### Bar chart\n\nBar charts represent the change in metrics over time or across categories and show proportionality, similar to a [pie](https:\/\/docs.databricks.com\/dashboards\/visualization-types.html#pie) visualization. \n![Bar visualization example](https:\/\/docs.databricks.com\/_images\/bar.png) \n**Configuration values**: For this bar visualization example, the following values were set: \n* Title: `Total price and order month by order priority and clerk`\n* X-axis: \n+ Field: `o_orderdate`\n+ Transform: `Monthly`\n+ Scale Type: `Temporal`\n+ Axis title: `Order month`\n* Y-axis: \n+ Field: `o_totalprice`\n+ Scale Type: `Quantitative`\n+ Transform: `Sum`\n+ Axis title: `Total price`\n* Group by: \n+ Field: `o_orderpriority`\n+ Legend title: `Order priority`\n* Filter \n+ Field: `TPCH orders.o_clerk` \n**SQL query**: For this bar visualization, the following SQL query was used to generate the data set named `TPCH orders`. \n```\nSELECT * FROM samples.tpch.orders\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/visualization-types.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Dashboard visualization types\n##### Combo chart\n\nCombo charts combine line and bar charts to present the changes over time with proportionality. \n![Combo chart example](https:\/\/docs.databricks.com\/_images\/lakeview-combo-chart.png) \n**Configuration values**: For this combo chart visualization, the following values were set: \n* X-axis: `ps_partkey` \n+ Scale Type: `Quantitative`\n* Y-axis: \n+ Bar: `ps_availqty`\n+ Aggregation type: `SUM`\n+ Line: `ps_supplycost`\n+ Aggregation type: `AVG`\n* Color by Y-Series: \n+ `Sum of ps_availqty`\n+ `Average ps_supplycost` \n**SQL query**: For this combo chart visualization, the following SQL query was used to generate the data set. \n```\nSELECT * FROM samples.tpch.partsupp\n\n``` \n### Dual-axis combo chart \nYou can use combo charts to show two different y-axes. With your combo chart widget selected, click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu on the **Y axis** settings in the chart configuration panel. Turn on the **Enable dual axis** option. \n![Dual-axis combo chart example](https:\/\/docs.databricks.com\/_images\/lakeview-dual-axis-combo.png) \n**Configuration values**: For this combo chart, the **Enable dual axis** option is on. The other configurations are set as follows: \n* X-axis: `tpep_pickup_datetime` \n+ Scale Type: `Temporal`\n+ Transform: `Weekly`\n* Y-axis: \n+ Left Y-axis (Bar): `trip_distance` \n- Transform: `AVG`\n+ Right Y-axis (Line): `fare_amount` \n- Transform: `AVG` \nColor by Y-series: \n* `Average trip_distance` \n+ `Average fare_amount` \n**SQL query**: The following SQL query was used to generate the data set: \n```\nSELECT * FROM samples.nyctaxi.trips\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/visualization-types.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Dashboard visualization types\n##### Counter visualization\n\nCounters display a single value prominently, with an option to compare them against a target value. To use counters, specify which row of data to display on the counter visualization for the **Value Column** and **Target Column**. \n![Counter example](https:\/\/docs.databricks.com\/_images\/counter.png) \n**Configuration values**: For this counter visualization example, the following values were set: \n* Title: `Orders: Target amount vs. actual amount by date`\n* Value: \n+ Field: `avg(o_totalprice)`\n+ Value row number: 1\n* Target: \n+ Field: `avg(o_totalprice)`\n+ Value row number: 2\n* Filter \n+ Field: `TPCH orders.o_orderdate` \n**SQL query**: For this counter visualization, the following SQL query was used to generate the data set named `TPCH orders_target`. \n```\nSELECT o_orderdate, avg(o_totalprice)\nFROM samples.tpch.orders\nGROUP BY 1\nORDER BY 1\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/visualization-types.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Dashboard visualization types\n##### Line visualization\n\nLine visualizations present the change in one or more metrics over time. \n![Line visualization example](https:\/\/docs.databricks.com\/_images\/line.png) \n**Configuration values**: For this line visualization example, the following values were set: \n* Title: `Average price and order year by order priority and clerk`\n* X-axis: \n+ Field: `o_orderdate`\n+ Transform: `Yearly`\n+ Scale Type: `Temporal`\n+ Axis title: `Order year`\n* Y-axis: \n+ Field: `o_totalprice`\n+ Transform: `Average`\n+ Scale Type: `Quantitative`\n+ Axis title: `Average price`\n* Group by: \n+ Field: `o_orderpriority`\n+ Legend title: `Order priority`\n* Filter \n+ Field: `TPCH orders.o_clerk` \n**SQL query**: For this bar visualization visualization, the following SQL query was used to generate the data set named `TPCH orders`. \n```\nSELECT * FROM samples.tpch.orders\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/visualization-types.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Dashboard visualization types\n##### Heatmap chart\n\nHeatmap charts blend features of bar charts, stacking, and bubble charts allowing you to visualize numerical data using colors. A common color palette for a heatmap shows the highest values using warmer colors, like orange or red, and the lowest values using cooler colors, like blue or purple. \nFor example, the following heatmap visualizes the most frequently occurring distances of taxi rides on each day and groups the results by the day of the week, distance, and total fare. \n![Heatmap example](https:\/\/docs.databricks.com\/_images\/heatmap.png) \n**Configuration values**: For this heatmap chart visualization, the following values were set: \n* X column (dataset column): `o_orderpriority`\n* Y columns (dataset column): `o_orderstatus`\n* Color column: \n+ Dataset column: `o_totalprice`\n+ Aggregation type: `Average`\n* X-axis name (override default value): `Order priority`\n* Y-axis name(override default value): `Order status`\n* Color scheme (override default value): `YIGnBu` \n**SQL query**: For this heatmap chart visualization, the following SQL query was used to generate the data set. \n```\nSELECT * FROM samples.tpch.orders\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/visualization-types.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Dashboard visualization types\n##### Histogram chart\n\nA histogram plots the frequency that a given value occurs in a dataset. A histogram helps you to understand whether a dataset has values that are clustered around a small number of ranges or are more spread out. A histogram is displayed as a bar chart in which you control the number of distinct bars (also called bins). \n![Histogram chart example](https:\/\/docs.databricks.com\/_images\/histogram.png) \n**Configuration values**: For this histogram chart visualization, the following values were set: \n* X column (dataset column): `o_totalprice`\n* Number of bins: 20\n* X-axis name (override default value): `Total price` \n**Configuration options**: For histogram chart configuration options, see [histogram chart configuration options](https:\/\/docs.databricks.com\/visualizations\/histogram.html#options). \n**SQL query**: For this histogram chart visualization, the following SQL query was used to generate the data set. \n```\nselect * from samples.tpch.orders\n\n```\n\n#### Dashboard visualization types\n##### Pie visualization\n\nPie visualizations show proportionality between metrics. They are *not* meant for conveying time series data. \n![Pie visualization example](https:\/\/docs.databricks.com\/_images\/pie.png) \n**Configuration values**: For this pie visualization example, the following values were set: \n* Title: `Total price by order priority and clerk`\n* Angle: \n+ Field: `o_totalprice`\n+ Transform: `Sum`\n+ Axis title: `Total price`\n* Group by: \n+ Field: `o_orderpriority`\n+ Legend title: `Order priority`\n* Filter \n+ Field: `TPCH orders.o_clerk` \n**SQL query**: For this pie visualization, the following SQL query was used to generate the data set named `TPCH orders`. \n```\nSELECT * FROM samples.tpch.orders\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/visualization-types.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Dashboard visualization types\n##### Pivot visualization\n\nA pivot visualization aggregates records from a query result into a tabular display. It\u2019s similar to `PIVOT` or `GROUP BY` statements in SQL. You configure the pivot visualization with drag-and-drop fields. \nNote \nFor performance reasons, pivot tables only support rendering 100 columns x 100 rows. \n![Pivot example](https:\/\/docs.databricks.com\/_images\/pivot.png) \n**Configuration values**: For this pivot visualization example, the following values were set: \n* Title: `Line item quantity by return flag and ship mode by supplier`\n* Rows: \n+ Field: `l_returnflag`\n* Columns: \n+ Field: `l_shipmode`\n* Cell \n+ Dataset:\n+ Field: `l_quantity`\n+ Transform: Sum\n* Filter \n+ Field: `TPCH lineitem.l_supplierkey` \n**SQL query**: For this pivot visualization, the following SQL query was used to generate the data set named `TPCH lineitem`. \n```\nSELECT * FROM samples.tpch.lineitem\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/visualization-types.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Dashboard visualization types\n##### Scatter visualization\n\nScatter visualizations are commonly used to show the relationship between two numerical variables. You can encode the third dimension with color to show how the numerical variables differ across groups. \n![Scatter example](https:\/\/docs.databricks.com\/_images\/scatter.png) \n**Configuration values**: For this scatter visualization example, the following values were set: \n* Title: `Total price and quantity by ship mode and supplier`\n* X-axis: \n+ Field: `l_quantity`\n+ Axis title: `Quantity`\n+ Scale type: `Quantitative`\n+ Transform: `None`\n* Y-axis: \n+ Field: `l_extendedprice`\n+ Scale type: `Quantitative`\n+ Transform: `None`\n+ Axis title: `Price`\n* Group by: \n+ Field: `l_shipmode`\n+ Legend title: `Ship mode`\n* Filter \n+ Field: `TPCH lineitem.l_supplierkey` \n**SQL query**: For this scatter visualization, the following SQL query was used to generate the data set named `TPCH lineitem`. \n```\nSELECT * FROM samples.tpch.lineitem\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/visualization-types.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Dashboard visualization types\n##### Table visualization\n\nThe table visualization shows data in a standard table but allows you to manually reorder, hide, and format the data. \n![Table example](https:\/\/docs.databricks.com\/_images\/table.png) \n**Configuration values**: For this table visualization example, the following values were set: \n* Title: `Line item summary by supplier`\n* Columns: \n+ Display row number: Enabled\n+ Field: `l_orderkey`\n+ Field: `l_extendedprice` \n- Display as: `Number`\n- Number format: $0.00\n+ Field: `l_discount` \n- Display as: `Number`\n- Number format: %0.00\n+ Field: `l_tax` \n- Display as: `Number`\n- Number format: %0.00\n+ Field: `l_shipdate`\n+ Field: `l_shipmode`\n* Filter \n+ Field: `TPCH lineitem.l_supplierkey` \n**SQL query**: For this table visualization, the following SQL query was used to generate the data set named `TPCH lineitem`. \n```\nSELECT * FROM samples.tpch.lineitem\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/visualization-types.html"} +{"content":"# AI and Machine Learning on Databricks\n## MLOps workflows on Databricks\n#### How does Databricks support CI\/CD for machine learning?\n\nCI\/CD (continuous integration and continuous delivery) refers to an automated process for developing, deploying, monitoring, and maintaining your applications. By automating the building, testing, and deployment of code, development teams can deliver releases more frequently and reliably than manual processes still prevalent across many data engineering and data science teams. CI\/CD for machine learning brings together techniques of MLOps, DataOps, ModelOps, and DevOps. \nThis article describes how Databricks supports CI\/CD for machine learning solutions. In machine learning applications, CI\/CD is important not only for code assets, but is also applied to data pipelines, including both input data and the results generated by the model. \n![End-to-end MLOps lifecycle diagram showing elements of CI\/CD for ML.](https:\/\/docs.databricks.com\/_images\/end-to-end-ml-cycle.png)\n\n#### How does Databricks support CI\/CD for machine learning?\n##### Machine learning elements that need CI\/CD\n\nOne of the challenges of ML development is that different teams own different parts of the process. Teams may rely on different tools and have different release schedules. Databricks provides a single, unified data and ML platform with integrated tools to improve teams\u2019 efficiency and ensure consistency and repeatability of data and ML pipelines. \nIn general for machine learning tasks, the following should be tracked in an automated CI\/CD workflow: \n* Training data, including data quality, schema changes, and distribution changes.\n* Input data pipelines.\n* Code for training, validating, and serving the model.\n* Model predictions and performance.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/ci-cd-for-ml.html"} +{"content":"# AI and Machine Learning on Databricks\n## MLOps workflows on Databricks\n#### How does Databricks support CI\/CD for machine learning?\n##### Integrate Databricks into your CI\/CD processes\n\n[MLOps](https:\/\/docs.databricks.com\/machine-learning\/mlops\/mlops-workflow.html), DataOps, ModelOps, and DevOps refer to the integration of development processes with \u201coperations\u201d - making the processes and infrastructure predictable and reliable. This set of articles describes how to integrate operations (\u201cops\u201d) principles into your ML workflows on the Databricks platform. \nDatabricks incorporates all of the components required for the ML lifecycle including tools to build \u201cconfiguration as code\u201d to ensure reproducibility and \u201cinfrastructure as code\u201d to automate the provisioning of cloud services. It also includes logging and alerting services to help you detect and troubleshoot problems when they occur.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/ci-cd-for-ml.html"} +{"content":"# AI and Machine Learning on Databricks\n## MLOps workflows on Databricks\n#### How does Databricks support CI\/CD for machine learning?\n##### DataOps: Reliable and secure data\n\nGood ML models depend on reliable data pipelines and infrastructure. With the Databricks Data Intelligence Platform, the entire data pipeline from ingesting data to the outputs from the served model is on a single platform and uses the same toolset, which facilitates productivity, reproducibility, sharing, and troubleshooting. \n![DataOps diagram](https:\/\/docs.databricks.com\/_images\/ci-cd-dataops.png) \n### DataOps tasks and tools in Databricks \nThe table lists common DataOps tasks and tools in Databricks: \n| DataOps task | Tool in Databricks |\n| --- | --- |\n| Ingest and transform data | [Autoloader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html) and Apache Spark |\n| Track changes to data including versioning and lineage | [Delta tables](https:\/\/docs.databricks.com\/delta\/index.html) |\n| Build, manage, and monitor data processing pipelines | [Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/index.html) |\n| Ensure data security and governance | [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html) |\n| Exploratory data analysis and dashboards | [Databricks SQL](https:\/\/docs.databricks.com\/sql\/index.html), [Dashboards](https:\/\/docs.databricks.com\/dashboards\/index.html), and [Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html) |\n| General coding | [Databricks SQL](https:\/\/docs.databricks.com\/sql\/index.html) and [Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/index.html) |\n| Schedule data pipelines | [Databricks Workflows](https:\/\/docs.databricks.com\/workflows\/index.html) |\n| Automate general workflows | [Databricks Workflows](https:\/\/docs.databricks.com\/workflows\/index.html) |\n| Create, store, manage, and discover features for model training | [Databricks Feature Store](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/index.html) |\n| Data monitoring | [Lakehouse Monitoring](https:\/\/docs.databricks.com\/lakehouse-monitoring\/index.html) |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/ci-cd-for-ml.html"} +{"content":"# AI and Machine Learning on Databricks\n## MLOps workflows on Databricks\n#### How does Databricks support CI\/CD for machine learning?\n##### ModelOps: Model development and lifecycle\n\nDeveloping a model requires a series of experiments and a way to track and compare the conditions and results of those experiments. The Databricks Data Intelligence Platform includes MLflow for model development tracking and the MLflow Model Registry to manage the model lifecycle including staging, serving, and storing model artifacts. \nAfter a model is released to production, many things can change that might affect its performance. In addition to monitoring the model\u2019s prediction performance, you should also monitor input data for changes in quality or statistical characteristics that might require retraining the model. \n![ModelOps diagram](https:\/\/docs.databricks.com\/_images\/ci-cd-modelops.png) \n### ModelOps tasks and tools in Databricks \nThe table lists common ModelOps tasks and tools provided by Databricks: \n| ModelOps task | Tool in Databricks |\n| --- | --- |\n| Track model development | [MLflow model tracking](https:\/\/docs.databricks.com\/mlflow\/tracking.html) |\n| Manage model lifecycle | [Models in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/manage-model-lifecycle\/index.html) |\n| Model code version control and sharing | [Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html) |\n| No-code model development | [Databricks AutoML](https:\/\/docs.databricks.com\/machine-learning\/automl\/index.html) |\n| Model monitoring | [Lakehouse Monitoring](https:\/\/docs.databricks.com\/lakehouse-monitoring\/index.html) |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/ci-cd-for-ml.html"} +{"content":"# AI and Machine Learning on Databricks\n## MLOps workflows on Databricks\n#### How does Databricks support CI\/CD for machine learning?\n##### DevOps: Production and automation\n\nThe Databricks platform supports ML models in production with the following: \n* End-to-end data and model lineage: From models in production back to the raw data source, on the same platform.\n* Production-level Model Serving: Automatically scales up or down based on your business needs.\n* Multitask workflows: Automates jobs and create scheduled machine learning workflows.\n* Git folders: Code versioning and sharing from the workspace, also helps teams follow software engineering best practices.\n* [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html): Automates deployment infrastructure across clouds for ML inference jobs, serving endpoints, and featurization jobs. \n### Model serving \nFor deploying models to production, MLflow significantly simplifies the process, providing single-click deployment as a batch job for large amounts of data or as a REST endpoint on an autoscaling cluster. The integration of Databricks Feature Store with MLflow also ensures consistency of features for training and serving; also, MLflow models can automatically look up features from the Feature Store, even for low latency online serving. \nThe Databricks platform supports many model deployment options: \n* Code and containers.\n* Batch serving.\n* Low-latency online serving.\n* On-device or edge serving.\n* Multi-cloud, for example, training the model on one cloud and deploying it with another. \nFor more information, see [Databricks Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html). \n### Multitask workflows \n[Databricks workflows](https:\/\/docs.databricks.com\/workflows\/index.html) allow you to automate and schedule any type of workload, from ETL to ML. Databricks also supports integrations with popular third party [orchestrators like Airflow](https:\/\/docs.databricks.com\/workflows\/jobs\/how-to\/use-airflow-with-jobs.html). \n### Git folders \nThe Databricks platform includes Git support in the workspace to help teams follow software engineering best practices by performing Git operations through the UI. Administrators and DevOps engineers can use APIs to set up automation with their favorite CI\/CD tools. Databricks supports any type of Git deployment including private networks. \nFor more information about best practices for code development using Databricks Git folders, see [CI\/CD workflows with Git integration and Databricks Git folders](https:\/\/docs.databricks.com\/repos\/ci-cd-techniques-with-repos.html) and [Use CI\/CD](https:\/\/docs.databricks.com\/dev-tools\/index-ci-cd.html). These techniques, together with the Databricks REST API, let you build automated deployment processes with GitHub Actions, Azure DevOps pipelines, or Jenkins jobs. \n### Unity Catalog for governance and security \nThe Databricks platform includes [Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html), which lets admins set up fine-grained access control, security policies, and governance for all data and AI assets across Databricks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/mlops\/ci-cd-for-ml.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Tutorial: End-to-end ML models on Databricks\n\nMachine learning in the real world is messy. Data sources contain missing values, include redundant rows, or may not fit\nin memory. Feature engineering often requires domain expertise and can be tedious. Modeling too often mixes data\nscience and systems engineering, requiring not only knowledge of algorithms but also of machine architecture and\ndistributed systems. \nDatabricks simplifies this process. The following 10-minute tutorial notebook shows an end-to-end example of training machine learning models on tabular data. \nYou can [import this notebook](https:\/\/docs.databricks.com\/notebooks\/notebook-export-import.html) and run it yourself, or copy code-snippets and ideas for your own use.\n\n#### Tutorial: End-to-end ML models on Databricks\n##### Notebook\n\nIf your workspace is enabled for Unity Catalog, use this version of the notebook: \n### Use scikit-learn with MLflow integration on Databricks (Unity Catalog) \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-end-to-end-example-uc.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nIf your workspace is not enabled for Unity Catalog, use this version of the notebook: \n### Use scikit-learn with MLflow integration on Databricks \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-end-to-end-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/end-to-end-example.html"} +{"content":"# AI and Machine Learning on Databricks\n## Deploy models for batch inference and prediction\n### Deep learning model inference workflow\n##### Model inference using TensorFlow and TensorRT\n\nThe example notebook in this article demonstrates the Databricks recommended [deep learning inference workflow](https:\/\/docs.databricks.com\/machine-learning\/model-inference\/dl-model-inference.html) with TensorFlow and TensorFlowRT. This example shows how to optimize a trained ResNet-50 model with TensorRT for model inference. \n[NVIDIA TensorRT](https:\/\/developer.nvidia.com\/tensorrt) is a high-performance inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications. TensorRT is installed in the [GPU-enabled](https:\/\/docs.databricks.com\/compute\/gpu.html) version of Databricks Runtime for Machine Learning. \nDatabricks recommends you use the [G4 instance type series](https:\/\/aws.amazon.com\/ec2\/instance-types\/g4\/), which is optimized for deploying machine learning models in production.\n\n##### Model inference using TensorFlow and TensorRT\n###### Model inference TensorFlow-TensorRT notebook\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/deep-learning\/tensorflow-tensorrt.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-inference\/resnet-model-inference-tensorrt.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### SparkR overview\n\nSparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR also supports distributed machine learning using MLlib.\n\n#### SparkR overview\n##### SparkR function reference\n\nYou can find the latest SparkR function reference on\n[spark.apache.org](https:\/\/spark.apache.org\/docs\/latest\/api\/R\/index.html). \nYou can also view function help in R notebooks or RStudio after you import the SparkR\npackage. \n![Embedded R documentation](https:\/\/docs.databricks.com\/_images\/inline-r-docs.png)\n\n#### SparkR overview\n##### SparkR in notebooks\n\n* For Spark 2.0 and above, you do not need to explicitly pass a `sqlContext` object to every function call.\n* For Spark 2.2 and above, notebooks no longer import SparkR by default because SparkR functions were conflicting with similarly named functions from other popular packages. To use SparkR you can call `library(SparkR)` in your notebooks. The SparkR session is already configured, and all SparkR functions will talk to your attached cluster using the existing session.\n\n#### SparkR overview\n##### SparkR in spark-submit jobs\n\nYou can run scripts that use SparkR on Databricks as spark-submit jobs, with minor code modifications.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/overview.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### SparkR overview\n##### Create SparkR DataFrames\n\nYou can create a DataFrame from a local R `data.frame`, from a data source, or using a Spark SQL query. \n### From a local R `data.frame` \nThe simplest way to create a DataFrame is to convert a local R `data.frame` into a\n`SparkDataFrame`. Specifically we can use `createDataFrame` and pass in the local R\n`data.frame` to create a `SparkDataFrame`. Like most other SparkR functions, `createDataFrame`\nsyntax changed in Spark 2.0. You can see examples of this in the code snippet below.\nFor more examples, see [createDataFrame](https:\/\/spark.apache.org\/docs\/latest\/api\/R\/reference\/createDataFrame.html). \n```\nlibrary(SparkR)\ndf <- createDataFrame(faithful)\n\n# Displays the content of the DataFrame to stdout\nhead(df)\n\n``` \n### Using the data source API \nThe general method for creating a DataFrame from a data source is `read.df`.\nThis method takes the path for the file to load and the type of data source.\nSparkR supports reading CSV, JSON, text, and Parquet files\nnatively. \n```\nlibrary(SparkR)\ndiamondsDF <- read.df(\"\/databricks-datasets\/Rdatasets\/data-001\/csv\/ggplot2\/diamonds.csv\", source = \"csv\", header=\"true\", inferSchema = \"true\")\nhead(diamondsDF)\n\n``` \nSparkR automatically infers the schema from the CSV file. \n#### Adding a data source connector with Spark Packages \nThrough Spark Packages you can find data source connectors\nfor popular file formats such as Avro. As an example, use the\n[spark-avro package](https:\/\/spark-packages.org\/package\/databricks\/spark-avro)\nto load an [Avro](https:\/\/avro.apache.org\/) file. The availability of the spark-avro package depends on your cluster\u2019s [version](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html). See [Avro file](https:\/\/docs.databricks.com\/query\/formats\/avro.html). \nFirst take an existing `data.frame`, convert to a Spark DataFrame, and save it as an Avro file. \n```\nrequire(SparkR)\nirisDF <- createDataFrame(iris)\nwrite.df(irisDF, source = \"com.databricks.spark.avro\", path = \"dbfs:\/tmp\/iris.avro\", mode = \"overwrite\")\n\n``` \nTo verify that an Avro file was saved: \n```\n%fs ls \/tmp\/iris.avro\n\n``` \nNow use the spark-avro package again to read back the data. \n```\nirisDF2 <- read.df(path = \"\/tmp\/iris.avro\", source = \"com.databricks.spark.avro\")\nhead(irisDF2)\n\n``` \nThe data source API can also be used to save DataFrames into\nmultiple file formats. For example, you can save the DataFrame from the\nprevious example to a Parquet file using `write.df`. \n```\nwrite.df(irisDF2, path=\"dbfs:\/tmp\/iris.parquet\", source=\"parquet\", mode=\"overwrite\")\n\n``` \n```\n%fs ls dbfs:\/tmp\/iris.parquet\n\n``` \n### From a Spark SQL query \nYou can also create SparkR DataFrames using Spark SQL queries. \n```\n# Register earlier df as temp view\ncreateOrReplaceTempView(irisDF2, \"irisTemp\")\n\n``` \n```\n# Create a df consisting of only the 'species' column using a Spark SQL query\nspecies <- sql(\"SELECT species FROM irisTemp\")\n\n``` \n`species` is a SparkDataFrame.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/overview.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### SparkR overview\n##### DataFrame operations\n\nSpark DataFrames support a number of functions to do structured data\nprocessing. Here are some basic examples. A complete list can\nbe found in the [API docs](https:\/\/spark.apache.org\/docs\/latest\/api\/R\/). \n### Select rows and columns \n```\n# Import SparkR package if this is a new notebook\nrequire(SparkR)\n\n# Create DataFrame\ndf <- createDataFrame(faithful)\n\n``` \n```\n# Select only the \"eruptions\" column\nhead(select(df, df$eruptions))\n\n``` \n```\n# You can also pass in column name as strings\nhead(select(df, \"eruptions\"))\n\n``` \n```\n# Filter the DataFrame to only retain rows with wait times shorter than 50 mins\nhead(filter(df, df$waiting < 50))\n\n``` \n### Grouping and aggregation \nSparkDataFrames support a number of commonly used functions to\naggregate data after grouping. For example you can count the number of\ntimes each waiting time appears in the faithful dataset. \n```\nhead(count(groupBy(df, df$waiting)))\n\n``` \n```\n# You can also sort the output from the aggregation to get the most common waiting times\nwaiting_counts <- count(groupBy(df, df$waiting))\nhead(arrange(waiting_counts, desc(waiting_counts$count)))\n\n``` \n### Column operations \nSparkR provides a number of functions that can be directly applied to\ncolumns for data processing and aggregation. The following example shows the\nuse of basic arithmetic functions. \n```\n# Convert waiting time from hours to seconds.\n# You can assign this to a new column in the same DataFrame\ndf$waiting_secs <- df$waiting * 60\nhead(df)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/overview.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### SparkR overview\n##### Machine learning\n\nSparkR exposes most of MLLib algorithms. Under the hood, SparkR\nuses MLlib to train the model. \nThe following example shows how to build a gaussian GLM model using\nSparkR. To run linear regression, set family to `\"gaussian\"`. To run\nlogistic regression, set family to `\"binomial\"`. When using SparkML GLM SparkR\nautomatically performs one-hot encoding of\ncategorical features so that it does not need to be done manually.\nBeyond String and Double type features, it is also possible to fit over\nMLlib Vector features, for compatibility with other MLlib components. \n```\n# Create the DataFrame\ndf <- createDataFrame(iris)\n\n# Fit a linear model over the dataset.\nmodel <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = \"gaussian\")\n\n# Model coefficients are returned in a similar format to R's native glm().\nsummary(model)\n\n``` \nFor tutorials, see [Tutorial: Analyze data with glm](https:\/\/docs.databricks.com\/sparkr\/glm-tutorial.html). \nFor additional examples, see [Work with DataFrames and tables in R](https:\/\/docs.databricks.com\/sparkr\/dataframes-tables.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/overview.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Get started with MLflow experiments\n\nThis collection of notebooks demonstrate how you can get up and running with MLflow experiment runs.\n\n#### Get started with MLflow experiments\n##### MLflow components\n\n[MLflow](https:\/\/www.mlflow.org\/) is an open source platform for managing the end-to-end machine learning lifecycle. MLflow has three primary components: \n* Tracking\n* Models\n* Projects \nThe MLflow Tracking component lets you log and query machine model training sessions (*runs*) using the following APIs: \n* [Java](https:\/\/www.mlflow.org\/docs\/latest\/java_api\/index.html)\n* [Python](https:\/\/www.mlflow.org\/docs\/latest\/python_api\/index.html)\n* [R](https:\/\/www.mlflow.org\/docs\/latest\/R-api.html)\n* [REST](https:\/\/www.mlflow.org\/docs\/latest\/rest-api.html) \nAn MLflow *run* is a collection of parameters, metrics, tags, and artifacts associated with a machine learning model training process.\n\n#### Get started with MLflow experiments\n##### What are experiments in MLflow?\n\n*Experiments* are the primary unit of organization in MLflow; all MLflow runs belong to an experiment. Each experiment lets you visualize, search, and compare runs, as well as download run artifacts or metadata for analysis in other tools. Experiments are maintained in a Databricks hosted MLflow tracking server. \nExperiments are located in the [workspace](https:\/\/docs.databricks.com\/workspace\/index.html) file tree. You manage experiments using the same tools you use to manage other workspace objects such as folders, notebooks, and libraries.\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/quick-start.html"} +{"content":"# AI and Machine Learning on Databricks\n## ML lifecycle management using MLflow\n#### Get started with MLflow experiments\n##### MLflow example notebooks\n\nThe following notebooks demonstrate how to create and log to an MLflow run using the MLflow tracking APIs, as well how to use the experiment UI to view the run. These notebooks are available in Python, Scala, and R. \nThe Python and R notebooks use a [notebook experiment](https:\/\/docs.databricks.com\/mlflow\/experiments.html#mlflow-notebook-experiments). The Scala notebook creates an experiment in the `Shared` folder. \nNote \nWith Databricks Runtime 10.4 LTS ML and above, [Databricks Autologging](https:\/\/docs.databricks.com\/mlflow\/databricks-autologging.html) is enabled by default for Python notebooks. \n* [Quickstart Python](https:\/\/docs.databricks.com\/mlflow\/quick-start-python.html)\n* [Quickstart Java and Scala](https:\/\/docs.databricks.com\/mlflow\/quick-start-java-scala.html)\n* [Quickstart R](https:\/\/docs.databricks.com\/mlflow\/quick-start-r.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/mlflow\/quick-start.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Shiny on Databricks\n\n[Shiny](https:\/\/shiny.rstudio.com\/) is an R package, available on CRAN, used to build interactive R applications and dashboards. You can use Shiny inside [RStudio Server](https:\/\/docs.databricks.com\/sparkr\/rstudio.html) hosted on Databricks clusters. You can also develop, host, and share Shiny applications directly from a Databricks notebook. \nTo get started with Shiny, see the [Shiny tutorials](https:\/\/shiny.rstudio.com\/tutorial\/). You can run these tutorials on Databricks notebooks. \nThis article describes how to run Shiny applications on Databricks and use Apache Spark inside Shiny applications.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/shiny.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Shiny on Databricks\n##### Shiny inside R notebooks\n\n### Get started with Shiny inside R notebooks \nThe Shiny package is included with Databricks Runtime. You can interactively develop and test Shiny applications inside Databricks R notebooks similarly to hosted RStudio. \nFollow these steps to get started: \n1. Create an R notebook.\n2. Import the Shiny package and run the example app `01_hello` as follows: \n```\nlibrary(shiny)\nrunExample(\"01_hello\")\n\n```\n3. When the app is ready, the output includes the Shiny app URL as a clickable link which opens a new tab. To share this app with other users, see [Share Shiny app URL](https:\/\/docs.databricks.com\/sparkr\/shiny.html#share-shiny-app-url). \n![Example Shiny app](https:\/\/docs.databricks.com\/_images\/shiny-01-notebook.png) \nNote \n* Log messages appear in the command result, similar to the default log message (`Listening on http:\/\/0.0.0.0:5150`) shown in the example.\n* To stop the Shiny application, click **Cancel**.\n* The Shiny application uses the notebook R process. If you detach the notebook from the cluster, or if you cancel the cell running the application, the Shiny application terminates. You cannot run other cells while the Shiny application is running. \n### Run Shiny apps from Databricks Git folders \nYou can run Shiny apps that are checked into [Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html). \n1. [Clone a remote Git repository](https:\/\/docs.databricks.com\/repos\/git-operations-with-repos.html).\n2. Run the application. \n```\nlibrary(shiny)\nrunApp(\"006-tabsets\")\n\n``` \n### Run Shiny apps from files \nIf your Shiny application code is part of a project managed by version control, you can run it inside the notebook. \nNote \nYou must use the absolute path or set the working directory with `setwd()`. \n1. Check out the code from a repository using code similar to: \n```\n%sh git clone https:\/\/github.com\/rstudio\/shiny-examples.git\ncloning into 'shiny-examples'...\n\n```\n2. To run the application, enter code similar to the following in another cell: \n```\nlibrary(shiny)\nrunApp(\"\/databricks\/driver\/shiny-examples\/007-widgets\/\")\n\n``` \n### Share Shiny app URL \nThe Shiny app URL generated when you start an app is shareable with other users. Any Databricks user with CAN ATTACH TO permission on the cluster can view and interact with the app as long as both the app and the cluster are running. \nIf the cluster that the app is running on terminates, the app is no longer accessible. You can [disable automatic termination](https:\/\/docs.databricks.com\/compute\/configure.html) in the cluster settings. \nIf you attach and run the notebook hosting the Shiny app on a different cluster, the Shiny URL changes. Also, if you restart the app on the same cluster, Shiny might pick a different random port. To ensure a stable URL, you can set the `shiny.port` option, or, when restarting the app on the same cluster, you can specify the `port` argument.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/shiny.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Shiny on Databricks\n##### Shiny on hosted RStudio Server\n\n### Requirements \n* [RStudio on Databricks](https:\/\/docs.databricks.com\/sparkr\/rstudio.html). \nImportant \nWith RStudio Server Pro, you must disable proxied authentication.\nMake sure `auth-proxy=1` is not present inside `\/etc\/rstudio\/rserver.conf`. \n### Get started with Shiny on hosted RStudio Server \n1. Open RStudio on Databricks.\n2. In RStudio, import the Shiny package and run the example app `01_hello` as follows: \n```\n> library(shiny)\n> runExample(\"01_hello\")\n\nListening on http:\/\/127.0.0.1:3203\n\n``` \nA new window appears, displaying the Shiny application. \n![First Shiny app](https:\/\/docs.databricks.com\/_images\/shiny-01-hello.png) \n### Run a Shiny app from an R script \nTo run a Shiny app from an R script, open the R script in the RStudio editor and click the **Run App** button on the top right. \n![Shiny run App](https:\/\/docs.databricks.com\/_images\/shiny-run-app.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/shiny.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Shiny on Databricks\n##### Use Apache Spark inside Shiny apps\n\nYou can use Apache Spark inside Shiny applications with either SparkR or sparklyr. \n### Use SparkR with Shiny in a notebook \n```\nlibrary(shiny)\nlibrary(SparkR)\nsparkR.session()\n\nui <- fluidPage(\nmainPanel(\ntextOutput(\"value\")\n)\n)\n\nserver <- function(input, output) {\noutput$value <- renderText({ nrow(createDataFrame(iris)) })\n}\n\nshinyApp(ui = ui, server = server)\n\n``` \n### Use sparklyr with Shiny in a notebook \n```\nlibrary(shiny)\nlibrary(sparklyr)\n\nsc <- spark_connect(method = \"databricks\")\n\nui <- fluidPage(\nmainPanel(\ntextOutput(\"value\")\n)\n)\n\nserver <- function(input, output) {\noutput$value <- renderText({\ndf <- sdf_len(sc, 5, repartition = 1) %>%\nspark_apply(function(e) sum(e)) %>%\ncollect()\ndf$result\n})\n}\n\nshinyApp(ui = ui, server = server)\n\n``` \n```\nlibrary(dplyr)\nlibrary(ggplot2)\nlibrary(shiny)\nlibrary(sparklyr)\n\nsc <- spark_connect(method = \"databricks\")\ndiamonds_tbl <- spark_read_csv(sc, path = \"\/databricks-datasets\/Rdatasets\/data-001\/csv\/ggplot2\/diamonds.csv\")\n\n# Define the UI\nui <- fluidPage(\nsliderInput(\"carat\", \"Select Carat Range:\",\nmin = 0, max = 5, value = c(0, 5), step = 0.01),\nplotOutput('plot')\n)\n\n# Define the server code\nserver <- function(input, output) {\noutput$plot <- renderPlot({\n# Select diamonds in carat range\ndf <- diamonds_tbl %>%\ndplyr::select(\"carat\", \"price\") %>%\ndplyr::filter(carat >= !!input$carat[[1]], carat <= !!input$carat[[2]])\n\n# Scatter plot with smoothed means\nggplot(df, aes(carat, price)) +\ngeom_point(alpha = 1\/2) +\ngeom_smooth() +\nscale_size_area(max_size = 2) +\nggtitle(\"Price vs. Carat\")\n})\n}\n\n# Return a Shiny app object\nshinyApp(ui = ui, server = server)\n\n``` \n![Spark Shiny app](https:\/\/docs.databricks.com\/_images\/shiny-spark.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/shiny.html"} +{"content":"# Develop on Databricks\n## Databricks for R developers\n#### Shiny on Databricks\n##### Frequently asked questions (FAQ)\n\n* [Why is my Shiny app grayed out after some time?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#why-is-my-shiny-app-grayed-out-after-some-time)\n* [Why does my Shiny viewer window disappear after a while?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#why-does-my-shiny-viewer-window-disappear-after-a-while)\n* [Why do long Spark jobs never return?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#why-do-long-spark-jobs-never-return)\n* [How can I avoid the timeout?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#how-can-i-avoid-the-timeout)\n* [My app crashes immediately after launching, but the code appears to be correct. What\u2019s going on?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#my-app-crashes-immediately-after-launching-but-the-code-appears-to-be-correct-whats-going-on)\n* [How many connections can be accepted for one Shiny app link during development?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#how-many-connections-can-be-accepted-for-one-shiny-app-link-during-development)\n* [Can I use a different version of the Shiny package than the one installed in Databricks Runtime?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#can-i-use-a-different-version-of-the-shiny-package-than-the-one-installed-in-databricks-runtime)\n* [How can I develop a Shiny application that can be published to a Shiny server and access data on Databricks?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#how-can-i-develop-a-shiny-application-that-can-be-published-to-a-shiny-server-and-access-data-on-databricks)\n* [Can I develop a Shiny application inside a Databricks notebook?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#can-i-develop-a-shiny-application-inside-a-databricks-notebook)\n* [How can I save the Shiny applications that I developed on hosted RStudio Server?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#how-can-i-save-the-shiny-applications-that-i-developed-on-hosted-rstudio-server) \n### [Why is my Shiny app grayed out after some time?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#id1) \nIf there is no interaction with the Shiny app, the connection to the app closes after about 10 minutes. \nTo reconnect, refresh the Shiny app page. The dashboard state resets. \n### [Why does my Shiny viewer window disappear after a while?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#id2) \nIf the Shiny viewer window disappears after idling for several minutes, it is due to the same timeout as the \u201cgray out\u201d scenario. \n### [Why do long Spark jobs never return?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#id3) \nThis is also because of the idle timeout. Any Spark job running for longer than the previously mentioned timeouts is not able to render its result because the connection closes before the job returns. \n### [How can I avoid the timeout?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#id4) \n* There is a workaround suggested in [Feature request: Have client send keep alive message to prevent TCP timeout on some load balancers](https:\/\/github.com\/rstudio\/shiny\/issues\/2110#issuecomment-419971302) on Github. The workaround sends heartbeats to keep the WebSocket connection alive when the app is idle. However, if the app is blocked by a long running computation, this workaround does not work.\n* Shiny does not support long running tasks. A Shiny blog post recommends using [promises and futures](https:\/\/blog.rstudio.com\/2018\/06\/26\/shiny-1-1-0\/) to run long tasks asynchronously and keep the app unblocked. Here is an example that uses heartbeats to keep the Shiny app alive, and runs a long running Spark job in a `future` construct. \n```\n# Write an app that uses spark to access data on Databricks\n# First, install the following packages:\ninstall.packages(\u2018future\u2019)\ninstall.packages(\u2018promises\u2019)\n\nlibrary(shiny)\nlibrary(promises)\nlibrary(future)\nplan(multisession)\n\nHEARTBEAT_INTERVAL_MILLIS = 1000 # 1 second\n\n# Define the long Spark job here\nrun_spark <- function(x) {\n# Environment setting\nlibrary(\"SparkR\", lib.loc = \"\/databricks\/spark\/R\/lib\")\nsparkR.session()\n\nirisDF <- createDataFrame(iris)\ncollect(irisDF)\nSys.sleep(3)\nx + 1\n}\n\nrun_spark_sparklyr <- function(x) {\n# Environment setting\nlibrary(sparklyr)\nlibrary(dplyr)\nlibrary(\"SparkR\", lib.loc = \"\/databricks\/spark\/R\/lib\")\nsparkR.session()\nsc <- spark_connect(method = \"databricks\")\n\niris_tbl <- copy_to(sc, iris, overwrite = TRUE)\ncollect(iris_tbl)\nx + 1\n}\n\nui <- fluidPage(\nsidebarLayout(\n# Display heartbeat\nsidebarPanel(textOutput(\"keep_alive\")),\n\n# Display the Input and Output of the Spark job\nmainPanel(\nnumericInput('num', label = 'Input', value = 1),\nactionButton('submit', 'Submit'),\ntextOutput('value')\n)\n)\n)\nserver <- function(input, output) {\n#### Heartbeat ####\n# Define reactive variable\ncnt <- reactiveVal(0)\n# Define time dependent trigger\nautoInvalidate <- reactiveTimer(HEARTBEAT_INTERVAL_MILLIS)\n# Time dependent change of variable\nobserveEvent(autoInvalidate(), { cnt(cnt() + 1) })\n# Render print\noutput$keep_alive <- renderPrint(cnt())\n\n#### Spark job ####\nresult <- reactiveVal() # the result of the spark job\nbusy <- reactiveVal(0) # whether the spark job is running\n# Launch a spark job in a future when actionButton is clicked\nobserveEvent(input$submit, {\nif (busy() != 0) {\nshowNotification(\"Already running Spark job...\")\nreturn(NULL)\n}\nshowNotification(\"Launching a new Spark job...\")\n# input$num must be read outside the future\ninput_x <- input$num\nfut <- future({ run_spark(input_x) }) %...>% result()\n# Or: fut <- future({ run_spark_sparklyr(input_x) }) %...>% result()\nbusy(1)\n# Catch exceptions and notify the user\nfut <- catch(fut, function(e) {\nresult(NULL)\ncat(e$message)\nshowNotification(e$message)\n})\nfut <- finally(fut, function() { busy(0) })\n# Return something other than the promise so shiny remains responsive\nNULL\n})\n# When the spark job returns, render the value\noutput$value <- renderPrint(result())\n}\nshinyApp(ui = ui, server = server)\n\n```\n* There is a hard limit of 12 hours since the initial page load after which any connection, even if active, will be terminated. You must refresh the Shiny app to reconnect in these cases. However, the underlying WebSocket connection can close at any time by a variety of factors including network instability or computer sleep mode. Databricks recommends rewriting Shiny apps such that they do not require a long-lived connection and do not over-rely on session state. \n### [My app crashes immediately after launching, but the code appears to be correct. What\u2019s going on?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#id5) \nThere is a 50 MB limit on the total amount of data that can be displayed in a Shiny app on Databricks. If the application\u2019s total data size exceeds this limit, it will crash immediately after launching. To avoid this, Databricks recommends reducing the data size, for example by downsampling the displayed data or reducing the resolution of images. \n### [How many connections can be accepted for one Shiny app link during development?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#id6) \nDatabricks recommends up to 20. \n### [Can I use a different version of the Shiny package than the one installed in Databricks Runtime?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#id7) \nYes. See [Fix the Version of R Packages](https:\/\/kb.databricks.com\/r\/pin-r-packages.html). \n### [How can I develop a Shiny application that can be published to a Shiny server and access data on Databricks?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#id8) \nWhile you can access data naturally using SparkR or sparklyr during development and testing on Databricks, after a Shiny application is published to a stand-alone hosting service, it cannot directly access the data and tables on Databricks. \nTo enable your application to function outside Databricks, you must rewrite how you access data. There are a few options: \n* Use [JDBC\/ODBC](https:\/\/docs.databricks.com\/integrations\/jdbc-odbc-bi.html) to submit queries to a Databricks cluster.\n* Use [Databricks Connect](https:\/\/docs.databricks.com\/dev-tools\/databricks-connect\/index.html).\n* Directly access data on object storage. \nDatabricks recommends that you work with your Databricks solutions team to find the best approach for your existing data and analytics architecture. \n### [Can I develop a Shiny application inside a Databricks notebook?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#id9) \nYes, you can develop a Shiny application inside a Databricks notebook. \n### [How can I save the Shiny applications that I developed on hosted RStudio Server?](https:\/\/docs.databricks.com\/sparkr\/shiny.html#id10) \nYou can either save your application code [on DBFS](https:\/\/docs.databricks.com\/dbfs\/index.html) or check your code into version control.\n\n","doc_uri":"https:\/\/docs.databricks.com\/sparkr\/shiny.html"} +{"content":"# What is Delta Lake?\n### Drop or replace a Delta table\n\nDatabricks supports SQL standard DDL commands for dropping and replacing tables registered with either Unity Catalog or the Hive metastore. This article provides examples of dropping and replacing Delta tables and recommendations for syntax depending on your configured environment and desired outcome.\n\n### Drop or replace a Delta table\n#### When to drop a table\n\nYou should use `DROP TABLE` to remove a table from the metastore when you want to permanently delete the table and have no intention of creating a new table in the same location. For example: \n```\nDROP TABLE table_name\n\n``` \n`DROP TABLE` has different semantics depending on the type of table and whether the table is registered to Unity Catalog or the legacy Hive metastore. \n| Table type | Metastore | Behavior |\n| --- | --- | --- |\n| Managed | Unity Catalog | The table is removed from the metastore and underlying data is marked for deletion. You can `UNDROP` data in Unity Catalog managed tables for 7 days. |\n| Managed | Hive | The table is removed from the metastore and the underlying data is deleted. |\n| External | Unity Catalog | The table is removed from the metastore but the underlying data remains. URI access privileges are now governed by the external location that contains the data. |\n| External | Hive | The table is removed from the metastore but the underlying data remains. Any URI access privileges are unchanged. | \n`DROP TABLE` semantics differ across table types, and Unity Catalog maintains a history of Delta tables using an internal table ID. However, all tables share the common result that after the operation completes, the previously registered table name no longer has an active link to data and table history from the metastore. \nSee [DROP TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-drop-table.html). \nNote \nDatabricks does not recommend the pattern of dropping and then recreating a table using the same name for production pipelines or systems, as this pattern can result in unexpected results for concurrent operations. See [Replace data with concurrent operations](https:\/\/docs.databricks.com\/delta\/drop-table.html#concurrent).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/drop-table.html"} +{"content":"# What is Delta Lake?\n### Drop or replace a Delta table\n#### When to replace a table\n\nDatabricks recommends using `CREATE OR REPLACE TABLE` statements for use cases where you want to fully overwrite the target table with new data. For example, to overwrite a Delta table with all data from a Parquet directory, you could run the following command: \n```\nCREATE OR REPLACE TABLE table_name\nAS SELECT * FROM parquet.`\/path\/to\/files`\n\n``` \n`CREATE OR REPLACE TABLE` has the same semantics regardless of the table type or metastore in use. The following are important advantages of `CREATE OR REPLACE TABLE`: \n* Table contents are replaced, but the table identity is maintained.\n* The table history is retained, and you can revert the table to an earlier version with the `RESTORE` command.\n* The operation is a single transaction, so there is never a time when the table doesn\u2019t exist.\n* Concurrent queries reading from the table can continue without interruption. Because the version before and after replacement still exists in the table history, concurrent queries can reference either version of the table as necessary. \nSee [CREATE TABLE [USING]](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-table-using.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/drop-table.html"} +{"content":"# What is Delta Lake?\n### Drop or replace a Delta table\n#### Replace data with concurrent operations\n\nWhenever you want to perform a full replacement of data in a table that might be used in concurrent operations, you must use `CREATE OR REPLACE TABLE`. \nThe following anti-pattern should not be used: \n```\n-- This is an anti-pattern. Avoid doing this!\nDROP TABLE IF EXISTS table_name;\n\nCREATE TABLE table_name\nAS SELECT * FROM parquet.`\/path\/to\/files`;\n\n``` \nThe reasons for this recommendation vary depending on whether you\u2019re using managed or external tables and whether you\u2019re using Unity Catalog, but across all Delta table types using this pattern can result in an error, dropped records, or corrupted results. \nInstead, Databricks recommends always using `CREATE OR REPLACE TABLE`, as in the following example: \n```\nCREATE OR REPLACE TABLE table_name\nAS SELECT * FROM parquet.`\/path\/to\/files`\n\n``` \nBecause the table history is maintained during the atomic data replacement, concurrent transactions can validate the version of the source table referenced, and therefore fail or reconcile concurrent transactions as necessary without introducing unexpected behavior or results.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/drop-table.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on PostgreSQL\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes how to set up Lakehouse Federation to run federated queries on Run queries on PostgreSQL data that is not managed by Databricks. To learn more about Lakehouse Federation, see [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). \nTo connect to your Run queries on PostgreSQL database using Lakehouse Federation, you must create the following in your Databricks Unity Catalog metastore: \n* A *connection* to your Run queries on PostgreSQL database.\n* A *foreign catalog* that mirrors your Run queries on PostgreSQL database in Unity Catalog so that you can use Unity Catalog query syntax and data governance tools to manage Databricks user access to the database.\n\n#### Run federated queries on PostgreSQL\n##### Before you begin\n\nWorkspace requirements: \n* Workspace enabled for Unity Catalog. \nCompute requirements: \n* Network connectivity from your Databricks Runtime cluster or SQL warehouse to the target database systems. See [Networking recommendations for Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/networking.html).\n* Databricks clusters must use Databricks Runtime 13.3 LTS or above and shared or single-user access mode.\n* SQL warehouses must be Pro or Serverless. \nPermissions required: \n* To create a connection, you must be a metastore admin or a user with the `CREATE CONNECTION` privilege on the Unity Catalog metastore attached to the workspace.\n* To create a foreign catalog, you must have the `CREATE CATALOG` permission on the metastore and be either the owner of the connection or have the `CREATE FOREIGN CATALOG` privilege on the connection. \nAdditional permission requirements are specified in each task-based section that follows.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/postgresql.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on PostgreSQL\n##### Create a connection\n\nA connection specifies a path and credentials for accessing an external database system. To create a connection, you can use Catalog Explorer or the `CREATE CONNECTION` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** Metastore admin or user with the `CREATE CONNECTION` privilege. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **External Data** menu and select **Connections**.\n3. Click **Create connection**.\n4. Enter a user-friendly **Connection name**.\n5. Select a **Connection type** of **PostgreSQL**.\n6. Enter the following connection properties for your PostgreSQL instance. \n* **Host**: For example, `postgres-demo.lb123.us-west-2.rds.amazonaws.com`\n* **Port**: For example, `5432`\n* **User**: For example, `postgres_user`\n* **Password**: For example, `password123`\n7. (Optional) Click **Test connection** to confirm that it works.\n8. (Optional) Add a comment.\n9. Click **Create**. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nCREATE CONNECTION <connection-name> TYPE postgresql\nOPTIONS (\nhost '<hostname>',\nport '<port>',\nuser '<user>',\npassword '<password>'\n);\n\n``` \nWe recommend that you use Databricks [secrets](https:\/\/docs.databricks.com\/security\/secrets\/index.html) instead of plaintext strings for sensitive values like credentials. For example: \n```\nCREATE CONNECTION <connection-name> TYPE postgresql\nOPTIONS (\nhost '<hostname>',\nport '<port>',\nuser secret ('<secret-scope>','<secret-key-user>'),\npassword secret ('<secret-scope>','<secret-key-password>')\n)\n\n``` \nFor information about setting up secrets, see [Secret management](https:\/\/docs.databricks.com\/security\/secrets\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/postgresql.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on PostgreSQL\n##### Create a foreign catalog\n\nA foreign catalog mirrors a database in an external data system so that you can query and manage access to data in that database using Databricks and Unity Catalog. To create a foreign catalog, you use a connection to the data source that has already been defined. \nTo create a foreign catalog, you can use Catalog Explorer or the `CREATE FOREIGN CATALOG` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** `CREATE CATALOG` permission on the metastore and either ownership of the connection or the `CREATE FOREIGN CATALOG` privilege on the connection. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Click the **Create Catalog** button.\n3. On the **Create a new catalog** dialog, enter a name for the catalog and select a **Type** of **Foreign**.\n4. Select the **Connection** that provides access to the database that you want to mirror as a Unity Catalog catalog.\n5. Enter the name of the **Database** that you want to mirror as a catalog.\n6. Click **Create.** \nRun the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional. Replace the placeholder values: \n* `<catalog-name>`: Name for the catalog in Databricks.\n* `<connection-name>`: The [connection object](https:\/\/docs.databricks.com\/query-federation\/index.html#connection) that specifies the data source, path, and access credentials.\n* `<database-name>`: Name of the database you want to mirror as a catalog in Databricks. \n```\nCREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>\nOPTIONS (database '<database-name>');\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/postgresql.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on PostgreSQL\n##### Supported pushdowns\n\nThe following pushdowns are supported on all compute: \n* Filters\n* Projections\n* Limit\n* Functions: partial, only for filter expressions. (String functions, Mathematical functions, Data, Time and Timestamp functions, and other miscellaneous functions, such as Alias, Cast, SortOrder) \nThe following pushdowns are supported on Databricks Runtime 13.3 LTS and above, and on SQL warehouses: \n* The following aggregation functions: MIN, MAX, COUNT, SUM, AVG, VAR\\_POP, VAR\\_SAMP, STDDEV\\_POP, STDDEV\\_SAMP, GREATEST, LEAST, COVAR\\_POP, COVAR\\_SAMP, CORR, REGR\\_INTERCEPT, REGR\\_R2, REGR\\_SLOPE, REGR\\_SXY\n* The following Boolean functions: =, <, <, =>, >=, <=>\n* The following mathematical functions (not supported if ANSI is disabled): +, -, \\*, %, \/\n* Miscellaneous operators | and ~\n* Sorting, when used with limit \nThe following pushdowns are not supported: \n* Joins\n* Windows functions\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/postgresql.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on PostgreSQL\n##### Data type mappings\n\nWhen you read from PostgreSQL to Spark, data types map as follows: \n| PostgreSQL type | Spark type |\n| --- | --- |\n| numeric | DecimalType |\n| int2 | ShortType |\n| int4 (if not signed) | IntegerType |\n| int8, oid, xid, int4 (if signed) | LongType |\n| float4 | FloatType |\n| double precision, float8 | DoubleType |\n| char | CharType |\n| name, varchar, tid | VarcharType |\n| bpchar, character varying, json, money, point, super, text | StringType |\n| bytea, geometry, varbyte | BinaryType |\n| bit, bool | BooleanType |\n| date | DateType |\n| tabstime, time, time with time zone, timetz, time without time zone, timestamp with time zone, timestamp, timestamptz, timestamp without time zone\\* | TimestampType\/TimestampNTZType |\n| Postgresql array type\\*\\* | ArrayType | \n\\*When you read from Postgresql, Postgresql `Timestamp` is mapped to Spark `TimestampType` if `preferTimestampNTZ = false` (default). Postgresql `Timestamp` is mapped to `TimestampNTZType` if `preferTimestampNTZ = true`. \n\\*\\*Limited array types are supported.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/postgresql.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Manage privileges in Unity Catalog\n\nThis article explains how to control access to data and other objects in Unity Catalog. To learn about how this model differs from access control in the Hive metastore, see [Work with Unity Catalog and the legacy Hive metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/hive-metastore.html).\n\n#### Manage privileges in Unity Catalog\n##### Who can manage privileges?\n\nInitially, users have no access to data in a metastore. Databricks account admins, workspace admins, and metastore admins have default privileges for managing Unity Catalog. See [Admin privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/admin-privileges.html). \nAll securable objects in Unity Catalog have an owner. Object owners have all privileges on that object, including the ability to grant privileges to other principals. See [Manage Unity Catalog object ownership](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/ownership.html). \nPrivileges can be granted by either a metastore admin, the owner of an object, or the owner of the catalog or schema that contains the object. Account admins can also grant privileges directly on a metastore. \n### Workspace catalog privileges \nIf your workspace was enabled for Unity Catalog automatically, the workspace is attached to a metastore by default and a workspace catalog is created for your workspace in the metastore. Workspace admins are the default owners of the workspace catalog. As owners, they can manage privileges on the workspace catalog and all child objects. \nAll workspace users receive the `USE CATALOG` privilege on the workspace catalog. Workspace users also receive the `USE SCHEMA`, `CREATE TABLE`, `CREATE VOLUME`, `CREATE MODEL`, `CREATE FUNCTION`, and `CREATE MATERIALIZED VIEW` privileges on the `default` schema in the catalog. \nFor more information, see [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Manage privileges in Unity Catalog\n##### Inheritance model\n\nSecurable objects in Unity Catalog are hierarchical, and privileges are inherited downward. The highest level object that privileges are inherited from is the catalog. This means that granting a privilege on a catalog or schema automatically grants the privilege to all current and future objects within the catalog or schema. For example, if you give a user the `SELECT` privilege on a catalog, then that user will be able to select (read) all tables and views in that catalog. Privileges that are granted on a Unity Catalog metastore are not inherited. \n![Unity Catalog object hierarchy](https:\/\/docs.databricks.com\/_images\/object-hierarchy.png) \nOwners of an object are automatically granted all privileges on that object. In addition, object owners can grant privileges on the object itself and on all of its child objects. This means that owners of a schema do not automatically have all privileges on the tables in the schema, but they can grant themselves privileges on the tables in the schema. \nNote \nIf you created your Unity Catalog metastore during the public preview (before August 25, 2022), you might be on an earlier privilege model that doesn\u2019t support the current inheritance model. You can upgrade to Privilege Model version 1.0 to get privilege inheritance. See [Upgrade to privilege inheritance](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/upgrade-privilege-model.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Manage privileges in Unity Catalog\n##### Show, grant, and revoke privileges\n\nYou can manage privileges for metastore objects using SQL commands, the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html), the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html), or Catalog Explorer. \nIn the SQL commands that follow, replace these placeholder values: \n* `<privilege-type>` is a Unity Catalog privilege type. See [Privilege types](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-privileges.html#privilege-types).\n* `<securable-type>`: The type of securable object, such as `CATALOG` or `TABLE`. See [Securable objects](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-privileges.html#securable-objects)\n* `<securable-name>`: The name of the securable. If the securable type is `METASTORE`, do not provide the securable name. It is assumed to be the metastore attached to the workspace.\n* `<principal>` is a user, service principal (represented by its applicationId value), or group. You must enclose users, service principals, and group names that include [special characters](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-identifiers.html#delimited-identifiers) in backticks ( `` `` ). See [Principal](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-principal.html). \n### Show grants on objects in a Unity Catalog metastore \n**Permissions required:** Metastore admin, the owner of the object, the owner of the catalog or schema that contains the object. You can also view your own grants. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Select the object, such as a catalog, schema, table, or view.\n3. Go to the **Permissions** tab. \nRun the following SQL command in a notebook or SQL query editor. You can show grants on a specific principal, or you can show all grants on a securable object. \n```\nSHOW GRANTS [principal] ON <securable-type> <securable-name>\n\n``` \nFor example, the following command shows all grants on a schema named *default* in the parent catalog named *main*: \n```\nSHOW GRANTS ON SCHEMA main.default;\n\n``` \nThe command returns: \n```\nprincipal actionType objectType objectKey\n------------- ------------- ---------- ------------\nfinance-team CREATE TABLE SCHEMA main.default\nfinance-team USE SCHEMA SCHEMA main.default\n\n``` \n### Grant permissions on objects in a Unity Catalog metastore \n**Permissions required:** Metastore admin, the owner of the object, or the owner of the catalog or schema that contains the object. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Select the object, such as a catalog, schema, table, or view.\n3. Go to the **Permissions** tab.\n4. Click **Grant**.\n5. Enter the email address for a user or the name of a group.\n6. Select the permissions to grant.\n7. Click **OK**. \nRun the following SQL command in a notebook or SQL query editor. \n```\nGRANT <privilege-type> ON <securable-type> <securable-name> TO <principal>\n\n``` \nFor example, the following command grants a group named *finance-team* access to create tables in a schema named *default* with the parent catalog named *main*: \n```\nGRANT CREATE TABLE ON SCHEMA main.default TO `finance-team`;\nGRANT USE SCHEMA ON SCHEMA main.default TO `finance-team`;\nGRANT USE CATALOG ON CATALOG main TO `finance-team`;\n\n``` \n### Revoke permissions on objects in a Unity Catalog metastore \n**Permissions required:** Metastore admin, the owner of the object, or the owner of the catalog or schema that contains the object. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Select the object, such as a catalog, schema, table, or view.\n3. Go to the **Permissions** tab.\n4. Select a privilege that has been granted to a user, service principal, or group.\n5. Click **Revoke**.\n6. To confirm, click **Revoke**. \nRun the following SQL command in a notebook or SQL query editor. \n```\nREVOKE <privilege-type> ON <securable-type> <securable-name> TO <principal>\n\n``` \nFor example, the following command revokes a group named *finance-team* access to create tables in a schema named *default* with the parent catalog named *main*: \n```\nREVOKE CREATE TABLE ON SCHEMA main.default TO `finance-team`;\n\n``` \n### Show grants on a metastore \n**Permissions required:** Metastore admin or account admin. You can also view your own grants on a metastore. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Next to the **Catalog Explorer** page label, click the icon next to the metastore name.\n3. Go to the **Permissions** tab. \nRun the following SQL command in a notebook or SQL query editor. You can show grants on a specific principal, or you can show all grants on a metastore. \n```\nSHOW GRANTS [principal] ON METASTORE\n\n``` \n### Grant permissions on a metastore \n**Permissions required:** Metastore admin or account admin. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Next to the **Catalog Explorer** page label, click the icon next to the metastore name.\n3. On the **Permissions** tab, click **Grant**.\n4. Enter the email address for a user or the name of a group.\n5. Select the permissions to grant.\n6. Click **OK**. \n1. Run the following SQL command in a notebook or SQL query editor. \n```\nGRANT <privilege-type> ON METASTORE TO <principal>`;\n\n``` \nWhen you grant privileges on a metastore, you do not include the metastore name, because the metastore that is attached to your workspace is assumed. \n### Revoke permissions on a metastore \n**Permissions required:** Metastore admin or account admin.. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Next to the **Catalog Explorer** page label, click the icon next to the metastore name.\n3. On the **Permissions** tab, select a user or group and click **Revoke**.\n4. To confirm, click **Revoke**. \n1. Run the following SQL command in a notebook or SQL query editor. \n```\nREVOKE <privilege-type> ON METASTORE TO <principal>;\n\n``` \nWhen you revoke privileges on a metastore, you do not include the metastore name, because the metastore that is attached to your workspace is assumed.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html"} +{"content":"# Get started: Account and workspace setup\n### Run your first ETL workload on Databricks\n\nLearn how to use production-ready tools from Databricks to develop and deploy your first extract, transform, and load (ETL) pipelines for data orchestration. \nBy the end of this article, you will feel comfortable: \n1. [Launching a Databricks all-purpose compute cluster](https:\/\/docs.databricks.com\/getting-started\/etl-quick-start.html#cluster).\n2. [Creating a Databricks notebook](https:\/\/docs.databricks.com\/getting-started\/etl-quick-start.html#notebook).\n3. [Configuring incremental data ingestion to Delta Lake with Auto Loader](https:\/\/docs.databricks.com\/getting-started\/etl-quick-start.html#auto-loader).\n4. [Executing notebook cells to process, query, and preview data](https:\/\/docs.databricks.com\/getting-started\/etl-quick-start.html#process).\n5. [Scheduling a notebook as a Databricks job](https:\/\/docs.databricks.com\/getting-started\/etl-quick-start.html#schedule). \nThis tutorial uses interactive notebooks to complete common ETL tasks in Python or Scala. \nYou can also use Delta Live Tables to build ETL pipelines. Databricks created Delta Live Tables to reduce the complexity of building, deploying, and maintaining production ETL pipelines. See [Tutorial: Run your first Delta Live Tables pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html). \nYou can also use the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) to create this article\u2019s resources. See [Create clusters, notebooks, and jobs with Terraform](https:\/\/docs.databricks.com\/dev-tools\/terraform\/cluster-notebook-job.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/etl-quick-start.html"} +{"content":"# Get started: Account and workspace setup\n### Run your first ETL workload on Databricks\n#### Requirements\n\n* You are logged into a Databricks workspace.\n* You have [permission to create a cluster](https:\/\/docs.databricks.com\/compute\/use-compute.html). \nNote \nIf you do not have cluster control privileges, you can still complete most of the steps below as long as you have [access to a cluster](https:\/\/docs.databricks.com\/compute\/use-compute.html#permissions).\n\n### Run your first ETL workload on Databricks\n#### Step 1: Create a cluster\n\nTo do exploratory data analysis and data engineering, create a cluster to provide the compute resources needed to execute commands. \n1. Click ![compute icon](https:\/\/docs.databricks.com\/_images\/clusters-icon.png) **Compute** in the sidebar.\n2. On the Compute page, click **Create Cluster**. This opens the New Cluster page.\n3. Specify a unique name for the cluster, leave the remaining values in their default state, and click **Create Cluster**. \nTo learn more about Databricks clusters, see [Compute](https:\/\/docs.databricks.com\/compute\/index.html).\n\n### Run your first ETL workload on Databricks\n#### Step 2: Create a Databricks notebook\n\nTo get started writing and executing interactive code on Databricks, create a notebook. \n1. Click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** in the sidebar, then click **Notebook**.\n2. On the Create Notebook page: \n* Specify a unique name for your notebook.\n* Make sure the default language is set to **Python** or **Scala**.\n* Select the cluster you created in step 1 from the **Cluster** dropdown.\n* Click **Create**. \nA notebook opens with an empty cell at the top. \nTo learn more about creating and managing notebooks, see [Manage notebooks](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/etl-quick-start.html"} +{"content":"# Get started: Account and workspace setup\n### Run your first ETL workload on Databricks\n#### Step 3: Configure Auto Loader to ingest data to Delta Lake\n\nDatabricks recommends using [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html) for incremental data ingestion. Auto Loader automatically detects and processes new files as they arrive in cloud object storage. \nDatabricks recommends storing data with [Delta Lake](https:\/\/docs.databricks.com\/delta\/index.html). Delta Lake is an open source storage layer that provides ACID transactions and enables the data lakehouse. Delta Lake is the default format for tables created in Databricks. \nTo configure Auto Loader to ingest data to a Delta Lake table, copy and paste the following code into the empty cell in your notebook: \n```\n# Import functions\nfrom pyspark.sql.functions import col, current_timestamp\n\n# Define variables used in code below\nfile_path = \"\/databricks-datasets\/structured-streaming\/events\"\nusername = spark.sql(\"SELECT regexp_replace(current_user(), '[^a-zA-Z0-9]', '_')\").first()[0]\ntable_name = f\"{username}_etl_quickstart\"\ncheckpoint_path = f\"\/tmp\/{username}\/_checkpoint\/etl_quickstart\"\n\n# Clear out data from previous demo execution\nspark.sql(f\"DROP TABLE IF EXISTS {table_name}\")\ndbutils.fs.rm(checkpoint_path, True)\n\n# Configure Auto Loader to ingest JSON data to a Delta table\n(spark.readStream\n.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"json\")\n.option(\"cloudFiles.schemaLocation\", checkpoint_path)\n.load(file_path)\n.select(\"*\", col(\"_metadata.file_path\").alias(\"source_file\"), current_timestamp().alias(\"processing_time\"))\n.writeStream\n.option(\"checkpointLocation\", checkpoint_path)\n.trigger(availableNow=True)\n.toTable(table_name))\n\n``` \n```\n\/\/ Imports\nimport org.apache.spark.sql.functions.current_timestamp\nimport org.apache.spark.sql.streaming.Trigger\nimport spark.implicits._\n\n\/\/ Define variables used in code below\nval file_path = \"\/databricks-datasets\/structured-streaming\/events\"\nval username = spark.sql(\"SELECT regexp_replace(current_user(), '[^a-zA-Z0-9]', '_')\").first.get(0)\nval table_name = s\"${username}_etl_quickstart\"\nval checkpoint_path = s\"\/tmp\/${username}\/_checkpoint\"\n\n\/\/ Clear out data from previous demo execution\nspark.sql(s\"DROP TABLE IF EXISTS ${table_name}\")\ndbutils.fs.rm(checkpoint_path, true)\n\n\/\/ Configure Auto Loader to ingest JSON data to a Delta table\nspark.readStream\n.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"json\")\n.option(\"cloudFiles.schemaLocation\", checkpoint_path)\n.load(file_path)\n.select($\"*\", $\"_metadata.file_path\".as(\"source_file\"), current_timestamp.as(\"processing_time\"))\n.writeStream\n.option(\"checkpointLocation\", checkpoint_path)\n.trigger(Trigger.AvailableNow)\n.toTable(table_name)\n\n``` \nNote \nThe variables defined in this code should allow you to safely execute it without risk of conflicting with existing workspace assets or other users. Restricted network or storage permissions will raise errors when executing this code; contact your workspace administrator to troubleshoot these restrictions. \nTo learn more about Auto Loader, see [What is Auto Loader?](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/etl-quick-start.html"} +{"content":"# Get started: Account and workspace setup\n### Run your first ETL workload on Databricks\n#### Step 4: Process and interact with data\n\nNotebooks execute logic cell-by-cell. To execute the logic in your cell: \n1. To run the cell you completed in the previous step, select the cell and press **SHIFT+ENTER**.\n2. To query the table you\u2019ve just created, copy and paste the following code into an empty cell, then press **SHIFT+ENTER** to run the cell. \n```\ndf = spark.read.table(table_name)\n\n``` \n```\nval df = spark.read.table(table_name)\n\n```\n3. To preview the data in your DataFrame, copy and paste the following code into an empty cell, then press **SHIFT+ENTER** to run the cell. \n```\ndisplay(df)\n\n``` \n```\ndisplay(df)\n\n``` \nTo learn more about interactive options for visualizing data, see [Visualizations in Databricks notebooks](https:\/\/docs.databricks.com\/visualizations\/index.html).\n\n### Run your first ETL workload on Databricks\n#### Step 5: Schedule a job\n\nYou can run Databricks notebooks as production scripts by adding them as a task in a Databricks job. In this step, you will create a new job that you can trigger manually. \nTo schedule your notebook as a task: \n1. Click **Schedule** on the right side of the header bar.\n2. Enter a unique name for the **Job name**.\n3. Click **Manual**.\n4. In the **Cluster** drop-down, select the cluster you created in step 1.\n5. Click **Create**.\n6. In the window that appears, click **Run now**.\n7. To see the job run results, click the ![External Link](https:\/\/docs.databricks.com\/_images\/external-link.png) icon next to the **Last run** timestamp. \nFor more information on jobs, see [What is Databricks Jobs?](https:\/\/docs.databricks.com\/workflows\/index.html#what-is-jobs).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/etl-quick-start.html"} +{"content":"# Get started: Account and workspace setup\n### Run your first ETL workload on Databricks\n#### Additional Integrations\n\nLearn more about integrations and tools for data engineering with Databricks: \n* [Connect your favorite IDE](https:\/\/docs.databricks.com\/dev-tools\/index.html)\n* [Use dbt with Databricks](https:\/\/docs.databricks.com\/partners\/prep\/dbt.html)\n* [Learn about the Databricks Command Line Interface (CLI)](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html)\n* [Learn about the Databricks Terraform Provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/etl-quick-start.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Develop Delta Live Tables pipeline code in your local development environment\n\nIn addition to using notebooks or the file editor in your Databricks workspace to implement pipeline code that uses the Delta Live Tables [Python interface](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html), you can also develop your code in your local development environment. For example, you can use your favorite integrated development environment (IDE) such as Visual Studio Code or PyCharm. After writing your pipeline code locally, you can manually move it into your Databricks workspace or use Databricks tools to operationalize your pipeline, including deploying and running the pipeline. \nThis article describes the tools and methods available to develop your Python pipelines locally and deploy those pipelines to your Databricks workspace. Links to articles that provide more details on using these tools and methods are also provided.\n\n##### Develop Delta Live Tables pipeline code in your local development environment\n###### Get syntax checking, autocomplete, and type checking in your IDE\n\nDatabricks provides a Python module you can install in your local environment to assist with the development of code for your Delta Live Tables pipelines. This module has the interfaces and docstring references for the Delta Live Tables Python interface, providing syntax checking, autocomplete, and data type checking as you write code in your IDE. \nThis module includes interfaces but no functional implementations. You cannot use this library to create or run a Delta Live Tables pipeline locally. Instead, use one of the methods described below to deploy your code. \nThe Python module for local development is available on PyPI. For installation and usage instructions, see [Python stub for Delta Live Tables](https:\/\/pypi.org\/project\/databricks-dlt\/).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/develop-locally.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Develop Delta Live Tables pipeline code in your local development environment\n###### Validate, deploy, and run your pipeline code with Databricks Asset Bundles\n\nAfter implementing your Delta Live Tables pipeline code, Databricks recommends using Databricks Asset Bundles to operationalize the code. Databricks Asset Bundles provide CI\/CD capabilities to your pipeline development lifecycle, including validation of the pipeline artifacts, packaging of all pipeline artifacts such as source code and configuration, deployment of the code to your Databricks workspace, and starting pipeline updates. \nTo learn how to create a bundle to manage your pipeline code using Databricks Asset Bundles, see [Develop Delta Live Tables pipelines with Databricks Asset Bundles](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-bundles.html).\n\n##### Develop Delta Live Tables pipeline code in your local development environment\n###### Develop and sync pipeline code in your IDE\n\nIf you use the Visual Studio Code IDE for development, you can use the [Python module](https:\/\/docs.databricks.com\/delta-live-tables\/develop-locally.html#dlt-module) to develop your code and then use the Databricks extension for Visual Studio Code to sync your code directly from Visual Studio Code to your workspace. See [What is the Databricks extension for Visual Studio Code?](https:\/\/docs.databricks.com\/dev-tools\/vscode-ext\/index.html). \nTo learn how to create a pipeline using the code you synced to your workspace using the Databricks extension for Visual Studio Code, see [Import Python modules from Git folders or workspace files](https:\/\/docs.databricks.com\/delta-live-tables\/import-workspace-files.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/develop-locally.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Develop Delta Live Tables pipeline code in your local development environment\n###### Manually sync your pipeline code to your Databricks workspace\n\nInstead of creating a bundle using Databricks Asset Bundles or using the Databricks extension for Visual Studio Code, you can sync your code to your Databricks workspace and use that code to create a pipeline inside the workspace. This can be particularly useful during development and testing stages when you want to iterate on code quickly. Databricks supports several methods to move code from your local environment to your workspace. \nTo learn how to create a pipeline using the code you synced to your workspace using one of the below methods, see [Import Python modules from Git folders or workspace files](https:\/\/docs.databricks.com\/delta-live-tables\/import-workspace-files.html). \n* **Workspace files**: You can use Databricks workspace files to upload your pipeline source code to your Databricks workspace and then import that code into a pipeline. To learn how to use workspace files, see [What are workspace files?](https:\/\/docs.databricks.com\/files\/workspace.html).\n* **Databricks Git folders**: To facilitate collaboration and version control, Databricks recommends using Databricks Git folders to sync code between your local environment and your Databricks workspace. Git folders integrates with your Git provider, allowing you to push code from your local environment and then import that code into a pipeline in your workspace. To learn how to use Databricks Git folders, see [Git integration with Databricks Git folders](https:\/\/docs.databricks.com\/repos\/index.html).\n* **Manually copy your code**: You can copy the code from your local environment, paste the code into a Databricks notebook, and use the Delta Live Tables UI to create a new pipeline with the notebook. To learn how to create a pipeline in the UI, see [Tutorial: Run your first Delta Live Tables pipeline](https:\/\/docs.databricks.com\/delta-live-tables\/tutorial-pipelines.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/develop-locally.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Develop Delta Live Tables pipelines\n##### Develop Delta Live Tables pipeline code in your local development environment\n###### Implement custom CI\/CD workflows\n\nIf you prefer to write scripts to manage your pipelines, Databricks has a REST API, the Databricks command line interface (CLI), and software development kits (SDKs) for popular programming languages. You can also use the `databricks_pipeline` Resource in the Databricks Terraform provider. \nTo learn how to use the REST API, see [Delta Live Tables](https:\/\/docs.databricks.com\/api\/workspace\/pipelines) in the Databricks REST API Reference. \nTo learn how to use the Databricks CLI, see [What is the Databricks CLI?](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html). \nTo learn how to use the Databricks Python SDK, see [Databricks SDK for Python](https:\/\/docs.databricks.com\/dev-tools\/sdk-python.html) and the [pipeline examples](https:\/\/github.com\/databricks\/databricks-sdk-py\/tree\/main\/examples\/pipelines) in the project GitHub repository. \nTo learn how to use Databricks SDKs for other languages, see [Use SDKs with Databricks](https:\/\/docs.databricks.com\/dev-tools\/index-sdk.html). \nTo learn how to use the Databricks Terraform provider, see [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and the Terraform documentation for the [databricks\\_pipeline Resource](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/pipeline).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/develop-locally.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Production considerations for Structured Streaming\n\nThis article contains recommendations to configure production incremental processing workloads with Structured Streaming on Databricks to fulfill latency and cost requirements for real-time or batch applications. Understanding key concepts of Structured Streaming on Databricks can help you avoid common pitfalls as you scale up the volume and velocity of data and move from development to production. \nDatabricks has introduced Delta Live Tables to reduce the complexities of managing production infrastructure for Structured Streaming workloads. Databricks recommends using Delta Live Tables for new Structured Streaming pipelines; see [What is Delta Live Tables?](https:\/\/docs.databricks.com\/delta-live-tables\/index.html). \nNote \nCompute auto-scaling has limitations scaling down cluster size for Structured Streaming workloads. Databricks recommends using Delta Live Tables with Enhanced Autoscaling for streaming workloads. See [Optimize the cluster utilization of Delta Live Tables pipelines with Enhanced Autoscaling](https:\/\/docs.databricks.com\/delta-live-tables\/auto-scaling.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/production.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Production considerations for Structured Streaming\n##### Using notebooks for Structured Streaming workloads\n\nInteractive development with Databricks notebooks requires you attach your notebooks to a cluster in order to execute queries manually. You can schedule Databricks notebooks for automated deployment and automatic recovery from query failure using [Workflows](https:\/\/docs.databricks.com\/workflows\/jobs\/create-run-jobs.html). \n* [Recover from Structured Streaming query failures with workflows](https:\/\/docs.databricks.com\/structured-streaming\/query-recovery.html)\n* [Monitoring Structured Streaming queries on Databricks](https:\/\/docs.databricks.com\/structured-streaming\/stream-monitoring.html)\n* [Use scheduler pools for multiple streaming workloads](https:\/\/docs.databricks.com\/structured-streaming\/scheduler-pools.html) \nYou can visualize Structured Streaming queries in notebooks during interactive development, or for interactive monitoring of production workloads. You should only visualize a Structured Streaming query in production if a human will regularly monitor the output of the notebook. While the `trigger` and `checkpointLocation` parameters are optional, as a best practice Databricks recommends that you *always* specify them in production.\n\n#### Production considerations for Structured Streaming\n##### Controlling batch size and frequency for Structured Streaming on Databricks\n\nStructured Streaming on Databricks has enhanced options for helping to control costs and latency while streaming with Auto Loader and Delta Lake. \n* [Configure Structured Streaming batch size on Databricks](https:\/\/docs.databricks.com\/structured-streaming\/batch-size.html)\n* [Configure Structured Streaming trigger intervals](https:\/\/docs.databricks.com\/structured-streaming\/triggers.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/production.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Production considerations for Structured Streaming\n##### What is stateful streaming?\n\nA *stateful* Structured Streaming query requires incremental updates to intermediate state information, whereas a *stateless* Structured Streaming query only tracks information about which rows have been processed from the source to the sink. \nStateful operations include streaming aggregation, streaming `dropDuplicates`, stream-stream joins, `mapGroupsWithState`, and `flatMapGroupsWithState`. \nThe intermediate state information required for stateful Structured Streaming queries can lead to unexpected latency and production problems if not configured properly. \nIn Databricks Runtime 13.3 LTS and above, you can enable changelog checkpointing with RocksDB to lower checkpoint duration and end-to-end latency for Structured Streaming workloads. Databricks recommends enabling changelog checkpointing for all Structured Streaming stateful queries. See [Enable changelog checkpointing](https:\/\/docs.databricks.com\/structured-streaming\/rocksdb-state-store.html#changelog-checkpoint). \n* [Optimize stateful Structured Streaming queries](https:\/\/docs.databricks.com\/structured-streaming\/stateful-streaming.html)\n* [Configure RocksDB state store on Databricks](https:\/\/docs.databricks.com\/structured-streaming\/rocksdb-state-store.html)\n* [Apply watermarks to control data processing thresholds](https:\/\/docs.databricks.com\/structured-streaming\/watermarks.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/production.html"} +{"content":"# Compute\n### Use compute\n\nThis article explains how to connect to compute in your Databricks workspace. You need access to compute to run data engineering, data science, and data analytics workloads.\n\n### Use compute\n#### Who can access or create compute?\n\nThe ability to access or create compute depends on a user\u2019s entitlements. \n* If your workspace is enabled for serverless compute for notebooks (Public Preview), all users in the workspace have access to the serverless compute resource to run interactive workloads in notebooks and workflows. \n* Workspace admins can create any type of compute. They also inherit the CAN MANAGE permission on all compute created in their workspace.\n* Non-admin users with the **Unrestricted cluster creation** entitlement have access to all configuration settings when creating compute. They can access compute they\u2019ve been given permissions to and can create any type of new compute. To learn about available configuration settings, see [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html). Workspace admins can assign this entitlement to any user, group, or service principal. See [Manage entitlements](https:\/\/docs.databricks.com\/security\/auth-authz\/entitlements.html).\n* Non-admin users without the **Unrestricted cluster creation** entitlement can only access compute they are granted permissions to or compute they create using policies they are assigned permission to.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/use-compute.html"} +{"content":"# Compute\n### Use compute\n#### Use serverless compute (Public Preview)\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nIf your workspace has been enabled for the serverless compute Public Preview, you will automatically have access to the serverless compute resource in any of your available notebooks. Serverless compute gives you on-demand access to scalable compute in notebooks, letting you immediately write and run your Python or SQL code. \nTo attach to the serverless compute, click the **Connect** drop-down menu in the notebook and select **Serverless**. In new notebooks, the attached compute automatically defaults to serverless upon code execution if no other resource has been selected. \nFor details on enablement, see [Enable serverless compute public preview](https:\/\/docs.databricks.com\/admin\/workspace-settings\/serverless.html).\n\n### Use compute\n#### Use compute configured by another user\n\nIf you don\u2019t have unrestricted cluster creation permissions, you only have access to the compute and compute policies granted to you by your workspace admins. Users can have any of these permissions on a compute: \n* CAN ATTACH TO: Allows you to attach your notebook to compute and view the compute metrics and Spark UI.\n* CAN RESTART: Allows you to start, restart, and terminate compute. Also includes CAN ATTACH TO permissions.\n* CAN MANAGE: Allows you to edit compute details, permissions, and size. Also includes CAN ATTACH TO and CAN RESTART permissions.\n* NO PERMISSIONS: No permissions on the compute. \nIf you have permissions to attach to a compute, you can select it from the **Connect** drop-down menu in an opened notebook or from the **Compute** drop-down menu when creating a new job. For more information on compute permissions, see [Compute permissions](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/use-compute.html"} +{"content":"# Compute\n### Use compute\n#### Create new compute using a policy\n\nIf you have permission to a compute policy, you can create your own compute. Policies have minimal configuration options and are designed to be efficient resources using their default settings. If you do wish to edit any settings, you can learn about each setting in the [configuration settings reference](https:\/\/docs.databricks.com\/compute\/configure.html). \n1. Click **New** > **Cluster** in your workspace sidebar.\n2. Select a policy from the **Policy** drop-down menu.\n3. (Optional) Update the name of the compute.\n4. (Optional) Configure any available settings.\n5. Click **Create compute**. \nYou now have a compute resource you can use to run your workloads. \n### Policies \nWorkspace admins can create and manage the compute policies in your workspace. If you don\u2019t have access to a policy that allows you to create the compute you need, reach out to your workspace admin. For more on policies, see [Create and manage compute policies](https:\/\/docs.databricks.com\/admin\/clusters\/policies.html). \nYour workspace might have custom policies or use the Databricks default policies. The default policies include: \n* **Personal Compute:** Allows users to create an individually assigned single-node compute resource with minimal configuration options.\n* **Shared Compute:** Allows users to create larger multi-node resource intended for multiple users to share.\n* **Power User Compute:** Allows users to create larger multi-node resources. The policy is intended for single-user workloads that require more compute resources than Personal Compute allows.\n* **Job Compute:** Allows users to create a general-purpose default compute for jobs. \nBy default, all users have access to the Personal Compute policy. If you don\u2019t see the Personal Compute policy, your organization has removed it from your workspace.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/use-compute.html"} +{"content":"# Compute\n### Use compute\n#### Unrestricted compute creation\n\nIf you are a workspace admin or a user with the **Unrestricted cluster creation** entitlement, you can create compute using the **Unrestricted** policy. This gives you access to all compute settings in the **New compute** UI. For a reference of all available settings, see [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/use-compute.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create and manage schemas (databases)\n\nThis article shows how to create and manage schemas (databases) in Unity Catalog. A schema contains tables, views, volumes, models, and functions. You create schemas inside [catalogs](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html).\n\n#### Create and manage schemas (databases)\n##### Requirements\n\n* You must have a Unity Catalog metastore [linked to the workspace](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-metastore.html) where you perform the schema creation.\n* You must have the `USE CATALOG` and `CREATE SCHEMA` [data permissions](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html) on the schema\u2019s parent catalog. Either a metastore admin or the owner of the catalog can grant you these privileges. If you are a metastore admin, you can grant these privileges to yourself.\n* The cluster that you use to run a notebook to create a schema must use a Unity Catalog-compliant access mode. See [Access modes](https:\/\/docs.databricks.com\/compute\/configure.html#access-mode). \nSQL warehouses always support Unity Catalog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-schemas.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create and manage schemas (databases)\n##### Create a schema\n\nTo create a schema, you can use Catalog Explorer or SQL commands. \n1. Log in to a workspace that is linked to the metastore.\n2. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. In the **Catalog** pane on the left, click the catalog you want to create the schema in.\n4. In the detail pane, click **Create schema**.\n5. Give the schema a name and add any comment that would help users understand the purpose of the schema.\n6. (Optional) Specify a managed storage location. Requires the `CREATE MANAGED STORAGE` privilege on the target external location. See [Specify a managed storage location in Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/managed-storage.html).\n7. Click **Create**.\n8. Assign permissions for your catalog. See [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html).\n9. Click **Save**. \n1. Run the following SQL commands in a notebook or Databricks SQL editor. Items in brackets are optional. You can use either `SCHEMA` or `DATABASE`. Replace the placeholder values: \n* `<catalog-name>`: The name of the parent catalog for the schema.\n* `<schema-name>`: A name for the schema.\n* `<location-path>`: Optional. Requires additional privileges. See [Specify a managed storage location in Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/managed-storage.html).\n* `<comment>`: Optional description or other comment.\n* `<property-key> = <property-value> [ , ... ]`: Optional. Spark SQL properties and values to set for the schema.For parameter descriptions, see [CREATE SCHEMA](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-schema.html). \n```\nUSE CATALOG <catalog>;\nCREATE { DATABASE | SCHEMA } [ IF NOT EXISTS ] <schema-name>\n[ MANAGED LOCATION '<location-path>' ]\n[ COMMENT <comment> ]\n[ WITH DBPROPERTIES ( <property-key = property_value [ , ... ]> ) ];\n\n``` \nYou can optionally omit the `USE CATALOG` statement and replace `<schema-name>` with `<catalog-name>.<schema-name>`.\n2. Assign privileges to the schema. See [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html). \nYou can also create a schema by using the [Databricks Terraform provider](https:\/\/docs.databricks.com\/dev-tools\/terraform\/index.html) and [databricks\\_schema](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/schema). You can retrieve a list of schema IDs by using [databricks\\_schemas](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/data-sources\/schemas)\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-schemas.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Create and manage schemas (databases)\n##### Delete a schema\n\nTo delete (or drop) a schema, you can use Catalog Explorer or a SQL command. To drop a schema you must be its owner. \nYou must delete all tables in the schema before you can delete it. \n1. Log in to a workspace that is linked to the metastore.\n2. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. In the **Catalog** pane, on the left, click the schema that you want to delete.\n4. In the detail pane, click the three-dot menu in the upper right corner and select **Delete**.\n5. On the **Delete schema** dialog, click **Delete**. \nRun the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional. Replace the placeholder `<schema-name>`. \nFor parameter descriptions, see [DROP SCHEMA](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-drop-schema.html). \nIf you use `DROP SCHEMA` without the `CASCADE` option, you must delete all tables in the schema before you can delete it. \n```\nDROP SCHEMA [ IF EXISTS ] <schema-name> [ RESTRICT | CASCADE ]\n\n``` \nFor example, to delete a schema named `inventory_schema` and its tables: \n```\nDROP SCHEMA inventory_schema CASCADE\n\n``` \n### Next steps \nNow you can add tables or volumes to your schema. See [Create tables in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html) and [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-schemas.html"} +{"content":"# Databricks data engineering\n### Optimization recommendations on Databricks\n\nDatabricks provides many optimizations supporting a variety of workloads on the lakehouse, ranging from large-scale ETL processing to ad-hoc, interactive queries. Many of these optimizations take place automatically. You get their benefits simply by using Databricks. Additionally, most Databricks Runtime features require Delta Lake, the default format used to create tables in Databricks. \nDatabricks configures default values that optimize most workloads. But, in some cases, changing configuration settings improves performance.\n\n### Optimization recommendations on Databricks\n#### Databricks Runtime performance enhancements\n\nNote \nUse the latest Databricks Runtime to leverage the newest performance enhancements. All behaviors documented here are enabled by default in Databricks Runtime 10.4 LTS and above. \n* [Disk caching](https:\/\/docs.databricks.com\/optimizations\/disk-cache.html) accelerates repeated reads against Parquet data files by loading data to disk volumes attached to compute clusters.\n* [Dynamic file pruning](https:\/\/docs.databricks.com\/optimizations\/dynamic-file-pruning.html) improves query performance by skipping directories that do not contain data files that match query predicates.\n* [Low shuffle merge](https:\/\/docs.databricks.com\/optimizations\/low-shuffle-merge.html) reduces the number of data files rewritten by `MERGE` operations and reduces the need to recaculate `ZORDER` clusters.\n* Apache Spark 3.0 introduced [adaptive query execution](https:\/\/docs.databricks.com\/optimizations\/aqe.html), which provides enhanced performance for many operations.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/index.html"} +{"content":"# Databricks data engineering\n### Optimization recommendations on Databricks\n#### Databricks recommendations for enhanced performance\n\n* You can [clone](https:\/\/docs.databricks.com\/delta\/clone.html) tables on Databricks to make deep or shallow copies of source datasets.\n* The [cost-based optimizer](https:\/\/docs.databricks.com\/optimizations\/cbo.html) accelerates query performance by leveraging table statistics.\n* You can use Spark SQL to interact with [semi-structured JSON data](https:\/\/docs.databricks.com\/optimizations\/semi-structured.html) without parsing strings.\n* [Higher order functions](https:\/\/docs.databricks.com\/optimizations\/higher-order-lambda-functions.html) provide built-in, optimized performance for many operations that do not have common Spark operators. Higher order functions provide a performance benefit over user defined functions.\n* Databricks provides a number of built-in operators and special syntax for working with [complex data types](https:\/\/docs.databricks.com\/optimizations\/complex-types.html), including arrays, structs, and JSON strings.\n* You can manually tune settings for range joins. See [Range join optimization](https:\/\/docs.databricks.com\/optimizations\/range-join.html).\n\n### Optimization recommendations on Databricks\n#### Opt-in behaviors\n\n* Databricks provides a write serializable isolation guarantee by default; changing the [isolation level](https:\/\/docs.databricks.com\/optimizations\/isolation-level.html) to serializable can reduce throughput for concurrent operations, but might be necessary when read serializability is required.\n* You can use [bloom filter indexes](https:\/\/docs.databricks.com\/optimizations\/bloom-filters.html) to reduce the likelihood of scanning data files that don\u2019t contain records matching a given condition.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/index.html"} +{"content":"# AI and Machine Learning on Databricks\n## Prepare data and environment for ML and DL\n### Preprocess data for machine learning and deep learning\n##### Feature engineering with scikit-learn\n\nThe example notebook on this page illustrates how to use scikit-learn on Databricks for feature engineering.\n\n##### Feature engineering with scikit-learn\n###### Use scikit-learn with MLflow integration on Databricks\n\nThis notebook shows a complete end-to-end example of loading data, training a model, distributed hyperparameter tuning, and model inference. It also illustrates how to use MLflow and the model registry. \nIf your workspace is enabled for Unity Catalog, use this version of the notebook: \n### Use scikit-learn with MLflow integration on Databricks (Unity Catalog) \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-end-to-end-example-uc.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import \nIf your workspace is not enabled for Unity Catalog, use this version of the notebook: \n### Use scikit-learn with MLflow integration on Databricks \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/mlflow\/mlflow-end-to-end-example.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/preprocess-data\/scikit-learn.html"} +{"content":"# Compute\n### What is a SQL warehouse?\n\nA SQL warehouse is a compute resource that lets you query and explore data on Databricks. \nMost users have access to SQL warehouses configured by administrators. \nDatabricks recommends using serverless SQL warehouses when available.\n\n### What is a SQL warehouse?\n#### Use SQL warehouses\n\nThe SQL warehouses you have access to appear in the compute drop-down menus of workspace UIs that support SQL warehouse compute, including the query editor, Catalog Explorer, and dashboards. \nYou can also view, sort, and search available SQL warehouses by clicking ![Endpoints Icon](https:\/\/docs.databricks.com\/_images\/warehouses-icon.png) **SQL Warehouses** in the sidebar. By default, warehouses are sorted by state (running warehouses first), then in alphabetical order. \nThe UI indicates whether or not a warehouse is currently running. Running a query against a stopped warehouse starts it automatically if you have access to the warehouse. See [Start a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html#start). \nNote \nTo help you get started, Databricks creates a small SQL warehouse called **Starter Warehouse** automatically. You can edit or delete this SQL warehouse. \nImportant \nYou can also attach a notebook to a SQL warehouse. See [Notebooks and SQL warehouses](https:\/\/docs.databricks.com\/notebooks\/notebook-ui.html#notebook-sql-warehouse) for more information and limitations.\n\n### What is a SQL warehouse?\n#### Start a SQL warehouse\n\nTo manually start a stopped SQL warehouse, click ![Endpoints Icon](https:\/\/docs.databricks.com\/_images\/warehouses-icon.png) **SQL Warehouses** in the sidebar then click the start icon next to the warehouse. \nA SQL warehouse auto-restarts in the following conditions: \n* A warehouse is stopped and you attempt to run a query.\n* A job assigned to a stopped warehouse is scheduled to run.\n* A connection is established to a stopped warehouse from a JDBC\/ODBC interface.\n* A dashboard associated with a dashboard-level warehouse is opened.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html"} +{"content":"# Compute\n### What is a SQL warehouse?\n#### Create a SQL warehouse\n\nConfiguring and launching SQL warehouses requires elevated permissions generally restricted to an administrator. See [SQL warehouse admin settings](https:\/\/docs.databricks.com\/admin\/sql\/index.html) and [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html). \nUnity Catalog governs data access permissions on SQL warehouses for most assets. Administrators configure most data access permissions. SQL warehouses can have custom data access configured instead of or in addition to Unity Catalog. See [Enable data access configuration](https:\/\/docs.databricks.com\/admin\/sql\/data-access-configuration.html). \nYou should contact an administrator in the following situations: \n* You cannot connect to any SQL warehouses.\n* You cannot run queries because a SQL warehouse is stopped.\n* You cannot access tables or data from your SQL warehouse. \nNote \nSome organizations might allow users to modify privileges on either database objects or SQL warehouses. Check with your teammates and admins to understand how your organization manages data access.\n\n### What is a SQL warehouse?\n#### Warehouse sizing and autoscaling behavior\n\nFor information on how classic and pro SQL warehouses are sized and how autoscaling works, see [SQL warehouse sizing, scaling, and queuing behavior](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/warehouse-behavior.html).\n\n### What is a SQL warehouse?\n#### SQL warehouses and third party BI tools\n\nDatabricks SQL supports many third party [BI and visualization tools](https:\/\/docs.databricks.com\/integrations\/index.html#bi) that can connect to SQL warehouses, including the following: \n* [Connect Power BI to Databricks](https:\/\/docs.databricks.com\/partners\/bi\/power-bi.html)\n* [Connect Tableau to Databricks](https:\/\/docs.databricks.com\/partners\/bi\/tableau.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html"} +{"content":"# Compute\n### What is a SQL warehouse?\n#### Developer tools for SQL warehouses\n\nYou can use the REST API, CLI, and other drivers and integrations to configure and run commands on SQL warehouses. See the following: \n* [Databricks SQL REST API](https:\/\/docs.databricks.com\/api\/workspace\/warehouses)\n* [Databricks SQL CLI](https:\/\/docs.databricks.com\/dev-tools\/databricks-sql-cli.html)\n* [Databricks Driver for SQLTools for Visual Studio Code](https:\/\/docs.databricks.com\/dev-tools\/sqltools-driver.html)\n* [DataGrip integration with Databricks](https:\/\/docs.databricks.com\/dev-tools\/datagrip.html)\n* [DBeaver integration with Databricks](https:\/\/docs.databricks.com\/dev-tools\/dbeaver.html)\n* [Connect to SQL Workbench\/J](https:\/\/docs.databricks.com\/partners\/bi\/workbenchj.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Connect to Azure Data Lake Storage Gen2 and Blob Storage\n\nThis article explains how to connect to Azure Data Lake Storage Gen2 and Blob Storage from Databricks. \nNote \n* The legacy Windows Azure Storage Blob driver (WASB) has been deprecated. ABFS has numerous benefits over WASB. See [Azure documentation on ABFS](https:\/\/learn.microsoft.com\/azure\/storage\/blobs\/data-lake-storage-abfs-driver). For documentation for working with the legacy WASB driver, see [Connect to Azure Blob Storage with WASB (legacy)](https:\/\/docs.databricks.com\/archive\/storage\/wasb-blob.html).\n* Azure has announced the pending retirement of [Azure Data Lake Storage Gen1](https:\/\/learn.microsoft.com\/azure\/data-lake-store\/data-lake-store-overview). Databricks recommends migrating all data from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2. If you have not yet migrated, see [Accessing Azure Data Lake Storage Gen1 from Databricks](https:\/\/docs.databricks.com\/archive\/storage\/azure-datalake.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/azure-storage.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Connect to Azure Data Lake Storage Gen2 and Blob Storage\n##### Connect to Azure Data Lake Storage Gen2 or Blob Storage using Azure credentials\n\nThe following credentials can be used to access Azure Data Lake Storage Gen2 or Blob Storage: \n* **OAuth 2.0 with a <entra-service-principal>**: Databricks recommends using <entra-service-principal>s to connect to Azure storage. To create a <entra-service-principal> and provide it access to Azure storage accounts, see [Access storage using a service principal & Microsoft Entra ID(Azure Active Directory)](https:\/\/docs.databricks.com\/connect\/storage\/aad-storage-service-principal.html). \nTo create a <entra-service-principal>, you must have the `Application Administrator` role or the `Application.ReadWrite.All` permission in Microsoft Entra ID (formerly Azure Active Directory). To assign roles on a storage account you must be an Owner or a user with the User Access Administrator Azure RBAC role on the storage account.\n* **Shared access signatures (SAS)**: You can use storage [SAS tokens](https:\/\/learn.microsoft.com\/azure\/storage\/common\/storage-sas-overview) to access Azure storage. With SAS, you can restrict access to a storage account using temporary tokens with fine-grained access control. \nYou can only grant a SAS token permissions that you have on the storage account, container, or file yourself.\n* **Account keys**: You can use [storage account access keys](https:\/\/learn.microsoft.com\/azure\/storage\/common\/storage-account-keys-manage?tabs=azure-portal) to manage access to Azure Storage. Storage account access keys provide full access to the configuration of a storage account, as well as the data. Databricks recommends using a <entra-service-principal> or a SAS token to connect to Azure storage instead of account keys. \nTo view an account\u2019s access keys, you must have the Owner, Contributor, or Storage Account Key Operator Service role on the storage account. \nDatabricks recommends using secret scopes for storing all credentials. You can grant users, service principals, and groups in your workspace access to read the secret scope. This protects the Azure credentials while allowing users to access Azure storage. To create a secret scope, see [Secret scopes](https:\/\/docs.databricks.com\/security\/secrets\/secret-scopes.html). \n### Set Spark properties to configure Azure credentials to access Azure storage \nYou can set Spark properties to configure a Azure credentials to access Azure storage. The credentials can be scoped to either a cluster or a notebook. Use both cluster access control and notebook access control together to protect access to Azure storage. See [Compute permissions](https:\/\/docs.databricks.com\/compute\/clusters-manage.html#cluster-level-permissions) and [Collaborate using Databricks notebooks](https:\/\/docs.databricks.com\/notebooks\/notebooks-collaborate.html). \nTo set Spark properties, use the following snippet in a cluster\u2019s Spark configuration or a notebook: \nUse the following format to set the cluster Spark configuration: \n```\nspark.hadoop.fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net OAuth\nspark.hadoop.fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\nspark.hadoop.fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net <application-id>\nspark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net {{secrets\/<secret-scope>\/<service-credential-key>}}\nspark.hadoop.fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net https:\/\/login.microsoftonline.com\/<directory-id>\/oauth2\/token\n\n``` \nYou can use `spark.conf.set` in notebooks, as shown in the following example: \n```\nservice_credential = dbutils.secrets.get(scope=\"<secret-scope>\",key=\"<service-credential-key>\")\n\nspark.conf.set(\"fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net\", \"OAuth\")\nspark.conf.set(\"fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net\", \"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\")\nspark.conf.set(\"fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net\", \"<application-id>\")\nspark.conf.set(\"fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net\", service_credential)\nspark.conf.set(\"fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net\", \"https:\/\/login.microsoftonline.com\/<directory-id>\/oauth2\/token\")\n\n``` \nReplace \n* `<secret-scope>` with the Databricks secret scope name.\n* `<service-credential-key>` with the name of the key containing the client secret.\n* `<storage-account>` with the name of the Azure storage account.\n* `<application-id>` with the **Application (client) ID** for the Microsoft Entra ID application.\n* `<directory-id>` with the **Directory (tenant) ID** for the Microsoft Entra ID application. \nYou can configure SAS tokens for multiple storage accounts in the same Spark session. \n```\nspark.conf.set(\"fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net\", \"SAS\")\nspark.conf.set(\"fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net\", \"org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider\")\nspark.conf.set(\"fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net\", dbutils.secrets.get(scope=\"<scope>\", key=\"<sas-token-key>\"))\n\n``` \nReplace \n* `<storage-account>` with the Azure Storage account name.\n* `<scope>` with the Databricks secret scope name.\n* `<sas-token-key>` with the name of the key containing the Azure storage SAS token. \n```\nspark.conf.set(\n\"fs.azure.account.key.<storage-account>.dfs.core.windows.net\",\ndbutils.secrets.get(scope=\"<scope>\", key=\"<storage-account-access-key>\"))\n\n``` \nReplace \n* `<storage-account>` with the Azure Storage account name.\n* `<scope>` with the Databricks secret scope name.\n* `<storage-account-access-key>` with the name of the key containing the Azure storage account access key. \n### Access Azure storage \nOnce you have properly configured credentials to access your Azure storage container, you can interact with resources in the storage account using URIs. Databricks recommends using the `abfss` driver for greater security. \n```\nspark.read.load(\"abfss:\/\/<container-name>@<storage-account-name>.dfs.core.windows.net\/<path-to-data>\")\n\ndbutils.fs.ls(\"abfss:\/\/<container-name>@<storage-account-name>.dfs.core.windows.net\/<path-to-data>\")\n\n``` \n```\nCREATE TABLE <database-name>.<table-name>;\n\nCOPY INTO <database-name>.<table-name>\nFROM 'abfss:\/\/container@storageAccount.dfs.core.windows.net\/path\/to\/folder'\nFILEFORMAT = CSV\nCOPY_OPTIONS ('mergeSchema' = 'true');\n\n``` \n### Example notebook \n#### ADLS Gen2 OAuth 2.0 with Microsoft Entra ID (formerly Azure Active Directory) service principals notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/adls-gen2-service-principal.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/azure-storage.html"} +{"content":"# Connect to data sources\n## Configure access to cloud object storage for Databricks\n#### Connect to Azure Data Lake Storage Gen2 and Blob Storage\n##### Azure Data Lake Storage Gen2 known issues\n\nIf you try accessing a storage container created through the Azure portal, you might receive the following error: \n```\nStatusCode=404\nStatusDescription=The specified filesystem does not exist.\nErrorCode=FilesystemNotFound\nErrorMessage=The specified filesystem does not exist.\n\n``` \nWhen a hierarchical namespace is enabled, you don\u2019t need to create containers through Azure portal. If you see this issue, delete the Blob container through Azure portal. After a few minutes, you can access the container. Alternatively, you can change your `abfss` URI to use a different container, as long as this container is not created through Azure portal. \nSee [Known issues with Azure Data Lake Storage Gen2](https:\/\/aka.ms\/adlsgen2knownissues) in the Microsoft documentation.\n\n#### Connect to Azure Data Lake Storage Gen2 and Blob Storage\n##### Deprecated patterns for storing and accessing data from Databricks\n\nThe following are deprecated storage patterns: \n* Databricks no longer recommends mounting external data locations to Databricks Filesystem. See [Mounting cloud object storage on Databricks](https:\/\/docs.databricks.com\/dbfs\/mounts.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/azure-storage.html"} +{"content":"# Get started: Account and workspace setup\n### Tutorial: Create your first table and grant privileges\n\nThis article provides a quick walkthrough of creating a table and granting privileges in Databricks using the Unity Catalog data governance model. As of November 8, 2023, workspaces in new accounts are automatically enabled for Unity Catalog and include the permissions required for all users to complete this tutorial. \nIf you are unsure if your workspace is enabled for Unity Catalog, see [Set up and manage Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html). If you would like to familiarize yourself with Unity Catalog data objects, see [What is Unity Catalog?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/index.html). \nThis article is intended for users but may also be of interest to admins who are newly responsible for workspace management.\n\n### Tutorial: Create your first table and grant privileges\n#### Before you begin\n\nIn order to perform the tasks described in this article, you must have: \n* A Databricks workspace that was enabled for Unity Catalog automatically.\n* Permissions to attach to a compute resource. See [Use compute](https:\/\/docs.databricks.com\/compute\/use-compute.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/create-table.html"} +{"content":"# Get started: Account and workspace setup\n### Tutorial: Create your first table and grant privileges\n#### Create your first table\n\nUnity Catalog includes a three-level namespace for data objects: `catalog.schema.table`. In this example, you\u2019ll run a notebook that creates a table named `department` in the workspace catalog and `default` schema (database). \nNote \nThe workspace catalog is the default catalog created with your workspace that all users have access to. It shares a name with your workspace. \nYou can define access to tables declaratively using SQL or the Databricks Explorer UI: \n1. In the sidebar, click **+New** > **Notebook**.\n2. Select `SQL` as your notebook language.\n3. Click **Connect** and attach the notebook to a compute resource.\n4. Add the following commands to the notebook and run them (replace `<workspace-catalog>` with the name of your workspace catalog): \n```\nUSE CATALOG <workspace-catalog>\n\n``` \n```\nCREATE TABLE IF NOT EXISTS default.department\n(\ndeptcode INT,\ndeptname STRING,\nlocation STRING\n);\n\n``` \n```\nINSERT INTO default.department VALUES\n(10, 'FINANCE', 'EDINBURGH'),\n(20, 'SOFTWARE', 'PADDINGTON');\n\n```\n5. In the sidebar, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog** and then search for the workspace catalog (`<workspace-name>`) and the `default` schema, where you\u2019ll find your new `department` table. \n![Use Catalog Explorer to find a table in workspace catalog](https:\/\/docs.databricks.com\/_images\/table-search-explorer-workspace-catalog.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/create-table.html"} +{"content":"# Get started: Account and workspace setup\n### Tutorial: Create your first table and grant privileges\n#### Manage permissions on your table\n\nAs the original table creator, you\u2019re the table *owner*, and you can grant other users permission to read or write to the table. You can even transfer ownership, but we won\u2019t do that here. For more information about the Unity Catalog privileges and permissions model, see [Manage privileges in Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html). \n### Grant permissions using the UI \nTo give users permissions on your table using the UI: \n1. Click the table name in Catalog Explorer to open the table details page, and go to the **Permissions** tab.\n2. Click **Grant**.\n3. On the **Grant on** dialog: \n1. Select the users and groups you want to give permission to.\n2. Select the privileges you want to grant. For this example, assign the `SELECT` (read) privilege and click **Grant**. \n### Grant permissions using SQL statements \nYou can also grant those permissions using the following SQL statement in a Databricks notebook or the SQL query editor. In this example, you give a group called `data-consumers` permissions on your table: \n```\nGRANT SELECT ON default.department TO `data-consumers`;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/create-table.html"} +{"content":"# Get started: Account and workspace setup\n### Tutorial: Create your first table and grant privileges\n#### Next steps\n\nLearn more about: \n* [Create tables](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html)\n* [Creating catalogs](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html)\n* [Creating schemas](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html)\n* [Creating views](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-views.html)\n* [Creating volumes (non-tabular data)](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html)\n* [Managing privileges](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/create-table.html"} +{"content":"# Query data\n## Data format options\n#### MLflow experiment\n\nThe MLflow experiment data source provides a standard API to load MLflow experiment run data.\nYou can load data from the [notebook experiment](https:\/\/docs.databricks.com\/mlflow\/experiments.html),\nor you can use the MLflow experiment name or experiment ID.\n\n#### MLflow experiment\n##### Requirements\n\nDatabricks Runtime 6.0 ML or above.\n\n#### MLflow experiment\n##### Load data from the notebook experiment\n\nTo load data from the notebook experiment, use `load()`. \n```\ndf = spark.read.format(\"mlflow-experiment\").load()\ndisplay(df)\n\n``` \n```\nval df = spark.read.format(\"mlflow-experiment\").load()\ndisplay(df)\n\n```\n\n#### MLflow experiment\n##### Load data using experiment IDs\n\nTo load data from one or more workspace experiments, specify the experiment IDs as shown. \n```\ndf = spark.read.format(\"mlflow-experiment\").load(\"3270527066281272\")\ndisplay(df)\n\n``` \n```\nval df = spark.read.format(\"mlflow-experiment\").load(\"3270527066281272,953590262154175\")\ndisplay(df)\n\n```\n\n#### MLflow experiment\n##### Load data using experiment name\n\nYou can also pass the experiment name to the `load()` method. \n```\nexpId = mlflow.get_experiment_by_name(\"\/Shared\/diabetes_experiment\/\").experiment_id\ndf = spark.read.format(\"mlflow-experiment\").load(expId)\ndisplay(df)\n\n``` \n```\nval expId = mlflow.getExperimentByName(\"\/Shared\/diabetes_experiment\/\").get.getExperimentId\nval df = spark.read.format(\"mlflow-experiment\").load(expId)\ndisplay(df)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/mlflow-experiment.html"} +{"content":"# Query data\n## Data format options\n#### MLflow experiment\n##### Filter data based on metrics and parameters\n\nThe examples in this section show how you can filter data after loading it from an experiment. \n```\ndf = spark.read.format(\"mlflow-experiment\").load(\"3270527066281272\")\nfiltered_df = df.filter(\"metrics.loss < 0.01 AND params.learning_rate > '0.001'\")\ndisplay(filtered_df)\n\n``` \n```\nval df = spark.read.format(\"mlflow-experiment\").load(\"3270527066281272\")\nval filtered_df = df.filter(\"metrics.loss < 1.85 AND params.num_epochs > '30'\")\ndisplay(filtered_df)\n\n```\n\n#### MLflow experiment\n##### Schema\n\nThe schema of the DataFrame returned by the data source is: \n```\nroot\n|-- run_id: string\n|-- experiment_id: string\n|-- metrics: map\n| |-- key: string\n| |-- value: double\n|-- params: map\n| |-- key: string\n| |-- value: string\n|-- tags: map\n| |-- key: string\n| |-- value: string\n|-- start_time: timestamp\n|-- end_time: timestamp\n|-- status: string\n|-- artifact_uri: string\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/mlflow-experiment.html"} +{"content":"# Query data\n## Data format options\n#### Read Delta Sharing shared tables using Apache Spark DataFrames\n\nThis article provides syntax examples of using Apache Spark to query data shared using [Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html). Use the `deltasharing` keyword as a format option for DataFrame operations.\n\n#### Read Delta Sharing shared tables using Apache Spark DataFrames\n##### Other options for querying shared data\n\nYou can also create queries that use shared table names in Delta Sharing catalogs registered in the metastore, such as those in the following examples: \n```\nSELECT * FROM shared_table_name\n\n``` \n```\nspark.read.table(\"shared_table_name\")\n\n``` \nFor more on configuring Delta Sharing in Databricks and querying data using shared table names, see [Read data shared using Databricks-to-Databricks Delta Sharing (for recipients)](https:\/\/docs.databricks.com\/data-sharing\/read-data-databricks.html). \nYou can use Structured Streaming to process records in shared tables incrementally. To use Structured Streaming, you must enable history sharing for the table. See [ALTER SHARE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-share.html). History sharing requires Databricks Runtime 12.2 LTS or above. \nIf the shared table has change data feed enabled on the source Delta table and history enabled on the share, you can use change data feed while reading a Delta share with Structured Streaming or batch operations. See [Use Delta Lake change data feed on Databricks](https:\/\/docs.databricks.com\/delta\/delta-change-data-feed.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/deltasharing.html"} +{"content":"# Query data\n## Data format options\n#### Read Delta Sharing shared tables using Apache Spark DataFrames\n##### Read with the Delta Sharing format keyword\n\nThe `deltasharing` keyword is supported for Apache Spark DataFrame read operations, as shown in the following example: \n```\ndf = (spark.read\n.format(\"deltasharing\")\n.load(\"<profile-path>#<share-name>.<schema-name>.<table-name>\")\n)\n\n```\n\n#### Read Delta Sharing shared tables using Apache Spark DataFrames\n##### Read change data feed for Delta Sharing shared tables\n\nFor tables that have history shared and change data feed enabled, you can read change data feed records using Apache Spark DataFrames. History sharing requires Databricks Runtime 12.2 LTS or above. \n```\ndf = (spark.read\n.format(\"deltasharing\")\n.option(\"readChangeFeed\", \"true\")\n.option(\"startingTimestamp\", \"2021-04-21 05:45:46\")\n.option(\"endingTimestamp\", \"2021-05-21 12:00:00\")\n.load(\"<profile-path>#<share-name>.<schema-name>.<table-name>\")\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/deltasharing.html"} +{"content":"# Query data\n## Data format options\n#### Read Delta Sharing shared tables using Apache Spark DataFrames\n##### Read Delta Sharing shared tables using Structured Streaming\n\nFor tables that have history shared, you can use the shared table as a source for Structured Streaming. History sharing requires Databricks Runtime 12.2 LTS or above. \n```\nstreaming_df = (spark.readStream\n.format(\"deltasharing\")\n.load(\"<profile-path>#<share-name>.<schema-name>.<table-name>\")\n)\n\n# If CDF is enabled on the source table\nstreaming_cdf_df = (spark.readStream\n.format(\"deltasharing\")\n.option(\"readChangeFeed\", \"true\")\n.option(\"startingTimestamp\", \"2021-04-21 05:45:46\")\n.load(\"<profile-path>#<share-name>.<schema-name>.<table-name>\")\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query\/formats\/deltasharing.html"} +{"content":"# \n### DatabricksIQ-powered features\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis page provides information about the DatabricksIQ-powered features that can make your work in Databricks more efficient. These features include Databricks Assistant for help with coding and with creating dashboards, automatically generated table documentation in Catalog Explorer, and help in the workspace.\n\n### DatabricksIQ-powered features\n#### What is DatabricksIQ?\n\nDatabricksIQ is the data intelligence engine powering the Databricks Platform. It is a compound AI system that combines the use of AI models, retrieval, ranking, and personalization systems to understand the semantics of your organization\u2019s data and usage patterns. \nDatabricksIQ has no end-user UI, but it enables existing product experiences to be more accurate and provide more relevant results like, Databricks Assistant, [AI-generated comments](https:\/\/docs.databricks.com\/catalog-explorer\/ai-comments.html) and [intelligent search](https:\/\/docs.databricks.com\/search\/index.html#intelligent-search). \nThese DatabricksIQ-powered features enable everyone in an organization to be more productive using data and AI, while maintaining the governance and controls established in Unity Catalog.\n\n### DatabricksIQ-powered features\n#### DatabricksIQ features: trust and safety\n\nDatabricks understands the importance of your data and the trust you place in us when you use Databricks services. Databricks is committed to the highest standards of data protection and has implemented rigorous measures to ensure your information is protected. For more details, see [DatabricksIQ trust and safety](https:\/\/docs.databricks.com\/databricksiq\/databricksiq-trust.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/databricksiq\/index.html"} +{"content":"# \n### DatabricksIQ-powered features\n#### How do I enable or disable Databricks Assistant?\n\nDatabricks Assistant is enabled by default. An admin can disable or enable Databricks Assistant for all workspaces in an account. If an admin has permitted workspace setting overrides, workspace admins can enable or disable Databricks Assistant for specific workspaces. For more details, see [Enable or disable Databricks Assistant](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html#enable-or-disable).\n\n### DatabricksIQ-powered features\n#### Use Databricks Assistant to develop code\n\nDatabricks Assistant works as an AI-based companion pair-programmer to make you more efficient as you create notebooks, queries, and files. It provides inline code suggestions as you type, can help you rapidly answer questions by generating, optimizing, completing, explaining, and fixing code and queries. Databricks Assistant is available in notebooks, the SQL editor, and when creating dashboards. For details, see [Use Databricks Assistant](https:\/\/docs.databricks.com\/notebooks\/use-databricks-assistant.html) and [What is Databricks Assistant?](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html).\n\n### DatabricksIQ-powered features\n#### Create dashboards with Databricks Assistant\n\nWhen drafting a dashboard, you can use natural-language prompts to build the charts you want to see. For details, see [Create visualizations with Databricks Assistant](https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-w-db-assistant.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/databricksiq\/index.html"} +{"content":"# \n### DatabricksIQ-powered features\n#### AI-generated table comments in Catalog Explorer\n\nFor tables in Catalog Explorer, Databricks automatically generates comments that describe the table based on the table metadata. You do not have to do anything to generate this documentation. You can edit the comment, accept it as-is, or delete it. For detail, see [Add comments to a table](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#comments).\n\n### DatabricksIQ-powered features\n#### Use Databricks Assistant for help\n\nThe help assistant is always available to you in the Databricks workspace. You can type in a question, and it generates answers based on Databricks documentation and knowledge base articles. For details, see [Get help in the Databricks workspace](https:\/\/docs.databricks.com\/workspace\/index.html#get-help).\n\n","doc_uri":"https:\/\/docs.databricks.com\/databricksiq\/index.html"} +{"content":"# Connect to data sources\n### Configure access to cloud object storage for Databricks\n\nNote \nThe articles in this section describe legacy patterns for configuring access to cloud object storage. Databricks recommends using Unity Catalog to manage connections to storage. See [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html). \nDatabricks uses cloud object storage to store data files and tables. During workspace deployment, Databricks configures a cloud object storage location known as the [DBFS root](https:\/\/docs.databricks.com\/dbfs\/dbfs-root.html). You can configure connections to other cloud object storage locations in your account. \nThe following articles describe configuration options when you are not using Unity Catalog. \n* [Connect to Amazon S3](https:\/\/docs.databricks.com\/connect\/storage\/amazon-s3.html)\n* [Tutorial: Configure S3 access with an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html)\n* [Connect to Google Cloud Storage](https:\/\/docs.databricks.com\/connect\/storage\/gcs.html)\n* [Connect to Azure Data Lake Storage Gen2 and Blob Storage](https:\/\/docs.databricks.com\/connect\/storage\/azure-storage.html)\n* [Access storage using a service principal & Microsoft Entra ID(Azure Active Directory)](https:\/\/docs.databricks.com\/connect\/storage\/aad-storage-service-principal.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/storage\/index.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Work with Unity Catalog and the legacy Hive metastore\n\nThis article explains how to use the per-workspace Hive metastore when your Databricks workspace is enabled for Unity Catalog. \nIf your workspace was in service before it was enabled for Unity Catalog, it likely has a Hive metastore that contains data that you want to continue to use. Databricks recommends that you [migrate the tables managed by the Hive metastore to the Unity Catalog metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html), but if you choose not to, this article explains how to work with data managed by both metastores.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/hive-metastore.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Work with Unity Catalog and the legacy Hive metastore\n##### Query the Hive metastore in Unity Catalog\n\nThe Unity Catalog metastore is additive, meaning it can be used with the per-workspace Hive metastore in Databricks. The Hive metastore appears as a top-level catalog called `hive_metastore` in the three-level namespace. \nFor example, you can refer to a table called `sales_raw` in the `sales` schema in the legacy Hive metastore by using the following notation: \n```\nSELECT * from hive_metastore.sales.sales_raw;\n\n``` \n```\ndisplay(spark.table(\"hive_metastore.sales.sales_raw\"))\n\n``` \n```\nlibrary(SparkR)\n\ndisplay(tableToDF(\"hive_metastore.sales.sales_raw\"))\n\n``` \n```\ndisplay(spark.table(\"hive_metastore.sales.sales_raw\"))\n\n``` \nYou can also specify the catalog and schema with a `USE` statement: \n```\nUSE hive_metastore.sales;\nSELECT * from sales_raw;\n\n``` \n```\nspark.sql(\"USE hive_metastore.sales\")\ndisplay(spark.table(\"sales_raw\"))\n\n``` \n```\nlibrary(SparkR)\n\nsql(\"USE hive_metastore.sales\")\ndisplay(tableToDF(\"sales_raw\"))\n\n``` \n```\nspark.sql(\"USE hive_metastore.sales\")\ndisplay(spark.table(\"sales_raw\"))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/hive-metastore.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Work with Unity Catalog and the legacy Hive metastore\n##### Access control in Unity Catalog and the Hive metastore\n\nIf you configured [table access control](https:\/\/docs.databricks.com\/data-governance\/table-acls\/index.html) on the Hive metastore, Databricks continues to enforce those access controls for data in the `hive_metastore` catalog for clusters running in the shared access mode. The Unity Catalog access model differs slightly from legacy access controls, like no `DENY` statements. The Hive metastore is a workspace-level object. Permissions defined within the `hive_metastore` catalog always refer to the local users and groups in the workspace. See [Differences from table access control](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/hive-metastore.html#differences-from-table-access-control).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/hive-metastore.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Work with Unity Catalog and the legacy Hive metastore\n##### Differences from table access control\n\nUnity Catalog has the following key differences from using [table access controls](https:\/\/docs.databricks.com\/data-governance\/table-acls\/index.html) in the legacy Hive metastore in each workspace. \nThe access control model in Unity Catalog has the following differences from table access control: \n* **Account groups**: Access control policies in Unity Catalog are applied to account groups, while access control policies for the Hive metastore are applied to workspace-local groups. See [Difference between account groups and workspace-local groups](https:\/\/docs.databricks.com\/admin\/users-groups\/groups.html#account-vs-workspace-group).\n* **`USE CATALOG` and `USE SCHEMA` permissions are required on the catalog and schema for all operations on objects inside the catalog or schema**: Regardless of a principal\u2019s privileges on a table, the principal must also have the `USE CATALOG` privilege on its parent catalog to access the schema and the `USE SCHEMA` privilege to access objects within the schema. With workspace-level table access controls, on the other hand, granting `USAGE` on the root catalog automatically grants `USAGE` on all databases, but `USAGE` on the root catalog is not required.\n* **Views**: In Unity Catalog, the owner of a view does not need to be an owner of the view\u2019s referenced tables and views. Having the `SELECT` privilege is sufficient, along with `USE SCHEMA` on the views\u2019 parent schema and `USE CATALOG` on the parent catalog. With workspace-level table access controls, a view\u2019s owner needs to be an owner of all referenced tables and views.\n* **No support for `ANY FILE` or `ANONYMOUS FUNCTION`**: In Unity Catalog, there is no concept of an `ANY FILE` or `ANONYMOUS FUNCTION` securable that might allow an unprivileged user to run privileged code.\n* **No `READ_METADATA` privilege**: Unity Catalog manages access to view metadata in a different way. See [Unity Catalog privileges and securable objects](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/hive-metastore.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Work with Unity Catalog and the legacy Hive metastore\n##### Joins between Unity Catalog and Hive metastore objects\n\nBy using three-level namespace notation, you can join data in a Unity Catalog metastore with data in the legacy Hive metastore. \nNote \nA join with data in the legacy Hive metastore will only work on the workspace where that data resides. Trying to run such a join in another workspace results in an error. Databricks recommends that you [upgrade](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html) legacy tables and views to Unity Catalog. \nThe following example joins results from the `sales_current` table in the legacy Hive metastore with the `sales_historical` table in the Unity Catalog metastore when the `order_id` fields are equal. \n```\nSELECT * FROM hive_metastore.sales.sales_current\nJOIN main.shared_sales.sales_historical\nON hive_metastore.sales.sales_current.order_id = main.shared_sales.sales_historical.order_id;\n\n``` \n```\ndfCurrent = spark.table(\"hive_metastore.sales.sales_current\")\ndfHistorical = spark.table(\"main.shared_sales.sales_historical\")\n\ndisplay(dfCurrent.join(\nother = dfHistorical,\non = dfCurrent.order_id == dfHistorical.order_id\n))\n\n``` \n```\nlibrary(SparkR)\n\ndfCurrent = tableToDF(\"hive_metastore.sales.sales_current\")\ndfHistorical = tableToDF(\"main.shared_sales.sales_historical\")\n\ndisplay(join(\nx = dfCurrent,\ny = dfHistorical,\njoinExpr = dfCurrent$order_id == dfHistorical$order_id))\n\n``` \n```\nval dfCurrent = spark.table(\"hive_metastore.sales.sales_current\")\nval dfHistorical = spark.table(\"main.shared_sales.sales_historical\")\n\ndisplay(dfCurrent.join(\nright = dfHistorical,\njoinExprs = dfCurrent(\"order_id\") === dfHistorical(\"order_id\")\n))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/hive-metastore.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Work with Unity Catalog and the legacy Hive metastore\n##### Default catalog\n\nA default catalog is configured for each workspace that is enabled for Unity Catalog. \nIf you omit the top-level catalog name when you perform data operations, the default catalog is assumed. \nThe default catalog that was initially configured for your workspace depends on how your workspace was enabled for Unity Catalog: \n* If your workspace was enabled for Unity Catalog automatically, the *workspace catalog* was set as the default catalog. See [Automatic enablement of Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/get-started.html#enablement).\n* If your workspace was enabled for Unity Catalog manually, the `hive_metastore` catalog was set as the default catalog. \nIf you are transitioning from the Hive metastore to Unity Catalog within an existing workspace, it typically makes sense to use `hive_metastore` as the default catalog to avoid impacting existing code that references the hive metastore. \nTo learn how to get and switch the default catalog, see [Manage the default catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#default)\n\n#### Work with Unity Catalog and the legacy Hive metastore\n##### Cluster-scoped data access permissions\n\nWhen you use the Hive metastore alongside Unity Catalog, data access credentials associated with the cluster are used to access Hive metastore data but not data registered in Unity Catalog. \nIf users access paths that are outside Unity Catalog (such as a path not registered as a table or external location) then the access credentials assigned to the cluster are used. \nSee [Connect to Amazon S3](https:\/\/docs.databricks.com\/connect\/storage\/amazon-s3.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/hive-metastore.html"} +{"content":"# Data governance with Unity Catalog\n## What is Unity Catalog?\n#### Work with Unity Catalog and the legacy Hive metastore\n##### Upgrade legacy tables to Unity Catalog\n\nTables in the Hive metastore do not benefit from the full set of security and governance features that Unity Catalog introduces, such as built-in auditing and access control. Databricks recommends that you [upgrade your legacy tables](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html) by adding them to Unity Catalog.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/hive-metastore.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Dashboard tutorials\n\nFollow along with tutorials designed to teach you build and manage dashboards.\n\n#### Dashboard tutorials\n##### Get started\n\nIf you\u2019re new to working with dashboards on Databricks, use the following tutorials to familiarize yourself with some of the available tools and features. \n| Tutorial | Description |\n| --- | --- |\n| [Create a dashboard](https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-dashboard.html) | Create your first dashboard using a sample dataset. |\n| [Create visualizations with Databricks Assistant](https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-w-db-assistant.html) | Use natural language prompts to generate visualizations on the dashboard canvas. |\n\n#### Dashboard tutorials\n##### Apply filters\n\nFilters allow viewers to interact with your dashboards and explore data. You can set up filters to restrict dataset results based on field values or to insert parameter values into a dataset query at runtime. \n| Tutorial | Description |\n| --- | --- |\n| [Use query-based parameters](https:\/\/docs.databricks.com\/dashboards\/tutorials\/query-based-params.html) | Set up query-based parameters. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/index.html"} +{"content":"# What is data warehousing on Databricks?\n## Dashboards\n#### Dashboard tutorials\n##### Use Databricks APIs to manage dashboards\n\nDashboards are data objects that you can manage using Databricks REST APIs. You can use the Workspace API to work with them as generic workspace objects or use Lakeview API tools, which are specifically designed for dashboard management. \nThe tutorials listed in this section are designed to help you manage dashboards throughout their lifecycle, create new Lakeview dashboards from existing legacy dashboards, and move dashboards across different workspaces. \n| Tutorial | Description |\n| --- | --- |\n| [Manage dashboards with Workspace APIs](https:\/\/docs.databricks.com\/dashboards\/tutorials\/workspace-lakeview-api.html) | If you\u2019re already using the [Workspace](https:\/\/docs.databricks.com\/api\/workspace\/workspace) API to manage workspace objects like notebooks, you can continue to use it for many dashboard management operations. This tutorial includes examples that demonstrate how to use the Workspace and Lakeview APIs to manage dashboards. |\n| [Use the Lakeview APIs to create and manage dashboards](https:\/\/docs.databricks.com\/dashboards\/tutorials\/lakeview-crud-api.html) | The [Lakeview](https:\/\/docs.databricks.com\/api\/workspace\/lakeview) API is designed to work with dashboard objects. Certain functionality in the Lakeview API mirrors available tools in the Workspace API but includes additional functionality specifically for dashboard management. This tutorial includes examples that demonstrate how to use Lakeview APIs to manage dashboards. |\n| [Manage dashboard permissions using the Workspace API](https:\/\/docs.databricks.com\/dashboards\/tutorials\/manage-permissions.html) | Manage dashboard permissions using the [Workspace](https:\/\/docs.databricks.com\/api\/workspace\/workspace) API. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/dashboards\/tutorials\/index.html"} +{"content":"# What is Delta Lake?\n### Type widening\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html) in Databricks Runtime 15.2 and above. \nTables with type widening enabled allow you to change column data types to a wider type without rewriting underlying data files. You can either change column types manually or use schema evolution to evolve column types. \nType widening requires Delta Lake. All Unity Catalog managed tables use Delta Lake by default.\n\n### Type widening\n#### Supported type changes\n\nYou can widen types according to the following rules: \n| Source type | Supported wider types |\n| --- | --- |\n| `byte` | `short`, `int`, `long`, `decimal`, `double` |\n| `short` | `int`, `long`, `decimal`, `double` |\n| `int` | `long`, `decimal`, `double` |\n| `long` | `decimal` |\n| `float` | `double` |\n| `decimal` | `decimal` with greater precision and scale |\n| `date` | `timestampNTZ` | \nTo avoid accidental promotion of integer values to decimals, you must manually commit type changes from `byte`, `short`, `int`, or `long` to `decimal` or `double`. \nNote \nWhen changing any numeric type to `decimal`, the total precision must be equal to or greater than the starting precision. If you also increase the scale, the total precision must increase by a corresponding amount. \nThe minimum target for `byte`, `short`, and `int` types is `decimal(10,0)`. The minimum target for `long` is `decimal(20,0)`. \nIf you want to add two decimal places to a field with `decimal(10,1)`, the minimum target is `decimal(12,3)`. \nType changes are supported for top-level columns and fields nested inside structs, maps, and arrays.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/type-widening.html"} +{"content":"# What is Delta Lake?\n### Type widening\n#### Enable type widening\n\nYou can enable type widening on an existing table by setting the `delta.enableTypeWidening` table property to `true`: \n```\nALTER TABLE <table_name> SET TBLPROPERTIES ('delta.enableTypeWidening' = 'true')\n\n``` \nYou can also enable type widening during table creation: \n```\nCREATE TABLE T(c1 INT) USING DELTA TBLPROPERTIES('delta.enableTypeWidening' = 'true')\n\n``` \nImportant \nWhen you enable type widening, it sets the table feature `typeWidening-preview`, which upgrades the reader and writer protocols. You must use Databricks Runtime 15.2 or above for to interact with tables with type widening enabled. If external clients also interact with the table, verify that they support this table feature. See [How does Databricks manage Delta Lake feature compatibility?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html).\n\n### Type widening\n#### Manually apply a type change\n\nUse the `ALTER COLUMN` command to manually change types: \n```\nALTER TABLE <table_name> ALTER COLUMN <col_name> TYPE <new_type>\n\n``` \nThis operation updates the table schema without rewriting the underlying data files.\n\n### Type widening\n#### Widen types with automatic schema evolution\n\nSchema evolution works with type widening to update data types in target tables to match the type of incoming data. \nNote \nWithout type widening enabled, schema evolution always attempts to safely downcast data to match column types in the target table. If you don\u2019t want to automatically widen data types in your target tables, disable type widening before you run workloads with schema evolution enabled. \nTo use schema evolution to widen the data type of a column, you must meet the following conditions: \n* The command uses `INSERT` or `MERGE INTO`.\n* The command runs with automatic schema evolution enabled.\n* The target table has type widening enabled.\n* The source column type is wider than the target column type.\n* Type widening supports the type change. \nType mismatches that don\u2019t meet all of these conditions follow normal schema enforcement rules. See [Schema enforcement](https:\/\/docs.databricks.com\/tables\/schema-enforcement.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/type-widening.html"} +{"content":"# What is Delta Lake?\n### Type widening\n#### Disable the type widening table feature\n\nYou can prevent accidental type widening on enabled tables by setting the property to `false`: \n```\nALTER TABLE <table_name> SET TBLPROPERTIES ('delta.enableTypeWidening' = 'false')\n\n``` \nThis setting prevents future type changes to the table, but doesn\u2019t remove the type widening table feature or undo types that have changed. \nIf you need to completely remove the type widening table features, you can use the `DROP FEATURE` command as shown in the following example: \n```\nALTER TABLE <table-name> DROP FEATURE 'typeWidening-preview' [TRUNCATE HISTORY]\n\n``` \nWhen dropping type widening, all data files that don\u2019t conform to the current table schema are rewritten. See [Drop Delta table features](https:\/\/docs.databricks.com\/delta\/drop-feature.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/type-widening.html"} +{"content":"# Get started: Account and workspace setup\n## Build an end-to-end data pipeline in Databricks\n#### Explore the source data for a data pipeline\n\nA common first step in creating a data pipeline is understanding the source data for the pipeline. In this step, you will run [Databricks Utilities](https:\/\/docs.databricks.com\/dev-tools\/databricks-utils.html) and PySpark commands in a notebook to examine the source data and artifacts. \nTo learn more about exploratory data analysis, see [Exploratory data analysis on Databricks: Tools and techniques](https:\/\/docs.databricks.com\/exploratory-data-analysis\/index.html).\n\n#### Explore the source data for a data pipeline\n##### Video: Introduction to Databricks notebooks\n\nFor an introduction to Databricks notebooks, watch this video:\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/data-pipeline-explore-data.html"} +{"content":"# Get started: Account and workspace setup\n## Build an end-to-end data pipeline in Databricks\n#### Explore the source data for a data pipeline\n##### Create a data exploration notebook\n\n1. In the sidebar, click ![New Icon](https:\/\/docs.databricks.com\/_images\/create-icon.png) **New** and select **Notebook** from the menu. The notebook opens with a default name that you can replace.\n2. Enter a name for the notebook, for example, `Explore songs data`. By default: \n* **Python** is the selected language.\n* The notebook is attached to the last cluster you used. In this case, the cluster you created in [Step 1: Create a cluster](https:\/\/docs.databricks.com\/getting-started\/data-pipeline-get-started.html#create-a-cluster).\n3. To view the contents of the directory containing the dataset, enter the following in the first cell of the notebook, click ![Run Menu](https:\/\/docs.databricks.com\/_images\/run-menu.png), and select **Run Cell**. \n```\n%fs ls \"\/databricks-datasets\/songs\/data-001\"\n\n``` \n| | path | name | size | modificationTime |\n| --- | --- | --- | --- | --- |\n| 1 | dbfs:\/databricks-datasets\/songs\/README.md | README.md | 1719 | 1454620183000 |\n| 2 | dbfs:\/databricks-datasets\/songs\/data-001\/ | data-001\/ | 0 | 1672791237846 |\n| 3 | dbfs:\/databricks-datasets\/songs\/data-002\/ | data-002\/ | 0 | 1672791237846 |\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/data-pipeline-explore-data.html"} +{"content":"# Get started: Account and workspace setup\n## Build an end-to-end data pipeline in Databricks\n#### Explore the source data for a data pipeline\n##### Explore the data\n\n1. The README file has information about the dataset, including a description of the data schema. The schema information is used in the next step when ingesting the data. To view the contents of the README, click ![Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret.png) in the cell actions menu, select **Add Cell Below**, enter the following in the new cell, click ![Run Menu](https:\/\/docs.databricks.com\/_images\/run-menu.png), and select **Run Cell**. \n```\n%fs head --maxBytes=10000 \"\/databricks-datasets\/songs\/README.md\"\n\n``` \n```\nSample of Million Song Dataset\n===============================\n\n## Source\nThis data is a small subset of the [Million Song Dataset](http:\/\/labrosa.ee.columbia.edu\/millionsong\/).\nThe original data was contributed by The Echo Nest.\nPrepared by T. Bertin-Mahieux <tb2332 '@' columbia.edu>\n\n## Attribute Information\n- artist_id:string\n- artist_latitude:double\n- artist_longitude:double\n- artist_location:string\n- artist_name:string\n- duration:double\n- end_of_fade_in:double\n- key:int\n- key_confidence:double\n- loudness:double\n- release:string\n- song_hotnes:double\n- song_id:string\n- start_of_fade_out:double\n- tempo:double\n- time_signature:double\n- time_signature_confidence:double\n- title:string\n- year:double\n- partial_sequence:int\n...\n\n```\n2. The records used in this example are in the `\/databricks-datasets\/songs\/data-001\/` directory. To view the contents of this directory, click ![Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret.png) in the cell actions menu, select **Add Cell Below**, enter the following in the new cell, click ![Run Menu](https:\/\/docs.databricks.com\/_images\/run-menu.png), and select **Run Cell**. \n```\n%fs ls \"\/databricks-datasets\/songs\/data-001\"\n\n``` \n| | path | name | size | modificationTime |\n| --- | --- | --- | --- | --- |\n| 1 | dbfs:\/databricks-datasets\/songs\/data-001\/header.txt | header.txt | 377 | 1454633901000 |\n| 2 | dbfs:\/databricks-datasets\/songs\/data-001\/part-00000 | part-00000 | 52837 | 1454547464000 |\n| 3 | dbfs:\/databricks-datasets\/songs\/data-001\/part-00001 | part-00001 | 52469 | 1454547465000 |\n3. Because the README and file names do not indicate the file format, you can view a sample of the records to better understand the contents and format of each record. To read and display the first ten records from one of the data files, click ![Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret.png) in the cell actions menu, select **Add Cell Below**, enter the following in the new cell, click ![Run Menu](https:\/\/docs.databricks.com\/_images\/run-menu.png), and select **Run Cell**. \n```\n%fs head --maxBytes=10000 \"\/databricks-datasets\/songs\/data-001\/part-00000\"\n\n``` \n```\nAR81V6H1187FB48872 nan nan Earl Sixteen 213.7073 0.0 11 0.419 -12.106 Soldier of Jah Army nan SOVNZSZ12AB018A9B8 208.289 125.882 1 0.0 Rastaman 2003 --\nARVVZQP11E2835DBCB nan nan Wavves 133.25016 0.0 0 0.282 0.596 Wavvves 0.471578247701 SOJTQHQ12A8C143C5F 128.116 89.519 1 0.0 I Want To See You (And Go To The Movies) 2009 --\nARFG9M11187FB3BBCB nan nan Nashua USA C-Side 247.32689 0.0 9 0.612 -4.896 Santa Festival Compilation 2008 vol.1 nan SOAJSQL12AB0180501 242.196 171.278 5 1.0 Loose on the Dancefloor 0 225261\n...\n\n``` \nYou can observe a few things about the data from viewing a sample of the records. You\u2019ll use these observations later when processing the data: \n* The records do not contain a header. Instead, the header is stored in a separate file in the same directory. \n+ The files appear to be in tab-separated value (TSV) format.\n+ Some fields are missing or invalid.\n4. To further explore and analyze the data, use these observations to load the TSV formatted song data into a [PySpark DataFrame](https:\/\/docs.databricks.com\/getting-started\/dataframes.html). To do this, click ![Down Caret](https:\/\/docs.databricks.com\/_images\/down-caret.png) in the cell actions menu, select **Add Cell Below**, enter the following code in the new cell, and then click ![Run Menu](https:\/\/docs.databricks.com\/_images\/run-menu.png) > **Run Cell**. \n```\ndf = spark.read.format('csv').option(\"sep\", \"\\t\").load('dbfs:\/databricks-datasets\/songs\/data-001\/part-00000')\ndf.display()\n\n``` \nBecause the data file is missing a header, the column names display as `_c0`, `_c1`, and so on. Each column is interpreted as a `string` regardless of the actual data type. The ingestion of the raw data in the [next step](https:\/\/docs.databricks.com\/getting-started\/data-pipeline-get-started.html#ingest-prepare-data) shows an example of how you can impose a valid schema when you load the data. \n![DataFrame created from raw songs data](https:\/\/docs.databricks.com\/_images\/songs-dataframe.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/getting-started\/data-pipeline-explore-data.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Predictive optimization for Delta Lake\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nPredictive optimization removes the need to manually manage maintenance operations for Delta tables on Databricks. \nWith predictive optimization enabled, Databricks automatically identifies tables that would benefit from maintenance operations and runs them for the user. Maintenance operations are only run as necessary, eliminating both unnecessary runs for maintenance operations and the burden associated with tracking and troubleshooting performance.\n\n#### Predictive optimization for Delta Lake\n##### What operations does predictive optimization run?\n\nPredictive optimization runs the following operations automatically for enabled Delta tables: \n| Operation | Description |\n| --- | --- |\n| `OPTIMIZE` | Improves query performance by optimizing file sizes. See [Compact data files with optimize on Delta Lake](https:\/\/docs.databricks.com\/delta\/optimize.html). |\n| `VACUUM` | Reduces storage costs by deleting data files no longer referenced by the table. See [Remove unused data files with vacuum](https:\/\/docs.databricks.com\/delta\/vacuum.html). | \nNote \n`OPTIMIZE` does not run `ZORDER` when executed with predictive optimization. \nWarning \nThe retention window for the `VACUUM` command is determined by the `delta.deletedFileRetentionDuration` table property, which defaults to 7 days. This means `VACUUM` removes data files that are no longer referenced by a Delta table version in the last 7 days. If you\u2019d like to retain data for longer (such as to support time travel for longer durations), you must set this table property appropriately before you enable predictive optimization, as in the following example: \n```\nALTER TABLE table_name SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = '30 days');\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/predictive-optimization.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Predictive optimization for Delta Lake\n##### Where does predictive optimization run?\n\nPredictive optimization identifies tables that would benefit from `OPTIMIZE` and `VACUUM` operations and queues them to run using jobs compute. Your account is billed for compute associated with these workloads using a SKU specific to Databricks Managed Services. See pricing for [Databricks managed services](https:\/\/www.databricks.com\/product\/pricing\/platform-addons). Databricks provides system tables for observability into predictive optimization operations, costs, and impact. See [Use system tables to track predictive optimization](https:\/\/docs.databricks.com\/optimizations\/predictive-optimization.html#observability). \nNote \nPredictive optimization does not run `OPTIMIZE` commands on tables that use liquid clustering.\n\n#### Predictive optimization for Delta Lake\n##### Prerequisites for predictive optimization\n\nYou must fulfill the following requirements to enable predictive optimization: \n* Your Databricks workspace must be on the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons) in a region that supports predictive optimization. See [Databricks clouds and regions](https:\/\/docs.databricks.com\/resources\/supported-regions.html).\n* You must use SQL warehouses or Databricks Runtime 12.2 LTS or above when you enable predictive optimization.\n* Only Unity Catalog managed tables are supported. \n* Serverless compute must be enabled in your account. See [Enable serverless compute in your account](https:\/\/docs.databricks.com\/admin\/sql\/serverless.html#accept-terms).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/predictive-optimization.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Predictive optimization for Delta Lake\n##### Enable predictive optimization\n\nYou must enable predictive optimization at the account level. \nYou must have the following privileges to enable or disable predictive optimization at the specified level: \n| Unity Catalog object | Privilege |\n| --- | --- |\n| Account | Account admin |\n| Catalog | Catalog owner |\n| Schema | Schema owner | \nNote \nWhen you enable predictive optimization for the first time, Databricks automatically creates a service principal in your Databricks account. Databricks uses this service principal to perform the requested maintenance operations. See [Manage service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html). \n### Enable predictive optimization for your account \nAn account admin must complete the following steps to enable predictive optimization for all metastores in an account: \n1. Access the accounts console.\n2. Navigate to **Settings**, then **Feature enablement**.\n3. Select **Enabled** next to **Predictive optimization**. \nNote \nMetastores in regions that don\u2019t support predictive optimization aren\u2019t enabled. \n### Enable or disable predictive optimization for a catalog or schema \nPredictive optimization uses an inheritance model. When enabled for a catalog, schemas inherit the property. Tables within an enabled schema inherit predictive optimization. To override this inheritance behavior, you can explicitly disable predictive optimization for a catalog or schema. \nNote \nYou can disable predictive optimization at the catalog or schema level before enabling it at the account level. If predictive optimization is later enabled on the account, it is blocked for tables in these objects. \nUse the following syntax to enable or disable predictive optimization: \n```\nALTER CATALOG [catalog_name] {ENABLE | DISABLE} PREDICTIVE OPTIMIZATION;\nALTER {SCHEMA | DATABASE} schema_name {ENABLE | DISABLE} PREDICTIVE OPTIMIZATION;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/predictive-optimization.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Predictive optimization for Delta Lake\n##### Check whether predictive optimization is enabled\n\nThe `Predictive Optimization` field is a Unity Catalog property that details if predictive optimization is enabled. If predictive optimization is inherited from a parent object, this is indicated in the field value. \nUse the following syntax to see if predictive optimization is enabled: \n```\nDESCRIBE (CATALOG | SCHEMA | TABLE) EXTENDED name\n\n```\n\n#### Predictive optimization for Delta Lake\n##### Use system tables to track predictive optimization\n\nDatabricks provides a system table to track the history of predictive optimization operations. See [Predictive optimization system table reference](https:\/\/docs.databricks.com\/admin\/system-tables\/predictive-optimization.html).\n\n#### Predictive optimization for Delta Lake\n##### Limitations\n\nPredictive optimization is not available in all regions. See [Databricks clouds and regions](https:\/\/docs.databricks.com\/resources\/supported-regions.html). \nPredictive optimization does not run `OPTIMIZE` commands on tables that use liquid clustering or Z-order. \nPredictive optimization does not run `VACUUM` operations on tables with a file retention window configured below the default of 7 days. See [Configure data retention for time travel queries](https:\/\/docs.databricks.com\/delta\/history.html#data-retention). \nPredictive optimization does not perform maintenance operations on the following tables: \n* Tables loaded to a workspace as Delta Sharing recipients.\n* Materialized views. See [Use materialized views in Databricks SQL](https:\/\/docs.databricks.com\/sql\/user\/materialized-views.html).\n* Streaming tables. See [Load data using streaming tables in Databricks SQL](https:\/\/docs.databricks.com\/sql\/load-data-streaming-table.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/predictive-optimization.html"} +{"content":"# Technology partners\n## Connect to data governance partners using Partner Connect\n#### Connect to Lightup\n\nLightup provides out-of-the-box data quality indicators (DQIs) to measure data health. \nYou can integrate your Databricks SQL warehouses (formerly Databricks SQL endpoints) and Databricks clusters with Lightup.\n\n#### Connect to Lightup\n##### Connect to Lightup using Partner Connect\n\nTo connect your Databricks workspace to Lightup using Partner Connect, see [Connect to data governance partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/data-governance.html). \nNote \nPartner Connect only supports Databricks SQL warehouses for Lightup. To connect a cluster in your Databricks workspace to Lightup, connect to Lightup manually.\n\n#### Connect to Lightup\n##### Connect to Lightup manually\n\nTo connect to Lightup manually, see [Connect to Databricks](https:\/\/docs.lightup.ai\/docs\/connect-to-databricks) in the Lightup documentation. \nNote \nTo connect a SQL warehouse with Lightup faster, connect to Lightup using Partner Connect.\n\n#### Connect to Lightup\n##### Next steps\n\n* [Lightup website](https:\/\/www.lightup.ai)\n* [Lightup documentation](https:\/\/docs.lightup.ai\/)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/data-governance\/lightup.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Query foundation models\n\nIn this article, you learn how to format query requests for [foundation models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/foundation-models.html) and send them to your model serving endpoint. \nFor traditional ML or Python models query requests, see [Query serving endpoints for custom models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-custom-model-endpoints.html). \n[Databricks Model Serving](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html) supports [Foundation Models APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html) and [external models](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html) for accessing foundation models and uses a unified OpenAI-compatible API and SDK for querying them. This makes it possible to experiment with and customize foundation models for production across supported clouds and providers. \n* [Query a chat completion model](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html#chat)\n* [Query an embedding model](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html#embedding)\n* [Query a text completion model](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html#completion) \nDatabricks Model Serving provides the following options for sending scoring requests to foundation models: \n| Method | Details |\n| --- | --- |\n| OpenAI client | Query a model hosted by a Databricks Model Serving endpoint using the OpenAI client. Specify the model serving endpoint name as the `model` input. Supported for chat, embeddings, and completions models made available by Foundation Model APIs or external models. |\n| Serving UI | Select **Query endpoint** from the **Serving endpoint** page. Insert JSON format model input data and click **Send Request**. If the model has an input example logged, use **Show Example** to load it. |\n| REST API | Call and query the model using the REST API. See [POST \/serving-endpoints\/{name}\/invocations](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/query) for details. For scoring requests to endpoints serving multiple models, see [Query individual models behind an endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/serve-multiple-models-to-serving-endpoint.html#query). |\n| MLflow Deployments SDK | Use MLflow Deployments SDK\u2019s [predict()](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.deployments.html#mlflow.deployments.DatabricksDeploymentClient.predict) function to query the model. |\n| Databricks GenAI SDK | Databricks GenAI SDK is a layer on top of the REST API. It handles low-level details, such as authentication and mapping model IDs to endpoint URLs, making it easier to interact with the models. The SDK is designed to be used from inside Databricks notebooks. |\n| SQL function | Invoke model inference directly from SQL using the `ai_query` SQL function. See [Query a served model with ai\\_query()](https:\/\/docs.databricks.com\/large-language-models\/how-to-ai-query.html). |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Query foundation models\n##### Requirements\n\n* A [model serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html).\n* A Databricks workspace in a supported region. \n+ [Foundation Model APIs regions](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html#regions)\n+ [External models regions](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/model-serving-limits.html#regions)\n* To send a scoring request through the OpenAI client, REST API or MLflow Deployment SDK, you must have a Databricks API token. \nImportant \nAs a security best practice for production scenarios, Databricks recommends that you use [machine-to-machine OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html) for authentication during production. \nFor testing and development, Databricks recommends using a personal access token belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Query foundation models\n##### Install packages\n\nAfter you have selected a querying method, you must first install the appropriate package to your cluster. \nTo use the OpenAI client, the `openai` package needs to be installed on your cluster. Run the following in your notebook or your local terminal: \n```\n!pip install openai\n\n``` \nThe following is only required when installing the package on a Databricks Notebook \n```\ndbutils.library.restartPython()\n\n``` \nAccess to the Serving REST API is available in Databricks Runtime for Machine Learning. \n```\n!pip install mlflow\n\n``` \nThe following is only required when installing the package on a Databricks Notebook \n```\ndbutils.library.restartPython()\n\n``` \n```\n!pip install databricks-genai\n\n``` \nThe following is only required when installing the package on a Databricks Notebook \n```\ndbutils.library.restartPython()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Query foundation models\n##### Query a chat completion model\n\nThe following are examples for querying a chat model. \nFor a batch inference example, see [Batch inference using Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/fmapi-batch-inference.html). \nThe following is a chat request for the DBRX Instruct model made available by the Foundation Model APIs pay-per-token endpoint, `databricks-dbrx-instruct` in your workspace. \nTo use the OpenAI client, specify the model serving endpoint name as the `model` input. The following example assumes you have a [Databricks API token](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html#required) and `openai` installed on your compute. You also need your [Databricks workspace instance](https:\/\/docs.databricks.com\/workspace\/workspace-details.html#workspace-url) to connect the OpenAI client to Databricks. \n```\n\nimport os\nimport openai\nfrom openai import OpenAI\n\nclient = OpenAI(\napi_key=\"dapi-your-databricks-token\",\nbase_url=\"https:\/\/example.staging.cloud.databricks.com\/serving-endpoints\"\n)\n\nresponse = client.chat.completions.create(\nmodel=\"databricks-dbrx-instruct\",\nmessages=[\n{\n\"role\": \"system\",\n\"content\": \"You are a helpful assistant.\"\n},\n{\n\"role\": \"user\",\n\"content\": \"What is a mixture of experts model?\",\n}\n],\nmax_tokens=256\n)\n\n``` \nImportant \nThe following example uses REST API parameters for querying serving endpoints that serve foundation models. These parameters are [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html) and the definition might change. See [POST \/serving-endpoints\/{name}\/invocations](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/query). \nThe following is a chat request for the DBRX Instruct model made available by the Foundation Model APIs pay-per-token endpoint, `databricks-dbrx-instruct` in your workspace. \n```\ncurl \\\n-u token:$DATABRICKS_TOKEN \\\n-X POST \\\n-H \"Content-Type: application\/json\" \\\n-d '{\n\"messages\": [\n{\n\"role\": \"system\",\n\"content\": \"You are a helpful assistant.\"\n},\n{\n\"role\": \"user\",\n\"content\": \" What is a mixture of experts model?\"\n}\n]\n}' \\\nhttps:\/\/<workspace_host>.databricks.com\/serving-endpoints\/databricks-dbrx-instruct\/invocations \\\n\n``` \nImportant \nThe following example uses the `predict()` API from the [MLflow Deployments SDK](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.deployments.html#mlflow.deployments.DatabricksDeploymentClient.predict). \nThe following is a chat request for the DBRX Instruct model made available by the Foundation Model APIs pay-per-token endpoint, `databricks-dbrx-instruct` in your workspace. \n```\n\nimport mlflow.deployments\n\n# Only required when running this example outside of a Databricks Notebook\nexport DATABRICKS_HOST=\"https:\/\/<workspace_host>.databricks.com\"\nexport DATABRICKS_TOKEN=\"dapi-your-databricks-token\"\n\nclient = mlflow.deployments.get_deploy_client(\"databricks\")\n\nchat_response = client.predict(\nendpoint=\"databricks-dbrx-instruct\",\ninputs={\n\"messages\": [\n{\n\"role\": \"user\",\n\"content\": \"Hello!\"\n},\n{\n\"role\": \"assistant\",\n\"content\": \"Hello! How can I assist you today?\"\n},\n{\n\"role\": \"user\",\n\"content\": \"What is a mixture of experts model??\"\n}\n],\n\"temperature\": 0.1,\n\"max_tokens\": 20\n}\n)\n\n``` \nThe following is a chat request for the DBRX Instruct model made available by the Foundation Model APIs pay-per-token endpoint, `databricks-dbrx-instruct` in your workspace. \n```\nfrom databricks_genai_inference import ChatCompletion\n\n# Only required when running this example outside of a Databricks Notebook\nexport DATABRICKS_HOST=\"https:\/\/<workspace_host>.databricks.com\"\nexport DATABRICKS_TOKEN=\"dapi-your-databricks-token\"\n\nresponse = ChatCompletion.create(model=\"databricks-dbrx-instruct\",\nmessages=[{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n{\"role\": \"user\",\"content\": \"What is a mixture of experts model?\"}],\nmax_tokens=128)\nprint(f\"response.message:{response.message}\")\n\n``` \nTo [query a foundation model endpoint using LangChain](https:\/\/docs.databricks.com\/large-language-models\/langchain.html#wrap-endpoint), you can do either of the following: \n* Import the `Databricks` LLM class and specify the `endpoint_name` and `transform_input_fn`.\n* Import the `ChatDatabricks` ChatModel class and specify the `endpoint`. \nThe following example uses the `Databricks` LLM class in LangChain to query the Foundation Model APIs pay-per-token endpoint, `databricks-dbrx-instruct`. Foundation Model APIs expects `messages` in the request dictionary, while LangChain Databricks LLM by default provides `prompt` in the request dictionary. Use the `transform_input` function to prepare the request dictionary into the expected format. \n```\nfrom langchain.llms import Databricks\nfrom langchain_core.messages import HumanMessage, SystemMessage\n\ndef transform_input(**request):\nrequest[\"messages\"] = [\n{\n\"role\": \"user\",\n\"content\": request[\"prompt\"]\n}\n]\ndel request[\"prompt\"]\nreturn request\n\nllm = Databricks(endpoint_name=\"databricks-dbrx-instruct\", transform_input_fn=transform_input)\nllm(\"What is a mixture of experts model?\")\n\n``` \nThe following example uses the `ChatDatabricks` ChatModel class and specifies the `endpoint`. \n```\nfrom langchain.chat_models import ChatDatabricks\nfrom langchain_core.messages import HumanMessage, SystemMessage\n\nmessages = [\nSystemMessage(content=\"You're a helpful assistant\"),\nHumanMessage(content=\"What is a mixture of experts model?\"),\n]\nchat_model = ChatDatabricks(endpoint=\"databricks-dbrx-instruct\", max_tokens=500)\nchat_model.invoke(messages)\n\n``` \nImportant \nThe following example uses the built-in SQL function, [ai\\_query](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_query.html). This function is [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html) and the definition might change. See [Query a served model with ai\\_query()](https:\/\/docs.databricks.com\/large-language-models\/how-to-ai-query.html). \nThe following is a chat request for `llama-2-70b-chat` made available by the Foundation Model APIs pay-per-token endpoint, `databricks-llama-2-70b-chat` in your workspace. \nNote \nThe `ai_query()` function does not support query endpoints that serve the DBRX or the DBRX Instruct model. \n```\nSELECT ai_query(\n\"databricks-llama-2-70b-chat\",\n\"Can you explain AI in ten words?\"\n)\n\n``` \nThe following is the expected request format for a chat model. For external models, you can include additional parameters that are valid for a given provider and endpoint configuration. See [Additional query parameters](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html#extra-parameters). \n```\n{\n\"messages\": [\n{\n\"role\": \"user\",\n\"content\": \"What is a mixture of experts model?\"\n}\n],\n\"max_tokens\": 100,\n\"temperature\": 0.1\n}\n\n``` \nThe following is an expected response format: \n```\n{\n\"model\": \"databricks-dbrx-instruct\",\n\"choices\": [\n{\n\"message\": {},\n\"index\": 0,\n\"finish_reason\": null\n}\n],\n\"usage\": {\n\"prompt_tokens\": 7,\n\"completion_tokens\": 74,\n\"total_tokens\": 81\n},\n\"object\": \"chat.completion\",\n\"id\": null,\n\"created\": 1698824353\n}\n\n``` \n### Chat session \nDatabricks GenAI SDK provides `ChatSession` class to manage multi-round chat conversations. It provides the following functions: \n| Function | Return | Description |\n| --- | --- | --- |\n| `reply (string)` | | Takes a new user message |\n| `last` | string | Last message from assistant |\n| `history` | list of dict | Messages in chat history, including roles. |\n| `count` | int | Number of chat rounds conducted so far. | \nTo initialize `ChatSession`, you use the same set of arguments as `ChatCompletion`, and those arguments are used throughout the chat session. \n```\n\nfrom databricks_genai_inference import ChatSession\n\nchat = ChatSession(model=\"llama-2-70b-chat\", system_message=\"You are a helpful assistant.\", max_tokens=128)\nchat.reply(\"Knock, knock!\")\nchat.last # return \"Hello! Who's there?\"\nchat.reply(\"Guess who!\")\nchat.last # return \"Okay, I'll play along! Is it a person, a place, or a thing?\"\n\nchat.history\n# return: [\n# {'role': 'system', 'content': 'You are a helpful assistant.'},\n# {'role': 'user', 'content': 'Knock, knock.'},\n# {'role': 'assistant', 'content': \"Hello! Who's there?\"},\n# {'role': 'user', 'content': 'Guess who!'},\n# {'role': 'assistant', 'content': \"Okay, I'll play along! Is it a person, a place, or a thing?\"}\n# ]\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Query foundation models\n##### Query an embedding model\n\nThe following is an embeddings request for the `bge-large-en` model made available by Foundation Model APIs. \nTo use the OpenAI client, specify the model serving endpoint name as the `model` input. The following example assumes you have a Databricks API token and `openai` installed on your cluster. \n```\n\nimport os\nimport openai\nfrom openai import OpenAI\n\nclient = OpenAI(\napi_key=\"dapi-your-databricks-token\",\nbase_url=\"https:\/\/example.staging.cloud.databricks.com\/serving-endpoints\"\n)\n\nresponse = client.embeddings.create(\nmodel=\"databricks-bge-large-en\",\ninput=\"what is databricks\"\n)\n\n``` \nImportant \nThe following example uses REST API parameters for querying serving endpoints that serve foundation models. These parameters are [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html) and the definition might change. See [POST \/serving-endpoints\/{name}\/invocations](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/query). \n```\ncurl \\\n-u token:$DATABRICKS_TOKEN \\\n-X POST \\\n-H \"Content-Type: application\/json\" \\\n-d '{ \"input\": \"Embed this sentence!\"}' \\\nhttps:\/\/<workspace_host>.databricks.com\/serving-endpoints\/databricks-bge-large-en\/invocations\n\n``` \nImportant \nThe following example uses the `predict()` API from the [MLflow Deployments SDK](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.deployments.html#mlflow.deployments.DatabricksDeploymentClient.predict). \n```\n\nimport mlflow.deployments\n\nexport DATABRICKS_HOST=\"https:\/\/<workspace_host>.databricks.com\"\nexport DATABRICKS_TOKEN=\"dapi-your-databricks-token\"\n\nclient = mlflow.deployments.get_deploy_client(\"databricks\")\n\nembeddings_response = client.predict(\nendpoint=\"databricks-bge-large-en\",\ninputs={\n\"input\": \"Here is some text to embed\"\n}\n)\n\n``` \n```\n\nfrom databricks_genai_inference import Embedding\n\n# Only required when running this example outside of a Databricks Notebook\nexport DATABRICKS_HOST=\"https:\/\/<workspace_host>.databricks.com\"\nexport DATABRICKS_TOKEN=\"dapi-your-databricks-token\"\n\nresponse = Embedding.create(\nmodel=\"bge-large-en\",\ninput=\"3D ActionSLAM: wearable person tracking in multi-floor environments\")\nprint(f'embeddings: {response.embeddings}')\n\n``` \nTo use a [Databricks Foundation Model APIs model in LangChain](https:\/\/python.langchain.com\/docs\/integrations\/providers\/databricks#databricks-foundation-model-apis) as an Embedding Model, import the `DatabricksEmbeddings` class and specify the `endpoint` parameter as follows: \n```\nfrom langchain.embeddings import DatabricksEmbeddings\n\nembeddings = DatabricksEmbeddings(endpoint=\"databricks-bge-large-en\")\nembeddings.embed_query(\"Can you explain AI in ten words?\")\n\n``` \nImportant \nThe following example uses the built-in SQL function, [ai\\_query](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_query.html). This function is [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html) and the definition might change. See [Query a served model with ai\\_query()](https:\/\/docs.databricks.com\/large-language-models\/how-to-ai-query.html). \n```\nSELECT ai_query(\n\"databricks-bge-large-en\",\n\"Can you explain AI in ten words?\"\n)\n\n``` \nThe following is the expected request format for an embeddings model. For external models, you can include additional parameters that are valid for a given provider and endpoint configuration. See [Additional query parameters](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html#extra-parameters). \n```\n{\n\"input\": [\n\"embedding text\"\n]\n}\n\n``` \nThe following is the expected response format: \n```\n{\n\"object\": \"list\",\n\"data\": [\n{\n\"object\": \"embedding\",\n\"index\": 0,\n\"embedding\": []\n}\n],\n\"model\": \"text-embedding-ada-002-v2\",\n\"usage\": {\n\"prompt_tokens\": 2,\n\"total_tokens\": 2\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Query foundation models\n##### Query a text completion model\n\nThe following is a completions request for the `databricks-mpt-30b-instruct` model made available by Foundation Model APIs. For the parameters and syntax, see [Completion task](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html#completion). \nTo use the OpenAI client, specify the model serving endpoint name as the `model` input. The following example assumes you have a Databricks API token and `openai` installed on your cluster. \n```\n\nimport os\nimport openai\nfrom openai import OpenAI\n\nclient = OpenAI(\napi_key=\"dapi-your-databricks-token\",\nbase_url=\"https:\/\/example.staging.cloud.databricks.com\/serving-endpoints\"\n)\n\ncompletion = client.completions.create(\nmodel=\"databricks-mpt-30b-instruct\",\nprompt=\"what is databricks\",\ntemperature=1.0\n)\n\n``` \nImportant \nThe following example uses REST API parameters for querying serving endpoints that serve foundation models. These parameters are [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html) and the definition might change. See [POST \/serving-endpoints\/{name}\/invocations](https:\/\/docs.databricks.com\/api\/workspace\/servingendpoints\/query). \n```\ncurl \\\n-u token:$DATABRICKS_TOKEN \\\n-X POST \\\n-H \"Content-Type: application\/json\" \\\n-d '{\"prompt\": \"What is a quoll?\", \"max_tokens\": 64}' \\\nhttps:\/\/<workspace_host>.databricks.com\/serving-endpoints\/databricks-mpt-30b-instruct\/invocations\n\n``` \nImportant \nThe following example uses the `predict()` API from the [MLflow Deployments SDK](https:\/\/mlflow.org\/docs\/latest\/python_api\/mlflow.deployments.html#mlflow.deployments.DatabricksDeploymentClient.predict). \n```\n\nimport mlflow.deployments\n\n# Only required when running this example outside of a Databricks Notebook\nexport DATABRICKS_HOST=\"https:\/\/<workspace_host>.databricks.com\"\nexport DATABRICKS_TOKEN=\"dapi-your-databricks-token\"\n\nclient = mlflow.deployments.get_deploy_client(\"databricks\")\n\ncompletions_response = client.predict(\nendpoint=\"databricks-mpt-30b-instruct\",\ninputs={\n\"prompt\": \"What is the capital of France?\",\n\"temperature\": 0.1,\n\"max_tokens\": 10,\n\"n\": 2\n}\n)\n\n``` \n```\n\nfrom databricks_genai_inference import Completion\n\n# Only required when running this example outside of a Databricks Notebook\nexport DATABRICKS_HOST=\"https:\/\/<workspace_host>.databricks.com\"\nexport DATABRICKS_TOKEN=\"dapi-your-databricks-token\"\n\nresponse = Completion.create(\nmodel=\"databricks-mpt-30b-instruct\",\nprompt=\"Write 3 reasons why you should train an AI model on domain specific data sets.\",\nmax_tokens=128)\nprint(f\"response.text:{response.text:}\")\n\n``` \nImportant \nThe following example uses the built-in SQL function, [ai\\_query](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/ai_query.html). This function is [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html) and the definition might change. See [Query a served model with ai\\_query()](https:\/\/docs.databricks.com\/large-language-models\/how-to-ai-query.html). \n```\nSELECT ai_query(\n\"databricks-mpt-30b-instruct\",\n\"Can you explain AI in ten words?\"\n)\n\n``` \nThe following is the expected request format for a completions model. For external models, you can include additional parameters that are valid for a given provider and endpoint configuration. See [Additional query parameters](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html#extra-parameters). \n```\n{\n\"prompt\": \"What is mlflow?\",\n\"max_tokens\": 100,\n\"temperature\": 0.1,\n\"stop\": [\n\"Human:\"\n],\n\"n\": 1,\n\"stream\": false,\n\"extra_params\":{\n\"top_p\": 0.9\n}\n}\n\n``` \nThe following is the expected response format: \n```\n{\n\"id\": \"cmpl-8FwDGc22M13XMnRuessZ15dG622BH\",\n\"object\": \"text_completion\",\n\"created\": 1698809382,\n\"model\": \"gpt-3.5-turbo-instruct\",\n\"choices\": [\n{\n\"text\": \"MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It provides tools for tracking experiments, managing and deploying models, and collaborating on projects. MLflow also supports various machine learning frameworks and languages, making it easier to work with different tools and environments. It is designed to help data scientists and machine learning engineers streamline their workflows and improve the reproducibility and scalability of their models.\",\n\"index\": 0,\n\"logprobs\": null,\n\"finish_reason\": \"stop\"\n}\n],\n\"usage\": {\n\"prompt_tokens\": 5,\n\"completion_tokens\": 83,\n\"total_tokens\": 88\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html"} +{"content":"# Model serving with Databricks\n## Deploy generative AI foundation models\n#### Query foundation models\n##### Chat with supported LLMs using AI Playground\n\nYou can interact with supported large language models using the [AI Playground](https:\/\/docs.databricks.com\/large-language-models\/ai-playground.html). The AI Playground is a chat-like environment where you can test, prompt, and compare LLMs from your Databricks workspace. \n![AI playground](https:\/\/docs.databricks.com\/_images\/ai-playground.png)\n\n#### Query foundation models\n##### Additional resources\n\n* [Inference tables for monitoring and debugging models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html)\n* [Batch inference using Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/fmapi-batch-inference.html)\n* [Databricks Foundation Model APIs](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/index.html)\n* [External models in Databricks Model Serving](https:\/\/docs.databricks.com\/generative-ai\/external-models\/index.html)\n* [Supported models for pay-per-token](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/supported-models.html)\n* [Foundation model REST API reference](https:\/\/docs.databricks.com\/machine-learning\/foundation-models\/api-reference.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/score-foundation-models.html"} +{"content":"# DatabricksIQ-powered features\n","doc_uri":"https:\/\/docs.databricks.com\/databricksiq\/databricksiq-trust.html"} +{"content":"# DatabricksIQ-powered features\n### DatabricksIQ trust and safety\n\nDatabricks understands the importance of your data and the trust you place in us when you use Databricks services. Databricks is committed to the highest standards of data protection and has implemented rigorous measures to ensure your information is protected. \n* **Your data is not used for training and our model partners do not retain your data.** \n+ Neither Databricks nor our model partner (Azure OpenAI) trains models using customer data. Your code and input are **not** used to generate suggestions displayed for other customers.\n+ Data submitted by Databricks in these features is not retained by Azure OpenAI, even for abuse monitoring. These features have been opted-out of Azure\u2019s retention of data for that purpose.\n* **Protection from harmful output.** Databricks also uses Azure OpenAI [content filtering](https:\/\/learn.microsoft.com\/azure\/ai-services\/openai\/concepts\/content-filter?tabs=python) to protect users from harmful content. In addition, Databricks has performed an extensive evaluation with thousands of simulated user interactions to ensure that the protections put in place to protect against harmful content, jailbreaks, insecure code generation, and use of third-party copyright content are effective.\n* **Databricks uses only the data necessary to provide the service.** Data is only sent when you interact with AI-assistive features. Databricks sends your natural language input, relevant table metadata, errors, as well as input code or queries to help return more relevant results for your data. Databricks does **not** send row-level data.\n* **Data is protected in transit.** All traffic between Databricks and Azure OpenAI is encrypted in transit with industry standard TLS encryption.\n* **EU Data Stays in the EU.** For European Union (EU) workspaces, AI-assistive features will use an Azure OpenAI model hosted in the EU. All other regions will use a model hosted in the US. \nThe following features use Azure OpenAI Service: \n* [Databricks Assistant for notebooks, SQL editor, and file editor](https:\/\/docs.databricks.com\/notebooks\/databricks-assistant-faq.html)\n* [Databricks Assistant for help](https:\/\/docs.databricks.com\/workspace\/index.html#get-help)\n* [Databricks Assistant for dashboards](https:\/\/docs.databricks.com\/dashboards\/tutorials\/create-w-db-assistant.html) \nAll other AI-assistive features, including AI-generated Unity Catalog documentation, use Databricks-managed models.\n\n","doc_uri":"https:\/\/docs.databricks.com\/databricksiq\/databricksiq-trust.html"} +{"content":"# \n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-dev.html"} +{"content":"# \n### Development environment\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html) \nRAG Studio applications are developed using a local environment. Follow these steps to configure your development environment. \nTip \n**\ud83d\udea7 Roadmap \ud83d\udea7** Support for developing RAG Studio apps using only the Databricks Notebook as your IDE. \n1. Follow the [Databricks CLI installation instructions](https:\/\/docs.databricks.com\/dev-tools\/cli\/install.html) to install or upgrade the [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) on your development machine. \nWarning \nUsing an older and incompatible Databricks CLI version generates an error. If that happens, follow the [upgrade instructions](https:\/\/docs.databricks.com\/dev-tools\/cli\/install.html#update)\n2. Authenticate the Databricks CLI following the steps for using your [personal access token](https:\/\/docs.databricks.com\/dev-tools\/cli\/authentication.html#token-auth) or [OAuth user-to-machine (U2M) authentication](https:\/\/docs.databricks.com\/dev-tools\/cli\/authentication.html#token-auth). \nNote \nYou must use the same Workspace in which you create RAG Studio\u2019s [required infrastructure](https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-infra.html).\n3. Set up your Python environment \n1. Install Python 3.10 \nWarning \nRAG Studio has been tested in Python 3.10. Although you can use a higher Python version, we suggest using Python 3.10.\n2. Using your preferred virtual environment manager, create a virtual environment. \nNote \nWhile technically optional, we strongly suggest using a virtual environment since after creating a RAG Application, you will install RAG Studio\u2019s `requirements.txt`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/setup\/env-setup-dev.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Delta Live Tables properties reference\n\nThis article provides a reference for Delta Live Tables JSON setting specification and table properties in Databricks. For more details on using these various properties and configurations, see the following articles: \n* [Configure pipeline settings for Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html)\n* [Delta Live Tables API guide](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/properties.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Delta Live Tables properties reference\n###### Delta Live Tables pipeline configurations\n\n| Fields |\n| --- |\n| **`id`** Type: `string` A globally unique identifier for this pipeline. The identifier is assigned by the system and cannot be changed. |\n| **`name`** Type: `string` A user-friendly name for this pipeline. The name can be used to identify pipeline jobs in the UI. |\n| **`storage`** Type: `string` A location on DBFS or cloud storage where output data and metadata required for pipeline execution are stored. Tables and metadata are stored in subdirectories of this location. When the `storage` setting is not specified, the system will default to a location in `dbfs:\/pipelines\/`. The `storage` setting cannot be changed after a pipeline is created. |\n| **`configuration`** Type: `object` An optional list of settings to add to the Spark configuration of the cluster that will run the pipeline. These settings are read by the Delta Live Tables runtime and available to pipeline queries through the Spark configuration. Elements must be formatted as `key:value` pairs. |\n| **`libraries`** Type: `array of objects` An array of notebooks containing the pipeline code and required artifacts. |\n| **`clusters`** Type: `array of objects` An array of specifications for the clusters to run the pipeline. If this is not specified, pipelines will automatically select a default cluster configuration for the pipeline. |\n| **`development`** Type: `boolean` A flag indicating whether to run the pipeline in `development` or `production` mode. The default value is `true` |\n| **`notifications`** Type: `array of objects` An optional array of specifications for email notifications when a pipeline update completes, fails with a retryable error, fails with a non-retryable error, or a flow fails. |\n| **`continuous`** Type: `boolean` A flag indicating whether to run the pipeline continuously. The default value is `false`. |\n| **`target`** Type: `string` The name of a database for persisting pipeline output data. Configuring the `target` setting allows you to view and query the pipeline output data from the Databricks UI. |\n| **`channel`** Type: `string` The version of the Delta Live Tables runtime to use. The supported values are:* `preview` to test your pipeline with upcoming changes to the runtime version. * `current` to use the current runtime version. The `channel` field is optional. The default value is `current`. Databricks recommends using the current runtime version for production workloads. |\n| **`edition`** Type `string` The Delta Live Tables product edition to run the pipeline. This setting allows you to choose the best product edition based on the requirements of your pipeline:* `CORE` to run streaming ingest workloads. * `PRO` to run streaming ingest and change data capture (CDC) workloads. * `ADVANCED` to run streaming ingest workloads, CDC workloads, and workloads that require Delta Live Tables expectations to enforce data quality constraints. The `edition` field is optional. The default value is `ADVANCED`. |\n| **`photon`** Type: `boolean` A flag indicating whether to use [What is Photon?](https:\/\/docs.databricks.com\/compute\/photon.html) to run the pipeline. Photon is the Databricks high performance Spark engine. Photon-enabled pipelines are billed at a different rate than non-Photon pipelines. The `photon` field is optional. The default value is `false`. |\n| **`pipelines.maxFlowRetryAttempts`** Type: `int` The maximum number of attempts to retry a flow before failing a pipeline update when a retryable failure occurs. The default value is two. By default, when a retryable failure occurs, the Delta Live Tables runtime attempts to run the flow three times including the original attempt. |\n| **`pipelines.numUpdateRetryAttempts`** Type: `int` The maximum number of attempts to retry an update before failing the update when a retryable failure occurs. The retry is run as a full update. The default is five. This parameter applies only to triggered updates run in production mode. There is no retry when your pipeline runs in development mode. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/properties.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Delta Live Tables properties reference\n###### Delta Live Tables table properties\n\nIn addition to the table properties supported by [Delta Lake](https:\/\/docs.databricks.com\/delta\/table-properties.html), you can set the following table properties. \n| Table properties |\n| --- |\n| **`pipelines.autoOptimize.managed`** Default: `true` Enables or disables automatically scheduled optimization of this table. |\n| **`pipelines.autoOptimize.zOrderCols`** Default: None An optional string containing a comma-separated list of column names to z-order this table by. For example, `pipelines.autoOptimize.zOrderCols = \"year,month\"` |\n| **`pipelines.reset.allowed`** Default: `true` Controls whether a full refresh is allowed for this table. |\n\n##### Delta Live Tables properties reference\n###### CDC table properties\n\nnote:: These properties to control tombstone management behavior are deprecated and replaced by pipeline settings. Any existing or new pipelines should use the new pipeline settings. See [Control tombstone management for SCD type 1 queries](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#cdc). \nThe following table properties are added to control the behavior of tombstone management for `DELETE` events when using CDC: \n| Table properties |\n| --- |\n| **`pipelines.cdc.tombstoneGCThresholdInSeconds`** Default: 5 minutes Set this value to match the highest expected interval between out-of-order data. |\n| **`pipelines.cdc.tombstoneGCFrequencyInSeconds`** Default: 60 seconds Controls how frequently tombstones are checked for cleanup. | \nSee [APPLY CHANGES API: Simplify change data capture in Delta Live Tables](https:\/\/docs.databricks.com\/delta-live-tables\/cdc.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/properties.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Delta Live Tables properties reference\n###### Pipelines trigger interval\n\nYou can specify a pipeline trigger interval for the entire Delta Live Tables pipeline or as part of a dataset declaration. See [Pipelines trigger interval](https:\/\/docs.databricks.com\/delta-live-tables\/settings.html#trigger-interval). \n| `pipelines.trigger.interval` |\n| --- |\n| The default is based on flow type:* Five seconds for streaming queries. * One minute for complete queries when all input data is from Delta sources. * Ten minutes for complete queries when some data sources may be non-Delta. The value is a number plus the time unit. The following are the valid time units:* `second`, `seconds` * `minute`, `minutes` * `hour`, `hours` * `day`, `days` You can use the singular or plural unit when defining the value, for example:* `{\"pipelines.trigger.interval\" : \"1 hour\"}` * `{\"pipelines.trigger.interval\" : \"10 seconds\"}` * `{\"pipelines.trigger.interval\" : \"30 second\"}` * `{\"pipelines.trigger.interval\" : \"1 minute\"}` * `{\"pipelines.trigger.interval\" : \"10 minutes\"}` * `{\"pipelines.trigger.interval\" : \"10 minute\"}` |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/properties.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Delta Live Tables properties reference\n###### Cluster attributes that are not user settable\n\nBecause Delta Live Tables manages cluster lifecycles, many cluster settings are set by Delta Live Tables and cannot be manually configured by users, either in a pipeline configuration or in a cluster policy used by a pipeline. The following table lists these settings and why they cannot be manually set. \n| Fields |\n| --- |\n| **`cluster_name`** Delta Live Tables sets the names of the clusters used to run pipeline updates. These names cannot be overridden. |\n| **`data_security_mode`** **`access_mode`** These values are automatically set by the system. |\n| **`spark_version`** Delta Live Tables clusters run on a custom version of Databricks Runtime that is continually updated to include the latest features. The version of Spark is bundled with the Databricks Runtime version and cannot be overridden. |\n| **`autotermination_minutes`** Because Delta Live Tables manages cluster auto-termination and reuse logic, the cluster auto-termination time cannot be overridden. |\n| **`runtime_engine`** Although you can control this field by enabling Photon for your pipeline, you cannot set this value directly. |\n| **`effective_spark_version`** This value is automatically set by the system. |\n| **`cluster_source`** This field is set by the system and is read-only. |\n| **`docker_image`** Because Delta Live Tables manages the cluster lifecycle, you cannot use a custom container with pipeline clusters. |\n| **`workload_type`** This value is set by the system and cannot be overridden. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/properties.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Concepts\n\nThis section describes concepts to help you use Databricks Feature Store and feature tables.\n\n#### Concepts\n##### Feature tables\n\nFeatures are organized as feature tables. Each table must have a primary key, and is backed by a [Delta table](https:\/\/docs.databricks.com\/delta\/index.html) and additional metadata. Feature table metadata tracks the data sources from which a table was generated and the notebooks and jobs that created or wrote to the table. \nWith Databricks Runtime 13.3 LTS and above, if your workspace is enabled for Unity Catalog, you can use any Delta table in Unity Catalog with a primary key as a feature table. See [Feature Engineering in Unity Catalog](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/uc\/feature-tables-uc.html). Feature tables that are stored in the local Workspace Feature Store are called \u201cWorkspace feature tables\u201d. See [Work with features in Workspace Feature Store](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/workspace-feature-store\/feature-tables.html). \nFeatures in a feature table are typically computed and updated using a common computation function. \nYou can publish a feature table to an [online store](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/concepts.html#online-store-1) for real-time model inference.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/concepts.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Concepts\n##### `FeatureLookup`\n\nMany different models might use features from a particular feature table, and not all models will need every feature. To train a model using features, you create a `FeatureLookup` for each feature table. The `FeatureLookup` specifies which features to use from the table, and also defines the keys to use to join the feature table to the label data passed to `create_training_set`. \nThe diagram illustrates how a `FeatureLookup` works. In this example, you want to train a model using features from two feature tables, `customer_features` and `product_features`. You create a `FeatureLookup` for each feature table, specifying the name of the table, the features (columns) to select from the table, and the lookup key to use when the joining features to create a training dataset. \nYou then call `create_training_set`, also shown in the diagram. This API call specifies the DataFrame that contains the raw training data (`label_df`), the `FeatureLookups` to use, and `label`, a column that contains the ground truth. The training data must contain column(s) corresponding to each of the primary keys of the feature tables. The data in the feature tables is joined to the input DataFrame according to these keys. The result is shown in the diagram as the \u201cTraining dataset\u201d. \n![FeatureLookup diagram](https:\/\/docs.databricks.com\/_images\/feature-lookup-diagram.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/concepts.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Concepts\n##### Training set\n\nA training set consists of a list of features and a DataFrame containing raw training data, labels, and primary keys by which to look up features. You create the training set by specifying features to extract from Feature Store, and provide the training set as input during model training. \nSee [Create a training dataset](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/train-models-with-feature-store.html#create-a-training-dataset) for an example of how to create and use a training set. \nWhen you train a model using Feature Engineering in Unity Catalog, you can view the model\u2019s lineage in Catalog Explorer. Tables and functions that were used to create the model are automatically tracked and displayed. See [View feature store lineage](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/lineage.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/concepts.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Concepts\n##### Time series feature tables\n\nThe data used to train a model often has time dependencies built into it. When you build the model, you must consider only feature values up until the time of the observed target value. If you train on features based on data measured after the timestamp of the target value, the model\u2019s performance may suffer. \n[Time series feature tables](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html) include a timestamp column that ensures that each row in the training dataset represents the latest known feature values as of the row\u2019s timestamp. You should use time series feature tables whenever feature values change over time, for example with time series data, event-based data, or time-aggregated data. \nWhen you create a time series feature table, you specify time-related columns in your primary keys to be timeseries columns using the `timeseries_columns` argument (for Feature Engineering in Unity Catalog) or the `timestamp_keys` argument (for Workspace Feature Store). This enables point-in-time lookups when you use `create_training_set` or `score_batch`. The system performs an as-of timestamp join, using the `timestamp_lookup_key` you specify. \nIf you do not use the `timeseries_columns` argument or the `timestamp_keys` argument, and only designate a timeseries column as a primary key column, Feature Store does not apply point-in-time logic to the timeseries column during joins. Instead, it matches only rows with an exact time match instead of matching all rows prior to the timestamp.\n\n#### Concepts\n##### Offline store\n\nThe offline feature store is used for feature discovery, model training, and batch inference. It contains feature tables materialized as [Delta tables](https:\/\/docs.databricks.com\/delta\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/concepts.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Concepts\n##### Online store\n\nAn online store is a low-latency database used for real-time model inference. For a list of online stores that Databricks supports, see [Third-party online stores](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/online-feature-stores.html).\n\n#### Concepts\n##### Streaming\n\nIn addition to batch writes, Databricks Feature Store supports streaming. You can write feature values to a feature table from a streaming source, and feature computation code can utilize [Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/index.html) to transform raw data streams into features. \nYou can also stream feature tables from the offline store to an online store.\n\n#### Concepts\n##### Model packaging\n\nA machine learning model trained using features from Databricks Feature Store retains references to these features. At inference time, the model can optionally retrieve feature values from Feature Store. The caller only needs to provide the primary key of the features used in the model (for example, `user_id`), and the model retrieves all required feature values from Feature Store. \nIn batch inference, feature values are retrieved from the offline store and joined with new data prior to scoring.\nIn real-time inference, feature values are retrieved from the online store. \nTo package a model with feature metadata, use `FeatureEngineeringClient.log_model` (for Feature Engineering in Unity Catalog) or `FeatureStoreClient.log_model` (for Workspace Feature Store).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/concepts.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Rivery\n\nRivery helps you ingest, orchestrate, and take action on all of your data. Rivery empowers organizations to support more data sources, larger and more complex datasets, accelerate time to insights, and increase access to data across the entire organization. \nYou can integrate your Databricks SQL warehouses (formerly Databricks SQL endpoints) with Rivery. \nNote \nRivery does not integrate with Databricks clusters.\n\n#### Connect to Rivery\n##### Connect to Rivery using Partner Connect\n\nTo connect to Rivery using Partner Connect, do the following: \n1. [Connect to ingestion partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/ingestion.html).\n2. In your Rivery account, choose **Connections**. The connection to your SQL warehouse should be displayed. \nIf the connection is not displayed, you can troubleshoot by skipping ahead to connect to Rivery manually.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/rivery.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Rivery\n##### Connect to Rivery manually\n\nUse the following instructions to manually connect Rivery to a SQL warehouse in your workspace. \nNote \nTo connect your SQL warehouses faster to Rivery, use Partner Connect. \n### Requirements \nBefore you connect to Rivery manually, you need the following: \n* A Databricks [personal access token](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement). \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n* A SQL warehouse. \n+ To create a SQL warehouse in your workspace, see [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* Connection details for your SQL warehouse. See [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html). Specifically, you need the SQL warehouse\u2019s **Server Hostname**, **Port**, and **HTTP Path** field values. \n### Steps to connect \nTo connect to Rivery manually, follow the steps in this section. \nTip \nIf the **Rivery** tile in Partner Connect has a check mark icon inside of it, you can get the connection details for the connected SQL warehouse by clicking the tile and then expanding **Connection details**. Note however that the **Personal access token** here is hidden; you must [create a replacement personal access token](https:\/\/docs.databricks.com\/partner-connect\/index.html#how-to-create-token) and enter that new token instead when Rivery asks you for it. \n1. Sign in to your Rivery account, or create a new Rivery account, at <https:\/\/console.rivery.io>.\n2. Click **Connections**. \nImportant \nIf you sign in to your organization\u2019s Rivery account, there might already be a list of existing connection entries with the Databricks logo. *These entries might contain connection details for SQL warehouses in workspaces that are separate from yours.* If you still want to reuse one of these connections, and you trust the SQL warehouse and have access to it, choose that destination and skip the remaining steps in this section.\n3. Click **Create New Connection**.\n4. Choose **Databricks**. Use the **Filter Data Sources** box to find it if necessary.\n5. Enter a **Connection Name** and an optional **Description**.\n6. Enter the **Server Hostname**, **Port**, and **HTTP Path** from the connection details for your SQL warehouse.\n7. Enter your token in **Personal Access Token**.\n8. Click **Save**.\n9. Click the lightning bolt (**Test connection**) icon.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/rivery.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Rivery\n##### Additional resources\n\nSee the following resources on the [Rivery website](https:\/\/rivery.io): \n* [Data Source to Target Overview](https:\/\/rivery.io\/docs\/data-source-to-target-overview)\n* [Documentation](https:\/\/rivery.io\/docs\/start-here)\n* [Community](https:\/\/community.rivery.io\/)\n* [Support](https:\/\/rivery.io\/docs\/working-with-rivery-support)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/rivery.html"} +{"content":"# AI and Machine Learning on Databricks\n## Prepare data and environment for ML and DL\n### Preprocess data for machine learning and deep learning\n##### Feature engineering with MLlib\n\nApache Spark MLlib contains many utility functions for performing feature engineering at scale, including methods for encoding and transforming features. These methods can also be used to process features for other machine learning libraries. \nDatabricks recommends the following Apache Spark MLLib guides: \n* [Extracting, transforming and selecting features with MLlib](https:\/\/spark.apache.org\/docs\/latest\/ml-features)\n* [MLlib Programming Guide](https:\/\/spark.apache.org\/docs\/latest\/ml-guide.html)\n* [Python API Reference](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/)\n* [Scala API Reference](https:\/\/api-docs.databricks.com\/scala\/spark\/latest\/org\/apache\/spark\/ml\/index.html) \nThis PySpark-based notebook includes preprocessing steps that convert categorical data to numeric variables using category indexing and one-hot encoding.\n\n##### Feature engineering with MLlib\n###### Binary classification example\n\n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/binary-classification.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/preprocess-data\/mllib.html"} +{"content":"# Model serving with Databricks\n## Monitor model quality and endpoint health\n#### Enable inference tables on model serving endpoints using the API\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article explains how to use the Databricks API to enable inference tables for a [model serving endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/index.html). For general information about using inference tables, including how to enable them using the Databricks UI, see [Inference tables for monitoring and debugging models](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html). \nYou can enable inference tables when you create a new endpoint or on an existing endpoint. Databricks recommends that you create the endpoint with a service principal so that the inference table is not affected if the user who created the endpoint is removed from the workspace. \nThe owner of the inference tables is the user who created the endpoint. All access control lists (ACLs) on the table follow the standard Unity Catalog permissions and can be modified by the table owner.\n\n#### Enable inference tables on model serving endpoints using the API\n##### Requirements\n\n* Your workspace must have Unity Catalog enabled.\n* To enable inference tables on an endpoint both the creator of the endpoint and the modifier need the following permissions: \n+ CAN MANAGE permission on the endpoint.\n+ `USE CATALOG` permissions on the specified catalog.\n+ `USE SCHEMA` permissions on the specified schema.\n+ `CREATE TABLE` permissions in the schema.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/enable-model-serving-inference-tables.html"} +{"content":"# Model serving with Databricks\n## Monitor model quality and endpoint health\n#### Enable inference tables on model serving endpoints using the API\n##### Enable inference tables at endpoint creation using the API\n\nYou can enable inference tables for an endpoint during endpoint creation using the API. For instructions on creating an endpoint, see [Create custom model serving endpoints](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/create-manage-serving-endpoints.html). \nIn the API, the request body has an `auto_capture_config` to specify: \n* The Unity Catalog catalog: string representing the catalog to store the table\n* The Unity Catalog schema: string representing the schema to store the table\n* (optional) table prefix: string used as a prefix for the inference table name. If this isn\u2019t specified, the endpoint name is used.\n* (optional) enabled: boolean value used to enable or disable inference tables. This true by default. \nAfter specifying a catalog, schema, and optionally table prefix, a table is created at `<catalog>.<schema>.<table_prefix>_payload`. This table automatically creates a [Unity Catalog managed table](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html). The owner of the table is the user who creates the endpoint. \nNote \nSpecifying an existing table is not supported since the inference table is always automatically created on endpoint creation or endpoint updates. \nWarning \nThe inference table could become corrupted if you do any of the following: \n* Change the table schema.\n* Change the table name.\n* Delete the table.\n* Lose permissions to the Unity Catalog catalog or schema. \nIn this case, the `auto_capture_config` of the endpoint status shows a `FAILED` state for the payload table. If this happens, you must create a new endpoint to continue using inference tables. \nThe following example demonstrates how to enable inference tables during endpoint creation. \n```\nPOST \/api\/2.0\/serving-endpoints\n\n{\n\"name\": \"feed-ads\",\n\"config\":{\n\"served_entities\": [\n{\n\"entity_name\": \"ads1\",\n\"entity_version\": \"1\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": true\n}\n],\n\"auto_capture_config\":{\n\"catalog_name\": \"ml\",\n\"schema_name\": \"ads\",\n\"table_name_prefix\": \"feed-ads-prod\"\n}\n}\n}\n\n``` \nThe response looks like: \n```\n{\n\"name\": \"feed-ads\",\n\"creator\": \"customer@example.com\",\n\"creation_timestamp\": 1666829055000,\n\"last_updated_timestamp\": 1666829055000,\n\"state\": {\n\"ready\": \"NOT_READY\",\n\"config_update\": \"IN_PROGRESS\"\n},\n\"pending_config\": {\n\"start_time\": 1666718879000,\n\"served_entities\": [\n{\n\"name\": \"ads1-1\",\n\"entity_name\": \"ads1\",\n\"entity_version\": \"1\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": true,\n\"state\": {\n\"deployment\": \"DEPLOYMENT_CREATING\",\n\"deployment_state_message\": \"Creating\"\n},\n\"creator\": \"customer@example.com\",\n\"creation_timestamp\": 1666829055000\n}\n],\n\"config_version\": 1,\n\"traffic_config\": {\n\"routes\": [\n{\n\"served_model_name\": \"ads1-1\",\n\"traffic_percentage\": 100\n}\n]\n},\n\"auto_capture_config\": {\n\"catalog_name\": \"ml\",\n\"schema_name\": \"ads\",\n\"table_name_prefix\": \"feed-ads-prod\",\n\"state\": {\n\"payload_table\": {\n\"name\": \"feed-ads-prod_payload\"\n}\n},\n\"enabled\": true\n}\n},\n\"id\": \"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\",\n\"permission_level\": \"CAN_MANAGE\"\n}\n\n``` \nOnce logging to inference tables has been enabled, wait until your endpoint is ready. Then you can start calling it. \nAfter you create an inference table, schema evolution and adding data should be handled by the system. \nThe following operations do not impact the integrity of the table: \n* Running OPTIMIZE, ANALYZE, and VACUUM against the table.\n* Deleting old unused data. \nIf you don\u2019t specify an `auto_capture_config`, by default the settings configuration from the previous configuration version is re-used. For example, if inference tables was already enabled, the same settings are used on the next endpoint update or if inference tables was disabled, then it continues being disabled. \n```\n{\n\"served_entities\": [\n{\n\"name\":\"current\",\n\"entity_name\":\"model-A\",\n\"entity_version\":\"1\",\n\"workload_size\":\"Small\",\n\"scale_to_zero_enabled\":true\n}\n],\n\"auto_capture_config\": {\n\"enabled\": false\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/enable-model-serving-inference-tables.html"} +{"content":"# Model serving with Databricks\n## Monitor model quality and endpoint health\n#### Enable inference tables on model serving endpoints using the API\n##### Enable inference tables on an existing endpoint using the API\n\nYou can also enable inference tables on an existing endpoint using the API. After inference tables are enabled, continue specifying the same `auto_capture_config` body in future update endpoint API calls to continue using inference tables. \nNote \nChanging the table location after enabling inference tables is not supported. \n```\nPUT \/api\/2.0\/serving-endpoints\/{name}\/config\n\n{\n\"served_entities\": [\n{\n\"name\":\"current\",\n\"entity_name\":\"model-A\",\n\"entity_version\":\"1\",\n\"workload_size\":\"Small\",\n\"scale_to_zero_enabled\":true\n},\n{\n\"name\":\"challenger\",\n\"entity_name\":\"model-B\",\n\"entity_version\":\"1\",\n\"workload_size\":\"Small\",\n\"scale_to_zero_enabled\":true\n}\n],\n\"traffic_config\":{\n\"routes\": [\n{\n\"served_model_name\":\"current\",\n\"traffic_percentage\":\"50\"\n},\n{\n\"served_model_name\":\"challenger\",\n\"traffic_percentage\":\"50\"\n}\n]\n},\n\"auto_capture_config\":{\n\"catalog_name\": \"catalog\",\n\"schema_name\": \"schema\",\n\"table_name_prefix\": \"my-endpoint\"\n}\n}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/enable-model-serving-inference-tables.html"} +{"content":"# Model serving with Databricks\n## Monitor model quality and endpoint health\n#### Enable inference tables on model serving endpoints using the API\n##### Disable inference tables\n\nWhen disabling inference tables, you do not need to specify catalog, schema, or table prefix. The only required field is `enabled: false`. \n```\nPOST \/api\/2.0\/serving-endpoints\n\n{\n\"name\": \"feed-ads\",\n\"config\":{\n\"served_entities\": [\n{\n\"entity_name\": \"ads1\",\n\"entity_version\": \"1\",\n\"workload_size\": \"Small\",\n\"scale_to_zero_enabled\": true\n}\n],\n\"auto_capture_config\":{\n\"enabled\": false\n}\n}\n}\n\n``` \nTo re-enable a disabled inference table follow the instructions in [Enable inference tables on an existing endpoint](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/enable-model-serving-inference-tables.html#enable-existing). You can use either the same table or specify a new table.\n\n#### Enable inference tables on model serving endpoints using the API\n##### Next steps\n\nAfter you enable inference tables, you can monitor the served models in your model serving endpoint with Databricks Lakehouse Monitoring. For details, see [Workflow: Monitor model performance using inference tables](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html#model-monitoring).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/model-serving\/enable-model-serving-inference-tables.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### View feature store lineage\n\nWhen you log a model using `FeatureEngineeringClient.log_model`, the features used in the model are automatically tracked and can be viewed in the **Lineage** tab of Catalog Explorer. In addition to feature tables, Python UDFs that are used to compute on-demand features are also tracked.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/lineage.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### View feature store lineage\n##### How to capture lineage of a feature table, function, or model\n\nLineage information tracking feature tables and functions used in models is automatically captured when you call `log_model`. See the following example code. \n```\nfrom databricks.feature_engineering import FeatureEngineeringClient, FeatureLookup, FeatureFunction\nfe = FeatureEngineeringClient()\n\nfeatures = [\nFeatureLookup(\ntable_name = \"main.on_demand_demo.restaurant_features\",\nfeature_names = [\"latitude\", \"longitude\"],\nrename_outputs={\"latitude\": \"restaurant_latitude\", \"longitude\": \"restaurant_longitude\"},\nlookup_key = \"restaurant_id\",\ntimestamp_lookup_key = \"ts\"\n),\nFeatureFunction(\nudf_name=\"main.on_demand_demo.extract_user_latitude\",\noutput_name=\"user_latitude\",\ninput_bindings={\"blob\": \"json_blob\"},\n),\nFeatureFunction(\nudf_name=\"main.on_demand_demo.extract_user_longitude\",\noutput_name=\"user_longitude\",\ninput_bindings={\"blob\": \"json_blob\"},\n),\nFeatureFunction(\nudf_name=\"main.on_demand_demo.haversine_distance\",\noutput_name=\"distance\",\ninput_bindings={\"x1\": \"restaurant_longitude\", \"y1\": \"restaurant_latitude\", \"x2\": \"user_longitude\", \"y2\": \"user_latitude\"},\n)\n]\n\ntraining_set = fe.create_training_set(\nlabel_df, feature_lookups=features, label=\"label\", exclude_columns=[\"restaurant_id\", \"json_blob\", \"restaurant_latitude\", \"restaurant_longitude\", \"user_latitude\", \"user_longitude\", \"ts\"]\n)\n\nclass IsClose(mlflow.pyfunc.PythonModel):\ndef predict(self, ctx, inp):\nreturn (inp['distance'] < 2.5).values\n\nmodel_name = \"fe_packaged_model\"\nmlflow.set_registry_uri(\"databricks-uc\")\n\nfe.log_model(\nIsClose(),\nmodel_name,\nflavor=mlflow.pyfunc,\ntraining_set=training_set,\nregistered_model_name=registered_model_name\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/lineage.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### View feature store lineage\n##### View the lineage of a feature table, model, or function\n\nTo view the lineage of a feature table, model, or function, follow these steps: \n1. Navigate to the table, model version, or function page in Catalog Explorer.\n2. Select the **Lineage** tab. The left sidebar shows Unity Catalog components that were logged with this table, model version, or function. \n![Lineage tab on model page in Catalog Explorer](https:\/\/docs.databricks.com\/_images\/model-page-lineage-tab1.png)\n3. Click **See lineage graph**. The lineage graph appears. For details about exploring the lineage graph, see [Capture and explore lineage](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html#capture-and-explore-lineage). \n![lineage screen](https:\/\/docs.databricks.com\/_images\/lineage-graph1.png)\n4. To close the lineage graph, click ![close button for lineage graph](https:\/\/docs.databricks.com\/_images\/close-lineage-graph.png) in the upper-right corner.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/lineage.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Add email and system notifications for job events\n\nYou can monitor the runs of a job and the tasks that are part of that job by configuring notifications when a run starts, completes successfully, fails, or its duration exceeds a configured threshold. Notifications can be sent to one or more email addresses or system destinations such as Slack, Microsoft Teams, PagerDuty, or any webhook-based service.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Add email and system notifications for job events\n##### Configure system notifications\n\nNote \n* For each job or task, you can configure a maximum of three system destinations for each notification event type.\n* An administrator must configure system destinations. System destinations are configured by selecting **Create new destination** in the **Edit system notifications** dialog or the [admin settings page](https:\/\/docs.databricks.com\/admin\/workspace-settings\/notification-destinations.html).\n* Notifications you set at the job level are not sent when failed tasks are retried. To receive a failure notification after every failed task (including every failed retry), use task notifications instead. To add system notifications for task runs, click **+ Add** next to **Notifications** in the task panel when you add or edit a job task.\n* A job that has completed in a `Succeeded with failures` state is considered to be in a successful state. To receive alerts for jobs that complete in this state, you must select **Success** when you configure notifications. \nSystem notifications integrate with popular notification tools, including: \n* Slack\n* PagerDuty\n* Microsoft Teams\n* [HTTP webhooks](https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html#webhook-payloads) \nTo add one or more system notifications when runs of this job have a notifiable event such as a job start, completion, or failure: \n1. In the **Job details** panel for your job, click **Edit notifications**.\n2. Click **Add Notification** and select a system destination in **Destination**.\n3. In **Select a system destination**, select a destination, and click the checkbox for each notification type to send to that destination.\n4. To add another destination, click **Add notification** again.\n5. Click **Confirm**. \nImportant \nThe content of Slack and Microsoft Teams messages might change in future releases. You should not implement clients or processing that depend on the specific content or formatting of these messages. If you require a specific schema or formatting for notifications, Databricks recommends configuring a user-defined webhook.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Add email and system notifications for job events\n##### Configure email notifications\n\nNote \n* Notifications you set at the job level are not sent when failed tasks are retried. To receive a failure notification after every failed task (including every failed retry), use task notifications instead. To add email notifications for task runs, click **+ Add** next to **Notifications** in the task panel when you add or edit a job task.\n* A job that has completed in a `Succeeded with failures` state is considered to be in a successful state. To receive alerts for jobs that complete in this state, you must select **Success** when you configure notifications. \nTo add one or more email addresses to notify when runs of this job begin, complete, or fail: \n1. In the **Job details** panel for your job, click **Edit notifications**.\n2. Click **Add Notification** and select **Email address** in **Destination**.\n3. Enter an email address and click the checkbox for each notification type to send to that address.\n4. To enter another email address for notification, click **Add notification** again.\n5. Click **Confirm**. \nYou can use email notifications to integrate with tools such as [Amazon SES and SNS](https:\/\/docs.aws.amazon.com\/ses\/latest\/DeveloperGuide\/receiving-email-setting-up.html).\n\n#### Add email and system notifications for job events\n##### Configure notifications for slow running or late jobs\n\nIf you have configured an [expected duration for a job](https:\/\/docs.databricks.com\/workflows\/jobs\/settings.html#timeout-setting-job), you can add an email or system notification if the job exceeds the configured duration. To receive a notification for jobs that exceed the duration threshold, click the checkbox for **Duration Warning** when you add or edit a notification.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Add email and system notifications for job events\n##### Filter out notifications for skipped or canceled runs\n\nYou can reduce the number of notifications sent by filtering out notifications when a run is skipped or canceled. To filter notifications, check **Mute notifications for skipped runs** or **Mute notifications for canceled runs** when you add or modify [email notifications](https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html#email-notifications) or [system notifications](https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html#system-notifications). \nNote \nSelecting **Mute notifications for skipped runs** or **Mute notifications for canceled runs** for a job does not filter out notifications configured for job tasks. To filter all notifications for skipped or canceled runs, you must also filter out any task-level notifications you have configured.\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks Workflows\n#### Add email and system notifications for job events\n##### HTTP webhook payloads\n\nIf you are using an HTTP webhook to send notifications, the following are example payloads sent by Databricks to your configured endpoint. \nNotification for a job run start event: \n```\n{\n\"event_type\": \"jobs.on_start\",\n\"workspace_id\": \"your_workspace_id\",\n\"run\": {\n\"run_id\": \"run_id\"\n},\n\"job\": {\n\"job_id\": \"job_id\",\n\"name\": \"job_name\"\n}\n}\n\n``` \nNotification for a task run start event: \n```\n{\n\"event_type\": \"jobs.on_start\",\n\"workspace_id\": \"your_workspace_id\",\n\"task\": {\n\"task_key\": \"task_name\"\n},\n\"run\": {\n\"run_id\": \"run_id_of_task\"\n\"parent_run_id\": \"run_id_of_parent_job_run\"\n},\n\"job\": {\n\"job_id\": \"job_id\",\n\"name\": \"job_name\"\n}\n}\n\n``` \nWhen configuring destinations, webhooks can be configured for the following event types: \n| Event code | When is it sent? |\n| --- | --- |\n| `jobs.on_start` | A run starts. |\n| `jobs.on_success` | A run stops and completes in a successful or succeeded with failures state. |\n| `jobs.on_failure` | A run stops in an unsuccessful state. |\n| `jobs.on_duration_warning_threshold_exceeded` | A run has been running for more than the configured expected duration. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/workflows\/jobs\/job-notifications.html"} +{"content":"# What is Delta Lake?\n### Update Delta Lake table schema\n\nDelta Lake lets you update the schema of a table. The following types of changes are supported: \n* Adding new columns (at arbitrary positions)\n* Reordering existing columns\n* Renaming existing columns \nYou can make these changes explicitly using DDL or implicitly using DML. \nImportant \nAn update to a Delta table schema is an operation that conflicts with all concurrent Delta write operations. \nWhen you update a Delta table schema, streams that read from that table terminate. If you want the stream to continue you must restart it. For recommended methods, see [Production considerations for Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/production.html).\n\n### Update Delta Lake table schema\n#### Explicitly update schema to add columns\n\n```\nALTER TABLE table_name ADD COLUMNS (col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name], ...)\n\n``` \nBy default, nullability is `true`. \nTo add a column to a nested field, use: \n```\nALTER TABLE table_name ADD COLUMNS (col_name.nested_col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name], ...)\n\n``` \nFor example, if the schema before running `ALTER TABLE boxes ADD COLUMNS (colB.nested STRING AFTER field1)` is: \n```\n- root\n| - colA\n| - colB\n| +-field1\n| +-field2\n\n``` \nthe schema after is: \n```\n- root\n| - colA\n| - colB\n| +-field1\n| +-nested\n| +-field2\n\n``` \nNote \nAdding nested columns is supported only for structs. Arrays and maps are not supported.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/update-schema.html"} +{"content":"# What is Delta Lake?\n### Update Delta Lake table schema\n#### Explicitly update schema to change column comment or ordering\n\n```\nALTER TABLE table_name ALTER [COLUMN] col_name (COMMENT col_comment | FIRST | AFTER colA_name)\n\n``` \nTo change a column in a nested field, use: \n```\nALTER TABLE table_name ALTER [COLUMN] col_name.nested_col_name (COMMENT col_comment | FIRST | AFTER colA_name)\n\n``` \nFor example, if the schema before running `ALTER TABLE boxes ALTER COLUMN colB.field2 FIRST` is: \n```\n- root\n| - colA\n| - colB\n| +-field1\n| +-field2\n\n``` \nthe schema after is: \n```\n- root\n| - colA\n| - colB\n| +-field2\n| +-field1\n\n```\n\n### Update Delta Lake table schema\n#### Explicitly update schema to replace columns\n\n```\nALTER TABLE table_name REPLACE COLUMNS (col_name1 col_type1 [COMMENT col_comment1], ...)\n\n``` \nFor example, when running the following DDL: \n```\nALTER TABLE boxes REPLACE COLUMNS (colC STRING, colB STRUCT<field2:STRING, nested:STRING, field1:STRING>, colA STRING)\n\n``` \nif the schema before is: \n```\n- root\n| - colA\n| - colB\n| +-field1\n| +-field2\n\n``` \nthe schema after is: \n```\n- root\n| - colC\n| - colB\n| +-field2\n| +-nested\n| +-field1\n| - colA\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/update-schema.html"} +{"content":"# What is Delta Lake?\n### Update Delta Lake table schema\n#### Explicitly update schema to rename columns\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nNote \nThis feature is available in Databricks Runtime 10.4 LTS and above. \nTo rename columns without rewriting any of the columns\u2019 existing data, you must enable column mapping for the table. See [Rename and drop columns with Delta Lake column mapping](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html). \nTo rename a column: \n```\nALTER TABLE table_name RENAME COLUMN old_col_name TO new_col_name\n\n``` \nTo rename a nested field: \n```\nALTER TABLE table_name RENAME COLUMN col_name.old_nested_field TO new_nested_field\n\n``` \nFor example, when you run the following command: \n```\nALTER TABLE boxes RENAME COLUMN colB.field1 TO field001\n\n``` \nIf the schema before is: \n```\n- root\n| - colA\n| - colB\n| +-field1\n| +-field2\n\n``` \nThen the schema after is: \n```\n- root\n| - colA\n| - colB\n| +-field001\n| +-field2\n\n``` \nSee [Rename and drop columns with Delta Lake column mapping](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/update-schema.html"} +{"content":"# What is Delta Lake?\n### Update Delta Lake table schema\n#### Explicitly update schema to drop columns\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nNote \nThis feature is available in Databricks Runtime 11.3 LTS and above. \nTo drop columns as a metadata-only operation without rewriting any data files, you must enable column mapping for the table. See [Rename and drop columns with Delta Lake column mapping](https:\/\/docs.databricks.com\/delta\/delta-column-mapping.html). \nImportant \nDropping a column from metadata does not delete the underlying data for the column in files. To purge the dropped column data, you can use [REORG TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-reorg-table.html) to rewrite files. You can then use [VACUUM](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-vacuum.html) to physically delete the files that contain the dropped column data. \nTo drop a column: \n```\nALTER TABLE table_name DROP COLUMN col_name\n\n``` \nTo drop multiple columns: \n```\nALTER TABLE table_name DROP COLUMNS (col_name_1, col_name_2)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/update-schema.html"} +{"content":"# What is Delta Lake?\n### Update Delta Lake table schema\n#### Explicitly update schema to change column type or name\n\nYou can change a column\u2019s type or name or drop a column by rewriting the table. To do this, use the `overwriteSchema` option. \nThe following example shows changing a column type: \n```\n(spark.read.table(...)\n.withColumn(\"birthDate\", col(\"birthDate\").cast(\"date\"))\n.write\n.mode(\"overwrite\")\n.option(\"overwriteSchema\", \"true\")\n.saveAsTable(...)\n)\n\n``` \nThe following example shows changing a column name: \n```\n(spark.read.table(...)\n.withColumnRenamed(\"dateOfBirth\", \"birthDate\")\n.write\n.mode(\"overwrite\")\n.option(\"overwriteSchema\", \"true\")\n.saveAsTable(...)\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/update-schema.html"} +{"content":"# What is Delta Lake?\n### Update Delta Lake table schema\n#### Add columns with automatic schema update\n\nColumns that are present in the DataFrame but missing from the table are automatically added as part of a write transaction when: \n* You use the `WITH SCHEMA EVOLUTION` SQL syntax. See [Schema evolution syntax for SQL](https:\/\/docs.databricks.com\/delta\/update-schema.html#sql-evo).\n* `write` or `writeStream` have `.option(\"mergeSchema\", \"true\")`\n* `spark.databricks.delta.schema.autoMerge.enabled` is `true` \nWhen both options are specified, the option from the `DataFrameWriter` takes precedence. The added columns are appended to the end of the struct they are present in. Case is preserved when appending a new column. \nNote \n* `mergeSchema` cannot be used with `INSERT INTO` or `.write.insertInto()`. \n### Schema evolution syntax for SQL \nIn Databricks Runtime 15.2 and above, you can specify schema evolution in a merge statement using the following syntax: \n```\nMERGE WITH SCHEMA EVOLUTION INTO target_table\nUSING source\nON source.key = target_table.key\nWHEN MATCHED THEN\nUPDATE SET *\nWHEN NOT MATCHED THEN\nINSERT *\nWHEN NOT MATCHED BY SOURCE THEN\nDELETE\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/update-schema.html"} +{"content":"# What is Delta Lake?\n### Update Delta Lake table schema\n#### Automatic schema evolution for Delta Lake merge\n\nSchema evolution allows users to resolve schema mismatches between the target and source table in merge. It handles the following two cases: \n1. A column in the source table is not present in the target table. The new column is added to the target schema, and its values are inserted or updated using the source values.\n2. A column in the target table is not present in the source table. The target schema is left unchanged; the values in the additional target column are either left unchanged (for `UPDATE`) or set to `NULL` (for `INSERT`). \nImportant \nTo use schema evolution, you must set the Spark session configuration `spark.databricks.delta.schema.autoMerge.enabled` to `true` before you run the `merge` command. \nNote \n* In Databricks Runtime 12.2 LTS and above, columns present in the source table can be specified by name in insert or update actions. In Databricks Runtime 11.3 LTS and below, only `INSERT *` or `UPDATE SET *` actions can be used for schema evolution with merge. \nHere are a few examples of the effects of `merge` operation with and without schema evolution. \n| Columns | Query (in SQL) | Behavior without schema evolution (default) | Behavior with schema evolution |\n| --- | --- | --- | --- |\n| Target columns: `key, value` Source columns: `key, value, new_value` | ``` MERGE INTO target_table t USING source_table s ON t.key = s.key WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * ``` | The table schema remains unchanged; only columns `key`, `value` are updated\/inserted. | The table schema is changed to `(key, value, new_value)`. Existing records with matches are updated with the `value` and `new_value` in the source. New rows are inserted with the schema `(key, value, new_value)`. |\n| Target columns: `key, old_value` Source columns: `key, new_value` | ``` MERGE INTO target_table t USING source_table s ON t.key = s.key WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * ``` | `UPDATE` and `INSERT` actions throw an error because the target column `old_value` is not in the source. | The table schema is changed to `(key, old_value, new_value)`. Existing records with matches are updated with the `new_value` in the source leaving `old_value` unchanged. New records are inserted with the specified `key`, `new_value`, and `NULL` for the `old_value`. |\n| Target columns: `key, old_value` Source columns: `key, new_value` | ``` MERGE INTO target_table t USING source_table s ON t.key = s.key WHEN MATCHED THEN UPDATE SET new_value = s.new_value ``` | `UPDATE` throws an error because column `new_value` does not exist in the target table. | The table schema is changed to `(key, old_value, new_value)`. Existing records with matches are updated with the `new_value` in the source leaving `old_value` unchanged, and unmatched records have `NULL` entered for `new_value`. See note [(1)](https:\/\/docs.databricks.com\/delta\/update-schema.html#1). |\n| Target columns: `key, old_value` Source columns: `key, new_value` | ``` MERGE INTO target_table t USING source_table s ON t.key = s.key WHEN NOT MATCHED THEN INSERT (key, new_value) VALUES (s.key, s.new_value) ``` | `INSERT` throws an error because column `new_value` does not exist in the target table. | The table schema is changed to `(key, old_value, new_value)`. New records are inserted with the specified `key`, `new_value`, and `NULL` for the `old_value`. Existing records have `NULL` entered for `new_value` leaving `old_value` unchanged. See note [(1)](https:\/\/docs.databricks.com\/delta\/update-schema.html#1). | \n**(1)** This behavior is available in Databricks Runtime 12.2 LTS and above; Databricks Runtime 11.3 LTS and below error in this condition.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/update-schema.html"} +{"content":"# What is Delta Lake?\n### Update Delta Lake table schema\n#### Exclude columns with Delta Lake merge\n\nIn Databricks Runtime 12.2 LTS and above, you can use `EXCEPT` clauses in merge conditions to explicitly exclude columns. The behavior of the `EXCEPT` keyword varies depending on whether or not schema evolution is enabled. \nWith schema evolution disabled, the `EXCEPT` keyword applies to the list of columns in the target table and allows excluding columns from `UPDATE` or `INSERT` actions. Excluded columns are set to `null`. \nWith schema evolution enabled, the `EXCEPT` keyword applies to the list of columns in the source table and allows excluding columns from schema evolution. A new column in the source that is not present in the target is not added to the target schema if it is listed in the `EXCEPT` clause. Excluded columns that are already present in the target are set to `null`. \nThe following examples demonstrate this syntax: \n| Columns | Query (in SQL) | Behavior without schema evolution (default) | Behavior with schema evolution |\n| --- | --- | --- | --- |\n| Target columns: `id, title, last_updated` Source columns: `id, title, review, last_updated` | ``` MERGE INTO target t USING source s ON t.id = s.id WHEN MATCHED THEN UPDATE SET last_updated = current_date() WHEN NOT MATCHED THEN INSERT * EXCEPT (last_updated) ``` | Matched rows are updated by setting the `last_updated` field to the current date. New rows are inserted using values for `id` and `title`. The excluded field `last_updated` is set to `null`. The field `review` is ignored because it is not in the target. | Matched rows are updated by setting the `last_updated` field to the current date. Schema is evolved to add the field `review`. New rows are inserted using all source fields except `last_updated` which is set to `null`. |\n| Target columns: `id, title, last_updated` Source columns: `id, title, review, internal_count` | ``` MERGE INTO target t USING source s ON t.id = s.id WHEN MATCHED THEN UPDATE SET last_updated = current_date() WHEN NOT MATCHED THEN INSERT * EXCEPT (last_updated, internal_count) ``` | `INSERT` throws an error because column `internal_count` does not exist in the target table. | Matched rows are updated by setting the `last_updated` field to the current date. The `review` field is added to the target table, but the `internal_count` field is ignored. New rows inserted have `last_updated` set to `null`. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/update-schema.html"} +{"content":"# What is Delta Lake?\n### Update Delta Lake table schema\n#### Automatic schema evolution for arrays of structs\n\nDelta `MERGE INTO` supports resolving struct fields by name and evolving schemas for arrays of structs. With schema evolution enabled, target table schemas will evolve for arrays of structs, which also works with any nested structs inside of arrays. \nNote \nIn Databricks Runtime 12.2 LTS and above, struct fields present in the source table can be specified by name in insert or update commands. In Databricks Runtime 11.3 LTS and below, only `INSERT *` or `UPDATE SET *` commands can be used for schema evolution with merge. \nHere are a few examples of the effects of merge operations with and without schema evolution for arrays of structs. \n| Source schema | Target schema | Behavior without schema evolution (default) | Behavior with schema evolution |\n| --- | --- | --- | --- |\n| array<struct<b: string, a: string>> | array<struct<a: int, b: int>> | The table schema remains unchanged. Columns will be resolved by name and updated or inserted. | The table schema remains unchanged. Columns will be resolved by name and updated or inserted. |\n| array<struct<a: int, c: string, d: string>> | array<struct<a: string, b: string>> | `update` and `insert` throw errors because `c` and `d` do not exist in the target table. | The table schema is changed to array<struct<a: string, b: string, c: string, d: string>>. `c` and `d` are inserted as `NULL` for existing entries in the target table. `update` and `insert` fill entries in the source table with `a` casted to string and `b` as `NULL`. |\n| array<struct<a: string, b: struct<c: string, d: string>>> | array<struct<a: string, b: struct<c: string>>> | `update` and `insert` throw errors because `d` does not exist in the target table. | The target table schema is changed to array<struct<a: string, b: struct<c: string, d: string>>>. `d` is inserted as `NULL` for existing entries in the target table. |\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/update-schema.html"} +{"content":"# What is Delta Lake?\n### Update Delta Lake table schema\n#### Dealing with `NullType` columns in schema updates\n\nBecause Parquet doesn\u2019t support `NullType`, `NullType` columns are dropped from the DataFrame when writing into Delta tables, but are still stored in the schema. When a different data type is received for that column, Delta Lake merges the schema to the new data type. If Delta Lake receives a `NullType` for an existing column, the old schema is retained and the new column is dropped during the write. \n`NullType` in streaming is not supported. Since you must set schemas when using streaming this should be very rare. `NullType` is also not accepted for complex types such as `ArrayType` and `MapType`.\n\n### Update Delta Lake table schema\n#### Replace table schema\n\nBy default, overwriting the data in a table does not overwrite the schema. When overwriting a table using `mode(\"overwrite\")` without `replaceWhere`, you may still want to overwrite the schema of the data being written. You replace the schema and partitioning of the table by setting the `overwriteSchema` option to `true`: \n```\ndf.write.option(\"overwriteSchema\", \"true\")\n\n``` \nImportant \nYou cannot specify `overwriteSchema` as `true` when using dynamic partition overwrite.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/update-schema.html"} +{"content":"# Technology partners\n## What is Databricks Partner Connect?\n#### Troubleshoot Partner Connect\n\nThis section provides information to help address common issues with Partner Connect.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/troubleshoot.html"} +{"content":"# Technology partners\n## What is Databricks Partner Connect?\n#### Troubleshoot Partner Connect\n##### Troubleshoot general errors\n\nYou might see the following general errors when you try to connect using Partner Connect. \n### When trying to connect to a partner, an error message is displayed \n**Issue**: When you try to connect your Databricks workspace to a partner solution by using Partner Connect, an error displays, and you cannot create the connection. \n**Causes**: \nThere are multiple reasons why this issue may occur: \n* If multiple individuals within an organization try to create an account with a partner, only the first individual succeeds. This is because the partner may offer some accounts at only the organizational level, and the first individual who creates such an account also establishes the account for the organization.\n* If you already have an account with the partner, the connection may still fail. This is because the partner may mistakenly try to create a duplicate account. \n**Solutions**: \nDo one of the following: \n* Ask the first individual who created the organizational account with the partner to add your email address to that account. Then bypass Partner Connect and sign in to the partner account directly to begin using the partner\u2019s solution.\n* Try making the connection again through Partner Connect, but this time, specify an email address that is not associated with your organization\u2019s domain, such as a personal email address. This may require you to also add that email address as a user to your Databricks workspace.\n* Bypass Partner Connect and sign in to the partner directly with your existing account, and begin using the partner solution. \nIf your workspace is not already connected after you sign in to the partner, complete the connection by following the instructions in the appropriate partner connection guide. \n### When trying to sign in to a partner\u2019s account or website, a pop-up blocker is displayed \n**Issue**: For a partner solution that uses Partner Connect to sign in to the partner\u2019s account or website, when you click **Sign in**, Partner Connect opens a new tab in your web browser and a pop-up blocker is displayed. This pop-up blocker prevents you from signing in to the partner\u2019s account or website. \n**Cause**: Your web browser is blocking pop-ups. \n**Solution**: Allow pop-ups for the partner\u2019s website in your web browser. Specific instructions vary by web browser. For example, for Google Chrome, see [Block or allow popups in Chrome](https:\/\/support.google.com\/chrome\/answer\/95472) on the Google Chrome Help website. For other web browsers, search the Internet with a phrase such as \u201chow do I allow pop-ups for a specific website?\u201d \n### Insufficient permissions to create a partner connection \n**Issue**: When you try to create the partner connection, a permissions error displays and you can\u2019t create the connection. \n**Cause**: This issue might occur because you aren\u2019t a workspace admin. \n**Solution**: An account admin or a workspace admin must add your user to the admins group. See [Manage groups](https:\/\/docs.databricks.com\/admin\/users-groups\/groups.html). \n### Insufficient permissions to create a partner connection to data managed by Unity Catalog \n**Issue**: When you try to create the partner connection, the following *missing permissions* error displays and you can\u2019t create the connection: \n```\nYou do not have permission to grant <read|write> access to the workspace default catalog. To enable granting <read|write> access to the default catalog or other catalogs, contact an account admin\n\n``` \n**Cause**: This issue might occur because you don\u2019t own the workspace default catalog. If a partner solution isn\u2019t enabled for Unity Catalog with Partner Connect, you can only connect to the default catalog, and the default catalog must be `hive_metastore` or another catalog that you explicitly own. \n**Solution**: Do one of the following: \n* Ask an owner of the default catalog to create the connection.\n* Ask an owner of the default catalog to add you to the group that owns the default catalog. See [Manage groups](https:\/\/docs.databricks.com\/admin\/users-groups\/groups.html).\n* Ask an account admin to change the default catalog by running `databricks metastores assign`. In the following command example, replace these placeholders: \n+ Replace `<workspace-id>` with the ID of the workspace.\n+ Replace `<metastore-id>` with the ID of the metastore.\n+ Replace `<catalog-name>` with the name of the default catalog.\n```\ndatabricks metastores assign <workspace-id> <metastore-id> <catalog-name>\n\n``` \nWarning \nChanging the default catalog can break existing data operations that depend on it. In particular, data operations that don\u2019t specify the catalog assume the default catalog. Changing the default causes those operations to be unable to find the data objects that they reference. \nFor more information, see [Link a metastore with a workspace](https:\/\/docs.databricks.com\/archive\/dev-tools\/cli\/unity-catalog-cli.html#link-a-metastore-with-a-workspace). \n### Existing partner connection can no longer access Databricks data \n**Issue**: When you sign in to an existing partner account that connects to your workspace using Partner Connect, and the partner tries to access Databricks data, an error displays, stating that you don\u2019t have permissions to access the workspace default catalog. \n**Cause**: If your workspace is Unity Catalog-enabled, this issue might occur because the default catalog isn\u2019t `hive_metastore`. If the partner solution doesn\u2019t support connections to data managed by Unity Catalog using Partner Connect, this breaks your existing connection. \n**Solution**: Do one of the following: \n* Disconnect from the partner and then reconnect, specifying the default catalog. See [Disconnect from a partner](https:\/\/docs.databricks.com\/partner-connect\/admin.html#disconnect). This solution requires that you own the default catalog.\n* Ask an account admin to change the default catalog to `hive_metastore` by running `databricks metastores assign`. In the following command example, replace these placeholders: \n+ Replace `<workspace-id>` with the ID of the workspace.\n+ Replace `<metastore-id>` with the ID of the metastore.\n```\ndatabricks metastores assign <workspace-id> <metastore-id> hive_metastore\n\n``` \nWarning \nChanging the default catalog can break existing data operations that depend on it. In particular, data operations that don\u2019t specify the catalog assume the default catalog. Changing the default causes those operations to be unable to find the data objects that they reference. \nFor more information, see [Link a metastore with a workspace](https:\/\/docs.databricks.com\/archive\/dev-tools\/cli\/unity-catalog-cli.html#link-a-metastore-with-a-workspace).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/troubleshoot.html"} +{"content":"# Technology partners\n## What is Databricks Partner Connect?\n#### Troubleshoot Partner Connect\n##### Troubleshoot security errors\n\nThe security configurations mentioned in this section might cause Partner Connect integrations to fail. \n### IP access lists \n**Issue**: The following warning appears in the **Connect to partner** dialog box, and you\u2019re unable to create a connection to a partner: \n```\nPartner Connect might not work because IP access list is enabled.\n\n``` \n**Cause**: If the IP access list feature is enabled on your workspace, you might have to add the partner integration to the allowlist. \n**Solution**: For steps to add partner integrations to the allowlist, see the partner documentation or contact the partner\u2019s support team. \n* dbt: [IP Restrictions](https:\/\/docs.getdbt.com\/docs\/get-started\/connect-your-database#ip-restrictions)\n* Fivetran: [Fivetran IP Addresses](https:\/\/fivetran.com\/docs\/getting-started\/ips)\n* Hevo Data: [Verifying IP Address Whitelisting](https:\/\/docs.hevodata.com\/getting-started\/creating-your-hevo-account\/regions\/#verifying-ip-address-whitelisting)\n* Hex: [Allow connections from Hex IP addresses](https:\/\/learn.hex.tech\/docs\/connect-to-data\/allow-connections-from-hex-ip-addresses)\n* Hightouch: [IP addresses](https:\/\/hightouch.com\/docs\/security\/networking#ip-addresses)\n* Labelbox: [Labelbox IP address to whitelist](https:\/\/docs.labelbox.com\/docs\/webhooks)\n* Lightup: [Whitelist Lightup IP addresses (recommended)](https:\/\/docs.lightup.ai\/docs\/lightup-cloud#whitelist-lightup-ip-address-recommended)\n* Preset: [Connecting Your Data](https:\/\/docs.preset.io\/docs\/connecting-your-data)\n* Qlik Sense: [Allowlisting domain names and IP addresses](https:\/\/help.qlik.com\/en-US\/cloud-services\/Subsystems\/Hub\/Content\/Sense_Hub\/Introduction\/qlik-cloud-dns-ip.htm)\n* Rivery: [Rivery\u2019s Whitelist IPs](https:\/\/docs.rivery.io\/docs\/rivery-whitelist-ips) \nFor more information about IP access lists, see [Configure IP access lists for workspaces](https:\/\/docs.databricks.com\/security\/network\/front-end\/ip-access-list-workspace.html). \n### AWS PrivateLink \nFollow the solution in this section to configure AWS Privatelink settings to enable partner integrations if: \n* You enabled front-end PrivateLink connection for your workspace\n* You encounter the following error in Partner Connect \n```\nPartner Connect might not work because front-end PrivateLink is enabled.\n\n``` \n**Solution**: Configure PrivateLink settings to enable partner integrations. For steps to configure PrivateLink settings, see the partner documentation or contact the partner\u2019s support team. \n* Fivetran: \n+ [AWS PrivateLink](https:\/\/fivetran.com\/docs\/databases\/connection-options#awsprivatelink)\n+ [(Optional) Connect using AWS PrivateLink](https:\/\/fivetran.com\/docs\/destinations\/databricks\/databricks-setup-guide#optionalconnectusingawsprivatelink)\n* Preset: [Announcing Preset Managed Private Cloud for AWS](https:\/\/preset.io\/blog\/announcing-preset-managed-private-cloud-for-aws\/)\n* Rivery: [Configuring AWS PrivateLink](https:\/\/docs.rivery.io\/docs\/configuring-aws-privatelink)\n* ThoughtSpot: [Enabling an AWS PrivateLink between ThoughtSpot Cloud and your Databricks data warehouse](https:\/\/docs.thoughtspot.com\/cloud\/latest\/connections-databricks-private-link) \nFor more information about AWS PrivateLink, see [Enable AWS PrivateLink](https:\/\/docs.databricks.com\/security\/network\/classic\/privatelink.html). \n### Personal access tokens \n**Issue**: When you try to create a connection, the following **Missing permissions** error displays: \n```\nPartner Connect requires personal access tokens to be enabled.\n\n``` \n**Cause**: If token-based authentication is disabled on your workspace, you can\u2019t use Partner Connect to integrate with partner solutions unless you use your Databricks username and password to authenticate with Power BI or Tableau. Username and password authentication might be disabled if your Databricks workspace is [enabled for single sign-on (SSO)](https:\/\/docs.databricks.com\/admin\/users-groups\/single-sign-on\/index.html). \nFor more information about personal access tokens, see [Monitor and manage personal access tokens](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partner-connect\/troubleshoot.html"} +{"content":"# Security and compliance guide\n### Networking\n\nThis article introduces networking configurations for the deployment and management of Databricks accounts and workspaces. \nNote \nThere are currently no networking charges for serverless features. In a later release, you might be charged. Databricks will provide advance notice for networking pricing changes.\n\n### Networking\n#### Databricks architecture overview\n\nDatabricks operates out of a *control plane* and a *compute plane*. \n* The **control plane** includes the backend services that Databricks manages in your Databricks account. The web application is in the control plane.\n* The **compute plane** is where your data is processed. There are two types of compute planes depending on the compute that you are using. \n+ For classic Databricks compute, the compute resources are in your AWS account in what is called the *classic compute plane*. This refers to the network in your AWS account and its resources.\n+ For serverless compute, the serverless compute resources run in a *serverless compute plane* in your Databricks account. \nFor additional architecture information, see [Databricks architecture overview](https:\/\/docs.databricks.com\/getting-started\/overview.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/index.html"} +{"content":"# Security and compliance guide\n### Networking\n#### Secure network connectivity\n\nDatabricks provides a secure networking environment by default, but if your organization has additional needs, you can configure network connectivity features between the different networking connections shown in the diagram below. \n![Network connectivity overview diagram](https:\/\/docs.databricks.com\/_images\/networking-classic-serverless.png) \n1. **Users and applications to Databricks**: You can configure features to control access and provide private connectivity between users and their Databricks workspaces. See [Users to Databricks networking](https:\/\/docs.databricks.com\/security\/network\/front-end\/index.html).\n2. **The control plane and the classic compute plane**: Classic compute resources, such as clusters, are deployed in are in your AWS account and connect to the control plane. You can use classic network connectivity features to deploy classic compute plane resources in your own virtual private cloud and to enable private connectivity from the clusters to the control plane. See [Classic compute plane networking](https:\/\/docs.databricks.com\/security\/network\/classic\/index.html).\n3. **The serverless compute plane and storage**: You can configure firewalls on your resources to allow access from Databricks serverless compute plane. See [Serverless compute plane networking](https:\/\/docs.databricks.com\/security\/network\/serverless-network-security\/index.html). \nYou can configure your AWS storage networking features to secure the connection between the classic compute plane and S3. For more information, see [Configure Databricks S3 commit service-related settings](https:\/\/docs.databricks.com\/security\/network\/classic\/s3-commit-service.html) and [Networking recommendations for Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/networking.html). \nConnectivity between the control plane and the serverless compute plane is always over the cloud network backbone and not the public internet.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/network\/index.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Query semi-structured data in Databricks\n\nThis article describes the Databricks SQL operators you can use to query and transform semi-structured data stored as JSON. \nNote \nThis feature lets you read semi-structured data without flattening the files. However, for optimal read query performance Databricks recommends that you extract nested columns with the correct data types. \nYou extract a column from fields containing JSON strings using the syntax `<column-name>:<extraction-path>`, where `<column-name>` is the string column name and `<extraction-path>` is the path to the field to extract. The returned results are strings.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/semi-structured.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Query semi-structured data in Databricks\n##### Create a table with highly nested data\n\nRun the following query to create a table with highly nested data. The examples in this article all reference this table. \n```\nCREATE TABLE store_data AS SELECT\n'{\n\"store\":{\n\"fruit\": [\n{\"weight\":8,\"type\":\"apple\"},\n{\"weight\":9,\"type\":\"pear\"}\n],\n\"basket\":[\n[1,2,{\"b\":\"y\",\"a\":\"x\"}],\n[3,4],\n[5,6]\n],\n\"book\":[\n{\n\"author\":\"Nigel Rees\",\n\"title\":\"Sayings of the Century\",\n\"category\":\"reference\",\n\"price\":8.95\n},\n{\n\"author\":\"Herman Melville\",\n\"title\":\"Moby Dick\",\n\"category\":\"fiction\",\n\"price\":8.99,\n\"isbn\":\"0-553-21311-3\"\n},\n{\n\"author\":\"J. R. R. Tolkien\",\n\"title\":\"The Lord of the Rings\",\n\"category\":\"fiction\",\n\"reader\":[\n{\"age\":25,\"name\":\"bob\"},\n{\"age\":26,\"name\":\"jack\"}\n],\n\"price\":22.99,\n\"isbn\":\"0-395-19395-8\"\n}\n],\n\"bicycle\":{\n\"price\":19.95,\n\"color\":\"red\"\n}\n},\n\"owner\":\"amy\",\n\"zip code\":\"94025\",\n\"fb:testid\":\"1234\"\n}' as raw\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/semi-structured.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Query semi-structured data in Databricks\n##### Extract a top-level column\n\nTo extract a column, specify the name of the JSON field in your extraction path. \nYou can provide column names within brackets. Columns referenced inside brackets are\nmatched case *sensitively*. The column name is also referenced case insensitively. \n```\nSELECT raw:owner, RAW:owner FROM store_data\n\n``` \n```\n+-------+-------+\n| owner | owner |\n+-------+-------+\n| amy | amy |\n+-------+-------+\n\n``` \n```\n-- References are case sensitive when you use brackets\nSELECT raw:OWNER case_insensitive, raw:['OWNER'] case_sensitive FROM store_data\n\n``` \n```\n+------------------+----------------+\n| case_insensitive | case_sensitive |\n+------------------+----------------+\n| amy | null |\n+------------------+----------------+\n\n``` \nUse backticks to escape spaces and special characters. The field names are matched case *insensitively*. \n```\n-- Use backticks to escape special characters. References are case insensitive when you use backticks.\n-- Use brackets to make them case sensitive.\nSELECT raw:`zip code`, raw:`Zip Code`, raw:['fb:testid'] FROM store_data\n\n``` \n```\n+----------+----------+-----------+\n| zip code | Zip Code | fb:testid |\n+----------+----------+-----------+\n| 94025 | 94025 | 1234 |\n+----------+----------+-----------+\n\n``` \nNote \nIf a JSON record contains multiple columns that can match your extraction path due to case insensitive matching, you will receive an error asking you to use brackets. If you have matches of columns across rows, you will not receive any errors. The following will throw an error: `{\"foo\":\"bar\", \"Foo\":\"bar\"}`, and the following won\u2019t throw an error: \n```\n{\"foo\":\"bar\"}\n{\"Foo\":\"bar\"}\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/semi-structured.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Query semi-structured data in Databricks\n##### Extract nested fields\n\nYou specify nested fields through dot notation or using brackets. When you use brackets, columns are matched case sensitively. \n```\n-- Use dot notation\nSELECT raw:store.bicycle FROM store_data\n-- the column returned is a string\n\n``` \n```\n+------------------+\n| bicycle |\n+------------------+\n| { |\n| \"price\":19.95, |\n| \"color\":\"red\" |\n| } |\n+------------------+\n\n``` \n```\n-- Use brackets\nSELECT raw:store['bicycle'], raw:store['BICYCLE'] FROM store_data\n\n``` \n```\n+------------------+---------+\n| bicycle | BICYCLE |\n+------------------+---------+\n| { | null |\n| \"price\":19.95, | |\n| \"color\":\"red\" | |\n| } | |\n+------------------+---------+\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/semi-structured.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Query semi-structured data in Databricks\n##### Extract values from arrays\n\nYou index elements in arrays with brackets. Indices are 0-based. You can use an asterisk (`*`) followed by dot or bracket notation to extract subfields from all elements in an array. \n```\n-- Index elements\nSELECT raw:store.fruit[0], raw:store.fruit[1] FROM store_data\n\n``` \n```\n+------------------+-----------------+\n| fruit | fruit |\n+------------------+-----------------+\n| { | { |\n| \"weight\":8, | \"weight\":9, |\n| \"type\":\"apple\" | \"type\":\"pear\" |\n| } | } |\n+------------------+-----------------+\n\n``` \n```\n-- Extract subfields from arrays\nSELECT raw:store.book[*].isbn FROM store_data\n\n``` \n```\n+--------------------+\n| isbn |\n+--------------------+\n| [ |\n| null, |\n| \"0-553-21311-3\", |\n| \"0-395-19395-8\" |\n| ] |\n+--------------------+\n\n``` \n```\n-- Access arrays within arrays or structs within arrays\nSELECT\nraw:store.basket[*],\nraw:store.basket[*][0] first_of_baskets,\nraw:store.basket[0][*] first_basket,\nraw:store.basket[*][*] all_elements_flattened,\nraw:store.basket[0][2].b subfield\nFROM store_data\n\n``` \n```\n+----------------------------+------------------+---------------------+---------------------------------+----------+\n| basket | first_of_baskets | first_basket | all_elements_flattened | subfield |\n+----------------------------+------------------+---------------------+---------------------------------+----------+\n| [ | [ | [ | [1,2,{\"b\":\"y\",\"a\":\"x\"},3,4,5,6] | y |\n| [1,2,{\"b\":\"y\",\"a\":\"x\"}], | 1, | 1, | | |\n| [3,4], | 3, | 2, | | |\n| [5,6] | 5 | {\"b\":\"y\",\"a\":\"x\"} | | |\n| ] | ] | ] | | |\n+----------------------------+------------------+---------------------+---------------------------------+----------+\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/semi-structured.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Query semi-structured data in Databricks\n##### Cast values\n\nYou can use `::` to cast values to basic data types. Use the [from\\_json](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-functions-builtin.html#json-functions) method to cast nested results into more complex data types, such as arrays or structs. \n```\n-- price is returned as a double, not a string\nSELECT raw:store.bicycle.price::double FROM store_data\n\n``` \n```\n+------------------+\n| price |\n+------------------+\n| 19.95 |\n+------------------+\n\n``` \n```\n-- use from_json to cast into more complex types\nSELECT from_json(raw:store.bicycle, 'price double, color string') bicycle FROM store_data\n-- the column returned is a struct containing the columns price and color\n\n``` \n```\n+------------------+\n| bicycle |\n+------------------+\n| { |\n| \"price\":19.95, |\n| \"color\":\"red\" |\n| } |\n+------------------+\n\n``` \n```\nSELECT from_json(raw:store.basket[*], 'array<array<string>>') baskets FROM store_data\n-- the column returned is an array of string arrays\n\n``` \n```\n+------------------------------------------+\n| basket |\n+------------------------------------------+\n| [ |\n| [\"1\",\"2\",\"{\\\"b\\\":\\\"y\\\",\\\"a\\\":\\\"x\\\"}]\", |\n| [\"3\",\"4\"], |\n| [\"5\",\"6\"] |\n| ] |\n+------------------------------------------+\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/semi-structured.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n#### Query semi-structured data in Databricks\n##### NULL behavior\n\nWhen a JSON field exists with a `null` value, you will receive a SQL `null` value for that column, not a `null` text value. \n```\nselect '{\"key\":null}':key is null sql_null, '{\"key\":null}':key == 'null' text_null\n\n``` \n```\n+-------------+-----------+\n| sql_null | text_null |\n+-------------+-----------+\n| true | null |\n+-------------+-----------+\n\n```\n\n#### Query semi-structured data in Databricks\n##### Transform nested data using Spark SQL operators\n\nApache Spark has a number of built-in functions for working with complex and nested data. The following notebook contains examples. \nAdditionally, [higher order functions](https:\/\/docs.databricks.com\/optimizations\/higher-order-lambda-functions.html) provide many additional options when built-in Spark operators aren\u2019t available for transforming data the way you want. \n### Complex nested data notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/complex-nested-structured.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/semi-structured.html"} +{"content":"# What is Databricks?\n## What is a data lakehouse?\n#### What is the medallion lakehouse architecture?\n\nThe medallion architecture describes a series of data layers that denote the quality of data stored in the lakehouse. Databricks recommends taking a multi-layered approach to building a single source of truth for enterprise data products. This architecture guarantees atomicity, consistency, isolation, and durability as data passes through multiple layers of validations and transformations before being stored in a layout optimized for efficient analytics. The terms [bronze](https:\/\/docs.databricks.com\/lakehouse\/medallion.html#bronze) (raw), [silver](https:\/\/docs.databricks.com\/lakehouse\/medallion.html#silver) (validated), and [gold](https:\/\/docs.databricks.com\/lakehouse\/medallion.html#gold) (enriched) describe the quality of the data in each of these layers. \nIt is important to note that this medallion architecture does not replace other dimensional modeling techniques. Schemas and tables within each layer can take on a variety of forms and degrees of normalization depending on the frequency and nature of data updates and the downstream use cases for the data. \nOrganizations can leverage the Databricks lakehouse to create and maintain validated datasets accessible throughout the company. Adopting an organizational mindset focused on curating data-as-products is a key step in successfully building a data lakehouse.\n\n#### What is the medallion lakehouse architecture?\n##### Ingest raw data to the bronze layer\n\nThe bronze layer contains unvalidated data. Data ingested in the bronze layer typically: \n* Maintains the raw state of the data source.\n* Is appended incrementally and grows over time.\n* Can be any combination of streaming and batch transactions. \nRetaining the full, unprocessed history of each dataset in an efficient storage format provides the ability to recreate any state of a given data system. \nAdditional metadata (such as source file names or recording the time data was processed) may be added to data on ingest for enhanced discoverability, description of the state of the source dataset, and optimized performance in downstream applications.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/medallion.html"} +{"content":"# What is Databricks?\n## What is a data lakehouse?\n#### What is the medallion lakehouse architecture?\n##### Validate and deduplicate data in the silver layer\n\nRecall that while the bronze layer contains the entire data history in a nearly raw state, the silver layer represents a validated, enriched version of our data that can be trusted for downstream analytics. \nWhile Databricks believes strongly in the lakehouse vision driven by bronze, silver, and gold tables, simply implementing a silver layer efficiently will immediately unlock many of the potential benefits of the lakehouse. \nFor any data pipeline, the silver layer may contain more than one table.\n\n#### What is the medallion lakehouse architecture?\n##### Power analytics with the gold layer\n\nThis gold data is often highly refined and aggregated, containing data that powers analytics, machine learning, and production applications. While all tables in the lakehouse should serve an important purpose, gold tables represent data that has been transformed into knowledge, rather than just information. \nAnalysts largely rely on gold tables for their core responsibilities, and data shared with a customer would rarely be stored outside this level. \nUpdates to these tables are completed as part of regularly scheduled production workloads, which helps control costs and allows service level agreements (SLAs) for data freshness to be established. \nWhile the lakehouse doesn\u2019t have the same deadlock issues that you may encounter in a enterprise data warehouse, gold tables are often stored in a separate storage container to help avoid cloud limits on data requests. \nIn general, because aggregations, joins, and filtering are handled before data is written to the gold layer, users should see low latency query performance on data in gold tables.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse\/medallion.html"} +{"content":"# Databricks data engineering\n## What are init scripts?\n#### Allowlist libraries and init scripts on shared compute\n\nIn Databricks Runtime 13.3 LTS and above, you can add libraries and init scripts to the `allowlist` in Unity Catalog. This allows users to leverage these artifacts on compute configured with shared access mode. \nYou can allowlist a directory or filepath before that directory or file exists. See [Upload files to a Unity Catalog volume](https:\/\/docs.databricks.com\/ingestion\/add-data\/upload-to-volume.html). \nNote \nYou must be a metastore admin or have the `MANAGE ALLOWLIST` privilege to modify the allowlist. See [MANAGE ALLOWLIST](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html#manage-allowlist). \nImportant \nLibraries used as JDBC drivers or custom Spark data sources on Unity Catalog-enabled shared compute require `ANY FILE` permissions. \nSome installed libraries store data of all users in one common temp directory. These libraries might compromise user isolation.\n\n#### Allowlist libraries and init scripts on shared compute\n##### How to add items to the allowlist\n\nYou can add items to the `allowlist` with [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html) or the [REST API](https:\/\/docs.databricks.com\/api\/workspace\/artifactallowlists). \nTo open the dialog for adding items to the allowlist in Catalog Explorer, do the following: \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Click ![Gear icon](https:\/\/docs.databricks.com\/_images\/gear-icon.png) to open the metastore details and permissions UI.\n3. Select **Allowed JARs\/Init Scripts**.\n4. Click **Add**. \nImportant \nThis option only displays for sufficiently privileged users. If you cannot access the allowlist UI, contact your metastore admin for assistance in allowlisting libraries and init scripts.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/allowlist.html"} +{"content":"# Databricks data engineering\n## What are init scripts?\n#### Allowlist libraries and init scripts on shared compute\n##### Add an init script to the allowlist\n\nComplete the following steps in the allowlist dialog to add an init script to the allowlist: \n1. For **Type**, select **Init Script**.\n2. For **Source Type**, select **Volume** or the object storage protocol.\n3. Specify the source path to add to the allowlist. See [How are permissions on paths enforced in the allowlist?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/allowlist.html#paths).\n\n#### Allowlist libraries and init scripts on shared compute\n##### Add a JAR to the allowlist\n\nComplete the following steps in the allowlist dialog to add a JAR to the allowlist: \n1. For **Type**, select **JAR**.\n2. For **Source Type**, select **Volume** or the object storage protocol.\n3. Specify the source path to add to the allowlist. See [How are permissions on paths enforced in the allowlist?](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/allowlist.html#paths).\n\n#### Allowlist libraries and init scripts on shared compute\n##### Add Maven coordinates to the allowlist\n\nComplete the following steps in the allowlist dialog to add Maven coordinates to the allowlist: \n1. For **Type**, select **Maven**.\n2. For **Source Type**, select **Coordinates**.\n3. Enter coordinates in the following format: `groudId:artifactId:version`. \n* You can include all versions of a library by allowlisting the following format: `groudId:artifactId`.\n* You can include all artifacts in a group by allowlisting the following format: `groupId`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/allowlist.html"} +{"content":"# Databricks data engineering\n## What are init scripts?\n#### Allowlist libraries and init scripts on shared compute\n##### How are permissions on paths enforced in the allowlist?\n\nYou can use the allowlist to grant access to JARs or init scripts stored in Unity Catalog volumes and object storage. If you add a path for a directory rather than a file, allowlist permissions propagate to contained files and directories. \nPrefix matching is used for all artifacts stored in Unity Catalog volumes or object storage. To prevent prefix matching at a given directory level, include a trailing slash (`\/`). For example: `\/Volumes\/prod-libraries\/`. \nYou can define permissions at the following levels: \n1. The base path for the volume or storage container.\n2. A directory nested at any depth from the base path.\n3. A single file. \nAdding a path to the allowlist only means that the path can be used for either init scripts or JAR installation. Databricks still checks for permissions to access data in the specified location. \nThe principal used must have `READ VOLUME` permissions on the specified volume. See [SELECT](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/privileges.html#read-volume). \nIn single user access mode, the identity of the assigned principal (a user or service principal) is used. \nIn shared access mode: \n* Libraries use the identity of the library installer.\n* Init scripts use the identity of the cluster owner. \nNote \nNo-isolation shared access mode does not support volumes, but uses the same identity assignment as shared access mode. \nDatabricks recommends configuring all object storage privileges related to init scripts and libraries with read-only permissions. Users with write permissions on these locations can potentially modify code in library files or init scripts. \nDatabricks recommends using instance profiles to manage access to JARs or init scripts stored in S3. Use the following documentation in the cross-reference link to complete this setup: \n1. Create a IAM role with read and list permissions on your desired buckets. See [Tutorial: Configure S3 access with an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html).\n2. Launch a cluster with the instance profile. See [Instance profiles](https:\/\/docs.databricks.com\/compute\/configure.html#instance-profiles). \nNote \nAllowlist permissions for JARs and init scripts are managed separately. If you use the same location to store both types of objects, you must add the location to the allowlist for each.\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/manage-privileges\/allowlist.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Use XGBoost on Databricks\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/sparkdl-xgboost.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Use XGBoost on Databricks\n##### Distributed training of XGBoost models using `sparkdl.xgboost`\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nNote \n`sparkdl.xgboost` is deprecated starting with Databricks Runtime 12.0 ML, and is removed in Databricks Runtime 13.0 ML and above. For information about migrating your workloads to `xgboost.spark`, see [Migration guide for the deprecated sparkdl.xgboost module](https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost-spark.html#xgboost-migration). \nDatabricks Runtime ML includes PySpark estimators based on the Python `xgboost` package, `sparkdl.xgboost.XgboostRegressor` and `sparkdl.xgboost.XgboostClassifier`. You can create an ML pipeline based on these estimators. For more information, see [XGBoost for PySpark Pipeline](https:\/\/databricks.github.io\/spark-deep-learning\/#module-sparkdl.xgboost). \nDatabricks strongly recommends that `sparkdl.xgboost` users use Databricks Runtime 11.3 LTS ML or above. Previous Databricks Runtime versions are affected by bugs in older versions of `sparkdl.xgboost`. \nNote \n* The `sparkdl.xgboost` module is deprecated since Databricks Runtime 12.0 ML. Databricks recommends that you migrate your code to use the `xgboost.spark` module instead. See [the migration guide](https:\/\/docs.databricks.com\/machine-learning\/train-model\/xgboost-spark.html#xgboost-migration).\n* The following parameters from the `xgboost` package are not supported: `gpu_id`, `output_margin`, `validate_features`.\n* The parameters `sample_weight`, `eval_set`, and `sample_weight_eval_set` are not supported. Instead, use the parameters `weightCol` and `validationIndicatorCol`. See [XGBoost for PySpark Pipeline](https:\/\/databricks.github.io\/spark-deep-learning\/#module-sparkdl.xgboost) for details.\n* The parameters `base_margin`, and `base_margin_eval_set` are not supported. Use the parameter `baseMarginCol` instead. See [XGBoost for PySpark Pipeline](https:\/\/databricks.github.io\/spark-deep-learning\/#module-sparkdl.xgboost) for details.\n* The parameter `missing` has different semantics from the `xgboost` package. In the `xgboost` package, the zero values in a SciPy sparse matrix are treated as missing values regardless of the value of `missing`. For the PySpark estimators in the `sparkdl` package, zero values in a Spark sparse vector are not treated as missing values unless you set `missing=0`. If you have a sparse training dataset (most feature values are missing), Databricks recommends setting `missing=0` to reduce memory consumption and achieve better performance.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/sparkdl-xgboost.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Use XGBoost on Databricks\n##### Distributed training of XGBoost models using `sparkdl.xgboost`\n###### Distributed training\n\nDatabricks Runtime ML supports distributed XGBoost training using the `num_workers` parameter. To use distributed training, create a classifier or regressor and set `num_workers` to a value less than or equal to the total number of Spark task slots on your cluster. To use the all Spark task slots, set `num_workers=sc.defaultParallelism`. \nFor example: \n```\nclassifier = XgboostClassifier(num_workers=sc.defaultParallelism)\nregressor = XgboostRegressor(num_workers=sc.defaultParallelism)\n\n```\n\n##### Distributed training of XGBoost models using `sparkdl.xgboost`\n###### Limitations of distributed training\n\n* You cannot use `mlflow.xgboost.autolog` with distributed XGBoost.\n* You cannot use `baseMarginCol` with distributed XGBoost.\n* You cannot use distributed XGBoost on an cluster with autoscaling enabled. See [Enable autoscaling](https:\/\/docs.databricks.com\/compute\/configure.html#autoscaling) for instructions to disable autoscaling.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/sparkdl-xgboost.html"} +{"content":"# AI and Machine Learning on Databricks\n## Model training examples\n### Use XGBoost on Databricks\n##### Distributed training of XGBoost models using `sparkdl.xgboost`\n###### GPU training\n\nNote \nDatabricks Runtime 11.3 LTS ML includes XGBoost 1.6.1, which does not support GPU clusters with [compute capability](https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html#compute-capability) 5.2 and below. \nDatabricks Runtime 9.1 LTS ML and above support GPU clusters for XGBoost training. To use a GPU cluster, set `use_gpu` to `True`. \nFor example: \n```\nclassifier = XgboostClassifier(num_workers=N, use_gpu=True)\nregressor = XgboostRegressor(num_workers=N, use_gpu=True)\n\n```\n\n##### Distributed training of XGBoost models using `sparkdl.xgboost`\n###### Example notebook\n\nThis notebook shows the use of the Python package `sparkdl.xgboost` with Spark MLlib. The `sparkdl.xgboost` package is deprecated since Databricks Runtime 12.0 ML. \n### PySpark-XGBoost notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/xgboost-pyspark.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/train-model\/sparkdl-xgboost.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Losing spot instances\n\nIf you\u2019re losing spot instances, you may be using an instance type that has a high reclaim rate. Consider changing your instance types. If you\u2019re on AWS, use the [Spot Instance Advisor](https:\/\/aws.amazon.com\/ec2\/spot\/instance-advisor\/) to see how frequently each instance type gets reclaimed. Or you can simply stop using spot instances. You may also want to review our [spot instance recommendations](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#spot-instances).\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/losing-spot-instances.html"} +{"content":"# Get started: Account and workspace setup\n## Navigate the workspace\n#### Workspace browser\n\nWith the workspace browser you can create, browse, and organize Databricks objects, including notebooks, libraries, experiments, queries, dashboards, and alerts, in a single place. You can then share objects and assign permissions at the folder level to organize objects by team or project. You can also browse content in Databricks Git folders. The workspace browser introduces a contextual browser that allows you to browse content, including content in Git folders, from within a notebook.\n\n#### Workspace browser\n##### View objects in the workspace browser\n\nYou can view objects, including content in Git folders, in the workspace browser by clicking ![Workspace Icon](https:\/\/docs.databricks.com\/_images\/workspace-icon.png) **Workspace** in the sidebar. Objects created outside the workspace browser (for example, from the query list page) are viewable, by default, in the **Home** folder, where you can organize them within subfolders if you want.\n\n#### Workspace browser\n##### Work with folders and folder objects\n\nThe workspace enables you to create folders, move objects between folders, and share objects to groups of users with a choice of permission levels. \n* To create a folder, click **Add** and then select **Folder**.\n* To move objects between folders, select the object you want to move and then drag it to the folder where you want to move it.\n* To share and grant permissions to all objects in a folder, right-click the folder and select **Share**. Enter the users, groups or service principals to which you want to share the folder and its objects, and then select the permission level. Click **Add**. \n![Sharing folder](https:\/\/docs.databricks.com\/_images\/sql_workspace_share_permissions.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/workspace\/workspace-browser\/index.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Create a monitor using the API\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis page describes how to create a monitor in Databricks using the Python API and describes all of the parameters used in API call. You can also create and manage monitors using the REST API. For reference information, see the [Lakehouse monitoring Python API reference](https:\/\/api-docs.databricks.com\/python\/lakehouse-monitoring\/latest\/index.html) and the [REST API reference](https:\/\/docs.databricks.com\/api\/workspace\/qualitymonitors). \nYou can create a monitor on any managed or external Delta table registered in Unity Catalog. Only a single monitor can be created in a Unity Catalog metastore for any table.\n\n### Create a monitor using the API\n#### Requirements\n\nThe Lakehouse Monitoring API is built into Databricks Runtime 14.3 LTS and above. To use the most recent version of the API, or to use it with earlier Databricks Runtime versions, use the following command at the beginning of your notebook to install the Python client: \n```\n%pip install \"https:\/\/ml-team-public-read.s3.amazonaws.com\/wheels\/data-monitoring\/a4050ef7-b183-47a1-a145-e614628e3146\/databricks_lakehouse_monitoring-0.4.14-py3-none-any.whl\"\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-api.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Create a monitor using the API\n#### Profile type parameter\n\nThe `profile_type` parameter determines the class of metrics that monitoring computes for the table. There are three types: TimeSeries, InferenceLog, and Snapshot. This section briefly describes the parameters. For details, see the [API reference](https:\/\/api-docs.databricks.com\/python\/lakehouse-monitoring\/latest\/index.html) or the [REST API reference](https:\/\/docs.databricks.com\/api\/workspace\/qualitymonitors). \nNote \n* When you first create a time series or inference profile, the monitor analyzes only data from the 30 days prior to its creation. After the monitor is created, all new data is processed.\n* Monitors defined on materialized views and streaming tables do not support incremental processing. \n### `TimeSeries` profile \nA `TimeSeries` profile compares data distributions across time windows. For a `TimeSeries` profile, you must provide the following: \n* A timestamp column (`timestamp_col`). The timestamp column data type must be either `TIMESTAMP` or a type that can be converted to timestamps using the `to_timestamp` [PySpark function](https:\/\/spark.apache.org\/docs\/latest\/api\/python\/reference\/pyspark.sql\/api\/pyspark.sql.functions.to_timestamp.html).\n* The set of `granularities` over which to calculate metrics. Available granularities are \u201c5 minutes\u201d, \u201c30 minutes\u201d, \u201c1 hour\u201d, \u201c1 day\u201d, \u201cn week(s)\u201d, \u201c1 month\u201d, \u201c1 year\u201d. \n```\nfrom databricks import lakehouse_monitoring as lm\n\nlm.create_monitor(\ntable_name=f\"{catalog}.{schema}.{table_name}\",\nprofile_type=lm.TimeSeries(\ntimestamp_col=\"ts\",\ngranularities=[\"30 minutes\"]\n),\noutput_schema_name=f\"{catalog}.{schema}\"\n)\n\n``` \n### `InferenceLog` profile \nAn `InferenceLog` profile is similar to a `TimeSeries` profile but also includes model quality metrics. For an `InferenceLog` profile, the following parameters are required: \n| Parameter | Description |\n| --- | --- |\n| `problem_type` | \u201cclassification\u201d or \u201cregression\u201d. |\n| `prediction_col` | Column containing the model\u2019s predicted values. |\n| `timestamp_col` | Column containing the timestamp of the inference request. |\n| `model_id_col` | Column containing the id of the model used for prediction. |\n| `granularities` | Determines how to partition the data in windows across time. Possible values: \u201c5 minutes\u201d, \u201c30 minutes\u201d, \u201c1 hour\u201d, \u201c1 day\u201d, \u201cn week(s)\u201d, \u201c1 month\u201d, \u201c1 year\u201d. | \nThere is also an optional parameter: \n| Optional parameter | Description |\n| --- | --- |\n| `label_col` | Column containing the ground truth for model predictions. | \n```\nfrom databricks import lakehouse_monitoring as lm\n\nlm.create_monitor(\ntable_name=f\"{catalog}.{schema}.{table_name}\",\nprofile_type=lm.InferenceLog(\nproblem_type=\"classification\",\nprediction_col=\"preds\",\ntimestamp_col=\"ts\",\ngranularities=[\"30 minutes\", \"1 day\"],\nmodel_id_col=\"model_ver\",\nlabel_col=\"label\", # optional\n),\noutput_schema_name=f\"{catalog}.{schema}\"\n)\n\n``` \nFor InferenceLog profiles, slices are automatically created based on the the distinct values of `model_id_col`. \n### `Snapshot` profile \nIn contrast to `TimeSeries`, a `Snapshot` profile monitors how the full contents of the table change over time. Metrics are calculated over all data in the table, and monitor the table state at each time the monitor is refreshed. \n```\nfrom databricks import lakehouse_monitoring as lm\n\nlm.create_monitor(\ntable_name=f\"{catalog}.{schema}.{table_name}\",\nprofile_type=lm.Snapshot(),\noutput_schema_name=f\"{catalog}.{schema}\"\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-api.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Create a monitor using the API\n#### Refresh and view monitor results\n\nTo refresh metrics tables, use `run_refresh`. For example: \n```\nfrom databricks import lakehouse_monitoring as lm\n\nlm.run_refresh(\ntable_name=f\"{catalog}.{schema}.{table_name}\"\n)\n\n``` \nWhen you call `run_refresh` from a notebook, the monitor metric tables are created or updated. This calculation runs on serverless compute, not on the cluster that the notebook is attached to. You can continue to run commands in the notebook while the monitor statistics are updated. \nFor information about the statistics that are stored in metric tables, see [Monitor metric tables](https:\/\/docs.databricks.com\/lakehouse-monitoring\/monitor-output.html) Metric tables are Unity Catalog tables. You can query them in notebooks or in the SQL query explorer, and view them in Catalog Explorer. \nTo display the history of all refreshes associated with a monitor, use `list_refreshes`. \n```\nfrom databricks import lakehouse_monitoring as lm\n\nlm.list_refreshes(\ntable_name=f\"{catalog}.{schema}.{table_name}\"\n)\n\n``` \nTo get the status of a specific run that has been queued, running, or finished, use `get_refresh`. \n```\nfrom databricks import lakehouse_monitoring as lm\n\nrun_info = lm.run_refresh(table_name=f\"{catalog}.{schema}.{table_name}\")\n\nlm.get_refresh(\ntable_name=f\"{catalog}.{schema}.{table_name}\",\nrefresh_id = run_info.refresh_id\n)\n\n``` \nTo cancel a refresh that is queued or running, use `cancel_refresh`. \n```\nfrom databricks import lakehouse_monitoring as lm\n\nrun_info = lm.run_refresh(table_name=f\"{catalog}.{schema}.{table_name}\")\n\nlm.cancel_refresh(\ntable_name=f\"{catalog}.{schema}.{table_name}\",\nrefresh_id=run_info.refresh_id\n)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-api.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Create a monitor using the API\n#### View monitor settings\n\nYou can review monitor settings using the API `get_monitor`. \n```\nfrom databricks import lakehouse_monitoring as lm\n\nlm.get_monitor(table_name=TABLE_NAME)\n\n```\n\n### Create a monitor using the API\n#### Schedule\n\nTo set up a monitor to run on a scheduled basis, use the `schedule` parameter of `create_monitor`: \n```\nlm.create_monitor(\ntable_name=f\"{catalog}.{schema}.{table_name}\",\nprofile_type=lm.TimeSeries(\ntimestamp_col=\"ts\",\ngranularities=[\"30 minutes\"]\n),\nschedule=lm.MonitorCronSchedule(\nquartz_cron_expression=\"0 0 12 * * ?\", # schedules a refresh every day at 12 noon\ntimezone_id=\"PST\",\n),\noutput_schema_name=f\"{catalog}.{schema}\"\n)\n\n``` \nSee [cron expressions](https:\/\/en.wikipedia.org\/wiki\/Cron) for more information.\n\n### Create a monitor using the API\n#### Notifications\n\nTo set up notifications for a monitor, use the `notifications` parameter of `create_monitor`: \n```\nlm.create_monitor(\ntable_name=f\"{catalog}.{schema}.{table_name}\",\nprofile_type=lm.TimeSeries(\ntimestamp_col=\"ts\",\ngranularities=[\"30 minutes\"]\n),\nnotifications=lm.Notifications(\n# Notify the given email when a monitoring refresh fails or times out.\non_failure=lm.Destination(\nemail_addresses=[\"your_email@domain.com\"]\n)\n)\noutput_schema_name=f\"{catalog}.{schema}\"\n)\n\n``` \nA maximum of 5 email addresses is supported per event type (for example, \u201con\\_failure\u201d).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-api.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Create a monitor using the API\n#### Control access to metric tables\n\nThe metric tables and dashboard created by a monitor are owned by the user who created the monitor. You can use Unity Catalog privileges to control access to metric tables. To share dashboards within a workspace, use the **Share** button at the upper-right of the dashboard.\n\n### Create a monitor using the API\n#### Delete a monitor\n\nTo delete a monitor: \n```\nlm.delete_monitor(table_name=TABLE_NAME)\n\n``` \nThis command does not delete the profile tables and the dashboard created by the monitor. You must delete those assets in a separate step, or you can save them in a different location.\n\n### Create a monitor using the API\n#### Example notebooks\n\nThe following example notebooks illustrate how to create a monitor, refresh the monitor, and examine the metric tables it creates.\n\n### Create a monitor using the API\n#### Notebook example: Time series profile\n\nThis notebook illustrates how to create a `TimeSeries` type monitor. \n### TimeSeries Lakehouse Monitor example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/monitoring\/timeseries-monitor.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n### Create a monitor using the API\n#### Notebook example: Inference profile (regression)\n\nThis notebook illustrates how to create a `InferenceLog` type monitor for a regression problem. \n### Inference Lakehouse Monitor regression example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/monitoring\/regression-monitor.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-api.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### Create a monitor using the API\n#### Notebook example: Inference profile (classification)\n\nThis notebook illustrates how to create a `InferenceLog` type monitor for a classification problem. \n### Inference Lakehouse Monitor classification example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/monitoring\/classification-monitor.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n### Create a monitor using the API\n#### Notebook example: Snapshot profile\n\nThis notebook illustrates how to create a `Snapshot` type monitor. \n### Snapshot Lakehouse Monitor example notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/monitoring\/snapshot-monitor.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/create-monitor-api.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Interoperability and usability for the data lakehouse\n##### Best practices for interoperability and usability\n\nThis article covers best practices for **interoperability and usability**, organized by architectural principles listed in the following sections.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/interoperability-and-usability\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Interoperability and usability for the data lakehouse\n##### Best practices for interoperability and usability\n###### 1. Define standards for integration\n\n### Use the Databricks REST API for external integration \nThe Databricks Lakehouse comes with a comprehensive [REST API](https:\/\/docs.databricks.com\/api\/workspace\/introduction) that allows you to manage nearly all aspects of the platform programmatically. The REST API server runs in the control plane and provides a unified endpoint to manage the Databricks platform. This is the preferred way to integrate Databricks, for example, into existing tools for CI\/CD or MLOps. For integration in shell-based devices, [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html) encapsulates many of the REST APIs in a command line interface. \n### Use optimized connectors to access data sources from the lakehouse \nDatabricks offers a variety of ways to help you ingest data into Delta Lake. Therefore, the lakehouse provides optimized connectors for many data formats and cloud services. See [Query databases using JDBC](https:\/\/docs.databricks.com\/connect\/external-systems\/jdbc.html). Many of them are already included in the Databricks Runtime. These connectors are often built and optimized for specific data sources. \n### Use partners available in Partner Connect \nBusinesses have different needs, and no single tool can meet all of them. [Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/index.html) allows you to explore and easily integrate with our partners, which cover all aspects of the lakehouse: data ingestion, preparation and transformation, BI and visualization, machine learning, data quality, and so on. Partner Connect lets you create trial accounts with selected Databricks technology partners and connect your Databricks workspace to partner solutions from the Databricks UI. Try partner solutions using your data in the Databricks lakehouse, and then adopt the solutions that best meet your business needs. \n### Use Delta Live Tables and Auto Loader \nDelta Live Tables is a framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations to perform on your data, and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling. See [What is Delta Live Tables?](https:\/\/docs.databricks.com\/delta-live-tables\/index.html). \n[Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html) incrementally and efficiently processes new data files as they arrive in cloud storage. It can reliably read data files from cloud storage. An essential aspect of both Delta Live Tables and Auto Loader is their declarative nature: Without them, one has to build complex pipelines that integrate different cloud services - such as a notification service and a queuing service - to reliably read cloud files based on events and allow the combining of batch and streaming sources reliably. \nAuto Loader and Delta Live Tables reduce system dependencies and complexity and significantly improve the interoperability with the cloud storage and between different paradigms like batch and streaming. As a side effect, the simplicity of pipelines increases platform usability. \n### Use Infrastructure as Code for deployments and maintenance \nHashiCorp Terraform is a popular open source tool for creating safe and predictable cloud infrastructure across several cloud providers. See [Operational-Excellence > Use Infrastructure as Code for deployments and maintenance](https:\/\/docs.databricks.com\/lakehouse-architecture\/operational-excellence\/best-practices.html#use-infrastructure-as-code-for-deployments-and-maintenance)\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/interoperability-and-usability\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Interoperability and usability for the data lakehouse\n##### Best practices for interoperability and usability\n###### 2. Prefer open interfaces and open data formats\n\n### Use the Delta data format \nThe [Delta Lake](https:\/\/delta.io\/) framework has many advantages, from reliability features to high-performance enhancements, and it is also a fully open data format. See: \n* [Use Delta Lake](https:\/\/docs.databricks.com\/lakehouse-architecture\/reliability\/best-practices.html#delta-lake).\n* [Best practices for performance efficiency](https:\/\/docs.databricks.com\/lakehouse-architecture\/performance-efficiency\/best-practices.html). \nAdditionally, Delta Lake comes with a Delta Standalone library, which opens the Delta format for development projects. It is a single-node Java library that can read from and write to Delta tables. Dozens of third-party tools and applications [support Delta Lake](https:\/\/delta.io\/integrations). Specifically, this library provides APIs to interact with table metadata in the transaction log, implementing the Delta Transaction Log Protocol to achieve the transactional guarantees of the Delta format. See [What is Delta Lake?](https:\/\/docs.databricks.com\/delta\/index.html). \n### Use Delta Sharing to exchange data with partners \n[Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html) is an open protocol developed by Databricks for secure data sharing with other organizations regardless of which computing platforms they use. A Databricks user, called a \u201cdata provider\u201d, can use Delta Sharing to share data with a person or group outside their organization, named a \u201cdata recipient\u201d. Data recipients can immediately begin working with the latest version of the shared data. Delta Sharing is available for data in the Unity Catalog metastore. \n### Use MLflow to manage machine learning workflows \n[MLflow](https:\/\/mlflow.org\/) is an open source platform for managing the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. Using MLflow on Databricks provides both advantages: You can write your ML workflow using an open and portable tool and use reliable services operated by Databricks (Tracking Server, Model Registry). See [ML lifecycle management using MLflow](https:\/\/docs.databricks.com\/mlflow\/index.html). It also adds enterprise-grade, scalable model serving, allowing you to host MLflow models as REST endpoints.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/interoperability-and-usability\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Interoperability and usability for the data lakehouse\n##### Best practices for interoperability and usability\n###### 3. Lower the barriers for implementing use cases\n\n### Provide a self-service experience across the platform \nThe Databricks Data Intelligence Platform has all the capabilities required to provide a self-service experience. While there might be a mandatory approval step, the best practice is to fully automate the setup when a business unit requests access to the lakehouse. Automatically provision their new environment, sync users and use SSO for authentication, provide access control to common data and separate object storages for their own data, and so on. Together with a central data catalog containing semantically consistent and business-ready data sets, this quickly and securely provides access for new business units to the lakehouse capabilities and the data they need. \n### Use the serverless services of the platform \nFor [serverless compute](https:\/\/docs.databricks.com\/lakehouse-architecture\/performance-efficiency\/best-practices.html#use-serverless-compute) on the Databricks platform, the compute layer runs in the customer\u2019s Databricks account. Cloud administrators no longer have to manage complex cloud environments that involve adjusting quotas, creating and maintaining networking assets, and joining billing sources. Users benefit from near-zero waiting times for cluster start and improved concurrency on their queries. \n### Offer predefined clusters and SQL warehouses for each use case \nIf using serverless services is not possible, remove the burden of defining a cluster (VM type, node size, and cluster size) from end users. This can be achieved in the following ways: \n* Provide shared clusters as immediate environments for users. On these clusters, use autoscaling down to a very minimum of nodes to avoid high idle costs.\n* Use [cluster policies](https:\/\/docs.databricks.com\/lakehouse-architecture\/operational-excellence\/best-practices.html#use-cluster-policies) to define t-shirt-sized clusters (S, M, L) for projects as a standardized work environment.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/interoperability-and-usability\/best-practices.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n### Interoperability and usability for the data lakehouse\n##### Best practices for interoperability and usability\n###### 4. Ensure data consistency and usability\n\n### Offer reusable data-as-products that the business can trust \nProducing high-quality data-as-product is a primary goal of any data platform. The idea is that data engineering teams apply product thinking to the curated data: The data assets are their products, and the data scientists, ML and BI engineers, or any other business teams that consume data are their customers. These customers should be able to discover, address, and create value from these data-as-products through a self-service experience without the intervention of the specialized data teams. \n### Publish data products semantically consistent across the enterprise \nA data lake usually contains data from different source systems. These systems sometimes name the same concept differently (such as *customer* vs. *account*) or mean different concepts by the same identifier. For business users to easily combine these data sets in a meaningful way, the data must be made homogeneous across all sources to be semantically consistent. In addition, for some data to be valuable for analysis, internal business rules must be applied correctly, such as revenue recognition. To ensure that all users are using the correctly interpreted data, data sets with these rules must be made available and published to Unity Catalog. Access to source data must be limited to teams that understand the correct usage. \n### Use Unity Catalog for data discovery and lineage exploration \nIn Unity Catalog, administrators and data stewards manage users and their access to data *centrally across all workspaces* in a Databricks account. Users in different workspaces can share the same data and, depending on user privileges granted centrally in Unity Catalog, joint data access is possible. See [What is Catalog Explorer?](https:\/\/docs.databricks.com\/catalog-explorer\/index.html). \nFrom a usability perspective, Unity Catalog provides the following two capabilities: \n* Catalog Explorer is the main UI for many Unity Catalog features. You can use Catalog Explorer to view schema details, preview sample data, and see table details and properties. Administrators can view and change owners, and admins and data object owners can grant and revoke permissions. You can also use Databricks Search, which enables users to find data assets (such as tables, columns, views, dashboards, models, and so on) easily and seamlessly. Users will be shown results that are relevant to their search requests and that they have access to. See [Capture and view data lineage using Unity Catalog](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/data-lineage.html). \n* Data lineage across all queries run on a Databricks cluster or SQL warehouse. Lineage is supported for all languages and is captured down to the column level. Lineage data includes notebooks, workflows, and dashboards related to the query. Lineage can be visualized in Catalog Explorer in near real-time and retrieved with the Databricks REST API. \nTo allow enterprises to provide their users a holistic view of all data across all data platforms, Unity Catalog provides integration with enterprise data catalogs (sometimes referred to as the \u201ccatalog of catalogs\u201d).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/interoperability-and-usability\/best-practices.html"} +{"content":"# Compute\n## What is a SQL warehouse?\n#### Monitor a SQL warehouse\n\nYou can monitor a SQL warehouse from the Databricks UI.\n\n#### Monitor a SQL warehouse\n##### View SQL warehouse monitoring metrics\n\nTo monitor a SQL warehouse, click the name of a SQL warehouse and then the **Monitoring** tab. On the **Monitoring** tab, you see the following monitoring elements: \n![Screenshot with numbered annotations to denote the defined parts of the page that follow.](https:\/\/docs.databricks.com\/_images\/warehouse-monitoring-tab.png) \n1. **Live statistics**: Live statistics show the currently running and queued queries, active SQL sessions, the warehouse status, and the current cluster count.\n2. **Time scale filter**: The monitoring time scale filter sets the time range for the query count chart, running cluster chart, and the query history and event log table. The default time range is 8 hours, but you can specify 24 hours, 7 days, or 14 days. You can also click and drag on the bar chart to change the time range.\n3. **Peak query count chart**: The peak query count chart shows the maximum number of concurrent queries, either running or queued, on the warehouse during the selected time frame. The data that supplies this chart does not include metadata queries. Each data point in the chart is the peak within a 5-minute window.\n4. **Running clusters chart**: The running clusters chart shows the number of clusters allocated to the warehouse during the selected time frame. During a cluster recycle, this count might temporarily exceed configured maximum.\n5. **Query history table**: The query history table shows all of the queries active during the selected time frame, their start time and duration, and the user that executed the query. You can filter the queries by user, query duration, query status, and query type. \nNote \nThe cluster count can be greater than one only if [scaling](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/warehouse-behavior.html#scaling) is enabled and configured.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/sql-warehouse\/monitor.html"} +{"content":"# \n### Key concepts of RAG Studio\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html)\n\n### Key concepts of RAG Studio\n#### Logging\n\nCore to RAG Studio is always-on `\ud83d\udcdd Trace` logging. Every time your app is invoked, RAG Studio automatically captures a detailed, step-by-step log of every action taken inside the `\ud83d\udd17 Chain`, saving it to the `\ud83d\uddc2\ufe0f Request Log` which is simply a Delta Table. \nThis logging is based on Model Serving\u2019s [Inference Tables](https:\/\/docs.databricks.com\/machine-learning\/model-serving\/inference-tables.html) functionality. \nView the `\ud83d\uddc2\ufe0f Request Log` [schema](https:\/\/docs.databricks.com\/rag-studio\/details\/request-log.html) for more details.\n\n### Key concepts of RAG Studio\n#### Assessments\n\nFor every `\ud83d\uddc2\ufe0f Request Log`, you can associate a `\ud83d\udc4d Assessment & Evaluation Results Log` with that log. An assessment represents feedback about that `\ud83d\udcdd Trace` e.g., were the retrieved documents relevant? was the answer correct? etc. Each `\ud83d\udcdd Trace` can have multiple assessments from different sources: one of your `\ud83e\udde0 Expert Users`, `\ud83d\udc64 End Users`, or a `\ud83e\udd16 LLM Judge` \nView the `\ud83d\udc4d Assessment & Evaluation Results Log` [schema](https:\/\/docs.databricks.com\/rag-studio\/details\/assessment-log.html) for more details.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/concepts\/terminology-key-concepts.html"} +{"content":"# \n### Key concepts of RAG Studio\n#### Online evaluations\n\n`\ud83d\uddc2\ufe0f Request Log` and `\ud83d\udc4d Assessment & Evaluation Results Log` are used to compute metrics that allow you to understand the quality, cost, and latency of your RAG Application based on feedback collected from `\ud83d\udc64 End Users` and `\ud83e\udde0 Expert Users`. The metric computations are added to the `\ud83d\udc4d Assessment & Evaluation Results Log` table and can be accessed through the `\ud83d\udd75\ufe0f\u2640\ufe0f Exploration & Investigation UI`. \nView the [metrics](https:\/\/docs.databricks.com\/rag-studio\/details\/metrics.html) computed by RAG Studio for more details.\n\n### Key concepts of RAG Studio\n#### Offline evaluations\n\nOffline evaluations allow you to curate `\ud83d\udcd6 Evaluation Set`s which are `\ud83d\uddc2\ufe0f Request Log` (optionally linked with the ground-truth answer from a `\ud83d\udc4d Assessment & Evaluation Results Log`) that contain representative queries your RAG Application supports. You use a `\ud83d\udcd6 Evaluation Set` to compute the same metrics as in online evaluations, however, offline evaluation is typically done to assess the quality, cost, and latency of a new version before deploying the RAG Application to your users.\n\n### Key concepts of RAG Studio\n#### Versions\n\nIn order create RAG Applications that deliver accurate answers, you must be able to quickly create and compare different BOTH end-to-end versions of your RAG Application and versions of the individual components (`\ud83d\uddc3\ufe0f Data Processor`, `\ud83d\udd17 Chain`, etc) that make up your RAG Application. For example, you might want to see how `chunk_size = 500` compares to `chunk_size = 1000`. RAG Studio supports logging versions - each version represents the code and configuration for the individual components.\n\n### Key concepts of RAG Studio\n#### Unified online and offline schemas\n\nA core concept of RAG Studio is that all infrastructure and data schemas *are unified between development and production.* This enables you to quickly test a new version with `\ud83e\udde0 Expert Users`, then deploy it to production once validated \u2013 using the same instrumentation code and measuring the same metrics in both environments.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/concepts\/terminology-key-concepts.html"} +{"content":"# \n### Key concepts of RAG Studio\n#### Environments\n\nHowever, having the same infrastructure and schemas in development and production can create a blurry line between these environments. RAG Studio supports multiple environments, because it is critically important that developers maintain a clean separation between these environments. \nView the [environments](https:\/\/docs.databricks.com\/rag-studio\/details\/environments.html) for more details.\n\n### Key terminology\n#### Application configuration\n\n* **`\u2699\ufe0f Global Configuration`:** the app\u2019s name, Databricks workspace where the app is deployed, the Unity Catalog schema where assets are stored, and (optionally) the MLflow experiment and [vector search](https:\/\/docs.databricks.com\/generative-ai\/vector-search.html) endpoint.\n* **`\ud83e\udd16 LLM Judge` configuration:** configuration for how `\ud83e\udd16 LLM Judge`s are run by RAG Studio.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/concepts\/terminology-key-concepts.html"} +{"content":"# \n### Key terminology\n#### Component code & configuration\n\n* **`\ud83d\udce5 Data Ingestor`:** A data pipeline that ingests raw unstructured documents from a 3rd party raw data source (such as Confluence, Google Drive, etc) into a [UC Volume](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html). Each `\ud83d\udce5 Data Ingestor` can be associated with any number of `\ud83d\uddc3\ufe0f Data Processor`.\n* **`\ud83d\uddc3\ufe0f Data Processor`:** A data pipeline that parses, chunks, and embeds unstructured documents from a `\ud83d\udce5 Data Ingestor` into chunks stored in a Vector Index. A `\ud83d\uddc3\ufe0f Data Processor` is associated with 1+ `\ud83d\udce5 Data Ingestor`.\n* **`\ud83d\udd0d Retriever`:** Logic that retrieves relevant chunks from a Vector Index. Given the dependencies between processing logic and retrieval logic, a `\ud83d\udd0d Retriever` is associated to 1+ `\ud83d\uddc3\ufe0f Data Processor`s. A `\ud83d\udd0d Retriever` can be a simple call to a Vector Index or a more complex series of steps including a re-ranker.\\*\n* **`\ud83d\udd17 Chain`:** The orchestration code that glues together `\ud83d\udd0d Retriever` and Generative AI Models to turn a user query (question) into bot response (answer). Each `\ud83d\udd17 Chain` is associated with 1+ `\ud83d\udd0d Retriever`s.\n\n### Key terminology\n#### Data generated by RAG Studio\n\n* **`\ud83d\uddc2\ufe0f Request Log`:** The step-by-step `\ud83d\udcdd Trace` of every `\ud83d\udd17 Chain` invocation e.g., every user query & bot response along with detailed traces of the steps taken by the `\ud83d\udd17 Chain` to generate that response.\n* **`\ud83d\udc4d Assessment & Evaluation Results Log`:** User provided or `\ud83e\udd16 LLM Judge` feedback (thumbs up \/ down, edited bot responses, etc) that is linked to a `\ud83d\udcdd Trace`. The results from RAG Studio computing evaluations (aka metrics) are added to each row of this table.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/concepts\/terminology-key-concepts.html"} +{"content":"# \n### Key terminology\n#### Data curated by the `\ud83d\udc69\ud83d\udcbb RAG App Developer`\n\n* **`\ud83d\udcd6 Evaluation Set`:** `\ud83d\uddc2\ufe0f Request Log`, optionally with associated `\ud83d\udc4d Assessment & Evaluation Results Log`, that contain representative questions\/answers used for offline evaluation of the RAG Application.\n* **`\ud83d\udccb Review Set`:** `\ud83d\uddc2\ufe0f Request Log` that are curated by the developer for the purposes of collecting `\ud83e\udde0 Expert Users`\u2019s feedback in order to create `\ud83d\udcd6 Evaluation Set`s.\n\n### Key terminology\n#### RAG Studio User interfaces\n\n* **`\ud83d\udcac Review UI`:** A chat-based web app for soliciting feedback from `\ud83e\udde0 Expert Users` or for a `\ud83d\udc69\ud83d\udcbb RAG App Developer` to test the app.\n* **`\ud83d\udd75\ufe0f\u2640\ufe0f Exploration & Investigation UI`:** A UI, built into Databricks, for viewing computed evaluations (metrics) about a RAG Application version and investigating individual `\ud83d\uddc2\ufe0f Request Log`s and `\ud83d\udc4d Assessment & Evaluation Results Log`s.\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/concepts\/terminology-key-concepts.html"} +{"content":"# Data governance with Unity Catalog\n### Hive metastore table access control (legacy)\n\nEach Databricks workspace deploys with a built-in Hive metastore as a managed service. An instance of the metastore deploys to each cluster and securely accesses metadata from a central per-workspace repository. \nBy default, a cluster allows all users to access all data managed by the workspace\u2019s built-in Hive metastore unless table access control is enabled for that cluster. Table access control lets you programmatically grant and revoke access to objects in your workspace\u2019s Hive metastore from Python and SQL. When table access control is enabled, users can set permissions for data objects that are accessed using that cluster. \nNote \nHive metastore table access control is a legacy data governance model. Databricks recommends that you [upgrade the tables managed by the Hive metastore to the Unity Catalog metastore](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/migrate.html). Unity Catalog simplifies security and governance of your data by providing a central place to administer and audit data access across multiple workspaces in your account.\n\n### Hive metastore table access control (legacy)\n#### Requirements\n\n* This feature requires the [Premium plan or above](https:\/\/databricks.com\/product\/pricing\/platform-addons).\n* This feature requires a Data Science & Engineering cluster with an [appropriate configuration](https:\/\/docs.databricks.com\/data-governance\/table-acls\/table-acl.html#table-access-control) or a SQL warehouse. \nThis section covers: \n* [Enable Hive metastore table access control on a cluster (legacy)](https:\/\/docs.databricks.com\/data-governance\/table-acls\/table-acl.html)\n* [Hive metastore privileges and securable objects (legacy)](https:\/\/docs.databricks.com\/data-governance\/table-acls\/object-privileges.html)\n* [What is the `ANY FILE` securable?](https:\/\/docs.databricks.com\/data-governance\/table-acls\/any-file.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/data-governance\/table-acls\/index.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create an external location to connect cloud storage to Databricks\n\nThis article describes how to configure an external location in Unity Catalog to connect cloud storage to Databricks. \nExternal locations associate Unity Catalog storage credentials with cloud object storage containers. External locations are used to define managed storage locations for catalogs and schemas, and to define locations for external tables and external volumes. \nYou can create an external location that references storage in an AWS S3 or Cloudflare R2 bucket. \nYou can create an external location using Catalog Explorer, the Databricks CLI, SQL commands in a notebook or Databricks SQL query, or [Terraform](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/external_location). \nNote \nWhen you define a volume, cloud URI access to data under the volume path is governed by the permissions of the volume.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create an external location to connect cloud storage to Databricks\n##### Before you begin\n\n**Prerequisites**: \n* You must create the AWS S3 or Cloudflare R2 bucket that you want to use as an external location before you create the external location object in Databricks. \n+ The AWS CloudFormation template supports only S3 buckets.\n+ The name of an S3 bucket that you want users to read from and write to cannot use dot notation (for example, `incorrect.bucket.name.notation`). For more bucket naming guidance, see the [AWS bucket naming rules](https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/bucketnamingrules.html). \n* If you don\u2019t use the AWS CloudFormation template to create the external location, you must first create a storage credential in Databricks that gives access to the cloud storage location path. See [Create a storage credential for connecting to AWS S3](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html) and [Create a storage credential for connecting to Cloudflare R2](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials-r2.html). \nIf you use the AWS CloudFormation flow, that storage credential is created for you. \n**Permissions requirements**: \n* You must have the `CREATE EXTERNAL LOCATION` privilege on both the metastore and the storage credential referenced in the external location. Metastore admins have `CREATE EXTERNAL LOCATION` on the metastore by default. \n* If you are using the AWS CloudFormation template, you must also have the `CREATE STORAGE CREDENTIAL` privilege on the metastore. Metastore admins have `CREATE STORAGE CREDENTIAL` on the metastore by default.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create an external location to connect cloud storage to Databricks\n##### Create an external location for an S3 bucket using an AWS CloudFormation template\n\nIf you create an external location using the AWS CloudFormation template, Databricks configures the external location and creates a storage credential for you. You also have the option to [create the external location manually](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html#catalog-explorer), which requires that you first create an IAM role that gives access to the S3 bucket that is referenced by the external location and a storage credential that references that IAM role. For details, see [Create a storage credential for connecting to AWS S3](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html). \nNote \nYou cannot create external locations for Cloudflare R2 buckets using the AWS CloudFormation template. Instead use the [manual flow in Catalog Explorer](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html#catalog-explorer) or [SQL statements in a Databricks notebook or SQL query editor](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html#sql). \n**Permissions and prerequisites:** see [Before you begin](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html#requirements). \nTo create the external location: \n1. Log in to a workspace that is attached to the metastore.\n2. Click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog** to open Catalog Explorer.\n3. Click the **+ Add** button and select **Add an external location**.\n4. On the **Create a new external location** dialog, select **AWS Quickstart (Recommended)** then click **Next**. \nThe AWS Quickstart configures the external location and creates a storage credential for you. If you choose to use the **Manual** option, you must manually create an IAM role that gives access to the S3 bucket and create the storage credential in Databricks yourself.\n5. On the **Create external location with Quickstart** dialog, enter the path to the S3 bucket in the **Bucket Name** field.\n6. Click **Generate new token** to generate the personal access token that you will use to authenticate between Databricks and your AWS account.\n7. Copy the token and click **Launch in Quickstart**.\n8. In the AWS CloudFormation template that launches (labeled **Quick create stack**), paste the token into the **Databricks Account Credentials** field.\n9. Accept the terms at the bottom of the page (**I acknowledge that AWS CloudFormation might create IAM resources with custom names**).\n10. Click **Create stack**. \nIt may take a few minutes for the CloudFormation template to finish creating the external location object in Databricks.\n11. Return to your Databricks workspace and click **Catalog** to open **Catalog Explorer**.\n12. In the left pane of Catalog Explorer, scroll down and click **External Data > External Locations**.\n13. Confirm that a new external location has been created. \nAutomatically-generated external locations use the naming syntax `db_s3_external_databricks-S3-ingest-<id>`.\n14. (Optional) Bind the external location to specific workspaces. \nBy default, any privileged user can use the external location on any workspace attached to the metastore. If you want to allow access only from specific workspaces, go to the **Workspaces** tab and assign workspaces. See [(Optional) Assign an external location to specific workspaces](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html#workspace-binding).\n15. Grant permission to use the external location. \nFor anyone to use the external location, you must grant permissions: \n* To use the external location to add a managed storage location to metastore, catalog, or schema, grant the `CREATE MANAGED LOCATION` privilege.\n* To create external tables or volumes, grant `CREATE EXTERNAL TABLE` or `CREATE EXTERNAL VOLUME`.To use Catalog Explorer to grant permissions: \n1. Click the external location name to open the details pane.\n2. On the **Permissions** tab, click **Grant**.\n3. On the **Grant on `<external location>`** dialog, select users, groups, or service principals in **Principals** field, and select the privilege you want to grant.\n4. Click **Grant**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create an external location to connect cloud storage to Databricks\n##### Create an external location using Catalog Explorer\n\nYou can create an external location manually using Catalog Explorer. \n**Permissions and prerequisites:** see [Before you begin](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html#requirements). \nIf you are creating an external location for an S3 bucket, Databricks recommends that you use the [AWS CloudFormation template](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html#cloudformation) rather than the procedure described here. If you use the AWS CloudFormation template, you do not need to create a storage credential. It is created for you. \nTo create the external location: \n1. Log in to a workspace that is attached to the metastore.\n2. In the sidebar, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. Click the **+ Add** button and select **Add an external location**.\n4. On the **Create a new external location** dialog, click **Manual**, then **Next**. \nTo learn about the AWS Quickstart option, see [Create an external location for an S3 bucket using an AWS CloudFormation template](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html#cloudformation).\n5. In the **Create a new external location manually** dialog, enter an **External location name**.\n6. Optionally copy the bucket path from an existing mount point (S3 buckets only).\n7. If you aren\u2019t copying from an existing mount point, use the **URL** field to enter the S3 or R2 bucket path that you want to use as the external location. \nFor example, `S3:\/\/mybucket\/<path>` or `r2:\/\/mybucket@my-account-id.r2.cloudflarestorage.com\/<path>`.\n8. Select the storage credential that grants access to the external location.\n9. (Optional) If you want users to have read-only access to the external location, click **Advanced Options** and select **Read only**. For more information, see [Mark an external location as read-only](https:\/\/docs.databricks.com\/connect\/unity-catalog\/manage-external-locations.html#read-only).\n10. Click **Create**.\n11. (Optional) Bind the external location to specific workspaces. \nBy default, any privileged user can use the external location on any workspace attached to the metastore. If you want to allow access only from specific workspaces, go to the **Workspaces** tab and assign workspaces. See [(Optional) Assign an external location to specific workspaces](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html#workspace-binding).\n12. Grant permission to use the external location. \nFor anyone to use the external location you must grant permissions: \n* To use the external location to add a managed storage location to metastore, catalog, or schema, grant the `CREATE MANAGED LOCATION` privilege.\n* To create external tables or volumes, grant `CREATE EXTERNAL TABLE` or `CREATE EXTERNAL VOLUME`.To use Catalog Explorer to grant permissions: \n1. Click the external location name to open the details pane.\n2. On the **Permissions** tab, click **Grant**.\n3. On the **Grant on `<external location>`** dialog, select users, groups, or service principals in **Principals** field, and select the privilege you want to grant.\n4. Click **Grant**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create an external location to connect cloud storage to Databricks\n##### Create an external location using SQL\n\nTo create an external location using SQL, run the following command in a notebook or the SQL query editor. Replace the placeholder values. \n**Permissions and prerequisites:** see [Before you begin](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html#requirements). \n* `<location-name>`: A name for the external location. If `location_name` includes special characters, such as hyphens (`-`), it must be surrounded by backticks ( `` `` ). See [Names](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-names.html). \n* `<bucket-path>`: The path in your cloud tenant that this external location grants access to. For example, `S3:\/\/mybucket` or `r2:\/\/mybucket@my-account-id.r2.cloudflarestorage.com`.\n* `<storage-credential-name>`: The name of the storage credential that authorizes reading from and writing to the bucket. If the storage credential name includes special characters, such as hyphens (`-`), it must be surrounded by backticks ( `` `` ). \n```\nCREATE EXTERNAL LOCATION [IF NOT EXISTS] `<location-name>`\nURL '<bucket-path>'\nWITH ([STORAGE] CREDENTIAL `<storage-credential-name>`)\n[COMMENT '<comment-string>'];\n\n``` \nIf you want to limit external location access to specific workspaces in your account, also known as workspace binding or external location isolation, see [(Optional) Assign an external location to specific workspaces](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html#workspace-binding).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create an external location to connect cloud storage to Databricks\n##### (Optional) Assign an external location to specific workspaces\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nBy default, an external location is accessible from all of the workspaces in the metastore. This means that if a user has been granted a privilege (such as `READ FILES`) on that external location, they can exercise that privilege from any workspace attached to the metastore. If you use workspaces to isolate user data access, you might want to allow access to an external location only from specific workspaces. This feature is known as workspace binding or external location isolation. \nTypical use cases for binding an external location to specific workspaces include: \n* Ensuring that data engineers who have the `CREATE EXTERNAL TABLE` privilege on an external location that contains production data can create external tables on that location only in a production workspace.\n* Ensuring that data engineers who have the `READ FILES` privilege on an external location that contains sensitive data can only use specific workspaces to access that data. \nFor more information about how to restrict other types of data access by workspace, see [Workspace-catalog binding example](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-catalogs.html#catalog-binding-example). \nImportant \nWorkspace bindings are referenced when privileges against external locations are exercised. For example, if a user creates an external table by issuing the statement `CREATE TABLE myCat.mySch.myTable LOCATION 's3:\/\/bucket\/path\/to\/table'` from the `myWorkspace` workspace, the following workspace binding checks are performed in addition to regular user privilege checks: \n* Is the external location covering `'s3:\/\/bucket\/path\/to\/table'` bound to `myWorkspace`?\n* Is the catalog `myCat` bound to `myWorkspace` with access level `Read & Write`? \nIf the external location is subsequently unbound from `myWorkspace`, then the external table continues to function. \nThis feature also allows you to populate a catalog from a central workspace and make it available to other workspaces using catalog bindings, without also having to make the external location available in those other workspaces. \n### Bind an external location to one or more workspaces \nTo assign an external location to specific workspaces, you can use Catalog Explorer or the Unity Catalog REST API. \n**Permissions required**: Metastore admin or external location owner. \nNote \nMetastore admins can see all external locations in a metastore using Catalog Explorer\u2014and external location owners can see all external locations that they own in a metastore\u2014regardless of whether the external location is assigned to the current workspace. External locations that are not assigned to the workspace appear grayed out. \n1. Log in to a workspace that is linked to the metastore.\n2. In the sidebar, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n3. At the bottom of the screen, click **External Data > External Locations**.\n4. Select the external location and go to the **Workspaces** tab.\n5. On the **Workspaces** tab, clear the **All workspaces have access** checkbox. \nIf your external location is already bound to one or more workspaces, this checkbox is already cleared.\n6. Click **Assign to workspaces** and enter or find the workspaces you want to assign. \nTo revoke access, go to the **Workspaces** tab, select the workspace, and click **Revoke**. To allow access from all workspaces, select the **All workspaces have access** checkbox. \nThere are two APIs and two steps required to assign an external location to a workspace. In the following examples, replace `<workspace-url>` with your workspace instance name. To learn how to get the workspace instance name and workspace ID, see [Get identifiers for workspace objects](https:\/\/docs.databricks.com\/workspace\/workspace-details.html). To learn about getting access tokens, see [Authentication for Databricks automation - overview](https:\/\/docs.databricks.com\/dev-tools\/auth\/index.html). \n1. Use the `catalogs` API to set the external location\u2019s `isolation mode` to `ISOLATED`: \n```\ncurl -L -X PATCH 'https:\/\/<workspace-url>\/api\/2.1\/unity-catalog\/external-locations\/<my-location> \\\n-H 'Authorization: Bearer <my-token> \\\n-H 'Content-Type: application\/json' \\\n--data-raw '{\n\"isolation_mode\": \"ISOLATED\"\n}'\n\n``` \nThe default `isolation mode` is `OPEN` to all workspaces attached to the metastore. See [Catalogs](https:\/\/docs.databricks.com\/api\/workspace\/catalogs) in the REST API reference.\n2. Use the update `bindings` API to assign the workspaces to the catalog: \n```\ncurl -L -X PATCH 'https:\/\/<workspace-url>\/api\/2.1\/unity-catalog\/bindings\/external-locations\/<my-location> \\\n-H 'Authorization: Bearer <my-token> \\\n-H 'Content-Type: application\/json' \\\n--data-raw '{\n\"add\": [{\"workspace_id\": <workspace-id>,...],\n\"remove\": [{\"workspace_id\": <workspace-id>,...]\n}'\n\n``` \nUse the `\"add\"` and `\"remove\"` properties to add or remove workspace bindings. \n..note:: Read-only binding (`BINDING_TYPE_READ_ONLY`) is not available for external locations. Therefore there is no reason to set `binding_type` for the external locations binding. \nTo list all workspace assignments for an external location, use the list `bindings` API: \n```\ncurl -L -X GET 'https:\/\/<workspace-url>\/api\/2.1\/unity-catalog\/bindings\/external-locations\/<my-location> \\\n-H 'Authorization: Bearer <my-token> \\\n\n``` \nSee [Workspace Bindings](https:\/\/docs.databricks.com\/api\/workspace\/workspacebindings) in the REST API reference. \n### Unbind an external location from a workspace \nInstructions for revoking workspace access to an external location using Catalog Explorer or the `bindings` API are included in [Bind an external location to one or more workspaces](https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html#bind).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Create an external location to connect cloud storage to Databricks\n##### Next steps\n\n* Grant other users permission to use external locations. See [Manage external locations](https:\/\/docs.databricks.com\/connect\/unity-catalog\/manage-external-locations.html).\n* Define managed storage locations using external locations. See [Specify a managed storage location in Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/managed-storage.html).\n* Define external tables using external locations. See [Create an external table](https:\/\/docs.databricks.com\/data-governance\/unity-catalog\/create-tables.html#create-an-external-table).\n* Define external volumes using external locations. See [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/external-locations.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Amazon Redshift\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article describes how to set up Lakehouse Federation to run federated queries on Run queries on Amazon Redshift data that is not managed by Databricks. To learn more about Lakehouse Federation, see [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). \nTo connect to your Run queries on Amazon Redshift database using Lakehouse Federation, you must create the following in your Databricks Unity Catalog metastore: \n* A *connection* to your Run queries on Amazon Redshift database.\n* A *foreign catalog* that mirrors your Run queries on Amazon Redshift database in Unity Catalog so that you can use Unity Catalog query syntax and data governance tools to manage Databricks user access to the database.\n\n#### Run federated queries on Amazon Redshift\n##### Before you begin\n\nWorkspace requirements: \n* Workspace enabled for Unity Catalog. \nCompute requirements: \n* Network connectivity from your Databricks Runtime cluster or SQL warehouse to the target database systems. See [Networking recommendations for Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/networking.html).\n* Databricks clusters must use Databricks Runtime 13.3 LTS or above and shared or single-user access mode.\n* SQL warehouses must be Pro or Serverless. \nPermissions required: \n* To create a connection, you must be a metastore admin or a user with the `CREATE CONNECTION` privilege on the Unity Catalog metastore attached to the workspace.\n* To create a foreign catalog, you must have the `CREATE CATALOG` permission on the metastore and be either the owner of the connection or have the `CREATE FOREIGN CATALOG` privilege on the connection. \nAdditional permission requirements are specified in each task-based section that follows.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/redshift.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Amazon Redshift\n##### Create a connection\n\nA connection specifies a path and credentials for accessing an external database system. To create a connection, you can use Catalog Explorer or the `CREATE CONNECTION` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** Metastore admin or user with the `CREATE CONNECTION` privilege. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. In the left pane, expand the **External Data** menu and select **Connections**.\n3. Click **Create connection**.\n4. Enter a user-friendly **Connection name**.\n5. Select a **Connection type** of **Redshift**.\n6. Enter the following connection properties for your Redshift instance. \n* **Host**: For example, `redshift-demo.us-west-2.redshift.amazonaws.com`\n* **Port**: For example, `5439`\n* **User**: For example, `redshift_user`\n* **Password**: For example, `password123`\n7. (Optional) Click **Test connection** to confirm that it works.\n8. (Optional) Add a comment.\n9. Click **Create**. \nRun the following command in a notebook or the Databricks SQL query editor. \n```\nCREATE CONNECTION <connection-name> TYPE redshift\nOPTIONS (\nhost '<hostname>',\nport '<port>',\nuser '<user>',\npassword '<password>'\n);\n\n``` \nWe recommend that you use Databricks [secrets](https:\/\/docs.databricks.com\/security\/secrets\/index.html) instead of plaintext strings for sensitive values like credentials. For example: \n```\nCREATE CONNECTION <connection-name> TYPE redshift\nOPTIONS (\nhost '<hostname>',\nport '<port>',\nuser secret ('<secret-scope>','<secret-key-user>'),\npassword secret ('<secret-scope>','<secret-key-password>')\n)\n\n``` \nFor information about setting up secrets, see [Secret management](https:\/\/docs.databricks.com\/security\/secrets\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/redshift.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Amazon Redshift\n##### Create a foreign catalog\n\nA foreign catalog mirrors a database in an external data system so that you can query and manage access to data in that database using Databricks and Unity Catalog. To create a foreign catalog, you use a connection to the data source that has already been defined. \nTo create a foreign catalog, you can use Catalog Explorer or the `CREATE FOREIGN CATALOG` SQL command in a Databricks notebook or the Databricks SQL query editor. \n**Permissions required:** `CREATE CATALOG` permission on the metastore and either ownership of the connection or the `CREATE FOREIGN CATALOG` privilege on the connection. \n1. In your Databricks workspace, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. Click the **Create Catalog** button.\n3. On the **Create a new catalog** dialog, enter a name for the catalog and select a **Type** of **Foreign**.\n4. Select the **Connection** that provides access to the database that you want to mirror as a Unity Catalog catalog.\n5. Enter the name of the **Database** that you want to mirror as a catalog.\n6. Click **Create.** \nRun the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional. Replace the placeholder values: \n* `<catalog-name>`: Name for the catalog in Databricks.\n* `<connection-name>`: The [connection object](https:\/\/docs.databricks.com\/query-federation\/index.html#connection) that specifies the data source, path, and access credentials.\n* `<database-name>`: Name of the database you want to mirror as a catalog in Databricks. \n```\nCREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>\nOPTIONS (database '<database-name>');\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/redshift.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Run federated queries on Amazon Redshift\n##### Supported pushdowns\n\nThe following pushdowns are supported: \n* Filters\n* Projections\n* Limit\n* Joins\n* Aggregates (Average, Count, Max, Min, StddevPop, StddevSamp, Sum, VarianceSamp)\n* Functions (String functions and other miscellaneous functions, such as Alias, Cast, SortOrder)\n* Sorting \nThe following pushdowns are not supported: \n* Windows functions\n\n#### Run federated queries on Amazon Redshift\n##### Data type mappings\n\nWhen you read from Redshift to Spark, data types map as follows: \n| Redshift type | Spark type |\n| --- | --- |\n| numeric | DecimalType |\n| int2, int4 | IntegerType |\n| int8, oid, xid | LongType |\n| float4 | FloatType |\n| double precision, float8, money | DoubleType |\n| bpchar, char, character varying, name, super, text, tid, varchar | StringType |\n| bytea, geometry, varbyte | BinaryType |\n| bit, bool | BooleanType |\n| date | DateType |\n| tabstime, time, time with time zone, timetz, time without time zone, timestamp with time zone, timestamp, timestamptz, timestamp without time zone\\* | TimestampType\/TimestampNTZType | \n\\*When you read from Redshift, Redshift `Timestamp` is mapped to Spark `TimestampType` if `infer_timestamp_ntz_type = false` (default). Redshift `Timestamp` is mapped to `TimestampNTZType` if `infer_timestamp_ntz_type = true`.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/redshift.html"} +{"content":"# Ingest data into a Databricks lakehouse\n### Get started using COPY INTO to load data\n\nThe `COPY INTO` SQL command lets you load data from a file location into a Delta table. This is a re-triable and idempotent operation; files in the source location that have already been loaded are skipped. \n`COPY INTO` offers the following capabilities: \n* Easily configurable file or directory filters from cloud storage, including S3, ADLS Gen2, ABFS, GCS, and Unity Catalog volumes.\n* Support for multiple source file formats: CSV, JSON, XML, [Avro](https:\/\/avro.apache.org\/docs\/), [ORC](https:\/\/spark.apache.org\/docs\/latest\/sql-data-sources-orc.html), [Parquet](https:\/\/spark.apache.org\/docs\/latest\/sql-data-sources-parquet.html), text, and binary files\n* Exactly-once (idempotent) file processing by default\n* Target table schema inference, mapping, merging, and evolution \nNote \nFor a more scalable and robust file ingestion experience, Databricks recommends that SQL users leverage streaming tables. See [Load data using streaming tables in Databricks SQL](https:\/\/docs.databricks.com\/sql\/load-data-streaming-table.html). \nWarning \n`COPY INTO` respects the workspace setting for deletion vectors. If enabled, deletion vectors are enabled on the target table when `COPY INTO` runs on a SQL warehouse or compute running Databricks Runtime 14.0 or above. Once enabled, deletion vectors block queries against a table in Databricks Runtime 11.3 LTS and below. See [What are deletion vectors?](https:\/\/docs.databricks.com\/delta\/deletion-vectors.html) and [Auto-enable deletion vectors](https:\/\/docs.databricks.com\/admin\/workspace-settings\/deletion-vectors.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/index.html"} +{"content":"# Ingest data into a Databricks lakehouse\n### Get started using COPY INTO to load data\n#### Requirements\n\nAn account admin must follow the steps in [Configure data access for ingestion](https:\/\/docs.databricks.com\/ingestion\/copy-into\/configure-data-access.html) to configure access to data in cloud object storage before users can load data using `COPY INTO`.\n\n### Get started using COPY INTO to load data\n#### Example: Load data into a schemaless Delta Lake table\n\nNote \nThis feature is available in Databricks Runtime 11.3 LTS and above. \nYou can create empty placeholder Delta tables so that the schema is later inferred during a `COPY INTO` command by setting `mergeSchema` to `true` in `COPY_OPTIONS`: \n```\nCREATE TABLE IF NOT EXISTS my_table\n[COMMENT <table-description>]\n[TBLPROPERTIES (<table-properties>)];\n\nCOPY INTO my_table\nFROM '\/path\/to\/files'\nFILEFORMAT = <format>\nFORMAT_OPTIONS ('mergeSchema' = 'true')\nCOPY_OPTIONS ('mergeSchema' = 'true');\n\n``` \nThe SQL statement above is idempotent and can be scheduled to run to ingest data exactly-once into a Delta table. \nNote \nThe empty Delta table is not usable outside of `COPY INTO`. `INSERT INTO` and `MERGE INTO` are not supported to write data into schemaless Delta tables. After data is inserted into the table with `COPY INTO`, the table becomes queryable. \nSee [Create target tables for COPY INTO](https:\/\/docs.databricks.com\/ingestion\/copy-into\/examples.html#target-table).\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/index.html"} +{"content":"# Ingest data into a Databricks lakehouse\n### Get started using COPY INTO to load data\n#### Example: Set schema and load data into a Delta Lake table\n\nThe following example shows how to create a Delta table and then use the `COPY INTO` SQL command to load sample data from [Databricks datasets](https:\/\/docs.databricks.com\/discover\/databricks-datasets.html) into the table. You can run the example Python, R, Scala, or SQL code from a [notebook](https:\/\/docs.databricks.com\/notebooks\/notebooks-manage.html) attached to a Databricks [cluster](https:\/\/docs.databricks.com\/compute\/index.html). You can also run the SQL code from a [query](https:\/\/docs.databricks.com\/sql\/user\/sql-editor\/index.html) associated with a [SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html) in [Databricks SQL](https:\/\/docs.databricks.com\/sql\/index.html). \n```\nDROP TABLE IF EXISTS default.loan_risks_upload;\n\nCREATE TABLE default.loan_risks_upload (\nloan_id BIGINT,\nfunded_amnt INT,\npaid_amnt DOUBLE,\naddr_state STRING\n);\n\nCOPY INTO default.loan_risks_upload\nFROM '\/databricks-datasets\/learning-spark-v2\/loans\/loan-risks.snappy.parquet'\nFILEFORMAT = PARQUET;\n\nSELECT * FROM default.loan_risks_upload;\n\n-- Result:\n-- +---------+-------------+-----------+------------+\n-- | loan_id | funded_amnt | paid_amnt | addr_state |\n-- +=========+=============+===========+============+\n-- | 0 | 1000 | 182.22 | CA |\n-- +---------+-------------+-----------+------------+\n-- | 1 | 1000 | 361.19 | WA |\n-- +---------+-------------+-----------+------------+\n-- | 2 | 1000 | 176.26 | TX |\n-- +---------+-------------+-----------+------------+\n-- ...\n\n``` \n```\ntable_name = 'default.loan_risks_upload'\nsource_data = '\/databricks-datasets\/learning-spark-v2\/loans\/loan-risks.snappy.parquet'\nsource_format = 'PARQUET'\n\nspark.sql(\"DROP TABLE IF EXISTS \" + table_name)\n\nspark.sql(\"CREATE TABLE \" + table_name + \" (\" \\\n\"loan_id BIGINT, \" + \\\n\"funded_amnt INT, \" + \\\n\"paid_amnt DOUBLE, \" + \\\n\"addr_state STRING)\"\n)\n\nspark.sql(\"COPY INTO \" + table_name + \\\n\" FROM '\" + source_data + \"'\" + \\\n\" FILEFORMAT = \" + source_format\n)\n\nloan_risks_upload_data = spark.sql(\"SELECT * FROM \" + table_name)\n\ndisplay(loan_risks_upload_data)\n\n'''\nResult:\n+---------+-------------+-----------+------------+\n| loan_id | funded_amnt | paid_amnt | addr_state |\n+=========+=============+===========+============+\n| 0 | 1000 | 182.22 | CA |\n+---------+-------------+-----------+------------+\n| 1 | 1000 | 361.19 | WA |\n+---------+-------------+-----------+------------+\n| 2 | 1000 | 176.26 | TX |\n+---------+-------------+-----------+------------+\n...\n'''\n\n``` \n```\nlibrary(SparkR)\nsparkR.session()\n\ntable_name = \"default.loan_risks_upload\"\nsource_data = \"\/databricks-datasets\/learning-spark-v2\/loans\/loan-risks.snappy.parquet\"\nsource_format = \"PARQUET\"\n\nsql(paste(\"DROP TABLE IF EXISTS \", table_name, sep = \"\"))\n\nsql(paste(\"CREATE TABLE \", table_name, \" (\",\n\"loan_id BIGINT, \",\n\"funded_amnt INT, \",\n\"paid_amnt DOUBLE, \",\n\"addr_state STRING)\",\nsep = \"\"\n))\n\nsql(paste(\"COPY INTO \", table_name,\n\" FROM '\", source_data, \"'\",\n\" FILEFORMAT = \", source_format,\nsep = \"\"\n))\n\nloan_risks_upload_data = tableToDF(table_name)\n\ndisplay(loan_risks_upload_data)\n\n# Result:\n# +---------+-------------+-----------+------------+\n# | loan_id | funded_amnt | paid_amnt | addr_state |\n# +=========+=============+===========+============+\n# | 0 | 1000 | 182.22 | CA |\n# +---------+-------------+-----------+------------+\n# | 1 | 1000 | 361.19 | WA |\n# +---------+-------------+-----------+------------+\n# | 2 | 1000 | 176.26 | TX |\n# +---------+-------------+-----------+------------+\n# ...\n\n``` \n```\nval table_name = \"default.loan_risks_upload\"\nval source_data = \"\/databricks-datasets\/learning-spark-v2\/loans\/loan-risks.snappy.parquet\"\nval source_format = \"PARQUET\"\n\nspark.sql(\"DROP TABLE IF EXISTS \" + table_name)\n\nspark.sql(\"CREATE TABLE \" + table_name + \" (\" +\n\"loan_id BIGINT, \" +\n\"funded_amnt INT, \" +\n\"paid_amnt DOUBLE, \" +\n\"addr_state STRING)\"\n)\n\nspark.sql(\"COPY INTO \" + table_name +\n\" FROM '\" + source_data + \"'\" +\n\" FILEFORMAT = \" + source_format\n)\n\nval loan_risks_upload_data = spark.table(table_name)\n\ndisplay(loan_risks_upload_data)\n\n\/*\nResult:\n+---------+-------------+-----------+------------+\n| loan_id | funded_amnt | paid_amnt | addr_state |\n+=========+=============+===========+============+\n| 0 | 1000 | 182.22 | CA |\n+---------+-------------+-----------+------------+\n| 1 | 1000 | 361.19 | WA |\n+---------+-------------+-----------+------------+\n| 2 | 1000 | 176.26 | TX |\n+---------+-------------+-----------+------------+\n...\n*\/\n\n``` \nTo clean up, run the following code, which deletes the table: \n```\nspark.sql(\"DROP TABLE \" + table_name)\n\n``` \n```\nsql(paste(\"DROP TABLE \", table_name, sep = \"\"))\n\n``` \n```\nspark.sql(\"DROP TABLE \" + table_name)\n\n``` \n```\nDROP TABLE default.loan_risks_upload\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/index.html"} +{"content":"# Ingest data into a Databricks lakehouse\n### Get started using COPY INTO to load data\n#### Reference\n\n* Databricks Runtime 7.x and above: [COPY INTO](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-copy-into.html)\n\n### Get started using COPY INTO to load data\n#### Additional resources\n\n* [Load data using COPY INTO with Unity Catalog volumes or external locations](https:\/\/docs.databricks.com\/ingestion\/copy-into\/unity-catalog.html) \n* [Load data using COPY INTO with an instance profile](https:\/\/docs.databricks.com\/ingestion\/copy-into\/tutorial-dbsql.html) \n* For common use patterns, including examples of multiple `COPY INTO` operations against the same Delta table, see [Common data loading patterns using COPY INTO](https:\/\/docs.databricks.com\/ingestion\/copy-into\/examples.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/copy-into\/index.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/non-uc.html"} +{"content":"# Connect to data sources\n## What is Lakehouse Federation\n#### Set up query federation for non-Unity-Catalog workspaces\n\nExperimental \nThe configurations described in this article are [Experimental](https:\/\/docs.databricks.com\/release-notes\/release-types.html). Experimental features are provided as-is and are not supported by Databricks through customer technical support. **To get full query federation support, you should instead use [Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html), which enables your Databricks users to take advantage of Unity Catalog syntax and data governance tools.** \nThe term *query federation* describes a collection of features that enable users and systems to run queries against multiple external data sources without needing to migrate all data to a unified system. \n**To get full query federation support, you should use [Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html), which enables your Databricks users to take advantage of Unity Catalog syntax and data governance tools.** However, if you do not have access to a Unity-Catalog-enabled workspace, you have many options for running queries against data in external data sources, including: \n* Databricks-provided [JDBC and ODBC drivers](https:\/\/docs.databricks.com\/integrations\/jdbc-odbc-bi.html) that are compatible with many BI tools.\n* Databricks partner integrations, with a number of available [BI and visualization tools](https:\/\/docs.databricks.com\/integrations\/index.html#bi) that support querying data in the lakehouse.\n* Delta Sharing, which provides an [open source protocol](https:\/\/delta.io\/sharing\/) and [extended Databricks support](https:\/\/docs.databricks.com\/data-sharing\/index.html) for sharing Delta Lake tables with users connecting from numerous third-party clients and other Databricks workspaces.\n* [Open source storage integrations](https:\/\/delta.io\/integrations) for data that uses the Delta protocol.\n* The experimental connection configurations described in the following articles: \n+ [PostgreSQL](https:\/\/docs.databricks.com\/query-federation\/postgresql-no-uc.html)\n+ [MySQL](https:\/\/docs.databricks.com\/query-federation\/mysql-no-uc.html)\n+ [Snowflake](https:\/\/docs.databricks.com\/query-federation\/snowflake-no-uc.html)\n+ [Redshift](https:\/\/docs.databricks.com\/query-federation\/redshift-no-uc.html)\n+ [Synapse](https:\/\/docs.databricks.com\/query-federation\/synapse-no-uc.html)\n+ [SQL Server](https:\/\/docs.databricks.com\/query-federation\/sql-server-no-uc.html) \nDrivers for the database solutions described in the articles listed above are included on all Databricks Runtime clusters, as well as Serverless and Pro SQL warehouses.\n\n","doc_uri":"https:\/\/docs.databricks.com\/query-federation\/non-uc.html"} +{"content":"# Compute\n### What is Photon?\n\n**Applies to:** ![check marked yes](https:\/\/docs.databricks.com\/_images\/check.png) Databricks SQL ![check marked yes](https:\/\/docs.databricks.com\/_images\/check.png) Databricks Runtime 9.1 and above ![check marked yes](https:\/\/docs.databricks.com\/_images\/check.png) Databricks Runtime 15.2 ML and above \nLearn about the advantages of running your workloads on Photon, the features it supports, and how to enable or disable Photon. Photon is turned on by default in Databricks SQL warehouses and is compatible with Apache Spark APIs, so it works with your existing code.\n\n### What is Photon?\n#### What is Photon used for?\n\nPhoton is a high-performance Databricks-native vectorized query engine that runs your SQL workloads and DataFrame API calls faster to reduce your total cost per workload. \nThe following are key features and advantages of using Photon. \n* Support for SQL and equivalent DataFrame operations with Delta and Parquet tables.\n* Accelerated queries that process data faster and include aggregations and joins.\n* Faster performance when data is accessed repeatedly from the disk cache.\n* Robust scan performance on tables with many columns and many small files.\n* Faster Delta and Parquet writing using `UPDATE`, `DELETE`, `MERGE INTO`, `INSERT`, and `CREATE TABLE AS SELECT`, including wide tables that contain thousands of columns.\n* Replaces sort-merge joins with hash-joins.\n* For AI and ML workloads, Photon improves performance for applications using Spark SQL, Spark DataFrames, feature engineering, GraphFrames, and xgboost4j.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/photon.html"} +{"content":"# Compute\n### What is Photon?\n#### Get started with Photon\n\nPhoton is enabled by default on clusters running [Databricks Runtime 9.1 LTS](https:\/\/docs.databricks.com\/release-notes\/runtime\/9.1lts.html) and above. Photon is also available on clusters running [Databricks Runtime 15.2 for Machine Learning](https:\/\/docs.databricks.com\/release-notes\/runtime\/15.2ml.html) and above. \nTo manually disable or enable Photon on your cluster, select the **Use Photon Acceleration** checkbox when you [create or edit the cluster](https:\/\/docs.databricks.com\/compute\/configure.html#photon-image). \nIf you create a cluster using the [Clusters API](https:\/\/docs.databricks.com\/api\/workspace\/clusters), set `runtime_engine` to `PHOTON`. \n### Instance types \nPhoton supports a number of instance types on the driver and worker nodes. Photon instance types consume DBUs at a different rate than the same instance type running the non-Photon runtime. For more information about Photon instances and DBU consumption, see the [Databricks pricing page](https:\/\/databricks.com\/product\/aws-pricing\/instance-types).\n\n### What is Photon?\n#### Operators, expressions, and data types\n\nThe following are the operators, expressions, and data types that Photon covers. \n**Operators** \n* Scan, Filter, Project\n* Hash Aggregate\/Join\/Shuffle\n* Nested-Loop Join\n* Null-Aware Anti Join\n* Union, Expand, ScalarSubquery\n* Delta\/Parquet Write Sink\n* Sort\n* Window Function \n**Expressions** \n* Comparison \/ Logic\n* Arithmetic \/ Math (most)\n* Conditional (IF, CASE, etc.)\n* String (common ones)\n* Casts\n* Aggregates(most common ones)\n* Date\/Timestamp \n**Data types** \n* Byte\/Short\/Int\/Long\n* Boolean\n* String\/Binary\n* Decimal\n* Float\/Double\n* Date\/Timestamp\n* Struct\n* Array\n* Map\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/photon.html"} +{"content":"# Compute\n### What is Photon?\n#### Features that require Photon\n\nThe following are features that require Photon. \n* Predictive I\/O for read and write. See [What is predictive I\/O?](https:\/\/docs.databricks.com\/optimizations\/predictive-io.html).\n* H3 geospatial expressions. See [H3 geospatial functions](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-h3-geospatial-functions.html).\n* Dynamic file pruning. See [Dynamic file pruning](https:\/\/docs.databricks.com\/optimizations\/dynamic-file-pruning.html).\n\n### What is Photon?\n#### Limitations\n\n* Structured Streaming: Photon currently supports stateless streaming with Delta, Parquet, CSV, and JSON. Stateless Kafka and Kinesis streaming is supported when writing to a Delta or Parquet sink.\n* Photon does not support UDFs or RDD APIs.\n* Photon doesn\u2019t impact queries that normally run in under two seconds. \nFeatures not supported by Photon run the same way they would with Databricks Runtime.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/photon.html"} +{"content":"# Databricks data engineering\n## What is DBFS?\n#### What are the root directories?\n\nDatabricks historically used directories in the workspace root directory for common storage locations. Most of these locations are deprecated. \n`\/Volumes` provides an alias for path-based access to data in Unity Catalog volumes. See [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html). \n* `\/databricks-datasets`\n* `\/user\/hive\/warehouse`\n* `\/FileStore`\n* `\/databricks-results`\n* `\/databricks\/init`\n\n#### What are the root directories?\n##### What is stored in the `\/databricks-datasets` directory?\n\nThe `\/databricks-datasets` directory is available on all access mode configurations unless custom workspace permissions set by workspace administrators prevent access. \nDatabricks provides a number of open source datasets in this directory. Many of the tutorials and demos provided by Databricks reference these datasets, but you can also use them to indepedently explore the functionality of Databricks.\n\n#### What are the root directories?\n##### What is stored in the `\/user\/hive\/warehouse` directory?\n\nThis is the default location for data for managed tables registered to the `hive_metastore`.\n\n#### What are the root directories?\n##### What is stored in the `\/Filestore` directory?\n\nThe `\/Filestore` directory might contain data and libraries uploaded through the Databricks UI or image files for generated plots. \nThis is primarily legacy behavior, and most UI options now upload files using either workspace files or volumes.\n\n#### What are the root directories?\n##### What is stored in the `\/databricks-results` directory?\n\n`\/databricks-results` stores files generated by downloading the [full results](https:\/\/docs.databricks.com\/notebooks\/notebook-outputs.html#download-full-results) of a query.\n\n","doc_uri":"https:\/\/docs.databricks.com\/dbfs\/root-locations.html"} +{"content":"# Databricks data engineering\n## What is DBFS?\n#### What are the root directories?\n##### What is stored in the `\/databricks\/init` directory?\n\nSome workspaces might contain this directory, which was used to hold legacy global init scripts, which should not be used. See [Global init scripts (legacy)](https:\/\/docs.databricks.com\/archive\/init-scripts\/legacy-global.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/dbfs\/root-locations.html"} +{"content":"# \n### Conceptual overview\n\nPreview \nThis feature is in [Private Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). To try it, reach out to your Databricks contact. \n*Looking for a different RAG Studio doc?* [Go to the RAG documentation index](https:\/\/docs.databricks.com\/rag-studio\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/rag-studio\/concepts\/index.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n### Query databases using JDBC\n##### Query MySQL with Databricks\n\nThis example queries MySQL using its JDBC driver. For more details on reading, writing, configuring parallelism, and query pushdown, see [Query databases using JDBC](https:\/\/docs.databricks.com\/connect\/external-systems\/jdbc.html). \nNote \nYou may prefer Lakehouse Federation for managing queries to MySQL. See [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/mysql.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n### Query databases using JDBC\n##### Query MySQL with Databricks\n###### Using JDBC\n\n```\ndriver = \"org.mariadb.jdbc.Driver\"\n\ndatabase_host = \"<database-host-url>\"\ndatabase_port = \"3306\" # update if you use a non-default port\ndatabase_name = \"<database-name>\"\ntable = \"<table-name>\"\nuser = \"<username>\"\npassword = \"<password>\"\n\nurl = f\"jdbc:mysql:\/\/{database_host}:{database_port}\/{database_name}\"\n\nremote_table = (spark.read\n.format(\"jdbc\")\n.option(\"driver\", driver)\n.option(\"url\", url)\n.option(\"dbtable\", table)\n.option(\"user\", user)\n.option(\"password\", password)\n.load()\n)\n\n``` \n```\nval driver = \"org.mariadb.jdbc.Driver\"\n\nval database_host = \"<database-host-url>\"\nval database_port = \"3306\" # update if you use a non-default port\nval database_name = \"<database-name>\"\nval table = \"<table-name>\"\nval user = \"<username>\"\nval password = \"<password>\"\n\nval url = s\"jdbc:mysql:\/\/${database_host}:${database_port}\/${database_name}\"\n\nval remote_table = spark.read\n.format(\"jdbc\")\n.option(\"driver\", driver)\n.option(\"url\", url)\n.option(\"dbtable\", table)\n.option(\"user\", user)\n.option(\"password\", password)\n.load()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/mysql.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n### Query databases using JDBC\n##### Query MySQL with Databricks\n###### Using the MySQL connector in Databricks Runtime\n\nUsing Databricks Runtime 11.3 LTS and above, you can use the named connector to query MySQL. See the following examples: \n```\nremote_table = (spark.read\n.format(\"mysql\")\n.option(\"dbtable\", \"table_name\")\n.option(\"host\", \"database_hostname\")\n.option(\"port\", \"3306\") # Optional - will use default port 3306 if not specified.\n.option(\"database\", \"database_name\")\n.option(\"user\", \"username\")\n.option(\"password\", \"password\")\n.load()\n)\n\n``` \n```\nDROP TABLE IF EXISTS mysql_table;\nCREATE TABLE mysql_table\nUSING mysql\nOPTIONS (\ndbtable '<table-name>',\nhost '<database-host-url>',\nport '3306', \/* Optional - will use default port 3306 if not specified. *\/\ndatabase '<database-name>',\nuser '<username>',\npassword '<password>'\n);\nSELECT * from mysql_table;\n\n``` \n```\nval remote_table = spark.read\n.format(\"mysql\")\n.option(\"dbtable\", \"table_name\")\n.option(\"host\", \"database_hostname\")\n.option(\"port\", \"3306\") # Optional - will use default port 3306 if not specified.\n.option(\"database\", \"database_name\")\n.option(\"user\", \"username\")\n.option(\"password\", \"password\")\n.load()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/mysql.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Infoworks\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nInfoworks DataFoundry is an automated enterprise data operations and orchestration system that runs natively on Databricks\nand leverages the full power of Databricks to deliver a easy solution for data onboarding\u2014an important first step in operationalizing your data lake.\nDataFoundry not only automates data ingestion, but also automates the key functionality that must accompany ingestion to establish a foundation for analytics.\nData onboarding with DataFoundry automates: \n* Data ingestion: from all enterprise and external data sources\n* Data synchronization: CDC to keep data synchronized with the source\n* Data governance: cataloging, lineage, metadata management, audit, and history \nHere are the steps for using Infoworks with Databricks.\n\n#### Connect to Infoworks\n##### Step 1: Generate a Databricks personal access token\n\nInfoworks authenticates with Databricks using a Databricks personal access token. \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/infoworks.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Infoworks\n##### Step 2: Set up a cluster to support integration needs\n\nInfoworks will write data to an S3 bucket and the Databricks integration cluster will read data from that location. Therefore the integration cluster requires secure access to the S3 bucket. \n### Secure access to an S3 bucket \nTo access AWS resources, you can launch the Databricks integration cluster with an instance profile. The instance profile should have access to the staging S3 bucket and the target S3 bucket where you want to write the Delta tables. To create an instance profile and configure the integration cluster to use the role, follow the instructions in [Tutorial: Configure S3 access with an instance profile](https:\/\/docs.databricks.com\/connect\/storage\/tutorial-s3-instance-profile.html). \nAs an alternative, you can use [IAM credential passthrough](https:\/\/docs.databricks.com\/archive\/credential-passthrough\/iam-passthrough.html), which enables user-specific access to S3 data from a shared cluster. \n### Specify the cluster configuration \n1. Set **Cluster Mode** to **Standard**.\n2. Set **Databricks Runtime Version** to a Databricks runtime version.\n3. Enable [optimized writes and auto compaction](https:\/\/docs.databricks.com\/delta\/tune-file-size.html) by adding the following properties to your [Spark configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration): \n```\nspark.databricks.delta.optimizeWrite.enabled true\nspark.databricks.delta.autoCompact.enabled true\n\n```\n4. Configure your cluster depending on your integration and scaling needs. \nFor cluster configuration details, see [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html). \nSee [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html) for the steps to obtain the JDBC URL and HTTP path.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/infoworks.html"} +{"content":"# Technology partners\n## Connect to ingestion partners using Partner Connect\n#### Connect to Infoworks\n##### Step 3: Obtain JDBC and ODBC connection details to connect to a cluster\n\nTo connect a Databricks cluster to Infoworks you need the following JDBC\/ODBC connection properties: \n* JDBC URL\n* HTTP Path\n\n#### Connect to Infoworks\n##### Step 4: Get Infoworks for Databricks\n\nGo to [Infoworks](https:\/\/www.infoworks.io\/datafoundry-for-databricks\/?utm_source=Databricks&utm_medium=website&utm_campaign=Databricks%20Ingestion%20Network%20Launch&utm_term=ingestion%2C%20onboarding) to learn more and get a demo.\n\n#### Connect to Infoworks\n##### Additional resources\n\n[Support](https:\/\/support.infoworks.io\/support\/home)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/ingestion\/infoworks.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Use the Databricks interactive debugger\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis page describes how to use the built-in interactive debugger in the Databricks notebook. The debugger is available only for Python. \nThe interactive debugger provides breakpoints, step-by-step execution, variable inspection, and more tools to help you develop code in notebooks more efficiently.\n\n#### Use the Databricks interactive debugger\n##### Requirements\n\nYour notebook must be attached to a cluster that meets the following requirements: \n* Databricks Runtime version 13.3 LTS or above.\n* The access mode must be **Single user** (Assigned) or **No isolation shared**.\n\n#### Use the Databricks interactive debugger\n##### Enable or disable the debugger\n\nTo enable or disable the debugger, do the following: \n1. Click your username at the upper-right of the workspace and select **Settings** from the dropdown list.\n2. In the **Settings** sidebar, select **Developer**.\n3. In the **Experimental features** section, toggle **Python Notebook Interactive Debugger**.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/debugger.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Use the Databricks interactive debugger\n##### Start the debugger\n\nTo start the debugger, follow these steps: \n1. Add one or more breakpoints by clicking in the gutter of a cell. To remove a breakpoint, click on it again. \n![create and remove breakpoints video](https:\/\/docs.databricks.com\/_images\/breakpoints.gif)\n2. Do one of the following: \n* Click **Run > Debug cell**.\n* Use the keyboard shortcut **Ctrl + Shift + D**.\n* From the cell run menu, select **Debug cell**.\n![debug cell item in cell run menu](https:\/\/docs.databricks.com\/_images\/debug-in-cell-menu.png) \nA debug session starts automatically and runs the selected cell. \nYou can also start the debugger if a cell triggers an error. At the bottom of the cell output, click ![Debug button](https:\/\/docs.databricks.com\/_images\/debug-button.png). \nWhen a debug session is active, the debug toolbar ![Debug toolbar](https:\/\/docs.databricks.com\/_images\/debug-toolbar.png) appears at the top of the cell.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/debugger.html"} +{"content":"# Databricks data engineering\n## Introduction to Databricks notebooks\n#### Use the Databricks interactive debugger\n##### Use the debugger\n\n![annotated debugger toolbar](https:\/\/docs.databricks.com\/_images\/debugger-toolbar.png) \nIn a debug session, you can do the following: \n* Set or remove breakpoints.\n* View the values of variables at a breakpoint.\n* Step through the code.\n* Step into or out of a function. \nWhen the code reaches a breakpoint, it stops before the line is run, not after. \nUse the buttons in the debugger toolbar to step through the code. As you step through the code, the current line is highlighted in the cell. You can view variable values in the variable explorer pane in the right sidebar. \nWhen you step through a function, local function variables appear in the variable pane, marked `[local]`.\n\n#### Use the Databricks interactive debugger\n##### Terminate a debugger session\n\nTo end the debugging session, click ![debugger stop button](https:\/\/docs.databricks.com\/_images\/debugger-stop.png) at the upper-left of the cell or click ![notebook stop button](https:\/\/docs.databricks.com\/_images\/stop-button.png) at the top of the notebook.\n\n#### Use the Databricks interactive debugger\n##### Limitations\n\n* The debugger works only with Python. It does not support Scala or R.\n* The debugger does not work on **Shared** access mode clusters.\n* The debugger does not support stepping into external files or modules.\n* When a debug session is active, you cannot run other commands in the notebook.\n\n","doc_uri":"https:\/\/docs.databricks.com\/notebooks\/debugger.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n#### Interoperability and usability for the data lakehouse\n\nThis article covers architectural principles of the **interoperability and usability** pillar, referring to the lakehouse\u2019s interaction with users and other systems. One of the fundamental ideas of the lakehouse is to provide a great user experience for all the personas that work with it, and to be able to interact with a wide ecosystem of external systems. \n* **Interoperability** is the ability of a system to work with and integrate with other systems. It implies interaction between different components and products, possibly from multiple vendors, and between past and future versions of the same product.\n* **Usability** is the characteristic of a system to provide its users with the best possible experience to perform tasks safely, effectively, and efficiently. \n![Interoperability and usability lakehouse architecture diagram for Databricks.](https:\/\/docs.databricks.com\/_images\/interoperability-usability.png) \nFollowing the principles of this pillar help to: \n* Achieve a consistent and collaborative user experience.\n* Leverage synergies across clouds.\n* Simplify integration from and to the lakehouse.\n* Reduce training and enablement costs. \nAnd ultimately lead to a faster time-to-value.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/interoperability-and-usability\/index.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n#### Interoperability and usability for the data lakehouse\n##### Principles of interoperability and usability\n\n1. **Define standards for integration** \nIntegration has different aspects and can be done in many different ways. To avoid proliferating tools and approaches, best practices must be defined and a list of well-supported and preferred tools and connectors should be provided. \nOne of the key architectural principles are modularity and loose coupling rather than tight integration. This reduces dependencies between components and workloads, helps eliminate side effects, and enables independent development on different time scales. Use datasets and their schema as a contract. Separate workloads such as data wrangling jobs (such as loading and transforming data into a data lake) from value-adding jobs (for example reporting, dashboards, and data science feature engineering). Define a central data catalog with guidelines for data formats, data quality, and data lifecycle.\n2. **Prefer open interfaces and open data formats** \nOften, solutions are developed where data can only be accessed through a specific system. This can lead to vendor lock-in, but it can also become a huge cost driver if data access via that system is subject to license fees. Using open data formats and interfaces helps to avoid this. They also simplify integration with existing systems and open up an ecosystem of partners who have already integrated their tools with the lakehouse. \nIf you use open source ecosystems such as Python or R for data science, or Spark or ANSI SQL for data access and access rights control, you will have an easier time finding personnel for projects. It will also simplify potential migrations to and from a platform.\n3. **Lower the barriers for implementing use cases** \nTo get the most out of the data in the data lake, users must be able to easily deploy their use cases on the platform. This starts with lean processes around platform access and data management. For example, self-service access to the platform helps prevent a central team from becoming a bottleneck. Shared environments and predefined blueprints for deploying new environments ensure that the platform is quickly available to any business user.\n4. **Ensure data consistency and usability** \nTwo important activities on a data platform are *data publishing* and *data consumption*. From a publishing perspective, data should be offered as a product. Publishers need to follow a defined lifecycle with consumers in mind, and the data needs to be clearly defined with managed schemas, descriptions, and so on. \nIt is also important to provide semantically consistent data so that consumers can easily understand and correctly combine different data sets. In addition, all data must be easily discoverable and accessible to consumers through a central catalog with properly curated metadata and data lineage.\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/interoperability-and-usability\/index.html"} +{"content":"# Introduction to the well-architected data lakehouse\n## Data lakehouse architecture: Databricks well-architected framework\n#### Interoperability and usability for the data lakehouse\n##### Next: Best practices for interoperability and usability\n\nSee [Best practices for interoperability and usability](https:\/\/docs.databricks.com\/lakehouse-architecture\/interoperability-and-usability\/best-practices.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-architecture\/interoperability-and-usability\/index.html"} +{"content":"# Connect to data sources\n### Connect to external systems\n\nDatabricks provides built-in integrations to many cloud-native data systems, as well as extensible JDBC support to connect to other data systems. \nThe connectors documented in this section mostly focus on configuring a connection to a single table in the external data system. You can use some of these drivers to write data back to external systems as well. \nFor read-only data connections, Databricks recommends using Lakehouse Federation, which enables syncing entire databases to Databricks from external systems and is governed by Unity Catalog. See [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). \nPartner Connect also provides integrations to many popular enterprise data systems. Many Partner Connect solutions not only connect to data sources, but also facilitate easy ETL to keep data in your lakehouse fresh. See [What is Databricks Partner Connect?](https:\/\/docs.databricks.com\/partner-connect\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/index.html"} +{"content":"# Connect to data sources\n### Connect to external systems\n#### What data sources connect to Databricks with JDBC?\n\nYou can use [JDBC](https:\/\/docs.databricks.com\/connect\/external-systems\/jdbc.html) to connect with many data sources. Databricks Runtime includes drivers for a number of JDBC databases, but you might need to install a driver or different driver version to connect to your preferred database. Supported databases include the following: \n* [Query PostgreSQL with Databricks](https:\/\/docs.databricks.com\/connect\/external-systems\/postgresql.html)\n* [Query MySQL with Databricks](https:\/\/docs.databricks.com\/connect\/external-systems\/mysql.html)\n* [Query MariaDB with Databricks](https:\/\/docs.databricks.com\/connect\/external-systems\/mariadb.html)\n* [Query SQL Server with Databricks](https:\/\/docs.databricks.com\/connect\/external-systems\/sql-server.html)\n* [Use the Databricks connector to connect to another Databricks workspace](https:\/\/docs.databricks.com\/connect\/external-systems\/databricks.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/index.html"} +{"content":"# Connect to data sources\n### Connect to external systems\n#### What data services does Databricks integrate with?\n\nThe following data services require you to configure connection settings, security credentials, and networking settings. You might need administrator or power user privileges in your AWS account or Databricks workspace. Some also require that you create a Databricks [library](https:\/\/docs.databricks.com\/libraries\/index.html) and install it in a cluster: \n* [Query Amazon Redshift using Databricks](https:\/\/docs.databricks.com\/connect\/external-systems\/amazon-redshift.html)\n* [Google BigQuery](https:\/\/docs.databricks.com\/connect\/external-systems\/bigquery.html)\n* [MongoDB](https:\/\/docs.databricks.com\/connect\/external-systems\/mongodb.html)\n* [Cassandra](https:\/\/docs.databricks.com\/connect\/external-systems\/cassandra.html)\n* [Couchbase](https:\/\/docs.databricks.com\/connect\/external-systems\/couchbase.html)\n* [ElasticSearch](https:\/\/docs.databricks.com\/connect\/external-systems\/elasticsearch.html)\n* [Read and write data from Snowflake](https:\/\/docs.databricks.com\/connect\/external-systems\/snowflake.html)\n* [Query data in Azure Synapse Analytics](https:\/\/docs.databricks.com\/connect\/external-systems\/synapse-analytics.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/index.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n#### Delta Live Tables language references\n\nThis article has information on the programming interfaces available to implement Delta Live Tables pipelines and has links to documentation with detailed specifications and examples for each interface. \nData loading and transformations are implemented in a Delta Live Tables pipeline by queries that define streaming tables and materialized views. To implement these queries, Delta Live Tables supports SQL and Python interfaces. Because these interfaces provide equivalent functionality for most data processing use cases, pipeline developers can choose the interface that they are most comfortable with. The articles in this section are detailed references for the SQL and Python interfaces and should be used by developers as they implement pipelines in their interface of choice.\n\n#### Delta Live Tables language references\n##### Delta Live Tables SQL language reference\n\nFor pipeline developers familiar with writing queries in SQL, Delta Live Tables has a simple but powerful SQL interface designed to support the loading and transformation of data. To learn about the details of the SQL interface, including how to define streaming tables for tasks such as loading data and materialized views for transforming data, see [Delta Live Tables SQL language reference](https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html).\n\n#### Delta Live Tables language references\n##### Delta Live Tables Python language reference\n\nFor Python developers, Delta Live Tables has a Python interface designed to support the loading and transformation of data. For tasks that require processing not supported by SQL, developers can use Python to write pipeline source code that combines Delta Live Tables queries with Python functions that implement the processing not supported by the Delta Live Tables interfaces. To learn about the Delta Live Tables Python interface, including detailed specifications for the Python functions included in the interface, see [Delta Live Tables Python language reference](https:\/\/docs.databricks.com\/delta-live-tables\/python-ref.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/language-references.html"} +{"content":"# \n### File metadata column\n\nYou can get metadata information for input files with the `_metadata` column. The `_metadata` column is a *hidden* column, and is available for all input file formats. To include the `_metadata` column in the returned DataFrame, you must explicitly reference it in your query. \nIf the data source contains a column named `_metadata`, queries return the column from the data source, and not the file metadata. \nWarning \nNew fields might be added to the `_metadata` column in future releases. To prevent schema evolution errors if the `_metadata` column is updated, Databricks recommends selecting specific fields from the column in your queries. See [examples](https:\/\/docs.databricks.com\/ingestion\/file-metadata-column.html#metadata-examples).\n\n### File metadata column\n#### Supported metadata\n\nThe `_metadata` column is a `STRUCT` containing the following fields: \n| Name | Type | Description | Example | Minimum Databricks Runtime release |\n| --- | --- | --- | --- | --- |\n| **file\\_path** | `STRING` | File path of the input file. | `file:\/tmp\/f0.csv` | 10.5 |\n| **file\\_name** | `STRING` | Name of the input file along with its extension. | `f0.csv` | 10.5 |\n| **file\\_size** | `LONG` | Length of the input file, in bytes. | 628 | 10.5 |\n| **file\\_modification\\_time** | `TIMESTAMP` | Last modification timestamp of the input file. | `2021-12-20 20:05:21` | 10.5 |\n| **file\\_block\\_start** | `LONG` | Start offset of the block being read, in bytes. | 0 | 13.0 |\n| **file\\_block\\_length** | `LONG` | Length of the block being read, in bytes. | 628 | 13.0 |\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/file-metadata-column.html"} +{"content":"# \n### File metadata column\n#### Examples\n\n### Use in a basic file-based data source reader \n```\ndf = spark.read \\\n.format(\"csv\") \\\n.schema(schema) \\\n.load(\"dbfs:\/tmp\/*\") \\\n.select(\"*\", \"_metadata\")\n\ndisplay(df)\n\n'''\nResult:\n+---------+-----+----------------------------------------------------+\n| name | age | _metadata |\n+=========+=====+====================================================+\n| | | { |\n| | | \"file_path\": \"dbfs:\/tmp\/f0.csv\", |\n| Debbie | 18 | \"file_name\": \"f0.csv\", |\n| | | \"file_size\": 12, |\n| | | \"file_block_start\": 0, |\n| | | \"file_block_length\": 12, |\n| | | \"file_modification_time\": \"2021-07-02 01:05:21\" |\n| | | } |\n+---------+-----+----------------------------------------------------+\n| | | { |\n| | | \"file_path\": \"dbfs:\/tmp\/f1.csv\", |\n| Frank | 24 | \"file_name\": \"f1.csv\", |\n| | | \"file_size\": 12, |\n| | | \"file_block_start\": 0, |\n| | | \"file_block_length\": 12, |\n| | | \"file_modification_time\": \"2021-12-20 02:06:21\" |\n| | | } |\n+---------+-----+----------------------------------------------------+\n'''\n\n``` \n```\nval df = spark.read\n.format(\"csv\")\n.schema(schema)\n.load(\"dbfs:\/tmp\/*\")\n.select(\"*\", \"_metadata\")\n\ndisplay(df_population)\n\n\/* Result:\n+---------+-----+----------------------------------------------------+\n| name | age | _metadata |\n+=========+=====+====================================================+\n| | | { |\n| | | \"file_path\": \"dbfs:\/tmp\/f0.csv\", |\n| Debbie | 18 | \"file_name\": \"f0.csv\", |\n| | | \"file_size\": 12, |\n| | | \"file_block_start\": 0, |\n| | | \"file_block_length\": 12, |\n| | | \"file_modification_time\": \"2021-07-02 01:05:21\" |\n| | | } |\n+---------+-----+----------------------------------------------------+\n| | | { |\n| | | \"file_path\": \"dbfs:\/tmp\/f1.csv\", |\n| Frank | 24 | \"file_name\": \"f1.csv\", |\n| | | \"file_size\": 10, |\n| | | \"file_block_start\": 0, |\n| | | \"file_block_length\": 12, |\n| | | \"file_modification_time\": \"2021-12-20 02:06:21\" |\n| | | } |\n+---------+-----+----------------------------------------------------+\n*\/\n\n``` \n### Select specific fields \n```\nspark.read \\\n.format(\"csv\") \\\n.schema(schema) \\\n.load(\"dbfs:\/tmp\/*\") \\\n.select(\"_metadata.file_name\", \"_metadata.file_size\")\n\n``` \n```\nspark.read\n.format(\"csv\")\n.schema(schema)\n.load(\"dbfs:\/tmp\/*\")\n.select(\"_metadata.file_name\", \"_metadata.file_size\")\n\n``` \n### Use in filters \n```\nspark.read \\\n.format(\"csv\") \\\n.schema(schema) \\\n.load(\"dbfs:\/tmp\/*\") \\\n.select(\"*\") \\\n.filter(col(\"_metadata.file_name\") == lit(\"test.csv\"))\n\n``` \n```\nspark.read\n.format(\"csv\")\n.schema(schema)\n.load(\"dbfs:\/tmp\/*\")\n.select(\"*\")\n.filter(col(\"_metadata.file_name\") === lit(\"test.csv\"))\n\n``` \n### Use in COPY INTO \n```\nCOPY INTO my_delta_table\nFROM (\nSELECT *, _metadata FROM 's3:\/\/my-bucket\/csvData'\n)\nFILEFORMAT = CSV\n\n``` \n### Use in Auto Loader \nNote \nWhen writing the `_metadata` column, we rename it to `source_metadata`. Writing it as `_metadata` would make it impossible to access the metadata column in the target table, because if the data source contains a column named `_metadata`, queries will return the column from the data source, and not the file metadata. \n```\nspark.readStream \\\n.format(\"cloudFiles\") \\\n.option(\"cloudFiles.format\", \"csv\") \\\n.schema(schema) \\\n.load(\"s3:\/\/my-bucket\/csvData\") \\\n.selectExpr(\"*\", \"_metadata as source_metadata\") \\\n.writeStream \\\n.format(\"delta\") \\\n.option(\"checkpointLocation\", checkpointLocation) \\\n.start(targetTable)\n\n``` \n```\nspark.readStream\n.format(\"cloudFiles\")\n.option(\"cloudFiles.format\", \"csv\")\n.schema(schema)\n.load(\"s3:\/\/my-bucket\/csvData\")\n.selectExpr(\"*\", \"_metadata as source_metadata\")\n.writeStream\n.format(\"delta\")\n.option(\"checkpointLocation\", checkpointLocation)\n.start(targetTable)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/file-metadata-column.html"} +{"content":"# \n### File metadata column\n#### Related articles\n\n* [COPY INTO](https:\/\/docs.databricks.com\/sql\/language-manual\/delta-copy-into.html)\n* [Auto Loader](https:\/\/docs.databricks.com\/ingestion\/auto-loader\/index.html)\n* [Structured Streaming](https:\/\/docs.databricks.com\/structured-streaming\/index.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/ingestion\/file-metadata-column.html"} +{"content":"# \n### Databricks reference documentation\n\nThis article contains links to Databricks reference documentation and guidance.\n\n### Databricks reference documentation\n#### API reference documentation\n\nDatabricks provides the following API reference documentation: \n* [REST API reference](https:\/\/docs.databricks.com\/api\/workspace\/introduction)\n* [Apache Spark APIs](https:\/\/docs.databricks.com\/reference\/spark.html)\n* [MLflow API](https:\/\/docs.databricks.com\/reference\/mlflow-api.html)\n* [Feature Store Python API](https:\/\/docs.databricks.com\/dev-tools\/api\/python\/latest\/index.html)\n* [Delta Lake API](https:\/\/docs.databricks.com\/reference\/delta-lake.html)\n* [Delta Live Tables API](https:\/\/docs.databricks.com\/delta-live-tables\/api-guide.html)\n\n### Databricks reference documentation\n#### SQL language reference documentation\n\n* [SQL language reference](https:\/\/docs.databricks.com\/sql\/language-manual\/index.html)\n* [Delta Live Tables SQL language reference](https:\/\/docs.databricks.com\/delta-live-tables\/sql-ref.html)\n\n### Databricks reference documentation\n#### CLI reference documentation\n\n* [Databricks CLI](https:\/\/docs.databricks.com\/dev-tools\/cli\/commands.html)\n* [Databricks SQL CLI](https:\/\/docs.databricks.com\/dev-tools\/databricks-sql-cli.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/reference\/api.html"} +{"content":"# \n### Databricks reference documentation\n#### SDK reference documentation\n\n* [Databricks SDK for Python](https:\/\/databricks-sdk-py.readthedocs.io)\n* [Databricks SDK for R](https:\/\/databrickslabs.github.io\/databricks-sdk-r\/index.html)\n* [Databricks SDK for Java](https:\/\/javadoc.io\/doc\/com.databricks\/databricks-sdk-java)\n* [Databricks SDK for Go](https:\/\/pkg.go.dev\/github.com\/databricks\/databricks-sdk-go)\n\n### Databricks reference documentation\n#### Common error codes in Databricks\n\n* [SQLSTATE error codes](https:\/\/docs.databricks.com\/error-messages\/sqlstates.html)\n* [Error classes in Databricks](https:\/\/docs.databricks.com\/error-messages\/error-classes.html)\n\n### Databricks reference documentation\n#### Additional developer resources\n\nFor additional resources on developing with Python, SparkR, and Scala on Databricks, see: \n* [Databricks for Python developers](https:\/\/docs.databricks.com\/languages\/python.html)\n* [Databricks for R developers](https:\/\/docs.databricks.com\/sparkr\/index.html)\n* [Databricks for Scala developers](https:\/\/docs.databricks.com\/languages\/scala.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/reference\/api.html"} +{"content":"# Transform data\n### Aggregate data on Databricks\n\nThis article introduces the general semantics for aggregation and discusses the differences between results computed using batch queries, materialized views, and streaming.\n\n### Aggregate data on Databricks\n#### Batch aggregates\n\nBatch aggregation is the default behavior observed when running an ad hoc query in SQL or processing data with Apache Spark DataFrames. \nAn aggregate query written against a table or data source computes the aggregate statistics for all records in the data source. Databricks leverages optimizations and metadata whenever possible to optimize these queries, and can compute many aggregates effeciently for large datasets. \nBatch aggregation latency and compute costs can increase as data size increases, and pre-computed frequently referenced aggregate values can save users substantial time and money. Databricks recommends using materialized views to incrementally update aggregate values. See [Incremental aggregates](https:\/\/docs.databricks.com\/transform\/aggregation.html#incremental).\n\n### Aggregate data on Databricks\n#### Stateful aggregates\n\nAggregates defined in streaming workloads are stateful. Stateful aggregates track observed records over time and recompute results when processing new data. \nYou must use watermarks when computing stateful aggregates. Omitting a watermark from a stateful aggregate query results in state information building up infinitely over time. This results in processing slowdowns and can lead to out-of-memory errors. \nYou should not use a stateful aggregate to calculate statistics over an entire dataset. Databricks recommends using materialized views for incremental aggregate calculation on an entire dataset. See [Incremental aggregates](https:\/\/docs.databricks.com\/transform\/aggregation.html#incremental). \nConfiguring workloads that compute stateful aggregates efficiently and correctly requires understanding how data arrives from source systems and how Databricks uses watermarks, output modes, and trigger intervals to control query state and results computation.\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/aggregation.html"} +{"content":"# Transform data\n### Aggregate data on Databricks\n#### Incremental aggregates\n\nYou can use materialized views to compute many aggregate values incrementally. Materialized views automatically track changes in the data source and apply appropriate updates to aggregate values on refresh. The results returned by a materialzed view are equivalent to those returned by recomputing aggregate results on source data with a batch job or ad hoc query.\n\n### Aggregate data on Databricks\n#### Approximate aggregates\n\nWhile Databricks excels at computing on extremely large datasets, using approximation for aggregates can accelerate query processing and reduce costs when you don\u2019t require precise results. \nUsing `LIMIT` statements is sometimes good enough for getting a quick snapshot of data, but does not introduce randomness, or guarantee that sampling is distributed across the dataset. \nSpark SQL has the following native methods for approximating aggregations on numeric or categorical data: \n* [approx\\_count\\_distinct aggregate function](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/approx_count_distinct.html)\n* [approx\\_percentile aggregate function](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/approx_percentile.html)\n* [approx\\_top\\_k aggregate function](https:\/\/docs.databricks.com\/sql\/language-manual\/functions\/approx_top_k.html) \nYou can also specify a sample percent with `TABLESAMPLE` to generate a random sample from a dataset and calculate approximate aggregates. See [TABLESAMPLE clause](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-qry-select-sampling.html).\n\n### Aggregate data on Databricks\n#### Monitor datasets using aggregate statistcs\n\nLakehouse Monitoring uses aggregate statistics and data distributions to track data quality over time. You can generate reports to visualize trends and schedule alerts to flag unexpected changes in data. See [Introduction to Databricks Lakehouse Monitoring](https:\/\/docs.databricks.com\/lakehouse-monitoring\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/transform\/aggregation.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Manage Python dependencies for Delta Live Tables pipelines\n\nDelta Live Tables supports external dependencies in your pipelines. Databricks recommends using one of two patterns to install Python packages: \n1. Use the `%pip install` command to install packages for all source files in a pipeline.\n2. Import modules or libraries from source code stored in workspace files. See [Import Python modules from Git folders or workspace files](https:\/\/docs.databricks.com\/delta-live-tables\/import-workspace-files.html). \nDelta Live Tables also supports using global and cluster-scoped [init scripts](https:\/\/docs.databricks.com\/init-scripts\/index.html). However, these external dependencies, particularly init scripts, increase the risk of issues with runtime upgrades. To mitigate these risks, minimize using init scripts in your pipelines. If your processing requires init scripts, automate testing of your pipeline to detect problems early. If you use init scripts, Databricks recommends increasing your testing frequency. \nImportant \nBecause [JVM libraries are not supported](https:\/\/docs.databricks.com\/delta-live-tables\/external-dependencies.html#jvm-library-support) in Delta Live Tables pipelines, do not use an init script to install JVM libraries. However, You can install other library types, such as Python libraries, with an init script.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/external-dependencies.html"} +{"content":"# Databricks data engineering\n## What is Delta Live Tables?\n### Manage configuration of Delta Live Tables pipelines\n##### Manage Python dependencies for Delta Live Tables pipelines\n###### Python libraries\n\nTo specify external Python libraries, use the `%pip install` magic command. When an update starts, Delta Live Tables runs all cells containing a `%pip install` command before running any table definitions. Every Python notebook included in the pipeline shares a library environment and has access to all installed libraries. \nImportant \n* `%pip install` commands must be in a separate cell at the top of your Delta Live Tables pipeline notebook. Do not include any other code in cells containing `%pip install` commands.\n* Because every notebook in a pipeline shares a library environment, you cannot define different library versions in a single pipeline. If your processing requires different library versions, you must define them in different pipelines. \nThe following example installs the `numpy` library and makes it globally available to any Python notebook in the pipeline: \n```\n%pip install simplejson\n\n``` \nTo install a Python wheel package, add the Python wheel path to the `%pip install` command. Installed Python wheel packages are available to all tables in the pipeline. The following example installs a Python wheel file named `dltfns-1.0-py3-none-any.whl` from the DBFS directory `\/dbfs\/dlt\/`: \n```\n%pip install \/dbfs\/dlt\/dltfns-1.0-py3-none-any.whl\n\n``` \nSee [Install a Python wheel package with %pip](https:\/\/docs.databricks.com\/libraries\/notebooks-python-libraries.html#pip-install-wheel).\n\n##### Manage Python dependencies for Delta Live Tables pipelines\n###### Can I use Scala or Java libraries in a Delta Live Tables pipeline?\n\nNo, Delta Live Tables supports only SQL and Python. You cannot use JVM libraries in a pipeline. Installing JVM libraries will cause unpredictable behavior, and may break with future Delta Live Tables releases. If your pipeline uses an init script, you must also ensure that JVM libraries are not installed by the script.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta-live-tables\/external-dependencies.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### pandas user-defined functions\n\nA pandas user-defined function (UDF)\u2014also known as vectorized UDF\u2014is a user-defined function that uses\n[Apache Arrow](https:\/\/arrow.apache.org\/) to transfer data and pandas to work with the data. pandas UDFs allow\nvectorized operations that can increase performance up to 100x compared to row-at-a-time [Python UDFs](https:\/\/docs.databricks.com\/udf\/python.html). \nFor background information, see the blog post\n[New Pandas UDFs and Python Type Hints in the Upcoming Release of Apache Spark 3.0](https:\/\/databricks.com\/blog\/2020\/05\/20\/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html). \nYou define a pandas UDF using the keyword `pandas_udf` as a decorator and wrap the function with a [Python type hint](https:\/\/www.python.org\/dev\/peps\/pep-0484\/).\nThis article describes the different types of pandas UDFs and shows how to use pandas UDFs with type hints.\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/pandas.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### pandas user-defined functions\n##### Series to Series UDF\n\nYou use a Series to Series pandas UDF to vectorize scalar operations.\nYou can use them with APIs such as `select` and `withColumn`. \nThe Python function should take a pandas Series as an input and return a\npandas Series of the same length, and you should specify these in the Python\ntype hints. Spark runs a pandas UDF by splitting columns into batches, calling the function\nfor each batch as a subset of the data, then concatenating the results. \nThe following example shows how to create a pandas UDF that computes the product of 2 columns. \n```\nimport pandas as pd\nfrom pyspark.sql.functions import col, pandas_udf\nfrom pyspark.sql.types import LongType\n\n# Declare the function and create the UDF\ndef multiply_func(a: pd.Series, b: pd.Series) -> pd.Series:\nreturn a * b\n\nmultiply = pandas_udf(multiply_func, returnType=LongType())\n\n# The function for a pandas_udf should be able to execute with local pandas data\nx = pd.Series([1, 2, 3])\nprint(multiply_func(x, x))\n# 0 1\n# 1 4\n# 2 9\n# dtype: int64\n\n# Create a Spark DataFrame, 'spark' is an existing SparkSession\ndf = spark.createDataFrame(pd.DataFrame(x, columns=[\"x\"]))\n\n# Execute function as a Spark vectorized UDF\ndf.select(multiply(col(\"x\"), col(\"x\"))).show()\n# +-------------------+\n# |multiply_func(x, x)|\n# +-------------------+\n# | 1|\n# | 4|\n# | 9|\n# +-------------------+\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/pandas.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### pandas user-defined functions\n##### Iterator of Series to Iterator of Series UDF\n\nAn iterator UDF is the same as a scalar pandas UDF except: \n* The Python function \n+ Takes an iterator of batches instead of a single input batch as input.\n+ Returns an iterator of output batches instead of a single output batch.\n* The length of the entire output in the iterator should be the same as the length of the entire input.\n* The wrapped pandas UDF takes a single Spark column as an input. \nYou should specify the Python type hint as\n`Iterator[pandas.Series]` -> `Iterator[pandas.Series]`. \nThis pandas UDF is useful when the UDF execution requires initializing some state, for example,\nloading a machine learning model file to apply inference to every input batch. \nThe following example shows how to create a pandas UDF with iterator support. \n```\nimport pandas as pd\nfrom typing import Iterator\nfrom pyspark.sql.functions import col, pandas_udf, struct\n\npdf = pd.DataFrame([1, 2, 3], columns=[\"x\"])\ndf = spark.createDataFrame(pdf)\n\n# When the UDF is called with the column,\n# the input to the underlying function is an iterator of pd.Series.\n@pandas_udf(\"long\")\ndef plus_one(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:\nfor x in batch_iter:\nyield x + 1\n\ndf.select(plus_one(col(\"x\"))).show()\n# +-----------+\n# |plus_one(x)|\n# +-----------+\n# | 2|\n# | 3|\n# | 4|\n# +-----------+\n\n# In the UDF, you can initialize some state before processing batches.\n# Wrap your code with try\/finally or use context managers to ensure\n# the release of resources at the end.\ny_bc = spark.sparkContext.broadcast(1)\n\n@pandas_udf(\"long\")\ndef plus_y(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:\ny = y_bc.value # initialize states\ntry:\nfor x in batch_iter:\nyield x + y\nfinally:\npass # release resources here, if any\n\ndf.select(plus_y(col(\"x\"))).show()\n# +---------+\n# |plus_y(x)|\n# +---------+\n# | 2|\n# | 3|\n# | 4|\n# +---------+\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/pandas.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### pandas user-defined functions\n##### Iterator of multiple Series to Iterator of Series UDF\n\nAn Iterator of multiple Series to Iterator of Series UDF has similar characteristics and\nrestrictions as [Iterator of Series to Iterator of Series UDF](https:\/\/docs.databricks.com\/udf\/pandas.html#scalar-iterator-udfs). The specified function takes an iterator of batches and\noutputs an iterator of batches. It is also useful when the UDF execution requires initializing some\nstate. \nThe differences are: \n* The underlying Python function takes an iterator of a *tuple* of pandas Series.\n* The wrapped pandas UDF takes *multiple* Spark columns as an input. \nYou specify the type hints as `Iterator[Tuple[pandas.Series, ...]]` -> `Iterator[pandas.Series]`. \n```\nfrom typing import Iterator, Tuple\nimport pandas as pd\n\nfrom pyspark.sql.functions import col, pandas_udf, struct\n\npdf = pd.DataFrame([1, 2, 3], columns=[\"x\"])\ndf = spark.createDataFrame(pdf)\n\n@pandas_udf(\"long\")\ndef multiply_two_cols(\niterator: Iterator[Tuple[pd.Series, pd.Series]]) -> Iterator[pd.Series]:\nfor a, b in iterator:\nyield a * b\n\ndf.select(multiply_two_cols(\"x\", \"x\")).show()\n# +-----------------------+\n# |multiply_two_cols(x, x)|\n# +-----------------------+\n# | 1|\n# | 4|\n# | 9|\n# +-----------------------+\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/pandas.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### pandas user-defined functions\n##### Series to scalar UDF\n\nSeries to scalar pandas UDFs are similar to Spark aggregate functions.\nA Series to scalar pandas UDF defines an aggregation from one or more\npandas Series to a scalar value, where each pandas Series represents a Spark column.\nYou use a Series to scalar pandas UDF with APIs such as `select`, `withColumn`, `groupBy.agg`, and\n[pyspark.sql.Window](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.Window.html). \nYou express the type hint as `pandas.Series, ...` -> `Any`. The return type should be a\nprimitive data type, and the returned scalar can be either a Python primitive type, for example,\n`int` or `float` or a NumPy data type such as `numpy.int64` or `numpy.float64`. `Any` should ideally\nbe a specific scalar type. \nThis type of UDF *does not* support partial aggregation and all data for each group is loaded into memory. \nThe following example shows how to use this type of UDF to compute mean with `select`, `groupBy`, and `window` operations: \n```\nimport pandas as pd\nfrom pyspark.sql.functions import pandas_udf\nfrom pyspark.sql import Window\n\ndf = spark.createDataFrame(\n[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],\n(\"id\", \"v\"))\n\n# Declare the function and create the UDF\n@pandas_udf(\"double\")\ndef mean_udf(v: pd.Series) -> float:\nreturn v.mean()\n\ndf.select(mean_udf(df['v'])).show()\n# +-----------+\n# |mean_udf(v)|\n# +-----------+\n# | 4.2|\n# +-----------+\n\ndf.groupby(\"id\").agg(mean_udf(df['v'])).show()\n# +---+-----------+\n# | id|mean_udf(v)|\n# +---+-----------+\n# | 1| 1.5|\n# | 2| 6.0|\n# +---+-----------+\n\nw = Window \\\n.partitionBy('id') \\\n.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)\ndf.withColumn('mean_v', mean_udf(df['v']).over(w)).show()\n# +---+----+------+\n# | id| v|mean_v|\n# +---+----+------+\n# | 1| 1.0| 1.5|\n# | 1| 2.0| 1.5|\n# | 2| 3.0| 6.0|\n# | 2| 5.0| 6.0|\n# | 2|10.0| 6.0|\n# +---+----+------+\n\n``` \nFor detailed usage, see [pyspark.sql.functions.pandas\\_udf](https:\/\/api-docs.databricks.com\/python\/pyspark\/latest\/pyspark.sql\/api\/pyspark.sql.functions.pandas_udf.html?highlight=pandas%20udf#pyspark-sql-functions-pandas-udf).\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/pandas.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### pandas user-defined functions\n##### Usage\n\n### Setting Arrow batch size \nNote \nThis configuration has no impact on compute configured with shared access mode and Databricks Runtime 13.3 LTS through 14.2. \nData partitions in Spark are converted into Arrow record batches, which\ncan temporarily lead to high memory usage in the JVM. To avoid possible\nout of memory exceptions, you can adjust the size of the Arrow record batches\nby setting the `spark.sql.execution.arrow.maxRecordsPerBatch` configuration to an integer that\ndetermines the maximum number of rows for each batch. The default value\nis 10,000 records per batch. If the number of columns is large, the\nvalue should be adjusted accordingly. Using this limit, each data\npartition is divided into 1 or more record batches for processing. \n### Timestamp with time zone semantics \nSpark internally stores timestamps as UTC values, and timestamp data\nbrought in without a specified time zone is converted as local\ntime to UTC with microsecond resolution. \nWhen timestamp data is exported or displayed in Spark,\nthe session time zone is used to localize the\ntimestamp values. The session time zone is set with the\n`spark.sql.session.timeZone` configuration and defaults to the JVM system local\ntime zone. pandas uses a `datetime64` type with nanosecond\nresolution, `datetime64[ns]`, with optional time zone on a per-column\nbasis. \nWhen timestamp data is transferred from Spark to pandas it is\nconverted to nanoseconds and each column is converted to the Spark\nsession time zone then localized to that time zone, which removes the\ntime zone and displays values as local time. This occurs when\ncalling `toPandas()` or `pandas_udf` with timestamp columns. \nWhen timestamp data is transferred from pandas to Spark, it is\nconverted to UTC microseconds. This occurs when calling\n`createDataFrame` with a pandas DataFrame or when returning a\ntimestamp from a pandas UDF. These conversions are done\nautomatically to ensure Spark has data in the expected format, so\nit is not necessary to do any of these conversions yourself. Any\nnanosecond values are truncated. \nA standard UDF loads timestamp data as Python\ndatetime objects, which is different than a pandas timestamp. To get the best performance, we\nrecommend that you use pandas time series functionality when working with\ntimestamps in a pandas UDF. For details, see [Time Series \/ Date functionality](https:\/\/pandas.pydata.org\/pandas-docs\/stable\/user_guide\/timeseries.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/pandas.html"} +{"content":"# Develop on Databricks\n## What are user-defined functions (UDFs)?\n#### pandas user-defined functions\n##### Example notebook\n\nThe following notebook illustrates the performance improvements you can achieve with pandas UDFs: \n### pandas UDFs benchmark notebook \n[Open notebook in new tab](https:\/\/docs.databricks.com\/_extras\/notebooks\/source\/pandas-udfs-benchmark.html)\n![Copy to clipboard](https:\/\/docs.databricks.com\/_static\/clippy.svg) Copy link for import\n\n","doc_uri":"https:\/\/docs.databricks.com\/udf\/pandas.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Python API\n\nThis page provides links to the Python API documentation of Databricks Feature Engineering and Databricks Workspace Feature Store, and information about the client packages `databricks-feature-engineering` and `databricks-feature-store`. \nNote \nAs of version 0.17.0, `databricks-feature-store` has been deprecated. All existing modules from this package are now available in `databricks-feature-engineering` version 0.2.0 and later. For information about migrating to `databricks-feature-engineering`, see [Migrate to databricks-feature-engineering](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html#migrate-to-feature-engineering).\n\n#### Python API\n##### Compatibility matrix\n\nThe package and client you should use depend on where your feature tables are located and what Databricks Runtime ML version you are running, as shown in the following table. \nTo identify the package version that is built in to your Databricks Runtime ML version, see the [Feature Engineering compatibility matrix](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html#feature-engineering-compatibility-matrix). \n| Databricks Runtime version | For feature tables in | Use package | Use Python client |\n| --- | --- | --- | --- |\n| Databricks Runtime 14.3 ML and above | Unity Catalog | `databricks-feature-engineering` | `FeatureEngineeringClient` |\n| Databricks Runtime 14.3 ML and above | Workspace | `databricks-feature-engineering` | `FeatureStoreClient` |\n| Databricks Runtime 14.2 ML and below | Unity Catalog | `databricks-feature-engineering` | `FeatureEngineeringClient` |\n| Databricks Runtime 14.2 ML and below | Workspace | `databricks-feature-store` | `FeatureStoreClient` |\n\n#### Python API\n##### Feature Engineering Python API reference\n\nSee the Feature Engineering [Python API reference](https:\/\/api-docs.databricks.com\/python\/feature-engineering\/latest\/index.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Python API\n##### Workspace Feature Store Python API reference (deprecated)\n\nNote \n* As of version 0.17.0, `databricks-feature-store` has been deprecated. All existing modules from this package are now available in `databricks-feature-engineering` version 0.2.0 and later. \nFor `databricks-feature-store` v0.17.0, see Databricks `FeatureStoreClient` in [Feature Engineering Python API reference](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html#feature-engineering-api-reference) for the latest Workspace Feature Store API reference. \nFor v0.16.3 and below, use the links in the table to download or display the Feature Store Python API reference. To determine the pre-installed version for your Databricks Runtime ML version, see [the compatibility matrix](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html#feature-engineering-compatibility-matrix). \n| Version | Download PDF | Online API reference |\n| --- | --- | --- |\n| v0.3.5 to v0.16.3 | [Feature Store Python API 0.16.3 reference PDF](https:\/\/docs.databricks.com\/_extras\/documents\/feature-store-python-api-reference-0-16-3.pdf) | [Online API reference](https:\/\/api-docs.databricks.com\/python\/feature-store\/latest\/index.html) |\n| v0.3.5 and below | [Feature Store Python API 0.3.5 reference PDF](https:\/\/docs.databricks.com\/_extras\/documents\/feature-store-python-api-reference-0-3-5.pdf) | Online API reference not available |\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Python API\n##### Python package\n\nThis section describes how to install the Python packages to use Databricks Feature Engineering and Databricks Workspace Feature Store. \n### Feature Engineering \nNote \n* As of version 0.2.0, `databricks-feature-engineering` contains modules for working with feature tables in both Unity Catalog and Workspace Feature Store. `databricks-feature-engineering` below version 0.2.0 only works with feature tables in Unity Catalog. \nThe Databricks Feature Engineering APIs are available through the Python client package `databricks-feature-engineering`. The client is available on [PyPI](https:\/\/pypi.org\/project\/databricks-feature-engineering) and is pre-installed in Databricks Runtime 13.3 LTS ML and above. \nFor a reference of which client version corresponds to which runtime version, see the [compatibility matrix](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html#feature-engineering-compatibility-matrix). \nTo install the client in Databricks Runtime: \n```\n%pip install databricks-feature-engineering\n\n``` \nTo install the client in a local Python environment: \n```\npip install databricks-feature-engineering\n\n``` \n### Workspace Feature Store (deprecated) \nNote \n* As of version 0.17.0, `databricks-feature-store` has been deprecated. All existing modules from this package are now available in `databricks-feature-engineering`, version 0.2.0 and later.\n* See [Migrate to databricks-feature-engineering](https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html#migrate-to-feature-engineering) for more information. \nThe Databricks Feature Store APIs are available through the Python client package `databricks-feature-store`. The client is available on [PyPI](https:\/\/pypi.org\/project\/databricks-feature-store) and is pre-installed in Databricks Runtime for Machine Learning. For a reference of which runtime includes which client version, see the [compatibility matrix](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html#feature-engineering-compatibility-matrix). \nTo install the client in Databricks Runtime: \n```\n%pip install databricks-feature-store\n\n``` \nTo install the client in a local Python environment: \n```\npip install databricks-feature-store\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Python API\n##### Migrate to `databricks-feature-engineering`\n\nTo install the `databricks-feature-engineering` package, use `pip install databricks-feature-engineering` instead of `pip install databricks-feature-store`. All of the modules in `databricks-feature-store` have been moved to `databricks-feature-engineering`, so you do not have to change any code. Import statements such as `from databricks.feature_store import FeatureStoreClient` will continue to work after you install `databricks-feature-engineering`. \nTo work with feature tables in Unity Catalog, use `FeatureEngineeringClient`. To use Workspace Feature Store, you must use `FeatureStoreClient`.\n\n#### Python API\n##### Supported scenarios\n\nOn Databricks, including Databricks Runtime and Databricks Runtime for Machine Learning, you can: \n* Create, read, and write feature tables.\n* Train and score models on feature data.\n* Publish feature tables to online stores for real-time serving. \nFrom a local environment or an environment external to Databricks, you can: \n* Develop code with local IDE support.\n* Unit test using mock frameworks.\n* Write integration tests to be run on Databricks.\n\n#### Python API\n##### Limitations\n\nThe client library can only be run on Databricks, including Databricks Runtime and Databricks Runtime for Machine Learning. It does\nnot support calling Feature Engineering in Unity Catalog or Feature Store APIs from a local environment, or from an environment other than Databricks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html"} +{"content":"# AI and Machine Learning on Databricks\n## What is a feature store?\n#### Python API\n##### Use the clients for unit testing\n\nYou can install the Feature Engineering in Unity Catalog client or the Feature Store client locally to aid in running unit tests. \nFor example, to validate that a method `update_customer_features` correctly calls\n`FeatureEngineeringClient.write_table` (or for Workspace Feature Store,\n`FeatureStoreClient.write_table`), you could write: \n```\nfrom unittest.mock import MagicMock, patch\n\nfrom my_feature_update_module import update_customer_features\nfrom databricks.feature_engineering import FeatureEngineeringClient\n\n@patch.object(FeatureEngineeringClient, \"write_table\")\n@patch(\"my_feature_update_module.compute_customer_features\")\ndef test_something(compute_customer_features, mock_write_table):\ncustomer_features_df = MagicMock()\ncompute_customer_features.return_value = customer_features_df\n\nupdate_customer_features() # Function being tested\n\nmock_write_table.assert_called_once_with(\nname='ml.recommender_system.customer_features',\ndf=customer_features_df,\nmode='merge'\n)\n\n```\n\n#### Python API\n##### Use the clients for integration testing\n\nYou can run integration tests with the Feature Engineering in Unity Catalog client or the Feature Store client on Databricks. For details, see\n[Developer Tools and Guidance: Use CI\/CD](https:\/\/docs.databricks.com\/dev-tools\/index-ci-cd.html#dev-tools-ci-cd).\n\n#### Python API\n##### Use the clients for in an integrated development environment (IDE)\n\nYou can use the Feature Engineering in Unity Catalog client or the Feature Store client with an IDE for software development with Databricks. For details, see [Use dbx with Visual Studio Code](https:\/\/docs.databricks.com\/archive\/dev-tools\/dbx\/ide-how-to.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/machine-learning\/feature-store\/python-api.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Box options\n\nThis section covers the configuration options for box chart visualizations. For an example, see [bar chart example](https:\/\/docs.databricks.com\/visualizations\/visualization-types.html#box).\n\n##### Box options\n###### General\n\nTo configure general options, click **General** and configure each of the following required settings: \n* **Horizontal Chart**: If selected, the box, which shows number distribution, is plotted along the X axis.\n* **X column**: X axis values. For a vertical chart, choose a categorical column. For a horizontal chart, choose a number column.\n* **Y columns**: Y axis values. For a vertical chart, choose a number column. For a horizontal chart, choose a categorical column.\n* **Group by**: Additional columns to group by, after the default grouping is applied. By default, results are grouped by the X axis unless **Horizontal Chart** is also selected, in which case results are grouped by the Y axis.\n* **Legend placement**: **Automatic (Flexible), \\*\\*Flexible**, **Right**, **Bottom**, or **Hidden**.\n* **Legend items order**: **Normal** or **Reversed**.\n* **Show all points**: Whether to show each X axis value separately or to show the range of values using a box.\n* **Missing and NULL values**: Whether to hide missing or NULL values or to convert them to 0 and show them in the visualization.\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/boxplot.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Box options\n###### X axis\n\nTo configure formatting options for the X axis, click **X axis** and configure the following optional settings: \n* **Scale**: **Categorical**.\n* **Name**: Override the column name with a different display name.\n* **Sort values**: Whether to sort the X axis values, even if they are not sorted in the query.\n* **Reverse Order**: Whether to reverse the sorting order.\n* **Show labels** Whether to show the X axis values as labels.\n* **Hide axis**: If enabled, hides the X axis labels and scale markers.\n\n##### Box options\n###### Y axis\n\nTo configure formatting options for the Y axis, click **Y axis** and configure the following optional settings: \n* **Scale**: **Automatic (Linear)**, **Datetime**, **Linear**, **Logarithmic**, or **Categorical**.\n* **Name**: Specify a display name for the Y axis column if different from the column name.\n* **Start Value**: Show only values higher than a given value, regardless of the query result.\n* **End Value**: Show only values lower than a given value, regardless of the query result.\n* **Hide axis**: Whether to hide the Y axis labels and line.\n\n##### Box options\n###### Series\n\nTo configure series, click **Series** and configure the following optional settings: \n* Order: The order the values appear in the box chart.\n* Label: The label for the values in the legend.\n\n##### Box options\n###### Colors\n\nTo configure colors, click **Colors** and optionally override automatic colors and configure custom colors.\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/boxplot.html"} +{"content":"# Discover data\n## Exploratory data analysis on Databricks: Tools and techniques\n### Visualization types\n##### Box options\n###### Data labels\n\nTo configure labels for each data point in the visualization, click **Data labels** and configure the following optional settings: \n* **Number values format**: The format to use for labels for numeric values.\n* **Percent values format**: The format to use for labels for percentages.\n* **Date\/time values format**: The format to use for labels for date\/time values.\n* **Data labels**: The format to use for labels for other types of values.\n\n","doc_uri":"https:\/\/docs.databricks.com\/visualizations\/boxplot.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Manage storage credentials\n\nThis article describes how to list, view, update, grant permissions on, and delete [storage credentials](https:\/\/docs.databricks.com\/connect\/unity-catalog\/storage-credentials.html). \nDatabricks recommends that you grant only `CREATE EXTERNAL LOCATION` and no other privileges on storage credentials. \nThis article describes how to manage storage credentials using Catalog Explorer and SQL commands in a notebook or Databricks SQL query. For information about using the Databricks CLI or Terraform instead, see the [Databricks Terraform documentation](https:\/\/registry.terraform.io\/providers\/databricks\/databricks\/latest\/docs\/resources\/external_location) and [What is the Databricks CLI?](https:\/\/docs.databricks.com\/dev-tools\/cli\/index.html).\n\n#### Manage storage credentials\n##### List storage credentials\n\nTo view the list of all storage credentials in a metastore, you can use Catalog Explorer or a SQL command. \n1. In the sidebar, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. At the bottom of the screen, click **External Data > Storage Credentials**. \nRun the following command in a notebook or the Databricks SQL editor. \n```\nSHOW STORAGE CREDENTIALS;\n\n```\n\n#### Manage storage credentials\n##### View a storage credential\n\nTo view the properties of a storage credential, you can use Catalog Explorer or a SQL command. \n1. In the sidebar, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. At the bottom of the screen, click **External Data > Storage Credentials**.\n3. Click the name of a storage credential to see its properties. \nRun the following command in a notebook or the Databricks SQL editor. Replace `<credential-name>` with the name of the credential. \n```\nDESCRIBE STORAGE CREDENTIAL <credential-name>;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/manage-storage-credentials.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Manage storage credentials\n##### Show grants on a storage credential\n\nTo show grants on a storage credential, use a command like the following. You can optionally filter the results to show only the grants for the specified principal. \n```\nSHOW GRANTS [<principal>] ON STORAGE CREDENTIAL <storage-credential-name>;\n\n``` \nReplace the placeholder values: \n* `<principal>`: The email address of the account-level user or the name of the account level group to whom to grant the permission.\n* `<storage-credential-name>`: The name of a storage credential. \nNote \nIf a group name contains a space, use back-ticks around it (not apostrophes).\n\n#### Manage storage credentials\n##### Grant permissions to create external locations\n\nTo grant permission to create an external location using a storage credential, complete the following steps: \n1. In the sidebar, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. At the bottom of the screen, click **External Data > Storage Credentials**.\n3. Click the name of a storage credential to open its properties.\n4. Click **Permissions**.\n5. To grant permission to users or groups, select each identity, then click **Grant**.\n6. To revoke permissions from users or groups, select each identity, then click **Revoke**. \nRun the following command in a notebook or the SQL query editor: \n```\nGRANT CREATE EXTERNAL LOCATION ON STORAGE CREDENTIAL <storage-credential-name> TO <principal>;\n\n``` \nReplace the placeholder values: \n* `<principal>`: The email address of the account-level user or the name of the account level group to whom to grant the permission.\n* `<storage-credential-name>`: The name of a storage credential. \nNote \nIf a group name contains a space, use back-ticks around it (not apostrophes).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/manage-storage-credentials.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Manage storage credentials\n##### Change the owner of a storage credential\n\nA storage credential\u2019s creator is its initial owner. To change the owner to a different account-level user or group, do the following: \nRun the following command in a notebook or the Databricks SQL editor. Replace the placeholder values: \n* `<credential-name>`: The name of the credential.\n* `<principal>`: The email address of an account-level user or the name of an account-level group. \n```\nALTER STORAGE CREDENTIAL <credential-name> OWNER TO <principal>;\n\n```\n\n#### Manage storage credentials\n##### Mark a storage credential as read-only\n\nIf you want users to have read-only access to all data managed by a storage credential, you can use Catalog Explorer to mark the storage credential as read-only. \nMaking storage credentials read-only means that any storage configured with that credential is read-only. \nYou can mark storage credentials as read-only when you create them. \nYou can also use Catalog Explorer to change read-only status after creating a storage credential: \n1. In Catalog Explorer, find the storage credential, click the ![Kebab menu](https:\/\/docs.databricks.com\/_images\/kebab-menu.png) kebab menu (also known as the three-dot menu) on the object row, and select **Edit**.\n2. On the edit dialog, select the **Read only** option.\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/manage-storage-credentials.html"} +{"content":"# Connect to data sources\n## Connect to cloud object storage using Unity Catalog\n#### Manage storage credentials\n##### Rename a storage credential\n\nTo rename a storage credential, you can use Catalog Explorer or a SQL command. \n1. In the sidebar, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. At the bottom of the screen, click **External Data > Storage Credentials**.\n3. Click the name of a storage credential to open the edit dialog.\n4. Rename the storage credential and save it. \nRun the following command in a notebook or the Databricks SQL editor. Replace the placeholder values: \n* `<credential-name>`: The name of the credential.\n* `<new-credential-name>`: A new name for the credential. \n```\nALTER STORAGE CREDENTIAL <credential-name> RENAME TO <new-credential-name>;\n\n```\n\n#### Manage storage credentials\n##### Delete a storage credential\n\nTo delete (drop) a storage credential you must be its owner. To delete a storage credential, you can use Catalog Explorer or a SQL command. \n1. In the sidebar, click ![Catalog icon](https:\/\/docs.databricks.com\/_images\/data-icon.png) **Catalog**.\n2. At the bottom of the screen, click **External Data > Storage Credentials**.\n3. Click the name of a storage credential to open the edit dialog.\n4. Click the **Delete** button. \nRun the following command in a notebook or the Databricks SQL editor. Replace `<credential-name>` with the name of the credential. Portions of the command that are in brackets are optional. By default, if the credential is used by an external location, it is not deleted. Replace `<credential-name>` with the name of the credential. \n`IF EXISTS` does not return an error if the credential does not exist. \n```\nDROP STORAGE CREDENTIAL [IF EXISTS] <credential-name>;\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/unity-catalog\/manage-storage-credentials.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect Power BI to Databricks\n\n[Microsoft Power BI](https:\/\/powerbi.microsoft.com) is a business analytics service that provides interactive visualizations with self-service business intelligence capabilities, enabling end users to create reports and dashboards by themselves without having to depend on information technology staff or database administrators. \nWhen you use Databricks as a data source with Power BI, you can bring the advantages of Databricks performance and technology beyond data scientists and data engineers to all business users.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/power-bi.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect Power BI to Databricks\n##### Publish to Power BI Online from Databricks\n\nWhen using Databricks as a data source with Power BI Online, you can create PowerBI datasets from tables or schemas directly from the Databricks UI. \n### Requirements \n* Your data must be on Unity Catalog, and your compute (cluster) must be Unity Catalog enabled. Hive metastore is not currently supported.\n* You must have a premium (premium capacity or premium per-user license) Power BI license.\n* You must enable \u201cUsers can edit data models in Power BI service (preview)\u201d under Workspace settings and Data model settings to edit the Semantic Model after it is published. You can also edit the Semantic Model using Tabular Editor by making a connection using the XMLA endpoint.\n* If you need to enable XML write in your PowerBI workspace, follow this [link](https:\/\/learn.microsoft.com\/power-bi\/enterprise\/service-premium-connect-tools#enable-xmla-read-write) for instructions.\n* If your workspace is under a private link, you will need to update the dataset\u2019s datasource credentials manually in Power BI. \n### How to Use It \nPublish Databricks tables to a Power BI dataset \n1. Sign in to your Databricks workspace and navigate to the Catalog Explorer. Select the schema\/tables to be published. Do not select from a hive metastore or the samples catalog.\n2. From the compute dropdown, select the data warehouse you want to use in this Power BI publish.\n3. With the desired table\/schema to be published open in the Catalog Explorer, click the \u201cUse with BI tools\u201d button on the upper right.\n4. In the dropdown list that opens, click the \u201cPublish to Power BI workspace\u201d option. \nAt this point, a menu will open over the right side of the window. Follow the prompts given by the menu, detailed below: \n5. Click \u201cConnect to Microsoft Entra ID\u201d to authenticate with your Microsoft account.\n6. In the following menu, select the desired workspace to be published to in the \u201cPower BI workspaces\u201d dropdown. In the \u201cDataset Mode\u201d dropdown, select either DirectQuery (selected by default) or Import mode.\n7. Click the blue \u201cPublish to Power BI\u201d button at the bottom of the menu.\n8. Wait for the dataset to publish. This normally takes about 10 to 20 seconds.\n9. When the dataset is published, the blue button will have a link labeled \u201cOpen Power BI\u201d. Click this to open your new Power BI dataset in a new tab. \n### Features and Notes \n* When publishing a schema containing multiple tables, all tables with columns will be published. If no columns are present in any table, the publishing will not be performed.\n* Comments on a table\u2019s columns in Databricks are copied to the descriptions of corresponding columns in Power BI.\n* Foreign key relationships are preserved in the published dataset. However, Power BI only supports one active relationship path between any two tables. Thus, when multiple paths are present in the schema in Databricks, some of the corresponding relationships in Power BI will be set to inactive. You may later change which relationships are active\/inactive in the data model view in Power BI.\n* A Personal Access Token (PAT) is created on your behalf to allow Power BI to access the semantic model. This authentication method can be changed later in the Power BI datasource settings.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/power-bi.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect Power BI to Databricks\n##### Connect Power BI Desktop to Databricks\n\nYou can connect Power BI Desktop to your Databricks clusters and Databricks SQL warehouses. \n### Requirements \n* Power BI Desktop 2.85.681.0 or above. To use data managed by Unity Catalog with Power BI, you must use Power BI Desktop 2.98.683.0 or above (October 2021 release). \nNote \nPower BI Desktop requires Windows. An alternative for other operating systems is to run Power BI Desktop on a physical host or a Windows-based virtual machine and then connect to it from your operating system. \nIf you use a version of Power BI Desktop below 2.85.681.0, you also need to install the [Databricks ODBC driver](https:\/\/databricks.com\/spark\/odbc-drivers-download) in the same environment as Power BI Desktop. \n* One of the following to authenticate: \n+ (Recommended) Power BI enabled as an OAuth application in your account. This is enabled by default.\n+ A Databricks [personal access token](https:\/\/docs.databricks.com\/api\/workspace\/tokenmanagement). \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n+ A Databricks [username](https:\/\/docs.databricks.com\/admin\/users-groups\/users.html) (typically your email address) and password. \nUsername and password authentication might be disabled if your Databricks workspace is [enabled for single sign-on (SSO)](https:\/\/docs.databricks.com\/security\/auth-authz\/index.html#sso).\n* A Databricks [cluster](https:\/\/docs.databricks.com\/compute\/configure.html) or Databricks SQL [warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/index.html). \n### Connect Power BI Desktop to Databricks using Partner Connect \nYou can use Partner Connect to connect to a cluster or SQL warehouse from Power BI Desktop in just a few clicks. \n1. Make sure your Databricks account, workspace, and the signed-in user meet the [requirements](https:\/\/docs.databricks.com\/partner-connect\/index.html#requirements) for Partner Connect.\n2. In the sidebar, click ![Partner Connect button](https:\/\/docs.databricks.com\/_images\/partner-connect.png) **Partner Connect**.\n3. Click the **Power BI** tile.\n4. In the **Connect to partner** dialog, for **Compute**, choose the name of the Databricks compute resource that you want to connect.\n5. Choose **Download connection file**.\n6. Open the downloaded connection file, which starts Power BI Desktop.\n7. In Power BI Desktop, enter your authentication credentials: \n* **Personal Access Token**: Enter your Databricks personal access token.\n* **Username \/ Password**: Enter your Databricks username (typically your email address) and password. Username and password authentication might be disabled if your Databricks workspace is [enabled for single sign-on (SSO)](https:\/\/docs.databricks.com\/admin\/users-groups\/single-sign-on\/index.html). If you cannot log in using your Databricks username and password, try using the **Personal Access Token** option instead.\n* **Microsoft Entra ID**: Not applicable.\n8. Click **Connect**.\n9. Select the Databricks data to query from the Power BI **Navigator**. \n![Power BI Navigator](https:\/\/docs.databricks.com\/_images\/power-bi-navigator.png) \n### Connect Power BI Desktop to Databricks manually \nFollow these instructions, depending on your chosen authentication method, to connect to a cluster or SQL warehouse with Power BI Desktop. Databricks SQL warehouses are recommended when using Power BI in **DirectQuery** mode. \nNote \nTo connect faster with Power BI Desktop, use Partner Connect. \n1. Get the [Server Hostname and HTTP Path](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n2. Start Power BI Desktop.\n3. Click **Get data** or **File > Get data**.\n4. Click **Get data to get started**.\n5. Search for **Databricks**, then click the connector: \n* **Azure Databricks**, if you authenticate using a personal access token or your Databricks username and password. \nNote \nAlthough the connector name is **Azure Databricks**, it works with Databricks on AWS.\n* **Databricks (Beta)**, if you authenticate using OAuth.\n6. Click **Connect**.\n7. Enter the **Server Hostname** and **HTTP Path**.\n8. Select your **Data Connectivity mode**. For information about the difference between **Import** and **DirectQuery**, see [Use DirectQuery in Power BI Desktop](https:\/\/learn.microsoft.com\/power-bi\/connect-data\/desktop-use-directquery).\n9. Click **OK**.\n10. Click your authentication method: \n* **Username \/ Password**: Enter your Databricks username and password. Username and password authentication may be disabled if your Databricks workspace is [enabled for single sign-on (SSO)](https:\/\/docs.databricks.com\/admin\/users-groups\/single-sign-on\/index.html). If you cannot log in by using your Databricks username and password, try using the **Personal Access Token** option instead.\n* **Personal Access Token**: Enter your personal access token.\n* **OAuth**: Click **Sign in**. A browser window opens and prompts you to sign in with your IdP. After the success message appears, exit your browser and return to Power BI Desktop.\n11. Click **Connect**.\n12. Select the Databricks data to query from the Power BI **Navigator**. If Unity Catalog is enabled for your workspace, select a catalog before you select a schema and a table. \n![Power BI Navigator](https:\/\/docs.databricks.com\/_images\/power-bi-navigator.png) \n### Using a custom SQL query \nThe Databricks connector provides the `Databricks.Query` data source that allows a user to provide a custom SQL query. \n1. Follow the steps described in [Connect with Power BI Desktop](https:\/\/docs.databricks.com\/partners\/bi\/power-bi.html#manual-connection) to create a connection, using **Import** as the data connectivity mode.\n2. In the **Navigator**, right click the top-most item containing the selected host name and HTTP path and click **Transform Data** to open the Power Query Editor. \n![Click Transform Data in the Navigator](https:\/\/docs.databricks.com\/_images\/power-bi-navigator-transform-data-source.png)\n3. In the function bar, replace the function name `Databricks.Catalogs` with `Databricks.Query` and apply the change. This creates a Power Query function that takes a SQL query as parameter.\n4. Enter the desired SQL query in the parameter field and click **Invoke**. This executes the query and a new table is created with the query results as its contents. \n### Automated HTTP proxy detection \nPower BI Desktop version 2.104.941.0 and above (May 2022 release) has built-in support for detecting Windows system-wide HTTP proxy configuration. \nPower BI Desktop can automatically detect and use your Windows system-wide HTTP proxy configuration. \nIf the proxy server does not provide a CRL distribution point (CDP), Power BI might show the following error message: \n```\nDetails: \"ODBC: ERROR [HY000] [Microsoft][DriverSupport] (1200)\n-The revocation status of the certificate or one of the certificates in the certificate chain is unknown.\"\n\n``` \nTo fix this error, complete the following steps: \n1. Create the file `C:\\Program Files\\Microsoft Power BI Desktop\\bin\\ODBC Drivers\\Simba Spark ODBC Driver\\microsoft.sparkodbc.ini` if it does not exist.\n2. Add the following config to your `microsoft.sparkodbc.ini` file: \n```\n[Driver]\nCheckCertRevocation=0\n\n``` \n### Power BI Delta Sharing connector \nThe Power BI Delta Sharing connector allows users to discover, analyze, and visualize datasets shared with them through the [Delta Sharing](https:\/\/docs.databricks.com\/data-sharing\/index.html) open protocol. The protocol enables secure exchange of datasets across products and platforms by leveraging REST and cloud storage. \nFor connection instructions, see [Power BI: Read shared data](https:\/\/docs.databricks.com\/data-sharing\/read-data-open.html#power-bi). \n### Limitations \n* The Databricks connector supports [web proxy](https:\/\/learn.microsoft.com\/power-bi\/connect-data\/desktop-troubleshooting-sign-in#using-default-system-credentials-for-web-proxy). However, automatic proxy settings defined in .pac files aren\u2019t supported.\n* In the Databricks connector, the `Databricks.Query` data source is not supported in combination with DirectQuery mode.\n* The data that the Delta Sharing connector loads must fit into the memory of your machine. To ensure this, the connector limits the number of imported rows to the **Row Limit** that was set earlier.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/power-bi.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect Power BI to Databricks\n##### Additional resources\n\n[Support](https:\/\/powerbi.microsoft.com\/support\/)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/power-bi.html"} +{"content":"# Databricks data engineering\n### What is DBFS?\n\nThe term *DBFS* is used to describe two parts of the platform: \n* DBFS root\n* DBFS mounts \nStoring and accessing data using DBFS root or DBFS mounts is a deprecated pattern and not recommended by Databricks.\n\n### What is DBFS?\n#### What is the Databricks File System?\n\nThe term *DBFS* comes from Databricks File System, which describes the distributed file system used by Databricks to interact with cloud-based storage. \nThe underlying technology associated with DBFS is still part of the Databricks platform. For example, `dbfs:\/` is an optional scheme when interacting with Unity Catalog volumes. \nPast and current warnings and caveats about DBFS only apply to the DBFS root or DBFS mounts.\n\n### What is DBFS?\n#### How does DBFS work with Unity Catalog?\n\nDatabricks recommends using Unity Catalog to manage access to all data. \nUnity Catalog adds the concepts of external locations, storage credentials, and volumes to help organizations provide least privileges access to data in cloud object storage. \nSome security configurations provide direct access to both Unity Catalog-managed resources and DBFS, primarily for organizations that are completed migrations or have partially migrated to Unity Catalog. See [Best practices for DBFS and Unity Catalog](https:\/\/docs.databricks.com\/dbfs\/unity-catalog.html).\n\n### What is DBFS?\n#### What is the DBFS root?\n\nThe *DBFS root* is a storage location provisioned as part of workspace creation in the cloud account containing the Databricks workspace. For details on Databricks Filesystem root configuration and deployment, see [Create an S3 bucket for workspace deployment](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/storage.html). \nDatabricks does not recommend storing any production data, libraries, or scipts in DBFS root. See [Recommendations for working with DBFS root](https:\/\/docs.databricks.com\/dbfs\/dbfs-root.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/dbfs\/index.html"} +{"content":"# Databricks data engineering\n### What is DBFS?\n#### Mount object storage\n\nNote \nDBFS mounts are deprecated. Databricks recommends using Unity Catalog volumes. See [Create and work with volumes](https:\/\/docs.databricks.com\/connect\/unity-catalog\/volumes.html). \nMounting object storage to DBFS allows you to access objects in object storage as if they were on the local file system. Mounts store Hadoop configurations necessary for accessing storage. For more information, see [Mounting cloud object storage on Databricks](https:\/\/docs.databricks.com\/dbfs\/mounts.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/dbfs\/index.html"} +{"content":"# Compute\n## Use compute\n#### GPU-enabled compute\n\nNote \nSome GPU-enabled instance types are in **Beta** and are marked as such in the drop-down list when you select the driver and worker types during compute creation.\n\n#### GPU-enabled compute\n##### Overview\n\nDatabricks supports compute accelerated with graphics processing units (GPUs).\nThis article describes how to create compute with GPU-enabled instances and describes\nthe GPU drivers and libraries installed on those instances. \nTo learn more about deep learning on GPU-enabled compute, see [Deep learning](https:\/\/docs.databricks.com\/machine-learning\/train-model\/deep-learning.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/gpu.html"} +{"content":"# Compute\n## Use compute\n#### GPU-enabled compute\n##### Create a GPU compute\n\nCreating a GPU compute is similar to creating any compute. You should keep in mind the following: \n* The **Databricks Runtime Version** must be a GPU-enabled version, such as **Runtime 13.3 LTS ML (GPU, Scala 2.12.15, Spark 3.4.1)**.\n* The **Worker Type** and **Driver Type** must be GPU instance types. \n### Supported instance types \nDatabricks supports the following GPU-accelerated instance types: \n* [[Deprecated] P2 instance type series](https:\/\/aws.amazon.com\/ec2\/instance-types\/p2\/): p2.xlarge, p2.8xlarge, and p2.16xlarge \n+ P2 instances are available only in select AWS regions. For information, see [Amazon EC2 Pricing](https:\/\/aws.amazon.com\/ec2\/pricing\/on-demand\/). Your Databricks deployment must reside in a supported region to launch GPU-enabled compute.\n+ The [default on-demand limit for P2 instances is one](https:\/\/aws.amazon.com\/ec2\/faqs\/#How_many_instances_can_I_run_in_Amazon_EC2).\n+ P2 instances require EBS volumes for storage.\nWarning \nAfter August 31 2023, Databricks will no longer support spinning up compute using Amazon EC2 P2 instances.\n* [P3 instance type series](https:\/\/aws.amazon.com\/ec2\/instance-types\/p3\/): p3.2xlarge, p3.8xlarge, and p3.16xlarge. \n+ P3 instances are available only in select AWS regions. For information, see [Amazon EC2 Pricing](https:\/\/aws.amazon.com\/ec2\/pricing\/on-demand\/). Your Databricks deployment must reside in a supported region to launch GPU-enabled compute.\n* [P4d instance type series](https:\/\/aws.amazon.com\/ec2\/instance-types\/p4\/): p4d.24xlarge, p4de.24xlarge.\n* [P5 instance type series](https:\/\/aws.amazon.com\/ec2\/instance-types\/p5\/): p5.48xlarge.\n* [G4 instance type series](https:\/\/aws.amazon.com\/ec2\/instance-types\/g4\/), which are optimized for deploying machine learning models in production.\n* [G5 instance type series](https:\/\/aws.amazon.com\/ec2\/instance-types\/g5\/), which can be used for a wide range of graphics-intensive and machine learning use cases. \n+ G5 instances require Databricks Runtime 9.1 LTS ML or above. \n#### Considerations \nFor all GPU-accelerated instance types, keep the following in mind: \n* Due to Amazon spot instance price surges, GPU spot instances are difficult to retain. Use on-demand if needed.\n* You might need to [request a limit increase](https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/ec2-resource-limits.html) in order to create\nGPU-enabled compute. \nSee [Supported Instance Types](https:\/\/databricks.com\/product\/aws-pricing\/instance-types) for a list of supported GPU instance types and their attributes.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/gpu.html"} +{"content":"# Compute\n## Use compute\n#### GPU-enabled compute\n##### GPU scheduling\n\nDatabricks Runtime supports [GPU-aware scheduling](https:\/\/spark.apache.org\/docs\/3.0.0-preview\/configuration.html#custom-resource-scheduling-and-configuration-overview) from Apache Spark 3.0. Databricks preconfigures it on GPU compute. \nGPU scheduling is not enabled on single-node compute. \n`spark.task.resource.gpu.amount` is the only Spark config related to GPU-aware scheduling that you might need to change.\nThe default configuration uses one GPU per task, which is ideal for distributed inference workloads and distributed training, if you use all GPU nodes.\nTo do distributed training on a subset of nodes, which helps reduce communication overhead during distributed training, Databricks recommends setting `spark.task.resource.gpu.amount` to the number of GPUs per worker node\nin the compute [Spark configuration](https:\/\/docs.databricks.com\/compute\/configure.html#spark-configuration). \nFor PySpark tasks, Databricks automatically remaps assigned GPU(s) to indices 0, 1, \u2026.\nUnder the default configuration that uses one GPU per task, your code can simply use the default GPU without checking which GPU is assigned to the task.\nIf you set multiple GPUs per task, for example 4, your code can assume that the indices of the assigned GPUs are always 0, 1, 2, and 3. If you do need the physical indices of the assigned GPUs, you can get them from the `CUDA_VISIBLE_DEVICES` environment variable. \nIf you use Scala, you can get the indices of the GPUs assigned to the task from `TaskContext.resources().get(\"gpu\")`. \nFor Databricks Runtime releases below 7.0, to avoid conflicts among multiple Spark tasks trying to use the same GPU, Databricks automatically configures GPU compute so that there is at most one running task per node.\nThat way the task can use all GPUs on the node without running into conflicts with other tasks.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/gpu.html"} +{"content":"# Compute\n## Use compute\n#### GPU-enabled compute\n##### NVIDIA GPU driver, CUDA, and cuDNN\n\nDatabricks installs the NVIDIA driver and libraries required to use GPUs on Spark driver and worker instances: \n* [CUDA Toolkit](https:\/\/developer.nvidia.com\/cuda-toolkit), installed under `\/usr\/local\/cuda`.\n* [cuDNN](https:\/\/developer.nvidia.com\/cudnn): NVIDIA CUDA Deep Neural Network Library.\n* [NCCL](https:\/\/developer.nvidia.com\/nccl): NVIDIA Collective Communications Library. \nThe version of the NVIDIA driver included is 535.54.03, which supports CUDA 11.0. \nFor the versions of the libraries included, see the [release notes](https:\/\/docs.databricks.com\/release-notes\/runtime\/index.html) for the specific Databricks Runtime version you are using. \nNote \nThis software contains source code provided by NVIDIA Corporation. Specifically, to support GPUs, Databricks includes code from [CUDA Samples](https:\/\/docs.nvidia.com\/cuda\/eula\/#nvidia-cuda-samples-preface). \n### NVIDIA End User License Agreement (EULA) \nWhen you select a GPU-enabled \u201cDatabricks Runtime Version\u201d in Databricks, you implicitly agree to the terms and conditions outlined in the\nNVIDIA EULA with respect to the CUDA, cuDNN, and Tesla libraries,\nand the [NVIDIA End User License Agreement (with NCCL Supplement)](https:\/\/docs.nvidia.com\/deeplearning\/sdk\/nccl-sla\/index.html#supplement\/) for the NCCL library.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/gpu.html"} +{"content":"# Compute\n## Use compute\n#### GPU-enabled compute\n##### Databricks Container Services on GPU compute\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nYou can use [Databricks Container Services](https:\/\/docs.databricks.com\/compute\/custom-containers.html) on compute with GPUs to create portable deep learning environments with customized libraries. See [Customize containers with Databricks Container Service](https:\/\/docs.databricks.com\/compute\/custom-containers.html) for instructions. \nTo create custom images for GPU compute, you must select a standard runtime version instead of Databricks Runtime ML for GPU. When you select **Use your own Docker container**, you can choose GPU compute with a standard runtime version. The custom images for GPU is based on the [official CUDA containers](https:\/\/hub.docker.com\/r\/nvidia\/cuda\/), which is different from Databricks Runtime ML for GPU. \nWhen you create custom images for GPU compute, you cannot change the NVIDIA driver version, because it must match the driver version on the host machine. \nThe `databricksruntime` [Docker Hub](https:\/\/hub.docker.com\/u\/databricksruntime) contains example base images with GPU capability. The Dockerfiles used to generate these images are located in the [example containers GitHub repository](https:\/\/github.com\/databricks\/containers\/tree\/master\/ubuntu\/gpu), which also has details on what the example images provide and how to customize them.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/gpu.html"} +{"content":"# Compute\n## Use compute\n#### GPU-enabled compute\n##### Error messages\n\n* The following error indicates that the AWS cloud provider does not have enough capacity for the requested compute resource.\n`Error: Cluster terminated. Reason: AWS Insufficient Instance Capacity Failure` \nTo resolve, you can try to create a compute in a different availability zone. The availability zone is in the [compute configuration](https:\/\/docs.databricks.com\/compute\/configure.html#cluster-aws-config), under **Advanced options**. You can also review [AWS reserved instances pricing](https:\/\/aws.amazon.com\/ec2\/pricing\/reserved-instances\/pricing\/) to purchase additional quota.\n* If your compute uses P4d or G5 instance types and Databricks Runtime 7.3 LTS ML, the CUDA package version in 7.3 is incompatible with newer GPU instances. In those cases, ML packages such as TensorFlow Keras and PyTorch will produce errors such as: \n+ TensorFlow Keras: `InternalError: CUDA runtime implicit initialization on GPU:x failed. Status: device kernel image is invalid`\n+ PyTorch: `UserWarning: NVIDIA A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.`You can resolve these errors by upgrading to Databricks Runtime 10.4 LTS ML or above.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/gpu.html"} +{"content":"# Databricks reference documentation\n### MLflow API reference\n\nThe open-source MLflow REST API allows you to create, list, and get experiments and runs, and allows you to log parameters, metrics, and artifacts. The Databricks Runtime for Machine Learning provides a managed version of the MLflow server, which includes experiment tracking and the Model Registry. \nFor MLflow, there are two REST API reference guides: \n* [Databricks MLflow REST API 2.0 reference](https:\/\/docs.databricks.com\/api\/workspace\/experiments)\n* [Open Source MLflow REST API reference](https:\/\/mlflow.org\/docs\/latest\/rest-api.html)\n\n","doc_uri":"https:\/\/docs.databricks.com\/reference\/mlflow-api.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Looker Studio\n\nThis article describes how to use Looker Studio with a Databricks cluster or Databricks SQL warehouse (formerly Databricks SQL endpoint).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/looker-studio.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Looker Studio\n##### Requirements\n\nBefore you connect to Looker Studio manually, you need the following: \n* A cluster or SQL warehouse in your Databricks workspace. \n+ [Compute configuration reference](https:\/\/docs.databricks.com\/compute\/configure.html).\n+ [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n* The connection details for your cluster or SQL warehouse, specifically the **Server Hostname**, **Port**, and **HTTP Path** values. \n+ [Get connection details for a Databricks compute resource](https:\/\/docs.databricks.com\/integrations\/compute-details.html).\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/dev-tools\/auth\/pat.html). To create a personal access token, do the following: \n1. In your Databricks workspace, click your Databricks username in the top bar, and then select **Settings** from the drop down.\n2. Click **Developer**.\n3. Next to **Access tokens**, click **Manage**.\n4. Click **Generate new token**.\n5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token\u2019s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the **Lifetime (days)** box empty (blank).\n6. Click **Generate**.\n7. Copy the displayed token to a secure location, and then click **Done**.\nNote \nBe sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (**Revoke**) icon next to the token on the **Access tokens** page. \nIf you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following: \n+ [Enable or disable personal access token authentication for the workspace](https:\/\/docs.databricks.com\/admin\/access-control\/tokens.html#enable-tokens)\n+ [Personal access token permissions](https:\/\/docs.databricks.com\/security\/auth-authz\/api-access-permissions.html#pat) \nNote \nAs a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use [OAuth tokens](https:\/\/docs.databricks.com\/dev-tools\/auth\/oauth-m2m.html). \nIf you use personal access token authentication, Databricks recommends using personal access tokens belonging to [service principals](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) instead of workspace users. To create tokens for service principals, see [Manage tokens for a service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html#personal-access-tokens).\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/looker-studio.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Looker Studio\n##### Connect to Looker Studio manually\n\nTo connect to Looker Studio manually, do the following: \n1. Go to <https:\/\/lookerstudio.google.com\/data> and search for the Databricks connector.\n2. Click **Authorize**, and sign in to your Google account. \nNote \nDatabricks Connector for Data Studio use and transfer to any other app of information received from Google APIs will adhere to [Google API Services User Data Policy](https:\/\/developers.google.com\/terms\/api-services-user-data-policy), including the Limited Use requirements.\n3. Enter your Databricks credentials. Enter `token` in the username field, and enter your personal access token in the password field. \n![Authorize](https:\/\/docs.databricks.com\/_images\/looker-studio-authorize.png)\n4. For **Server Hostname**, enter the Databricks server hostname. \n![Connection parameters](https:\/\/docs.databricks.com\/_images\/looker-studio-connect.png)\n5. For **SQL Warehouse**, select a SQL warehouse from the drop-down list. You can filter by keyword to search.\n6. For **SQL Query**, write your SQL query. Your query must include the complete path with the catalog, the schema, and the table specified and enclosed in backticks. \nFor example: \n```\nselect * from `catalog`.`schema`.`table` limit 100\n\n``` \nNote \nThe maximum result size you can retrieve is 16MB.\n\n#### Connect to Looker Studio\n##### Reset your access\n\nTo revoke your authorization from a connector: \n1. Find the Databricks data source card on the Looker Studio home page.\n2. Click the kebab menu in the top right, and then click **Revoke Access**. \n![Revoke access](https:\/\/docs.databricks.com\/_images\/looker-studio-remove-access.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/looker-studio.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Sigma\n\nSigma delivers cloud-scale analytics and business intelligence with the simplicity of a spreadsheet, complete with pivot tables and dashboards. \nYou can connect Databricks SQL warehouses (formerly Databricks SQL endpoints) to Sigma.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/sigma.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Sigma\n##### Connect to Sigma using Partner Connect\n\nTo connect to Sigma using Partner Connect, do the following: \n1. In the sidebar, click ![Partner Connect button](https:\/\/docs.databricks.com\/_images\/partner-connect.png) **Partner Connect**.\n2. Click the partner tile. \nNote \nIf the partner tile has a check mark icon inside it, an administrator has already used Partner Connect to connect the partner to your workspace. Skip to step 8. The partner uses the email address for your Databricks account to prompt you to sign in to your existing partner account.\n3. Select a catalog for Sigma to write to, then click **Next**.\n4. If there are SQL warehouses in your workspace, select a SQL warehouse from the drop-down list. If your SQL warehouse is stopped, click **Start**.\n5. If there are no SQL warehouses in your workspace, do the following: \n1. Click **Create warehouse**. A new tab opens in your browser that displays the **New SQL Warehouse** page in the Databricks SQL UI.\n2. Follow the steps in [Create a SQL warehouse](https:\/\/docs.databricks.com\/compute\/sql-warehouse\/create.html).\n3. Return to the Partner Connect tab in your browser, then close the partner tile.\n4. Re-open the partner tile.\n5. Select the SQL warehouse you just created from the drop-down list.\n6. Select a schema for Sigma to write to, then click **Add**. You can repeat this step to add multiple schemas.\n7. Click **Next**. \nPartner Connect creates the following resources in your workspace: \n* A Databricks [service principal](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) named **SIGMA\\_USER**.\n* A Databricks [personal access token](https:\/\/docs.databricks.com\/admin\/users-groups\/service-principals.html) that is associated with the **SIGMA\\_USER** service principal.Partner Connect also grants the following privileges to the **SIGMA\\_USER** service principal: \n* (Unity Catalog)`USE CATALOG`: Required to interact with objects in the selected catalog.\n* (Unity Catalog)`CREATE SCHEMA`: Grants the ability to create objects in the schemas you selected.\n* (Hive metastore) `USAGE`: Required to grant privileges for the schemas you selected.\n* (Hive metastore) `CREATE`: Grants the ability to create objects in the schemas you selected.\n* **CAN\\_USE**: Grants permissions to use the SQL warehouse you selected.The **Email** box displays the email address for your Databricks account. The partner uses this email address to prompt you to either create a new partner account or sign in to your existing partner account.\n8. Click **Connect to Sigma** or **Sign in**. \nA new tab opens in your web browser, which displays the partner website.\n9. Complete the on-screen instructions on the partner website to create your trial partner account or sign in to your existing partner account.\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/sigma.html"} +{"content":"# Technology partners\n## Connect to BI partners using Partner Connect\n#### Connect to Sigma\n##### Connect to Sigma manually\n\nTo connect to Sigma manually, see [Connect to Databricks](https:\/\/help.sigmacomputing.com\/hc\/en-us\/articles\/6963295723411-Connect-to-Databricks) in the Sigma documentation.\n\n#### Connect to Sigma\n##### Next steps\n\nSee the Sigma documentation to learn how to create [datasets](https:\/\/help.sigmacomputing.com\/hc\/en-us\/articles\/4409226917011#h_01FJX6890SS09YJFDCKRP7C5TK) and [workbooks](https:\/\/help.sigmacomputing.com\/hc\/en-us\/articles\/1500010599982).\n\n#### Connect to Sigma\n##### Additional resources\n\nSee the following resources on the [Sigma website](https:\/\/www.sigmacomputing.com\/): \n* [Documentation](https:\/\/help.sigmacomputing.com\/)\n* [Support](mailto:support%40sigmacomputing.com)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/bi\/sigma.html"} +{"content":"# Databricks data engineering\n## Streaming on Databricks\n#### Using Unity Catalog with Structured Streaming\n\nUse Structured Streaming with Unity Catalog to manage data governance for your incremental and streaming workloads on Databricks. This document outlines supported functionality and suggests best practices for using Unity Catalog and Structured Streaming together.\n\n#### Using Unity Catalog with Structured Streaming\n##### What Structured Streaming functionality does Unity Catalog support?\n\nUnity Catalog does not add any explicit limits for Structured Streaming sources and sinks available on Databricks. The Unity Catalog data governance model allows you to stream data from managed and external tables in Unity Catalog. You can also use external locations managed by Unity Catalog to interact with data using object storage URIs. You can write to external tables using either table names or file paths. You must interact with managed tables on Unity Catalog using the table name. \nUse external locations managed by Unity Catalog when specifying paths for Structured Streaming checkpoints. To learn more about securely connecting storage with Unity Catalog, see [Connect to cloud object storage using Unity Catalog](https:\/\/docs.databricks.com\/connect\/unity-catalog\/index.html). \nStructured streaming feature support differs depending on the Databricks Runtime version you are running and whether you are using assigned or shared cluster access mode. For details, see [Streaming limitations for Unity Catalog](https:\/\/docs.databricks.com\/compute\/access-mode-limitations.html#structured-streaming). \nFor an end-to-end demo using Structured Streaming on Unity Catalog, see [Tutorial: Run an end-to-end lakehouse analytics pipeline](https:\/\/docs.databricks.com\/getting-started\/lakehouse-e2e.html).\n\n#### Using Unity Catalog with Structured Streaming\n##### What Structured Streaming functionality is not supported on Unity Catalog?\n\nFor a list of Structured Streaming features that are not supported on Unity Catalog, see [Streaming limitations for Unity Catalog](https:\/\/docs.databricks.com\/compute\/access-mode-limitations.html#structured-streaming).\n\n","doc_uri":"https:\/\/docs.databricks.com\/structured-streaming\/unity-catalog.html"} +{"content":"# What is Delta Lake?\n### Review Delta Lake table details with describe detail\n\nYou can retrieve detailed information about a Delta table (for example, number of files, data size) using `DESCRIBE DETAIL`. \n```\nDESCRIBE DETAIL '\/data\/events\/'\n\nDESCRIBE DETAIL eventsTable\n\n``` \nFor Spark SQL syntax details, see [DESCRIBE DETAIL](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-aux-describe-table.html#describe-detail). \nSee the [Delta Lake API documentation](https:\/\/docs.databricks.com\/delta\/index.html#delta-api) for Scala\/Java\/Python syntax details.\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/table-details.html"} +{"content":"# What is Delta Lake?\n### Review Delta Lake table details with describe detail\n#### Detail schema\n\nThe output of this operation has only one row with the following schema. \nNote \nThe columns you see depend on the Databricks Runtime version that you are using and the table features that you\u2019ve enabled. \n| Column | Type | Description |\n| --- | --- | --- |\n| format | string | Format of the table, that is, `delta`. |\n| id | string | Unique ID of the table. |\n| name | string | Name of the table as defined in the metastore. |\n| description | string | Description of the table. |\n| location | string | Location of the table. |\n| createdAt | timestamp | When the table was created. |\n| lastModified | timestamp | When the table was last modified. |\n| partitionColumns | array of strings | Names of the partition columns if the table is partitioned. |\n| numFiles | long | Number of the files in the latest version of the table. |\n| sizeInBytes | int | The size of the latest snapshot of the table in bytes. |\n| properties | string-string map | All the properties set for this table. |\n| minReaderVersion | int | Minimum version of readers (according to the log protocol) that can read the table. |\n| minWriterVersion | int | Minimum version of writers (according to the log protocol) that can write to the table. |\n| statistics | map with string keys | Additional table-level statistics. |\n| tableFeatures | array of strings | A list of the table features supported by the table. See [How does Databricks manage Delta Lake feature compatibility?](https:\/\/docs.databricks.com\/delta\/feature-compatibility.html). |\n| clusteringColumns | array of strings | The columns being used for liquid clustering. See [Use liquid clustering for Delta tables](https:\/\/docs.databricks.com\/delta\/clustering.html) | \nBelow is an example of what the output looks like: \n```\n+------+--------------------+------------------+-----------+--------------------+--------------------+-------------------+----------------+--------+-----------+----------+----------------+----------------+\n|format| id| name|description| location| createdAt| lastModified|partitionColumns|numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion|\n+------+--------------------+------------------+-----------+--------------------+--------------------+-------------------+----------------+--------+-----------+----------+----------------+----------------+\n| delta|d31f82d2-a69f-42e...|default.deltatable| null|file:\/Users\/tuor\/...|2020-06-05 12:20:...|2020-06-05 12:20:20| []| 10| 12345| []| 1| 2|\n+------+--------------------+------------------+-----------+--------------------+--------------------+-------------------+----------------+--------+-----------+----------+----------------+----------------+\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/delta\/table-details.html"} +{"content":"# Data governance with Unity Catalog\n## What is Catalog Explorer?\n#### Document data in Catalog Explorer using markdown comments\n\nUsers can use [Catalog Explorer](https:\/\/docs.databricks.com\/catalog-explorer\/index.html) to view comments about data assets like catalogs, schemas, and tables. This article describes how object owners or users with modify permission on objects can add those comments manually using Catalog Explorer. \nNote \nFor tables and columns, Catalog Explorer also lets you see AI-generated comment suggestions and apply them. See [Add AI-generated comments to a table](https:\/\/docs.databricks.com\/catalog-explorer\/ai-comments.html). \nIf you\u2019re using Unity Catalog, you can use Catalog Explorer to add and edit comments on all objects other than those in a Delta Sharing catalog. \nFor data in the Hive metastore, you can use Catalog Explorer to edit table comments only. \nMarkdown provides a robust set of options for documenting data, enhancing the options Databricks users have for increasing the discoverability and understanding of shared data assets. Using markdown comments has no impact on query performance. Markdown does not render when returned by `DESCRIBE` statements.\n\n#### Document data in Catalog Explorer using markdown comments\n##### Add markdown comments to data objects using Catalog Explorer\n\nCatalog Explorer displays comments for catalogs, schemas, tables, and other assets below the object name. \n* If no comment exists, an **Add comment** option is shown.\n* You can toggle comment display with the **Hide comment** and **Show comment** options. \nMarkdown in table comments renders in Catalog Explorer as soon as you save changes. \n* Click the pencil icon to modify comments.\n* Click **Save** to update comments. \nYou can also use SQL to add table comments during table creation or `ALTER TABLE` actions. \nWhen modifying comments on a Delta Lake table, a `SET TBLPROPERTIES` operation in the table history records the SQL query used to define the current table comments.\n\n","doc_uri":"https:\/\/docs.databricks.com\/catalog-explorer\/markdown-data-comments.html"} +{"content":"# Data governance with Unity Catalog\n## What is Catalog Explorer?\n#### Document data in Catalog Explorer using markdown comments\n##### Example of supported markdown documentation\n\nCatalog Explorer supports basic markdown syntax. You cannot use markdown for emojis, images, and rendered markdown tables. Catalog Explorer renders only two levels of markdown headers. \nThe following example shows a code block of raw markdown. Copy this markdown to a comment in Catalog Explorer and click **Save** to preview. \n```\n# Header 1\n## Header 2\n\n**bold text**\n\n*italics text*\n\n~~strikethrough text~~\n\n`monospace text`\n\n---\n\n> Block quote\n\nOrdered list:\n1. Item 1\n1. Item 2\n1. Item 3\n\nUnordered list:\n- Item a\n- Item b\n- Item c\n\n```\ndef my_function():\nreturn my_value\n```\n\n[Link](https:\/\/www.markdownguide.org\/cheat-sheet\/#basic-syntax)\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/catalog-explorer\/markdown-data-comments.html"} +{"content":"# Data governance with Unity Catalog\n## What is Catalog Explorer?\n#### Document data in Catalog Explorer using markdown comments\n##### More resources\n\nYou can also use the following functionality to add comments to data objects: \n* The [COMMENT ON](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-comment.html) command. This option does not support column comments.\n* The `COMMENT` option when you use the `CREATE <object>` or `ALTER <object>` command. For example, see [CREATE TABLE [USING]](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-create-table-using.html) and [ALTER TABLE](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-syntax-ddl-alter-table.html). This option supports column comments.\n* AI-generated comments (also known as AI-generated documentation) in Catalog Explorer. You can view a comment suggested by a large language model (LLM) that takes into account the table metadata, such as the table schema and column names, and edit or accept the comment as-is to add it. This option supports tables and columns only. See [Add AI-generated comments to a table](https:\/\/docs.databricks.com\/catalog-explorer\/ai-comments.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/catalog-explorer\/markdown-data-comments.html"} +{"content":"# Security and compliance guide\n## Secret management\n#### Secret redaction\n\nStoring credentials as Databricks secrets makes it easy to protect your credentials when you run notebooks and jobs.\nHowever, it is easy to accidentally print a secret to standard output buffers or display the value during variable assignment. \nTo prevent this, Databricks redacts all secret values that are read using `dbutils.secrets.get()`. When displayed in notebook cell output, the secret values are replaced with `[REDACTED]`. \nFor example, if you set a variable to a secret value using `dbutils.secrets.get()` and then print that variable, that variable is replaced with `[REDACTED]`. \nWarning \nSecret redaction for notebook cell output applies only to literals. The secret redaction functionality does not prevent deliberate and arbitrary transformations of a secret literal. To ensure the proper control of secrets, you should use [Access control lists](https:\/\/docs.databricks.com\/security\/auth-authz\/access-control\/index.html) (limiting permission to run commands) to prevent unauthorized access to shared notebook contexts.\n\n","doc_uri":"https:\/\/docs.databricks.com\/security\/secrets\/redaction.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query databases using JDBC\n\nDatabricks supports connecting to external databases using JDBC. This article provides the basic syntax for configuring and using these connections with examples in Python, SQL, and Scala. \nNote \nYou may prefer Lakehouse Federation for managing queries to external database systems. See [What is Lakehouse Federation](https:\/\/docs.databricks.com\/query-federation\/index.html). \nPartner Connect provides optimized integrations for syncing data with many external external data sources. See [What is Databricks Partner Connect?](https:\/\/docs.databricks.com\/partner-connect\/index.html). \nImportant \nThe examples in this article do not include usernames and passwords in JDBC URLs. Databricks recommends using [secrets](https:\/\/docs.databricks.com\/security\/secrets\/index.html) to store your database credentials. For example: \n```\nusername = dbutils.secrets.get(scope = \"jdbc\", key = \"username\")\npassword = dbutils.secrets.get(scope = \"jdbc\", key = \"password\")\n\n``` \n```\nval username = dbutils.secrets.get(scope = \"jdbc\", key = \"username\")\nval password = dbutils.secrets.get(scope = \"jdbc\", key = \"password\")\n\n``` \nTo reference Databricks secrets with SQL, you must [configure a Spark configuration property during cluster initilization](https:\/\/docs.databricks.com\/security\/secrets\/secrets.html#ref-spark-conf-secret). \nFor a full example of secret management, see [Secret workflow example](https:\/\/docs.databricks.com\/security\/secrets\/example-secret-workflow.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/jdbc.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query databases using JDBC\n##### Establish cloud connectivity\n\nDatabricks VPCs are configured to allow only Spark clusters. When connecting to another infrastructure, the best practice is to use [VPC peering](https:\/\/docs.databricks.com\/security\/network\/classic\/vpc-peering.html). Once VPC peering is established, you can check with the `netcat` utility on the cluster. \n```\n%sh nc -vz <jdbcHostname> <jdbcPort>\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/jdbc.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query databases using JDBC\n##### Read data with JDBC\n\nYou must configure a number of settings to read data using JDBC. Note that each database uses a different format for the `<jdbc-url>`. \n```\nemployees_table = (spark.read\n.format(\"jdbc\")\n.option(\"url\", \"<jdbc-url>\")\n.option(\"dbtable\", \"<table-name>\")\n.option(\"user\", \"<username>\")\n.option(\"password\", \"<password>\")\n.load()\n)\n\n``` \n```\nCREATE TEMPORARY VIEW employees_table_vw\nUSING JDBC\nOPTIONS (\nurl \"<jdbc-url>\",\ndbtable \"<table-name>\",\nuser '<username>',\npassword '<password>'\n)\n\n``` \n```\nval employees_table = spark.read\n.format(\"jdbc\")\n.option(\"url\", \"<jdbc-url>\")\n.option(\"dbtable\", \"<table-name>\")\n.option(\"user\", \"<username>\")\n.option(\"password\", \"<password>\")\n.load()\n\n``` \nSpark automatically reads the schema from the database table and maps its types back to Spark SQL types. \n```\nemployees_table.printSchema\n\n``` \n```\nDESCRIBE employees_table_vw\n\n``` \n```\nemployees_table.printSchema\n\n``` \nYou can run queries against this JDBC table: \n```\ndisplay(employees_table.select(\"age\", \"salary\").groupBy(\"age\").avg(\"salary\"))\n\n``` \n```\nSELECT age, avg(salary) as salary\nFROM employees_table_vw\nGROUP BY age\n\n``` \n```\ndisplay(employees_table.select(\"age\", \"salary\").groupBy(\"age\").avg(\"salary\"))\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/jdbc.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query databases using JDBC\n##### Write data with JDBC\n\nSaving data to tables with JDBC uses similar configurations to reading. See the following example: \n```\n(employees_table.write\n.format(\"jdbc\")\n.option(\"url\", \"<jdbc-url>\")\n.option(\"dbtable\", \"<new-table-name>\")\n.option(\"user\", \"<username>\")\n.option(\"password\", \"<password>\")\n.save()\n)\n\n``` \n```\nCREATE TABLE new_employees_table\nUSING JDBC\nOPTIONS (\nurl \"<jdbc-url>\",\ndbtable \"<table-name>\",\nuser '<username>',\npassword '<password>'\n) AS\nSELECT * FROM employees_table_vw\n\n``` \n```\nemployees_table.write\n.format(\"jdbc\")\n.option(\"url\", \"<jdbc-url>\")\n.option(\"dbtable\", \"<new-table-name>\")\n.option(\"user\", \"<username>\")\n.option(\"password\", \"<password>\")\n.save()\n\n``` \nThe default behavior attempts to create a new table and throws an error if a table with that name already exists. \nYou can append data to an existing table using the following syntax: \n```\n(employees_table.write\n.format(\"jdbc\")\n.option(\"url\", \"<jdbc-url>\")\n.option(\"dbtable\", \"<new-table-name>\")\n.option(\"user\", \"<username>\")\n.option(\"password\", \"<password>\")\n.mode(\"append\")\n.save()\n)\n\n``` \n```\nCREATE TABLE IF NOT EXISTS new_employees_table\nUSING JDBC\nOPTIONS (\nurl \"<jdbc-url>\",\ndbtable \"<table-name>\",\nuser '<username>',\npassword '<password>'\n);\n\nINSERT INTO new_employees_table\nSELECT * FROM employees_table_vw;\n\n``` \n```\nemployees_table.write\n.format(\"jdbc\")\n.option(\"url\", \"<jdbc-url>\")\n.option(\"dbtable\", \"<new-table-name>\")\n.option(\"user\", \"<username>\")\n.option(\"password\", \"<password>\")\n.mode(\"append\")\n.save()\n\n``` \nYou can overwrite an existing table using the following syntax: \n```\n(employees_table.write\n.format(\"jdbc\")\n.option(\"url\", \"<jdbc-url>\")\n.option(\"dbtable\", \"<new-table-name>\")\n.option(\"user\", \"<username>\")\n.option(\"password\", \"<password>\")\n.mode(\"overwrite\")\n.save()\n)\n\n``` \n```\nCREATE OR REPLACE TABLE new_employees_table\nUSING JDBC\nOPTIONS (\nurl \"<jdbc-url>\",\ndbtable \"<table-name>\",\nuser '<username>',\npassword '<password>'\n) AS\nSELECT * FROM employees_table_vw;\n\n``` \n```\nemployees_table.write\n.format(\"jdbc\")\n.option(\"url\", \"<jdbc-url>\")\n.option(\"dbtable\", \"<new-table-name>\")\n.option(\"user\", \"<username>\")\n.option(\"password\", \"<password>\")\n.mode(\"overwrite\")\n.save()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/jdbc.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query databases using JDBC\n##### Control parallelism for JDBC queries\n\nBy default, the JDBC driver queries the source database with only a single thread. To improve performance for reads, you need to specify a number of options to control how many simultaneous queries Databricks makes to your database. For small clusters, setting the `numPartitions` option equal to the number of executor cores in your cluster ensures that all nodes query data in parallel. \nWarning \nSetting `numPartitions` to a high value on a large cluster can result in negative performance for the remote database, as too many simultaneous queries might overwhelm the service. This is especially troublesome for application databases. Be wary of setting this value above 50. \nNote \nSpeed up queries by selecting a column with an index calculated in the source database for the `partitionColumn`. \nThe following code example demonstrates configuring parallelism for a cluster with eight cores: \n```\nemployees_table = (spark.read\n.format(\"jdbc\")\n.option(\"url\", \"<jdbc-url>\")\n.option(\"dbtable\", \"<table-name>\")\n.option(\"user\", \"<username>\")\n.option(\"password\", \"<password>\")\n# a column that can be used that has a uniformly distributed range of values that can be used for parallelization\n.option(\"partitionColumn\", \"<partition-key>\")\n# lowest value to pull data for with the partitionColumn\n.option(\"lowerBound\", \"<min-value>\")\n# max value to pull data for with the partitionColumn\n.option(\"upperBound\", \"<max-value>\")\n# number of partitions to distribute the data into. Do not set this very large (~hundreds)\n.option(\"numPartitions\", 8)\n.load()\n)\n\n``` \n```\nCREATE TEMPORARY VIEW employees_table_vw\nUSING JDBC\nOPTIONS (\nurl \"<jdbc-url>\",\ndbtable \"<table-name>\",\nuser '<username>',\npassword '<password>',\npartitionColumn \"<partition-key>\",\nlowerBound \"<min-value>\",\nupperBound \"<max-value>\",\nnumPartitions 8\n)\n\n``` \n```\nval employees_table = spark.read\n.format(\"jdbc\")\n.option(\"url\", \"<jdbc-url>\")\n.option(\"dbtable\", \"<table-name>\")\n.option(\"user\", \"<username>\")\n.option(\"password\", \"<password>\")\n\/\/ a column that can be used that has a uniformly distributed range of values that can be used for parallelization\n.option(\"partitionColumn\", \"<partition-key>\")\n\/\/ lowest value to pull data for with the partitionColumn\n.option(\"lowerBound\", \"<min-value>\")\n\/\/ max value to pull data for with the partitionColumn\n.option(\"upperBound\", \"<max-value>\")\n\/\/ number of partitions to distribute the data into. Do not set this very large (~hundreds)\n.option(\"numPartitions\", 8)\n.load()\n\n``` \nNote \nDatabricks supports all Apache Spark [options for configuring JDBC](https:\/\/spark.apache.org\/docs\/latest\/sql-data-sources-jdbc.html). \nWhen writing to databases using JDBC, Apache Spark uses the number of partitions in memory to control parallelism. You can repartition data before writing to control parallelism. Avoid high number of partitions on large clusters to avoid overwhelming your remote database. The following example demonstrates repartitioning to eight partitions before writing: \n```\n(employees_table.repartition(8)\n.write\n.format(\"jdbc\")\n.option(\"url\", \"<jdbc-url>\")\n.option(\"dbtable\", \"<new-table-name>\")\n.option(\"user\", \"<username>\")\n.option(\"password\", \"<password>\")\n.save()\n)\n\n``` \n```\nCREATE TABLE new_employees_table\nUSING JDBC\nOPTIONS (\nurl \"<jdbc-url>\",\ndbtable \"<table-name>\",\nuser '<username>',\npassword '<password>'\n) AS\nSELECT \/*+ REPARTITION(8) *\/ * FROM employees_table_vw\n\n``` \n```\nemployees_table.repartition(8)\n.write\n.format(\"jdbc\")\n.option(\"url\", \"<jdbc-url>\")\n.option(\"dbtable\", \"<new-table-name>\")\n.option(\"user\", \"<username>\")\n.option(\"password\", \"<password>\")\n.save()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/jdbc.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query databases using JDBC\n##### Push down a query to the database engine\n\nYou can push down an entire query to the database and return just the result. The `table` parameter identifies the JDBC table to read. You can use anything that is valid in a SQL query `FROM` clause. \n```\npushdown_query = \"(select * from employees where emp_no < 10008) as emp_alias\"\n\nemployees_table = (spark.read\n.format(\"jdbc\")\n.option(\"url\", \"<jdbc-url>\")\n.option(\"dbtable\", pushdown_query)\n.option(\"user\", \"<username>\")\n.option(\"password\", \"<password>\")\n.load()\n)\n\n``` \n```\nCREATE TEMPORARY VIEW employees_table_vw\nUSING JDBC\nOPTIONS (\nurl \"<jdbc-url>\",\ndbtable \"(select * from employees where emp_no < 10008) as emp_alias\",\nuser '<username>',\npassword '<password>'\n)\n\n``` \n```\nval pushdown_query = \"(select * from employees where emp_no < 10008) as emp_alias\"\n\nval employees_table = spark.read\n.format(\"jdbc\")\n.option(\"url\", \"<jdbc-url>\")\n.option(\"dbtable\", pushdown_query)\n.option(\"user\", \"<username>\")\n.option(\"password\", \"<password>\")\n.load()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/jdbc.html"} +{"content":"# Connect to data sources\n## Connect to external systems\n#### Query databases using JDBC\n##### Control number of rows fetched per query\n\nJDBC drivers have a `fetchSize` parameter that controls the number of rows fetched at a time from the remote database. \n| Setting | Result |\n| --- | --- |\n| Too low | High latency due to many roundtrips (few rows returned per query) |\n| Too high | Out of memory error (too much data returned in one query) | \nThe optimal value is workload dependent. Considerations include: \n* How many columns are returned by the query?\n* What data types are returned?\n* How long are the strings in each column returned? \nSystems might have very small default and benefit from tuning. For example: Oracle\u2019s default `fetchSize` is 10. Increasing it to 100 reduces the number of total queries that need to be executed by a factor of 10. JDBC results are network traffic, so avoid very large numbers, but optimal values might be in the thousands for many datasets. \nUse the `fetchSize` option, as in the following example: \n```\nemployees_table = (spark.read\n.format(\"jdbc\")\n.option(\"url\", \"<jdbc-url>\")\n.option(\"dbtable\", \"<table-name>\")\n.option(\"user\", \"<username>\")\n.option(\"password\", \"<password>\")\n.option(\"fetchSize\", \"100\")\n.load()\n)\n\n``` \n```\nCREATE TEMPORARY VIEW employees_table_vw\nUSING JDBC\nOPTIONS (\nurl \"<jdbc-url>\",\ndbtable \"<table-name>\",\nuser '<username>',\npassword '<password>'.\nfetchSize 100\n)\n\n``` \n```\nval employees_table = spark.read\n.format(\"jdbc\")\n.option(\"url\", \"<jdbc-url>\")\n.option(\"dbtable\", \"<table-name>\")\n.option(\"user\", \"<username>\")\n.option(\"password\", \"<password>\")\n.option(\"fetchSize\", \"100\")\n.load()\n\n```\n\n","doc_uri":"https:\/\/docs.databricks.com\/connect\/external-systems\/jdbc.html"} +{"content":"# Introduction to Databricks Lakehouse Monitoring\n### View Lakehouse Monitoring expenses\n\nPreview \nThis feature is in [Public Preview](https:\/\/docs.databricks.com\/release-notes\/release-types.html). \nThis article shows you how to track your Lakehouse Monitoring expenses. You can check expenses using a query or using the billing portal.\n\n### View Lakehouse Monitoring expenses\n#### View usage from the system table `system.billing.usage`\n\nYou can check Lakehouse Monitoring expenses using the system table `system.billing.usage`. For more information, see [Billable usage system table reference](https:\/\/docs.databricks.com\/admin\/system-tables\/billing.html). \n```\nSELECT usage_date, sum(usage_quantity) as dbus\nFROM system.billing.usage\nWHERE\nusage_date >= DATE_SUB(current_date(), 30) AND\nsku_name like \"%JOBS_SERVERLESS%\" AND\ncustom_tags[\"LakehouseMonitoring\"] = \"true\"\nGROUP BY usage_date\nORDER BY usage_date DESC\n\n```\n\n### View Lakehouse Monitoring expenses\n#### View usage from the billing portal\n\nYou can also check Lakehouse Monitoring expenses using the billing portal. \n1. Log in to the [Databricks account console](https:\/\/accounts.cloud.databricks.com\/login).\n2. In the sidebar, click the **Usage** icon.\n3. On the Usage page, select **By tags**.\n4. In the first drop-down menu, select **LakehouseMonitoring** as the tag key.\n5. In the second drop-down menu, select **true** as the tag value. After you do this, **true** appears in the UI as shown in the diagram, and the second drop-down menu shows `LakehouseMonitoring(1)` to indicate that one tag key is selected. \n![track monitoring expenses AWS](https:\/\/docs.databricks.com\/_images\/track-expenses-aws.png)\n\n","doc_uri":"https:\/\/docs.databricks.com\/lakehouse-monitoring\/expense.html"} +{"content":"# Compute\n## What are Databricks pools?\n#### Pool configuration reference\n\nThis article describes the available settings when creating a pool using the UI. To learn how to use the Databricks CLI to create a pool, see [Instance Pools CLI (legacy)](https:\/\/docs.databricks.com\/archive\/dev-tools\/cli\/instance-pools-cli.html). To learn how to use the REST API to create a pool, see the [Instance Pools API](https:\/\/docs.databricks.com\/api\/workspace\/instancepools).\n\n#### Pool configuration reference\n##### Pool size and auto termination\n\nWhen you create a pool, in order to control its size, you can set three parameters: minimum idle instances, maximum capacity, and idle instance auto termination. \n### Minimum Idle Instances \nThe minimum number of instances the pool keeps idle. These instances do not terminate, regardless of the auto termination settings. If a cluster consumes idle instances from the pool, Databricks provisions additional instances to maintain the minimum. \n### Maximum Capacity \nThe maximum number of instances the pool can provision. If set, this value constrains *all instances* (idle + used). If a cluster using the pool requests more instances than this number during [autoscaling](https:\/\/docs.databricks.com\/compute\/configure.html#autoscaling), the request fails with an `INSTANCE_POOL_MAX_CAPACITY_FAILURE` error. \nThis configuration is *optional*. Databricks recommend setting a value only in the following circumstances: \n* You have an instance quota you *must* stay under.\n* You want to protect one set of work from impacting another set of work. For example, suppose your instance quota is 100 and you have teams A and B that need to run jobs. You can create pool A with a max 50 and pool B with max 50 so that the two teams share the 100 quota fairly.\n* You need to cap cost. \n### Idle Instance Auto Termination \nThe time in minutes above the value set in [Minimum Idle Instances](https:\/\/docs.databricks.com\/compute\/pools.html#pool-min) that instances can be idle before being terminated by the pool.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/pools.html"} +{"content":"# Compute\n## What are Databricks pools?\n#### Pool configuration reference\n##### Instance types\n\nA pool consists of both idle instances kept ready for new clusters and instances in use by running clusters. All of these instances are of the same instance provider type, selected when creating a pool. \nA pool\u2019s instance type cannot be edited. Clusters attached to a pool use the same instance type for the driver and worker nodes. Different families of instance types fit different use cases, such as memory-intensive or compute-intensive workloads. \nDatabricks always provides one year\u2019s deprecation notice before ceasing support for an instance type.\n\n#### Pool configuration reference\n##### Preloaded Databricks Runtime version\n\nYou can speed up cluster launches by selecting a Databricks Runtime version to be loaded on idle instances in the pool. If a user selects that runtime when they create a cluster backed by the pool, that cluster will launch even more quickly than a pool-backed cluster that doesn\u2019t use a preloaded Databricks Runtime version. \nSetting this option to **None** slows down cluster launches, as it causes the Databricks Runtime version to download on demand to idle instances in the pool. When the cluster releases the instances in the pool, the Databricks Runtime version remains cached on those instances. The next cluster creation operation that uses the same Databricks Runtime version might benefit from this caching behavior, but it is not guaranteed.\n\n#### Pool configuration reference\n##### Preloaded Docker image\n\nDocker images are supported with pools if you use the [Instance Pools API](https:\/\/docs.databricks.com\/api\/workspace\/instancepools) to create the pool.\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/pools.html"} +{"content":"# Compute\n## What are Databricks pools?\n#### Pool configuration reference\n##### Pool tags\n\nPool tags allow you to easily monitor the cost of cloud resources used by various groups in your organization. You can specify tags as key-value pairs when you create a pool, and Databricks applies these tags to cloud resources like VMs and disk volumes, as well as [DBU usage reports](https:\/\/docs.databricks.com\/admin\/account-settings\/usage.html). \nFor convenience, Databricks applies three default tags to each pool: `Vendor`,\n`DatabricksInstancePoolId`, and `DatabricksInstancePoolCreatorId`. You can also add custom tags when you create a pool. You can add up to 43 custom tags. \n### Custom tags \nTo add additional tags to the pool, navigate to the **Tabs** tab at the bottom of the **Create Pool** page. Click the **+ Add** button, then enter the key-value pair. \nPool-backed clusters inherit default and custom tags from the pool configuration. For detailed information about how pool tags and cluster tags work together, see [Monitor usage using tags](https:\/\/docs.databricks.com\/admin\/account-settings\/usage-detail-tags.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/pools.html"} +{"content":"# Compute\n## What are Databricks pools?\n#### Pool configuration reference\n##### AWS configurations\n\nWhen you configure a pool\u2019s AWS instances you can choose the availability zone (AZ), whether to use spot instances and the max spot price, and the EBS volume type and size. All clusters attached to the pool inherit these configurations. \n### Availability zones \nChoosing a specific AZ for a pool is useful primarily if your organization has purchased reserved instances in specific availability zones. For more information on AZs, see [AWS availability zones](https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/using-regions-availability-zones.html). \n#### Auto-AZ with pools \nIf you use a fleet instance type with your pool, you can select **auto** as the availability zone. When you use auto-AZ, the availability zone is automatically selected based on available cloud provider capacity. The pool will be moved to the best AZ right before every scale-up-from-zero event, and will remain fixed to a single AZ while the pool is non-empty. For more information, see [AWS Fleet instance types](https:\/\/docs.databricks.com\/compute\/configure.html#fleet). \nClusters that you attach to a pool inherit the pool\u2019s availability zone. You cannot specify the availability zone for individual clusters in pools. \n### Spot instances \nYou can specify whether you want the pool to use spot instances. A pool can either be all spot instances or all on-demand instances. \nYou can also set the max spot price to use when launching spot instances. This is set as a percentage of the corresponding on-demand price. By default, Databricks sets the max spot price at 100% of the on-demand price. See [AWS spot pricing](https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/using-spot-instances.html). \n### EBS volumes \nDatabricks provisions EBS volumes for every instance as follows: \n* A 30 GB unencrypted EBS instance root volume used only by the host operating system and Databricks internal services.\n* A 150 GB encrypted EBS container root volume used by the Spark worker. This hosts Spark services and logs.\n* (HIPAA only) a 75 GB encrypted EBS worker log volume that stores logs for Databricks internal services. \n#### Add EBS shuffle volumes \nTo add shuffle volumes, select **General Purpose SSD** in the **EBS Volume Type** dropdown list. \nBy default, Spark shuffle outputs go to the instance local disk. For instance types that do not\nhave a local disk, or if you want to increase your Spark shuffle storage space, you can specify\nadditional EBS volumes. This is particularly useful to prevent out of disk space errors when you\nrun Spark jobs that produce large shuffle outputs. \nDatabricks encrypts these EBS volumes for both on-demand and spot instances. Read more about\n[AWS EBS volumes](https:\/\/aws.amazon.com\/ebs\/features\/). \n#### AWS EBS limits \nEnsure that your AWS EBS limits are high enough to satisfy the runtime requirements for all\ninstances in all pools. For information on the default EBS limits and how to change them,\nsee [Amazon Elastic Block Store (EBS) Limits](https:\/\/docs.aws.amazon.com\/general\/latest\/gr\/aws_service_limits.html#limits_ebs). \n### Autoscaling local storage \nIf you don\u2019t want to allocate a fixed number of EBS volumes at pool creation time, use\nautoscaling local storage. With autoscaling local storage, Databricks monitors the amount of free\ndisk space available on your pool\u2019s Spark workers. If a worker begins to run too low on\ndisk, Databricks automatically attaches a new EBS volume to the worker before it runs out of disk\nspace. EBS volumes are attached up to a limit of 5 TB of total disk space per instance\n(including the instance\u2019s local storage). \nTo configure autoscaling storage, select **Enable autoscaling local storage**. \nThe EBS volumes attached to an instance are detached only when the instance is returned to AWS.\nThat is, EBS volumes are never detached from an instance as long as it is in the pool.\nTo scale down EBS usage, Databricks recommends configuring the [Pool size and auto termination](https:\/\/docs.databricks.com\/compute\/pools.html#instance-pool-sizing). \nNote \n* Databricks uses Throughput Optimized HDD (st1) to extend the local storage of an instance. The [default AWS capacity limit](https:\/\/docs.aws.amazon.com\/general\/latest\/gr\/aws_service_limits.html#limits_ebs) for these volumes is 20 TiB. To avoid hitting this limit, administrators should request an increase in this limit based on their usage requirements.\n* If you want to use autoscaling local storage, the IAM role or keys used to create your account must include the permissions `ec2:AttachVolume`, `ec2:CreateVolume`, `ec2:DeleteVolume`, and `ec2:DescribeVolumes`. For the complete list of permissions and instructions on how to update your existing IAM role or keys, see [Create an IAM role for workspace deployment](https:\/\/docs.databricks.com\/admin\/account-settings-e2\/credentials.html).\n\n","doc_uri":"https:\/\/docs.databricks.com\/compute\/pools.html"} +{"content":"# Technology partners\n## Connect to data governance partners using Partner Connect\n#### Connect Databricks to Alation\n\nThis article describes how to connect your Databricks workspace to Alation. The Databricks integration with Alation\u2019s data governance platform extends the data discovery, governance, and catalog capabilities of Unity Catalog across data sources.\n\n#### Connect Databricks to Alation\n##### Connect to Alation using Partner Connect\n\nTo connect to Alation using Partner Connect, see [Connect to data governance partners using Partner Connect](https:\/\/docs.databricks.com\/partner-connect\/data-governance.html).\n\n#### Connect Databricks to Alation\n##### Connect to Alation manually\n\nTo connect to Databricks from Alation manually, see the following articles in the Alation documentation: \n* (Recommended: Unity Catalog) [Databricks Unity Catalog OCF Connector: Install and Configure](https:\/\/docs2.alationdata.com\/en\/latest\/sources\/OpenConnectorFramework\/DatabricksUnityCatalog\/DatabricksUnityCatalogOCFInstallConfig.html) \n* (Legacy Hive metastore) [Databricks on AWS OCF Connector: Install and Configure](https:\/\/docs.alationdata.com\/en\/latest\/sources\/OpenConnectorFramework\/DatabricksonAWS\/DatabricksonAWSOCFConnectorInstallandConfigure.html)\n\n#### Connect Databricks to Alation\n##### Additional resources\n\n* [Alation website](https:\/\/www.alation.com\/)\n* [Alation documentation](https:\/\/docs.alationdata.com\/)\n\n","doc_uri":"https:\/\/docs.databricks.com\/partners\/data-governance\/alation.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Spark memory issues\n###### Verifying a memory issue\n\nMemory issues often result in error messages such as the following: \n```\nSparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 30) (10.139.64.114 executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.\n\n``` \nThese error messages, however, are often generic and can be caused by other issues. So, if you suspect you have a memory issue, you can verify the issue by doubling the memory per core to see if it impacts your problem. \nFor example, if you have a worker type with 4 cores and 16GB per memory, you can try switching to a worker type that has 4 cores and 32GB of memory. That will give you 8GB per core compared to the 4GB per core you had before. It\u2019s the ratio of cores to memory that matters here. If it takes longer to fail with the extra memory or doesn\u2019t fail at all, that\u2019s a good sign that you\u2019re on the right track. \nIf you can fix your issue by increasing the memory, great! Maybe that\u2019s the solution. If it doesn\u2019t fix the issue, or you can\u2019t bear the extra cost, you should dig deeper.\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/spark-memory-issues.html"} +{"content":"# Databricks data engineering\n## Optimization recommendations on Databricks\n### Diagnose cost and performance issues using the Spark UI\n##### Spark memory issues\n###### Possible causes\n\nThere are a lot of potential reasons for memory problems: \n* [Too few shuffle partitions](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#data-spilling)\n* [Large broadcast](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#broadcast-hash)\n* [UDFs](https:\/\/docs.databricks.com\/udf\/index.html)\n* [Window function](https:\/\/docs.databricks.com\/sql\/language-manual\/sql-ref-window-functions.html) without `PARTITION BY` statement\n* [Skew](https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide#data-skewness)\n* [Streaming State](https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html#state-store)\n\n","doc_uri":"https:\/\/docs.databricks.com\/optimizations\/spark-ui-guide\/spark-memory-issues.html"}