This repository has code files for a complete Data Science project on Databricks platform. This example try to solve a classification problem using the Pima Indian Diabetes dataset.
-
Open your Databricks Workspace. Create a compute with Databricks Runtime
13.3 LTS ML
or later. -
Clone this repo (Read How)
NOTE If you are trying to connect to a repo which is access controlled, configure Github integration first (Read how)
-
Run the
00-Setup
Notebook first. This need to be only run once -
Run the
01-Data Ingestion
Notebook next. This need to be only run once -
Run the notebooks
02-Exploratory Data Analysis
,03-Feature Engineering
,04-b HPT Training Evaluation NoFS
and04-c HPT Training Evaluation FS
in the order.
NOTE:04-b,c
notebooks registers model in Model Registry or Unity Catalog based your Workspace status. Before running any other notebooks, please make sure the registered models are marked to be used for Production.
- If Unity Catalog is enabed, make sure the model version has the
champion
alias. Read How - If Unity Catalog is not enabled, Use model registry UI to Transition the version to
Production
Read How
- Any time you want to cleanup all the resources you created, run the
09-Cleanup
Notebook. You can always recreate the resources by re-running the notebooks from Step 3
If you dont have access to public repos in your workspace, you can follow the below steps
- Download the project as a zip file by navigating to
<> Code
and selectingDownload ZIP
from theCode
Dropdown - In your Databricks workspace, import the zip file using
Import
option from the menu bar. (Read How)
Built on an open lakehouse architecture, AI and Machine Learning on Databricks empowers ML teams to prepare and process data, streamlines cross-team collaboration and standardizes the full ML lifecycle from experimentation to production including for generative AI and large language models.
Auto ML: Databricks AutoML allows you to quickly generate baseline models and notebooks. Read More
Distributed Hyper Parameter Tuning: Databricks Runtime ML includes Hyperopt, a Python library that facilitates distributed hyperparameter tuning and model selection. With Hyperopt, you can scan a set of Python models while varying algorithms and hyperparameters across spaces that you define. Read More
Distributed Training: Databricks enables distributed training and inference if your model or your data are too large to fit in memory on a single machine. For these workloads, Databricks Runtime ML includes the TorchDistributor, Horovod and spark-tensorflow-distributor packages. Read More
Automate experiment tracking and governance: Managed MLflow automatically tracks your experiments and logs parameters, metrics, versioning of data and code, as well as model artifacts with each training run. You can quickly see previous runs, compare results and reproduce a past result, as needed. Once you have identified the best version of a model for production, register it to the Model Registry to simplify handoffs along the deployment lifecycle. Read More
Manage the full model lifecycle from data to production — and back: Once trained models are registered, you can collaboratively manage them through their lifecycle with the Model Registry. Models can be versioned and moved through various stages, like experimentation, staging, production and archived. The lifecycle management integrates with approval and governance workflows according to role-based access controls. Comments and email notifications provide a rich collaborative environment for data teams.
- Read More About Model Lifecycle Management in a Non UC Workspace
- Read More About Model Lifecycle Management in a UC Enabled Workspace
Deploy ML models at scale and low latency: Deploy models with a single click without having to worry about server management or scale constraints. With Databricks, you can deploy your models as REST API endpoints anywhere with enterprise-grade availability.
- Deployment for Batch and Streaming Inference
- Deployment for Real Time Inference - Model Serving
- Model deployment patterns
Use generative AI and large language models: Integrate existing pretrained models — such as those from the Hugging Face transformers library or other open source libraries — into your workflow. Transformer pipelines make it easy to use GPUs and allow batching of items sent to the GPU for better throughput.
Customize a model on your data for your specific task. With the support of open source tooling, such as Hugging Face and DeepSpeed, you can quickly and efficiently take a foundation LLM and start training with your own data to have more accuracy for your domain and workload. This also gives you control to govern the data used for training so you can make sure you’re using AI responsibly. Read More