Machine Learning on Databricks

This repository has code files for a complete Data Science project on Databricks platform. This example try to solve a classification problem using the Pima Indian Diabetes dataset.

Usage:

Using Databricks Repos

Open your Databricks Workspace. Create a compute with Databricks Runtime 13.3 LTS ML or later.
Clone this repo (Read How)

NOTE If you are trying to connect to a repo which is access controlled, configure Github integration first (Read how)
Run the 00-Setup Notebook first. This need to be only run once
Run the 01-Data Ingestion Notebook next. This need to be only run once
Run the notebooks 02-Exploratory Data Analysis, 03-Feature Engineering, 04-b HPT Training Evaluation NoFS and 04-c HPT Training Evaluation FS in the order.

NOTE:04-b,c notebooks registers model in Model Registry or Unity Catalog based your Workspace status. Before running any other notebooks, please make sure the registered models are marked to be used for Production.

If Unity Catalog is enabed, make sure the model version has the champion alias. Read How
If Unity Catalog is not enabled, Use model registry UI to Transition the version to ProductionRead How

Any time you want to cleanup all the resources you created, run the 09-Cleanup Notebook. You can always recreate the resources by re-running the notebooks from Step 3

Import via zip file

If you dont have access to public repos in your workspace, you can follow the below steps

Download the project as a zip file by navigating to <> Code and selecting Download ZIP from the Code Dropdown
In your Databricks workspace, import the zip file using Import option from the menu bar. (Read How)

Overview of ML on Databricks:

Built on an open lakehouse architecture, AI and Machine Learning on Databricks empowers ML teams to prepare and process data, streamlines cross-team collaboration and standardizes the full ML lifecycle from experimentation to production including for generative AI and large language models.

Auto ML: Databricks AutoML allows you to quickly generate baseline models and notebooks. Read More

Distributed Hyper Parameter Tuning: Databricks Runtime ML includes Hyperopt, a Python library that facilitates distributed hyperparameter tuning and model selection. With Hyperopt, you can scan a set of Python models while varying algorithms and hyperparameters across spaces that you define. Read More

Distributed Training: Databricks enables distributed training and inference if your model or your data are too large to fit in memory on a single machine. For these workloads, Databricks Runtime ML includes the TorchDistributor, Horovod and spark-tensorflow-distributor packages. Read More

Automate experiment tracking and governance: Managed MLflow automatically tracks your experiments and logs parameters, metrics, versioning of data and code, as well as model artifacts with each training run. You can quickly see previous runs, compare results and reproduce a past result, as needed. Once you have identified the best version of a model for production, register it to the Model Registry to simplify handoffs along the deployment lifecycle. Read More

Manage the full model lifecycle from data to production — and back: Once trained models are registered, you can collaboratively manage them through their lifecycle with the Model Registry. Models can be versioned and moved through various stages, like experimentation, staging, production and archived. The lifecycle management integrates with approval and governance workflows according to role-based access controls. Comments and email notifications provide a rich collaborative environment for data teams.

Deploy ML models at scale and low latency: Deploy models with a single click without having to worry about server management or scale constraints. With Databricks, you can deploy your models as REST API endpoints anywhere with enterprise-grade availability.

Use generative AI and large language models: Integrate existing pretrained models — such as those from the Hugging Face transformers library or other open source libraries — into your workflow. Transformer pipelines make it easy to use GPUs and allow batching of items sent to the GPU for better throughput.

Customize a model on your data for your specific task. With the support of open source tooling, such as Hugging Face and DeepSpeed, you can quickly and efficiently take a foundation LLM and start training with your own data to have more accuracy for your domain and workload. This also gives you control to govern the data used for training so you can make sure you’re using AI responsibly. Read More

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
_resources		_resources
00-Setup.py		00-Setup.py
01-Data Ingestion.py		01-Data Ingestion.py
02-Exploratory Data Analysis.py		02-Exploratory Data Analysis.py
03-Feature Engineering.py		03-Feature Engineering.py
04-a ML on Databricks.py		04-a ML on Databricks.py
04-b HPT Training Evaluation NoFS.py		04-b HPT Training Evaluation NoFS.py
04-c HPT Training Evaluation FS.py		04-c HPT Training Evaluation FS.py
05-a Batch Inference NoFS.py		05-a Batch Inference NoFS.py
05-b Batch Inference FS.py		05-b Batch Inference FS.py
06-Model Serving.py		06-Model Serving.py
08-Distributed Training Parkinsons - LSTM - Petastorm - Horovod.py		08-Distributed Training Parkinsons - LSTM - Petastorm - Horovod.py
09-Cleanup.py		09-Cleanup.py
README.md		README.md
init.py		init.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning on Databricks

Usage:

Using Databricks Repos

Import via zip file

Overview of ML on Databricks:

About

Releases

Packages

Languages

srijitcn/ml_diabetes_prediction

Folders and files

Latest commit

History

Repository files navigation

Machine Learning on Databricks

Usage:

Using Databricks Repos

Import via zip file

Overview of ML on Databricks:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages