Skip to content

This repository will have code files for a complete Data Science experience on Databricks platform. This example try to solve a classification problem using the Pima Indian Diabetes dataset

Notifications You must be signed in to change notification settings

srijitcn/ml_diabetes_prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning on Databricks

This repository has code files for a complete Data Science project on Databricks platform. This example try to solve a classification problem using the Pima Indian Diabetes dataset.

Usage:

Using Databricks Repos
  1. Open your Databricks Workspace. Create a compute with Databricks Runtime 13.3 LTS ML or later.

  2. Clone this repo (Read How)

    NOTE If you are trying to connect to a repo which is access controlled, configure Github integration first (Read how)

  3. Run the 00-Setup Notebook first. This need to be only run once

  4. Run the 01-Data Ingestion Notebook next. This need to be only run once

  5. Run the notebooks 02-Exploratory Data Analysis, 03-Feature Engineering, 04-b HPT Training Evaluation NoFS and 04-c HPT Training Evaluation FS in the order.

NOTE:04-b,c notebooks registers model in Model Registry or Unity Catalog based your Workspace status. Before running any other notebooks, please make sure the registered models are marked to be used for Production.

  • If Unity Catalog is enabed, make sure the model version has the champion alias. Read How
  • If Unity Catalog is not enabled, Use model registry UI to Transition the version to ProductionRead How
  1. Any time you want to cleanup all the resources you created, run the 09-Cleanup Notebook. You can always recreate the resources by re-running the notebooks from Step 3
Import via zip file

If you dont have access to public repos in your workspace, you can follow the below steps

  1. Download the project as a zip file by navigating to <> Code and selecting Download ZIP from the Code Dropdown
  2. In your Databricks workspace, import the zip file using Import option from the menu bar. (Read How)

Overview of ML on Databricks:

Built on an open lakehouse architecture, AI and Machine Learning on Databricks empowers ML teams to prepare and process data, streamlines cross-team collaboration and standardizes the full ML lifecycle from experimentation to production including for generative AI and large language models.

Auto ML: Databricks AutoML allows you to quickly generate baseline models and notebooks. Read More

Distributed Hyper Parameter Tuning: Databricks Runtime ML includes Hyperopt, a Python library that facilitates distributed hyperparameter tuning and model selection. With Hyperopt, you can scan a set of Python models while varying algorithms and hyperparameters across spaces that you define. Read More

Distributed Training: Databricks enables distributed training and inference if your model or your data are too large to fit in memory on a single machine. For these workloads, Databricks Runtime ML includes the TorchDistributor, Horovod and spark-tensorflow-distributor packages. Read More

Automate experiment tracking and governance: Managed MLflow automatically tracks your experiments and logs parameters, metrics, versioning of data and code, as well as model artifacts with each training run. You can quickly see previous runs, compare results and reproduce a past result, as needed. Once you have identified the best version of a model for production, register it to the Model Registry to simplify handoffs along the deployment lifecycle. Read More

Manage the full model lifecycle from data to production — and back: Once trained models are registered, you can collaboratively manage them through their lifecycle with the Model Registry. Models can be versioned and moved through various stages, like experimentation, staging, production and archived. The lifecycle management integrates with approval and governance workflows according to role-based access controls. Comments and email notifications provide a rich collaborative environment for data teams.

Deploy ML models at scale and low latency: Deploy models with a single click without having to worry about server management or scale constraints. With Databricks, you can deploy your models as REST API endpoints anywhere with enterprise-grade availability.

Use generative AI and large language models: Integrate existing pretrained models — such as those from the Hugging Face transformers library or other open source libraries — into your workflow. Transformer pipelines make it easy to use GPUs and allow batching of items sent to the GPU for better throughput.

Customize a model on your data for your specific task. With the support of open source tooling, such as Hugging Face and DeepSpeed, you can quickly and efficiently take a foundation LLM and start training with your own data to have more accuracy for your domain and workload. This also gives you control to govern the data used for training so you can make sure you’re using AI responsibly. Read More

About

This repository will have code files for a complete Data Science experience on Databricks platform. This example try to solve a classification problem using the Pima Indian Diabetes dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages