Azure batch

services:
- azure batch account
- azure storage account
- data factory - may be useful for parameterised running

Useful tools

https://azure.github.io/BatchExplorer/

notes

Structure of running jobs:

Pools
- Define VM configuration for a job
- Best practice
  - Pools should have more than one compute node for redundancy on failure
  - Have jobs use pools dynamically, if moving jobs move them to new pool and once complete delete the old pool
  - Resize pools to zero every few months
Applications
- Zipped code python code, may need dependencies too?
Jobs
- Set of tasks to be run
- Best practice
  - 1000 tasks in one job is more efficient than 10 jobs with 100 tasks
  - Job has to be explicitly terminated to be completed, onAllTasksComplete property/maxWallClockTime does this
Tasks
- individual scripts/commands
- Best practice
  - task nodes are ephemeral so any data will be lost unless uploaded to storage via OutputFiles
  - retention time is a good idea for clarity and cleaning up data
  - Bulk submit collections of up to 100 tasks at a time
  - should build for some retry to withstand failure
Images
- Custom images with OS
- the storage blob containing the VM?
- conda from: linux datascience vm
  - windows has python 3.7
  - linux has python 3.5, but could install fstrings

options for running your own packages

All of these are defined at the pool level.

Define start task
- Each compute node runs this command as it joins the pool
- Seems a bit wasteful to run this for each job, or does it do it once?
create an application package
- zip file with all dependencies
- Seems like a pain to redo if you update requirements
- can version these and define which version you want to run
Use a custom image
- limit of 2500 dedicated compute nodes or 1000 low priority nodes in a pool
- can create a VHD and then import it for batch service mode
- can use Packer directly to build a linux image for user subscription mode
Use containers
- can prefetch container images but

linux image builder

az login
# register for the new feature
az feature register --namespace Microsoft.VirtualMachineImages --name VirtualMachineTemplatePreview

# check registration status, seems to take a while (more than 10 min, less than an hour)
az feature show --namespace Microsoft.VirtualMachineImages --name VirtualMachineTemplatePreview | grep state

# once feature is registered, propagate the feature
az provider register -n Microsoft.VirtualMachineImages
az provider register -n Microsoft.Compute
az provider register -n Microsoft.KeyVault
az provider register -n Microsoft.Storage

## env variables to use
sigResourceGroup=vm-testing
location=westus2
additionalregion=eastus
# shared image gallery name
sigName=sp_test_images
# name of image definition
imageDefName=sp_test_image
# image distribution metadata reference name
runOutputName=sp_linux_test
# subscription id
az account show | grep id
subscriptionID=<Subscription ID>

# if resource group doesn't exist, create it
az group create -n $sigResourceGroup -l $location

## create user-assigned identity and set permissions on the resource group

- [ ] make notes here

Orchestrating via python API

python batch examples
- ran the first few examples, straightforward

Running python scripts in batch

Running python script in azure

using the batch explorer tool, can find the data science desktop

data factories

select VM with start task for installing requirements
use and input and ouput storage blobs for input and output
create an azure data factory pipeline to run the python script on inputs and upload outputs

TLO Model Wiki

Azure batch

Azure batch

Useful tools

notes

options for running your own packages

linux image builder

Orchestrating via python API

Running python scripts in batch

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally