Skip to content

Azure batch

Stefan Piatek edited this page Jul 27, 2020 · 19 revisions

Azure batch

  • services:
    • azure batch account
    • azure storage account
    • data factory - may be useful for parameterised running

Useful tools

notes

Structure of running jobs:

  • Pools

    • Define VM configuration for a job
    • Best practice
      • Pools should have more than one compute node for redundancy on failure
      • Have jobs use pools dynamically, if moving jobs move them to new pool and once complete delete the old pool
      • Resize pools to zero every few months
  • Applications

  • Jobs

    • Set of tasks to be run
    • Best practice
      • 1000 tasks in one job is more efficient than 10 jobs with 100 tasks
      • Job has to be explicitly terminated to be completed, onAllTasksComplete property/maxWallClockTime does this
  • Tasks

    • individual scripts/commands
    • Best practice
      • task nodes are ephemeral so any data will be lost unless uploaded to storage via OutputFiles
      • retention time is a good idea for clarity and cleaning up data
      • Bulk submit collections of up to 100 tasks at a time
      • should build for some retry to withstand failure
  • Images

    • Custom images with OS
    • the storage blob containing the VM?
    • conda from: linux datascience vm
      • windows has python 3.7
      • linux has python 3.5, but could install fstrings

options for running your own packages

All of these are defined at the pool level.

  • Define start task
    • Each compute node runs this command as it joins the pool
    • Seems a bit wasteful to run this for each job, or does it do it once?
  • create an application package
    • zip file with all dependencies
    • Seems like a pain to redo if you update requirements
    • can version these and define which version you want to run
  • Use a custom image
    • limit of 2500 dedicated compute nodes or 1000 low priority nodes in a pool
    • can create a VHD and then import it for batch service mode
    • can use Packer directly to build a linux image for user subscription mode
  • Use containers
    • can prefetch container images but
az login
# register for the new feature
az feature register --namespace Microsoft.VirtualMachineImages --name VirtualMachineTemplatePreview

# check registration status, seems to take a while (more than 10 min, less than an hour)
az feature show --namespace Microsoft.VirtualMachineImages --name VirtualMachineTemplatePreview | grep state

# once feature is registered, propagate the feature
az provider register -n Microsoft.VirtualMachineImages
az provider register -n Microsoft.Compute
az provider register -n Microsoft.KeyVault
az provider register -n Microsoft.Storage

## env variables to use
sigResourceGroup=vm-testing
location=westus2
additionalregion=eastus
# shared image gallery name
sigName=sp_test_images
# name of image definition
imageDefName=sp_test_image
# image distribution metadata reference name
runOutputName=sp_linux_test
# subscription id
az account show | grep id
subscriptionID=<Subscription ID>

# if resource group doesn't exist, create it
az group create -n $sigResourceGroup -l $location

## create user-assigned identity and set permissions on the resource group

- [ ] make notes here
 

Orchestrating via python API

Running python scripts in batch

Running python script in azure

  • using the batch explorer tool, can find the data science desktop

data factories

  • select VM with start task for installing requirements
  • use and input and ouput storage blobs for input and output
  • create an azure data factory pipeline to run the python script on inputs and upload outputs

Clone this wiki locally