-
Notifications
You must be signed in to change notification settings - Fork 16
Azure batch
Stefan Piatek edited this page Jul 27, 2020
·
19 revisions
- services:
- azure batch account
- azure storage account
- data factory - may be useful for parameterised running
Structure of running jobs:
-
Pools
- Define VM configuration for a job
- Best practice
- Pools should have more than one compute node for redundancy on failure
- Have jobs use pools dynamically, if moving jobs move them to new pool and once complete delete the old pool
- Resize pools to zero every few months
-
Applications
-
Jobs
- Set of tasks to be run
- Best practice
- 1000 tasks in one job is more efficient than 10 jobs with 100 tasks
- Job has to be explicitly terminated to be completed, onAllTasksComplete property/maxWallClockTime does this
-
Tasks
- individual scripts/commands
- Best practice
- task nodes are ephemeral so any data will be lost unless uploaded to storage via OutputFiles
- retention time is a good idea for clarity and cleaning up data
- Bulk submit collections of up to 100 tasks at a time
- should build for some retry to withstand failure
-
Images
- Custom images with OS
- the storage blob containing the VM?
- conda from: linux datascience vm
- windows has python 3.7
- linux has python 3.5, but could install fstrings
All of these are defined at the pool level.
- Define start task
- Each compute node runs this command as it joins the pool
- Seems a bit wasteful to run this for each job, or does it do it once?
-
create an application package
- zip file with all dependencies
- Seems like a pain to redo if you update requirements
- can version these and define which version you want to run
-
Use a custom image
- limit of 2500 dedicated compute nodes or 1000 low priority nodes in a pool
- can create a VHD and then import it for batch service mode
- can use Packer directly to build a linux image for user subscription mode
-
Use containers
- can prefetch container images but
az login
# register for the new feature
az feature register --namespace Microsoft.VirtualMachineImages --name VirtualMachineTemplatePreview
# check registration status, seems to take a while (more than 10 min, less than an hour)
az feature show --namespace Microsoft.VirtualMachineImages --name VirtualMachineTemplatePreview | grep state
# once feature is registered, propagate the feature
az provider register -n Microsoft.VirtualMachineImages
az provider register -n Microsoft.Compute
az provider register -n Microsoft.KeyVault
az provider register -n Microsoft.Storage
## env variables to use
sigResourceGroup=vm-testing
location=westus2
additionalregion=eastus
# shared image gallery name
sigName=sp_test_images
# name of image definition
imageDefName=sp_test_image
# image distribution metadata reference name
runOutputName=sp_linux_test
# subscription id
az account show | grep id
subscriptionID=<Subscription ID>
# if resource group doesn't exist, create it
az group create -n $sigResourceGroup -l $location
## create user-assigned identity and set permissions on the resource group
- [ ] make notes here
-
python batch examples
- ran the first few examples, straightforward
Running python script in azure
- using the batch explorer tool, can find the data science desktop
- select VM with start task for installing requirements
- use and input and ouput storage blobs for input and output
- create an azure data factory pipeline to run the python script on inputs and upload outputs
TLO Model Wiki