Skip to content
This repository was archived by the owner on Jan 7, 2025. It is now read-only.

digits Slurm #1435

Open
wants to merge 106 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
106 commits
Select commit Hold shift + click to select a range
50e8a8a
init
Nov 28, 2016
9a49297
basic slurm tasks and detection
Nov 28, 2016
b6eb54e
adding new gitingnore
Nov 28, 2016
0be0458
saving before branch
Nov 28, 2016
5b1c1c5
overide gpu check
Nov 29, 2016
fb1df8d
minor change to gpu selector
Nov 29, 2016
7f57546
line ending issues
Nov 29, 2016
7ea2a02
Work around for inference
Nov 30, 2016
0d8185a
Fixed - gpus will use gpus that are are set by cuda_visible_devices
Dec 1, 2016
3acd953
fixed gpu selector to only show when not in slurm mode
Dec 1, 2016
95fa40a
removed junk
Dec 1, 2016
c47fde8
fixed line endings
Dec 1, 2016
830da0b
fixing build lint
Dec 1, 2016
2d8af6d
formatting
Dec 1, 2016
e0c0353
more build fixes
Dec 1, 2016
f882924
fixing lint issues
Dec 1, 2016
b149331
check build test
Dec 1, 2016
27b3b8c
changed test file back
Dec 2, 2016
bfc3fe5
merged with nvidia master
Dec 2, 2016
d4f4040
added check box for slurm
Dec 2, 2016
ce59562
slurm flags
Dec 5, 2016
dc90667
check
Dec 5, 2016
f222935
removed task debug code
Dec 5, 2016
0339605
fix lint
Dec 5, 2016
99d1d47
lint
Dec 5, 2016
7afc30f
lint
Dec 5, 2016
684a94e
lint
Dec 5, 2016
6d41dd3
build.sh
Dec 5, 2016
55b414b
added gui changed for slurm db tasks
Dec 6, 2016
b53b3c0
Merge remote-tracking branch 'upstream/master' into dev
Dec 6, 2016
7f82268
removed old jobs
Dec 6, 2016
299f487
test
Dec 6, 2016
facde02
Minor lint fix
Dec 6, 2016
bb8c1ee
job numbers in gui
Dec 6, 2016
26a99a9
build fix
Dec 6, 2016
dec9bde
fix build
Dec 6, 2016
fc82aff
fixed setup test
Dec 6, 2016
36561b1
test
Dec 6, 2016
2b9f63c
set db tasks back to int mode
Dec 7, 2016
ee7eff4
Fix s_mem not being popped
Dec 7, 2016
7d2a599
enabled slurm db tasks
Dec 7, 2016
8d6af9b
exceptions for jobs
Dec 7, 2016
2b0c3f1
testing
Dec 7, 2016
29ff26f
jenkins build
Dec 7, 2016
b470ee8
Debug print statements for jenkins
Dec 7, 2016
e102d79
more debug
Dec 7, 2016
b55b284
Debug task jenkinds
Dec 7, 2016
966abdd
Testing digits.dataset.tasks.create_generic_db.CreateGenericDbTask ex…
Dec 8, 2016
34d72be
tidy of task - issue of db tasks not working on slurm is still on goi…
Dec 8, 2016
eb8fcf4
Inference excluded to make DB errors clear
Dec 8, 2016
fbc1760
set slurm tmp dir - this fixes the slurm chdir errors as the envar TM…
Dec 9, 2016
191c710
S1
Dec 9, 2016
c12b939
Fixed caffe slurm timeout errors
Dec 11, 2016
9a9bec4
re-enabled slurm
Dec 11, 2016
0252800
changed setting for slurm
Dec 12, 2016
4331613
Changed slurm settings to lower values and fixed form submit error
Dec 12, 2016
ec22501
end of day
Dec 13, 2016
a1e3f1f
torch works
Dec 15, 2016
53a2137
Changed timeouts
Dec 15, 2016
a108a01
time limit form fix
Dec 15, 2016
8ecadf7
Fixed up inference
Dec 16, 2016
2292144
.gitignore is now working
Dec 16, 2016
7e16d53
.gitignore is now working
Dec 16, 2016
a7c96e1
.gitignore is now working
Dec 16, 2016
2a1facb
removed redundant code from generic.veiws.py
Dec 18, 2016
d75d246
Revert "Changed timeouts"
Dec 19, 2016
ec5932c
lint fix
Dec 19, 2016
480e449
Refactored cluster management into classes
Dec 21, 2016
bbc8004
Refactored cluster management into classes
Dec 21, 2016
7b2928c
fixing up system types
Dec 21, 2016
0b84511
Selection of system types working - issues with gpu allocation again
Dec 21, 2016
122a3cc
make jenkins set slurm
Dec 21, 2016
cdbb196
fix for node local tmpdir
Dec 22, 2016
e63aa64
inference working
Dec 22, 2016
d1c9c9b
testing jenkins issues
Dec 22, 2016
11fe393
more debug
Dec 22, 2016
1a835b2
cast gpu for caffe to int
Dec 22, 2016
46864a1
fixed gpu --gpu=all
Dec 22, 2016
c69ff3b
trying to fix cuda errors
Dec 22, 2016
8bd5d0e
fixed gpu issues
Dec 23, 2016
3cfe233
inf testing
Jan 2, 2017
da6b9f2
inf change gpu to id 0
Jan 2, 2017
a6d8c58
fixed gpu selection for generic tasks
Jan 3, 2017
f9358c2
gpu fix
Jan 4, 2017
b1eaf3c
Got changes from login branch
Jan 4, 2017
0adf134
testing
Jan 4, 2017
a062372
Fixed dataset pages
Jan 10, 2017
27a0a66
added check for cudaDeviceGetPCIBusId()
Jan 10, 2017
6efb6d2
gpu debugging
Jan 11, 2017
6e76413
fixed digits job number
Jan 12, 2017
d5e8210
Updated tasks default and cluster management layout
Jan 16, 2017
5b950d6
Fixed cluster manager
Jan 16, 2017
b212d64
refactor job cancel into cluster manager class
Jan 16, 2017
989b9b5
removed prints
Jan 16, 2017
b7bc101
Merge remote-tracking branch 'upstream/master' into dev
Jan 17, 2017
70cc3f4
updated
Jan 17, 2017
3b2001f
formatting
Jan 17, 2017
16baab5
formatting
Jan 17, 2017
80207b2
more formatting
Jan 17, 2017
5d6bfe4
Even more formatting
Jan 17, 2017
2f30e4e
merge
Jan 18, 2017
82e76a3
fixed error code
Jan 23, 2017
8296a51
Merge branch 'master' of https://github.com/NVIDIA/DIGITS into dev
Apr 26, 2017
edc101e
removed log file
Apr 26, 2017
cd882e8
fix travis errors
May 3, 2017
cd6864d
s_mem in generic jobs
May 3, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions .gitignore
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,47 @@ TAGS
/build/
/dist/
*.egg-info/

# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and Webstorm
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839

# User-specific stuff:
.idea/
.idea/workspace.xml
.idea/tasks.xml

# Sensitive or high-churn files:
.idea/dataSources/
.idea/dataSources.ids
.idea/dataSources.xml
.idea/dataSources.local.xml
.idea/sqlDataSources.xml
.idea/dynamic.xml
.idea/uiDesigner.xml

# Gradle:
.idea/gradle.xml
.idea/libraries

# Mongo Explorer plugin:
.idea/mongoSettings.xml

## File-based project format:
*.iws

## Plugin-specific files:

# IntelliJ
/out/

# mpeltonen/sbt-idea plugin
.idea_modules/

# JIRA plugin
atlassian-ide-plugin.xml

# Crashlytics plugin (for Android Studio and IntelliJ)
com_crashlytics_export_strings.xml
crashlytics.properties
crashlytics-build.properties
fabric.properties
6 changes: 6 additions & 0 deletions .nfs000000000f7c03bc0003551c
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/bash
# Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.

set -e

python2 -m digits $@
3 changes: 2 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,8 @@ before_install:
- deactivate
- virtualenv --system-site-packages ~/venv
- source ~/venv/bin/activate

- "sudo apt-get install libboost-filesystem1.55-dev
libboost-python1.55-dev libboost-system1.55-dev libboost-thread1.55-dev"
install:
- mkdir -p ~/.config/matplotlib
- echo "backend:agg" > ~/.config/matplotlib/matplotlibrc
Expand Down
2 changes: 2 additions & 0 deletions digits/config/__init__.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,15 @@
option_list = {}

from . import ( # noqa
system_type,
caffe,
gpu_list,
jobs_dir,
log_file,
torch,
server_name,
store_option,

)


Expand Down
Empty file modified digits/config/jobs_dir.py
100644 → 100755
Empty file.
10 changes: 10 additions & 0 deletions digits/config/system_type.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from __future__ import absolute_import
from . import option_list
from digits.extensions.cluster_management.cluster_factory import cluster_factory
if cluster_factory.use_cluster:
system_type = cluster_factory.selected_system

else:
system_type = 'interactive'

option_list['system_type'] = system_type
11 changes: 11 additions & 0 deletions digits/dataset/forms.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

from flask.ext.wtf import Form
from wtforms.validators import DataRequired
from wtforms import validators

from digits import utils

Expand All @@ -20,3 +21,13 @@ class DatasetForm(Form):
group_name = utils.forms.StringField('Group Name',
tooltip="An optional group name for organization on the main page."
)

# slurm options
slurm_selector = utils.forms.BooleanField('Use slurm?')
slurm_time_limit = utils.forms.IntegerField('Task time limit', tooltip='in minutes', default=0, )
slurm_cpu_count = utils.forms.IntegerField('Use this many cores', validators=[
validators.NumberRange(min=1, max=128)
], default=8, )
slurm_mem = utils.forms.IntegerField('Use this much memory (GB)', validators=[
validators.NumberRange(min=1, max=128)
], default=10, )
12 changes: 12 additions & 0 deletions digits/dataset/generic/job.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,15 @@ def __init__(self,
self.extension_id = extension_id
self.extension_userdata = extension_userdata

try:
self.time_limit = kwargs.pop('time_limit', None)
self.s_cpu_count = kwargs.pop('s_cpu_count', None)
self.s_mem = kwargs.pop('s_mem', None)
except:
self.time_limit
self.s_cpu_count
self.s_mem

super(GenericDatasetJob, self).__init__(**kwargs)
self.pickver_job_dataset_extension = PICKLE_VERSION

Expand All @@ -45,6 +54,9 @@ def __init__(self,
job=self,
backend=self.backend,
stage=stage,
time_limit=self.time_limit,
s_cpu_count=self.s_cpu_count,
s_mem=self.s_mem,
)
)

Expand Down
15 changes: 12 additions & 3 deletions digits/dataset/generic/views.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@

import os
# Find the best implementation available
from digits.config import config_value

try:
from cStringIO import StringIO
except ImportError:
Expand Down Expand Up @@ -54,7 +56,8 @@ def new(extension_id):
extension_title=extension.get_title(),
extension_id=extension_id,
extension_html=rendered_extension,
form=form
form=form,
system_type=config_value('system_type')
)


Expand Down Expand Up @@ -96,14 +99,17 @@ def create(extension_id):
extension_id=extension_id,
extension_html=rendered_extension,
form=form,
errors=errors), 400
errors=errors,
system_type=config_value('system_type')
), 400

# create instance of extension class
extension = extension_class(**extension_form.data)

job = None
try:
# create job

job = GenericDatasetJob(
username=utils.auth.get_username(),
name=form.dataset_name.data,
Expand All @@ -116,6 +122,9 @@ def create(extension_id):
force_same_shape=form.dsopts_force_same_shape.data,
extension_id=extension_id,
extension_userdata=extension.get_user_data(),
time_limit=form.slurm_time_limit.data,
s_cpu_count=form.slurm_cpu_count.data,
s_mem=form.slurm_mem.data,
)

# Save form data with the job so we can easily clone it later.
Expand Down Expand Up @@ -199,7 +208,7 @@ def explore():
return flask.render_template(
'datasets/images/explore.html',
page=page, size=size, job=job, imgs=imgs, labels=None,
pages=pages, label=None, total_entries=total_entries, db=db)
pages=pages, label=None, total_entries=total_entries, db=db, system_type=config_value('system_type'))


def show(job, related_jobs=None):
Expand Down
33 changes: 29 additions & 4 deletions digits/dataset/images/classification/views.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
import shutil

# Find the best implementation available
from digits.config import config_value

try:
from cStringIO import StringIO
except ImportError:
Expand All @@ -23,7 +25,6 @@
from digits.utils.routing import request_wants_json, job_from_request
from digits.webapp import scheduler


blueprint = flask.Blueprint(__name__, __name__)


Expand Down Expand Up @@ -115,6 +116,9 @@ def from_folders(job, form):
compression=compression,
mean_file=utils.constants.MEAN_FILE_CAFFE,
labels_file=job.labels_file,
time_limit=form.slurm_time_limit.data,
s_cpu_count=form.slurm_cpu_count.data,
s_mem=form.slurm_mem.data,
)
)

Expand All @@ -131,6 +135,9 @@ def from_folders(job, form):
encoding=encoding,
compression=compression,
labels_file=job.labels_file,
time_limit=form.slurm_time_limit.data,
s_cpu_count=form.slurm_cpu_count.data,
s_mem=form.slurm_mem.data,
)
)

Expand All @@ -147,6 +154,9 @@ def from_folders(job, form):
encoding=encoding,
compression=compression,
labels_file=job.labels_file,
time_limit=form.slurm_time_limit.data,
s_cpu_count=form.slurm_cpu_count.data,
s_mem=form.slurm_mem.data,
)
)

Expand Down Expand Up @@ -198,6 +208,10 @@ def from_files(job, form):
mean_file=utils.constants.MEAN_FILE_CAFFE,
labels_file=job.labels_file,
shuffle=shuffle,
time_limit=form.slurm_time_limit.data,
s_cpu_count=form.slurm_cpu_count.data,
s_mem=form.slurm_mem.data,

)
)

Expand Down Expand Up @@ -229,6 +243,9 @@ def from_files(job, form):
compression=compression,
labels_file=job.labels_file,
shuffle=shuffle,
time_limit=form.slurm_time_limit.data,
s_cpu_count=form.slurm_cpu_count.data,
s_mem=form.slurm_mem.data,
)
)

Expand Down Expand Up @@ -260,6 +277,9 @@ def from_files(job, form):
compression=compression,
labels_file=job.labels_file,
shuffle=shuffle,
time_limit=form.slurm_time_limit.data,
s_cpu_count=form.slurm_cpu_count.data,
s_mem=form.slurm_mem.data,
)
)

Expand All @@ -275,7 +295,8 @@ def new():
# Is there a request to clone a job with ?clone=<job_id>
fill_form_if_cloned(form)

return flask.render_template('datasets/images/classification/new.html', form=form)
return flask.render_template('datasets/images/classification/new.html', form=form,
system_type=config_value('system_type'))


@blueprint.route('.json', methods=['POST'])
Expand All @@ -296,7 +317,8 @@ def create():
if request_wants_json():
return flask.jsonify({'errors': form.errors}), 400
else:
return flask.render_template('datasets/images/classification/new.html', form=form), 400
return flask.render_template('datasets/images/classification/new.html', form=form,
system_type=config_value('system_type')), 400

job = None
try:
Expand All @@ -309,7 +331,10 @@ def create():
int(form.resize_width.data),
int(form.resize_channels.data),
),
resize_mode=form.resize_mode.data
resize_mode=form.resize_mode.data,
time_limit=form.slurm_time_limit.data,
s_cpu_count=form.slurm_cpu_count.data,
s_mem=form.slurm_mem.data,
)

if form.method.data == 'folder':
Expand Down
1 change: 1 addition & 0 deletions digits/dataset/images/generic/test_lmdb_creator.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,3 +209,4 @@ def _save_mean(mean, filename):
)

print 'Done after %s seconds' % (time.time() - start_time,)

20 changes: 17 additions & 3 deletions digits/dataset/images/generic/views.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from digits.webapp import scheduler
from digits.utils.forms import fill_form_if_cloned, save_form_to_job
from digits.utils.routing import request_wants_json
from digits.config import config_value

blueprint = flask.Blueprint(__name__, __name__)

Expand All @@ -24,8 +25,7 @@ def new():

# Is there a request to clone a job with ?clone=<job_id>
fill_form_if_cloned(form)

return flask.render_template('datasets/images/generic/new.html', form=form)
return flask.render_template('datasets/images/generic/new.html', form=form, system_type=config_value('system_type'))


@blueprint.route('.json', methods=['POST'])
Expand All @@ -46,7 +46,8 @@ def create():
if request_wants_json():
return flask.jsonify({'errors': form.errors}), 400
else:
return flask.render_template('datasets/images/generic/new.html', form=form), 400
return flask.render_template('datasets/images/generic/new.html', form=form,
system_type=config_value('system_type')), 400

job = None
try:
Expand All @@ -55,6 +56,7 @@ def create():
name=form.dataset_name.data,
group=form.group_name.data,
mean_file=form.prebuilt_mean_file.data.strip(),

)

if form.method.data == 'prebuilt':
Expand All @@ -70,6 +72,9 @@ def create():
database=form.prebuilt_train_images.data,
purpose=form.prebuilt_train_images.label.text,
force_same_shape=force_same_shape,
time_limit=form.slurm_time_limit.data,
s_cpu_count=form.slurm_cpu_count.data,
s_mem=form.slurm_mem.data,
)
)

Expand All @@ -80,6 +85,9 @@ def create():
database=form.prebuilt_train_labels.data,
purpose=form.prebuilt_train_labels.label.text,
force_same_shape=force_same_shape,
time_limit=form.slurm_time_limit.data,
s_cpu_count=form.slurm_cpu_count.data,
s_mem=form.slurm_mem.data,
)
)

Expand All @@ -90,6 +98,9 @@ def create():
database=form.prebuilt_val_images.data,
purpose=form.prebuilt_val_images.label.text,
force_same_shape=force_same_shape,
time_limit=form.slurm_time_limit.data,
s_cpu_count=form.slurm_cpu_count.data,
s_mem=form.slurm_mem.data,
)
)
if form.prebuilt_val_labels.data:
Expand All @@ -99,6 +110,9 @@ def create():
database=form.prebuilt_val_labels.data,
purpose=form.prebuilt_val_labels.label.text,
force_same_shape=force_same_shape,
time_limit=form.slurm_time_limit.data,
s_cpu_count=form.slurm_cpu_count.data,
s_mem=form.slurm_mem.data,
)
)

Expand Down
Loading