TitanPackaging

Background

This page documents a sub-task of the HPC adaptation project at UiO; please see the TitanTop page for background.

For the first few weeks of the project, participants from both groups will establish a shared knowledge of the relevant DELPH-IN tools, standard tasks and use patterns, and their interaction with the HPC environment. This work package will be organized as a series of hands-on 'walk-through' sessions, most likely at a rate of about one half day per week. The main result of this phase in the project will be a mutual, in-depth understanding of the available resources, requirements, and limitations in current software versions. This knowledge will be documented through a collection of collaboratively authored wiki pags.

Based on the observations made in the initial project phase, VD and user staff will jointly adapt and 'package' DELPH-IN tools for batch processing, i.e. prepare a gradual shift of use patterns, away from the currently predominant interactive style. This work package will create a standard set of job scripts and configuration mechanisms that automate common tasks and provide a good balance between ease of use and flexibility, i.e. user customization of relevant parameters. The resources developed in this work package will be contributed as new open-source components to the DELPH-IN repository.

TITAN-Specific Files

There is a preliminary collection of TITAN-specific files, including a few (imperfect) SLURM job scripts in the sub-directory uio/titan/ inside the LOGON tree. Note that, while we are experimenting, these files may change frequently. Please remember to always make update in $LOGONROOT often.

Parsing

Parsing a corpus (i.e. a sequence of sentences) is an 'embarassingly parallel' problem, in the sense that processing each sentence is completely independent of other sentences. In parsing, memory usage is comparatively tame: each parser client (process), typically, is limited to a maximum process size of 1G, and the controller (process) tends to run in less than 2G. However, parsing is quite memory-intense, in the sense of actively writing (frequently) to and reading from large blocks of memory. For multi-core nodes, it may be worthwhile to watch out for possible saturation of the memory sub-system.

The itsdb controller (written predominantly in Lisp) supports parallelization (and distribution) across nodes at the sentence level, using PVM. The parser itself is part of the [http://www.delph-in.net/pet PET] package, implemented in C++ (with critical routines in pure C). A typical interactive parser invocation could be the following:

  $LOGONROOT/parse --erg+tnt --count 7 --best 500 cb

The parse command is a shell script that will (a) launch the controller; (b) load the grammar identified as erg+tnt (the English Resource Grammar, [http://www.delph-in.net/erg ERG], used in conjunction with the TnT tagger); (c) use PVM routines to create 7 PET client processes for parsing; (d) configure the parsing environment to return up to 500 most probable parses; and (e) work through the corpus identified as cb ([http://www.catb.org/esr/writings/cathedral-bazaar/cathedral-bazaar/ The Cathedral and the Bazaar]). The parse script and a few more existing LOGON scripts are discussed in more detail on the LogonProcessing pages.

By default, itsdb will launch a PVM daemon on the current node if necessary (i.e. if there is no existing daemon on that node for the current user). That means that putting seven PET clients on a single eight-core node is easy, as would be putting 31 such clients on a 32-core node. To take advantage of multiple nodes, however, PVM initialization will need to be informed of the set of nodes (and number of cores per node available), i.e. inspect $SLURM_JOB_NODELIST and friends. StephanOepen used to have half-baked script support to retrieve that information from the older SGE environment, then create a PVM initialization file (.pvm_hosts), and then ask itsdb to use all available PVM nodes. These steps should be adapted for SLURM, made robust (to a certain degree), and supported in standard scripts.

(MaxEnt) Model Training and Testing

The task of learning the parameters of a statistical model, say to rank alternate parses by probability, is substantially more cpu- and memory-intensive than parsing. The task is typically broken down into three sub-tasks: (a) preparing a so-called feature cache, extracting all relevant information from the original training data (a so-called treebank) and storing it as a Berkeley DB (depending on the size of the training date, feature caches can vary between 10G and 100G in size); (b) perfoming a so-called grid search for best-performing feature choices and estimation (hyper-)parameters (typically using cross-validation, i.e. repeatedly training on ninety percent of the total data and testing on the remaining ten); and (c) training and serializing an actual model, reflecting the parameters found to work best during the grid search.

Questions and Answers

Where do I get feedback on why my jobs fail?

My initial attempts at producing a 'parse' job script had erroneously specified --mem-per-cpu=2M; the jobs were started and terminated immediately, but even when requesting --mail-type=ALL, I did not see any indication of why my jobs were killed (nor did I find a log file providing that information).

How can I set common SLURM defaults?

For use with the older SGE environment, I had a user configuration file with SGE defaults (project, maximum wall-clock time, mail notification, and so on). Is there a corresponding facility in SLURM? If not, is there a way to include #SBATCH statements near the top of each job file?

Can I pass command-line arguments into SLURM job scripts?

The sbatch documentation suggests that it should be possible to pass command-line options to the job script, i.e. not have to maintain separate job scripts for every single (distinct) way of invoking the LOGON 'parse' script, say.

Yes, sbatch stops consuming command-line options as soon as it sees the first option that it does not recognize as its own; that option and all following are available as parameters to the job script (e.g. accessible as $1, $@, and so on). As of 16-jun, the following seems to work (requesting one node and eight cores, hardwired in the job script):

  sbatch $LOGONROOT/uio/titan/parse --erg+tnt --best 500 cb

How do I monitor cpu usage for my user and project account(s)?

Home | Forum | Discussions | Events

TitanPackaging

Background

TITAN-Specific Files

Parsing

(MaxEnt) Model Training and Testing

Questions and Answers

Where do I get feedback on why my jobs fail?

How can I set common SLURM defaults?

Can I pass command-line arguments into SLURM job scripts?

How do I monitor cpu usage for my user and project account(s)?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!