-
Notifications
You must be signed in to change notification settings - Fork 4
TitanPackaging
This page documents a sub-task of the HPC adaptation project at UiO; please see the TitanTop page for background.
For the first few weeks of the project, participants from both groups will establish a shared knowledge of the relevant DELPH-IN tools, standard tasks and use patterns, and their interaction with the HPC environment. This work package will be organized as a series of hands-on 'walk-through' sessions, most likely at a rate of about one half day per week. The main result of this phase in the project will be a mutual, in-depth understanding of the available resources, requirements, and limitations in current software versions. This knowledge will be documented through a collection of collaboratively authored wiki pags.
Based on the observations made in the initial project phase, VD and user staff will jointly adapt and 'package' DELPH-IN tools for batch processing, i.e. prepare a gradual shift of use patterns, away from the currently predominant interactive style. This work package will create a standard set of job scripts and configuration mechanisms that automate common tasks and provide a good balance between ease of use and flexibility, i.e. user customization of relevant parameters. The resources developed in this work package will be contributed as new open-source components to the DELPH-IN repository.
There is a preliminary collection of TITAN-specific files, including a few (imperfect) SLURM job scripts in the sub-directory uio/titan/ inside the LOGON tree. Note that, while we are experimenting, these files may change frequently. Please remember to always make update in $LOGONROOT often.
Parsing a corpus (i.e. a sequence of sentences) is an 'embarassingly parallel' problem, in the sense that processing each sentence is completely independent of other sentences. In parsing, memory usage is comparatively tame: each parser client (process), typically, is limited to a maximum process size of 1G, and the controller (process) tends to run in less than 2G. However, parsing is quite memory-intense, in the sense of actively writing (frequently) to and reading from large blocks of memory. For multi-core nodes, it may be worthwhile to watch out for possible saturation of the memory sub-system.
The itsdb controller (written predominantly in Lisp) supports parallelization (and distribution) across nodes at the sentence level, using PVM. The parser itself is part of the [http://www.delph-in.net/pet PET] package, implemented in C++ (with critical routines in pure C). A typical interactive parser invocation could be the following:
$LOGONROOT/parse --erg+tnt --count 7 --best 500 cb
The parse command is a shell script that will (a) launch the controller; (b) load the grammar identified as erg+tnt (the English Resource Grammar, [http://www.delph-in.net/erg ERG], used in conjunction with the TnT tagger); (c) use PVM routines to create 7 PET client processes for parsing; (d) configure the parsing environment to return up to 500 most probable parses; and (e) work through the corpus identified as cb ([http://www.catb.org/esr/writings/cathedral-bazaar/cathedral-bazaar/ The Cathedral and the Bazaar]). The parse script and a few more existing LOGON scripts are discussed in more detail on the LogonProcessing pages.
By default, itsdb will launch a PVM daemon on the current node if necessary (i.e. if there is no existing daemon on that node for the current user). That means that putting seven PET clients on a single eight-core node is easy, as would be putting 31 such clients on a 32-core node. To take advantage of multiple nodes, however, PVM initialization will need to be informed of the set of nodes (and number of cores per node available), i.e. inspect $SLURM_JOB_NODELIST and friends. StephanOepen used to have half-baked script support to retrieve that information from the older SGE environment, then create a PVM initialization file (.pvm_hosts), and then ask itsdb to use all available PVM nodes. These steps should be adapted for SLURM, made robust (to a certain degree), and supported in standard scripts.
The task of learning the parameters of a statistical model, say to rank alternate parses by probability, is substantially more cpu- and memory-intensive than parsing. The task is typically broken down into three sub-tasks: (a) preparing a so-called feature cache, extracting all relevant information from the original training data (a so-called treebank) and storing it as a Berkeley DB (depending on the size of the training date, feature caches can vary between 10G and 100G in size); (b) perfoming a so-called grid search for best-performing feature choices and estimation (hyper-)parameters (typically using cross-validation, i.e. repeatedly training on ninety percent of the total data and testing on the remaining ten); and (c) training and serializing an actual model, reflecting the parameters found to work best during the grid search.
- My initial attempts at producing a 'parse' job script had erroneously specified --mem-per-cpu=2M; the jobs were started and terminated immediately, but even when requesting --mail-type=ALL, I did not see any indication of why my jobs were killed (nor did I find a log file providing that information).
- For use with the older SGE environment, I had a user configuration file with SGE defaults (project, maximum wall-clock time, mail notification, and so on). Is there a corresponding facility in SLURM? If not, is there a way to include #SBATCH statements near the top of each job file?
- The sbatch documentation suggests that it should be possible to pass command-line options to the job script, i.e. not have to maintain separate job scripts for every single (distinct) way of invoking the LOGON 'parse' script, say.
Home | Forum | Discussions | Events