-
Notifications
You must be signed in to change notification settings - Fork 1
LUMI setup
Recommended reads:
- https://docs.lumi-supercomputer.eu/storage/
- https://docs.lumi-supercomputer.eu/runjobs/
- https://docs.lumi-supercomputer.eu/software/installing/container-wrapper/
- https://github.com/paracrawl/cirrus-scripts#readme
- https://docs.google.com/document/d/1YyjdWofZ65ib9qTnGiJ8n0Rvgm4PKRhwvnFYfXrSMRg/edit?usp=sharing
- https://docs.google.com/presentation/d/1zRPEm2QM3MSrmE6894U-E4TAhFH1cEz_pbpH3895dqA/edit?usp=sharing
Please ignore "Compiling Software" section in README, instead follow these steps. The conda container that will be used, contains most of the software needed.
Clone the repo (do not clone recursively), change to lumi
branch and clone only needed submodules
git clone https://github.com/paracrawl/cirrus-scripts
cd cirrus-scripts
git checkout lumi
git submodule update --init env/src/preprocess
Edit env/init.d/lumi.sh
and set PATH
variable to the bin
directory of the conda container.
Right now is set to project_462000252/zaragoza/bitextor-8.1/bin
, which is a working env and you can use it, so no need to change it.
Edit config.d/10.lumi.sh
to set up working directories for processed data:
- Change
PROJ_DIR
andSCRATCH_DIR
to your directories in projappl and scratch partitions of the project (e.g./projappl/project_462000252/user
). Project partition will be used to store the code and models, scratch to store the data. - Set up collection names and directories. For the test runs, there is no need to do additional changes, only to copy the data (explained afterwards).
- Other relevant variables that may not need modifications for the test runs:
-
SBATCH_ACCOUNT
specifies the project that will be billed for the computing hours. -
SBATCH_PARTITION
: we will be usingsmall
for the test but will probably change tostandard
. -
SBATCH_MEM_PER_CPU
: only needed forsmall
partition. Comment this line forstandard
partition. -
SLURM_LOGS
: directory to store the logs of all the jobs. THIS DIRECTORY NEEDS TO BE CREATED before running jobs, otherwise they will fail. Also note that this directory grows significantly in number of files, so make sure to clean it from time to time.
-
To install the software that is not included in the container run:
cd env
./setup.sh install paracrawl
For users without access to the project, the bitextor container mentioned above is not available but can be created with this configuration file for the LUMI conda container wrapper:
channels:
- conda-forge
- bitextor
- dmnapolitano
- esarrias
dependencies:
- bitextor=8.1
To configure translation step with a Bergamot student, the following steps are required:
- Create the language pair directory like
models/es-en
. - Download the student model files to
models/es-en/esen.student.tiny11
and create a symlink. - Create a symlink to
models/translate-bergamot
.
zaragoza2@uan01:~/proj_462000252/zaragoza/cirrus-scripts> ll models/es-en/
total 8.0K
drwxrws--- 2 zaragoza2 project_462000252 4.0K May 11 13:03 esen.student.tiny11
lrwxrwxrwx 1 zaragoza2 project_462000252 19 May 11 13:14 model -> esen.student.tiny11
lrwxrwxrwx 1 zaragoza2 project_462000252 84 May 11 13:00 translate.sh -> /users/zaragoza2/proj_462000252/zaragoza/cirrus-scripts/models/translate-bergamot.sh
Note that translate-bergamot.sh
will look for marian-decoder
config at models/es-en/model/config.yml
. This is an optimized example for bergamot models:
quiet-translation: true
relative-paths: true
models:
- model.intgemm.alphas.bin
vocabs:
- vocab.esen.spm
- vocab.esen.spm
shortlist:
- lex.s2t.bin
- false
beam-size: 1
normalize: 1.0
word-penalty: 0
mini-batch: 16
maxi-batch: 100
maxi-batch-sort: src
workspace: 256
max-length: 300
max-length-crop: true
gemm-precision: int8shiftAlphaAll
max-length-crop
is avoids super long lines freezing Marian.
Marian Bergamot CPU version is compiled and configured in translate-bergamot.sh
, so there is no need to compile it.
To use other types of translators, you will need to compile/install it by yourself and configure translate.sh
.
Take a look at translation template scripts in models/
directory to have an idea of what is needed.
Note the use of foldfilter
wrapper to chop very long lines before translation and rejoin them at the output.
WARNING: foldfilter
can mess up with spaces in some cases, can't handle languages without spaces, for example.
Check outputs before using it.
Sharding step can be used as it is, but for test runs with little amounts of data we do not need that level of parallelization and we need keep the number of files low. So we can configure it to create only 4 (2²) shards
diff --git a/02.giashard b/02.giashard
index e367863..6c2a6f4 100755
--- a/02.giashard
+++ b/02.giashard
@@ -17,7 +17,7 @@ mkdir $SHARD_PATH.$$
cat "$BATCH_LIST" \
| awk "NR > $GROUP_START && NR <= $GROUP_END" \
-| xargs giashard -d $SCRIPTS/domain-suffixes.txt -f text,url -b 1024 -n 8 -o $SHARD_PATH.$$
+| xargs giashard -d $SCRIPTS/domain-suffixes.txt -f text,url -b 1024 -n 2 -o $SHARD_PATH.$$
# Fix filenames
for BATCH in $SHARD_PATH.$$/*/*/; do
diff --git a/02.giashard.sh b/02.giashard.sh
index 99fd834..0ed3904 100755
--- a/02.giashard.sh
+++ b/02.giashard.sh
@@ -35,7 +35,7 @@ esac
export BATCHES_PER_TASK
export TASKS_PER_BATCH=1 # more than 1 is not supported by 02.giashard
-export SHARDS_PER_TASK=16 # for 02.giamerge -> 1-16 * 16 = 256 shards
+export SHARDS_PER_TASK=1 # for 02.giamerge -> 1-16 * 16 = 256 shards
for language in $@; do
batch_list=$(make_batch_list $collection $language)
@@ -62,7 +62,7 @@ for language in $@; do
merge_job_id=$(schedule \
-J merge-shard-${language}-${collection} \
--dependency afterok:$shard_job_id \
- -a 1-16 \
+ -a 1-2 \
--time 24:00:00 \
--cpus-per-task 8 `#really just need 4, but 8 for more memory and better spread` \
-e ${SLURM_LOGS}/02.merge-${language}-%A_%a.err \
CAUTION: the same number of shards for each language in a language pair that is going to be aligned, needs the same sharding configuration.
For each collection we need $colection-{text,shards,batches}
directories.
-{shards,batches}
dirs need to be created manually, otherwise job scheduling will fail.
The data that sharding will use as a starting point need to be located at $collection-text
directory.
For the test runs you can copy the data from /scratch/project_462000252/zaragoza/data/*-text
.
After all the configuration, all the steps can be followed as the README explains.
Each processing step follows the scheme "run process writing to output file with a temporary suffix" then remove the suffix to mark it as finished. So every time a job fails or does not finish properly, it will leave temp files all over the place. Cleaning them regularly is advised in order to reduce the number of files.
Some steps, like tokenise
and split-text
are using serial jobs, so allocating more CPUs per job does not parallelize them.
But allocating more than one CPU will provide, in the case of 'small' partition, more memory to prevent running out of it.
So running
TPB=1 TPN=3 ./05.tokenisesh output_wide15_filtered_sample12 es
will schedule an array job of size the total number of batches that we have for that language, where each job will process in serial that batch. Having 3 CPUs will provide more ram.
In an scenario where we have more batches, therefore size of job array, bigger than the limit of jobs (200), we can increase the TPB so that each job processes more than one batch. Note that this won't increase the parallelization, only avoid the scheduler limits.
To allocate more cpus per job and let the threading parallelize in non-serial steps do
TPN=1 SBATCH_CPUS_PER_TASK=128 ./04.translate.sh
We are not using this partition by now, but will probably use it in next runs.
It is important to know that standard
partition does allocate full nodes and not sub-node resources.
So TPN variable won't affect the number of CPUs allocated.
Each job in the array will have 128 cores.
To take advantage of the full node in serial steps, we will need to run with
TPB=128 ./05.tokenise.sh ...
or something high like 64, to have most of the cores process a batch. Note that 128 would spawn a lot of processes and can lead to OOM, so decreasing it a bit could be reasonable. But this is not tested.