-
Notifications
You must be signed in to change notification settings - Fork 220
Description
Hello all,
I am having an odd issue with a 3D refinement seemingly stalling out and terminating without any error message. I am on a laptop and remotely logging in to RELION on another computer with 4 GPUs. I set up the job, watch it through the first or first few iterations, and then let it run in the background or overnight. However, several times when I have gone back in to view the results, the job is listed as an active job but no progress is being made and all RELION related processes have stopped running on the GPUs and CPUs. The stalling is not specific to one iteration or another. If I continue the job from the last iteration, it is able to complete with no issues. Any idea why this could be?
I should also note that this has happened when I am reading the particles into the scratch directory, and I haven't had this problem (yet) running jobs using RAM.
Environment:
- OS: Ubuntu 22.04.5 LTS
- RELION version: 5.0.0-commit-5b1a65
- Memory: 512 GB
- GPU: 4x Nvidia RTX A4500
Dataset:
- Box size: 300px
- Pixel size: 0.847 Å/px
- Number of particles: 250,000
- Description: subtracted particles containing a heterodimer of about 110 kDa
Job options:
- Type of job: 3D auto-refine
- Number of MPI processes: 5
- Number of threads: 10
- Full command (see
note.txt
in the job directory):
which relion_refine_mpi
--o Refine3D/job034/run --auto_refine --split_random_halves --i Class3D/job033/run_it025_data.star --ref Import/job021/p1p2_subref_300px.mrc --trust_ref_size --ini_high 60 --dont_combine_weights_via_disc --no_parallel_disc_io --scratch_dir /mnt/scratch/ --pool 100 --pad 2 --ctf --particle_diameter 150 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 10 --gpu "" --pipeline_control Refine3D/job034/
Error message:
No error message. Listed as an active job and makes no progress. No relion-related processes left running on GPUs or CPUs (other than the gui).