Skip to content

Relion 5 Stalls out with no error message during 3D Refinement #1279

@egelliott6

Description

@egelliott6

Hello all,

I am having an odd issue with a 3D refinement seemingly stalling out and terminating without any error message. I am on a laptop and remotely logging in to RELION on another computer with 4 GPUs. I set up the job, watch it through the first or first few iterations, and then let it run in the background or overnight. However, several times when I have gone back in to view the results, the job is listed as an active job but no progress is being made and all RELION related processes have stopped running on the GPUs and CPUs. The stalling is not specific to one iteration or another. If I continue the job from the last iteration, it is able to complete with no issues. Any idea why this could be?

I should also note that this has happened when I am reading the particles into the scratch directory, and I haven't had this problem (yet) running jobs using RAM.

Environment:

  • OS: Ubuntu 22.04.5 LTS
  • RELION version: 5.0.0-commit-5b1a65
  • Memory: 512 GB
  • GPU: 4x Nvidia RTX A4500

Dataset:

  • Box size: 300px
  • Pixel size: 0.847 Å/px
  • Number of particles: 250,000
  • Description: subtracted particles containing a heterodimer of about 110 kDa

Job options:

  • Type of job: 3D auto-refine
  • Number of MPI processes: 5
  • Number of threads: 10
  • Full command (see note.txt in the job directory):
    which relion_refine_mpi --o Refine3D/job034/run --auto_refine --split_random_halves --i Class3D/job033/run_it025_data.star --ref Import/job021/p1p2_subref_300px.mrc --trust_ref_size --ini_high 60 --dont_combine_weights_via_disc --no_parallel_disc_io --scratch_dir /mnt/scratch/ --pool 100 --pad 2 --ctf --particle_diameter 150 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 10 --gpu "" --pipeline_control Refine3D/job034/

Error message:
No error message. Listed as an active job and makes no progress. No relion-related processes left running on GPUs or CPUs (other than the gui).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions