Skip to content

Schwarz preconditioned CG gives nans in develop (again) #1601

@jcosborn

Description

@jcosborn

The issue reported in #1535 seems to be back in develop. It appears in the invert_test_mobius_sym and invert_test_mobius_asym tests. They exit with something like:

[ RUN      ] SchwarzNormal/InvertTest.verify/double_double_pcg_mat_pc_dag_mat_pc_normop_pc_additive_schwarz_cg_half_l2
Computed plaquette is 1.233908e-01 (spatial = 1.223209e-01, temporal = 1.244607e-01)
Solution = mat_pc_dag_mat_pc, Solve = normop_pc, Solver = pcg, Precision = double, Sloppy precision = double
CG: Convergence at 10 iterations, L2 relative residual: iterated = 1.652612e+06 (requested = 1.000000e-01)
CG: Convergence at 10 iterations, L2 relative residual: iterated = 1.398104e+04 (requested = 1.000000e-01)
CG: Convergence at 10 iterations, L2 relative residual: iterated = 2.604118e+05 (requested = 1.000000e-01)
...
CG: Convergence at 10 iterations, L2 relative residual: iterated = 1.014964e+02 (requested = 1.000000e-01)
CG: Convergence at 10 iterations, L2 relative residual: iterated = 9.442425e+04 (requested = 1.000000e-01)
CG: Convergence at 10 iterations, L2 relative residual: iterated = 2.110695e+04 (requested = 1.000000e-01)
ERROR: Solver appears to have diverged with residual       nan (rank 0, host plate, solver.cpp:417 in bool quda::Solver::convergence(quda::cvector<double>&, quda::cvector<double>&, quda::cvector<double>&, quda::cvector<double>&)())
       last kernel called was (name=N4quda4blas11axpyCGNorm2IddEE,volume=1x4x6x8x4,aux=GPU-offline,large_kernel_arg,vol=768,parity=1,precision=8,Ns=4,Nc=3,order=0,N=2,n_rhs=1)
       last tune param used was block=(32,1,1), grid=(24,1,1), shared_bytes=0, shared_carve_out=0, aux=(-1,-1,-1,-1)
Saving 294 sets of cached parameters to /home/josborn/lqcd/build/quda-git/qudatune/tunecache_notune_error.tsv
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

The test given in #1535 fails with a different error:

Computed plaquette is 1.231117e-01 (spatial = 1.236920e-01, temporal = 1.225315e-01)
Solution = mat, Solve = normop_pc, Solver = pcg, Precision = single, Sloppy precision = half
ERROR: Solver appears to have diverged for n = 0 (rank 0, host plate, solver.cpp:479 in void quda::Solver::PrintStats(const char*, int, quda::cvector<double>&, quda::cvector<double>&, quda::cvector<double>&)())
       last kernel called was (name=N4quda4blas9axpyZpbx_IfEE,volume=6x12x12x16x8,aux=GPU-offline,large_kernel_arg,vol=110592,parity=1,precision=2,Ns=4,Nc=3,order=0,N=8,n_rhs=1)
       last tune param used was block=(640,1,1), grid=(76,1,1), shared_bytes=0, shared_carve_out=0, aux=(-1,-1,-1,-1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions