-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PIConGPU mallocmc error on taursml (possible duplicate of #3064) #3433
Comments
Might this be induced by an extremely slow IO? It took 15 minuted to create |
I do not remember this specific error. You may try to run with blocking kernel on for debugging ( |
I wanted to check whether linking on the nodes was correct, but I only get the output of |
Okay -the missing |
@steindev reducing the simulation size by a factor 4 and switching blocking kernel on, still leads to the same error. |
@franzpoeschel Did your PIConGPU simulation run? |
Sorry for a late responce.
From the exception message alone I would assume something is wrong with device memory. But given the aforementioned |
It looks like the message we always got from |
@PrometheusPi I will create on Friday out of psychocoderHPC@5e93302 a patched version we can run on a single GPU to build a template we can later use to write native CUDA code. |
@psychocoderHPC Thats sound great. Thanks for the info. |
some more details of error messages from ./picongpu -s 100 -g 128 128 128
full simulation time: 8sec 402msec = 8 sec
Unhandled exception of type 'St13runtime_error' with message '/scratch/ws/1/s5960712-ml_streaming/pic_env/build/picongpu/thirdParty/cupla/alpaka/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp(360) 'cudaMalloc( &memPtr, static_cast<std::size_t>(widthBytes))' returned error : 'cudaErrorMemoryAllocation': 'out of memory'!', terminating
*** Error in `./picongpu': double free or corruption (!prev): 0x0000000035bd9690 ***
======= Backtrace: =========
/lib64/libc.so.6(cfree+0x4a0)[0x200000f09be0]
/sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_ZdlPv+0x18)[0x200000c47c38]
/sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_ZdlPvm+0x18)[0x200000c47c78]
./picongpu(_ZN16cupla_cuda_async13cuplaFreeHostEPv+0x15c)[0x1055152c]
./picongpu(_ZN5pmacc6BufferINS_9SuperCellINS_5FrameINS_15ParticlesBufferINS_19ParticleDescriptionINS_4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENS_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESD_NSC_IiLi4EEEEEN5boost3mpl6v_itemIN8picongpu9weightingENSI_INSJ_8momentumENSI_INSJ_8positionINSJ_12position_picENS_13pmacc_isAliasEEENSH_7vector0INSB_2naEEELi0EEELi0EEELi0EEENSI_INSJ_11chargeRatioINSJ_20ChargeRatioElectronsESO_EENSI_INSJ_9massRatioINSJ_18MassRatioElectronsESO_EENSI_INSJ_7currentINSJ_13currentSolver9EsirkepovINSJ_9particles6shapes3TSCENS13_8strategy16CachedSupercellsELj3EEESO_EENSI_INSJ_13interpolationINSJ_28FieldToParticleInterpolationIS17_NSJ_30AssignedTrilinearInterpolationEEESO_EENSI_INSJ_5shapeIS17_SO_EENSI_INSJ_14particlePusherINS15_6pusher5BorisESO_EESS_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENS_17HandleGuardRegionINS_9particles8policies17ExchangeParticlesENS15_8boundary29CallPluginsAndDeleteParticlesEEESS_SS_EESF_N8mallocMC9AllocatorIN6alpaka3acc12AccGpuCudaRtISt17integral_constantImLm3EEjEENS21_16CreationPolicies7ScatterINSJ_16DeviceHeapConfigENS29_11ScatterConf27DefaultScatterHashingParamsEEENS21_20DistributionPolicies4NoopENS21_11OOMPolicies10ReturnNullENS21_19ReservePoolPolicies9AlpakaBufIS28_EENS21_17AlignmentPolicies6ShrinkINS2M_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENS4_IS7_SF_NSI_INS_9multiMaskENSI_INS_12localCellIdxESV_Li0EEELi0EEES1S_S1Z_SS_NSI_INS_12NextFramePtrINSB_3argILi1EEEEENSI_INS_16PreviousFramePtrIS31_EESS_Li0EEELi0EEEEEEEEELj3EED1Ev+0x2c)[0x1046504c]
./picongpu(_ZN5pmacc18DeviceBufferInternINS_9SuperCellINS_5FrameINS_15ParticlesBufferINS_19ParticleDescriptionINS_4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENS_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESD_NSC_IiLi4EEEEEN5boost3mpl6v_itemIN8picongpu9weightingENSI_INSJ_8momentumENSI_INSJ_8positionINSJ_12position_picENS_13pmacc_isAliasEEENSH_7vector0INSB_2naEEELi0EEELi0EEELi0EEENSI_INSJ_11chargeRatioINSJ_20ChargeRatioElectronsESO_EENSI_INSJ_9massRatioINSJ_18MassRatioElectronsESO_EENSI_INSJ_7currentINSJ_13currentSolver9EsirkepovINSJ_9particles6shapes3TSCENS13_8strategy16CachedSupercellsELj3EEESO_EENSI_INSJ_13interpolationINSJ_28FieldToParticleInterpolationIS17_NSJ_30AssignedTrilinearInterpolationEEESO_EENSI_INSJ_5shapeIS17_SO_EENSI_INSJ_14particlePusherINS15_6pusher5BorisESO_EESS_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENS_17HandleGuardRegionINS_9particles8policies17ExchangeParticlesENS15_8boundary29CallPluginsAndDeleteParticlesEEESS_SS_EESF_N8mallocMC9AllocatorIN6alpaka3acc12AccGpuCudaRtISt17integral_constantImLm3EEjEENS21_16CreationPolicies7ScatterINSJ_16DeviceHeapConfigENS29_11ScatterConf27DefaultScatterHashingParamsEEENS21_20DistributionPolicies4NoopENS21_11OOMPolicies10ReturnNullENS21_19ReservePoolPolicies9AlpakaBufIS28_EENS21_17AlignmentPolicies6ShrinkINS2M_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENS4_IS7_SF_NSI_INS_9multiMaskENSI_INS_12localCellIdxESV_Li0EEELi0EEES1S_S1Z_SS_NSI_INS_12NextFramePtrINSB_3argILi1EEEEENSI_INS_16PreviousFramePtrIS31_EESS_Li0EEELi0EEEEEEEEELj3EED2Ev+0x68)[0x10465188]
./picongpu(_ZN5pmacc10GridBufferINS_9SuperCellINS_5FrameINS_15ParticlesBufferINS_19ParticleDescriptionINS_4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENS_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESD_NSC_IiLi4EEEEEN5boost3mpl6v_itemIN8picongpu9weightingENSI_INSJ_8momentumENSI_INSJ_8positionINSJ_12position_picENS_13pmacc_isAliasEEENSH_7vector0INSB_2naEEELi0EEELi0EEELi0EEENSI_INSJ_11chargeRatioINSJ_20ChargeRatioElectronsESO_EENSI_INSJ_9massRatioINSJ_18MassRatioElectronsESO_EENSI_INSJ_7currentINSJ_13currentSolver9EsirkepovINSJ_9particles6shapes3TSCENS13_8strategy16CachedSupercellsELj3EEESO_EENSI_INSJ_13interpolationINSJ_28FieldToParticleInterpolationIS17_NSJ_30AssignedTrilinearInterpolationEEESO_EENSI_INSJ_5shapeIS17_SO_EENSI_INSJ_14particlePusherINS15_6pusher5BorisESO_EESS_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENS_17HandleGuardRegionINS_9particles8policies17ExchangeParticlesENS15_8boundary29CallPluginsAndDeleteParticlesEEESS_SS_EESF_N8mallocMC9AllocatorIN6alpaka3acc12AccGpuCudaRtISt17integral_constantImLm3EEjEENS21_16CreationPolicies7ScatterINSJ_16DeviceHeapConfigENS29_11ScatterConf27DefaultScatterHashingParamsEEENS21_20DistributionPolicies4NoopENS21_11OOMPolicies10ReturnNullENS21_19ReservePoolPolicies9AlpakaBufIS28_EENS21_17AlignmentPolicies6ShrinkINS2M_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENS4_IS7_SF_NSI_INS_9multiMaskENSI_INS_12localCellIdxESV_Li0EEELi0EEES1S_S1Z_SS_NSI_INS_12NextFramePtrINSB_3argILi1EEEEENSI_INS_16PreviousFramePtrIS31_EESS_Li0EEELi0EEEEEEEEELj3ES39_ED2Ev+0x5c8)[0x104660d8]
./picongpu(_ZThn56_N8picongpu9ParticlesIN5pmacc4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEEN5boost3mpl6v_itemINS_11chargeRatioINS_20ChargeRatioElectronsENS1_13pmacc_isAliasEEENS7_INS_9massRatioINS_18MassRatioElectronsESA_EENS7_INS_7currentINS_13currentSolver9EsirkepovINS_9particles6shapes3TSCENSG_8strategy16CachedSupercellsELj3EEESA_EENS7_INS_13interpolationINS_28FieldToParticleInterpolationISK_NS_30AssignedTrilinearInterpolationEEESA_EENS7_INS_5shapeISK_SA_EENS7_INS_14particlePusherINSI_6pusher5BorisESA_EENS6_7vector0IN4mpl_2naEEELi0EEELi0EEELi0EEELi0EEELi0EEELi0EEENS7_INS_9weightingENS7_INS_8momentumENS7_INS_8positionINS_12position_picESA_EES13_Li0EEELi0EEELi0EEEED0Ev+0xbc)[0x1047779c]
./picongpu(_ZNSt19_Sp_counted_deleterIPN5pmacc15ISimulationDataESt14default_deleteIS1_ESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv+0x38)[0x102d4ce8]
./picongpu(_ZN5pmacc13DataConnectorD2Ev+0x24c)[0x10456aac]
/lib64/libc.so.6(+0x44994)[0x200000eb4994]
/lib64/libc.so.6(exit+0x24)[0x200000eb49e4]
/lib64/libc.so.6(+0x25208)[0x200000e95208]
/lib64/libc.so.6(__libc_start_main+0xc4)[0x200000e953f4]
======= Memory map: ========
10000000-10ac0000 r-xp 00000000 00:32 162131688502177885 /lustre/scratch2/ws/1/s5960712-ml_streaming/picInput/LaserOnly_streaming/.build/picongpu
10ac0000-10ad0000 r--p 00ab0000 00:32 162131688502177885 /lustre/scratch2/ws/1/s5960712-ml_streaming/picInput/LaserOnly_streaming/.build/picongpu
10ad0000-10ae0000 rw-p 00ac0000 00:32 162131688502177885 /lustre/scratch2/ws/1/s5960712-ml_streaming/picInput/LaserOnly_streaming/.build/picongpu
10ae0000-10b70000 rw-p 00000000 00:00 0
31860000-31bf0000 rw-p 00000000 00:00 0 [heap]
31bf0000-31c10000 rw-p 00000000 00:00 0 [heap]
31c10000-31c20000 rw-p 00000000 00:00 0 [heap]
31c20000-31c30000 rw-p 00000000 00:00 0 [heap]
31c30000-31c40000 rw-p 00000000 00:00 0 [heap]
31c40000-31c50000 rw-p 00000000 00:00 0 [heap]
31c50000-37510000 rw-p 00000000 00:00 0 [heap]
200000000-200400000 ---p 00000000 00:00 0
200400000-200600000 rw-s 00000000 00:06 119821 /dev/nvidiactl
200600000-200800000 rw-s 00000000 00:06 180240 /dev/nvidia0
200800000-200c00000 rw-s 00000000 00:05 559859184 /dev/zero (deleted)
200c00000-200e00000 rw-s 00000000 00:06 180240 /dev/nvidia0
200e00000-201e00000 ---p 00000000 00:00 0
201e00000-202000000 rw-s 00000000 00:06 119821 /dev/nvidiactl
202000000-202200000 rw-s 00000000 00:06 119821 /dev/nvidiactl
202200000-202600000 rw-s 00000000 00:05 559859185 /dev/zero (deleted)
202600000-202a00000 rw-s 00000000 00:05 559859186 /dev/zero (deleted)
202a00000-202e00000 rw-s 00000000 00:05 559859187 /dev/zero (deleted)
202e00000-203200000 rw-s 00000000 00:05 559859188 /dev/zero (deleted)
203200000-203600000 rw-s 00000000 00:05 559859189 /dev/zero (deleted)
203600000-203a00000 rw-s 00000000 00:05 559859190 /dev/zero (deleted)
203a00000-203e00000 rw-s 00000000 00:05 559859191 /dev/zero (deleted)
203e00000-204000000 rw-s 00000000 00:05 559859192 /dev/zero (deleted)
204000000-204200000 rw-s 00000000 00:05 559859193 /dev/zero (deleted)
204200000-204400000 rw-s 00000000 00:05 559859194 /dev/zero (deleted)
204400000-204600000 rw-s 00000000 00:05 559859195 /dev/zero (deleted)
204600000-204800000 rw-s 00000000 00:05 559859196 /dev/zero (deleted)
204800000-204a00000 rw-s 00000000 00:05 559859197 /dev/zero (deleted)
204a00000-204c00000 rw-s 00000000 00:05 559859198 /dev/zero (deleted)
204c00000-204e00000 rw-s 00000000 00:05 559859199 /dev/zero (deleted)
204e00000-205000000 rw-s 00000000 00:05 559859200 /dev/zero (deleted)
205000000-205200000 rw-s 00000000 00:05 559859201 /dev/zero (deleted)
205200000-205400000 rw-s 00000000 00:05 559859202 /dev/zero (deleted)
205400000-205600000 rw-s 00000000 00:05 559859203 /dev/zero (deleted)
205600000-205800000 rw-s 00000000 00:05 559879641 /dev/zero (deleted)
205800000-205a00000 rw-s 00000000 00:05 559879642 /dev/zero (deleted)
205a00000-205c00000 rw-s 00000000 00:05 559879643 /dev/zero (deleted)
205c00000-205e00000 rw-s 00000000 00:05 559879644 /dev/zero (deleted)
205e00000-206000000 rw-s 205e00000 00:06 194563 /dev/nvidia-uvm
206000000-206200000 rw-s 00000000 00:06 119821 /dev/nvidiactl
206200000-206400000 ---p 00000000 00:00 0
206400000-206600000 rw-s 00000000 00:06 119821 /dev/nvidiactl
206600000-206800000 rw-s 00000000 00:05 559893733 /dev/zero (deleted)
206800000-300200000 ---p 00000000 00:00 0
10000000000-10004000000 ---p 00000000 00:00 0
200000000000-200000030000 r-xp 00000000 09:01 201331802 /usr/lib64/ld-2.17.so
200000030000-200000040000 r--p 00020000 09:01 201331802 /usr/lib64/ld-2.17.so
200000040000-200000050000 rw-p 00030000 09:01 201331802 /usr/lib64/ld-2.17.so
200000050000-200000070000 r-xp 00000000 00:00 0 [vdso]
200000070000-2000001b0000 r-xp 00000000 00:2d 79992436 /software/ml/OpenMPI/3.1.4-gcccuda-2018b/lib/libmpi.so.40.10.4
2000001b0000-2000001c0000 r--p 00130000 00:2d 79992436 /software/ml/OpenMPI/3.1.4-gcccuda-2018b/lib/libmpi.so.40.10.4
2000001c0000-2000001e0000 rw-p 00140000 00:2d 79992436 /software/ml/OpenMPI/3.1.4-gcccuda-2018b/lib/libmpi.so.40.10.4
2000001e0000-2000001f0000 rw-p 00000000 00:00 0
2000001f0000-200000210000 r-xp 00000000 00:2d 79992440 /software/ml/OpenMPI/3.1.4-gcccuda-2018b/lib/libmpi_cxx.so.40.10.1
200000210000-200000220000 ---p 00020000 00:2d 79992440 /software/ml/OpenMPI/3.1.4-gcccuda-2018b/lib/libmpi_cxx.so.40.10.1
200000220000-200000230000 r--p 00020000 00:2d 79992440 /software/ml/OpenMPI/3.1.4-gcccuda-2018b/lib/libmpi_cxx.so.40.10.1
200000230000-200000240000 rw-p 00030000 00:2d 79992440 /software/ml/OpenMPI/3.1.4-gcccuda-2018b/lib/libmpi_cxx.so.40.10.1
200000240000-200000270000 r-xp 00000000 00:32 162131686539173694 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib/libboost_filesystem.so.1.71.0
200000270000-200000280000 r--p 00020000 00:32 162131686539173694 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib/libboost_filesystem.so.1.71.0
200000280000-200000290000 rw-p 00030000 00:32 162131686539173694 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib/libboost_filesystem.so.1.71.0
200000290000-2000002a0000 r-xp 00000000 00:32 162131686539190333 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib/libboost_system.so.1.71.0
2000002a0000-2000002b0000 r--p 00000000 00:32 162131686539190333 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib/libboost_system.so.1.71.0
2000002b0000-2000002c0000 rw-p 00010000 00:32 162131686539190333 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib/libboost_system.so.1.71.0
2000002c0000-200000370000 r-xp 00000000 00:32 162131686539174124 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib/libboost_math_tr1.so.1.71.0
200000370000-200000380000 r--p 000a0000 00:32 162131686539174124 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib/libboost_math_tr1.so.1.71.0
200000380000-200000390000 rw-p 000b0000 00:32 162131686539174124 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib/libboost_math_tr1.so.1.71.0
200000390000-2000003a0000 -w-s 00000000 00:06 180240 /dev/nvidia0
2000003a0000-2000003b0000 r--s ff010000 00:06 21669 /dev/infiniband/uverbs1
2000003b0000-2000003d0000 r-xp 00000000 09:01 207014255 /usr/lib64/libpthread-2.17.so
2000003d0000-2000003e0000 r--p 00010000 09:01 207014255 /usr/lib64/libpthread-2.17.so
2000003e0000-2000003f0000 rw-p 00020000 09:01 207014255 /usr/lib64/libpthread-2.17.so
2000003f0000-200000420000 r-xp 00000000 00:2d 68930154 /software/ml/zlib/1.2.11-GCCcore-7.3.0/lib/libz.so.1.2.11
200000420000-200000430000 r--p 00020000 00:2d 68930154 /software/ml/zlib/1.2.11-GCCcore-7.3.0/lib/libz.so.1.2.11
200000430000-200000440000 rw-p 00030000 00:2d 68930154 /software/ml/zlib/1.2.11-GCCcore-7.3.0/lib/libz.so.1.2.11
200000440000-2000004e0000 r-xp 00000000 00:32 162131686539174239 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib/libboost_program_options.so.1.71.0
2000004e0000-2000004f0000 r--p 00090000 00:32 162131686539174239 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib/libboost_program_options.so.1.71.0
2000004f0000-200000500000 rw-p 000a0000 00:32 162131686539174239 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib/libboost_program_options.so.1.71.0
200000500000-200000560000 r-xp 00000000 00:32 162131686539174327 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib/libboost_serialization.so.1.71.0
200000560000-200000570000 r--p 00050000 00:32 162131686539174327 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib/libboost_serialization.so.1.71.0
200000570000-200000580000 rw-p 00060000 00:32 162131686539174327 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib/libboost_serialization.so.1.71.0
200000580000-200000590000 r-xp 00000000 09:01 207014259 /usr/lib64/librt-2.17.so
200000590000-2000005a0000 r--p 00000000 09:01 207014259 /usr/lib64/librt-2.17.so
2000005a0000-2000005b0000 rw-p 00010000 09:01 207014259 /usr/lib64/librt-2.17.so
2000005b0000-200000630000 r-xp 00000000 00:2d 67380129 /software/ml/CUDA/9.2.88-GCC-7.3.0-2.30/targets/ppc64le-linux/lib/libcudart.so.9.2.148
200000630000-200000640000 rw-p 00070000 00:2d 67380129 /software/ml/CUDA/9.2.88-GCC-7.3.0-2.30/targets/ppc64le-linux/lib/libcudart.so.9.2.148
200000640000-200000710000 r-xp 00000000 09:01 201331822 /usr/lib64/libm-2.17.so
200000710000-200000720000 r--p 000c0000 09:01 201331822 /usr/lib64/libm-2.17.so
200000720000-200000730000 rw-p 000d0000 09:01 201331822 /usr/lib64/libm-2.17.so
200000730000-200000990000 r-xp 00000000 00:32 162131686539210679 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib64/libopenPMD.so
200000990000-2000009a0000 ---p 00260000 00:32 162131686539210679 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib64/libopenPMD.so
2000009a0000-2000009b0000 r--p 00260000 00:32 162131686539210679 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib64/libopenPMD.so
2000009b0000-2000009c0000 rw-p 00270000 00:32 162131686539210679 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib64/libopenPMD.so
2000009c0000-2000009d0000 r-xp 00000000 00:32 162131686539199953 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib64/libadios2_cxx11_mpi.so.2.6.0
2000009d0000-2000009e0000 r--p 00000000 00:32 162131686539199953 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib64/libadios2_cxx11_mpi.so.2.6.0
2000009e0000-2000009f0000 rw-p 00010000 00:32 162131686539199953 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib64/libadios2_cxx11_mpi.so.2.6.0
2000009f0000-200000b70000 r-xp 00000000 00:32 162131686539199949 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib64/libadios2_cxx11.so.2.6.0
200000b70000-200000b80000 r--p 00170000 00:32 162131686539199949 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib64/libadios2_cxx11.so.2.6.0
200000b80000-200000b90000 rw-p 00180000 00:32 162131686539199949 /lustre/scratch2/ws/1/s5960712-ml_streaming/pic_env/local/lib64/libadios2_cxx11.so.2.6.0
200000b90000-200000db0000 r-xp 00000000 00:2d 75652055 /software/ml/GCCcore/7.3.0/lib64/libstdc++.so.6.0.24[taurusml31:26070] *** Process received signal ***
[taurusml31:26070] Signal: Aborted (6)
[taurusml31:26070] Signal code: (-6)
[taurusml31:26070] [ 0] [0x2000000504d8]
[taurusml31:26070] [ 1] /lib64/libc.so.6(abort+0x2b4)[0x200000eb2094]
[taurusml31:26070] [ 2] /lib64/libc.so.6(+0x88d10)[0x200000ef8d10]
[taurusml31:26070] [ 3] /lib64/libc.so.6(cfree+0x4a0)[0x200000f09be0]
[taurusml31:26070] [ 4] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_ZdlPv+0x18)[0x200000c47c38]
[taurusml31:26070] [ 5] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_ZdlPvm+0x18)[0x200000c47c78]
[taurusml31:26070] [ 6] ./picongpu(_ZN16cupla_cuda_async13cuplaFreeHostEPv+0x15c)[0x1055152c]
[taurusml31:26070] [ 7] ./picongpu(_ZN5pmacc6BufferINS_9SuperCellINS_5FrameINS_15ParticlesBufferINS_19ParticleDescriptionINS_4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENS_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESD_NSC_IiLi4EEEEEN5boost3mpl6v_itemIN8picongpu9weightingENSI_INSJ_8momentumENSI_INSJ_8positionINSJ_12position_picENS_13pmacc_isAliasEEENSH_7vector0INSB_2naEEELi0EEELi0EEELi0EEENSI_INSJ_11chargeRatioINSJ_20ChargeRatioElectronsESO_EENSI_INSJ_9massRatioINSJ_18MassRatioElectronsESO_EENSI_INSJ_7currentINSJ_13currentSolver9EsirkepovINSJ_9particles6shapes3TSCENS13_8strategy16CachedSupercellsELj3EEESO_EENSI_INSJ_13interpolationINSJ_28FieldToParticleInterpolationIS17_NSJ_30AssignedTrilinearInterpolationEEESO_EENSI_INSJ_5shapeIS17_SO_EENSI_INSJ_14particlePusherINS15_6pusher5BorisESO_EESS_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENS_17HandleGuardRegionINS_9particles8policies17ExchangeParticlesENS15_8boundary29CallPluginsAndDeleteParticlesEEESS_SS_EESF_N8mallocMC9AllocatorIN6alpaka3acc12AccGpuCudaRtISt17integral_constantImLm3EEjEENS21_16CreationPolicies7ScatterINSJ_16DeviceHeapConfigENS29_11ScatterConf27DefaultScatterHashingParamsEEENS21_20DistributionPolicies4NoopENS21_11OOMPolicies10ReturnNullENS21_19ReservePoolPolicies9AlpakaBufIS28_EENS21_17AlignmentPolicies6ShrinkINS2M_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENS4_IS7_SF_NSI_INS_9multiMaskENSI_INS_12localCellIdxESV_Li0EEELi0EEES1S_S1Z_SS_NSI_INS_12NextFramePtrINSB_3argILi1EEEEENSI_INS_16PreviousFramePtrIS31_EESS_Li0EEELi0EEEEEEEEELj3EED1Ev+0x2c)[0x1046504c]
[taurusml31:26070] [ 8] ./picongpu(_ZN5pmacc18DeviceBufferInternINS_9SuperCellINS_5FrameINS_15ParticlesBufferINS_19ParticleDescriptionINS_4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENS_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESD_NSC_IiLi4EEEEEN5boost3mpl6v_itemIN8picongpu9weightingENSI_INSJ_8momentumENSI_INSJ_8positionINSJ_12position_picENS_13pmacc_isAliasEEENSH_7vector0INSB_2naEEELi0EEELi0EEELi0EEENSI_INSJ_11chargeRatioINSJ_20ChargeRatioElectronsESO_EENSI_INSJ_9massRatioINSJ_18MassRatioElectronsESO_EENSI_INSJ_7currentINSJ_13currentSolver9EsirkepovINSJ_9particles6shapes3TSCENS13_8strategy16CachedSupercellsELj3EEESO_EENSI_INSJ_13interpolationINSJ_28FieldToParticleInterpolationIS17_NSJ_30AssignedTrilinearInterpolationEEESO_EENSI_INSJ_5shapeIS17_SO_EENSI_INSJ_14particlePusherINS15_6pusher5BorisESO_EESS_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENS_17HandleGuardRegionINS_9particles8policies17ExchangeParticlesENS15_8boundary29CallPluginsAndDeleteParticlesEEESS_SS_EESF_N8mallocMC9AllocatorIN6alpaka3acc12AccGpuCudaRtISt17integral_constantImLm3EEjEENS21_16CreationPolicies7ScatterINSJ_16DeviceHeapConfigENS29_11ScatterConf27DefaultScatterHashingParamsEEENS21_20DistributionPolicies4NoopENS21_11OOMPolicies10ReturnNullENS21_19ReservePoolPolicies9AlpakaBufIS28_EENS21_17AlignmentPolicies6ShrinkINS2M_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENS4_IS7_SF_NSI_INS_9multiMaskENSI_INS_12localCellIdxESV_Li0EEELi0EEES1S_S1Z_SS_NSI_INS_12NextFramePtrINSB_3argILi1EEEEENSI_INS_16PreviousFramePtrIS31_EESS_Li0EEELi0EEEEEEEEELj3EED2Ev+0x68)[0x10465188]
[taurusml31:26070] [ 9] ./picongpu(_ZN5pmacc10GridBufferINS_9SuperCellINS_5FrameINS_15ParticlesBufferINS_19ParticleDescriptionINS_4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENS_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESD_NSC_IiLi4EEEEEN5boost3mpl6v_itemIN8picongpu9weightingENSI_INSJ_8momentumENSI_INSJ_8positionINSJ_12position_picENS_13pmacc_isAliasEEENSH_7vector0INSB_2naEEELi0EEELi0EEELi0EEENSI_INSJ_11chargeRatioINSJ_20ChargeRatioElectronsESO_EENSI_INSJ_9massRatioINSJ_18MassRatioElectronsESO_EENSI_INSJ_7currentINSJ_13currentSolver9EsirkepovINSJ_9particles6shapes3TSCENS13_8strategy16CachedSupercellsELj3EEESO_EENSI_INSJ_13interpolationINSJ_28FieldToParticleInterpolationIS17_NSJ_30AssignedTrilinearInterpolationEEESO_EENSI_INSJ_5shapeIS17_SO_EENSI_INSJ_14particlePusherINS15_6pusher5BorisESO_EESS_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENS_17HandleGuardRegionINS_9particles8policies17ExchangeParticlesENS15_8boundary29CallPluginsAndDeleteParticlesEEESS_SS_EESF_N8mallocMC9AllocatorIN6alpaka3acc12AccGpuCudaRtISt17integral_constantImLm3EEjEENS21_16CreationPolicies7ScatterINSJ_16DeviceHeapConfigENS29_11ScatterConf27DefaultScatterHashingParamsEEENS21_20DistributionPolicies4NoopENS21_11OOMPolicies10ReturnNullENS21_19ReservePoolPolicies9AlpakaBufIS28_EENS21_17AlignmentPolicies6ShrinkINS2M_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENS4_IS7_SF_NSI_INS_9multiMaskENSI_INS_12localCellIdxESV_Li0EEELi0EEES1S_S1Z_SS_NSI_INS_12NextFramePtrINSB_3argILi1EEEEENSI_INS_16PreviousFramePtrIS31_EESS_Li0EEELi0EEEEEEEEELj3ES39_ED2Ev+0x5c8)[0x104660d8]
[taurusml31:26070] [10] ./picongpu(_ZThn56_N8picongpu9ParticlesIN5pmacc4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEEN5boost3mpl6v_itemINS_11chargeRatioINS_20ChargeRatioElectronsENS1_13pmacc_isAliasEEENS7_INS_9massRatioINS_18MassRatioElectronsESA_EENS7_INS_7currentINS_13currentSolver9EsirkepovINS_9particles6shapes3TSCENSG_8strategy16CachedSupercellsELj3EEESA_EENS7_INS_13interpolationINS_28FieldToParticleInterpolationISK_NS_30AssignedTrilinearInterpolationEEESA_EENS7_INS_5shapeISK_SA_EENS7_INS_14particlePusherINSI_6pusher5BorisESA_EENS6_7vector0IN4mpl_2naEEELi0EEELi0EEELi0EEELi0EEELi0EEELi0EEENS7_INS_9weightingENS7_INS_8momentumENS7_INS_8positionINS_12position_picESA_EES13_Li0EEELi0EEELi0EEEED0Ev+0xbc)[0x1047779c]
[taurusml31:26070] [11] ./picongpu(_ZNSt19_Sp_counted_deleterIPN5pmacc15ISimulationDataESt14default_deleteIS1_ESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv+0x38)[0x102d4ce8]
[taurusml31:26070] [12] ./picongpu(_ZN5pmacc13DataConnectorD2Ev+0x24c)[0x10456aac]
[taurusml31:26070] [13] /lib64/libc.so.6(+0x44994)[0x200000eb4994]
[taurusml31:26070] [14] /lib64/libc.so.6(exit+0x24)[0x200000eb49e4]
[taurusml31:26070] [15] /lib64/libc.so.6(+0x25208)[0x200000e95208]
[taurusml31:26070] [16] /lib64/libc.so.6(__libc_start_main+0xc4)[0x200000e953f4]
[taurusml31:26070] *** End of error message ***
Abgebrochen
|
Current list of "defective" (?) nodes:
|
I prepared the code we need to debug the issue:
|
The analysis (nearly) finished: There seem to be no node working. |
@psychocoderHPC I will test your approach. |
@psychocoderHPC The stdout of the picongpu run on a single GPU can be found below:
|
@psychocoderHPC If you need stderr as well, let me know. (I already lost the node access again.) |
native CUDA reproducer
[updated the example to guarantee a crash (increased the last allocation size] |
Result of offline work with @psychocoderHPC: |
Setting the reserved GPU memory from 350MB to 2047MB solved the problem for now. |
While attempting to run a 256^3 cells per GPU PIConGPU simulation (laser only), I ran into the following memory error on the taurus V100 nodes:
@steindev did you encounter this before? I
@psychocoderHPC @sbastrakov Do you have any idea what might have caused this? Is this really a device memory issue?
The text was updated successfully, but these errors were encountered: