-
Notifications
You must be signed in to change notification settings - Fork 118
The Kokkos Lectures: Module 5 Q&A
12:27:44 From Ed D'Azevedo : Do we need to know the scratch memory will fit in hardware shared memory on GPU or kokkos allocate scratch memory in global device memory? 12:29:42 From Daniel Arndt : That depends on the level you request. 12:31:02 From Ed D'Azevedo : is level 0 the hardware shared memory in SM? 12:31:15 From Damien Lebrun-Grandie : yes 12:32:16 From Junchao Zhang : what if one asks more than what the hardware allows? 12:32:31 From Daniel Arndt : Then you get a runtime error. 12:34:16 From Damien Lebrun-Grandie : You can query the limits throw the team policy 12:34:31 From Damien Lebrun-Grandie : Throw -> from 12:34:40 From Junchao Zhang : got it 12:38:31 From Dmitry Mikushin : GPUs have their own SIMD bultin types such as float4 or double2, but that's really a CPU-like concept, nothing to do with warps. 12:40:31 From Junchao Zhang : is complex scalar type? 12:42:28 From Daisy Hollman : yes, complex can be a scalar 12:42:47 From Daisy Hollman : I believe. Need to check one sec 13:08:00 From Peter Bosler : Are Kokkos Kernels las functions blocking? 13:08:05 From Peter Bosler : *blas 13:13:15 From Glenn Brook : can device instances live/execute on different physical devices? 13:14:05 From Daisy Hollman : @Glenn it’s possible; Cuda instances can do anything that Cuda streams can do 13:15:06 From Junchao Zhang : does reallocation of unmanaged views also imply fence? 13:15:26 From Daisy Hollman : no 13:15:35 From Junchao Zhang : does deallocation …. 13:15:44 From Daisy Hollman : nothing to do with unmanaged views implies a fence 13:16:04 From Daisy Hollman : and don’t rely on deallocation fencing in the future 13:16:13 From Daisy Hollman : just be aware of it for performance reasons 13:16:18 From Philipp Grete : How important is the lightweight property in terms of getting overlapping kernel execution? 13:16:46 From Daisy Hollman : redundant fences are fine, so if you need a fence semantically, don’t rely on an implicit one happening 13:16:59 From Daisy Hollman : I am 13:22:28 From Philipp Grete : Regarding the implicit fence on View dealloc: Does this also apply to Subviews? 13:25:47 From Daniel Arndt : If the subview is the last View referencing underlying memory it is responsible for deallocating. 13:28:42 From Christian Trott : Its not reference counting which is blocking, its only deallocating 13:34:29 From James C Phillips : Can you respawn with multiple dependencies? 13:35:12 From Junchao Zhang : Is there a cudaEvent concept in kokkos? 13:36:19 From Christian Trott : yes you can respawn with multiple dependencies 13:36:27 From Christian Trott : There is no cudaEvent concept in Kokkos 13:37:14 From Christian Trott : we are working on the equivalent of CUDA Graphs 13:37:36 From Christian Trott : Which we believe will be more useful (and more efficient) for coarse grained dependency management 13:41:12 From Peter Hakel : How does this compare to std::async? 13:42:59 From Christian Trott : that is a complicated question, and will probably spawn a 20 min explanation by Daisy. I will ask her at the end whether she can summarize in 5 mins :-) 13:44:37 From Martin Pokorny : Can Kokkos futures be passed around by value? 13:45:57 From Christian Trott : I think so but I will pose the question in a min too 13:47:41 From Christian Trott : The Exercise is in Exercises/tasking not Intro-Full/Exercises/08 13:50:16 From Daisy Hollman : yes, kokkos futures are value 13:50:18 From Daisy Hollman : i’m muted 13:51:45 From Anders Johansson : Ref. streams and concurrent kernels: Does running multiple MPI ranks per GPU give a similar effect? (Presumably less efficient, but how bad?) 13:52:23 From Anders Johansson : where each MPI rank launches some kokkos kernels without thinking of streams 13:52:41 From Christian Trott : Yes Anders 13:52:49 From Christian Trott : that is something a lot of people do 13:53:19 From Peter Hakel : OK, thanks. 13:54:27 From Anders Johansson : Ok, thanks! Our code also does quite a bit of non-kokkos work (including a ScaLAPACK diagonalization), so multiple MPI ranks per GPU seems to run quite a bit faster 13:56:28 From Sean Isaac Geronimo Anderson : Thank you! 13:56:35 From Henry Moncada : thank