We need to investigate and study the best strategy for performance tuning in the CUDA backend. One knob is the thread block size vs number of blocks.