Have completed the CUDA version. The speed is about 20k passwords per second (that of single CPU core is about 200 per second) and the whole cracking process can finish in 3 hours.
I decide not to publish the code, unless Tencent has significantly increase the encryption strength and make it strong enough to resist GPU attack.