Bad performance with Auto-GPTQ kernel.d

I can reproduce the reported results without using real_quant. However, after using real_quant with auto_gptq, the performance is super bad (~250 on wikitext using Llama2-7b-W3A16g128 pretrained weight). The problem should be with the auto_gptq kernel, but I cannot install it normally, following the official procedure (refer to the last issue). Any help would be appreciated!!!