-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why cuda is slower than cpu #536
Comments
Hi, @Hukongtao! What you're observing is also apparent in our benchmark results. I'm actually going to break it down into several questions: What is going on in the system that leads to this performance?TorchCodec is paying some upfront costs to decode frames on the GPU. This is both initialization costs of creating the context objects that enabling the GPU decoding and transfer costs of actually transferring the encoded frames to the GPU. At the moment, in order for GPU decoding to be faster than CPU decoding, the initialization plus transfer costs must be less than the cost of doing the decoding on the CPU for GPU decoding to be a win. The initialization cost is a constant; the transfer cost is linear with respect to the size of the frames. The actual decoding on the GPU will always be faster than the actual decoding on a single thread on the CPU. (I'm confident in this claim because the GPU has actual function units, NVDECs dedicated to decoding! ) Let's call the difference between GPU decoding and single thread CPU decoding as "the win." As the video resolution increases, "the win" tends to get bigger. However, if the initialization and startup costs are greater than the win, then it's going to be faster to decode on the CPU. As video resolution increases, "the win" gets bigger, and it starts to become faster to use the GPU. The wrinkle here are multiple threads. There are only so many NVDECs on a GPU; I believe it's usually less than 10. However, we're now in the era of dozens of CPU cores available. Even though a single thread on a CPU core is slower than a single NVDEC on a GPU at decoding a video, there may be many, many more available CPU cores than NVDECs. That may mean it's faster overall to use your available CPU cores than the GPU. What can TorchCodec do to improve the situation?We can't eliminate the transfer costs, as that's fundamental. But we can reduce the initialization costs. That is something we consider important, but have not yet started work on. This is a great area for new folks to jump into! :) Issue #460 What should you consider when deciding CPU versus GPU right now?
|
Thank you very much for your detailed and patient explanation, which is really useful to me. |
@Hukongtao, good observation! We rely on FFmpeg for all of our decoding, and FFmpeg is smart enough to use the NVDECs on a GPU when they're available. Check out the C++ file CudaDevice.cpp to see where we do all of our CUDA specific calls to FFmpeg. |
🐛 Describe the bug
Minimal reproducible code
cpu spend time: 0.9169630885124207
gpu spend time:3.559278666973114
Use different devices, and you will find that cuda takes more time than cpu. I don't understand why.
Versions
As mentioned above
The text was updated successfully, but these errors were encountered: