Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why cuda is slower than cpu #536

Open
Hukongtao opened this issue Mar 5, 2025 · 3 comments
Open

Why cuda is slower than cpu #536

Hukongtao opened this issue Mar 5, 2025 · 3 comments
Labels
question Further information is requested

Comments

@Hukongtao
Copy link

Hukongtao commented Mar 5, 2025

🐛 Describe the bug

Minimal reproducible code

import time

import torch
from torchcodec.decoders import VideoDecoder

device = "cpu"
# device = "cuda"


video_path = "NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4_small.mp4"
decoder = VideoDecoder(
    video_path, 
    device=device,
    # dimension_order="NHWC",
    # seek_mode="approximate",
    num_ffmpeg_threads=8,
)

index_list = [106, 125, 127, 130, 132, 144, 146, 171, 180, 181, 189, 194, 195, 199, 203, 204, 204, 214, 227, 242, 259, 263, 266, 296, 303, 314, 320, 323, 325, 328, 333, 338, 338, 350, 370, 373, 381, 384, 384, 384, 396, 400, 444, 448, 452, 463, 467, 470, 473, 479, 487, 489, 489, 529, 532, 532, 559, 564, 570, 617, 649, 658, 658, 665, 674, 691, 703, 704, 716, 718, 733, 743, 750, 754, 765, 777, 786, 792, 814, 818, 818, 821, 833, 847, 854, 858, 859, 877, 881, 891, 925, 926, 928, 949, 954, 967, 975, 982, 987, 990]

data = decoder.get_frames_at(indices=index_list)

loop_number = 20
start_time = time.time()
for i in range(loop_number):
    data = decoder.get_frames_at(indices=index_list)
    print(data.data.shape)
end_time = time.time()
# print(data.data)
# print(data.data.shape)

print(f"spend time:{(end_time - start_time) / loop_number}")

cpu spend time: 0.9169630885124207
gpu spend time:3.559278666973114
Use different devices, and you will find that cuda takes more time than cpu. I don't understand why.

Versions

As mentioned above

@scotts
Copy link
Contributor

scotts commented Mar 6, 2025

Hi, @Hukongtao! What you're observing is also apparent in our benchmark results. I'm actually going to break it down into several questions:

What is going on in the system that leads to this performance?

TorchCodec is paying some upfront costs to decode frames on the GPU. This is both initialization costs of creating the context objects that enabling the GPU decoding and transfer costs of actually transferring the encoded frames to the GPU. At the moment, in order for GPU decoding to be faster than CPU decoding, the initialization plus transfer costs must be less than the cost of doing the decoding on the CPU for GPU decoding to be a win.

The initialization cost is a constant; the transfer cost is linear with respect to the size of the frames. The actual decoding on the GPU will always be faster than the actual decoding on a single thread on the CPU. (I'm confident in this claim because the GPU has actual function units, NVDECs dedicated to decoding! ) Let's call the difference between GPU decoding and single thread CPU decoding as "the win." As the video resolution increases, "the win" tends to get bigger. However, if the initialization and startup costs are greater than the win, then it's going to be faster to decode on the CPU. As video resolution increases, "the win" gets bigger, and it starts to become faster to use the GPU.

The wrinkle here are multiple threads. There are only so many NVDECs on a GPU; I believe it's usually less than 10. However, we're now in the era of dozens of CPU cores available. Even though a single thread on a CPU core is slower than a single NVDEC on a GPU at decoding a video, there may be many, many more available CPU cores than NVDECs. That may mean it's faster overall to use your available CPU cores than the GPU.

What can TorchCodec do to improve the situation?

We can't eliminate the transfer costs, as that's fundamental. But we can reduce the initialization costs. That is something we consider important, but have not yet started work on. This is a great area for new folks to jump into! :) Issue #460
is a great starting point.

What should you consider when deciding CPU versus GPU right now?

  • Consider the resolution of the videos your decoding. Lower resolution videos will tend to be faster on the CPU overall.
  • Consider how many CPU cores your systems have access to when decoding compared to NVDECs on your GPU.
  • Consider what transforms you want to apply to your decode frames after decoding. That's the purpose of the first column of experiments in our benchmark; that experiment simulates the kind of training jobs we have seen. There are multiple threads decoding videos, and after decoding, they apply a transform to the frame. Such transforms can also run on the GPU, and they are almost always faster on the GPU. We can often speed them up even more if the decoding also happens on the GPU, and then we only have to send the encoded data (compressed) to the GPU, rather than the decoded data (uncompressed) to the GPU.

@Hukongtao
Copy link
Author

Thank you very much for your detailed and patient explanation, which is really useful to me.
By the way, I would like to ask another question.
According to the official documentation, torchcodec uses Nvidia’s NVDEC hardware decoder, but I did not see where you used the NVIDIA Video Codec SDK in the source code.
Can you explain this? @scotts

Image

@scotts
Copy link
Contributor

scotts commented Mar 7, 2025

@Hukongtao, good observation! We rely on FFmpeg for all of our decoding, and FFmpeg is smart enough to use the NVDECs on a GPU when they're available. Check out the C++ file CudaDevice.cpp to see where we do all of our CUDA specific calls to FFmpeg.

@scotts scotts added the question Further information is requested label Mar 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants