Making picamera2 use all the CPU cores effectively #578
Replies: 8 comments 11 replies
-
Hi, so I guess the first point is that as far as I can tell, you're really only using 2 cores at a time. Because thread t3 waits for t1 to finish (and t4 for t2), they may as well just be the same thread. Apart from that, the MJPEG encode happens in hardware, the event loop shouldn't be doing much apart from your callback and the display, and I would hope the HTTP server doesn't eat much CPU. As regards the GIL, yes it does stop Python running in parallel. But are you calling down into what are really "large C/C++ functions"? Normally the GIL would be released while that C call is in progress, so you would actually get benefits from parallelism. Here are some other things to consider:
|
Beta Was this translation helpful? Give feedback.
-
Just to add that changing the resolution to 1600x1200 doesn't need to change the sensor crop. It might do that by default for this sensor, but all you need to do is specify the raw frame size to stop it (add |
Beta Was this translation helpful? Give feedback.
-
Important discovery this morning: on my RP 4B, numpy is only using one of the four cores! To check this on your own system, start
When the matrix multiply begins, if you have proper multi-core support, the CPU column of the I installed numpy using I have a completely stock, "by the book", 64-bit raspbian install. Could someone please suggest how to fix my numpy installation? Cheers, |
Beta Was this translation helpful? Give feedback.
-
Hi again Yes, I tried your test case and indeed, it is abominably slow. It took nearly a minute. As far as I can tell the version of numpy in apt is really very old, it seems to be 1.19.5. Try the following:
This upgraded me to version 1.24.2. To double-check that you're really using this, check what this reports:
This new version appears to use the OpenBLAS library which supports both 64-bit NEON extensions and multithreading. The same test completes in around 4 seconds. Maybe give this a try and report back what you find. Thanks! |
Beta Was this translation helpful? Give feedback.
-
The situation is not entirely straightforward. Ultimately I'm hoping that we'll be able to use official libcamera packages, and official libcamera python bindings from pip, but I don't think there's anything like that available at the moment. As things stand you have to get libcamera and python3-libcamera from apt, they aren't available anywhere else. You can get Picamera2 from pip, but then it won't update automatically if libcamera gets updated. Note that libcamera does not yet have a stable API so sticking to apt will avoid breakages. OpenCV and pyqt5 are a problem because the pip versions conflict with the versions in apt, and they really don't play nicely together. Also the pip versions usually (in my experience) fail to install on the 32-bit OS, they often compile for many many hours and then fail. Yet the apt ones install perfectly in about 10 seconds. You'd probably have more luck on a 64-bit platform, but there's a limit to how many different combinations of versions we can verify. I think other stuff should mostly be safe from pip. Certainly numpy is fine, they seem to maintain backwards compatibility. I'll see if we can get a more up to date version in our next major OS update. |
Beta Was this translation helpful? Give feedback.
-
I made quite a bit of progress. First, I can suggest a good way to make use of newer packages using The idea is to create a virtual environment using Here is a transcript of creating a virtual environment, and installing a number of packages there with pip:
One can create multiple directories, each with its own environment. Once inside the directory, use the command Some notes on the commands above:
I found this very helpful, because it enabled me to have two jupyter notebooks running side by side in the same browser, but with completely different library/module stacks inside. This makes a side-by-side comparison very simple. I'll post something about my own code later. It's now running at 36 FPS. But the speedup came from using four threads rather than two for part of my analysis. (It turns out that the actual Cheers, |
Beta Was this translation helpful? Give feedback.
-
I found the reason that my code was running at 36 FPS, but sometimes slowed down and sometimes sped up. The culprit was I found modes 0 and 3 to be fast enough for 40 FPS, and the other modes to be too slow. For the record, here's the configuration
For my application, simply turning off the noise reduction works very well. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the update. That's very interesting. I would probably go with "minimal" then because that uses only hardware denoising over on the GPU, and therefore runs at the full pixel rate with no latency at all. "Fast" is fast compared to the high quality setting, but does run in software after the hardware has finished and therefore does add some latency (though not enough that video recording apps would notice, unless they are adding further processing). |
Beta Was this translation helpful? Give feedback.
-
I'm using a RP 4B, which has four cores. My picamera2 code runs at 30 FPS but not faster. According to "top", there are more than two cores idle.
My question: how can I get those other cores working to speed things up to 40FPS?
My code reads high-resolution frames from an HQ camera, analyses them, then uses the analysis results to annotate low-resolution frames, which are displayed and streamed. To avoid copying overhead, I am using
picam2.pre_callback
. The callback:There is also a standard picamera2 event loop which:
The callback is structured as follows:
I have read that the python Global Interpreter Lock (GIL) prevents parallel execution, but have found that running four threads (as above) speeds up execution compared to the single-thread alternative. Note that threads 1 and 2 only READ the main.array, so can safely execute in parallel. Likewise, although threads 3 and 4 modify the lores.array, those changes are independent: they can be done in any order and/or interleaved. So threads 3 and 4 can also execute in parallel.
Here is the overall structure of the picamera2 code. (Note: the streaming server part is based on the mjpeg_server.py example code.)
I'd be grateful if someone could clarify why running multiple threads has sped up my code. Can threads t1 and t2 execute in parallel because they are only reading main.array? I have read that because of the GIL, only IO-bound code can benefit from threading. In some sense, my code is "IO-bound" reading main.array.
I would also be grateful for suggestions about how I can better structure the code to use more CPU cores. For example, should/could I run the MJPEG encoder and/or the HTTP server in a different process, connected by pipes or sockets to the picamera2 process?
Thanks!
Bruce
Beta Was this translation helpful? Give feedback.
All reactions