Making picamera2 use all the CPU cores effectively #578

ballen4705 · 2023-02-20T10:52:20Z

ballen4705
Feb 20, 2023

I'm using a RP 4B, which has four cores. My picamera2 code runs at 30 FPS but not faster. According to "top", there are more than two cores idle.

My question: how can I get those other cores working to speed things up to 40FPS?

My code reads high-resolution frames from an HQ camera, analyses them, then uses the analysis results to annotate low-resolution frames, which are displayed and streamed. To avoid copying overhead, I am using picam2.pre_callback. The callback:

reads the 2028 x 1520 main frame
analyse it, looking for "horizontal features" and "vertical features"
uses the results to write lines and text to the 640 x 480 lowres frame
This is done in four threads to spread the load among CPUs:
- thread 1 READS main.array and does the horizontal analysis
- thread 2 READS main.array and does the vertical analysis
- thread 3 waits for thread 1 to complete, then WRITES to lowres.array
- thread 4 waits for thread 2 to complete, then WRITES to lowres.array
Note: the horizontal and vertical analyses have similar operation counts.

There is also a standard picamera2 event loop which:

displays (previews) the lowres frames
streams the lowres frames from a web server

The callback is structured as follows:

def callback(request):
    with MappedArray(request, "main") as main:
        # analysis threads, only READ from main.array
        t1 = threading.Thread(target=HorizontalAnalysis, args=(main.array, ...))
        t2 = threading.Thread(target=VerticalAnalysis, args=(main.array, ...))
        t1.start()
        t2.start()
        with MappedArray(request, "lores") as lores:
            # display threads, only WRITE to lores.array
            # note: HorizontalDisplay() begins with t1.join()
            t3 = threading.Thread(target=HorizontalDisplay, args=(lores.array, t1, ...))
            # note: VerticalDisplay() begins with t2.join()
            t4 = threading.Thread(target=VerticalDisplay, args=(lores.array, t2, ...))
            t3.start()
            t4.start()
            t3.join()
            t4.join()        
    return

I have read that the python Global Interpreter Lock (GIL) prevents parallel execution, but have found that running four threads (as above) speeds up execution compared to the single-thread alternative. Note that threads 1 and 2 only READ the main.array, so can safely execute in parallel. Likewise, although threads 3 and 4 modify the lores.array, those changes are independent: they can be done in any order and/or interleaved. So threads 3 and 4 can also execute in parallel.

Here is the overall structure of the picamera2 code. (Note: the streaming server part is based on the mjpeg_server.py example code.)

picam2 = Picamera2()
picam2.pre_callback = callback
picam2.configure(picam2.create_video_configuration(
    buffer_count = 3,
    queue = False,
    main={"size": (2028, 1520),"format":"YUV420"},
    lores={"size":(640,480)},
    encode = "lores",
    display = "lores"))
picam2.set_controls({"ColourGains":(0.01,0.01),"FrameDurationLimits":(33333,33333)})
picam2.start_preview(Preview.QTGL, width=640, height=480)

encoder = MJPEGEncoder()
output = StreamingOutput()

picam2.start_recording(encoder,  FileOutput(output))
try:
    address = ('', 8000)
    server = StreamingServer(address, StreamingHandler)
    server.serve_forever()
finally:
    picam2.stop_recording()

I'd be grateful if someone could clarify why running multiple threads has sped up my code. Can threads t1 and t2 execute in parallel because they are only reading main.array? I have read that because of the GIL, only IO-bound code can benefit from threading. In some sense, my code is "IO-bound" reading main.array.

I would also be grateful for suggestions about how I can better structure the code to use more CPU cores. For example, should/could I run the MJPEG encoder and/or the HTTP server in a different process, connected by pipes or sockets to the picamera2 process?

Thanks!
Bruce

davidplowman · 2023-02-20T14:36:41Z

davidplowman
Feb 20, 2023
Maintainer

Hi, so I guess the first point is that as far as I can tell, you're really only using 2 cores at a time. Because thread t3 waits for t1 to finish (and t4 for t2), they may as well just be the same thread. Apart from that, the MJPEG encode happens in hardware, the event loop shouldn't be doing much apart from your callback and the display, and I would hope the HTTP server doesn't eat much CPU.

As regards the GIL, yes it does stop Python running in parallel. But are you calling down into what are really "large C/C++ functions"? Normally the GIL would be released while that C call is in progress, so you would actually get benefits from parallelism.

Here are some other things to consider:

Have you disabled the local display? If not maybe try it, it does cost some time.
Sometimes it's quicker to copy frame buffers into user land memory and then operate on those, particularly when pixels are read in a random-access kind of way, so that could be worth trying.
I think I've said before that doing this much analysis in the pre_callback worries me a bit. If you depcoupled that then you could update the lores at (for example) 40fps, but run the analysis at 30fps (on average, it would "skip" about 1 in 4 frames, where the lores would have to show the previous frame's results). Would that be viable?
Do you have to run the analysis on a 2028x1520 image? If you went with, for example, 1600x1200 then I expect you'd hit 40fps. But obviously that depends on your application.

1 reply

ballen4705 Feb 20, 2023
Author

Hi David,

thanks a lot for the helpful reply!

GIL: You are correct about the GIL, because I am internally using numpy functions. Python docs say "...if you are using numpy to do array operations then Python will release the GIL, meaning that if you write your code in a numpy style, much of the calculations will be done in a few array operations, providing you with a speedup by using multiple threads."

Indeed, the most expensive part of my analysis is summing pixels horizontally and vertically using numpy.sum(). This is why my threaded code runs about 2x faster than the unthreaded version: the threads t1 and t2 run in parallel. (Note: experiments showed that type conversions took longer than the additions. Fastest was to first sum (at most 254) uint8s into uint16, then sum those to uint32, then convert those to float.)

You are also correct that my code uses at most two CPU cores, because thread t3 only starts after t1 completes (and the same for t4 & t2). I did it as four threads because the "with ... as main:, ..., with ... as lores:" structure appealed to me. I have not figured out a clean solution with only two threads.

If what is written above about numpy is correct, then I can speed this up further by having HorizontalAnalysis() run two threads internally, and doing the same for VerticalAnalysis(). This will get all four cores working. The other part of my analysis is doing scipy linear algebra least-squares fitting, which has LAPACK/BLAS underneath. I'll check that numpy/scipy are compiled to use multiple cores for that.
Local display: I will try running without the local display, but would like this to work with both that and the remote stream running.
Copying frame buffers into userland: I did try using the capture() method and it was much slower than the pre_callback. I assume this is because of the copy and perhaps numpy reshape. Note that I'm reading all pixels twice (one sum vertical, one sum horizontal) in a very systematic fashion. The CPU/cache/FPU pipeline seems to like that and to work efficiently, as one would hope/expect.
Analysis in pre_callback: There is no point in updating lores faster than I can analyse main. As you know, I am timing the rate at which the callback() is invoked, to verify that it keeps up with the frame rate. (FWIW, I don't think I have seen any issues when the callback() is slow to return.)

More generally I find doing the analysis in the callback a clean solution, since it keeps main() simple and automatically puts my stuff into the event queue.
Image size: my analysis benefits from more pixels, and I need the full sensor area. I would have liked to use 4056x3040 but it's too slow. Moving the other direction, to 1600x1200, hurts in two ways. First, there are fewer pixels and second, the sensor is cropped.

Now that I have understood that the numpy calls such as np.sum() run on multiple cores, I am going to make HorizontalAnalysis, HorizontalDisplay, VerticalAnalysis and VerticalDisplay each run two threads internally. Hopefully this will exploit all four cores for the most expensive part of my code, speeding it up by another factor of two, which will easily get me up to 40FPS.

I'll report back here after I have tried this.

Cheers,
Bruce

davidplowman · 2023-02-20T15:33:42Z

davidplowman
Feb 20, 2023
Maintainer

Just to add that changing the resolution to 1600x1200 doesn't need to change the sensor crop. It might do that by default for this sensor, but all you need to do is specify the raw frame size to stop it (add raw={'size': (2028, 1520)} when you create the config).

1 reply

ballen4705 Feb 21, 2023
Author

I had not understood that sensor crop could be prevented like this. Thanks for the clarification. It would be helpful to mention this in the picamera2 documentation.

ballen4705 · 2023-02-21T08:39:45Z

ballen4705
Feb 21, 2023
Author

Important discovery this morning: on my RP 4B, numpy is only using one of the four cores!

To check this on your own system, start top running in a terminal window. Then, in a second window, start a python shell and execute:

import numpy as np
size = 3000
a = np.random.random_sample((size,size))
print("starting matrix multiply")
b = np.dot(a,a)

When the matrix multiply begins, if you have proper multi-core support, the CPU column of the top window should show well over 100%. In my case it should reach 400% to indicate that four cores are active. (If this runs too fast to see the CPU load, then just make size larger.)

I installed numpy using sudo apt-get install python3-numpy, and had assumed that this would install a version appropriate for my 64-bit OS and 4-core system.

I have a completely stock, "by the book", 64-bit raspbian install. Could someone please suggest how to fix my numpy installation?

Cheers,
Bruce

0 replies

davidplowman · 2023-02-21T11:29:52Z

davidplowman
Feb 21, 2023
Maintainer

Hi again

Yes, I tried your test case and indeed, it is abominably slow. It took nearly a minute.

As far as I can tell the version of numpy in apt is really very old, it seems to be 1.19.5. Try the following:

pip install numpy --upgrade

This upgraded me to version 1.24.2. To double-check that you're really using this, check what this reports:

import numpy as np
np.version.version

This new version appears to use the OpenBLAS library which supports both 64-bit NEON extensions and multithreading. The same test completes in around 4 seconds.

Maybe give this a try and report back what you find. Thanks!

1 reply

ballen4705 Feb 21, 2023
Author

Hi David,

thanks for running the test. It's probably also worth a look at opencv2 to see if that has been complied with neon intrinsics. That is another bit of code that greatly benefits from using all four cores and the SIMD vector floating-point hardware.

I'll try the pip version of numpy shortly. But before I do that, I'd like to ask about how best to shift to "pip" managed python packages on RP, because I've been warned in the past about "mixing" package/installation systems, in this case "apt" and"pip". Such warnings can also be found online, for example here.

If I shift to using pip as you suggest, should I now remove the apt-installed packages: python3-numpy, python3-scipy, python3-picamera2, ... and shift to using the pip versions? FWIW, according to

apt list --installed | grep python3 | wc

I currently have 180 apt installed python3 packages. I presume that there are also many other apt-managed packages which are dependent up on these.

Or are the past warnings incorrect, and "apt" managed python packages and "pip" managed python packages can happily co-exist?

Cheers,
Bruce

davidplowman · 2023-02-21T17:25:31Z

davidplowman
Feb 21, 2023
Maintainer

The situation is not entirely straightforward. Ultimately I'm hoping that we'll be able to use official libcamera packages, and official libcamera python bindings from pip, but I don't think there's anything like that available at the moment.

As things stand you have to get libcamera and python3-libcamera from apt, they aren't available anywhere else. You can get Picamera2 from pip, but then it won't update automatically if libcamera gets updated. Note that libcamera does not yet have a stable API so sticking to apt will avoid breakages.

OpenCV and pyqt5 are a problem because the pip versions conflict with the versions in apt, and they really don't play nicely together. Also the pip versions usually (in my experience) fail to install on the 32-bit OS, they often compile for many many hours and then fail. Yet the apt ones install perfectly in about 10 seconds. You'd probably have more luck on a 64-bit platform, but there's a limit to how many different combinations of versions we can verify.

I think other stuff should mostly be safe from pip. Certainly numpy is fine, they seem to maintain backwards compatibility. I'll see if we can get a more up to date version in our next major OS update.

1 reply

ballen4705 Feb 21, 2023
Author

Thanks for explaining, it sounds like a mess.

Before I start using pip to install things in a way that might generate conflicts with the apt versions, I'm going to do a bit of on-line reading to see if I can generate a purely "local to one user" pip installation that won't conflict with the system-wide apt-installed python code. Have you tried this? If I can do that, then I'll use that local installation for development purposes, and leave the system-wide installation as stock apt packages.

(Not so relevant, but FWIW, I think the problem with numpy is not to do with version 1.19.5 versus version 1.24.2. Rather, the issue is that the former has been been built without neon instrinsics and is linked to linear algebra libraries that are also built without these instrinsics. The latter has been built to take full advantage of the hardware. In other words, if version 1.19.5 had been properly built, I would see a speedup comparable to what you have reported.)

ballen4705 · 2023-02-22T08:51:59Z

ballen4705
Feb 22, 2023
Author

I made quite a bit of progress. First, I can suggest a good way to make use of newer packages using pip, without interfering with a system-wide apt-based installation.

The idea is to create a virtual environment using python -m venv. Within that virtual environment, one can locally install packages using pip. These packages can co-exist with the corresponding packages installed via apt. If a pip version is installed locally, then it takes precedence over a system-wide package. However, system-wide packages such as libcamera, which are not available via pip, can be used as normal in that virtual environment.

Here is a transcript of creating a virtual environment, and installing a number of packages there with pip:

mkdir pipPlayGround
cd pipPlayGround
python3 -m venv env --system-site-packages
source env/bin/activate
pip install --upgrade pip setuptools wheel --ignore-installed
pip install numpy --ignore-installed
pip install scipy --ignore-installed
pip install jupyter notebook --ignore-installed 
...

# Now work within this directory and environment, using python as normal.
# "which python" will show that you are using a locally installed version.
# At the end of your work session, to deactivate the virtual environment, give
# the command:

deactivate

# To return to the work environment:
cd pipPlayGround
source env/bin/activate

One can create multiple directories, each with its own environment. Once inside the directory, use the command source env/bin/activate to activate it. Those environments can also be completely removed, with no effect on the system-wide python installation, by merely deleting the directory and its contents.

Some notes on the commands above:

Third line: --system-site-packages means that the virtual environment is not a separate universe, isolated from the system python ecosystem. Rather, the virtual environment can make use of those system-wide packages, provided that you have not used pip to install a local version of that package.
Fifth line: it's important to install a package called setuptools, so that later pip installs work correctly
Fifth and following lines: the --ignore-installed is needed. Without it, pip would report "you already have this package, I won't install it", because the system wide package has been found. This flag does NOT mean that pip will overwrite your system wide package. Rather, it means that pip installs it inside your virtual environment only.

I found this very helpful, because it enabled me to have two jupyter notebooks running side by side in the same browser, but with completely different library/module stacks inside. This makes a side-by-side comparison very simple.

I'll post something about my own code later. It's now running at 36 FPS. But the speedup came from using four threads rather than two for part of my analysis. (It turns out that the actual numpy and scipy functions which I am using are not threaded - was a red herring.)

Cheers,
Bruce

0 replies

ballen4705 · 2023-02-23T23:50:53Z

ballen4705
Feb 23, 2023
Author

I found the reason that my code was running at 36 FPS, but sometimes slowed down and sometimes sped up. The culprit was NoiseReductionMode. The setting that I had always used was the default for video, which is called NoiseReductionModeFast == 1, and which, based on the name, I had assumed was "fast". For my purposes, it is not fast, and also not deterministic.

I found modes 0 and 3 to be fast enough for 40 FPS, and the other modes to be too slow. For the record, here's the configuration

# NoiseReductionModeOff = 0          FAST
# NoiseReductionModeFast = 1         NOT FAST!!
# NoiseReductionModeHighQuality = 2  SLOW
# NoiseReductionModeMinimal = 3      FAST
# NoiseReductionModeZSL = 4          SLOW

controls = {}
controls['ColourGains']=(0.01,0.01)
controls['FrameDurationLimits'] = (25000, 25000)
controls['NoiseReductionMode'] = 0  #OFF

config = {}
config['controls'] = controls
config['buffer_count'] = 2
config['queue'] = False
config['main'] = {"size": (2048,1536),"format":"YUV420"}
config['raw'] = None
config['lores'] = None
config['display'] = 'main'
config['encode'] = 'main'
config['colour_space']=ColorSpace.Rec709()
config['transform']=Transform()

For my application, simply turning off the noise reduction works very well.

0 replies

davidplowman · 2023-02-24T09:04:48Z

davidplowman
Feb 24, 2023
Maintainer

Thanks for the update. That's very interesting. I would probably go with "minimal" then because that uses only hardware denoising over on the GPU, and therefore runs at the full pixel rate with no latency at all. "Fast" is fast compared to the high quality setting, but does run in software after the hardware has finished and therefore does add some latency (though not enough that video recording apps would notice, unless they are adding further processing).

7 replies

ballen4705 Feb 24, 2023
Author

I'll put together a minimal code to demonstrate this behavior, and post it here. I can't do it now, should be done tonight/tomorrow morning European time.

ballen4705 Feb 24, 2023
Author

David, here you are:

# Setup for HQ camera
# Demonstrates FPS slowdown if config['raw'] includes 'size':(X,Y)

from picamera2 import Picamera2, Preview, MappedArray
from libcamera import ColorSpace,Transform
import numpy as np
import cv2
import time

controls = {}
controls['FrameDurationLimits'] = (25000, 25000)
controls['NoiseReductionMode'] = 0
config = {}
config['controls'] = controls

# To reproduce issue, comment out one of the following two lines
config['raw'] = None
#config['raw'] = {'size': (2028,1520),'format':'SRGGB12_CSI2P'}

config['main'] = {'size': (2028,1520),'format':'YUV420'}
config['display'] = 'main'
config['lores'] = None
config['encode'] = None
config['buffer_count'] = 2
config['queue'] = False
config['colour_space'] = ColorSpace.Rec709()
config['transform'] = Transform()

# To track frame/callback rate
totalcalls = 0
circularBufferLen = 20
circularBuffer = np.zeros(circularBufferLen)
def avg_rate():
    global totalcalls, circularBufferLen, circularBuffer
    tnow = time.time()
    i = totalcalls % circularBufferLen
    tlast = circularBuffer[i]
    circularBuffer[i] = tnow
    totalcalls += 1
    if totalcalls < circularBufferLen:
        return ""
    return "RATE: " + str(np.round(circularBufferLen/(tnow-tlast),1))

# Used in callback to display frame/callback rate in preview window
def myfun(frame):
    origin = (100, 200)
    font = cv2.FONT_HERSHEY_SIMPLEX
    scale = 6.0
    colour = 255
    thickness = 15        
    rate = avg_rate()
    cv2.putText(frame, rate, origin, font, scale, colour, thickness)
    return

def callback(request):
    with MappedArray(request, "main") as main:
        myfun(main.array[:1520,:])
    return

picam2 = Picamera2()
picam2.configure(config)
picam2.pre_callback = callback
picam2.start_preview(Preview.QTGL)
picam2.start()
time.sleep(20)

PS: if you add
print(main.array.shape)
just before the call to myfun(), you will see that even though I have specified that main should have a horizontal width of 2028 pixels, the array which is returned has been padded to a multiple of 64 pixels, to width 2048 pixels. It's not a problem, but not what I had expected since I had not called the align function.

davidplowman Feb 27, 2023
Maintainer

Thanks for the example.

The first problem to look at is the framerate. I would agree there's a slightly tricky behaviour here, although I don't think there's anything strictly wrong. What happens is that if you don't ask for a raw stream the code, internally, says "oh dear, what shall I do, I'll make about 4 raw buffers of my own to keep the camera running smoothly". But when you do ask for a raw stream with a buffer count of 2, then that's how many you get - and it's not enough.

The fix in this case is to up your buffer count to at least 4. Obviously I need to improve the documentation there, though people will still get caught out by this. It would be nice to have a more "helpful" behaviour, though I'm not quite sure how. In general, if an application asks for n buffers it may be that there's no memory for more. The create_preview_configuration is provided to make configurations suitable for preview (therefore with a reasonable framerate), and that will ask for 4 buffers.

On the other point, "align_buffers" doesn't mean it will change the stride to match the width. There is no choice but for the stride to be "aligned", the hardware requires it. What it does is choose a width (by reducing it) such that the width will match the stride you have to have. Does that make sense?

ballen4705 Mar 1, 2023
Author

Hi David,

apologies for the slow reply, I have not been able to spend time on this during the past several days.

Buffers
Thanks for explaining. Indeed, I did not realize that more buffers were being allocated. I think I'm going to try to set the logging to show more info. Hopefully there would have been a log message about this. If not, then in addition to improving the documentation, you could have picamera2 log a warning if it overrides any "user-set" values. But perhaps it already does this, I have not had a chance to experiment yet.

I was reducing the buffer count because I want to reduce the latency (lag) between when I move an object in front of the camera, and when I see motion on the screen. Is this misguided? Is limiting memory use the ONLY reason to set buffers explicitly? I need to do some experiments, setting the number of buffers to large values (say 32) to see if that increase the lag or makes no difference.

Stride
Indeed, I was confused by this point. What you have written is much clearer than what I found in the docs. If you simply say that buffers are always padded out to a multiple of the stride, and that the width is internally adjusted, that's much clearer. The docs left me with the impression that I would always get buffers of the requested size, and that the purpose of align_buffers was to help me pick a size that would allow efficient internal operation.

Note: I think users will have fewer problems/issues if align_buffers adopts a different strategy. This is to set the buffer size greater than or equal to the requested width. This guarantees that the user will find the data they expect in frame[:UserRequestedHeight,:UserRequestedWidth]. With the current scheme, if a user requests a width of 1520 with YUV420, then, align_buffers reduces it to 1472. IMO it would be better to increase it to 1536.

I should get active later in the week and can do some testing then. Meanwhile I've started a new thread with a (hopefully) simple question about ALSC.

Cheers,
Bruce

davidplowman Mar 1, 2023
Maintainer

I would say that the reason for setting the buffer count explicitly woud be:

If you're running out of memory and the camera won't start, then setting a lower number would help.
If you're experiencing frame drops then setting a larger number might help.

Setting a larger buffer count shouldn't increase the camera-to-screen latency as buffers always get processed as fast as possible. So long as the system is running "normally", you'll simply find that there are more empty buffers queued up waiting to be filled. But of course, if the system gets busy for a period (and this is not a real time OS), then having those empty buffers queued can prevent frame drops.

I'll think about whether to align up or down. Down seemed right because you can always adjust the scaler crop to be the same as the aligned output and get 1-to-1 pixels. If you align up then you could get something larger than the sensor provides (this would happen with the HQ cam), so you'd have to upscale slightly. Maybe giving the option would be the thing to do.

Making picamera2 use all the CPU cores effectively #578

ballen4705 Feb 20, 2023

Replies: 8 comments · 11 replies

davidplowman Feb 20, 2023 Maintainer

ballen4705 Feb 20, 2023 Author

davidplowman Feb 20, 2023 Maintainer

ballen4705 Feb 21, 2023 Author

ballen4705 Feb 21, 2023 Author

davidplowman Feb 21, 2023 Maintainer

ballen4705 Feb 21, 2023 Author

davidplowman Feb 21, 2023 Maintainer

ballen4705 Feb 21, 2023 Author

ballen4705 Feb 22, 2023 Author

ballen4705 Feb 23, 2023 Author

davidplowman Feb 24, 2023 Maintainer

ballen4705 Feb 24, 2023 Author

ballen4705 Feb 24, 2023 Author

davidplowman Feb 27, 2023 Maintainer

ballen4705 Mar 1, 2023 Author

davidplowman Mar 1, 2023 Maintainer

ballen4705
Feb 20, 2023

Replies: 8 comments 11 replies

davidplowman
Feb 20, 2023
Maintainer

ballen4705 Feb 20, 2023
Author

davidplowman
Feb 20, 2023
Maintainer

ballen4705 Feb 21, 2023
Author

ballen4705
Feb 21, 2023
Author

davidplowman
Feb 21, 2023
Maintainer

ballen4705 Feb 21, 2023
Author

davidplowman
Feb 21, 2023
Maintainer

ballen4705 Feb 21, 2023
Author

ballen4705
Feb 22, 2023
Author

ballen4705
Feb 23, 2023
Author

davidplowman
Feb 24, 2023
Maintainer

ballen4705 Feb 24, 2023
Author

ballen4705 Feb 24, 2023
Author

davidplowman Feb 27, 2023
Maintainer

ballen4705 Mar 1, 2023
Author

davidplowman Mar 1, 2023
Maintainer