-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow performance when using multiple streams #157
Comments
Hi @kyuhoJeong11 , Does your machine have enough resources? Likely the issue is that you run out of resources when adding several streams |
i use RTX 4090 GPU / 13th Gen Intel(R) Core(TM) i9-13900KS CPU / 64GB memory To add further explanation, it works fine with up to 2 streams. However, when using 3 or more streams, the incoming frames start to lag. Even after setting the stream-buffer-size to 1, the frames still continue to lag. The values in the attached image represent [the time from the RTSP stream relay server // the time from pipeless // and the difference between the two times]. Although it says 'device' in the image, please think of it as the stream number. As the number of streams increases, the time difference between the relay server and the time when the data is received continues to grow. I am not sure what the cause of this issue is, or how to resolve it. |
Is the RTSP relay server running on a different machine? Could it be because of the connection? If the time between the relay server when pipeless receives the stream increases when adding more streams it could be due to the connection or also due to the resources on the relay server. Can you explain your setup further? |
Currently, pipeless and the relay server are running on the same machine. RTSP streams are being generated via mediamtx, and the relay server is only responsible for relaying the streams without performing any additional roles. The RTSP streams are accessed through 127.0.0.1. In the processing phase, as shown in the code below, we are using a method that discards the current frame if inference is already in progress. import numpy as np
import torch
from torchvision.transforms import ToTensor, Normalize
import cv2
from PIL import Image
import pycuda.driver as cuda
import pycuda.autoinit
import tensorrt as trt
import torch.nn.functional as F
from torch import Tensor
from threading import Lock
import time
from pathlib import Path
from typing import List, Tuple, Union
from numpy import ndarray
# cur_number = 1
inference_status = [False for _ in range(16)]
streams = [cuda.Stream() for _ in range(16)]
def letterbox(im: ndarray,
new_shape: Union[Tuple, List] = (640, 640),
color: Union[Tuple, List] = (114, 114, 114)) \
-> Tuple[ndarray, float, Tuple[float, float]]:
# Resize and pad image while meeting stride-multiple constraints
shape = im.shape[:2] # current shape [height, width]
if isinstance(new_shape, int):
new_shape = (new_shape, new_shape)
# new_shape: [width, height]
# Scale ratio (new / old)
r = min(new_shape[0] / shape[1], new_shape[1] / shape[0])
# Compute padding [width, height]
new_unpad = int(round(shape[1] * r)), int(round(shape[0] * r))
dw, dh = new_shape[0] - new_unpad[0], new_shape[1] - new_unpad[
1] # wh padding
dw /= 2 # divide padding into 2 sides
dh /= 2
if shape[::-1] != new_unpad: # resize
im = cv2.resize(im, new_unpad, interpolation=cv2.INTER_LINEAR)
top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))
left, right = int(round(dw - 0.1)), int(round(dw + 0.1))
im = cv2.copyMakeBorder(im,
top,
bottom,
left,
right,
cv2.BORDER_CONSTANT,
value=color) # add border
return im, r, (dw, dh)
def blob(im: ndarray, return_seg: bool = False) -> Union[ndarray, Tuple]:
seg = None
if return_seg:
seg = im.astype(np.float32) / 255
im = im.transpose([2, 0, 1])
im = im[np.newaxis, ...]
im = np.ascontiguousarray(im).astype(np.float32) / 255
if return_seg:
return im, seg
else:
return im
def xywh2xyxy(i):
"""
Converts from (center-x, center-y,w,h) to (x1,y1,x2,y2)
"""
o = i.clone() # Create numpy view
o[..., 0] = i[..., 0] - i[..., 2] / 2
o[..., 1] = i[..., 1] - i[..., 3] / 2
o[..., 2] = i[..., 0] + i[..., 2]
o[..., 3] = i[..., 1] + i[..., 3]
return o
def clip_boxes(boxes, shape):
boxes[..., [0, 2]] = torch.clamp(boxes[..., [0, 2]], 0, shape[1]) # x1, x2
boxes[..., [1, 3]] = torch.clamp(boxes[..., [1, 3]], 0, shape[0]) # y1, y2
def postprocess_yolo(original_frame_shape, resized_img_shape, output):
confidence_thres = 0.45
iou_thres = 0.5
original_height, original_width, _ = original_frame_shape
resized_height, resized_width, _ = resized_img_shape
outputs = torch.tensor(output[0]).squeeze(0).T # Convert to torch tensor and transpose
# Get the number of rows in the outputs array
rows = outputs.shape[0]
boxes = []
scores = []
class_ids = []
# Calculate the scaling factors for the bounding box coordinates
if original_height > original_width:
scale_factor = original_height / resized_height
else:
scale_factor = original_width / resized_width
# Iterate over each row in the outputs array
for i in range(rows):
classes_scores = outputs[i, 4:]
# Skip rows with NaN or invalid values
if torch.isnan(classes_scores).any() or (classes_scores == 1).any():
continue
max_score = torch.max(classes_scores)
if max_score >= confidence_thres:
class_id = torch.argmax(classes_scores) # Get the class ID with the highest score
x, y, w, h = outputs[i, 0], outputs[i, 1], outputs[i, 2], outputs[i, 3]
# Calculate the scaled coordinates of the bounding box
if original_height > original_width:
pad = (resized_width - original_width / scale_factor) // 2
left = int((x - pad) * scale_factor)
top = int(y * scale_factor)
else:
pad = (resized_height - original_height / scale_factor) // 2
left = int(x * scale_factor)
top = int((y - pad) * scale_factor)
width = int(w * scale_factor)
height = int(h * scale_factor)
class_ids.append(class_id.item())
scores.append(max_score.item())
boxes.append([left, top, width, height])
if len(boxes) > 0:
boxes = torch.tensor(boxes, dtype=torch.float32)
scores = torch.tensor(scores, dtype=torch.float32)
class_ids = torch.tensor(class_ids, dtype=torch.int64)
clip_boxes(boxes, original_frame_shape) # Apply clipping
boxes = xywh2xyxy(boxes) # Convert from (cx, cy, w, h) to (x1, y1, x2, y2)
# Perform Non-Maximum Suppression (NMS) using PyTorch
indices = torch.ops.torchvision.nms(boxes, scores, iou_thres)
return boxes[indices], scores[indices], class_ids[indices]
else:
return torch.tensor([]), torch.tensor([]), torch.tensor([])
def allocate_buffers(engine, context, batch_size):
inputs = []
outputs = []
allocations = []
for i in range(engine.num_io_tensors):
name = engine.get_tensor_name(i)
is_input = False
if engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
is_input = True
shape = context.get_tensor_shape(name)
dtype = np.dtype(trt.nptype(engine.get_tensor_dtype(name)))
if is_input and shape[0] < 0:
assert engine.num_optimization_profiles > 0
profile_shape = engine.get_tensor_profile_shape(name, 0)
assert len(profile_shape) == 3 # min,opt,max
# Set the *max* profile as binding shape
context.set_input_shape(name, (batch_size, profile_shape[2][1], profile_shape[2][2], profile_shape[2][3]))
shape = context.get_tensor_shape(name)
size = dtype.itemsize
for s in shape:
size *= s
allocation = cuda.mem_alloc(size)
host_allocation = None if is_input else np.zeros(shape, dtype)
binding = {
"index": i,
"name": name,
"dtype": dtype,
"shape": list(shape),
"allocation": allocation,
"host_allocation": host_allocation,
}
allocations.append(allocation)
if is_input:
inputs.append(binding)
else:
outputs.append(binding)
return inputs, outputs, allocations
def do_inference(context, inputs, outputs, stream, windows):
# Copy input data from host to device (entire batch)
cuda.memcpy_htod_async(inputs[0]['allocation'], windows.ravel(), stream)
# Execute the inference (batch is processed as a whole)
context.execute_v2([inputs[0]['allocation'], outputs[0]['allocation']])
# Copy output data from device to host (entire batch)
cuda.memcpy_dtoh_async(outputs[0]['host_allocation'], outputs[0]['allocation'], stream)
# Synchronize the stream to ensure all operations are completed
stream.synchronize()
# Return output data as a list of numpy arrays
return outputs[0]['host_allocation']
def hook(frame, context):
global streams
# global cur_number
cur_frame = frame['inference_input']
engine = context['engine']
# frame_number = frame['frame_number']
channel = frame['channel_number']
device = context['device']
context_device = context['context']
if inference_status[channel - 1]:
return
inference_status[channel - 1] = True
context_device.push()
# print(f'frame_number: {frame["frame_number"]}')
# print(f'channel{channel}, frame: {frame["frame_number"]}')
try:
# cur_frame, ratio, dwdh = letterbox(cur_frame, (640, 384))
# rgb = cv2.cvtColor(cur_frame, cv2.COLOR_BGR2RGB)
# input = blob(rgb, return_seg=False)
batch_size = 1
# tensor = torch.asarray(tensor, device=device)
# inference
if engine is not None:
with engine.create_execution_context() as exec_context:
inputs, outputs, allocations = allocate_buffers(engine, exec_context, batch_size)
output = do_inference(exec_context, inputs, outputs, streams[channel - 1], input)
output = [[[]]]
bboxes, scores, class_ids = postprocess_yolo(cur_frame.shape, (384, 640, 3), output)
frame['inference_output'] = bboxes.cpu().numpy().astype('float32')
frame['user_data'] = True
# for allocation in allocations:
# allocation.free()
finally:
# torch.cuda.synchronize()
torch.cuda.empty_cache()
context_device.pop()
inference_status[channel - 1] = False
|
Hello, I have a question.
Currently, I am trying to use multiple RTSP streams. However, when I run several streams, the speed becomes very slow, making it practically unusable.
I am wondering if there is any way to resolve this issue.
I am currently using a modified version of the create_uri_bin function. Could this be the reason for the issue? Below is the code I am currently using after modification.
The text was updated successfully, but these errors were encountered: