Skip to content

Commit acc4fb2

Browse files
Added CUDA opOutputToCvMat for images as well
1 parent 56bc772 commit acc4fb2

File tree

8 files changed

+215
-178
lines changed

8 files changed

+215
-178
lines changed

doc/faq.md

Lines changed: 113 additions & 86 deletions
Large diffs are not rendered by default.

doc/installation.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,8 @@ OpenPose - Installation
2121
7. [3D Reconstruction Module](#3d-reconstruction-module)
2222
8. [Calibration Module](#calibration-module)
2323
9. [Compiling without cuDNN](#compiling-without-cudnn)
24-
10. [Custom Caffe (Ubuntu Only)](#custom-caffe-ubuntu-only)
25-
11. [Custom OpenCV (Ubuntu Only)](#custom-opencv-ubuntu-only)
24+
10. [Custom Caffe](#custom-caffe)
25+
11. [Custom OpenCV](#custom-opencv)
2626
12. [Doxygen Documentation Autogeneration (Ubuntu Only)](#doxygen-documentation-autogeneration-ubuntu-only)
2727
13. [CMake Command Line Configuration (Ubuntu Only)](#cmake-command-line-configuration-ubuntu-only)
2828

@@ -353,18 +353,20 @@ Then, you would have to reduce the `--net_resolution` flag to fit the model into
353353

354354

355355

356-
#### Custom Caffe (Ubuntu Only)
357-
Note that OpenPose uses a [custom fork of Caffe](https://github.com/CMU-Perceptual-Computing-Lab/caffe) (rather than the official Caffe master). Our custom fork is only updated if it works on our machines, but we try to keep it updated with the latest Caffe version. This version works on a newly formatted machine (Ubuntu 16.04 LTS) and in all our machines (CUDA 8 and 10 tested). The default GPU version is the master branch, which it is also compatible with CUDA 10 without changes (official Caffe version might require some changes for it). We also use the OpenCL and CPU tags if their CMake flags are selected.
356+
#### Custom Caffe
357+
OpenPose uses a [custom fork of Caffe](https://github.com/CMU-Perceptual-Computing-Lab/caffe) (rather than the official Caffe master). Our custom fork is only updated if it works on our machines, but we try to keep it updated with the latest Caffe version. This version works on a newly formatted machine (Ubuntu 16.04 LTS) and in all our machines (CUDA 8 and 10 tested). The default GPU version is the master branch, which it is also compatible with CUDA 10 without changes (official Caffe version might require some changes for it). We also use the OpenCL and CPU tags if their CMake flags are selected. We only modified some Caffe compilation flags and minor details.
358358

359-
We only modified some Caffe compilation flags and minor details. You can use your own Caffe distribution, simply specify the Caffe include path and the library as shown below. You will also need to turn off the `BUILD_CAFFE` variable. Note that cuDNN is required in order to get the maximum possible accuracy in OpenPose.
359+
Alternatively, you can use your own Caffe distribution on Ubuntu/Mac by 1) disabling `BUILD_CAFFE`, 2) setting `Caffe_INCLUDE_DIRS` to `{CAFFE_PATH}/include/caffe`, and 3) setting `Caffe_LIBS` to `{CAFFE_PATH}/build/lib/libcaffe.so`, as shown in the image below. Note that cuDNN-compatible Caffe version is required in order to get the maximum possible accuracy in OpenPose.
360360
<p align="center">
361361
<img src="media/cmake_installation/im_5.png", width="480">
362362
</p>
363363

364+
For Windows, simply replace the OpenCV DLLs and include folder for your custom one.
364365

365366

366-
#### Custom OpenCV (Ubuntu Only)
367-
If you have built OpenCV from source and OpenPose cannot find it automatically, you can set the `OPENCV_DIR` variable to the directory where you build OpenCV.
367+
368+
#### Custom OpenCV
369+
If you have built OpenCV from source and OpenPose cannot find it automatically, you can set the `OPENCV_DIR` variable to the directory where you build OpenCV (Ubuntu and Mac). For Windows, simply replace the OpenCV DLLs and include folder for your custom one.
368370

369371

370372

src/openpose/core/cvMatToOpInput.cpp

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,9 @@ namespace op
8787
// CUDA version (if #Gpus > n)
8888
else
8989
{
90+
// Note: This version reduces the global accuracy about 0.1%, so it is disabled for now
91+
error("This version reduces the global accuracy about 0.1%, so it is disabled for now.",
92+
__LINE__, __FUNCTION__, __FILE__);
9093
#ifdef USE_CUDA
9194
// (Re)Allocate temporary memory
9295
const unsigned int inputImageSize = 3 * cvInputData.rows * cvInputData.cols;

src/openpose/core/cvMatToOpOutput.cpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,8 @@ namespace op
9090
else
9191
{
9292
#ifdef USE_CUDA
93-
// Input image can be shared between this one and cvMatToOpInput.hpp
93+
// Input image can be shared between this one and cvMatToOpInput.hpp
94+
// However, that version reduces the global accuracy a bit
9495
// (Free and re-)Allocate temporary memory
9596
const unsigned int inputImageSize = 3 * cvInputData.rows * cvInputData.cols;
9697
if (pInputMaxSize < inputImageSize)

src/openpose/core/opOutputToCvMat.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ namespace op
9696
cvMat = cv::Mat(outputData.getSize(0), outputData.getSize(1), CV_8UC3);
9797
// CUDA --> CPU: Copy output image back to CPU
9898
cudaMemcpy(
99-
cvMat.data, pOutputImageUCharCuda, sizeof(unsigned char) * mOutputMaxSizeUChar,
99+
cvMat.data, pOutputImageUCharCuda, sizeof(unsigned char) * volume,
100100
cudaMemcpyDeviceToHost);
101101
// Indicate memory was copied out
102102
*spGpuMemoryAllocated = false;

src/openpose/gpu/cuda.cpp

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,16 @@
88
namespace op
99
{
1010
#ifdef USE_CUDA
11-
const dim3 THREADS_PER_BLOCK_TINY{32, 32, 1};
12-
const dim3 THREADS_PER_BLOCK_SMALL{64, 64, 1};
13-
const dim3 THREADS_PER_BLOCK_MEDIUM{128, 128, 1};
14-
const dim3 THREADS_PER_BLOCK_BIG{256, 256, 1};
15-
const dim3 THREADS_PER_BLOCK_HUGE{512, 512, 1};
11+
#ifdef DNDEBUG
12+
#define base 32
13+
#else
14+
#define base 64
15+
#endif
16+
const dim3 THREADS_PER_BLOCK_TINY{base, base, 1}; // 32 |64
17+
const dim3 THREADS_PER_BLOCK_SMALL{2*base, 2*base, 1}; // 64 |128
18+
const dim3 THREADS_PER_BLOCK_MEDIUM{4*base, 4*base, 1}; // 128|256
19+
const dim3 THREADS_PER_BLOCK_BIG{8*base, 8*base, 1}; // 256|512
20+
const dim3 THREADS_PER_BLOCK_HUGE{16*base, 16*base, 1}; // 512|1024
1621
#endif
1722

1823
void cudaCheck(const int line, const std::string& function, const std::string& file)

src/openpose/net/resizeAndMergeBase.cu

Lines changed: 74 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -151,48 +151,6 @@ namespace op
151151
}
152152
}
153153

154-
template <typename T>
155-
__global__ void resizeAndAddKernel(
156-
T* targetPtr, const T* const sourcePtr, const T scaleWidth, const T scaleHeight, const int widthSource,
157-
const int heightSource, const int widthTarget, const int heightTarget)
158-
{
159-
const auto x = (blockIdx.x * blockDim.x) + threadIdx.x;
160-
const auto y = (blockIdx.y * blockDim.y) + threadIdx.y;
161-
const auto channel = (blockIdx.z * blockDim.z) + threadIdx.z;
162-
if (x < widthTarget && y < heightTarget)
163-
{
164-
const auto sourceArea = widthSource * heightSource;
165-
const auto targetArea = widthTarget * heightTarget;
166-
const T xSource = (x + T(0.5f)) * widthSource / T(widthTarget) - T(0.5f);
167-
const T ySource = (y + T(0.5f)) * heightSource / T(heightTarget) - T(0.5f);
168-
const T* const sourcePtrChannel = sourcePtr + channel * sourceArea;
169-
targetPtr[channel * targetArea + y*widthTarget+x] += bicubicInterpolate(
170-
sourcePtrChannel, xSource, ySource, widthSource, heightSource, widthSource);
171-
}
172-
}
173-
174-
template <typename T>
175-
__global__ void resizeAndAverageKernel(
176-
T* targetPtr, const T* const sourcePtr, const T scaleWidth, const T scaleHeight, const int widthSource,
177-
const int heightSource, const int widthTarget, const int heightTarget, const int counter)
178-
{
179-
const auto x = (blockIdx.x * blockDim.x) + threadIdx.x;
180-
const auto y = (blockIdx.y * blockDim.y) + threadIdx.y;
181-
const auto channel = (blockIdx.z * blockDim.z) + threadIdx.z;
182-
if (x < widthTarget && y < heightTarget)
183-
{
184-
const auto sourceArea = widthSource * heightSource;
185-
const auto targetArea = widthTarget * heightTarget;
186-
const T xSource = (x + T(0.5f)) / scaleWidth - T(0.5f);
187-
const T ySource = (y + T(0.5f)) / scaleHeight - T(0.5f);
188-
const T* const sourcePtrChannel = sourcePtr + channel * sourceArea;
189-
const auto interpolated = bicubicInterpolate(
190-
sourcePtrChannel, xSource, ySource, widthSource, heightSource, widthSource);
191-
auto& targetPixel = targetPtr[channel * targetArea + y*widthTarget+x];
192-
targetPixel = (targetPixel + interpolated) / T(counter);
193-
}
194-
}
195-
196154
template <typename T>
197155
__global__ void resizeAndAddAndAverageKernel(
198156
T* targetPtr, const int counter, const T* const scaleWidths, const T* const scaleHeights,
@@ -227,39 +185,81 @@ namespace op
227185
}
228186
}
229187

230-
template <typename T>
231-
__global__ void resizeAndAddKernelOld(
232-
T* targetPtr, const T* const sourcePtr, const T scaleWidth, const T scaleHeight, const int widthSource,
233-
const int heightSource, const int widthTarget, const int heightTarget)
234-
{
235-
const auto x = (blockIdx.x * blockDim.x) + threadIdx.x;
236-
const auto y = (blockIdx.y * blockDim.y) + threadIdx.y;
237-
if (x < widthTarget && y < heightTarget)
238-
{
239-
const T xSource = (x + T(0.5f)) / scaleWidth - T(0.5f);
240-
const T ySource = (y + T(0.5f)) / scaleHeight - T(0.5f);
241-
targetPtr[y*widthTarget+x] += bicubicInterpolate(
242-
sourcePtr, xSource, ySource, widthSource, heightSource, widthSource);
243-
}
244-
}
188+
// template <typename T>
189+
// __global__ void resizeAndAddKernel(
190+
// T* targetPtr, const T* const sourcePtr, const T scaleWidth, const T scaleHeight, const int widthSource,
191+
// const int heightSource, const int widthTarget, const int heightTarget)
192+
// {
193+
// const auto x = (blockIdx.x * blockDim.x) + threadIdx.x;
194+
// const auto y = (blockIdx.y * blockDim.y) + threadIdx.y;
195+
// const auto channel = (blockIdx.z * blockDim.z) + threadIdx.z;
196+
// if (x < widthTarget && y < heightTarget)
197+
// {
198+
// const auto sourceArea = widthSource * heightSource;
199+
// const auto targetArea = widthTarget * heightTarget;
200+
// const T xSource = (x + T(0.5f)) * widthSource / T(widthTarget) - T(0.5f);
201+
// const T ySource = (y + T(0.5f)) * heightSource / T(heightTarget) - T(0.5f);
202+
// const T* const sourcePtrChannel = sourcePtr + channel * sourceArea;
203+
// targetPtr[channel * targetArea + y*widthTarget+x] += bicubicInterpolate(
204+
// sourcePtrChannel, xSource, ySource, widthSource, heightSource, widthSource);
205+
// }
206+
// }
245207

246-
template <typename T>
247-
__global__ void resizeAndAverageKernelOld(
248-
T* targetPtr, const T* const sourcePtr, const T scaleWidth, const T scaleHeight, const int widthSource,
249-
const int heightSource, const int widthTarget, const int heightTarget, const int counter)
250-
{
251-
const auto x = (blockIdx.x * blockDim.x) + threadIdx.x;
252-
const auto y = (blockIdx.y * blockDim.y) + threadIdx.y;
253-
if (x < widthTarget && y < heightTarget)
254-
{
255-
const T xSource = (x + T(0.5f)) / scaleWidth - T(0.5f);
256-
const T ySource = (y + T(0.5f)) / scaleHeight - T(0.5f);
257-
const auto interpolated = bicubicInterpolate(
258-
sourcePtr, xSource, ySource, widthSource, heightSource, widthSource);
259-
auto& targetPixel = targetPtr[y*widthTarget+x];
260-
targetPixel = (targetPixel + interpolated) / T(counter);
261-
}
262-
}
208+
// template <typename T>
209+
// __global__ void resizeAndAverageKernel(
210+
// T* targetPtr, const T* const sourcePtr, const T scaleWidth, const T scaleHeight, const int widthSource,
211+
// const int heightSource, const int widthTarget, const int heightTarget, const int counter)
212+
// {
213+
// const auto x = (blockIdx.x * blockDim.x) + threadIdx.x;
214+
// const auto y = (blockIdx.y * blockDim.y) + threadIdx.y;
215+
// const auto channel = (blockIdx.z * blockDim.z) + threadIdx.z;
216+
// if (x < widthTarget && y < heightTarget)
217+
// {
218+
// const auto sourceArea = widthSource * heightSource;
219+
// const auto targetArea = widthTarget * heightTarget;
220+
// const T xSource = (x + T(0.5f)) / scaleWidth - T(0.5f);
221+
// const T ySource = (y + T(0.5f)) / scaleHeight - T(0.5f);
222+
// const T* const sourcePtrChannel = sourcePtr + channel * sourceArea;
223+
// const auto interpolated = bicubicInterpolate(
224+
// sourcePtrChannel, xSource, ySource, widthSource, heightSource, widthSource);
225+
// auto& targetPixel = targetPtr[channel * targetArea + y*widthTarget+x];
226+
// targetPixel = (targetPixel + interpolated) / T(counter);
227+
// }
228+
// }
229+
230+
// template <typename T>
231+
// __global__ void resizeAndAddKernelOld(
232+
// T* targetPtr, const T* const sourcePtr, const T scaleWidth, const T scaleHeight, const int widthSource,
233+
// const int heightSource, const int widthTarget, const int heightTarget)
234+
// {
235+
// const auto x = (blockIdx.x * blockDim.x) + threadIdx.x;
236+
// const auto y = (blockIdx.y * blockDim.y) + threadIdx.y;
237+
// if (x < widthTarget && y < heightTarget)
238+
// {
239+
// const T xSource = (x + T(0.5f)) / scaleWidth - T(0.5f);
240+
// const T ySource = (y + T(0.5f)) / scaleHeight - T(0.5f);
241+
// targetPtr[y*widthTarget+x] += bicubicInterpolate(
242+
// sourcePtr, xSource, ySource, widthSource, heightSource, widthSource);
243+
// }
244+
// }
245+
246+
// template <typename T>
247+
// __global__ void resizeAndAverageKernelOld(
248+
// T* targetPtr, const T* const sourcePtr, const T scaleWidth, const T scaleHeight, const int widthSource,
249+
// const int heightSource, const int widthTarget, const int heightTarget, const int counter)
250+
// {
251+
// const auto x = (blockIdx.x * blockDim.x) + threadIdx.x;
252+
// const auto y = (blockIdx.y * blockDim.y) + threadIdx.y;
253+
// if (x < widthTarget && y < heightTarget)
254+
// {
255+
// const T xSource = (x + T(0.5f)) / scaleWidth - T(0.5f);
256+
// const T ySource = (y + T(0.5f)) / scaleHeight - T(0.5f);
257+
// const auto interpolated = bicubicInterpolate(
258+
// sourcePtr, xSource, ySource, widthSource, heightSource, widthSource);
259+
// auto& targetPixel = targetPtr[y*widthTarget+x];
260+
// targetPixel = (targetPixel + interpolated) / T(counter);
261+
// }
262+
// }
263263

264264
template <typename T>
265265
void resizeAndMergeGpu(

src/openpose/pose/poseGpuRenderer.cpp

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -103,10 +103,9 @@ namespace op
103103
scaleKeypoints(poseKeypointsRescaled, scaleInputToOutput);
104104
// Render keypoints
105105
if (!poseKeypoints.empty())
106-
cudaMemcpy(pGpuPose,
107-
poseKeypointsRescaled.getConstPtr(),
108-
numberPeople * numberBodyParts * 3 * sizeof(float),
109-
cudaMemcpyHostToDevice);
106+
cudaMemcpy(
107+
pGpuPose, poseKeypointsRescaled.getConstPtr(),
108+
numberPeople * numberBodyParts * 3 * sizeof(float), cudaMemcpyHostToDevice);
110109
renderPoseKeypointsGpu(
111110
*spGpuMemory, pMaxPtr, pMinPtr, pScalePtr, mPoseModel, numberPeople, frameSize, pGpuPose,
112111
mRenderThreshold, mShowGooglyEyes, mBlendOriginalFrame, getAlphaKeypoint());

0 commit comments

Comments
 (0)