Fig.1: The configurations of alexnet
Fig.2: The configurations of vgg models
A training of deeper neural networks are difficult because of the vanishing gradient problem in gradient-based learning methods and backpropagation. A residual learning framework ease the training of networks that are substantially deep.
Fig.3: Residual learning: a building block.
Pose estimation refers to computer vision techniques that detect human figures in images and videos, so that one could determine, for example, where someone’s elbow shows up in an image.
Fig.4: Parts and Pairs indexes for COCO dataset.
OpenPose provides a real-time method for Multi-Person 2D Pose Estimation based on its bottom-up approach instead of detection-based approach.
Fig.5: Architecture of the multi-stage CNN.
Fig.6: Body part detection and part association.
The feature maps obtained by the first 10 layers of VGG-19 model are processed with multiple stages CNN to generate a set of Part Confidence Maps and a set of Part Affinity Fields (PAFs). They are then used in a greedy algorithm to obtain the poses for each person in the image.