Data Efficient Machine Learning: Infant Pose and Posture Estimation
Institute Reference: INV-21102
Recent advances in computer vision have led to powerful human activity recognition models. However, models trained on large-scale adult activity datasets have limited success in estimating infant actions/behaviors due to the significant differences in their body ratios, the complexity of infant poses, and the types of their activities. More specifically, publicly available large-scale human pose datasets are predominantly comprised of scenes from sports, TV, and other daily activities performed by adult humans, and none of these datasets provides exemplars of activities of young children or infants. Additionally, privacy and security considerations hinder the availability of adequate infant images/videos required for training of a robust model from scratch.
To mitigate these data limitation issues towards developing a robust infant behavior estimation/tracking system, this Northeastern University group has developed a two-stage data efficient infant pose/posture estimation framework bootstrapped on both transfer learning and synthetic data augmentation approaches.
This posture estimation approach makes the following contributions:
(1) Presents a fine-tuned domain-adapted infant pose (FiDIP) estimation model composed of a pose estimation sub-network to leverage transfer learning from a pre-trained adult pose estimation network and a domain confusion sub-network for adapting the model to both real infant and synthetic infant datasets.
(2) Achieves a highly accurate and robust end-to-end posture-based-on-pose estimation pipeline, called FiDIP-Posture that is trained with limited posture labels, since pose can be seen as a low-dimensional representation for posture learning.
(3) Builds a synthetic and real infant pose (SyRIP) dataset, which includes 700 fully- labeled real infant images in diverse poses as well as 1000 synthetic infant images produced by adopting two different human image generation methods.
FiDIP-Posture when applied on a fully novel infant dataset in their interactive natural environments achieves mean average precision (mAP) as high as 86.3 in pose estimation and a classification accuracy of 77.9\% for posture recognition.