Abstract
In the context of football analytics, video recordings of matches play a crucial role in post-game analysis. However, videos are inherently limited because they only allow viewers to follow the match from the camera’s perspective. This thesis is part of a larger project aimed at creating 3D representations of football matches from video, thus enabling users to view the game from anywhere inside the virtual 3D environment. The larger project consists of three parts. This thesis focuses on estimating the camera parameters, as well as the 3D poses and locations of the players in the video. The other two projects focus on player tracking and player texture generation. A pipeline consisting of camera calibration and pose estimation is proposed, taking video recordings and bounding box annotations as input and predicting camera parameters as well as the players’ 3D poses and locations. For camera calibration, a model specifically tailored for cameras viewing football fields is used. The results indicate accurately predicted positions and viewing angles for the estimated camera. Pose estimation is performed using a pre-trained model and results in visually ac curate projections, although perspective ambiguities are present when the 3D poses are viewed from different angles. The main approach for positioning players was to detect when players touched the ground and interpolate the positions for ambiguous frames. The results are promising, but noise in the depth estimations occurs due to perspective ambiguities. Subsequently, an optional optimization of poses and positions using multi-view triangulation is also presented, showing possibilities for further refinement to ensure realistic and consistent human poses. Future work on pose and location optimization could yield a pseudo-truth dataset for further enhancements to improve overall poses and positions from strictly monocular video.