Football Analysis in VR – Texture Estimation with Differentiable Rendering and Diffusion Models

Abstract

In recent years, football analysis has grown in popularity with various new tools, ranging from tracking entire teams’ movement patterns to monitoring the effort exerted by individual players during matches. One increasingly popular method, facilitated by improved cameras, is recording matches and analysing them afterwards. However, some players may struggle to translate the overhead perspective typically used by cameras to their playing style. This master’s thesis focuses on using a single camera angle to reconstruct a match sequence in 3D, allowing interaction with the environment in virtual reality (VR) from arbitrary viewpoints.

Together with two other master’s theses, a pipeline has been developed where the results from the first project, which tracks each player and the ball over time, are given to the second project, which estimates the poses of all players and produces the results for this master’s project. This project concentrates on creating the interactive VR environment. Additionally, a texture generation testbed has been created to automatically recreate the players’ clothing in the 3D environment, to make the pipeline from a video to an interactable environment as seamless as possible.

A VR environment was constructed for the visualisation of, and interaction with, three-dimensional reconstructions of recorded football matches. Within this environment, the user may control playback, teleport onto the pitch, and even experience the match from the perspective of one of the players. Furthermore, a texture generation testbed was built utilising differentiable rendering to estimate the football players’ kits based on sequences of video frames and their estimated pose. Various techniques were explored, including perceptual loss and the application of diffusion models for inpainting and image-to-image synthesis, both in texture space and the image plane.

The texture generation testbed produces convincing results in best-case scenarios for players visible from many angles with accurate pose estimates. However, the combination of low-resolution images, lack of views for certain players, and even slight misalignments between the estimated pose and the player in the image increases the reliance on synthetic imagery. These methods require further tuning, and possibly training, to produce life-like results.