In this paper, we propose an enhanced architecture based on StyleGAN2 for conditional video generation. We leverage disentangled motion and content spaces for video manipulation. Our method learns dynamic representations of various actions that are independent of image content and can be transferred between different actors. Beyond the significant enhancement of video quality compared to prevalent methods, our approach allows us to generate videos of actors performing actions that were not seen together during the training stage. Furthermore, we demonstrate that the disentangled dynamic and content permit their independent manipulation, which holds potential for a wide range of applications such as changing a person's mood over time.
BibTex Code Here