In this paper, we propose an enhanced architecture based on StyleGAN2 for conditional video generation. We leverage disentangled motion and content spaces for video manipulation. Our method learns dynamic representations of various actions that are independent of image content and can be transferred between different actors. Beyond the significant enhancement of video quality compared to prevalent methods, our approach allows us to generate videos of actors performing actions that were not seen together during the training stage. Furthermore, we demonstrate that the disentangled dynamic and content permit their independent manipulation, which holds potential for a wide range of applications such as changing a person's mood over time.
@inproceedings{
title = {One Style is All You Need to Generate a Video},
author = {Sandeep Manandhar and Auguste Genovesio},
booktitle = {Proceedings of the {Winter Conference on Applications of Computer Vision (WACV)}},
year = {2024},
}