One Style is All You Need to Generate a Video

Sandeep Manandhar, Auguste Genovesio
IBENS, École Normale Supérieure, Paris
WACV 2024

Manipulation of facial expression of the generated video.Top row: Actress changing expression from happy to disgusted, sad, and surprised. Note the lips are synced despite the expression change. Bottom row: Other actors changing the expression.

Abstract

In this paper, we propose an enhanced architecture based on StyleGAN2 for conditional video generation. We leverage disentangled motion and content spaces for video manipulation. Our method learns dynamic representations of various actions that are independent of image content and can be transferred between different actors. Beyond the significant enhancement of video quality compared to prevalent methods, our approach allows us to generate videos of actors performing actions that were not seen together during the training stage. Furthermore, we demonstrate that the disentangled dynamic and content permit their independent manipulation, which holds potential for a wide range of applications such as changing a person's mood over time.

Generating MEAD sequences

Baseline methods

Generated by MoCoGAN-HD (unconditionally)

Generated by StyleGAN-V (unconditionally)

Generated by ImaGINator (conditionally)

Generated by our method

Each column represents a distinct temporal style. Notice the lip motion of different actors are in sync. Check out the next panel for a different emotional expression. Only conditionally generated videos are shown.

Generating UTD-MHAD sequences

We generated 27 different actions present in UTD-MHAD dataset. The action classification is done by PoseC3D (trained by us) which takes skeletal data of the image sequence as input.

Ablation of D_t

Top Row: Real sequences of actresses making clockwise, counter clockwise and triangle gestures. Middle Row: Generating after training without D_t. Bottom row: Generating after training with D_t which clearly improves the sequences.

Interpolating emotions mid sequence

We have linearly interpolated the emotion label of a sequence with respect to time. The change in labels do not interfere with the motion style. The graph on the right-side of the video depicts the LiA signals which measures the area of the lips throughout the seuqence.

GAN-inversion for temporal style

We use off-the-shelf gan inversion technique that uses LPIPS loss and MSE over the video frames to optimize for a single temporal vector. The optimized vector along with the time2vec module produces the temporal style for each frame of the video. We are able to extract the temporal style from an unseen sequence (a known actor reciting an unknown sentence) and transfer it to other learnt actors.


Upon inspection of PCA components of the trajectory recovered by the iterative inversion, the wave-like structures are evident. The motion style recovered by our GAN-inversion also has this character as show in the bottom row.

PCA waves

The motion styles recovered by our GAN-inversion are transferrable to other actors. This is not possible with the traditional iterative inversion. The video below demonstrates the extraction of the motion from real videos and its transfer to new videos with different actors. The face feature points (Real , Inverted ) have been computed using dlib.

Following are examples with GAN-inversion and re-enactment with model trained with 63, 127, and 255 sinusoidal bases. The video contains audio. We notify that the dialogues being spoken were absent from the training set. Meaning the input video sequences for the GAN-inversion were never seen during the training. However, we are able to successfully recover the motion of the lips eventhough the person is reciting unread sentences. The left most panel depicts the video clip to be inverted. The right panel consists of inverted clip (top left) and three re-enactment clips. The input clip was downscaled to 128x128 spatial resolution for the inversion.

BibTeX

@inproceedings{
  title = {One Style is All You Need to Generate a Video},
  author = {Sandeep Manandhar and Auguste Genovesio},
  booktitle = {Proceedings of the {Winter Conference on Applications of Computer Vision (WACV)}},
  year = {2024},
}