Anonymous Submission

Anonymous, Anonymous,

Institution Name
Conferance name and year

Manipulation of facial expression of the generated video.Top row: Actress changing expression from happy to disgusted, sad, and surprised. Note the lips are synced despite the expression change. Bottom row: Other actors changing the expression.

Abstract

In this paper, we propose an enhanced architecture based on StyleGAN2 for conditional video generation. We leverage disentangled motion and content spaces for video manipulation. Our method learns dynamic representations of various actions that are independent of image content and can be transferred between different actors. Beyond the significant enhancement of video quality compared to prevalent methods, our approach allows us to generate videos of actors performing actions that were not seen together during the training stage. Furthermore, we demonstrate that the disentangled dynamic and content permit their independent manipulation, which holds potential for a wide range of applications such as changing a person's mood over time.

Generating MEAD sequences

Baseline methods

Generated by MoCoGAN-HD (unconditionally)

Generated by StyleGAN-V (unconditionally)

Generated by ImaGINator (conditionally)

Generated by our method

Each column represents a distinct temporal style. Notice the lip motion of different actors are in sync. Check out the next panel for a different emotional expression. Only conditionally generated videos are shown.

Generating UTD-MHAD sequences

We generated 27 different actions present in UTD-MHAD dataset. The action classification is done by PoseC3D (trained by us) which takes skeletal data of the image sequence as input.

Our conditional generation. We also overlay the skeletal points used to classify the actions of the generated videos.

ImaGINator's conditionally generated videos.

StyleGAN-V's unconditionally generated videos.

Ablation of D_t

Top Row: Real sequences of actresses making clockwise, counter clockwise and triangle gestures. Middle Row: Generating after training without D_t. Bottom row: Generating after training with D_t which clearly improves the sequences.

Interpolating emotions mid sequence

We have linearly interpolated the emotion label of a sequence with respect to time. The change in labels do not interfere with the motion style. The graph on the right-side of the video depicts the LiA signals which measures the area of the lips throughout the seuqence.

GAN-inversion for temporal style

We use off-the-shelf gan inversion technique that uses LPIPS loss and MSE over the video frames to optimize for a single temporal vector. The optimized vector along with the time2vec module produces the temporal style for each frame of the video. We are able to extract the temporal style from an unseen sequence (a known actor reciting an unknown sentence) and transfer it to other learnt actors.

Upon inspection of PCA components of the trajectory recovered by the iterative inversion, the wave-like structures are evident. The motion style recovered by our GAN-inversion also has this character as show in the bottom row.

The motion styles recovered by our GAN-inversion are transferrable to other actors. This is not possible with the traditional iterative inversion. The video below demonstrates the extraction of the motion from real videos and its transfer to new videos with different actors. The face feature points (Real , Inverted ) have been computed using dlib.

BibTeX

BibTex Code Here