
So, in order to work and generate photorealistic images with a different outfit, VOGUE needs to train this pose-conditioned StyleGAN2 architecture. But this is harder than simply implementing StyleGAN2 since it was mainly developed for face images, which is where it got its popularity from. They had to make two key modifications: At first, they had to modify the beginning of the generator with an encoder that takes pose keypoints of the image as inputs. This serves as the input of the first “4×4 Style Block” of StyleGAN2 instead of a constant input to implement this pose-condition.
Then, they trained their StyleGAN2 to output segmentations at each resolution in addition to the RGB image, as you can see here. Using this network, they were able to generate many images and their segmentations with the desired pose.
Following this, given an input pair of images, they could “project” the images into the latent space of the generator to compute the latent codes that would best differentiate the characteristics of the pair of input images. Using an optimizer to find the space of combinations where lies the garment from the second image and the person from the first image. They had to maximize changes within the region of interest while minimizing changes outside of the region of interest. To do that, they used two latent space, representing the two input images, the first one from the image with the person to be generated and the second one from the image with the garment to be transferred. As we saw, they also needed the pose heatmap as input to the StyleGAN2 generator showed here again in gray. Then, they had access to the segmentations and images generated from the trained GAN architecture. Following this, they used a loss function composed of three separate terms that each optimized a part of the generated image.
There’s the editing-localization loss term that encourages the network to only interpolate styles within the region of interest, defined here as M, using the segmentation outputs.
Then, there’s the garment loss used to transfer over the correct shape and texture of the garments.
Using embeddings from a very popular convolutional neural network architecture called VGG-16, they compute the distance between the garment areas of the two images using again the segmentation labels. This created mask is then applied to the generated RGB images.
Finally, there’s the identity loss which guides the network to, as it says, preserve the identity of the person.
This is again done using the segmentation labels following the same procedure as the garment loss.
Just take a second to look at how these losses affect the output image. You can clearly see when the localization loss or the identity loss is missing and their importance.
More results
As they state: “Our method can synthesize the same style shirt for varied poses and body shapes by fixing the style vector. We present several different styles in multiple poses.” [1]
Just look at how much better the results are with this new approach:
Even more results!
Of course, this was just an overview of this new paper. I strongly invite you to read their paper in the references below for a better technical understanding.