Recently I came across this interesting paper In-Domain GAN Inversion for Real Image Editing (a.k.a IDInvert), and I was amazed by the potential of this technique and the large number of potential applications of it. Below is a list of applications this paper has envisioned.
To me, the applications should not stop at “Real Image Editing”, and should be extended to the art creation world. Just imagine that, after an artist has drawn his character, he can edit the character’s pose and expression without redrawing it. He can even pick a certain feature from another character, and blend into his own character within seconds. It certainly opens the door to new ways of creation. With this excitement, I decided to first experiment it out on anime characters.
Before I jump into my experiment, allow me to give an overview of IDInvert so that you can better appreciate the power of it, and understand the limitation of it.
StyleGAN is the state of the art in image generation. It was introduced by Nvidia, and it is the technology behind ThisPersonDoesNotExist and ThisWaifuDoesNotExist. The images generated by StyleGAN are near-perfect, most people cannot tell from them apart from the real photo. Moreover, it enables users to control how the image looks through different levels of styles:
- Coarse styles — corresponding to pose, hair, face shape
- Middles styles — Facial features, eyes
- Fine styles — color scheme
However, there is a great limitation for StyleGAN image editing — it is powerless when the image is a photo taken in the real world. The StyleGAN’s image editing is through the manipulation of the latent space vector, but a photo taken in the real world (same for an image drawn in the real world) does not have this latent space vector as it is not generated by StyleGAN.
In other words, StyleGAN is like Superman in the DC universe. It can do marvelous things in its own universe, but can’t help anyone in our universe.
Nonetheless, the power of StyleGAN is so tempting, that many are trying to break the barrier between the real and the generated, and here comes the technique called Inversion.
StyleGAN is trained to convert the latent space vector into an image, how about we just train another network to convert the image back into its latent space vector?
This is exactly what most of the people have tried, and the steps are:
- Use StyleGAN to produce thousands of images, so we have the “latent space vector — image” pairs as the training dataset
- Train an encoder network to convert the image back into the latent space vector. Note that if feed this latent space vector back to StyleGAN, the generated image will slightly differ from the original image. An additional optimization process is needed to fine-tune the vector. I omit this process for simplicity. You can refer to the paper for more information.
This method can indeed find a latent space vector that reconstructs a real-world image, but the problem is that this vector may not fall in the domain where the StyleGAN was trained with. As a result, this latent space vector may not support the image editing capability well.
The trick in IDInvert is a “domain-guided encoder”. Instead of just looking at the similarity between the original vector fed into StyleGAN and the encoded vector from the image, IDInvert feeds the encoded vector back to StyleGAN, and check the generated image. The training process is very similar to the StyleGAN training process, just that instead of updating the StyleGAN generator, it updates the encoder network.
In this way, the encoder network learns to output vectors that are in-domain, and this ensures that the inverted latent space vector can well support image editing.
IDInvert provides source code for both TensorFlow implementation and PyTorch implementation. A big thanks to the author for making it so convenient to try out. I used the PyTorch implementation source code.
For StyleGAN anime portrait pre-trained model, I get it from gwern. It also provides the anime portrait face dataset for download, which consist of 300k of anime face images, and the image size is 512*512px.
I used two Tesla P100 GPUs to train, used batch size 8, and trained 2000k iterations. It took about 2 weeks to finish the training.
However, the results turn out to be not very good. When feeding the encoded vector back to the StyleGAN (with the optimization process included), I found that the inverted images only have a rough resemblance with the input image, the details are very different, and there are artifacts present.
Compare to the real-person portrait example in the IDInvert paper, the anime results are much worse.
Through the experiment, I have identified a few possible reasons for the bad performance of anime portrait results, and what can be improved:
- The anime portrait image is 512x512px which is a bigger size compared to the examples in the IDInvert paper. A bigger image size usually makes the task more challenging. A potential solution is to train a smaller size version first, then use it as a pre-train model for the bigger size version.
- The training hyper-parameters I used are the default ones provided by IDInvert source code. This set of parameters were optimized for real-person portraits, may need to fine-tune for anime character images
- The anime portraits dataset has a mix of the artist’s styles, which makes it challenging to learn. A curated dataset with a more standard style will give better performance.
Although I did not obtain good results for my experiments, I still believe in the potential of the GAN-Inversion technology, and dream of the day that we can fully utilize GAN and make art creation more enjoyable.