Note that for each of the results you saw, they only used one picture taken from any angle. It was then sent to the model in order to produce these results, which are incredible to me when you think of the complexity of the task and all the possible parameters to take into consideration just regarding the initial picture. Such as the lighting, the resolution, the size, the angle or viewpoint, the location of the object in the image, etc.! If you are like me, you may be wondering how are they doing that.
Ok so I lied a little, they do not only take the image as inputs to the network, but they also take the camera parameters to help the process.
Their algorithm learned a function that converts these 3D points and 3D viewpoint into an RGB color as well as a density value for each point. Providing enough information to render the scene from any viewpoints later on. This is called a radiance field, taking positions and its viewing direction as inputs to output this color and volume density value for each of these points.
It is very similar to what NeRF does. Which is a paper I already. Basically, in the NeRF case, the radiance field function is done using a neural network trained on images and the intended output. This implies that they need a large number of images for each scene, as well as training a different network for each of these scenes. Making this process very costly and inefficient.
So the goal is to find a better way to have this needed radiance field, composed of RGB and density values, to then render the object in 3D in novel views.
In order to have the information needed to create such a radiance field, they used what they called a shape network that maps a latent code of the image into a 3D shape made of voxels. Voxels are just the same as pixels but in a 3-dimensional space and the latent code in question is basically all the useful information for the shape of the object in the image. This condensed shape information is found using a neural network composed of fully connected layers and followed by convolutions, which are powerful architectures for computer vision applications since convolutions have two main properties: they are invariant to translations and use the local properties of the images.
Of course, this network was trained on multiple images and was able to find a good function to map the shape information into what we call a latent code.
Then, it takes this latent code to produce this first 3D shape estimation.
You would think that we are done, but it’s not the case. This is just the first step, then, as we discussed, we need the radiance field of this representation using here an appearance network. Here again, it uses a similar latent code, but for the appearance, as well as the 3D shape as inputs to produce this radiance field using another network, referred here as F. Then, this radiance field can finally be used with the camera parameters information to produce this final render of the object.
Of course, this was just an overview of this new paper. I strongly recommend reading the paper linked below. The code is unfortunately not available right now, but I contacted one of the authors and he said that it will be available in a couple of weeks, so stay tuned for that!
If you like my work and want to stay up-to-date with AI technologies, you should definitely follow me on my social media channels.